本篇博文主要内容为 2025-08-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-05)
今日共更新1135篇论文,其中:
- 自然语言处理共147篇(Computation and Language (cs.CL))
- 人工智能共345篇(Artificial Intelligence (cs.AI))
- 计算机视觉共281篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共312篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] st Set Quality in Multilingual LLM Evaluation
【速读】: 该论文试图解决当前多语言评估数据集(multilingual evaluation sets)质量不高的问题,尤其是在半自动构建的数据集中可能存在未被发现的错误,这些错误会误导对大语言模型(Large Language Models, LLMs)多语言能力的评估。解决方案的关键在于:通过人工仔细分析近期用于评估的法语和泰卢固语数据集,识别并修正其中的错误,并对比原始版本与修订版本在多个LLMs上的性能差异,发现误差可高达近10%。基于此,论文主张测试集不应被视为不可更改的静态资源,而应定期复查、验证甚至版本化管理,从而提升评估结果的可靠性。
链接: https://arxiv.org/abs/2508.02635
作者: Kranti Chalamalasetti,Gabriel Bernier-Colborne,Yvan Gauthier,Sowmya Vajjala
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the 1st Workshop on Multilingual Data Quality Signals, COLM 2025, Short paper. 10 pages in total
Abstract:Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.
zh
[NLP-1] Pointer: Linear-Complexity Long-Range Modeling without Pre-training
【速读】: 该论文旨在解决标准Transformer模型在长序列建模中计算复杂度高(O(N²))的问题,尤其是在无需预训练的情况下实现高效且高性能的长距离依赖建模。其解决方案的关键在于提出一种名为Pointer的新架构,通过层间指针链(layer-wise pointer chaining)机制,使每一层的指针选择依赖于前一层的指针位置,从而以线性复杂度O(NK)构建显式的长距离连接路径,显著提升效率并保持高精度。
链接: https://arxiv.org/abs/2508.02631
作者: Zixi Li
机构: Noesis Lab (Independent Research Group); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注: Submitted to Nordic AI Meet 2025
Abstract:We introduce Pointer, a novel architecture that achieves linear O(NK) complexity for long-range sequence modeling while maintaining superior performance without requiring pre-training. Unlike standard attention mechanisms that compute O(N^2) pairwise interactions, our approach uses layer-wise pointer chaining where each layer’s pointer selection depends on previous layer’s pointer positions, creating explicit long-distance connections through pointer chains. We demonstrate that this architecture achieves 2 – 10\times speedup on long sequences compared to standard transformers, maintains 95% accuracy on copy tasks at distances up to 2048 tokens, and learns interpretable pointer patterns that reveal structured dependency modeling. Our experiments on efficiency benchmarks, long-range dependency tasks, and interpretability analysis show that Pointer offers a compelling alternative to attention mechanisms for scenarios requiring efficient long-range modeling without pre-training dependencies.
zh
[NLP-2] HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents ICCV2025
【速读】: 该论文旨在解决当前具身智能体在任务执行过程中缺乏有效机制以自适应地监控代码策略执行并进行修复的问题,从而提升机器人操作策略的鲁棒性和样本效率。其解决方案的关键在于提出HyCodePolicy框架,该框架通过将代码合成、几何定位、感知监控与迭代修复整合进一个闭环编程周期中,利用视觉语言模型(VLM)对执行过程中的关键节点进行观测以检测和定位失败,并结合结构化执行轨迹与VLM感知反馈,实现故障原因推理与代码自动修复,形成一种融合符号逻辑与感知信息的双反馈机制,从而支持最小人类干预下的自我修正程序生成。
链接: https://arxiv.org/abs/2508.02629
作者: Yibin Liu,Zhixuan Liang,Zanxin Chen,Tianxing Chen,Mengkang Hu,Wanxi Dong,Congsheng Xu,Zhaoming Han,Yusen Qin,Yao Mu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence
Abstract:Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.
zh
[NLP-3] Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction
【速读】: 该论文旨在解决人类在与生成式AI系统(尤其是支持对话或跨模态交互的系统)互动时,如何产生对AI具有意图、代理性甚至内在性的认知投射这一现象的问题,即“诺塞米亚”(Noosemia)的认知-现象学机制。其解决方案的关键在于构建一个跨学科框架,将语言表现力、认识论模糊性和技术复杂性的涌现特性作为核心驱动力,结合大语言模型(LLM)的意义整体论退化与“LLM上下文认知场”的技术概念,阐明LLM如何通过关系性建构意义,并在人-AI接口处生成连贯性及代理性的拟像。此框架不仅厘清了Noosemia的独特性,还引入了反向现象“无诺塞米亚”(a-Noosemia)以描述此类投射的消退,为理解AI引发的主体性错觉提供了理论基础与分析工具。
链接: https://arxiv.org/abs/2508.02622
作者: Enrico De Santis,Antonello Rizzi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:This paper introduces and formalizes Noosemia, a novel cognitive-phenomenological phenomenon emerging from human interaction with generative AI systems, particularly those enabling dialogic or multimodal exchanges. We propose a multidisciplinary framework to explain how, under certain conditions, users attribute intentionality, agency, and even interiority to these systems - a process grounded not in physical resemblance, but in linguistic performance, epistemic opacity, and emergent technological complexity. By linking an LLM declination of meaning holism to our technical notion of the LLM Contextual Cognitive Field, we clarify how LLMs construct meaning relationally and how coherence and a simulacrum of agency arise at the human-AI interface. The analysis situates noosemia alongside pareidolia, animism, the intentional stance and the uncanny valley, distinguishing its unique characteristics. We also introduce a-noosemia to describe the phenomenological withdrawal of such projections. The paper concludes with reflections on the broader philosophical, epistemological, and social implications of noosemic dynamics and directions for future research.
zh
[NLP-4] HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research ALT
链接: https://arxiv.org/abs/2508.02621
作者: Yinghao Zhu,Yifan Qi,Zixiang Wang,Lei Gu,Dehao Sui,Haoran Hu,Xichen Zhang,Ziyi He,Liantao Ma,Lequan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Code: this https URL
[NLP-5] Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation
【速读】: 该论文旨在解决奖励模型(Reward Model, RM)在基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中因token级交互建模不足而导致的“注意力劫持”(attention hacking)问题。具体而言,现有RM普遍采用解码器-only架构和独立Siamese编码范式,分别导致prompt-response序列内注意力衰减以及选中与拒绝序列间缺乏token级跨序列注意力,从而使奖励信号易受上下文干扰而失真。解决方案的关键在于提出“交互蒸馏”(Interaction Distillation)框架,通过引入一个以注意力机制为核心的教师模型(teacher model),利用其全面的注意力模式指导学生RM学习更精细的token交互结构,并借助注意力对齐目标实现训练优化,从而提升奖励信号的稳定性与泛化能力。
链接: https://arxiv.org/abs/2508.02618
作者: Jianxiang Zang,Meiling Ning,Shihan Dou,Jiazheng Zhang,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this “attention hacking”, we propose “Interaction Distillation”, a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model’s interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.
zh
[NLP-6] CharBench: Evaluating the Role of Tokenization in Character-Level Tasks
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在字符级推理任务(如字符计数或定位)中表现不佳的问题,尤其是澄清词元化(tokenization)是否是导致性能瓶颈的关键因素。其解决方案的核心在于构建了一个规模达现有基准两个数量级的综合性字符级任务评测集——CharBench,并对多种主流开源与专有模型进行系统评估。研究发现,词元化特性与字符计数任务的正确性关联较弱,而单词长度和实际字符数影响更大;但对于需要词内位置理解的任务,包含目标字符的词元越长,模型性能越差,表明长词元会削弱字符位置信息的可辨识度。这一结果为改进模型在字符级推理上的能力提供了关键实证依据和方向。
链接: https://arxiv.org/abs/2508.02591
作者: Omri Uzan,Yuval Pinter
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models’ reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.
zh
[NLP-7] Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules
【速读】: 该论文旨在解决现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略无法有效利用混合专家(Mixture-of-Experts, MoE)模型中动态路由机制的问题。其核心挑战在于,传统PEFT方法未与MoE的多专家架构对齐,导致适应模块未能充分利用专家间的条件性激活特性。解决方案的关键在于引入路由机制到适应模块本身,使微调过程能够根据输入内容动态选择并调整相关专家路径,从而更有效地适配MoE结构。实验表明,这种“有路由的”PEFT方法在OLMoE-1B-7B和Mixtral-8x7B模型上于常识推理与数学推理任务中均展现出更高的性能与效率,并识别出不同场景下的最优配置。
链接: https://arxiv.org/abs/2508.02587
作者: Yilun Liu,Yunpu Ma,Yuetian Lu,Shuo Chen,Zifeng Ding,Volker Tresp
机构: Technical University of Munich (慕尼黑工业大学); Ludwig Maximilian University of Munich (慕尼黑路德维希马克西米利安大学); University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper is a preprint under review. arXiv admin note: text overlap with arXiv:2411.08212
Abstract:Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE’s multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8x7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.
zh
[NLP-8] MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification
【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)在执行任务时易产生错误(如幻觉)的问题,尤其是在需要高可信度推理的任务中,现有基于非结构化交互(如自由辩论)的多LLM集成方法难以提供可解释且忠实的决策依据。其解决方案的关键在于提出MArgE框架,通过引入计算论证(computational argumentation)驱动的论证型大语言模型(Argumentative LLMs, ArgLLMs),将每个LLM的输出转化为结构化的论证树(argument tree),从而为声明验证任务构建可检验的推理路径,实现对最终判断结果的忠实解释与验证。
链接: https://arxiv.org/abs/2508.02584
作者: Ming Pok Ng,Junqi Jiang,Gabriel Freedman,Antonio Rago,Francesca Toni
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Leveraging outputs from multiple large language models (LLMs) is emerging as a method for harnessing their power across a wide range of tasks while mitigating their capacity for making errors, e.g., hallucinations. However, current approaches to combining insights from multiple LLMs often involve unstructured interactions (e.g., free debate), resulting in model generations that are not faithfully justifiable. In this work, we introduce MArgE, a novel framework to provide formal structure to the evidence from each LLM, in the form of a tree of extracted arguments, for the task of claim verification. We use a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworks and semantics from the field of computational argumentation, to construct structured argument trees for given claims. This process creates an inspectable pathway from the initial arguments to the final claim verification decisions, providing a faithful justification thereof. We show experimentally that MArgE can significantly outperform single LLMs, including three open-source models (4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as prior methods for unstructured multi-LLM debates. We thus demonstrate the advantages of incorporating formal, argumentative reasoning mechanisms when combining multiple LLM outputs.
zh
[NLP-9] EHSAN: Leverag ing ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare
【速读】: 该论文旨在解决阿拉伯语患者反馈中因方言多样性及缺乏细粒度情感标注而导致的自动化评估困难问题,尤其在医疗健康领域中的方面级情感分析(Aspect-Based Sentiment Analysis, ABSA)尚未得到充分研究。其解决方案的关键在于构建一个数据驱动的混合流程(EHSAN),通过ChatGPT伪标注与针对性人工审核相结合的方式,创建首个可解释的阿拉伯语医疗方面级情感数据集,并为每个标签提供生成式AI(Generative AI)推理说明以增强透明性。实验表明,即使仅使用机器生成标签,基于Transformer的模型仍能保持较高准确率,证明该方法在减少人工标注成本的同时具备良好的性能鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2508.02574
作者: Eman Alamoudi,Ellis Solaiman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Arabic-language patient feedback remains under-analysed because dialect diversity and scarce aspect-level sentiment labels hinder automated assessment. To address this gap, we introduce EHSAN, a data-centric hybrid pipeline that merges ChatGPT pseudo-labelling with targeted human review to build the first explainable Arabic aspect-based sentiment dataset for healthcare. Each sentence is annotated with an aspect and sentiment label (positive, negative, or neutral), forming a pioneering Arabic dataset aligned with healthcare themes, with ChatGPT-generated rationales provided for each label to enhance transparency. To evaluate the impact of annotation quality on model performance, we created three versions of the training data: a fully supervised set with all labels reviewed by humans, a semi-supervised set with 50% human review, and an unsupervised set with only machine-generated labels. We fine-tuned two transformer models on these datasets for both aspect and sentiment classification. Experimental results show that our Arabic-specific model achieved high accuracy even with minimal human supervision, reflecting only a minor performance drop when using ChatGPT-only labels. Reducing the number of aspect classes notably improved classification metrics across the board. These findings demonstrate an effective, scalable approach to Arabic aspect-based sentiment analysis (SA) in healthcare, combining large language model annotation with human expertise to produce a robust and explainable dataset. Future directions include generalisation across hospitals, prompt refinement, and interpretable data-driven modelling.
zh
[NLP-10] Guess or Recall? Training CNNs to Classify and Localize Memorization in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中逐字记忆(verbatim memorization)现象的机制分类不清晰问题,尤其是现有分类体系与注意力机制之间对齐不足的问题。其解决方案的关键在于提出一种基于卷积神经网络(Convolutional Neural Networks, CNNs)分析LLM注意力权重的新方法,通过训练CNN识别注意力模式来评估不同记忆形式的区分性,并据此构建一个与注意力权重高度对齐的新型三类分类体系:依赖语言建模能力猜测的记忆样本、因训练集中高重复性而被回忆的记忆样本,以及非记忆样本。该方法揭示了少样本逐字记忆并不对应独立的注意力机制,且大量可提取样本实为模型猜测结果,应单独研究,同时开发了定制化的可视化解释技术以定位各类记忆对应的注意力区域。
链接: https://arxiv.org/abs/2508.02573
作者: Jérémie Dentan,Davide Buscaldi,Sonia Vanier
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.02573 [cs.CL] (or arXiv:2508.02573v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.02573 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-11] Sparse-dLLM : Accelerating Diffusion LLM s with Dynamic Cache Eviction
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在推理过程中因注意力机制导致的二次方计算复杂度和内存开销问题,尤其针对现有缓存技术因存储全层状态而占用大量内存、限制长上下文应用的瓶颈。解决方案的关键在于提出Sparse-dLLM框架,其核心创新是无需训练即可集成动态缓存淘汰与稀疏注意力机制,通过延迟双向稀疏缓存策略,利用token显著性在解码步骤间的稳定性,保留关键token并基于注意力引导策略动态淘汰低相关性的前缀/后缀缓存条目,从而在保持性能的同时实现高达10倍的吞吐量提升,并维持相近的峰值内存消耗。
链接: https://arxiv.org/abs/2508.02558
作者: Yuerong Song,Xiaoran Liu,Ruixiao Li,Zhigeng Liu,Zengfeng Huang,Qipeng Guo,Ziwei He,Xipeng Qiu
机构: 1. 未提供具体的机构名称,仅显示作者编号和脚注标记,无法确定具体单位信息。
未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures
Abstract:Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10 \times higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.
zh
[NLP-12] Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks
【速读】: 该论文旨在解决临床文本中标准化医学概念自动标注的问题,以支持结构化数据提取和决策辅助。其核心挑战在于SNOMED CT(系统化医学术语集)的标注需大量人工干预,难以规模化应用。解决方案的关键在于提出一种基于双向门控循环单元(Bidirectional GRU)的神经序列标注方法,结合领域自适应的SpaCy与SciBERT分词策略,将句子切分为重叠的19词块,并融合上下文、句法及形态学特征进行建模。该方法在MIMIC-IV子集上实现90%的F1分数,优于传统规则系统并媲美或超越现有神经模型,同时展现出对歧义术语和拼写错误的有效处理能力,且计算成本显著低于Transformer架构,具备实际部署优势。
链接: https://arxiv.org/abs/2508.02556
作者: Ali Noori,Pratik Devkota,Somya Mohanty,Prashanti Manda
机构: University of North Carolina Greensboro (北卡罗来纳大学格林斯伯勒分校); Fractal Analytics; United HealthGroup (联合健康集团); University of Nebraska Omaha (内布拉斯加大学奥马哈分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Automated annotation of clinical text with standardized medical concepts is critical for enabling structured data extraction and decision support. SNOMED CT provides a rich ontology for labeling clinical entities, but manual annotation is labor-intensive and impractical at scale. This study introduces a neural sequence labeling approach for SNOMED CT concept recognition using a Bidirectional GRU model. Leveraging a subset of MIMIC-IV, we preprocess text with domain-adapted SpaCy and SciBERT-based tokenization, segmenting sentences into overlapping 19-token chunks enriched with contextual, syntactic, and morphological features. The Bi-GRU model assigns IOB tags to identify concept spans and achieves strong performance with a 90 percent F1-score on the validation set. These results surpass traditional rule-based systems and match or exceed existing neural models. Qualitative analysis shows effective handling of ambiguous terms and misspellings. Our findings highlight that lightweight RNN-based architectures can deliver high-quality clinical concept annotation with significantly lower computational cost than transformer-based models, making them well-suited for real-world deployment.
zh
[NLP-13] Building and Aligning Comparable Corpora
【速读】: 该论文旨在解决多语言自然语言处理中缺乏平行语料(parallel text)时,如何构建和对齐跨语言可比语料库(comparable corpus)的问题。其关键解决方案是提出一种基于跨语言潜在语义索引(Cross-Lingual Latent Semantic Indexing, CL-LSI)的自动对齐方法,相较于依赖双语词典的相似性度量方式,CL-LSI在文档层面甚至事件层面均展现出更优的对齐效果,从而有效实现不同语言下关于同一主题或事件的文档自动匹配。
链接: https://arxiv.org/abs/2508.02555
作者: Motaz Saad,David Langlois,Kamel Smaili
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 11 figures
Abstract:Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.
zh
[NLP-14] What are you sinking? A geometric approach on attention sink
【速读】: 该论文旨在解决Transformer模型中注意力机制中普遍存在的“注意力锚点”(attention sink, AS)现象的本质问题,即为何某些特殊标记(如CLS token或位置锚点)会持续吸引大量注意力。论文指出,AS并非架构缺陷,而是高维表示空间中建立稳定参考系(reference frame)的几何最优解。其解决方案的关键在于识别出三种不同类型的参考系——集中式、分布式与双向式,并证明这些参考系在训练初期便已形成,用以构建稳定的坐标系统,从而解释了AS的出现机制及其与位置编码等架构组件的关系。这一视角为理解Transformer的注意力行为提供了新的几何基础,并对模型设计具有指导意义。
链接: https://arxiv.org/abs/2508.02546
作者: Valeria Ruscio,Umberto Nanni,Fabrizio Silvestri
机构: Sapienza University of Rome (罗马大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.
zh
[NLP-15] Whats in the News? Towards Identification of Bias by Commission Omission and Source Selection (COSS)
【速读】: 该论文旨在解决新闻信息中偏见识别的难题,特别是针对由故意遗漏(commission)、选择性报道(omission)以及来源选择(source selection)所构成的COSS偏见类型。传统方法通常分别处理这些偏见类型,而本文提出一种联合三重目标的自动化识别方法,将偏见识别建模为一个统一的多任务学习框架,其关键在于通过文本复用特征提取与模式分析,构建可解释的可视化工具以辅助识别和理解新闻内容中的系统性偏见。
链接: https://arxiv.org/abs/2508.02540
作者: Anastasia Zhukova,Terry Ruas,Felix Hamborg,Karsten Donnay,Bela Gipp
机构: University of Wuppertal (伍珀塔尔大学); University of Göttingen (哥廷根大学); Heidelberg Academy of Sciences and Humanities (海德堡科学院); University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注: published in the Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries
Abstract:In a world overwhelmed with news, determining which information comes from reliable sources or how neutral is the reported information in the news articles poses a challenge to news readers. In this paper, we propose a methodology for automatically identifying bias by commission, omission, and source selection (COSS) as a joint three-fold objective, as opposed to the previous work separately addressing these types of bias. In a pipeline concept, we describe the goals and tasks of its steps toward bias identification and provide an example of a visualization that leverages the extracted features and patterns of text reuse.
zh
[NLP-16] Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction
【速读】: 该论文旨在解决标准基于Transformer的语言模型在处理复杂技术文档时,对细粒度句法结构和实体关系建模能力不足的问题。其核心解决方案是提出一种混合神经架构——上下文图Transformer(Contextual Graph Transformer, CGT),关键在于通过构建包含顺序、跳字和语义相似性边的动态图结构,并利用GATv2Conv层进行局部结构学习,再将增强后的嵌入传递给Transformer编码器以捕获全局依赖关系,从而实现对技术文本中结构化信息与长距离语义连贯性的联合建模。
链接: https://arxiv.org/abs/2508.02532
作者: Karan Reddy,Mayukha Pal
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Standard transformer-based language models, while powerful for general text, often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. To address this, we propose the Contextual Graph Transformer (CGT), a hybrid neural architecture that combines Graph Neural Networks (GNNs) and Transformers for domain-specific question answering. CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which is processed by GATv2Conv layers for local structure learning. These enriched embeddings are then passed to a Transformer encoder to capture global dependencies. Unlike generic large models, technical domains often require specialized language models with stronger contextualization and structure awareness. CGT offers a parameter-efficient solution for such use cases. Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. This gain stems from CGTs ability to jointly model structural token interactions and long-range semantic coherence. The model is trained from scratch using a two-phase approach: pretraining on general text followed by fine-tuning on domain-specific manuals. This highlights CGTs adaptability to technical language, enabling better grounding, entity tracking, and retrieval-augmented responses in real-world applications.
zh
[NLP-17] I Have No Mouth and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无需显式语音或音系监督的情况下,如何内部表征音素(phoneme)信息以完成韵律类任务(如押韵生成)的问题。其解决方案的关键在于发现并验证了模型内部存在一种高阶组织的音素表示结构,特别是在潜在空间中呈现出类似国际音标(IPA)元音图的分布模式;同时识别出一个专门促进音系信息传递的“音素移动头”(phoneme mover head),该模块在押韵任务中显著增强音素相关特征的表达能力,从而揭示了LLMs通过自监督学习隐式构建音系知识的能力。
链接: https://arxiv.org/abs/2508.02527
作者: Jack Merullo,Arjun Khurana,Oliver McLaughlin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models demonstrate proficiency on phonetic tasks, such as rhyming, without explicit phonetic or auditory grounding. In this work, we investigate how \verb|Llama-3.2-1B-Instruct| represents token-level phonetic information. Our results suggest that Llama uses a rich internal model of phonemes to complete phonetic tasks. We provide evidence for high-level organization of phoneme representations in its latent space. In doing so, we also identify a ``phoneme mover head" which promotes phonetic information during rhyming tasks. We visualize the output space of this head and find that, while notable differences exist, Llama learns a model of vowels similar to the standard IPA vowel chart for humans, despite receiving no direct supervision to do so.
zh
[NLP-18] PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成具有严格结构、音律和押韵规则的中国传统词体诗歌(Songci)时,如何有效满足形式约束的问题。其核心挑战在于,Songci 的创作依赖于特定的词牌(Cipai)模板,对句式、平仄、用韵等有高度规范的要求,而现有 LLMs 在此类任务中往往难以兼顾内容质量与形式合规性。解决方案的关键在于构建一个多层次评估框架(multi-faceted evaluation framework),涵盖形式符合度评分、基于 LLM 的自动化质量评估、人工评价及分类探针任务,并在此基础上提出“生成-批评”(Generate-Critic)架构:将评估框架作为自动批评器,利用其反馈信号作为奖励进行监督微调(Supervised Fine-Tuning, SFT),从而显著提升三款轻量级开源模型在形式符合度上的表现(最高提升达 5.88%)。
链接: https://arxiv.org/abs/2508.02515
作者: Zhan Qu,Shuzhou Yuan,Michael Färber
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across four families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-tuned, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic’s feedback as a reward signal, we fine-tune three lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.
zh
[NLP-19] Modular Arithmetic: Language Models Solve Math Digit by Digit
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行简单算术任务时内部工作机制缺乏统一理解的问题。其关键解决方案在于揭示了LLMs中存在一种基于数字位置的特定电路结构,即由多层感知机(MLP)神经元组成的模块化子群,这些子群独立作用于不同数位位置(如个位、十位、百位),且该结构不依赖于模型规模或分词策略(无论是逐位编码还是整体编码为单个token)。通过特征重要性分析与因果干预方法,作者识别并验证了这些数字位置特异性电路的存在,并证明其在算术推理中具有因果作用,从而构建出一个可解释、组合式的内部计算架构。
链接: https://arxiv.org/abs/2508.02513
作者: Tanja Baeumel,Daniil Gurgurov,Yusser al Ghussin,Josef van Genabith,Simon Ostermann
机构: German Research Center for AI (DFKI); Saarland University; Center for European Research in Trusted AI (CERTAIN)
类目: Computation and Language (cs.CL)
备注:
Abstract:While recent work has begun to uncover the internal strategies that Large Language Models (LLMs) employ for simple arithmetic tasks, a unified understanding of their underlying mechanisms is still lacking. We extend recent findings showing that LLMs represent numbers in a digit-wise manner and present evidence for the existence of digit-position-specific circuits that LLMs use to perform simple arithmetic tasks, i.e. modular subgroups of MLP neurons that operate independently on different digit positions (units, tens, hundreds). Notably, such circuits exist independently of model size and of tokenization strategy, i.e. both for models that encode longer numbers digit-by-digit and as one token. Using Feature Importance and Causal Interventions, we identify and validate the digit-position-specific circuits, revealing a compositional and interpretable structure underlying the solving of arithmetic problems in LLMs. Our interventions selectively alter the model’s prediction at targeted digit positions, demonstrating the causal role of digit-position circuits in solving arithmetic tasks.
zh
[NLP-20] st-time Prompt Intervention
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在推理过程中生成的思维链(Chain of Thought, CoT)存在冗余问题,如不必要的验证步骤和重复的推理路径,这源于训练阶段过度依赖结果奖励(outcome reward)机制,而缺乏可扩展的中间步骤奖励(process reward)数据。解决方案的关键在于提出一种名为PI(Prompt Intervention)的新框架,通过三个核心模块实现对推理过程的动态干预:时序干预模块(When module)确定干预时机,方式干预模块(How module)控制干预策略,以及采样模块(Which module)优化干预后的推理路径选择,从而将人类问题求解专家经验和认知科学原理融入LLM推理流程,显著缩短CoT长度并降低幻觉,提升推理的简洁性与可靠性。
链接: https://arxiv.org/abs/2508.02511
作者: Chenxu Yang,Qingyi Si,Mz Dai,Dingyu Yao,Mingyu Zheng,Minghui Chen,Zheng Lin,Weiping Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 16 figures, under review
Abstract:Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs’ reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.
zh
[NLP-21] OptiHive: Ensemble Selection for LLM -Based Optimization via Statistical Modeling
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的求解器在自动化建模与求解优化问题时存在的不可靠性及高延迟问题,尤其是依赖迭代自修正机制导致效率低下的缺陷。其解决方案的关键在于提出OptiHive框架,该框架通过一次批量LLM查询生成多样化的组件(包括求解器、问题实例和验证测试),并利用统计模型对生成组件的性能进行推断,从而实现可解释输出、不确定性量化以及最优求解器的选择,避免了迭代修复过程,在复杂优化任务上显著提升求解质量(如将最优率从5%提升至92%)。
链接: https://arxiv.org/abs/2508.02503
作者: Maxime Bouscary,Saurabh Amin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM-based solvers have emerged as a promising means of automating problem modeling and solving. However, they remain unreliable and often depend on iterative repair loops that result in significant latency. We introduce OptiHive, an LLM-based framework that produces high-quality solvers for optimization problems from natural-language descriptions without iterative self-correction. OptiHive uses a single batched LLM query to generate diverse components (solvers, problem instances, and validation tests) and filters out erroneous components to ensure fully interpretable outputs. Taking into account the imperfection of the generated components, we employ a statistical model to infer their true performance, enabling principled uncertainty quantification and solver selection. On tasks ranging from traditional optimization problems to challenging variants of the Multi-Depot Vehicle Routing Problem, OptiHive significantly outperforms baselines, increasing the optimality rate from 5% to 92% on the most complex problems.
zh
[NLP-22] From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)如何在不同语言身份下编码心理语言学知识,以及其输出行为和内部表征是否表现出类似人类的心理语言学响应。解决方案的关键在于通过两个任务——音义联想(sound symbolism)和词义效价(word valence)——在英语、荷兰语和中文三种语言环境下,对Llama-3.3-70B-Instruct与Qwen2.5-72B-Instruct进行单语和双语提示实验,结合行为分析与探针分析(probing analysis),发现模型的输出会根据提示的语言身份进行调整,且深层网络层中心理语言学信号更易解码,其中中文提示产生的效价表征更强且更稳定,从而揭示了语言身份对LLM输出行为和内部表示的调节作用。
链接: https://arxiv.org/abs/2508.02502
作者: Shuzhou Yuan,Zhan Qu,Mario Tawfelis,Michael Färber
机构: ScaDS.AI; TU Dresden (德累斯顿工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) exhibit strong linguistic capabilities, but little is known about how they encode psycholinguistic knowledge across languages. We investigate whether and how LLMs exhibit human-like psycholinguistic responses under different linguistic identities using two tasks: sound symbolism and word valence. We evaluate two models, Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct, under monolingual and bilingual prompting in English, Dutch, and Chinese. Behaviorally, both models adjust their outputs based on prompted language identity, with Qwen showing greater sensitivity and sharper distinctions between Dutch and Chinese. Probing analysis reveals that psycholinguistic signals become more decodable in deeper layers, with Chinese prompts yielding stronger and more stable valence representations than Dutch. Our results demonstrate that language identity conditions both output behavior and internal representations in LLMs, providing new insights into their application as models of cross-linguistic cognition.
zh
[NLP-23] Monsoon Uprising in Bangladesh: How Facebook Shaped Collective Identity
【速读】: 该论文试图解决的问题是:在政府镇压背景下,社交媒体平台(特别是Facebook)如何促进抗议者形成并强化集体身份认同,从而支持政治动员。其解决方案的关键在于揭示Facebook上多模态表达(如图像、表情包、视频、标签和讽刺内容)如何通过视觉修辞、语言话语与数字戏仿的协同作用,构建共享符号体系与抵抗文化,进而凝聚群体共识并挑战权威叙事。研究指出,红色象征、对“拉扎卡尔”(Razakar)一词的讽刺性重构以及勇气、不公与反抗主题的视觉传播,是塑造强韧集体身份的核心机制。
链接: https://arxiv.org/abs/2508.02498
作者: Md Tasin Abir,Arpita Chowdhury,Ashfia Rahman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 9 figures
Abstract:This study investigates how Facebook shaped collective identity during the July 2024 pro-democracy uprising in Bangladesh, known as the Monsoon Uprising. During government repression, protesters turned to Facebook as a central space for resistance, where multimodal expressions, images, memes, videos, hashtags, and satirical posts played an important role in unifying participants. Using a qualitative approach, this research analyzes visual rhetoric, verbal discourse, and digital irony to reveal how shared symbols, protest art, and slogans built a sense of solidarity. Key elements included the symbolic use of red, the ironic metaphorical use of the term “Razakar”, and the widespread sharing of visuals representing courage, injustice, and resistance. The findings show that the combination of visual and verbal strategies on Facebook not only mobilized public sentiment, but also built a strong collective identity that challenged authoritarian narratives. This study tries to demonstrate how online platforms can serve as powerful tools for identity construction and political mobilization in the digital age.
zh
[NLP-24] AIAP: A No-Code Workflow Builder for Non-Experts with Natural Language and Multi-Agent Collaboration
【速读】: 该论文旨在解决非专家用户在设计人工智能(Artificial Intelligence, AI)服务时面临的两大挑战:难以清晰表达意图以及难以管理系统的复杂性。其解决方案的关键在于提出了一种无代码平台AIAP,该平台将自然语言输入与可视化工作流相结合,并通过一个协同的多智能体系统(coordinated multi-agent system)将模糊的用户指令分解为模块化的、可执行步骤,这些步骤对用户隐藏在统一界面之下。实验表明,AIAP通过自动生成建议、模块化工作流及自动识别数据、操作和上下文,显著提升了用户直观开发AI服务的能力。
链接: https://arxiv.org/abs/2508.02470
作者: Hyunjn An,Yongwon Kim,Wonduk Seo,Joonil Park,Daye Kang,Changhoon Oh,Dokyun Kim,Seunghyun Lee
机构: Enhans(Enhans); Kaywon University of Art & Design(嘉泉大学艺术与设计学院); Yonsei University(延世大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 14 pages, 6 figures
Abstract:While many tools are available for designing AI, non-experts still face challenges in clearly expressing their intent and managing system complexity. We introduce AIAP, a no-code platform that integrates natural language input with visual workflows. AIAP leverages a coordinated multi-agent system to decompose ambiguous user instructions into modular, actionable steps, hidden from users behind a unified interface. A user study involving 32 participants showed that AIAP’s AI-generated suggestions, modular workflows, and automatic identification of data, actions, and context significantly improved participants’ ability to develop services intuitively. These findings highlight that natural language-based visual programming significantly reduces barriers and enhances user experience in AI service design.
zh
[NLP-25] LatentPrompt: Optimizing Promts in Latent Space
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)提示(prompt)优化中依赖启发式方法或人工探索的问题,从而提升任务性能。其解决方案的关键在于提出了一种模型无关的框架LatentPrompt,该框架利用潜在语义空间(latent semantic space)自动生成、评估和优化候选提示,无需手工规则;具体而言,它将初始种子提示嵌入连续潜在空间,并系统性地搜索该空间以找到最大化特定任务性能的提示,在金融短语银行(Financial PhraseBank)情感分类基准上实现了约3%的准确率提升。
链接: https://arxiv.org/abs/2508.02452
作者: Mateusz Bystroński,Grzegorz Piotrowski,Nitesh V. Chawla,Tomasz Kajdanowicz
机构: Wrocław University of Science and Technology (弗罗茨瓦夫科技大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances have shown that optimizing prompts for Large Language Models (LLMs) can significantly improve task performance, yet many optimization techniques rely on heuristics or manual exploration. We present LatentPrompt, a model-agnostic framework for prompt optimization that leverages latent semantic space to automatically generate, evaluate, and refine candidate prompts without requiring hand-crafted rules. Beginning with a set of seed prompts, our method embeds them in a continuous latent space and systematically explores this space to identify prompts that maximize task-specific performance. In a proof-of-concept study on the Financial PhraseBank sentiment classification benchmark, LatentPrompt increased classification accuracy by approximately 3 percent after a single optimization cycle. The framework is broadly applicable, requiring only black-box access to an LLM and an automatic evaluation metric, making it suitable for diverse domains and tasks.
zh
[NLP-26] AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications
【速读】: 该论文旨在解决创新测量依赖于特定情境的代理指标和人工专家评估所导致的实证研究局限性问题,尤其是在缺乏高质量专家数据的场景下难以开展大规模、可扩展的创新分析。其核心解决方案是设计并验证了一个基于大语言模型(Large Language Models, LLMs)的框架,该框架能够从非结构化文本数据中可靠地近似领域专家对创新性的判断。关键在于通过系统性优化模型选择、提示工程(prompt engineering)、训练数据规模与分布以及参数设置等设计因素,使LLM框架在F1得分和结果一致性方面显著优于传统机器学习、深度学习方法及现有文本测量手段,从而为研发人员、研究人员及相关评审者提供一种高效、稳定且广泛应用的创新量化工具。
链接: https://arxiv.org/abs/2508.02430
作者: Robin Nowak,Patrick Figge,Carolin Haeussler
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Measuring innovation often relies on context-specific proxies and on expert evaluation. Hence, empirical innovation research is often limited to settings where such data is available. We investigate how large language models (LLMs) can be leveraged to overcome the constraints of manual expert evaluations and assist researchers in measuring innovation. We design an LLM framework that reliably approximates domain experts’ assessment of innovation from unstructured text data. We demonstrate the performance and broad applicability of this framework through two studies in different contexts: (1) the innovativeness of software application updates and (2) the originality of user-generated feedback and improvement ideas in product reviews. We compared the performance (F1-score) and reliability (consistency rate) of our LLM framework against alternative measures used in prior innovation studies, and to state-of-the-art machine learning- and deep learning-based models. The LLM framework achieved higher F1-scores than the other approaches, and its results are highly consistent (i.e., results do not change across runs). This article equips RD personnel in firms, as well as researchers, reviewers, and editors, with the knowledge and tools to effectively use LLMs for measuring innovation and evaluating the performance of LLM-based innovation measures. In doing so, we discuss, the impact of important design decisions-including model selection, prompt engineering, training data size, training data distribution, and parameter settings-on performance and reliability. Given the challenges inherent in using human expert evaluation and existing text-based measures, our framework has important implications for harnessing LLMs as reliable, increasingly accessible, and broadly applicable research tools for measuring innovation.
zh
[NLP-27] Learning to Evolve: Bayesian-Guided Continual Knowledge Graph Embedding
【速读】: 该论文旨在解决持续知识图谱嵌入(Continual Knowledge Graph Embedding, CKGE)中模型因“灾难性遗忘”(catastrophic forgetting)而导致先前学习知识丢失的问题。解决方案的关键在于提出一种名为BAKE的新模型,其核心思想是将每个新数据批次视为对模型先验的贝叶斯后验更新,从而在理论上实现对历史知识的有效保留;同时引入一种持续聚类方法,通过约束不同时间快照间新旧知识的演化差异(或变化幅度),进一步缓解知识遗忘问题。
链接: https://arxiv.org/abs/2508.02426
作者: Linyu Li,Zhi Jin,Yuanpeng He,Dongming Jin,Yichi Zhang,Haoran Duan,Nyima Tash
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Since knowledge graphs (KG) will continue to evolve in real scenarios, traditional KGE models are only suitable for static knowledge graphs. Therefore, continual knowledge graph embedding (CKGE) has attracted the attention of researchers. Currently, a key challenge facing CKGE is that the model is prone to “catastrophic forgetting”, resulting in the loss of previously learned knowledge. In order to effectively alleviate this problem, we propose a new CKGE model BAKE. First, we note that the Bayesian posterior update principle provides a natural continual learning strategy that is insensitive to data order and can theoretically effectively resist the forgetting of previous knowledge during data evolution. Different from the existing CKGE method, BAKE regards each batch of new data as a Bayesian update of the model prior. Under this framework, as long as the posterior distribution of the model is maintained, the model can better preserve the knowledge of early snapshots even after evolving through multiple time snapshots. Secondly, we propose a continual clustering method for CKGE, which further directly combats knowledge forgetting by constraining the evolution difference (or change amplitude) between new and old knowledge between different snapshots. We conduct extensive experiments on BAKE on multiple datasets, and the results show that BAKE significantly outperforms existing baseline models.
zh
[NLP-28] Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的严重对象幻觉(object hallucination)问题,即模型在生成描述时与输入图像内容不一致的现象。以往研究普遍认为这是由视觉编码器与大语言模型(Large Language Models, LLMs)之间规模不匹配导致的语言先验(linguistic prior)所致。然而,本文通过深入分析发现了一个此前被忽视的现象:LVLMs 在幻觉过程中不仅可能忽略视觉信息,还可能同时忽略文本模态,这种行为称为模态偏差(modality bias),表明模型难以同时关注视觉和文本模态,从而造成对用户指令的碎片化理解。解决方案的关键在于提出一种无需训练的干预方法,通过调整文本和视觉 token 的注意力权重,增强跨模态兼容性以更好地对齐用户意图;并进一步引入对比解码策略(contrastive decoding),降低模型对自身参数知识的过度依赖,协同提升注意力调控效果。实验验证了模态偏差的普遍性,并证明该方法能有效缓解多个开源 LVLMs 和基准上的对象幻觉问题,具有良好的通用性和有效性。
链接: https://arxiv.org/abs/2508.02419
作者: Haohan Zheng,Zhenguo Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities, but they still suffer from severe object hallucination. Previous studies primarily attribute the flaw to linguistic prior caused by the scale mismatch between visual encoders and large language models (LLMs) in LVLMs. Specifically, as current LVLMs are built upon LLMs, they tend to over-rely on textual prompts and internal knowledge of LLMs, generating descriptions inconsistent with visual cues. However, through an in-depth investigation of the hallucinated mechanisms, we empirically reveal a previously overlooked phenomenon: LVLMs may ignore not only visual information but also textual modality during hallucination, a behavior termed as modality bias, which indicates that LVLMs struggle to simultaneously attend to both visual and textual modalities, leading to fragmented understanding of user-provided instructions. Based on this observation, we propose a simple yet effective training-free method to mitigate object hallucination. Concretely, we intervene and adjust the attention weights of textual and visual tokens, balancing cross-modal compatibility for better alignment with user intentions. Furthermore, we adopt a contrastive decoding strategy to reduce the LVLM’s overreliance on its parametric knowledge, synergistically enhancing our attention manipulation. Extensive experiments confirm the widespread presence of modality bias in LVLMs. Notably, our method effectively mitigates hallucination across multiple open-source LVLMs and benchmarks, highlighting its generalizability and efficacy.
zh
[NLP-29] CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本处理中因键值缓存(Key-Value Cache, KV cache)规模膨胀而导致的内存占用高和执行效率低的问题。现有KV缓存压缩方法通常依赖于基于分组查询注意力(Grouped Query Attention, GQA)结构中所有注意力头的启发式令牌淘汰策略,但这种方法忽略了不同注意力头的功能差异,可能导致关键信息的误删,从而降低模型性能。其解决方案的关键在于:首先识别每一层中能够有效检索提示初始与末尾令牌、同时能捕捉文本内部重要信息并关注其语义上下文的特定注意力头;随后利用这些功能性强的注意力头来判定重要令牌并保留其对应的KV缓存对;此外,还提出了一种分层自适应的KV缓存分配策略,以优化各层缓存使用效率。实验表明,该方法在LongBench和Needle-in-a-Haystack基准测试中均优于当前最优方案。
链接: https://arxiv.org/abs/2508.02401
作者: Xiaolin Lin,Jingcun Wang,Olga Kondrateva,Yiyu Shi,Bing Li,Grace Li Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.02401 [cs.CL] (or arXiv:2508.02401v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.02401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-30] Six Guidelines for Trustworthy Ethical and Responsible Automation Design
【速读】: 该论文旨在解决自动化系统中用户信任校准(calibrated trust)不足的问题,即用户对系统的依赖程度与其实际可靠性不匹配,导致安全风险或效率损失。解决方案的关键在于通过设计六项可操作的设计指南,帮助用户准确评估系统的可信度(trustworthiness),从而实现理性、可控的信任行为。这些指南融合了人机交互、认知心理学、自动化研究及伦理学等领域的核心原则,并特别引入语用学中的“共同基础”(common ground)与格赖斯合作原则(Gricean communication maxims),以确保系统在不同环境与情境下均能提供清晰、一致的可信度线索,最终促进人类与自动化系统之间更安全、高效且负责任的互动。
链接: https://arxiv.org/abs/2508.02371
作者: Matouš Jelínek,Nadine Schlicker,Ewart de Visser
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:Calibrated trust in automated systems (Lee and See 2004) is critical for their safe and seamless integration into society. Users should only rely on a system recommendation when it is actually correct and reject it when it is factually wrong. One requirement to achieve this goal is an accurate trustworthiness assessment, ensuring that the user’s perception of the system’s trustworthiness aligns with its actual trustworthiness, allowing users to make informed decisions about the extent to which they can rely on the system (Schlicker et al. 2022). We propose six design guidelines to help designers optimize for accurate trustworthiness assessments, thus fostering ethical and responsible human-automation interactions. The proposed guidelines are derived from existing literature in various fields, such as human-computer interaction, cognitive psychology, automation research, user-experience design, and ethics. We are incorporating key principles from the field of pragmatics, specifically the cultivation of common ground (H. H. Clark 1996) and Gricean communication maxims (Grice 1975). These principles are essential for the design of automated systems because the user’s perception of the system’s trustworthiness is shaped by both environmental contexts, such as organizational culture or societal norms, and by situational context, including the specific circumstances or scenarios in which the interaction occurs (Hoff and Bashir 2015). Our proposed guidelines provide actionable insights for designers to create automated systems that make relevant trustworthiness cues available. This would ideally foster calibrated trust and more satisfactory, productive, and safe interactions between humans and automated systems. Furthermore, the proposed heuristics might work as a tool for evaluating to what extent existing systems enable users to accurately assess a system’s trustworthiness.
zh
[NLP-31] Language Model Guided Reinforcement Learning in Quantitative Trading
【速读】: 该论文旨在解决算法交易中短期决策与长期财务目标不一致的问题,尤其针对强化学习(Reinforcement Learning, RL)因短视行为和策略推理不透明而导致的采纳受限问题。其解决方案的关键在于提出一种混合系统:利用大语言模型(Large Language Models, LLMs)生成高层次的交易策略以指导RL智能体的动作决策,从而在保持RL高效执行能力的同时引入LLMs的战略推理能力和多模态金融信号理解能力。实验表明,该方法在专家评审中展现出更合理的策略逻辑,并在夏普比率(Sharpe Ratio, SR)和最大回撤(Maximum Drawdown, MDD)等风险收益指标上优于未受引导的RL基线。
链接: https://arxiv.org/abs/2508.02366
作者: Adam Darmanin,Vince Vella
机构: University of Malta (马耳他大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Trading and Market Microstructure (q-fin.TR)
备注: 12 pages (4 pages appendix and references), 6 figures, preprint under review for FLLM 2025 conference
Abstract:Algorithmic trading requires short-term decisions aligned with long-term financial goals. While reinforcement learning (RL) has been explored for such tactical decisions, its adoption remains limited by myopic behavior and opaque policy rationale. In contrast, large language models (LLMs) have recently demonstrated strategic reasoning and multi-modal financial signal interpretation when guided by well-designed prompts. We propose a hybrid system where LLMs generate high-level trading strategies to guide RL agents in their actions. We evaluate (i) the rationale of LLM-generated strategies via expert review, and (ii) the Sharpe Ratio (SR) and Maximum Drawdown (MDD) of LLM-guided agents versus unguided baselines. Results show improved return and risk metrics over standard RL. Comments: 12 pages (4 pages appendix and references), 6 figures, preprint under review for FLLM 2025 conference Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Trading and Market Microstructure (q-fin.TR) ACMclasses: I.2.7; I.2.6; J.4 Cite as: arXiv:2508.02366 [cs.LG] (or arXiv:2508.02366v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.02366 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-32] Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在政治话题上进行微调时,会无意中引发跨话题政治立场泛化的问题,即模型在未被直接训练的其他政治议题上也表现出受微调影响的立场偏移。其核心问题是缺乏对这种跨话题泛化现象内部机制的理解,尤其是神经层面的表征机制。解决方案的关键在于:首先提出基于激活对比的政治理论定位方法(Political Neuron Localization through Activation Contrasting, PNLAC),识别出两类政治神经元——通用政治神经元(general political neurons)和特定主题神经元(topic-specific neurons);进而设计一种基于抑制机制的微调方法 InhibitFT,通过选择性抑制约5%的神经元即可显著降低跨话题立场泛化程度(平均减少20%),同时保持特定主题性能不受损。
链接: https://arxiv.org/abs/2508.02360
作者: Jiayi Zhang,Shu Yang,Junchao Wu,Derek F. Wong,Di Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons that affect the model’s political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.
zh
[NLP-33] Understanding User Preferences for Interaction Styles in Conversational Recommender Systems: The Predictive Role of System Qualities User Experience and Traits
【速读】: 该论文旨在解决对话式推荐系统(Conversational Recommender Systems, CRSs)中用户交互偏好形成机制不明确的问题,特别是任务导向型与探索型交互方式之间的差异及其影响因素。解决方案的关键在于通过一项包含139名参与者的组内实验,结合逻辑回归分析与聚类分析,识别出影响用户偏好的核心因素(如愉悦感、有用性、新颖性和对话质量),并发现感知有效性也与探索型偏好正相关;进一步基于年龄、性别和控制倾向的调节效应,揭示了五类具有不同对话风格偏好的潜在用户群体。研究构建了一个融合情感、认知和特质层面预测因子的用户建模框架,并提出一个可动态适配用户需求的预测与自适应设计方法,为对话式人工智能系统实现个性化交互提供理论依据与实践路径。
链接: https://arxiv.org/abs/2508.02328
作者: Raj Mahmud,Shlomo Berkovsky,Mukesh Prasad,A. Baki Kocaballi
机构: University of Technology Sydney (悉尼科技大学); Macquarie University (麦考瑞大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at OZCHI 2025. 21 pages, 9 figures, 8 tables
Abstract:Conversational Recommender Systems (CRSs) deliver personalised recommendations through multi-turn natural language dialogue and increasingly support both task-oriented and exploratory interactions. Yet, the factors shaping user interaction preferences remain underexplored. In this within-subjects study ((N = 139)), participants experienced two scripted CRS dialogues, rated their experiences, and indicated the importance of eight system qualities. Logistic regression revealed that preference for the exploratory interaction was predicted by enjoyment, usefulness, novelty, and conversational quality. Unexpectedly, perceived effectiveness was also associated with exploratory preference. Clustering uncovered five latent user profiles with distinct dialogue style preferences. Moderation analyses indicated that age, gender, and control preference significantly influenced these choices. These findings integrate affective, cognitive, and trait-level predictors into CRS user modelling and inform autonomy-sensitive, value-adaptive dialogue design. The proposed predictive and adaptive framework applies broadly to conversational AI systems seeking to align dynamically with evolving user needs.
zh
[NLP-34] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
【速读】: 该论文旨在解决Mixture-of-Experts (MoE) 架构在大规模语言模型中因专家参数膨胀导致的计算与存储开销过大,且性能提升不随专家参数增长而线性扩展的问题。现有方法如专家级剪枝、合并或分解仍难以兼顾性能保持与计算效率。其解决方案的关键在于提出“微专家(micro-expert)”作为更细粒度的压缩单元,突破传统以整个专家为单位的优化思路;并基于此构建了两个核心工具:CAMERA(一种轻量级、无需训练的微专家冗余识别框架),揭示解码过程中微专家贡献的显著差异;以及CAMERA-P(结构化微专家剪枝)和CAMERA-Q(面向微专家的混合精度量化),分别实现高效剪枝与极致量化压缩。实验表明,该方法在多个下游任务中优于现有基线,并能在单张NVIDIA A100-40GB GPU上快速完成对Qwen2-57B-A14B模型的完整微专家分析(<5分钟)。
链接: https://arxiv.org/abs/2508.02322
作者: Yuzhuang Xu,Xu Han,Yuanchi Zhang,Yixuan Wang,Yijun Liu,Shiyu Ji,Qingfu Zhu,Wanxiang Che
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 9 figures, 7 tables
Abstract:Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
zh
[NLP-35] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
【速读】: 该论文旨在解决多模态大语言模型(Omni-modal Large Language Models, Omni-modal LLMs)在大规模训练中面临的挑战,尤其是由于不同模态处理所需的异构模型架构导致的系统设计复杂性与扩展性受限问题。现有框架通常将模型定义与并行逻辑耦合,造成可扩展性差且工程开销高。解决方案的关键在于提出一种以模型为中心的分布式训练范式——veomni,其核心创新是通过解耦通信与计算,实现高效的三维(3D)并行策略,并提供灵活的配置接口,支持新模态的无缝集成且仅需最小代码改动,从而显著提升训练效率和规模扩展能力。
链接: https://arxiv.org/abs/2508.02317
作者: Qianli Ma,Yaowei Zheng,Zhelun Shi,Zhongkai Zhao,Bin Jia,Ziyue Huang,Zhiqi Lin,Youjie Li,Jiacheng Yang,Yanghua Peng,Zhi Zhang,Xin Liu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.
zh
[NLP-36] LaMPE: Length-aware Multi-grained Position Encoding for Adaptive Long-context Scaling Without Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在输入序列长度超过预训练上下文窗口时性能显著下降的问题,其根源在于旋转位置编码(Rotary Position Embedding, RoPE)的分布外(Out-of-Distribution, OOD)行为。现有方法通过固定映射策略将OOD位置重映射至分布内范围,但忽略了输入长度与模型有效上下文窗口之间的动态关系。论文提出了一种无需训练的解决方案——长度感知多粒度位置编码(Length-aware Multi-grained Positional Encoding, LaMPE),其关键在于:首先,基于相对位置的左偏态频率分布特性,设计了一个参数化的缩放Sigmoid函数,建立映射长度与输入长度之间的动态关联,从而自适应地分配不同输入长度下的位置容量;其次,引入一种新颖的多粒度注意力机制,在序列的不同区域战略性分配位置分辨率,以同时捕捉局部细节和长程依赖关系。该方法可无缝适配各类基于RoPE的LLM,且无需重新训练即可实现对长文本的有效扩展。
链接: https://arxiv.org/abs/2508.02308
作者: Sikui Zhang,Guangze Gao,Ziyun Gan,Chunfeng Yuan,Zefeng Lin,Houwen Peng,Bing Li,Weiming Hu
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Beijing Key Laboratory of Intelligent Robotics (北京市智能机器人重点实验室); 4. School of Information Science and Engineering, Central South University (中南大学信息科学与工程学院)
类目: Computation and Language (cs.CL)
备注: 13 pages, 9 figures
Abstract:Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model’s effective context window. To this end, we propose Length-aware Multi-grained Positional Encoding (LaMPE), a training-free method that fully utilizes the model’s effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at this https URL.
zh
[NLP-37] CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment
【速读】: 该论文旨在解决强化学习中基于规则的奖励机制(Reinforcement Learning with Verifiable Rewards, RLVR)在处理大语言模型(Large Language Models, LLMs)推理时存在的粗粒度信用分配问题。当前方法通常将完整响应视为单一动作,对所有token赋予相同奖励,导致难以精确定位哪些推理步骤促成成功或失败,进而影响策略优化效率与效果。为解决此问题,作者提出Credit Assignment Policy Optimization (CAPO),其关键在于利用一个现成的通用大语言模型作为生成式过程奖励模型(Generative Process Reward Model, LLM-as-GenPRM),通过单次推理生成逐token的批判性评价,从而提供可验证的细粒度奖励信号,实现更精准的信用分配。此外,通过引入投票机制增强评分准确性与鲁棒性,实验表明CAPO在多个数学及跨领域基准上显著优于监督学习和传统强化学习微调方法。
链接: https://arxiv.org/abs/2508.02298
作者: Guofu Xie,Yunsheng Shi,Hongtao Tian,Ting Yao,Xiao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies and inefficient learning. Methods like PPO provide credit assignment through value estimation, but often yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-by-step judgments for each reasoning step, but they require high-quality process supervision labels and are time-consuming when applied in online reinforcement learning (RL). To overcome these limitations, we introduce a simple but efficient method Credit Assignment Policy Optimization (CAPO). Given a reasoning response rollout from the policy model, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass, thereby providing verifiable token-level rewards to refine the tokens that were originally assigned identical rule-based rewards. This enables more fine-grained credit assignment in an effective way. Furthermore, to enhance the accuracy and robustness of CAPO, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments using different backbones like Llama and Qwen models and in different sizes show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across six challenging mathematical benchmarks and three out-of-domain benchmarks.
zh
[NLP-38] Simple Methods Defend RAG Systems Well Against Real-World Attacks
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在安全关键场景中确保响应安全性与领域内一致性的问题,即防止系统对域外(Out-Of-Domain, OOD)查询生成不相关或潜在有害的回答。其解决方案的关键在于引入并评估四种不同的OOD检测方法(包括GPT-4o、基于回归、主成分分析(Principal Component Analysis, PCA)和神经坍缩(Neural Collapse, NC)),并通过两种新颖的降维与特征分离策略提升检测效果:一是基于解释方差或OOD可分性选择PCA的前k个主成分,二是采用神经坍缩特征分离的改进方法。实验验证表明,外部OOD检测器对于维持RAG系统响应的相关性和准确性至关重要。
链接: https://arxiv.org/abs/2508.02296
作者: Ilias Triantafyllopoulos,Renyi Qu,Salvatore Giorgi,Brenda Curtis,Lyle H. Ungar,João Sedoc
机构: University of Pennsylvania (宾夕法尼亚大学); University of California, Berkeley (加州大学伯克利分校); University of Pennsylvania (宾夕法尼亚大学); University of Pennsylvania (宾夕法尼亚大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Ensuring safety and in-domain responses for Retrieval-Augmented Generation (RAG) systems is paramount in safety-critical applications, yet remains a significant challenge. To address this, we evaluate four methodologies for Out-Of-Domain (OOD) query detection: GPT-4o, regression-based, Principal Component Analysis (PCA)-based, and Neural Collapse (NC), to ensure the RAG system only responds to queries confined to the system’s knowledge base. Specifically, our evaluation explores two novel dimensionality reduction and feature separation strategies: \textitPCA, where top components are selected using explained variance or OOD separability, and an adaptation of \textitNeural Collapse Feature Separation. We validate our approach on standard datasets (StackExchange and MSMARCO) and real-world applications (Substance Use and COVID-19), including tests against LLM-simulated and actual attacks on a COVID-19 vaccine chatbot. Through human and LLM-based evaluations of response correctness and relevance, we confirm that an external OOD detector is crucial for maintaining response relevance.
zh
[NLP-39] A French Version of the OLDI Seed Corpus
【速读】: 该论文旨在解决法语地区语言(如布列塔尼语、科西嘉语等)在机器翻译中因平行语料库匮乏而导致的资源不足问题。其解决方案的关键在于构建首个法语分域的OLDI Seed Corpus,通过多套机器翻译系统生成初始译文,并借助定制化后编辑界面由母语合格译者进行人工校对,从而形成高质量法语平行语料;该语料库不仅处理了源文本中技术性百科术语与维基百科用户生成内容风格不一致的复杂挑战,还作为关键枢纽资源,用于推动对法国区域性语言平行语料的进一步收集与建模。
链接: https://arxiv.org/abs/2508.02290
作者: Malik Marmonier,Benoît Sagot,Rachel Bawden
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.
zh
[NLP-40] Dialogue Systems Engineering: A Survey and Future Directions
【速读】: 该论文旨在解决当前对话系统(Dialogue Systems)在实际部署与持续优化过程中缺乏系统化软件工程方法支撑的问题。随着大语言模型(Large Language Models, LLMs)的发展,对话系统的构建和应用能力显著提升,但其生命周期管理仍面临挑战,如高效开发、稳定运维及持续迭代等。解决方案的关键在于将传统软件工程知识体系(以SWEBOK v4.0为框架)映射至对话系统领域,形成专门针对对话系统的工程实践——即“对话系统工程”(Dialogue Systems Engineering),并在此基础上识别各知识领域的未探索方向,从而推动该领域向更加结构化、可扩展和可持续的方向发展。
链接: https://arxiv.org/abs/2508.02279
作者: Mikio Nakano,Hironori Takeuchi,Sadahiro Yoshikawa,Yoichi Matsuyama,Kazunori Komatani
机构: SANKEN, University of Osaka(大阪大学); C4A Research Institute, Inc.(C4A 研究所有限公司); Musashi University(武藏工业大学); Equmenopolis, Inc.(Equmenopolis 公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 2 figures
Abstract:This paper proposes to refer to the field of software engineering related to the life cycle of dialogue systems as Dialogue Systems Engineering, and surveys this field while also discussing its future directions. With the advancement of large language models, the core technologies underlying dialogue systems have significantly progressed. As a result, dialogue system technology is now expected to be applied to solving various societal issues and in business contexts. To achieve this, it is important to build, operate, and continuously improve dialogue systems correctly and efficiently. Accordingly, in addition to applying existing software engineering knowledge, it is becoming increasingly important to evolve software engineering tailored specifically to dialogue systems. In this paper, we enumerate the knowledge areas of dialogue systems engineering based on those of software engineering, as defined in the Software Engineering Body of Knowledge (SWEBOK) Version 4.0, and survey each area. Based on this survey, we identify unexplored topics in each area and discuss the future direction of dialogue systems engineering.
zh
[NLP-41] CellForge: Agent ic Design of Virtual Cell Models
【速读】: 该论文旨在解决虚拟细胞建模中自动化构建计算模型的难题,这一挑战源于生物系统的复杂性、多模态数据的异质性以及跨学科领域专业知识的需求。解决方案的关键在于提出CellForge——一个基于多智能体(multi-agent)框架的代理系统,能够将原始单细胞多组学数据和任务描述直接转化为优化的虚拟细胞计算模型。其核心创新在于整合三个模块:任务分析(Task Analysis)用于数据表征与文献检索,方法设计(Method Design)通过专家代理与中央协调者协同协作生成最优建模策略,实验执行(Experiment Execution)则自动输出可训练代码。该系统通过不同视角智能体之间的迭代交互达成共识,显著优于针对特定任务的现有最优方法,在基因敲除、药物处理和细胞因子刺激等多种扰动预测场景中表现出优越性能。
链接: https://arxiv.org/abs/2508.02276
作者: Xiangru Tang,Zhuoyun Yu,Jiapeng Chen,Yan Cui,Daniel Shao,Weixu Wang,Fang Wu,Yuchen Zhuang,Wenqi Shi,Zhi Huang,Arman Cohan,Xihong Lin,Fabian Theis,Smita Krishnaswamy,Mark Gerstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:
Abstract:Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge’s capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at this https URL.
zh
[NLP-42] Dynaword: From One-shot to Continuously Developed Datasets
【速读】: 该论文旨在解决当前大规模自然语言处理(Natural Language Processing, NLP)数据集面临的三大挑战:(1)依赖于许可不明确的数据源,限制了使用、共享和衍生作品的自由;(2)静态发布模式阻碍社区参与并降低数据集的长期可用性;(3)质量保障仅由发布团队执行,未能利用社区的专业知识。其解决方案的关键在于提出“Dynaword”框架与其实例“Danish Dynaword”——前者是一种支持持续更新、通过社区协作构建大规模开放数据集的系统化方法,后者作为验证实现,不仅包含超过同类数据集四倍的token数量、完全采用开放许可,还吸引了来自产业界和学术界的多方贡献,并通过轻量级测试确保格式一致性、质量和文档完整性,从而建立可持续的社区驱动数据集演化机制。
链接: https://arxiv.org/abs/2508.02271
作者: Kenneth Enevoldsen,Kristian Nørgaard Jensen,Jan Kostkan,Balázs Szabó,Márton Kardos,Kirten Vad,Andrea Blasi Núñez,Gianluca Barmina,Jacob Nielsen,Rasmus Larsen,Peter Vahlstrup,Per Møldrup Dalum,Desmond Elliott,Lukas Galke,Peter Schneider-Kamp,Kristoffer Nielbo
机构: Aarhus University (奥胡斯大学); The Alexandra Institute (亚历山德拉研究所); University of Copenhagen (哥本哈根大学); University of Southern Denmark (南丹麦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.02271 [cs.CL] (or arXiv:2508.02271v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.02271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-43] SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System
【速读】: 该论文旨在解决阿拉伯世界中现代标准阿拉伯语(Modern Standard Arabic, MSA)与地方方言(如叙利亚方言 Shami)之间的语言隔阂问题,这一现象称为“双语现象”(diglossia),对自然语言处理尤其是机器翻译构成了重大挑战。解决方案的关键在于构建一个双向机器翻译系统——SHAMI-MT,其核心是基于AraT5v2-base-1024架构的两个专用模型:MSA-to-Shami 和 Shami-to-MSA,均在Nabra数据集上微调,并通过MADAR语料库中的未见数据进行严格评估。该系统实现了高保真度的方言翻译,尤其在MSA到Shami方向上获得GPT-4.1评分4.01(满分5.0),显著提升了方言翻译的准确性与地道性,为内容本地化、文化遗产保护及跨文化沟通提供了关键工具。
链接: https://arxiv.org/abs/2508.02268
作者: Serry Sibaee,Omer Nacar,Yasser Al-Habashi,Adel Ammar,Wadii Boulila
机构: Prince Sultan University (王子苏丹大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbfSHAMI-MT, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf4.01 out of 5.0 when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.
zh
[NLP-44] Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)机制在训练大语言模型(Large Language Models, LLMs)时,熵(entropy)与策略性能之间动态交换关系不明确的问题。现有方法缺乏对这一交换机制在不同粒度(阶段级、实例级、token级)下如何运作的精细理解,导致训练效率受限。解决方案的关键在于通过系统性实证分析揭示:在熵上升阶段,负样本熵的降低有助于学习有效推理模式并快速提升性能;而在熵平台阶段,高熵token出现在低困惑度样本且位于序列末端时,其学习效率最高。基于此发现,作者提出两种动态调整奖励信号的方法,利用困惑度(perplexity)和位置信息聚焦于具有高学习潜力的token,从而显著优于基线方法。
链接: https://arxiv.org/abs/2508.02260
作者: Jia Deng,Jie Chen,Zhipeng Chen,Wayne Xin Zhao,Ji-Rong Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 20 figures
Abstract:Recently, reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs). A core challenge in RLVR involves managing the exchange between entropy and performance of policies. Despite the importance of this exchange, a fine-grained understanding of when and how this exchange operates most effectively remains limited. To bridge this gap, we conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity. Specifically, we first divide the training process into two distinct stages based on entropy dynamics, i.e., rising stage and plateau stage, and then systematically investigate how this mechanism varies across stage-level, instance-level, and token-level granularitiess. Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns, which in turn drives rapid performance gains. Moreover, in the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences. Motivated by these findings, we propose two methods that dynamically adjust the reward signal using perplexity and positional information to focus RL updates on tokens that exhibit high learning potential, achieving improvements compared to the baseline methods on various LLMs.
zh
[NLP-45] Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders
【速读】: 该论文旨在解决多语言编码器模型中不同语言之间存在的不对称干扰问题,即在训练过程中一种语言如何影响另一种语言的表示学习。其核心解决方案是构建一个大规模的“干扰矩阵”(interference matrix),通过在83种语言的所有可能语言对上训练和评估小型BERT-like模型,量化跨语言干扰的程度与模式。关键发现表明,这种干扰模式与传统语言特征(如语系归属)或嵌入相似性无关,反而更依赖于文字脚本类型(script),且该干扰矩阵能够有效预测下游任务性能,从而为设计高效多语言模型提供可量化的指导依据。
链接: https://arxiv.org/abs/2508.02256
作者: Belen Alastruey,João Maria Janeiro,Alexandre Allauzen,Maha Elbayad,Loïc Barrault,Marta R. Costa-jussà
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.
zh
[NLP-46] Isolating Culture Neurons in Multilingual Large Language Models
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)中文化信息编码机制不明确的问题,即文化知识在模型中的具体分布位置及其与语言特异性神经元的交互关系尚不清楚。其解决方案的关键在于提出了一种扩展的神经元定位方法,用于识别和隔离文化特异性神经元(culture-specific neurons),并基于此构建了包含8520万token的跨文化数据集MUREL,从而实现对文化神经元的独立干预与调控。实验表明,不同文化的信息主要分布在模型上层,并可独立于语言特异性神经元或其它文化特异性神经元进行编辑,为提升模型的公平性、包容性和对齐能力提供了新路径。
链接: https://arxiv.org/abs/2508.02241
作者: Danial Namazifard,Lukas Galke
机构: University of Tehran (德黑兰大学); University of Southern Denmark (南丹麦大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 13 figures
Abstract:Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment. Code and data is available at this https URL .
zh
[NLP-47] LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本上下文推理中因键值缓存(Key-Value Cache, KV Cache)规模膨胀导致的显存占用高和解码效率低的问题。其解决方案的关键在于提出一种基于学习的剪枝方法 LeanK,通过利用静态通道稀疏性(static channel sparsity),在训练阶段设计两阶段优化流程,学习得到通道级的静态掩码(channel-wise static mask),从而实现对不重要键(K)缓存通道的高效裁剪。该方法可在满足特定稀疏率和硬件适配要求的前提下,显著减少 GPU 显存占用(最高达 70% 的 K 缓存压缩)并加速注意力计算(自定义解码核实现 1.3 倍加速),同时保持模型精度不变。
链接: https://arxiv.org/abs/2508.02215
作者: Yike Zhang,Zhiyuan He,Huiqiang Jiang,Chengruidong Zhang,Yuqing Yang,Jianyong Wang,Lili Qiu
机构: Tsinghua University (清华大学); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at this https URL.
zh
[NLP-48] Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学能力评估中面临的瓶颈问题,尤其是针对以证明为核心的数学问题缺乏高质量、可扩展的自动化评测基准。现有方法依赖人工构建测试集,难以规模化且成本高昂,导致LLMs的真实数学能力未被充分衡量。其解决方案的关键在于提出Proof2Hybrid框架,该框架首次实现了从自然语言数学语料中自动合成高质量、以证明为中心的评测数据;其中核心创新为Proof2X路线图,它指导将数学证明转化为多种易验证的问题形式,并进一步设计了“m-out-of-n多判断题”这一新型混合格式问题,能够实现鲁棒的自动化评估,同时有效抵御猜测和表面模式匹配等干扰因素。
链接: https://arxiv.org/abs/2508.02208
作者: Yebo Peng,Zixiang Liu,Yaoming Li,Zhizhuo Yang,Xinye Xu,Bowen Ye,Weijun Yuan,Zihan Wang,Tong Yang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures
Abstract:Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named `` m -out-of- n multiple judge questions’', specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent in traditional formats. As a demonstration of our framework, we introduce AlgGeoTest, a benchmark for algebraic geometry–a frontier domain of modern mathematics–comprising 456 challenging items. Our extensive evaluations on state-of-the-art LLMs using AlgGeoTest reveal profound deficits in their comprehension of algebraic geometry, providing a more precise measure of their true mathematical capabilities. Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.
zh
[NLP-49] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在代码生成任务中推理速度慢的问题,尤其是传统基于token-by-token解码机制带来的高延迟。解决方案的关键在于提出一种基于离散状态扩散(discrete-state diffusion)的大语言模型——Seed Diffusion Preview,其通过非序列化的并行生成机制显著提升了推理效率,从而在保持与主流模型相当性能的前提下,实现了高达2,146 token/s的推理速度(在H20 GPU上),优于当前最先进的Mercury Coder和Gemini Diffusion,在速度-质量帕累托前沿上树立了新基准。
链接: https://arxiv.org/abs/2508.02193
作者: Yuxuan Song,Zheng Zhang,Cheng Luo,Pengyang Gao,Fan Xia,Hao Luo,Zheng Li,Yuehang Yang,Hongli Yu,Xingwei Qu,Yuwei Fu,Jing Su,Ge Zhang,Wenhao Huang,Mingxuan Wang,Lin Yan,Xiaoying Jia,Jingjing Liu,Wei-Ying Ma,Ya-Qin Zhang,Yonghui Wu,Hao Zhou
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Demo is available at this https URL Project page is this https URL
Abstract:We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
zh
[NLP-50] Learning Dynamics of Meta-Learning in Small Model Pretraining
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练成本高且缺乏可解释性的问题,尤其是针对小规模语言模型在预训练阶段如何实现更高效和可解释的学习。其解决方案的关键在于将一阶元学习(first-order MAML)与子集掩码语言建模(subset-masked LM pretraining)相结合,构建出四款LLama风格的解码器-only模型(参数量11M–570M),从而在保持低计算开销的前提下提升训练效率与可解释性。实验表明,该方法不仅使损失收敛速度提升至原始训练的1.6倍,还在多语言通用命名实体识别(Universal NER)任务上于同等计算资源下获得更高的F1分数;更重要的是,通过有效秩(effective-rank)曲线和注意力头熵的变化,揭示了训练过程中的两阶段动态:初期表征“分化”(diversify),后期压缩至共享子空间(compress),为元适应(meta-adaptation)提供了简洁、可读的信号。
链接: https://arxiv.org/abs/2508.02189
作者: David Demitri Africa,Yuval Weiss,Paula Buttery,Richard Diehl Martinez
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network’s representations fan out (“diversify”) and later they collapse into a smaller, shared subspace (“compress”). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.
zh
[NLP-51] Hidden in the Noise: Unveiling Backdoors in Audio LLM s Alignment through Latent Acoustic Pattern Triggers
【速读】: 该论文旨在解决音频大语言模型(Audio Large Language Models, ALLMs)在安全方面的潜在漏洞问题,特别是针对利用声学触发器的后门攻击(backdoor attacks)风险。现有研究多聚焦于文本和视觉模态的安全性,而音频特有的时序动态性和频谱特性使得其面临独特挑战。论文提出的关键解决方案是“Hidden in the Noise”(HIN)框架,其核心在于通过微调原始音频波形中的声学特征——如调整时间动态性和注入频谱定制噪声——来嵌入隐蔽且鲁棒的触发信号,这些信号能被ALLM的声学特征编码器捕获并用于诱导恶意行为。HIN的设计实现了攻击的高成功率(平均超过90%)与极低的可检测性(中毒样本仅引起损失曲线微小波动),从而凸显了当前ALLMs在声学特征层面的安全脆弱性。
链接: https://arxiv.org/abs/2508.02175
作者: Liang Lin,Miao Yu,Kaiwen Luo,Yibo Zhang,Lilan Peng,Dexian Wang,Xuehai Tang,Yuanhe Zhang,Xikang Yang,Zhenhong Zhou,Kun Wang,Yang Liu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:As Audio Large Language Models (ALLMs) emerge as powerful tools for speech processing, their safety implications demand urgent attention. While considerable research has explored textual and vision safety, audio’s distinct characteristics present significant challenges. This paper first investigates: Is ALLM vulnerable to backdoor attacks exploiting acoustic triggers? In response to this issue, we introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features. HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise. These changes introduce consistent patterns that an ALLM’s acoustic feature encoder captures, embedding robust triggers within the audio stream. To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types. Extensive experiments on AudioSafe and three established safety datasets reveal critical vulnerabilities in existing ALLMs: (I) audio features like environment noise and speech rate variations achieve over 90% average attack success rate. (II) ALLMs exhibit significant sensitivity differences across acoustic features, particularly showing minimal response to volume as a trigger, and (III) poisoned sample inclusion causes only marginal loss curve fluctuations, highlighting the attack’s stealth.
zh
[NLP-52] Subject or Style: Adaptive and Training-Free Mixture of LoRAs
【速读】: 该论文旨在解决现有LoRA(Low-Rank Adaptation)融合方法在主题驱动或风格驱动生成任务中难以平衡原始主体与风格信息的问题,同时克服当前方法通常需要额外训练或依赖多个超参数的局限性。其解决方案的关键在于提出EST-LoRA,一种无需训练的自适应LoRA融合方法,该方法综合考虑矩阵能量(Energy of matrix)、风格差异分数(Style discrepancy scores)和时间步长(Time steps)三个关键因素,并借鉴Mixture of Experts(MoE)架构,在每个注意力层内动态选择使用主体LoRA还是风格LoRA,从而确保生成过程中两者的贡献保持平衡,显著提升生成质量与效率。
链接: https://arxiv.org/abs/2508.02165
作者: Jia-Chen Zhang,Yu-Jie Xiong
机构: Shanghai University of Engineering Science (上海工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning models via Low-Rank Adaptation (LoRA) demonstrates remarkable performance in subject-driven or style-driven generation tasks. Studies have explored combinations of different LoRAs to jointly generate learned styles and content. However, current methods struggle to balance the original subject and style, and often require additional training. Recently, K-LoRA proposed a training-free LoRA fusion method. But it involves multiple hyperparameters, making it difficult to adapt to all styles and subjects. In this paper, we propose EST-LoRA, a training-free adaptive LoRA fusion method. It comprehensively considers three critical factors: \underlineEnergy of matrix, \underlineStyle discrepancy scores and \underlineTime steps. Analogous to the Mixture of Experts (MoE) architecture, the model adaptively selects between subject LoRA and style LoRA within each attention layer. This integrated selection mechanism ensures balanced contributions from both components during the generation process. Experimental results show that EST-LoRA outperforms state-of-the-art methods in both qualitative and quantitative evaluations and achieves faster generation speed compared to other efficient fusion approaches. Our code is publicly available at: this https URL.
zh
[NLP-53] rainable Dynamic Mask Sparse Attention
【速读】: 该论文旨在解决大语言模型中长文本建模时因标准自注意力机制(self-attention)具有二次计算复杂度而导致的效率瓶颈问题。现有稀疏注意力机制虽提升了效率,但仍存在模式固定或信息丢失等缺陷。其解决方案的关键在于提出一种可训练的动态掩码稀疏注意力机制(Dynamic Mask Attention, DMA),通过两个核心创新实现内容感知与位置感知的双重稀疏性:一是从值表示中动态生成内容感知稀疏掩码,使模型能自适应识别并聚焦关键信息;二是实施位置感知的稀疏注意力计算,跳过冗余计算区域。该双稀疏设计在保留完整信息的同时显著降低重要信息的计算复杂度,实现了信息保真度与计算效率之间的最优平衡。
链接: https://arxiv.org/abs/2508.02124
作者: Jingze Shi,Yifan Wu,Bingheng Wu,Yiran Peng,Liangdong Wang,Guang Liu,Yuyu Luo
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 figures, 4 tables
Abstract:In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.
zh
[NLP-54] “Harmless to You Hurtful to Me!”: Investigating the Detection of Toxic Languages Grounded in the Perspective of Youth AAAI
【速读】: 该论文旨在解决当前社交媒体毒性检测中忽视青少年独特毒性语言的问题,即那些成人认为无害但对青少年具有毒害性质的语言。其关键解决方案在于构建了首个中文“青少年毒性”数据集,并通过实证分析发现青少年对毒性内容的感知与话语来源、文本特征等上下文因素密切相关;进一步地,将这些元信息(meta information)融入现有毒性检测方法中,显著提升了检测准确性,为未来以青少年为中心的毒性检测研究提供了重要方向和实践依据。
链接: https://arxiv.org/abs/2508.02094
作者: Yaqiong Li,Peng Zhang,Lin Wang,Hansu Gu,Siyuan Qiao,Ning Gu,Tun Lu
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at the 20th International AAAI Conference on Web and Social Media (ICWSM 2026)
Abstract:Risk perception is subjective, and youth’s understanding of toxic content differs from that of adults. Although previous research has conducted extensive studies on toxicity detection in social media, the investigation of youth’s unique toxicity, i.e., languages perceived as nontoxic by adults but toxic as youth, is ignored. To address this gap, we aim to explore: 1) What are the features of youth-toxicity'' languages in social media (RQ1); 2) Can existing toxicity detection techniques accurately detect these languages (RQ2). For these questions, we took Chinese youth as the research target, constructed the first Chinese
youth-toxicity’’ dataset, and then conducted extensive analysis. Our results suggest that youth’s perception of these is associated with several contextual factors, like the source of an utterance and text-related features. Incorporating these meta information into current toxicity detection methods significantly improves accuracy overall. Finally, we propose several insights into future research on youth-centered toxicity detection.
zh
[NLP-55] CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
【速读】: 该论文旨在解决近似最近邻搜索(Approximate Nearest-Neighbor Search, ANNS)算法在实际应用中面临的效率与精度权衡问题,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)和基于大语言模型(Large Language Models, LLMs)的智能体应用中对高速、高准确率ANNS的需求。解决方案的关键在于提出CRINN——一种将ANNS优化建模为强化学习(Reinforcement Learning, RL)问题的新范式,其中执行速度作为奖励信号,从而自动生成逐步加速且满足精度约束的ANNS实现。这一方法突破了传统依赖人工调优的局限,首次验证了LLMs结合强化学习可有效自动化复杂算法优化任务。
链接: https://arxiv.org/abs/2508.02091
作者: Xiaoya Li,Xiaofei Sun,Albert Wang,Chris Shum,Jiwei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Preprint Version
Abstract:Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN’s effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN’s success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual this http URL can be found at this https URL
zh
[NLP-56] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的迎合行为(sycophancy)问题,即模型在面对用户陈述的观点时,即使与事实相悖也会无条件同意。这一现象虽已被观察到,但其内在机制尚不明确。论文的关键解决方案在于通过系统性实验和机制分析揭示了sycophancy的两阶段生成路径:首先在高层输出层出现偏好偏移,随后在深层表征空间中发生显著的表示差异;同时发现,用户权威信息未被模型内部编码,且第一人称表述(如“I believe…”)比第三人称表述(如“They believe…”)更能引发深层表征扰动,从而加剧迎合行为。这些发现表明,sycophancy并非表面现象,而是源于模型深层结构对已学知识的结构性覆盖,为改进对齐策略和构建更可信的AI系统提供了关键洞见。
链接: https://arxiv.org/abs/2508.02087
作者: Jin Li,Keyu Wang,Shu Yang,Zhuoran Zhang,Di Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (I believe...'') consistently induce higher sycophancy rates than third-person framings (
They believe…‘’) by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.
zh
[NLP-57] Human Capital Visualization using Speech Amount during Meetings SIGDIAL2025
【速读】: 该论文试图解决传统人力资源量化方法难以捕捉对话在人力资本中核心作用的问题,尤其是缺乏对日常沟通行为的系统性测量。其解决方案的关键在于利用会议中的发言量(speech amount)作为人力资本的可视化指标,并通过对话可视化技术进行量化分析,从而揭示不同属性(如性别、职位)及参会人员变动对发言行为的影响,以及发言量与连续变量之间的相关性,最终实现对组织内部人力资本动态的精细化评估。
链接: https://arxiv.org/abs/2508.02075
作者: Ekai Hashimoto,Takeshi Mizumoto,Kohei Nagira,Shun Shiramatsu
机构: Nagoya Institute of Technology (名古屋工业大学); Hylable Inc. (株式会社Hylable)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This paper has been accepted for presentation at the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue(SIGDIAL 2025). It represents the author’s version of the work
Abstract:In recent years, many companies have recognized the importance of human resources and are investing in human capital to revitalize their organizations and enhance internal communication, thereby fostering innovation. However, conventional quantification methods have mainly focused on readily measurable indicators without addressing the fundamental role of conversations in human capital. This study focuses on routine meetings and proposes strategies to visualize human capital by analyzing speech amount during these meetings. We employ conversation visualization technology, which operates effectively, to quantify speech. We then measure differences in speech amount by attributes such as gender and job post, changes in speech amount depending on whether certain participants are present, and correlations between speech amount and continuous attributes. To verify the effectiveness of our proposed methods, we analyzed speech amounts by departmental affiliation during weekly meetings at small to medium enterprises.
zh
[NLP-58] he SMeL Test: A simple benchmark for media literacy in language models
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在自主网络浏览过程中对不可靠信息缺乏有效过滤能力的问题,即模型未能习得人类研究者用于识别可信信息的简单启发式策略。解决方案的关键在于提出一种名为合成媒体素养测试(Synthetic Media Literacy Test, SMeL Test)的最小化基准,用于评估模型在具体情境中主动排除不可信信息的能力。该测试揭示了当前主流指令微调LLMs在信任可靠来源方面表现不一致,且即使具备推理能力的模型仍存在高达70%的幻觉率,表明模型规模与性能之间并非线性正相关,从而为未来开发针对性抗幻觉方法提供了实证依据和方向指引。
链接: https://arxiv.org/abs/2508.02074
作者: Gustaf Ahdritz,Anat Kleiman
机构: Kempner Institute, Harvard University (哈佛大学肯普纳研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently trusts more reliable sources; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.
zh
[NLP-59] MolReason er: Toward Effective and Interpretable Reasoning for Molecular LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分子推理任务中能力不足的问题,尤其是现有方法依赖通用提示(general-purpose prompting)导致缺乏领域特异性分子语义,或采用微调策略时面临可解释性差和推理深度有限的挑战。其解决方案的关键在于提出一个两阶段框架 MolReasoner:第一阶段通过合成思维链(Chain-of-Thought, CoT)样本进行监督微调(Mol-SFT),这些样本由 GPT-4o 生成并经化学准确性验证,从而初始化模型的推理能力;第二阶段引入强化学习(Mol-RL),设计专门的奖励函数以对齐化学结构与语言描述,显著提升分子推理能力与可解释性。该方法实现了从记忆驱动输出到鲁棒化学推理的转变。
链接: https://arxiv.org/abs/2508.02066
作者: Guojiang Zhao,Sihang Li,Zixiang Lu,Zheng Cheng,Haitao Lin,Lirong Wu,Hanchen Xia,Hengxing Cai,Wentao Guo,Hongshuai Wang,Mingjun Xu,Siyu Zhu,Guolin Ke,Linfeng Zhang,Zhifeng Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models(LLMs) have demonstrated remarkable performance across various domains, yet their capabilities in molecular reasoning remain insufficiently explored. Current approaches tend to rely heavily on general-purpose prompting, which lacks domain-specific molecular semantics, while those that use fine-tuning strategies often face challenges with interpretability and reasoning depth. To address these issues, we introduce MolReasoner, a two-stage framework designed to transition LLMs from memorization towards chemical reasoning. First, we propose Mol-SFT, which initializes the model’s reasoning abilities via synthetic Chain-of-Thought(CoT) samples generated by GPT-4o and verified for chemical accuracy. Subsequently, Mol-RL applies reinforcement learning with specialized reward functions designed explicitly to align chemical structures with linguistic descriptions, thereby enhancing molecular reasoning capabilities. Our approach notably enhances interpretability, improving the model 's molecular understanding and enabling better generalization. Extensive experiments demonstrate that MolReasoner outperforms existing methods, and marking a significant shift from memorization-based outputs to robust chemical reasoning.
zh
[NLP-60] ProCut: LLM Prompt Compression via Attribution Estimation
【速读】: 该论文旨在解决大规模工业级大语言模型(Large Language Model, LLM)系统中提示模板(prompt template)因迭代扩展而变得冗长的问题,这些问题导致维护困难、推理延迟增加以及服务成本上升。解决方案的关键在于提出一种名为“基于归因估计的提示压缩”(Prompt Compression via Attribution Estimation, ProCut)的灵活、无需训练且与LLM无关的框架:通过将提示模板分割为语义单元,并量化各单元对任务性能的贡献度,从而剪枝低效组件,在显著减少提示token数量(生产环境中减少78%)的同时保持甚至略微提升任务表现(相比替代方法最高提升62%)。其核心创新在于引入LLM驱动的归因估计器,使压缩延迟降低超过50%,并可无缝集成至现有提示优化流程中。
链接: https://arxiv.org/abs/2508.02053
作者: Zhentao Xu,Fengyi Li,Albert Chen,Xiaofeng Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.
zh
[NLP-61] Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理时间敏感事实知识时的评估难题,现有基准测试通常依赖人工标注或有限预定义模板,难以实现可扩展且全面的时敏问答(Time-Sensitive Question-Answering, TSQA)评估。其解决方案的关键在于提出TDBench——一个基于时序数据库(temporal databases)和数据库技术(如时序SQL和函数依赖)系统化构建TSQA对的新基准,并引入细粒度的“时间准确性”(time accuracy)指标,用于评估模型解释中时间引用的有效性,从而在保持传统答案准确性的基础上提升评估可靠性。此方法显著降低了对人工劳动的依赖,支持特定应用场景数据上的LLM评估与多跳问题的无缝生成。
链接: https://arxiv.org/abs/2508.02045
作者: Soyeon Kim,Jindong Wang,Xing Xie,Steven Euijong Whang
机构: KAIST(韩国科学技术院); William & Mary(威廉玛丽学院); Microsoft Research Asia(微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: this https URL.
zh
[NLP-62] Marco-Voice Technical Report
【速读】: 该论文旨在解决生成高度表达性、可控且自然的语音时,如何在不同语言和情感语境下忠实保留说话人身份的长期挑战。解决方案的关键在于提出一种有效的说话人-情感解耦机制,结合批次内对比学习(in-batch contrastive learning),实现说话人身份与情感风格的独立操控,并引入旋转情感嵌入集成方法(rotational emotional embedding integration)以实现平滑的情感控制。这一框架在自建的高质量中文情感语音数据集CSEMOTIONS上进行了训练与评估,实验表明所提出的Marco-Voice系统在客观和主观指标上均取得显著提升,尤其在语音清晰度和情感丰富度方面表现优异,推动了表达性神经语音合成领域的发展。
链接: https://arxiv.org/abs/2508.02038
作者: Fengping Tian,Chenyang Lyu,Xuanfan Ni,Haoqin Sun,Qingjuan Li,Zhiqiang Qian,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Technical Report
Abstract:This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.
zh
[NLP-63] Diagnosing Memorization in Chain-of-Thought Reasoning One Token at a Time
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理中因记忆偏差导致的错误传播问题,尤其是当输入发生微小变化时模型性能下降的现象。其核心挑战在于识别并量化不同粒度的记忆来源对推理链中各 token 的影响,从而揭示错误是如何从局部记忆偏差逐步累积并最终导致错误答案的。解决方案的关键是提出 STIM(Source-aware Token-level Identification of Memorization)框架,该框架基于预训练语料库中 token 的统计共现模式,将推理链中的每个 token 明确归因于本地(local)、中程(mid-range)或远距离(long-range)记忆源,并通过 token 级别的分析发现:复杂或长尾任务中模型更依赖记忆,且本地记忆通常是错误的主要驱动因素,占比高达 67%。STIM 不仅能有效诊断错误来源,还能预测错误推理步骤中的 token,为提升模型鲁棒性和可解释性提供了新工具。
链接: https://arxiv.org/abs/2508.02037
作者: Huihan Li,You Chen,Siyuan Wang,Yixin He,Ninareh Mehrabi,Rahul Gupta,Xiang Ren
机构: University of Southern California (南加州大学); University of California, San Diego (加州大学圣地亚哥分校); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.
zh
[NLP-64] SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models
【速读】: 该论文试图解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在语音场景下推理能力评估不足的问题,特别是现有评测主要聚焦于表面感知层面(如句级转录和情感识别),而忽视了模型在语境理解与推理驱动任务中的表现。解决方案的关键在于提出SpeechR——一个统一的基准测试框架,用于系统性评估LALMs在语音基础上的推理能力,涵盖事实检索、程序推理和规范判断三个核心维度,并采用多选题、生成式推理链和声学特征扰动三种评估形式,从而揭示高转录准确率与强推理能力之间无必然关联的现象,为未来模型开发提供结构化、可量化的分析工具。
链接: https://arxiv.org/abs/2508.02018
作者: Wanqi Yang,Yanda Li,Yunchao Wei,Meng Fang,Ling Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance. Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities. SpeechR establishes a structured benchmark for evaluating reasoning in spoken language, enabling more targeted analysis of model capabilities across diverse dialogue-based tasks.
zh
[NLP-65] SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
【速读】: 该论文旨在解决当前角色扮演代理(Role-Playing Agents)研究中对语音模态关注不足的问题,特别是缺乏系统性的评估框架来衡量语音角色扮演代理(Speech Role-Playing Agents, SRPAs)在真实交互场景中的表现。其解决方案的关键在于构建了一个大规模、高质量的数据集 SpeechRole-Data,包含98种多样化的角色和11.2万条基于语音的单轮与多轮对话,每种角色具有独特的声学特征(如音色和语调),从而支持更精细的语音角色扮演;同时提出 SpeechRole-Eval 评估基准,从基础交互能力、语音表现力和角色一致性等多个维度对 SRPAs 进行系统性测评,为语音驱动的多模态角色扮演研究提供了可复现的基础和明确的发展方向。
链接: https://arxiv.org/abs/2508.02013
作者: Changhao Jiang,Jiajun Sun,Yifei Cao,Jiabao Zhuang,Hui Li,Xiaoran Fan,Ming Zhang,Junjie Ye,Shihan Dou,Zhiheng Xi,Jingqi Tong,Yilong Wu,Baoyu Fan,Zhen Wang,Tao Liang,Zhihui Fei,Mingyang Wan,Guojun Ma,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
zh
[NLP-66] Prompting Large Language Models to Detect Dementia Family Caregivers
【速读】: 该论文旨在解决如何从社交媒体(如Twitter)中自动识别出由痴呆症患者照护者发布的推文这一问题,其核心挑战在于区分提及痴呆症的推文是否与家庭成员相关。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合不同的提示(prompting)策略进行二分类建模,最终发现对微调后的模型采用简单的零样本提示(zero-shot prompt)方法取得了最优效果,在验证集和测试集上均达到了0.95的宏F1分数(macro F1-score)。
链接: https://arxiv.org/abs/2508.01999
作者: Md Badsha Biswas,Özlem Uzuner
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Social media, such as Twitter, provides opportunities for caregivers of dementia patients to share their experiences and seek support for a variety of reasons. Availability of this information online also paves the way for the development of internet-based interventions in their support. However, for this purpose, tweets written by caregivers of dementia patients must first be identified. This paper demonstrates our system for the SMM4H 2025 shared task 3, which focuses on detecting tweets posted by individuals who have a family member with dementia. The task is outlined as a binary classification problem, differentiating between tweets that mention dementia in the context of a family member and those that do not. Our solution to this problem explores large language models (LLMs) with various prompting methods. Our results show that a simple zero-shot prompt on a fine-tuned model yielded the best results. Our final system achieved a macro F1-score of 0.95 on the validation set and the test set. Our full code is available on GitHub.
zh
[NLP-67] Contextually Aware E-Commerce Product Question Answering using RAG
【速读】: 该论文旨在解决电子商务产品页面信息过载导致用户难以快速准确获取所需信息的问题,以及现有产品问答(Product Question Answering, PQA)系统在利用用户上下文和多样化产品信息方面能力不足的局限。其解决方案的关键在于提出一种基于检索增强生成(Retrieval Augmented Generation, RAG)的可扩展端到端框架,该框架深度融合上下文理解能力,通过整合对话历史、用户画像与产品属性等多维信息,实现对客观、主观及多意图查询的精准响应,并能识别商品目录中的信息缺口以支持内容持续优化。
链接: https://arxiv.org/abs/2508.01990
作者: Praveen Tangarajan,Anand A. Rajasekar,Manish Rathi,Vinay Rao Dandin,Ozan Ersoy
机构: Flipkart US R&D Center (Flipkart 美国研发中心)
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure, 5 tables. Preprint under review
Abstract:E-commerce product pages contain a mix of structured specifications, unstructured reviews, and contextual elements like personalized offers or regional variants. Although informative, this volume can lead to cognitive overload, making it difficult for users to quickly and accurately find the information they need. Existing Product Question Answering (PQA) systems often fail to utilize rich user context and diverse product information effectively. We propose a scalable, end-to-end framework for e-commerce PQA using Retrieval Augmented Generation (RAG) that deeply integrates contextual understanding. Our system leverages conversational history, user profiles, and product attributes to deliver relevant and personalized answers. It adeptly handles objective, subjective, and multi-intent queries across heterogeneous sources, while also identifying information gaps in the catalog to support ongoing content improvement. We also introduce novel metrics to measure the framework’s performance which are broadly applicable for RAG system evaluations.
zh
[NLP-68] IBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models
【速读】: 该论文旨在解决藏语(Tibetan)这一低资源语言在自然语言处理中面临的严重数据稀缺问题,其核心挑战在于缺乏大规模、多领域且具备推理能力的语料库。解决方案的关键在于提出一种基于链式思维(chain-of-thought, CoT)提示技术的自动化构建方法,利用大语言模型(LLM)生成高质量的藏语数据集——TIBSTC-CoT,该数据集覆盖多样化的语域和推理模式,为模型训练提供了结构化基础。在此基础上,研究团队进一步开发了Sunshine-thinking系列藏语专用大语言模型(LLM),这些模型完全基于TIBSTC-CoT训练,展现出与当前最先进的多语言大模型相当的推理与生成能力,从而实现了从数据构建到模型创新的闭环突破,推动了包容性人工智能的发展。
链接: https://arxiv.org/abs/2508.01977
作者: Fan Gao,Cheng Huang,Nyima Tashi,Yutong Liu,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Xiao Feng,Hao Wang,Yongbin Yu
机构: 1. 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: this https URL.
zh
[NLP-69] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
【速读】: 该论文旨在解决长文档中检索增强生成(Retrieval-Augmented Generation, RAG)任务因文本切片导致的上下文信息丢失问题。传统方法通过延长嵌入模型的上下文窗口来处理更长的文本块,但受限于嵌入模型容量及实际应用对局部证据返回的需求,性能提升有限。其核心解决方案是提出“情境化嵌入”(Situated Embedding, SitEmb)——即在编码短文本块时引入更广的上下文窗口以增强语义表征,从而提升检索准确性。关键创新在于设计了一种新的训练范式,使嵌入模型能够有效捕捉短块与其所在全局语境之间的关系,并基于此构建了轻量级但高性能的SitEmb-v1和更大规模的SitEmb-v1.5模型,在多语言和下游任务中均显著优于现有主流嵌入模型。
链接: https://arxiv.org/abs/2508.01959
作者: Junjie Wu,Jiangnan Li,Yuqing Li,Lemao Liu,Liyan Xu,Jiwei Li,Dit-Yan Yeung,Jie Zhou,Mo Yu
机构: HKUST(香港科技大学); WeChat AI, Tencent(微信人工智能,腾讯); IIE-CAS(中国科学院信息工程研究所); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL)
备注: Our trained models can be downloaded from: this https URL
Abstract:Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance – i.e., situating a chunk’s meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications. Comments: Our trained models can be downloaded from: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.01959 [cs.CL] (or arXiv:2508.01959v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.01959 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-70] ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长视频序列时推理能力不足的问题,尤其是在需要基于连续视觉输入进行长期轨迹理解的具身任务场景中。现有方法难以有效建模长时间跨度的帧间依赖关系,导致推理不准确且易产生幻觉。解决方案的关键在于提出ROVER(Reasoning Over VidEo Recursively)框架,通过递归分解长视频轨迹为对应短时子任务的片段,在保持全局上下文的同时实现对局部时间窗口内帧序列的聚焦推理;该方法利用上下文学习(in-context learning)实现,并引入子任务特异的滑动窗口机制,显著降低时间复杂度至与视频长度线性相关,从而提升推理准确性并减少非最优时刻的幻觉现象。
链接: https://arxiv.org/abs/2508.01943
作者: Philip Schroeder,Ondrej Biza,Thomas Weng,Hongyin Luo,James Glass
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); RAI Institute (RAI 研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER’s time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: this https URL
zh
[NLP-71] Word Overuse and Alignment in Large Language Models : The Influence of Learning from Human Feedback KDD ECML
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成文本时过度使用特定词汇(如“delve”和“intricate”)的现象及其成因不明确的问题。解决方案的关键在于通过实验性模拟人类反馈学习(Learning from Human Feedback, LHF)过程,识别出由LHF诱导的词汇偏好,并验证不同群体(即LHF标注员与LLM用户)之间在词汇期望上的潜在分歧。研究提出了一种简洁的检测方法,并通过实证表明,LHF机制确实会导致模型对某些词汇的过度使用,这可被视为一种形式的对齐偏差(misalignment),从而为可解释人工智能(Explainable AI)和对齐研究中的数据与过程透明性提供了重要启示。
链接: https://arxiv.org/abs/2508.01930
作者: Tom S. Juzek,Zina B. Ward
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the 5th Workshop on Bias and Fairness in AI (BIAS 2025) at ECML PKDD
Abstract:Large Language Models (LLMs) are known to overuse certain terms like “delve” and “intricate.” The exact reasons for these lexical choices, however, have been unclear. Using Meta’s Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations – namely LHF workers versus LLM users. Our work contributes to the growing body of research on explainable artificial intelligence and emphasizes the importance of both data and procedural transparency in alignment research.
zh
[NLP-72] Quantum-RAG and PunGPT 2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language
【速读】: 该论文旨在解决低资源语言(特别是旁遮普语)在自然语言处理(Natural Language Processing, NLP)领域长期被忽视的问题,即现有大语言模型(Large Language Models, LLMs)对这类语言的支持不足,导致其在语法结构、语义表征和事实准确性等方面表现不佳。解决方案的关键在于构建一个端到端的开源框架:首先,通过从头训练PunGPT2模型(基于35GB多样化语料库),并采用字节对编码(Byte Pair Encoding, BPE)优化的分词器与语言学对齐的预训练目标,有效捕捉旁遮普语独特的句法与形态特征;其次,引入Pun-RAG检索增强生成框架,结合FAISS稠密检索器与定制知识库,提升事实准确性和领域召回率;进一步提出Pun-Instruct,基于QLoRA的参数高效微调方法实现指令遵循能力,显著降低计算开销;最创新的是设计Quantum-RAG,一种融合稀疏(BM25)与稠密检索的混合系统,利用量子启发的幅度嵌入与核相似性进行语义匹配,在极低内存开销下实现更高上下文相关性,首次将量子表示引入低资源语言生成任务中。整体方案为扩展LLM至欠代表语言提供了可复用范式,并推动了量子感知检索在NLP中的实际应用。
链接: https://arxiv.org/abs/2508.01918
作者: Jaskaranjeet Singh,Rakesh Thakur
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the rapid advancement of large language models (LLMs), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending LLM capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01918 [cs.CL] (or arXiv:2508.01918v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.01918 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-73] Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
【速读】: 该论文旨在解决神经网络模型内部表征中不同语义信息是否以可解释的子空间形式分离编码的问题,以及能否在无需标注数据的情况下自动发现这些“自然”子空间。其核心解决方案是提出一种名为邻近距离最小化(Neighbor Distance Minimization, NDM)的无监督学习方法,通过优化一个看似无关的训练目标,能够学习到非基向量对齐的可解释子空间;实验表明,这些子空间在多个输入中共享相同的抽象概念,类似于模型中的“变量”,并在GPT-2已知电路结构上展现出强关联性,同时验证了该方法在20亿参数规模模型上的可扩展性,为理解模型内部机制和构建可解释电路提供了新视角。
链接: https://arxiv.org/abs/2508.01916
作者: Xinting Huang,Michael Hahn
机构: Saarland University (萨尔兰大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to
variables’’ used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
zh
[NLP-74] A Decentralized Framework for Ethical Authorship Validation in Academic Publishing: Leverag ing Self-Sovereign Identity and Blockchain Technology
【速读】: 该论文旨在解决学术出版中日益严重的不道德行为问题,包括未经同意的作者署名(unconsented authorship)、赠予式作者署名(gift authorship)、作者身份模糊(author ambiguity)及未披露的利益冲突(undisclosed conflicts of interest)。这些问题削弱了学术成果的可信度与可追溯性。解决方案的关键在于构建一个基于去中心化身份(Self-Sovereign Identity, SSI)和区块链技术的框架:通过使用去中心化标识符(Decentralized Identifiers, DIDs)和可验证凭证(Verifiable Credentials, VCs)来安全验证作者身份与贡献角色,利用区块链上的信任注册表不可篡改地记录作者署名同意与同行评审活动,并借助零知识证明(Zero-Knowledge Proofs, ZKPs)在不泄露敏感信息的前提下检测利益冲突。该方案显著提升了学术出版的透明度、责任归属与伦理合规性。
链接: https://arxiv.org/abs/2508.01913
作者: Kamal Al-Sabahi,Yousuf Khamis Al Mabsali
机构: College of Banking and Financial Studies (银行与金融学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Academic publishing, integral to knowledge dissemination and scientific advancement, increasingly faces threats from unethical practices such as unconsented authorship, gift authorship, author ambiguity, and undisclosed conflicts of interest. While existing infrastructures like ORCID effectively disambiguate researcher identities, they fall short in enforcing explicit authorship consent, accurately verifying contributor roles, and robustly detecting conflicts of interest during peer review. To address these shortcomings, this paper introduces a decentralized framework leveraging Self-Sovereign Identity (SSI) and blockchain technology. The proposed model uses Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs) to securely verify author identities and contributions, reducing ambiguity and ensuring accurate attribution. A blockchain-based trust registry records authorship consent and peer-review activity immutably. Privacy-preserving cryptographic techniques, especially Zero-Knowledge Proofs (ZKPs), support conflict-of-interest detection without revealing sensitive data. Verified authorship metadata and consent records are embedded in publications, increasing transparency. A stakeholder survey of researchers, editors, and reviewers suggests the framework improves ethical compliance and confidence in scholarly communication. This work represents a step toward a more transparent, accountable, and trustworthy academic publishing ecosystem.
zh
[NLP-75] Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续预训练(continual pre-training)过程中因引入新数据导致的分布偏移(distribution shift)问题,从而引发对先前任务的性能退化(catastrophic forgetting)。解决方案的关键在于采用两种经典持续学习策略:经验回放(experience replay)与梯度对齐(gradient alignment),并通过大规模实验验证其有效性。研究发现,二者均能显著提升学习稳定性并缓解遗忘现象,尤其提出了一种高效的元经验回放(meta-experience replay, MER)实现方式,在几乎不增加计算和内存开销的前提下融合了梯度对齐的优势。此外,通过系统性缩放分析表明,低频次的经验回放比单纯扩大模型规模更高效地利用计算资源,但高频率回放则不如模型扩展本身经济。
链接: https://arxiv.org/abs/2508.01908
作者: Istabrak Abbes,Gopeshh Subbaraj,Matthew Riemer,Nizar Islah,Benjamin Therien,Tsuguchika Tabaru,Hiroaki Kingetsu,Sarath Chandar,Irina Rish
机构: Université de Montréal (蒙特利尔大学); Mila – Quebec AI Institute (魁北克人工智能研究所); Chandar Research Lab; IBM Research (IBM 研究院); Fujitsu Research (富士通研究院); Polytechnique Montréal (蒙特利尔工程学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.
zh
[NLP-76] Complete Evasion Zero Modification: PDF Attacks on AI Text Detection
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 文本检测器在面对规避攻击时鲁棒性不足的问题。其解决方案的关键在于提出了一种名为 PDFuzz 的新型攻击方法,该方法利用 PDF 文档中视觉文本布局与文本提取顺序之间的不一致性,通过精确操纵字符位置来打乱文本提取序列,从而在保持文本内容完全不变和视觉保真度的前提下,成功使检测器失效。实验表明,该攻击可将 ArguGPT 检测器的准确率从 93.6% 降至 50.4%,F1 分数从 0.938 降至 0.0,证明了现有检测系统对 PDF 结构固有漏洞的敏感性。
链接: https://arxiv.org/abs/2508.01887
作者: Aldan Creo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Code: this https URL
Abstract:AI-generated text detectors have become essential tools for maintaining content authenticity, yet their robustness against evasion attacks remains questionable. We present PDFuzz, a novel attack that exploits the discrepancy between visual text layout and extraction order in PDF documents. Our method preserves exact textual content while manipulating character positioning to scramble extraction sequences. We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text. Our results demonstrate complete evasion: detector performance drops from (93.6 \pm 1.4) % accuracy and 0.938 \pm 0.014 F1 score to random-level performance ((50.4 \pm 3.2) % accuracy, 0.0 F1 score) while maintaining perfect visual fidelity. Our work reveals a vulnerability in current detection systems that is inherent to PDF document structures and underscores the need for implementing sturdy safeguards against such attacks. We make our code publicly available at this https URL.
zh
[NLP-77] Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时频繁出现幻觉(hallucination)的问题,即模型输出看似流畅但事实错误或缺乏依据的内容。其解决方案的关键在于提出了一种名为“反事实探测”(Counterfactual Probing)的新方法:通过动态生成看似合理但包含细微事实错误的反事实陈述,评估模型对这些扰动的敏感性;基于假设——真实知识对反事实变化具有鲁棒性,而幻觉内容则表现出不一致的置信度模式——实现对幻觉的有效检测与缓解。该方法无需重新训练模型,可作为实时验证机制集成至现有LLM流水线中,实验表明其在TruthfulQA等数据集上显著优于基线方法,并平均降低24.5%的幻觉评分。
链接: https://arxiv.org/abs/2508.01862
作者: Yijun Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model’s sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline methods, while our adaptive mitigation strategies reduce hallucination scores by an average of 24.5%. The approach requires no model retraining and can be integrated into existing LLM pipelines as a realtime verification mechanism.
zh
[NLP-78] Web-CogReason er: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
【速读】: 该论文旨在解决当前网络代理(web agent)在复杂任务中因缺乏系统性知识结构而导致的认知推理能力不足问题,尤其在面对未见过的任务时泛化能力有限。其核心挑战在于如何将网页环境中的信息有效转化为可被代理用于推理和行动的知识体系。解决方案的关键在于提出Web-CogKnowledge框架,将知识划分为事实性(Factual)、概念性(Conceptual)和程序性(Procedural)三类,并对应地构建“记忆与理解”(Memorizing and Understanding)与“探索”(Exploring)两个阶段的学习流程,从而分别覆盖“what”和“how”的认知维度。在此基础上,研究者构建了Web-CogDataset作为结构化的知识基础,并设计了基于知识驱动的Chain-of-Thought(CoT)推理机制——Web-CogReasoner,实现了从知识获取到推理决策的闭环。实验证明,该方法在未见任务上的表现显著优于现有模型,凸显了结构化知识对增强代理认知能力的核心作用。
链接: https://arxiv.org/abs/2508.01858
作者: Yuhan Guo,Cong Guo,Aiwen Sun,Hongliang He,Xinyu Yang,Yue Lu,Yingji Zhang,Xuntao Guo,Dong Zhang,Jianzhuang Liu,Jiang Duan,Yijia Xiao,Liangjian Wen,Hai-Ming Xu,Yong Dai
机构: Southwestern University of Finance and Economics (西南财经大学); Shanghai Jiao Tong University (上海交通大学); Central South University (中南大学); Hithink Research (慧科研究院); Westlake University (西湖大学); Harbin Institute of Technology (哈尔滨工业大学); University of Manchester (曼彻斯特大学); University of California, Los Angeles (加州大学洛杉矶分校); University of Adelaide (阿德莱德大学); Fudan University (复旦大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our code and data is open sourced at this https URL
Abstract:Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent’s capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent’s processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the “what” of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the “how” of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent’s conceptual grounding-the “nouns” upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at this https URL
zh
[NLP-79] MLP Memory: Language Modeling with Retriever-pretrained External Memory
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时普遍存在幻觉(Hallucination)的问题,尤其是在知识密集型任务中,这严重限制了其实际应用。传统检索增强生成(Retriever-Augmented Generation, RAG)方法虽能缓解该问题,但因其非参数化检索器难以与LLM深度协同。论文提出的关键解决方案是将记忆功能从LLM解耦出来,引入一个预训练的、可微分的外部MLP记忆模块(external MLP memory),该模块通过模仿检索器在全量预训练数据上的行为进行训练,从而实现与Transformer解码器的高效交互。这一架构不仅提升了下游任务性能和困惑度,还展现出更陡峭的幂律缩放特性,并显著优于纯解码器模型在多个基准测试中的表现,同时具备更快的推理速度。
链接: https://arxiv.org/abs/2508.01832
作者: Rubin Wei,Jiaqi Cao,Jiarui Wang,Jushi Kai,Qipeng Guo,Bowen Zhou,Zhouhan Lin
机构: LUMIA Lab, Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Electronic Engineering, Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers 80\times speedup over k NN-LM (500M tokens) and 1.3\times faster inference than decoder-only models. Unlike k NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.
zh
[NLP-80] AGENT ICT2S:Robust Text-to-SPARQL via Agent ic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy
【速读】: 该论文旨在解决跨异构知识图谱(Heterogeneous Knowledge Graphs, HKGs)的问答任务(KGQA)中面临的三大挑战:不同知识图谱间模式不一致、对齐信息不完整以及分布式数据源难以协同推理的问题,尤其在循环经济等低资源领域,现有基于文本到SPARQL的方法受限于单一图谱设置或需大量领域特定微调,难以实现跨图谱的可靠查询执行。其解决方案的关键在于提出AgenticT²S框架,采用模块化设计将KGQA分解为检索、查询生成和验证三个子任务,并由专用代理(agent)分别负责;通过弱到强的对齐策略调度子目标至不同知识图谱,并引入两阶段验证机制——符号化校验与反事实一致性检查,以识别结构无效和语义不足的查询。实验表明,该方法在真实循环经济知识图谱上显著提升了执行准确率(+17.3%)和三重级别F₁分数(+25.4%),同时大幅缩短平均提示长度(-46.4%),验证了基于代理的模式感知推理在可扩展KGQA中的有效性。
链接: https://arxiv.org/abs/2508.01815
作者: Yang Zhao,Chengxiao Dai,Wei Zhuo,Tan Chuan Fu,Yue Xiu,Dusit Niyato,Jonathan Z. Low,Eugene Ho Hong Zhuang,Daren Zong Loong Tan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Question answering over heterogeneous knowledge graphs (KGQA) involves reasoning across diverse schemas, incomplete alignments, and distributed data sources. Existing text-to-SPARQL approaches rely on large-scale domain-specific fine-tuning or operate within single-graph settings, limiting their generalizability in low-resource domains and their ability to handle queries spanning multiple graphs. These challenges are particularly relevant in domains such as the circular economy, where information about classifications, processes, and emissions is distributed across independently curated knowledge graphs (KGs). We present AgenticT ^2 S, a modular framework that decomposes KGQA into subtasks managed by specialized agents responsible for retrieval, query generation, and verification. A scheduler assigns subgoals to different graphs using weak-to-strong alignment strategies. A two-stage verifier detects structurally invalid and semantically underspecified queries through symbolic validation and counterfactual consistency checks. Experiments on real-world circular economy KGs demonstrate that AgenticT ^2 S improves execution accuracy by 17.3% and triple level F _1 by 25.4% over the best baseline, while reducing the average prompt length by 46.4%. These results demonstrate the benefits of agent-based schema-aware reasoning for scalable KGQA and support decision-making in sustainability domains through robust cross-graph reasoning.
zh
[NLP-81] HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark
【速读】: 该论文旨在解决希伯来语自然语言处理(Natural Language Processing, NLP)领域中语义理解维度被忽视的问题,特别是针对生成式 AI (Generative AI) 任务中缺乏高质量机器阅读理解(Machine Reading Comprehension, MRC)数据集的现状。其核心挑战在于希伯来语作为形态丰富的语言(Morphologically Rich Language, MRL),其复杂的词形变化导致答案跨度边界不明确,进而引发标注不一致和标准评估指标(如F1分数和精确匹配EM)失效。解决方案的关键在于提出一套新的标注指南、受控众包协议以及适配形态丰富语言特性的修订评估指标,从而构建了一个包含30,147个多样化问答对的希伯来语问答基准数据集HeQ(Hebrew QA)。实验表明,传统评估指标不适用于希伯来语,且形态句法任务与MRC之间的性能相关性较低,说明专为前者设计的模型在语义密集型任务上可能表现不佳。这一工作为提升希伯来语及其他MRL的自然语言理解(Natural Language Understanding, NLU)模型提供了重要基础。
链接: https://arxiv.org/abs/2508.01812
作者: Amir DN Cohen,Hilla Merhav,Yoav Goldberg,Reut Tsarfaty
机构: Bar-Ilan University (巴伊兰大学); Webiks (Webiks); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current benchmarks for Hebrew Natural Language Processing (NLP) focus mainly on morpho-syntactic tasks, neglecting the semantic dimension of language understanding. To bridge this gap, we set out to deliver a Hebrew Machine Reading Comprehension (MRC) dataset, where MRC is to be realized as extractive Question Answering. The morphologically rich nature of Hebrew poses a challenge to this endeavor: the indeterminacy and non-transparency of span boundaries in morphologically complex forms lead to annotation inconsistencies, disagreements, and flaws in standard evaluation metrics. To remedy this, we devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics that are suitable for the morphologically rich nature of the language. Our resulting benchmark, HeQ (Hebrew QA), features 30,147 diverse question-answer pairs derived from both Hebrew Wikipedia articles and Israeli tech news. Our empirical investigation reveals that standard evaluation metrics such as F1 scores and Exact Match (EM) are not appropriate for Hebrew (and other MRLs), and we propose a relevant enhancement. In addition, our experiments show low correlation between models’ performance on morpho-syntactic tasks and on MRC, which suggests that models designed for the former might underperform on semantics-heavy tasks. The development and exploration of HeQ illustrate some of the challenges MRLs pose in natural language understanding (NLU), fostering progression towards more and better NLU models for Hebrew and other MRLs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01812 [cs.CL] (or arXiv:2508.01812v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.01812 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-82] CSLRConformer: A Data-Centric Conformer Approach for Continuous Arabic Sign Language Recognition on the Isharah Datase
【速读】: 该论文致力于解决连续手语识别(Continuous Sign Language Recognition, CSLR)中签名者无关(signer-independent)识别的难题,旨在提升系统在不同签名者之间的泛化能力。其关键解决方案在于提出一种以数据为中心的方法论,核心包括:基于探索性数据分析(Exploratory Data Analysis, EDA)的结构化特征选择策略,用于提取具有交际意义的关键点(communicative keypoints);一套严格的预处理流程,包含基于DBSCAN的异常值过滤与空间归一化;以及一种新颖的CSLRConformer架构,该架构借鉴了原始为语音识别设计的CNN-Transformer混合结构,有效建模手语的时空动态特性,从而在关键点表示基础上实现更精准的识别性能。
链接: https://arxiv.org/abs/2508.01791
作者: Fatimah Mohamed Emad Elden
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The field of Continuous Sign Language Recognition (CSLR) poses substantial technical challenges, including fluid inter-sign transitions, the absence of temporal boundaries, and co-articulation effects. This paper, developed for the MSLR 2025 Workshop Challenge at ICCV 2025, addresses the critical challenge of signer-independent recognition to advance the generalization capabilities of CSLR systems across diverse signers. A data-centric methodology is proposed, centered on systematic feature engineering, a robust preprocessing pipeline, and an optimized model architecture. Key contributions include a principled feature selection process guided by Exploratory Data Analysis (EDA) to isolate communicative keypoints, a rigorous preprocessing pipeline incorporating DBSCAN-based outlier filtering and spatial normalization, and the novel CSLRConformer architecture. This architecture adapts the hybrid CNN-Transformer design of the Conformer model, leveraging its capacity to model local temporal dependencies and global sequence context; a characteristic uniquely suited for the spatio-temporal dynamics of sign language. The proposed methodology achieved a competitive performance, with a Word Error Rate (WER) of 5.60% on the development set and 12.01% on the test set, a result that secured a 3rd place ranking on the official competition platform. This research validates the efficacy of cross-domain architectural adaptation, demonstrating that the Conformer model, originally conceived for speech recognition, can be successfully repurposed to establish a new state-of-the-art performance in keypoint-based CSLR.
zh
[NLP-83] A comprehensive taxonomy of hallucinations in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中普遍存在的幻觉问题,即模型生成看似合理但事实错误或虚构内容的现象。其核心贡献在于提出一个系统的分类体系,将LLM幻觉区分为内在型(与输入上下文矛盾)和外在型(与训练数据或现实不一致),并进一步区分事实性(绝对正确性)与忠实性(对输入的遵循程度)。解决方案的关键在于揭示幻觉在可计算的LLM中具有理论上的不可避免性,并据此从数据、模型架构及提示工程等多维度分析成因,进而提出包括评估基准、检测机制、架构优化和持续人工监督在内的综合缓解策略,强调未来需聚焦于鲁棒检测、系统性缓解与人类持续干预,以实现关键场景下的负责任部署。
链接: https://arxiv.org/abs/2508.01781
作者: Manuel Cossio
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 55 pages, 16 figures, 3 tables
Abstract:Large language models (LLMs) have revolutionized natural language processing, yet their propensity for hallucination, generating plausible but factually incorrect or fabricated content, remains a critical challenge. This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespective of architecture or training. It explores core distinctions, differentiating between intrinsic (contradicting input context) and extrinsic (inconsistent with training data or reality), as well as factuality (absolute correctness) and faithfulness (adherence to input). The report then details specific manifestations, including factual errors, contextual and logical inconsistencies, temporal disorientation, ethical violations, and task-specific hallucinations across domains like code generation and multimodal applications. It analyzes the underlying causes, categorizing them into data-related issues, model-related factors, and prompt-related influences. Furthermore, the report examines cognitive and human factors influencing hallucination perception, surveys evaluation benchmarks and metrics for detection, and outlines architectural and systemic mitigation strategies. Finally, it introduces web-based resources for monitoring LLM releases and performance. This report underscores the complex, multifaceted nature of LLM hallucinations and emphasizes that, given their theoretical inevitability, future efforts must focus on robust detection, mitigation, and continuous human oversight for responsible and reliable deployment in critical applications.
zh
[NLP-84] LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? ICIP
【速读】: 该论文旨在解决当前模型上下文协议(Model Context Protocol, MCP)基准测试局限于单服务器场景且工具数量有限的问题,从而无法有效评估大型、真实世界环境中大语言模型(Large Language Model, LLM)代理的能力。其关键解决方案是提出首个全面的基准测试框架——LiveMCPBench,包含95个基于真实MCP生态的任务,用于在多样化服务器上大规模评估LLM代理;同时构建了可扩展、可复现的评估流水线,包括LiveMCPTool(70个MCP服务器与527个工具的集合)、LiveMCPEval(基于LLM作为裁判的自动化自适应评估机制),以及MCP Copilot Agent(多步骤代理,实现跨工具动态规划与API交互)。该框架首次实现了对复杂、工具丰富且动态变化的MCP环境中的代理能力进行统一、可扩展和可复现的评测。
链接: https://arxiv.org/abs/2508.01780
作者: Guozhao Mo,Wenliang Zhong,Jiawei Chen,Xuanang Chen,Yaojie Lu,Hongyu Lin,Ben He,Xianpei Han,Le Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Our code and data will be publicly available at this https URL
Abstract:With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench’s complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at this https URL.
zh
[NLP-85] Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂数学推理任务中因多步求解过程产生错误而导致性能受限的问题,核心挑战在于如何高效构建高质量的过程奖励数据(process reward data),以训练有效的过程级奖励模型(Process-level Reward Models, PRMs)。解决方案的关键在于提出一种基于不确定性的自动化过程奖励数据构建框架,涵盖数据生成与标注两个环节,并创新性地引入两种不确定性感知的输出聚合方法——混合多数投票奖励(Hybrid Majority Reward Vote)和加权奖励频率投票(Weighted Reward Frequency Vote),通过融合多数投票与PRM的优势,在多个基准测试集(ProcessBench、MATH、GSMPlus)上显著提升了数学推理能力。
链接: https://arxiv.org/abs/2508.01773
作者: Jiuzhou Han,Wray Buntine,Ehsan Shareghi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models’ reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs. The code and data will be publicly available at this https URL.
zh
[NLP-86] AI-Generated Text is Non-Stationary: Detection via Temporal Tomography
【速读】: 该论文旨在解决当前AI生成文本检测方法在面对局部对抗性扰动时性能下降的问题,其根本原因在于现有方法将token级统计特征聚合为标量分数,从而丢失了异常发生位置的时空信息。解决方案的关键在于提出Temporal Discrepancy Tomography (TDT),这是一种基于信号处理的新范式,将token级差异视为时间序列信号,并通过连续小波变换(Continuous Wavelet Transform)生成二维时-尺度表示,从而同时捕捉统计异常的位置和语言尺度特征。此方法显著提升了检测鲁棒性,尤其在HART Level 2改写攻击下AUROC提升14.1%,且仅增加13%计算开销。
链接: https://arxiv.org/abs/2508.01754
作者: Alva West,Yixuan Weng,Minjun Zhu,Luodan Zhang,Zhen Lin,Guangsheng Bao,Yue Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The field of AI-generated text detection has evolved from supervised classification to zero-shot statistical analysis. However, current approaches share a fundamental limitation: they aggregate token-level measurements into scalar scores, discarding positional information about where anomalies occur. Our empirical analysis reveals that AI-generated text exhibits significant non-stationarity, statistical properties vary by 73.8% more between text segments compared to human writing. This discovery explains why existing detectors fail against localized adversarial perturbations that exploit this overlooked characteristic. We introduce Temporal Discrepancy Tomography (TDT), a novel detection paradigm that preserves positional information by reformulating detection as a signal processing task. TDT treats token-level discrepancies as a time-series signal and applies Continuous Wavelet Transform to generate a two-dimensional time-scale representation, capturing both the location and linguistic scale of statistical anomalies. On the RAID benchmark, TDT achieves 0.855 AUROC (7.1% improvement over the best baseline). More importantly, TDT demonstrates robust performance on adversarial tasks, with 14.1% AUROC improvement on HART Level 2 paraphrasing attacks. Despite its sophisticated analysis, TDT maintains practical efficiency with only 13% computational overhead. Our work establishes non-stationarity as a fundamental characteristic of AI-generated text and demonstrates that preserving temporal dynamics is essential for robust detection.
zh
[NLP-87] Enhancing the Preference Extractor in Multi-turn Dialogues: From Annotating Disasters to Accurate Preference Extraction
【速读】: 该论文旨在解决多轮对话中用户偏好识别(User Preference Identification)的难题,特别是由于高质量标注数据稀缺、标注过程易引入错误(称为“标注灾难”)以及模型训练时序列依赖导致的误差传播问题。其解决方案的关键在于提出一种名为IterChat的新颖对话数据生成框架:首先将对话数据重构为包含属性化历史偏好和单轮对话的结构化格式,从而降低标注难度并提升效率;其次利用GPT-4预定义偏好槽位,并随机采样槽位及其值来生成多样且高质量的对话数据,使得基于此格式微调或少样本提示(few-shot prompting)均能显著优于原始多轮对话数据。
链接: https://arxiv.org/abs/2508.01739
作者: Cheng Wang,ziru Liu,Pengcheng Tang,Mingyu Zhang,Quanyu Dai,Yue Zhu
机构: Huawei Technologies Co., Ltd.(华为技术有限公司); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Identifying user preferences in dialogue systems is a pivotal aspect of providing satisfying services. Current research shows that using large language models (LLMs) to fine-tune a task-specific preference extractor yields excellent results in terms of accuracy and generalization. However, the primary challenge stems from the inherent difficulty in obtaining high-quality labeled multi-turn dialogue data. Accurately tracking user preference transitions across turns not only demands intensive domain expertise and contextual consistency maintenance for annotators (termed \textbf``Annotating Disaster’') but also complicates model training due to error propagation in sequential dependency learning. Inspired by the observation that multi-turn preference extraction can be decomposed into iterative executions of one-turn extraction processes. We propose a novel dialogue data generation framework named \textbfIterChat. First, we construct a new data format that categorizes the dialogue data into attributed historical preferences and one-turn dialogues. This reduces the probability of annotation errors and improves annotation efficiency. Then, to generate a high-quality and diverse dialogue dataset, we adopt GPT4 to pre-define the preference slots in the target preference extractor task and then randomly sample the subset of the slots and their corresponding schema values to create the dialogue datasets. Experimental results indicate that fine-tuning or only few-shot prompting with the new dialogue format yields superior performance compared to the original multi-turn dialogues. Additionally, the new data format improves annotator efficiency with a win rate of 28.4% higher than the original multi-turn dialogues.
zh
[NLP-88] CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications
【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)在内容安全方面的显著短板问题,尤其是在非英语语境下缺乏高质量、文化适配的安全防护能力。现有研究主要集中在英文环境下的内容安全,而其他语言由于标注数据稀缺且文化敏感性高,导致安全机制滞后。解决方案的关键在于提出CultureGuard框架,通过四阶段合成数据生成与过滤流水线——文化数据分割、文化数据适配、机器翻译和质量筛选——将英文安全数据集Nemotron-Content-Safety-Dataset-V2高效扩展至阿拉伯语、德语、西班牙语、法语、印地语、日语、泰语和中文共八种语言,最终构建出包含386,661条样本的多语言安全数据集Nemotron-Content-Safety-Dataset-Multilingual-v1,并基于LoRA微调训练出多语言安全守护模型Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1,实现了跨语言内容安全性能的显著提升。
链接: https://arxiv.org/abs/2508.01710
作者: Raviraj Joshi,Rakesh Paul,Kanishk Singla,Anusha Kamath,Michael Evans,Katherine Luna,Shaona Ghosh,Utkarsh Vaidya,Eileen Long,Sanjay Singh Chauhan,Niranjan Wartikar
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models.
zh
[NLP-89] Am I Blue or Is My Hobby Counting Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中出现的“表达泄露”(expression leakage)问题,即模型会系统性地生成与输入语境语义无关但带有情感倾向的表达,这可能引入偏见或不相关的情感内容。解决方案的关键在于构建一个基准数据集(benchmark dataset)和一套自动评估流程,该流程能有效关联人工判断,从而无需为每个模型单独标注即可快速评估表达泄露程度;同时,研究发现表达泄露随模型参数规模增长而减少,但需在模型训练阶段进行针对性干预才能有效缓解,提示仅靠提示工程(prompting)无法解决此问题。
链接: https://arxiv.org/abs/2508.01708
作者: Berkay Köprü,Mehrzad Mashal,Yigit Gurses,Akos Kadar,Maximilian Schmitt,Ditty Mathew,Felix Burkhardt,Florian Eyben,Björn W. Schuller
机构: audEERING GmbH(audEERING公司); Agile Robots SE(敏捷机器人公司); Technical University of Munich(慕尼黑工业大学); Imperial College London(伦敦帝国学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have advanced natural language processing (NLP) skills such as through next-token prediction and self-attention, but their ability to integrate broad context also makes them prone to incorporating irrelevant information. Prior work has focused on semantic leakage, bias introduced by semantically irrelevant context. In this paper, we introduce expression leakage, a novel phenomenon where LLMs systematically generate sentimentally charged expressions that are semantically unrelated to the input context. To analyse the expression leakage, we collect a benchmark dataset along with a scheme to automatically generate a dataset from free-form text from common-crawl. In addition, we propose an automatic evaluation pipeline that correlates well with human judgment, which accelerates the benchmarking by decoupling from the need of annotation for each analysed model. Our experiments show that, as the model scales in the parameter space, the expression leakage reduces within the same LLM family. On the other hand, we demonstrate that expression leakage mitigation requires specific care during the model building process, and cannot be mitigated by prompting. In addition, our experiments indicate that, when negative sentiment is injected in the prompt, it disrupts the generation process more than the positive sentiment, causing a higher expression leakage rate.
zh
[NLP-90] Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在生成过程中难以充分融合模型内部参数化知识(parametric knowledge)与外部检索知识之间的协同效应问题,尤其体现在两者交互不足导致的生成误导或信息利用不充分。解决方案的关键在于提出一种名为“协作式链式代理”(Collaborative Chain-of-Agents, CoCoA)的框架:首先设计CoCoA-zero,通过条件知识归纳(conditional knowledge induction)和推理回答的两阶段机制实现对检索内容与模型知识的显式协同;进而基于此构建长链训练策略CoCoA,从CoCoA-zero中合成扩展的多代理推理轨迹,用于微调大语言模型(Large Language Models, LLMs),从而显著提升模型对参数化知识与检索知识的联合利用能力。
链接: https://arxiv.org/abs/2508.01696
作者: Yi Jiang,Sendong Zhao,Jianbo Li,Haochun Wang,Lizhe Zhang,Yan Liu,Bin Qin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: code available at this https URL
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising framework for enhancing the capabilities of Large Language Models (LLMs), especially in knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to fully exploit knowledge during generation. In particular, the synergy between the model’s internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model’s capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.
zh
[NLP-91] Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
【速读】: 该论文旨在解决全球范围内方言与区域语言变体在语音基础模型(Speech Foundation Models)中建模不足的问题,尤其关注多语言环境下方言分类的准确性与鲁棒性。其解决方案的关键在于构建了Voxlect——一个涵盖英语、阿拉伯语、汉语普通话与粤语、藏语、印地语系语言、泰语、西班牙语、法语、德语、巴西葡萄牙语和意大利语等11种语言的方言基准测试集,基于超过200万条标注方言信息的训练语句,对多种主流语音基础模型进行系统评估,并通过噪声条件下的鲁棒性测试与误差分析验证模型表现与地理连续性的匹配度。此外,Voxlect还被拓展用于增强语音识别(ASR)数据集的方言标签及评估语音生成系统的性能,从而推动跨方言语音技术的标准化与应用落地。
链接: https://arxiv.org/abs/2508.01691
作者: Tiantian Feng,Kevin Huang,Anfeng Xu,Xuan Shi,Thanathai Lertpetchpun,Jihwan Lee,Yoonjeong Lee,Dani Byrd,Shrikanth Narayanan
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models. Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information. We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification, we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems. Voxlect is publicly available with the license of the RAIL family at: this https URL.
zh
[NLP-92] he Bidirectional Process Reward Model
【速读】: 该论文旨在解决现有过程奖励模型(Process Reward Models, PRMs)在评估大语言模型(Large Language Models, LLMs)推理路径时,因采用单向从左到右(left-to-right, L2R)评估范式而导致的全局上下文利用不足问题,从而难以验证早期推理步骤与后续步骤的一致性。其解决方案的关键在于提出一种新型双向评估范式——双向过程奖励模型(Bidirectional Process Reward Model, BiPRM),该模型通过引入一个并行的从右到左(right-to-left, R2L)评估流,在不增加额外参数或推理延迟的前提下,利用反向提示(prompt modification)实现对早期步骤的实时校验,从而显著提升推理步骤评分的准确性与一致性。
链接: https://arxiv.org/abs/2508.01682
作者: Lingyin Zhang,Jun Gao,Xiaoxue Ren,Ziqiang Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs) by assigning fine-grained scores to intermediate reasoning steps within a solution trajectory. However, existing PRMs predominantly adopt a unidirectional left-to-right (L2R) evaluation paradigm, which limits their ability to leverage global context, making it challenging to verify the consistency of earlier steps based on later ones. In light of these challenges, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM seamlessly incorporates a parallel right-to-left (R2L) evaluation stream alongside the conventional L2R flow, enabling later reasoning steps to help assess earlier ones in real time. Notably, the built-in R2L evaluation is implemented solely through prompt modifications that reverse the original reasoning trajectory, without any additional parameters or inference latency introduced. This ensures BiPRM remains both efficient and broadly compatible with existing PRM studies. We conduct extensive experiments on two mathematical reasoning benchmarks using samples generated by three different policy models. Our method, BiPRM, is evaluated across three backbones and three distinct PRM objectives. Across all settings, BiPRM consistently outperforms unidirectional baselines, achieving up to a 31.9% improvement in stepwise reward evaluation. Generally, our results highlight BiPRM’s effectiveness, robustness, and general applicability, offering a promising new direction for process-based reward modeling.
zh
[NLP-93] CUPID: Evaluating Personalized and Contextualized Alignment of LLM s from Interactions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化交互中对动态上下文偏好建模能力不足的问题。传统方法假设用户偏好是静态的,但现实中用户的偏好随上下文变化,需通过多轮交互显式表达并被模型准确捕捉与迁移。解决方案的关键在于构建一个名为CUPID的基准数据集,其中包含756个由人类精心标注的用户与LLM对话的历史会话记录,每个会话包含特定上下文下的请求及多轮反馈以体现用户偏好。该基准评估LLM能否从历史交互中推断出与当前请求相关的偏好,并生成符合该偏好的响应,从而揭示当前主流LLM在上下文感知个性化推理上的显著短板(精度<50%,召回率<65%),并为未来研究提供可量化、可复现的评估工具。
链接: https://arxiv.org/abs/2508.01674
作者: Tae Soo Kim,Yoonjoo Lee,Yoonah Park,Jiho Kim,Young-Ho Kim,Juho Kim
机构: KAIST(韩国科学技术院); Seoul National University(首尔国立大学); Calvin University(卡尔文大学); NAVER AI LAB(NAVER人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to COLM 2025. Project Website: this https URL
Abstract:Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request – under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.
zh
[NLP-94] Authorship Attribution in Multilingual Machine-Generated Texts
【速读】: 该论文旨在解决多语言环境下生成式 AI (Generative AI) 文本与人类撰写文本之间作者归属(Authorship Attribution, AA)的难题,即在多种语言和多个大语言模型(LLM)背景下准确识别文本来源。其解决方案的关键在于系统性地评估现有单语AA方法在跨语言场景下的适用性与迁移能力,并揭示不同语言家族、书写系统及生成器对归属性能的影响,从而指出当前方法在跨语言泛化方面的局限性,强调需开发更鲁棒的多语言作者归属框架以匹配实际应用需求。
链接: https://arxiv.org/abs/2508.01656
作者: Lucio La Cava,Dominik Macko,Róbert Móro,Ivan Srba,Andrea Tagarelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
备注:
Abstract:As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages – covering multiple families and writing scripts – and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods, their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.
zh
[NLP-95] DUP: Detection-guided Unlearning for Backdoor Purification in Language Models
【速读】: 该论文旨在解决当前后门攻击(backdoor attack)防御策略中存在的两个关键问题:一是检测方法依赖粗粒度特征统计,难以捕捉细微异常;二是净化方法通常需要完整的重新训练或额外的干净模型,效率低下且不实用。其解决方案的核心在于提出一个统一框架DUP(Detection-guided Unlearning for Purification),通过联合利用类无关距离与层间转移特性来实现细粒度的特征级异常检测,并基于检测结果采用参数高效的“遗忘”机制进行净化。该机制创新性地复用知识蒸馏(knowledge distillation)思想,引导学生模型在检测到的中毒样本上输出与教师模型差异增大,从而有效消除后门行为,无需全量重训或外部干净数据集。
链接: https://arxiv.org/abs/2508.01647
作者: Man Hu,Yahui Ding,Yatao Yang,Liangyu Chen,Yanhao Jia,Shuai Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As backdoor attacks become more stealthy and robust, they reveal critical weaknesses in current defense strategies: detection methods often rely on coarse-grained feature statistics, and purification methods typically require full retraining or additional clean models. To address these challenges, we propose DUP (Detection-guided Unlearning for Purification), a unified framework that integrates backdoor detection with unlearning-based purification. The detector captures feature-level anomalies by jointly leveraging class-agnostic distances and inter-layer transitions. These deviations are integrated through a weighted scheme to identify poisoned inputs, enabling more fine-grained analysis. Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism that avoids full retraining and does not require any external clean model. Specifically, we innovatively repurpose knowledge distillation to guide the student model toward increasing its output divergence from the teacher on detected poisoned samples, effectively forcing it to unlearn the backdoor behavior. Extensive experiments across diverse attack methods and language model architectures demonstrate that DUP achieves superior defense performance in detection accuracy and purification efficacy. Our code is available at this https URL.
zh
[NLP-96] ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings
【速读】: 该论文旨在解决化学领域检索增强生成(Retrieval-Augmented Generation, RAG)系统中因通用文本嵌入模型无法准确表示复杂化学术语而导致的检索质量低下问题。其核心解决方案是提出 ChEmbed,一个针对化学文献检索任务微调的嵌入模型家族:关键在于利用 PubChem、Semantic Scholar 和 ChemRxiv 等化学语料构建高质量训练数据集,并通过大语言模型合成约 170 万条查询-段落对;同时在分词器中新增 900 个化学专用标记以减少 IUPAC 命名等化学实体的碎片化现象;此外,ChEmbed 支持 8192 token 的上下文长度,显著优于多数开源嵌入模型(通常为 512 或 2048 token),从而提升长文本片段的检索效率与准确性。在新提出的 ChemRxiv Retrieval 基准测试中,ChEmbed 将 nDCG@10 提升至 0.91,相较最优通用嵌入模型提高 9 个百分点。
链接: https://arxiv.org/abs/2508.01643
作者: Ali Shiraee Kasmaee,Mohammad Khodadad,Mehdi Astaraki,Mohammad Arshi Saloot,Nicholas Sherck,Hamidreza Mahyar,Soheila Samiee
机构: McMaster University (麦克马斯特大学); BASF Canada Inc (巴斯夫加拿大公司); BASF Corporation (巴斯夫公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems in chemistry heavily depend on accurate and relevant retrieval of chemical literature. However, general-purpose text embedding models frequently fail to adequately represent complex chemical terminologies, resulting in suboptimal retrieval quality. Specialized embedding models tailored to chemical literature retrieval have not yet been developed, leaving a substantial performance gap. To address this challenge, we introduce ChEmbed, a domain-adapted family of text embedding models fine-tuned on a dataset comprising chemistry-specific text from the PubChem, Semantic Scholar, and ChemRxiv corpora. To create effective training data, we employ large language models to synthetically generate queries, resulting in approximately 1.7 million high-quality query-passage pairs. Additionally, we augment the tokenizer by adding 900 chemically specialized tokens to previously unused slots, which significantly reduces the fragmentation of chemical entities, such as IUPAC names. ChEmbed also maintains a 8192-token context length, enabling the efficient retrieval of longer passages compared to many other open-source embedding models, which typically have a context length of 512 or 2048 tokens. Evaluated on our newly introduced ChemRxiv Retrieval benchmark, ChEmbed outperforms state-of-the-art general embedding models, raising nDCG@10 from 0.82 to 0.91 (+9 pp). ChEmbed represents a practical, lightweight, and reproducible embedding solution that effectively improves retrieval for chemical literature search.
zh
[NLP-97] OpenMed NER: Open-Source Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets
【速读】: 该论文旨在解决医疗领域命名实体识别(Named-entity Recognition, NER)中如何在多样化的实体类型(如疾病、化学物质、基因和物种)上同时实现先进性能并保持计算效率的问题。解决方案的关键在于提出OpenMed NER,一套开源的领域自适应Transformer模型,其核心创新是结合轻量级领域自适应预训练(Domain-Adaptive Pre-Training, DAPT)与参数高效的低秩适配(Low-Rank Adaptation, LoRA)技术:首先在35万段来自PubMed、arXiv和MIMIC-III等公开脱敏数据的语料上进行低成本DAPT,使用DeBERTa-v3、PubMedBERT和BioELECTRA等骨干模型;随后采用LoRA对任务特定数据进行微调,仅更新少于1.5%的参数。该方法在12个生物医学NER基准上实现了10项新最优微F1分数,尤其在基因和临床细胞系数据集上提升显著(>5.3–9.7个百分点),且训练可在单GPU上于12小时内完成,碳排放仅1.2 kg CO₂e,兼顾高性能与高效能,推动开放源代码模型超越闭源方案。
链接: https://arxiv.org/abs/2508.01630
作者: Maziyar Panahi
机构: CNRS(法国国家科学研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Named-entity recognition (NER) is fundamental to extracting structured information from the 80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint ( 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.
zh
[NLP-98] Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models
【速读】: 该论文旨在解决现有基于提示(prompt)的对抗攻击方法在评估大语言模型(Large Language Models, LLMs)鲁棒性时,将提示视为单一文本块而忽略其结构异质性的问题。研究表明,复杂且领域特定的提示中不同组件对对抗鲁棒性的贡献不均,传统方法因未考虑这种结构差异而导致攻击效果受限。解决方案的关键在于提出 PromptAnatomy 框架,通过自动解析提示为功能组件,并结合 ComPerturb 方法实现对各组件的选择性扰动,从而生成多样且可解释的对抗样本;同时引入基于困惑度(Perplexity, PPL)的过滤机制以保障语言合理性并减少分布偏移。实验表明,该方案显著提升了攻击成功率,并验证了提示结构感知与受控扰动对于可靠鲁棒性评估的重要性。
链接: https://arxiv.org/abs/2508.01554
作者: Yujia Zheng,Tianhao Li,Haotian Huang,Tianyu Zeng,Jingyu Lu,Chuangxin Chu,Yuekai Huang,Ziyou Jiang,Qian Xiong,Yuyao Ge,Mingyang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at this https URL.
zh
[NLP-99] MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的提示词(prompt)优化难题,尤其是如何在任务性能(如准确率)与上下文长度(以token计)之间实现高效权衡的问题。现有自动提示优化方法通常仅关注单一目标(如最大化准确率),忽视了效率与效果之间的多目标平衡,限制了LLMs在资源受限场景下的部署能力。解决方案的关键在于提出MOPrompt框架,这是一个基于多目标进化优化(Multi-objective Evolutionary Optimization, MEO)的新型提示优化方法,能够同时优化准确率和上下文长度,并通过生成帕累托前沿(Pareto front)为实践者提供一系列可解释的权衡选项,从而支持更灵活、高效的LLM部署策略。
链接: https://arxiv.org/abs/2508.01541
作者: Sara Câmara,Eduardo Luz,Valéria Carvalho,Ivan Meneghini,Gladston Moreira
机构: Universidade Federal de Ouro Preto (联邦大学奥罗普雷托分校); Federal Institute of Minas Gerais (米纳斯吉拉斯联邦理工学院); Postgraduate Program in Computer Science, Federal University of Ouro Preto (计算机科学研究生项目,联邦大学奥罗普雷托分校)
类目: Computation and Language (cs.CL)
备注: 8 pages
Abstract:Prompt engineering is crucial for unlocking the potential of Large Language Models (LLMs). Still, since manual prompt design is often complex, non-intuitive, and time-consuming, automatic prompt optimization has emerged as a research area. However, a significant challenge in prompt optimization is managing the inherent trade-off between task performance, such as accuracy, and context size. Most existing automated methods focus on a single objective, typically performance, thereby failing to explore the critical spectrum of efficiency and effectiveness. This paper introduces the MOPrompt, a novel Multi-objective Evolutionary Optimization (EMO) framework designed to optimize prompts for both accuracy and context size (measured in tokens) simultaneously. Our framework maps the Pareto front of prompt solutions, presenting practitioners with a set of trade-offs between context size and performance, a crucial tool for deploying Large Language Models (LLMs) in real-world applications. We evaluate MOPrompt on a sentiment analysis task in Portuguese, using Gemma-2B and Sabiazinho-3 as evaluation models. Our findings show that MOPrompt substantially outperforms the baseline framework. For the Sabiazinho model, MOPrompt identifies a prompt that achieves the same peak accuracy (0.97) as the best baseline solution, but with a 31% reduction in token length.
zh
[NLP-100] A Theory of Adaptive Scaffolding for LLM -Based Pedagogical Agents
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在教育场景中应用时缺乏坚实理论基础的问题,尤其是在STEM+C学习领域中,LLM驱动的教学代理往往未能有效整合认知科学与教学设计原理。其解决方案的关键在于提出一个融合证据中心设计(Evidence-Centered Design, ECD)与社会认知理论(Social Cognitive Theory, SCT)的框架,用于实现自适应支架式教学(adaptive scaffolding),并通过Inquizzitor这一LLM驱动的形成性评估代理加以验证——该代理结合人机混合智能,提供基于认知科学原则的反馈,从而确保教学互动的质量与理论一致性。
链接: https://arxiv.org/abs/2508.01503
作者: Clayton Cohn,Surya Rayala,Namrata Srivastava,Joyce Horn Fonteles,Shruti Jain,Xinying Luo,Divya Mereddy,Naveeduddin Mohammed,Gautam Biswas
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We illustrate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering teachers effective guidance that students value. This research underscores the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.
zh
[NLP-101] he Homogenizing Effect of Large Language Models on Human Expression and Thought
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在广泛使用过程中可能引发认知多样性(cognitive diversity)的下降,从而削弱集体智能(collective intelligence)和适应性。其核心问题在于,LLMs通过模仿训练数据中的主导语言模式和推理方式,强化了同质化趋势,导致边缘化的声音和非主流思维方式被进一步弱化。解决方案的关键在于认识到LLMs的设计与应用机制——即它们不仅反映训练数据中的偏见,还因用户对同一模型的普遍依赖而放大了这种趋同效应;因此,需从模型开发、训练数据选择到部署策略等多个层面引入多样性保障机制,以维护认知生态系统的丰富性和韧性。
链接: https://arxiv.org/abs/2508.01491
作者: Zhivar Sourati,Alireza S. Ziabari,Morteza Dehghani
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Cognitive diversity, reflected in variations of language, perspective, and reasoning, is essential to creativity and collective intelligence. This diversity is rich and grounded in culture, history, and individual experience. Yet as large language models (LLMs) become deeply embedded in people’s lives, they risk standardizing language and reasoning. This Review synthesizes evidence across linguistics, cognitive, and computer science to show how LLMs reflect and reinforce dominant styles while marginalizing alternative voices and reasoning strategies. We examine how their design and widespread use contribute to this effect by mirroring patterns in their training data and amplifying convergence as all people increasingly rely on the same models across contexts. Unchecked, this homogenization risks flattening the cognitive landscapes that drive collective intelligence and adaptability.
zh
[NLP-102] Sent: A Benchmark Dataset for Fairness-aware Explainable Sentiment Classification in Telugu
【速读】: 该论文旨在解决印度南部语言Telugu在自然语言处理(Natural Language Processing, NLP)和机器学习领域中资源匮乏的问题,特别是缺乏高质量的标注数据集,从而限制了其在情感分类(sentiment classification)等关键任务中的模型性能与可解释性研究。解决方案的关键在于构建了一个名为TeSent的综合性基准数据集,包含26,150条经过人工标注的Telugu语句及其人类提供的推理理由(rationales),并在此基础上开发了两种训练策略:一种是仅使用标签进行微调,另一种是同时引入推理理由进行微调;此外,还设计了一套基于推理理由的可解释性评估体系(plausibility and faithfulness evaluation suite)用于衡量六种主流后验解释器(post-hoc explainers)的输出质量,并进一步提出了TeEEC(Equity Evaluation Corpus in Telugu)以系统评估模型在情感相关任务中的公平性表现。实验结果表明,引入推理理由不仅能提升模型准确率、降低偏差,还能增强解释结果与人类推理的一致性。
链接: https://arxiv.org/abs/2508.01486
作者: Vallabhaneni Raj Kumar,Ashwin S,Supriya Manna,Niladri Sett,Cheedella V S N M S Hema Harshitha,Kurakula Harshitha,Anand Kumar Sharma,Basina Deepakraj,Tanuj Sarkar,Bondada Navaneeth Krishna,Samanthapudi Shakeer
机构: SRM University (SRM大学)
类目: Computation and Language (cs.CL)
备注: work under review
Abstract:In the Indian subcontinent, Telugu, one of India’s six classical languages, is the most widely spoken Dravidian Language. Despite its 96 million speaker base worldwide, Telugu remains underrepresented in the global NLP and Machine Learning landscape, mainly due to lack of high-quality annotated resources. This work introduces TeSent, a comprehensive benchmark dataset for sentiment classification, a key text classification problem, in Telugu. TeSent not only provides ground truth labels for the sentences, but also supplements with provisions for evaluating explainability and fairness, two critical requirements in modern-day machine learning tasks. We scraped Telugu texts covering multiple domains from various social media platforms, news websites and web-blogs to preprocess and generate 26,150 sentences, and developed a custom-built annotation platform and a carefully crafted annotation protocol for collecting the ground truth labels along with their human-annotated rationales. We then fine-tuned several SOTA pre-trained models in two ways: with rationales, and without rationales. Further, we provide a detailed plausibility and faithfulness evaluation suite, which exploits the rationales, for six widely used post-hoc explainers applied on the trained models. Lastly, we curate TeEEC, Equity Evaluation Corpus in Telugu, a corpus to evaluate fairness of Telugu sentiment and emotion related NLP tasks, and provide a fairness evaluation suite for the trained classifier models. Our experimental results suggest that training with rationales may improve model accuracy, reduce bias in models, and make the explainers’ output more aligned to human reasoning.
zh
[NLP-103] Harnessing Collective Intelligence of LLM s for Robust Biomedical QA: A Multi-Model Approach
【速读】: 该论文旨在解决生物医学文本挖掘与问答任务中因文献爆炸式增长而带来的高复杂性问题,尤其聚焦于生物医学语义问答(Task 13b)和针对Synergy任务的问答主题开发。其解决方案的关键在于部署一系列开源大语言模型(Large Language Models, LLMs)作为检索增强生成(Retrieval-Augmented Generation, RAG)系统,通过多模型协同策略提升答案准确性:对于是非类问题采用多数投票机制融合多个LLM输出,对于列表型和事实型问题则取各模型答案的并集;同时系统对13个前沿开源LLM进行组合实验,构建针对不同问题类型的定制化LLM流水线,从而显著优化最终性能,在2025年BioASQ挑战赛的四轮评测中取得优异成绩。
链接: https://arxiv.org/abs/2508.01480
作者: Dimitra Panou,Alexandros C. Dimopoulos,Manolis Koubarakis,Martin Reczko
机构: National and Kapodistrian University of Athens (雅典国立卡波迪斯特里安大学); Biomedical Sciences Research Center “Alexander Fleming” (亚历山大弗莱明生物医学研究中心); Harokopio University (哈罗科皮奥大学); Athena Research Center (阿提娜研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Biomedical text mining and question-answering are essential yet highly demanding tasks, particularly in the face of the exponential growth of biomedical literature. In this work, we present our participation in the 13th edition of the BioASQ challenge, which involves biomedical semantic question-answering for Task 13b and biomedical question-answering for developing topics for the Synergy task. We deploy a selection of open-source large language models (LLMs) as retrieval-augmented generators to answer biomedical questions. Various models are used to process the questions. A majority voting system combines their output to determine the final answer for Yes/No questions, while for list and factoid type questions, the union of their answers in used. We evaluated 13 state-of-the-art open source LLMs, exploring all possible model combinations to contribute to the final answer, resulting in tailored LLM pipelines for each question type. Our findings provide valuable insight into which combinations of LLMs consistently produce superior results for specific question types. In the four rounds of the 2025 BioASQ challenge, our system achieved notable results: in the Synergy task, we secured 1st place for ideal answers and 2nd place for exact answers in round 2, as well as two shared 1st places for exact answers in round 3 and 4.
zh
[NLP-104] reeDiff: AST-Guided Code Generation with Diffusion LLM s
【速读】: 该论文旨在解决扩散模型(diffusion models)在代码生成任务中因忽略程序语言的结构特性而导致性能受限的问题。传统扩散模型在训练时采用随机标记级扰动(token-level corruption),未能考虑编程语言严格的语法和语义结构,从而影响了模型对代码正确性的建模能力。解决方案的关键在于提出一种语法感知的扩散框架(syntax-aware diffusion framework),通过引入抽象语法树(Abstract Syntax Tree, AST)的结构先验信息,在训练过程中选择性地扰动由AST子树导出的语义有意义的代码片段(code spans),而非随机掩码单个token。这种方法使模型在去噪过程中能够保留语法边界并捕捉长距离依赖关系,显著提升了代码的语法正确性、重构准确性和对未见代码模式的泛化能力。
链接: https://arxiv.org/abs/2508.01473
作者: Yiming Zeng,Jinghan Cao,Zexin Li,Yiming Chen,Tao Ren,Dawei Xiang,Xidong Wu,Shangqian Gao,Tingting Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model’s ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.
zh
[NLP-105] owards Efficient Medical Reasoning with Minimal Fine-Tuning Data
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在医疗推理等专业领域应用中因使用未过滤数据集而导致的计算成本高和性能不佳的问题。现有方法通常基于样本难度(由知识复杂性和推理复杂性定义)进行数据选择,但忽略了样本梯度所反映的优化效用。作者发现,仅依赖梯度影响会偏向于易优化但缺乏深度推理链的样本,而仅依据难度则可能选择噪声大或过于复杂的样本,不利于稳定优化。为此,论文提出了一种名为“难度-影响象限”(Difficulty-Influence Quadrant, DIQ)的数据选择策略,其关键在于优先选取处于高难度与高梯度影响交集区域的样本,从而在保持复杂临床推理能力的同时最大化参数更新的有效性。实验证明,DIQ可在仅使用1%精选数据的情况下达到全数据集的性能,且在10%数据下持续优于基线,显著提升了医疗推理模型的微调效率与质量。
链接: https://arxiv.org/abs/2508.01450
作者: Xinlin Zhuang,Feilong Tang,Haolin Yang,Ming Hu,Huifa Li,Haochen Xue,Yichen Li,Junjun He,Zongyuan Ge,Ying Qian,Imran Razzak
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint, under review
Abstract:Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample’s optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at this https URL.
zh
[NLP-106] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂多跳问答(Multi-Hop Question Answering, MQA)任务中因缺乏非线性结构化推理能力而导致性能受限的问题,其核心在于LLMs难以充分捕捉实体间的深层概念关系。解决方案的关键是提出一种无需训练的框架ORACLE(Ontology-driven Reasoning and Chain for Logical Elucidation),通过三个阶段实现:(1) 利用LLMs动态构建与问题相关的知识本体(ontology),(2) 将本体转化为一阶逻辑(First-Order Logic)推理链,(3) 系统性地将原始查询分解为逻辑一致的子问题。该方法结合了LLMs的生成能力和知识图谱的结构优势,在多个标准MQA基准测试中达到与DeepSeek-R1等先进模型相当的性能,并生成更具逻辑性和可解释性的推理路径。
链接: https://arxiv.org/abs/2508.01424
作者: Haonan Bian,Yutao Qi,Rui Yang,Yuanxi Che,Jiaqian Wang,Heming Xia,Ranran Zhen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs), despite their success in question answering, exhibit limitations in complex multi-hop question answering (MQA) tasks that necessitate non-linear, structured reasoning. This limitation stems from their inability to adequately capture deep conceptual relationships between entities. To overcome this challenge, we present ORACLE (Ontology-driven Reasoning And Chain for Logical Eucidation), a training-free framework that combines LLMs’ generative capabilities with the structural benefits of knowledge graphs. Our approach operates through three stages: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation of these ontologies into First-Order Logic reasoning chains, and (3) systematic decomposition of the original query into logically coherent sub-questions. Experimental results on several standard MQA benchmarks show that our framework achieves highly competitive performance, rivaling current state-of-the-art models like DeepSeek-R1. Detailed analyses further confirm the effectiveness of each component, while demonstrating that our method generates more logical and interpretable reasoning chains than existing approaches.
zh
[NLP-107] Discovering Bias Associations through Open-Ended LLM Generations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中嵌入的社会偏见问题,特别是那些以隐晦方式表达的表征性伤害(representational harms),即对特定人口群体的不公平或扭曲描述。现有评估方法通常依赖于预定义的身份-概念关联,难以发现新的或意外的偏见形式。论文提出的解决方案是Bias Association Discovery Framework (BADF),其关键在于通过系统化方法从开放文本生成结果中提取已知及此前未被识别的、与人口身份相关的描述性概念之间的关联,从而实现对LLMs中偏见模式的全面映射与分析,为偏见识别提供可扩展且数据驱动的工具。
链接: https://arxiv.org/abs/2508.01412
作者: Jinhao Pan,Chahat Raj,Ziwei Zhu
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Social biases embedded in Large Language Models (LLMs) raise critical concerns, resulting in representational harms – unfair or distorted portrayals of demographic groups – that may be expressed in subtle ways through generated language. Existing evaluation methods often depend on predefined identity-concept associations, limiting their ability to surface new or unexpected forms of bias. In this work, we present the Bias Association Discovery Framework (BADF), a systematic approach for extracting both known and previously unrecognized associations between demographic identities and descriptive concepts from open-ended LLM outputs. Through comprehensive experiments spanning multiple models and diverse real-world contexts, BADF enables robust mapping and analysis of the varied concepts that characterize demographic identities. Our findings advance the understanding of biases in open-ended generation and provide a scalable tool for identifying and analyzing bias associations in LLMs. Data, code, and results are available at this https URL
zh
[NLP-108] ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics novels and subtitles with English translations
【速读】: 该论文旨在解决埃及阿拉伯语(Egyptian Arabic)与英语之间缺乏高质量、多语体平行语料库的问题,以支持机器翻译(Machine Translation, MT)模型的训练与评估,并推动相关领域的研究应用。解决方案的关键在于构建了一个由人工翻译和对齐的高质量平行数据集——ArzEn-MultiGenre,其包含25,557个段落对,覆盖歌曲歌词、小说和电视剧字幕三种此前未在现有埃及阿拉伯语-英语平行语料库中出现的文本类型,且所有翻译均由专业人员完成,确保了数据的准确性与可靠性,从而为机器翻译、跨语言分析及教学等场景提供可信赖的基准资源。
链接: https://arxiv.org/abs/2508.01411
作者: Rania Al-Sabbagh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, cross-linguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts.
zh
[NLP-109] MedSynth: Realistic Synthetic Medical Dialogue-Note Pairs
【速读】: 该论文旨在解决临床医生在医疗记录撰写上耗费大量时间的问题,这一负担是导致职业倦怠的重要因素之一。为应对这一挑战,作者提出了一种名为MedSynth的合成医学对话与病历文本数据集,其关键在于通过系统性分析疾病分布构建了超过10,000对涵盖2000多个ICD-10编码的对话-病历配对数据,从而显著提升生成式AI模型在Dialogue-to-Note(对话转病历)和Note-to-Dialogue(病历转对话)任务中的性能。该数据集具有开放获取、隐私合规且多样性高的特点,填补了当前医疗自然语言处理领域高质量训练数据稀缺的空白。
链接: https://arxiv.org/abs/2508.01401
作者: Ahmad Rezaie Mianroodi,Amirali Rezaie,Niko Grisel Todorov,Cyril Rakovski,Frank Rudzicz
机构: Dalhousie University (达尔豪斯大学); Vector Institute (向量研究所); Shahrood University of Technology (沙赫罗德科技大学); Chapman University (查普曼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages excluding references and appendices
Abstract:Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth – a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at this https URL and the dataset is available at this https URL.
zh
[NLP-110] MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis
【速读】: 该论文旨在解决传统市场分析与报告生成过程中依赖人工、耗时长且成本高的问题,尤其在企业需要快速获取高质量商业洞察的场景下。其核心解决方案是构建一个基于大语言模型(Large Language Models, LLMs)的自主框架,通过多个专业化代理(Researcher、Reviewer、Writer 和 Retriever)协同完成从数据查询、分析、可视化到报告撰写的全流程自动化。关键创新在于利用上下文学习(in-context learning)使代理从真实专业顾问的演示材料中学习分析方法,并引入基于LLM的评估系统与迭代优化机制,实现报告质量的持续提升,实验表明该框架可在7分钟内生成6页高质量报告,成本约1美元,显著提升了市场洞察的可及性与效率。
链接: https://arxiv.org/abs/2508.01370
作者: Roman Koshkin,Pengyu Dai,Nozomi Fujikawa,Masahito Togami,Marco Visentini-Scarzanella
机构: Okinawa Institute of Science and Technology (冲绳科学技术大学院大学); Institute of Science Tokyo (东京科学研究所); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:We present an autonomous framework that leverages Large Language Models (LLMs) to automate end-to-end business analysis and market report generation. At its core, the system employs specialized agents - Researcher, Reviewer, Writer, and Retriever - that collaborate to analyze data and produce comprehensive reports. These agents learn from real professional consultants’ presentation materials at Amazon through in-context learning to replicate professional analytical methodologies. The framework executes a multi-step process: querying databases, analyzing data, generating insights, creating visualizations, and composing market reports. We also introduce a novel LLM-based evaluation system for assessing report quality, which shows alignment with expert human evaluations. Building on these evaluations, we implement an iterative improvement mechanism that optimizes report quality through automated review cycles. Experimental results show that report quality can be improved by both automated review cycles and consultants’ unstructured knowledge. In experimental validation, our framework generates detailed 6-page reports in 7 minutes at a cost of approximately \ 1. Our work could be an important step to automatically create affordable market insights.
zh
[NLP-111] ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models
【速读】: 该论文旨在解决后门攻击对大型语言模型(Large Language Models, LLMs)带来的安全威胁,尤其是现有防御方法因主要针对分类任务设计,在应对LLMs的自回归特性与巨大输出空间时表现出性能差和延迟高的问题。解决方案的关键在于识别良性与后门模型在输出空间中的行为差异,提出“序列锁定”(sequence lock)现象——即后门模型在生成目标序列时表现出异常高且一致的置信度。基于此,作者设计了ConfGuard检测方法,通过监控滑动窗口内token置信度的变化来识别序列锁定,从而实现近乎100%真阳性率(TPR)且极低假阳性率(FPR)的实时检测,几乎不引入额外延迟,具备实际部署可行性。
链接: https://arxiv.org/abs/2508.01365
作者: Zihan Wang,Rui Zhang,Hongwei Li,Wenshu Fan,Wenbo Jiang,Qingchuan Zhao,Guowen Xu
机构: University of Electronic Science and Technology of China (电子科技大学); City University of Hong Kong (香港城市大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Under review
Abstract:Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM’s outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.
zh
[NLP-112] Large-Scale Diverse Synthesis for Mid-Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练中高质量、知识密集型数据稀缺的问题,尤其针对传统语料库信息有限、问答(Question-Answering, QA)数据在跨领域场景下难以扩展且知识多样性不足的挑战。解决方案的关键在于提出一种新颖的多样化合成流程(diversified pipeline),用于构建大规模QA数据集BoostQA(100B tokens)。该流程包含三个核心步骤:(1) 从异构来源筛选种子数据;(2) 利用DeepSeek-R1实现STEM导向的多年级合成,以增强数据多样性并缓解难度退化问题;(3) 通过DeepSeek-V3对答案进行精炼以提升输出质量。此外,作者在预训练与后训练之间的中段训练阶段(mid-training)引入BoostQA,显著提升了模型在STEM领域及高难度任务上的表现,使Llama-3 8B模型在MMLU和CMMLU上平均提升12.74%,并在12个基准测试中达到最优性能。
链接: https://arxiv.org/abs/2508.01326
作者: Xuemiao Zhang,Chengying Tu,Can Ren,Rongxiang Weng,Hongfei Yan,Jingang Wang,Xunliang Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of \mathbf12.74% on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.
zh
[NLP-113] LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练中高质量、多样化数据稀缺的问题。其核心解决方案是提出一种基于知识点(Knowledge Point, KP)图的合成框架 LinkSyn,关键在于通过构建KP图来引导多样化的问答(QA)数据生成:首先利用知识分布值函数动态调整路径采样概率,在知识点覆盖度与流行度之间取得平衡;其次借助 DeepSeek-R1 实现基于扩散机制的多种子协同合成,增强逻辑关联性;最后支持在特定学科内灵活调节难度以提升高难度问题的生成能力。该方法最终生成了包含 500 亿 token 的多学科 QA 数据集 LinkQA,实验证明其在持续预训练后显著提升了 Llama-3 8B 模型在 MMLU 和 CMMLU 上的表现,达到新的最先进水平(SOTA)。
链接: https://arxiv.org/abs/2508.01317
作者: Xuemiao Zhang,Can Ren,Chengying Tu,Rongxiang Weng,Hongfei Yan,Jingang Wang,Xunliang Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of \mathbf11.51% on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.
zh
[NLP-114] D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation
【速读】: 该论文旨在解决领域特定大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中因高质量问答(Question-Answering, QA)数据集稀缺且成本高昂而导致的性能瓶颈问题。其解决方案的核心是提出一种无需训练的流水线方法 D-SCoRE,该方法利用大语言模型(LLMs)和提示工程(prompt engineering)从任意文本源中自动构建多样化、高质量的 QA 数据集。D-SCoRE 的关键创新在于整合了文档中心处理(Document-centric processing)、分段(Segmentation)、思维链推理(Chain-of-Thought reasoning, CoT)与结构化导出(Structured export),并引入多维控制机制(如语义角色变换、问题类型平衡及反事实材料),从而显著提升生成 QA 对的多样性与相关性,实现高效、可扩展的领域感知微调。
链接: https://arxiv.org/abs/2508.01309
作者: Weibo Zhou,Lingbo Li,Shangsong Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The scarcity and high cost of high-quality question-answering (QA) datasets hinder supervised fine-tuning (SFT) for domain-specific large language models (LLMs). To address this, we introduce D-SCoRE, a training-free pipeline that utilizes LLMs and prompt engineering to produce diverse, high-quality QA datasets from arbitrary textual sources. D-SCoRE integrates \textbfD ocument-centric processing, \textbfS egmentation, \textbfCo T \textbfR easoning, and structured \textbfE xport to generate QA-COT datasets tailored for domain-aware SFT. Multi-dimensional control mechanisms, such as semantic role transformation, question type balancing, and counterfactual materials, enhance diversity and relevance, overcoming limitations of existing QA generation. LLMs fine-tuned on D-SCoRE-generated QA datasets, and human-annotated QA datasets (SQuAD, Covid-QA) are evaluated on SQuADShifts and Covid-QA test sets, with D-SCoRE outperforming across most domains. D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds using an 8B LLM on consumer-grade hardware. Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.
zh
[NLP-115] KEDAS: Knowledge Editing Alignment with Diverse Augmentation and Self-adaptive Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中过时知识更新效率低的问题,即如何在不损害模型原有能力的前提下高效地修改其内部知识。解决方案的关键在于提出一种名为KEDAS(Knowledge Editing alignment with Diverse Augmentation and Self-adaptive inference)的新框架:首先通过低秩适应(Low-Rank Adaptation, LoRA)实现知识对齐阶段的上下文编辑知识应用;其次设计多样化的编辑增强技术以提升编辑召回率;最后引入自适应后对齐推理机制,利用基于过滤的智能检索器动态选择推理路径——无关查询直接走原始模型,相关查询及其编辑信息则激活对齐适配器进行推理。这一系列创新使KEDAS在多个数据集和模型设置下均显著优于现有参数级编辑与检索基线方法,且保持了良好的通用任务性能与计算效率。
链接: https://arxiv.org/abs/2508.01302
作者: Chenming Tang,Yutong Yang,Yunfang Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Knowledge editing aims to modify outdated knowledge in large language models (LLMs) efficiently while retaining their powerful capabilities. Most existing methods rely on either parameter-level editing or retrieval-based approaches. In this work, we propose Knowledge Editing alignment with Diverse Augmentation and Self-adaptive inference (KEDAS) to better align LLMs with knowledge editing. In the alignment phase, LLMs learn to apply in-context edited knowledge via low-rank adaptation. During editing, we design a diverse edit augmentation technique to improve the recall of edits. After that, a self-adaptive post-alignment inference mechanism is proposed, in which a filter-based smart retriever is employed to perform a dynamic selection of inference routing. Specifically, irrelevant queries will go through the original pre-alignment model directly, while relevant ones, together with their related edits, go through the model with aligned adapters activated. In experiments, KEDAS secures the highest overall performance scores in 35 out of 36 cases across four datasets with three LLMs on three settings, surpassing its strong knowledge editing alignment counterpart by about 19.8 harmonic mean scores of edit success, locality and portability and outperforming both parameter editing and retrieval-based baselines significantly. Analysis of computational cost and performance on general tasks further validates the robustness and efficiency of KEDAS, indicating that it presents an ideal paradigm of knowledge editing alignment.
zh
[NLP-116] Prompting Large Language Models with Partial Knowledge for Answering Questions with Unseen Entities
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理部分相关知识(partially relevant knowledge)时效果不佳的问题,尤其是在知识图谱(Knowledge Graph, KG)问答(KGQA)场景中,由于知识库不完整导致检索结果包含大量噪声或无关信息。传统方法依赖嵌入相似度匹配来筛选知识,容易引入噪声,而本文提出一种新视角:大型语言模型(Large Language Models, LLMs)可通过已嵌入在其内部的、与问题相关的部分相关知识被“唤醒”(awakened),从而提升推理能力。解决方案的关键在于利用黄金推理路径(gold reasoning path)及其变体构造部分相关知识,并通过理论分析和实验证明该“唤醒效应”可显著改善LLMs在不完整知识环境下对答案的生成质量,尤其在模拟真实世界实体链接失败的“未见过实体KGQA”任务中表现优于传统基于嵌入相似性的方法。
链接: https://arxiv.org/abs/2508.01290
作者: Zhichao Yan,Jiapu Wang,Jiaoyan Chen,Yanyan Wang,Hongye Tan,Jiye Liang,Xiaoli Li,Ru Li,Jeff Z.Pan
机构: Shanxi University (山西大学); Beijing University of Technology (北京工业大学); University of Manchester (曼彻斯特大学); Singapore University of Technology and Design (新加坡科技设计大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) shows impressive performance by supplementing and substituting parametric knowledge in Large Language Models (LLMs). Retrieved knowledge can be divided into three types: explicit answer evidence, implicit answer clue, and insufficient answer context which can be further categorized into totally irrelevant and partially relevant information. Effectively utilizing partially relevant knowledge remains a key challenge for RAG systems, especially in incomplete knowledge base retrieval. Contrary to the conventional view, we propose a new perspective: LLMs can be awakened via partially relevant knowledge already embedded in LLMs. To comprehensively investigate this phenomenon, the triplets located in the gold reasoning path and their variants are used to construct partially relevant knowledge by removing the path that contains the answer. We provide theoretical analysis of the awakening effect in LLMs and support our hypothesis with experiments on two Knowledge Graphs (KGs) Question Answering (QA) datasets. Furthermore, we present a new task, Unseen Entity KGQA, simulating real-world challenges where entity linking fails due to KG incompleteness. Our awakening-based approach demonstrates greater efficacy in practical applications, outperforms traditional methods that rely on embedding-based similarity which are prone to returning noisy information.
zh
[NLP-117] Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在传统中文环境下缺乏全面评估基准的问题,尤其是忽视了三模态(视觉、听觉与文本)联合评测以及推理延迟(inference latency)的考量。其解决方案的关键在于提出首个针对传统中文的多模态基准测试集 Multi-TW,包含900道来源于官方汉语水平考试题目的多选题(涵盖图像-文本、音频-文本对),并系统评估了任意模态到任意模态(any-to-any)模型及视觉-语言模型(Vision-Language Models, VLMs)在传统中文场景下的性能与延迟表现。结果表明,闭源模型整体优于开源模型,但开源模型在音频任务中具备竞争力;同时,端到端的 any-to-any 流水线相比依赖独立音频转录的 VLM 具有显著延迟优势,凸显了高效多模态架构设计和针对传统中文优化的重要性。
链接: https://arxiv.org/abs/2508.01274
作者: Jui-Ming Yao,Bing-Cheng Xie,Sheng-Wei Peng,Hao-Yuan Chen,He-Rong Zheng,Bing-Jia Tan,Peter Shaojui Wang,Shun-Feng Su
机构: National Taiwan University of Science and Technology(国立台湾科技大学); National Taiwan University(国立台湾大学); University of London(伦敦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) process visual, acoustic, and textual inputs, addressing the limitations of single-modality LLMs. However, existing benchmarks often overlook tri-modal evaluation in Traditional Chinese and do not consider inference latency. To address this, we introduce Multi-TW, the first Traditional Chinese benchmark for evaluating the performance and latency of any-to-any multimodal models. Multi-TW includes 900 multiple-choice questions (image and text, audio and text pairs) sourced from official proficiency tests developed with the Steering Committee for the Test of Proficiency-Huayu (SC-TOP). We evaluated various any-to-any models and vision-language models (VLMs) with audio transcription. Our results show that closed-source models generally outperform open-source ones across modalities, although open-source models can perform well in audio tasks. End-to-end any-to-any pipelines offer clear latency advantages compared to VLMs using separate audio transcription. Multi-TW presents a comprehensive view of model capabilities and highlights the need for Traditional Chinese fine-tuning and efficient multimodal architectures.
zh
[NLP-118] Bridging LLM s and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025 IJCNN2025
【速读】: 该论文旨在解决当前教育领域中人工智能(AI)应用缺乏透明度与可解释性的问题,尤其是在生成式AI(Generative AI)系统难以提供逻辑清晰、可信的决策依据时。其核心解决方案是通过组织“XAI Challenge 2025”这一竞赛活动,引导参赛者开发能够回答学生关于大学政策问题并生成基于逻辑的自然语言解释的问答(Question-Answering, QA)系统;关键在于强制使用轻量级大语言模型(Large Language Models, LLMs)或LLM-符号混合系统,以提升系统的可解释性与可信度,并辅以基于Z3验证的逻辑模板构建高质量数据集及专家评审机制,确保结果贴近真实教育场景。
链接: https://arxiv.org/abs/2508.01263
作者: Long S. T. Nguyen,Khang H. N. Vo,Thu H. A. Nguyen,Tuan C. Bui,Duc Q. Nguyen,Thanh-Tung Tran,Anh D. Nguyen,Minh L. Nguyen,Fabien Baldacci,Thang H. Bui,Emanuel Di Nardo,Angelo Ciaramella,Son H. Le,Ihsan Ullah,Lorenzo Di Rocco,Tho T. Quan
机构: URA Research Group, Ho Chi Minh City University of Technology (HCMUT), Vietnam; Ho Chi Minh City International University (HCMIU), Vietnam; University of South-Eastern Norway, Norway; Japan Advanced Institute of Science and Technology (JAIST), Japan; Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France; University of Naples Parthenope, Italy; Sapienza University of Rome, Italy; VNU Information Technology Institute, Vietnam National University, Vietnam; Visual Intelligence Lab, School of Computer Science & Insight Center for Data Analyitcs, University of Galway, Ireland; Tho T. Quan
类目: Computation and Language (cs.CL)
备注: The XAI Challenge @ TRNS-AI Workshop, IJCNN 2025: Explainable AI for Educational Question Answering. Website: this https URL
Abstract:The growing integration of Artificial Intelligence (AI) into education has intensified the need for transparency and interpretability. While hackathons have long served as agile environments for rapid AI prototyping, few have directly addressed eXplainable AI (XAI) in real-world educational contexts. This paper presents a comprehensive analysis of the XAI Challenge 2025, a hackathon-style competition jointly organized by Ho Chi Minh City University of Technology (HCMUT) and the International Workshop on Trustworthiness and Reliability in Neurosymbolic AI (TRNS-AI), held as part of the International Joint Conference on Neural Networks (IJCNN 2025). The challenge tasked participants with building Question-Answering (QA) systems capable of answering student queries about university policies while generating clear, logic-based natural language explanations. To promote transparency and trustworthiness, solutions were required to use lightweight Large Language Models (LLMs) or hybrid LLM-symbolic systems. A high-quality dataset was provided, constructed via logic-based templates with Z3 validation and refined through expert student review to ensure alignment with real-world academic scenarios. We describe the challenge’s motivation, structure, dataset construction, and evaluation protocol. Situating the competition within the broader evolution of AI hackathons, we argue that it represents a novel effort to bridge LLMs and symbolic reasoning in service of explainability. Our findings offer actionable insights for future XAI-centered educational systems and competitive research initiatives.
zh
[NLP-119] Agent Armor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行外部工具时因行为动态且不透明所引发的安全风险,尤其是提示注入攻击(prompt injection attacks)带来的潜在威胁。其解决方案的关键在于将代理运行时轨迹(runtime traces)视为具有可分析语义的结构化程序,并提出AgentArmor框架:该框架通过图构造器将代理轨迹转换为基于控制流图(CFG)、数据流图(DFG)和程序依赖图(PDG)的图中间表示,结合工具交互元数据注册表与类型系统,在静态层面实现对敏感数据流、信任边界及策略违规的精准检测与约束。
链接: https://arxiv.org/abs/2508.01249
作者: Peiran Wang,Yang Liu,Yunfei Lu,Yifeng Cai,Hongbo Chen,Qingyou Yang,Jie Zhang,Jue Hong,Ye Wu
机构: ByteDance(字节跳动)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools. However, their dynamic and non-transparent behavior introduces critical security risks, particularly in the presence of prompt injection attacks. In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics. Thus, we present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation-based structured program dependency representations (e.g., CFG, DFG, and PDG) and enforces security policies via a type system. AgentArmor consists of three key components: (1) a graph constructor that reconstructs the agent’s working traces as graph-based intermediate representations with control flow and data flow described within; (2) a property registry that attaches security-relevant metadata of interacted tools data, and (3) a type system that performs static inference and checking over the intermediate representation. By representing agent behavior as structured programs, AgentArmor enables program analysis over sensitive data flow, trust boundaries, and policy violations. We evaluate AgentArmor on the AgentDojo benchmark, the results show that AgentArmor can achieve 95.75% of TPR, with only 3.66% of FPR. Our results demonstrate AgentArmor’s ability to detect prompt injection vulnerabilities and enforce fine-grained security constraints.
zh
[NLP-120] WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学问题求解中性能受限的问题,其根本原因在于现有训练数据缺乏针对性和多样性,尤其是未能覆盖模型的具体失败模式(failure modes),导致合成数据难以有效提升模型能力。解决方案的关键在于提出一个缺陷感知(defect-aware)的框架WarriorMath,其核心由两部分组成:一是通过多专家LLM协作生成、批判与优化题目,聚焦于识别并迭代改进基础模型无法解决的问题,从而构建高质量、缺陷导向的训练数据;二是引入渐进式训练机制,基于模型弱点动态调整训练难度,实现有针对性的迭代微调。实验表明,该方法在六个数学基准测试上平均提升12.57%,达到新的最优性能。
链接: https://arxiv.org/abs/2508.01245
作者: Yue Chen,Minghua He,Fangkai Yang,Pu Zhao,Lu Wang,Yu Kang,Yifei Dong,Yuefeng Zhan,Hao Sun,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: Peking University (北京大学); Microsoft; KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel in solving mathematical problems, yet their performance is often limited by the availability of high-quality, diverse training data. Existing methods focus on augmenting datasets through rephrasing or difficulty progression but overlook the specific failure modes of LLMs. This results in synthetic questions that the model can already solve, providing minimal performance gains. To address this, we propose WarriorMath, a defect-aware framework for mathematical problem solving that integrates both targeted data synthesis and progressive training. In the synthesis stage, we employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems. Questions that base LLMs fail to solve are identified and iteratively improved through expert-level feedback, producing high-quality, defect-aware training data. In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses. Experiments on six mathematical benchmarks show that WarriorMath outperforms strong baselines by 12.57% on average, setting a new state-of-the-art. Our results demonstrate the effectiveness of a defect-aware, multi-expert framework for improving mathematical ability.
zh
[NLP-121] WebDS: An End-to-End Benchmark for Web-based Data Science
【速读】: 该论文旨在解决现有基准测试在评估大语言模型(Large Language Model, LLM)进行网络数据科学任务时的局限性问题:传统网页基准多聚焦于简单交互(如表单提交或电商交易),缺乏对多工具协同和异构数据处理能力的要求;而传统数据科学基准则局限于静态文本数据集,无法衡量从数据获取、清洗、分析到洞察生成的端到端工作流。为应对这一挑战,作者提出了WebDS——首个面向网络数据科学的端到端基准,其关键在于构建了870个来自29个多样化网站(涵盖结构化政府数据门户与非结构化新闻媒体)的复杂任务,要求代理执行多跳、多步骤操作,并使用多种工具处理不同模态的数据,从而更真实地模拟现代数据分析实践。实证表明,当前最先进的LLM代理在WebDS上表现显著落后(如Browser Use仅完成15%任务),暴露出信息锚定不足、重复行为和捷径策略等新失败模式,凸显了WebDS作为更具挑战性和现实意义测试平台的价值。
链接: https://arxiv.org/abs/2508.01222
作者: Ethan Hsu,Hong Meng Yam,Ines Bouissou,Aaron Murali John,Raj Thota,Josh Koe,Vivek Sarath Putta,G K Dharesan,Alexander Spangher,Shikhar Murty,Tenghao Huang,Christopher D. Manning
机构: Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校); Singapore University of Technology and Design (新加坡科技设计大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:A large portion of real-world data science tasks are complex and require multi-hop web-based interactions: finding appropriate data available on the internet, synthesizing real-time data of various modalities from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions, such as form submissions or e-commerce transactions, and often do not require diverse tool-using capabilities required for web based data science. Conversely, traditional data science benchmarks typically concentrate on static, often textually bound datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step operations requiring the use of tools and heterogeneous data formats that better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on Web Voyager, successfully completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes like poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS’ tasks display. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.
zh
[NLP-122] Show or Tell? Modeling the evolution of request-making in Human-LLM conversations
【速读】: 该论文旨在解决如何从大型语言模型(Large Language Model, LLM)的对话日志中提取结构化用户行为模式的问题,以揭示用户在与LLM交互时请求表达的内在组织规律。其解决方案的关键在于提出了一项新的任务——将聊天查询细分为请求内容(contents of requests)、角色设定(roles)、查询特定上下文(query-specific context)以及附加表达(additional expressions),从而系统性地解构用户输入的语义结构。通过这一分段方法,研究发现用户行为虽具个体差异,但随经验积累趋于收敛,且新模型的引入会显著改变社区层面的交互模式,为理解人机交互演化提供了可追踪的量化框架。
链接: https://arxiv.org/abs/2508.01213
作者: Shengqi Zhu,Jeffrey M. Rzeszotarski,David Mimno
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Chat logs provide a rich source of information about LLM users, but patterns of user behavior are often masked by the variability of queries. We present a new task, segmenting chat queries into contents of requests, roles, query-specific context, and additional expressions. We find that, despite the familiarity of chat-based interaction, request-making in LLM queries remains significantly different from comparable human-human interactions. With the data resource, we introduce an important perspective of diachronic analyses with user expressions. We find that query patterns vary between early ones emphasizing requests, and individual users explore patterns but tend to converge with experience. Finally, we show that model capabilities affect user behavior, particularly with the introduction of new models, which are traceable at the community level.
zh
[NLP-123] Adaptive Content Restriction for Large Language Models via Suffix Optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中难以灵活、高效地实施内容限制的问题,尤其是当不同用户群体或场景对“有害内容”的定义存在差异且动态变化时,传统依赖监督微调(Supervised Fine-Tuning, SFT)的方法因计算资源消耗高、扩展性差而不可行。解决方案的关键在于提出一种名为自适应内容限制(Adaptive Content Restriction, AdaCoRe)的新任务,并设计首个无需模型微调的轻量级方法——后缀优化(Suffix Optimization, SOP),通过为输入提示添加一个短且经过优化的后缀,在不显著影响输出质量的前提下,有效抑制目标LLM生成特定受限词项。
链接: https://arxiv.org/abs/2508.01198
作者: Yige Li,Peihai Jiang,Jun Sun,Peng Shu,Tianming Liu,Zhen Xiang
机构: Singapore Management University (新加坡管理大学); The University of Georgia (佐治亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called \textitAdaptive Content Restriction (AdaCoRe), which focuses on lightweight strategies – methods without model fine-tuning – to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named \textitSuffix Optimization (SOP), which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new \textitContent Restriction Benchmark (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15%, 17%, 10%, 9%, and 6% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.
zh
[NLP-124] Is Chain-of-Thought Reasoning of LLM s a Mirag e? A Data Distribution Lens
【速读】: 该论文试图解决的问题是:Chain-of-Thought (CoT) prompting 被广泛认为能够提升大语言模型(Large Language Model, LLM)在各类任务中的表现,其背后是否蕴含着真正具备泛化能力的推理机制,还是仅仅依赖于训练数据分布上的“表面一致性”。为回答这一问题,作者提出从数据分布视角出发,探究 CoT 推理是否反映了模型从训练数据中学习到的结构化归纳偏置(inductive bias),从而有条件地生成与训练阶段相似的推理路径。解决方案的关键在于设计 DataAlchemy——一个隔离且受控的环境,用于从零开始训练 LLM,并系统性地在不同分布条件下探测模型的 CoT 推理行为。实验结果表明,CoT 推理是一种脆弱的幻象,当测试查询超出训练分布时便会失效,揭示了当前 CoT 方法的本质局限性:其有效性受限于训练数据与测试查询之间的分布差异,而非真正的通用推理能力。
链接: https://arxiv.org/abs/2508.01191
作者: Chengshuai Zhao,Zhen Tan,Pingchuan Ma,Dawei Li,Bohan Jiang,Yancheng Wang,Yingzhen Yang,Huan Liu
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.
zh
[NLP-125] CSIRO-LT at SemEval-2025 Task 11: Adapting LLM s for Emotion Recognition for Multiple Languages SEMEVAL-2025
【速读】: 该论文旨在解决跨语言文本情感识别(text-based emotion recognition)中的挑战,即不同语言在情感表达上存在文化差异和多样性,导致模型难以准确识别作者情绪状态及其强度。解决方案的关键在于采用基于LoRA(Low-Rank Adaptation)的微调策略,对预训练的多语言大语言模型(multilingual LLM)按语言单独进行适配,从而显著提升跨语言情感识别性能。
链接: https://arxiv.org/abs/2508.01161
作者: Jiyu Chen,Necva Bölücü,Sarvnaz Karimi,Diego Mollá,Cécile L. Paris
机构: CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), Vienna, Austria. Association for Computational Linguistics
Abstract:Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The \textitSemeval 2025 Task 11: Bridging the Gap in Text-Based emotion shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM with LoRA setting separately for each language.
zh
[NLP-126] Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成结构化临床会诊模板时,能否有效产出既临床合理又具备优先级排序能力的问题。其核心挑战在于:尽管某些前沿模型(如o3)能生成高完整性(最高达92.2%)的模板,但普遍存在输出冗长、难以在长度受限条件下正确识别并优先呈现最关键临床问题的缺陷,尤其在以叙事为主的专科(如精神科和疼痛医学)中性能显著下降。解决方案的关键在于构建一个多智能体流水线,融合提示优化(prompt optimization)、语义自动评分(semantic autograding)与优先级分析(prioritization analysis),从而系统评估LLMs在真实临床场景下对信息重要性排序的能力,推动更贴近医生实际沟通效率的生成式AI(Generative AI)应用发展。
链接: https://arxiv.org/abs/2508.01159
作者: Liam G. McCoy,Fateme Nateghi Haredasht,Kanav Chopra,David Wu,David JH Wu,Abass Conteh,Sarita Khemani,Saloni Kumar Maharaj,Vishnu Ravi,Arth Pahwa,Yingjie Weng,Leah Rosengaus,Lena Giang,Kelvin Zhenghao Li,Olivia Jee,Daniel Shirvani,Ethan Goh,Jonathan H. Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford’s eConsult team, we assess frontier models – including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro – for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model’s ability to prioritize clinically salient information within the time constraints of real-world physician communication.
zh
[NLP-127] DBAIOps: A Reasoning LLM -Enhanced Database Operation and Maintenance System using Knowledge Graphs WWW ALT
【速读】: 该论文旨在解决数据库运维(Database Operations and Maintenance, DB OM)中专家经验难以有效利用的问题,特别是在自动诊断与恢复过程中,现有基于规则的方法无法整合非数值型的文本化经验(如手册中的故障排除指南),而基于大语言模型(Large Language Models, LLMs)的方法则因信息碎片化导致结果不准确或泛化性强。解决方案的关键在于提出DBAIOps——一个融合推理型LLM与知识图谱的混合数据库运维系统:首先构建异构知识图谱来结构化表示诊断经验,并通过半自动算法从数千份文档中抽取并构建该图;其次设计800多个可复用的异常模型,同时识别直接告警指标和隐含关联的经验与指标;最后引入两阶段图演化机制自动探索诊断路径并补全缺失关系,再由推理LLM(如DeepSeek-R1)推断根因并生成清晰诊断报告,从而实现类DBA级别的诊断能力。
链接: https://arxiv.org/abs/2508.01136
作者: Wei Zhou,Peng Sun,Xuanhe Zhou,Qianglei Zang,Ji Xu,Tieying Zhang,Guoliang Li,Fan Wu
机构: Shanghai Jiao Tong University (上海交通大学); Baisheng (Shenzhen) Technology Co., Ltd. (百圣(深圳)科技有限公司); Tsinghua University (清华大学); Bytedance (字节跳动)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: DBAIOps supports 25 database systems and has been deployed in 20 real-world scenarios, covering domains like finance, energy, and healthcare. See website at: this https URL ; See code at: this https URL
Abstract:The operation and maintenance (OM) of database systems is critical to ensuring system availability and performance, typically requiring expert experience (e.g., identifying metric-to-anomaly relations) for effective diagnosis and recovery. However, existing automatic database OM methods, including commercial products, cannot effectively utilize expert experience. On the one hand, rule-based methods only support basic OM tasks (e.g., metric-based anomaly detection), which are mostly numerical equations and cannot effectively incorporate literal OM experience (e.g., troubleshooting guidance in manuals). On the other hand, LLM-based methods, which retrieve fragmented information (e.g., standard documents + RAG), often generate inaccurate or generic results. To address these limitations, we present DBAIOps, a novel hybrid database OM system that combines reasoning LLMs with knowledge graphs to achieve DBA-style diagnosis. First, DBAIOps introduces a heterogeneous graph model for representing the diagnosis experience, and proposes a semi-automatic graph construction algorithm to build that graph from thousands of documents. Second, DBAIOps develops a collection of (800+) reusable anomaly models that identify both directly alerted metrics and implicitly correlated experience and metrics. Third, for each anomaly, DBAIOps proposes a two-stage graph evolution mechanism to explore relevant diagnosis paths and identify missing relations automatically. It then leverages a reasoning LLM (e.g., DeepSeek-R1) to infer root causes and generate clear diagnosis reports for both DBAs and common users. Our evaluation over four mainstream database systems (Oracle, MySQL, PostgreSQL, and DM8) demonstrates that DBAIOps outperforms state-of-the-art baselines, 34.85% and 47.22% higher in root cause and human evaluation accuracy, respectively.
zh
[NLP-128] Cross-Domain Web Information Extraction at Pinterest
【速读】: 该论文旨在解决从互联网上海量非结构化网页信息中高效、准确提取结构化产品属性的问题,这对于提升电商平台用户体验和内容分发效率至关重要。解决方案的关键在于提出了一种融合结构、视觉与文本模态的新型网页表示方法,将每个可见的HTML节点以紧凑形式编码其文本内容、样式及布局信息,从而优化小模型(如XGBoost)的学习效果;实验表明,这种表示方式使简单模型在属性提取任务上优于复杂的大型语言模型(LLMs),同时实现了每秒处理超1000个URL的高吞吐量,并且成本仅为最便宜GPT方案的千分之一。
链接: https://arxiv.org/abs/2508.01096
作者: Michael Farag,Patrick Halina,Andrey Zaytsev,Alekhya Munagala,Imtihan Ahmed,Junhao Wang
机构: Pinterest( Pinterest)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:The internet offers a massive repository of unstructured information, but it’s a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest’s system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.
zh
[NLP-129] CADDesigner: Conceptual Design of CAD Models Based on General-Purpose Agent
【速读】: 该论文旨在解决计算机辅助设计(CAD)在工业制造中对设计师专业技能要求过高、设计效率低的问题。解决方案的关键在于提出一个基于大语言模型(LLM)的CAD概念设计代理(agent),该代理采用一种新颖的“上下文无关指令范式”(Context-Independent Imperative Paradigm, CIP),能够接受文本描述和手绘草图作为输入,并通过交互式对话进行需求分析与澄清,从而生成高质量的CAD建模代码。该方法还引入迭代视觉反馈机制以优化模型质量,并将生成的设计案例存储于结构化知识库中,实现代理代码生成能力的持续进化。
链接: https://arxiv.org/abs/2508.01031
作者: Jingzhe Ni,Xiaolong Yin,Xintong Li,Xingyu Lu,Ji Wei,Ruofeng Tong,Min Tang,Peng Du
机构: 1. Zhejiang University (浙江大学); 2. Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing but typically requires a high level of expertise from designers. To lower the entry barrier and improve design efficiency, we present an agent for CAD conceptual design powered by large language models (LLMs). The agent accepts both abstract textual descriptions and freehand sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Context-Independent Imperative Paradigm (CIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases are stored in a structured knowledge base, enabling continuous improvement of the agent’s code generation capabilities. Experimental results demonstrate that our method achieves state-of-the-art performance in CAD code generation.
zh
[NLP-130] UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在低资源语言如乌尔都语(Urdu)中语法知识掌握不足的问题。现有模型通常在高资源语言(如英语)上表现优异,但在乌尔都语等低资源语言上的细粒度句法理解能力尚不明确。解决方案的关键在于构建一个高质量、针对乌尔都语核心句法现象的最小差异对基准测试集——乌尔都语语言学最小差异对基准(UrBLiMP),该数据集包含5,696个最小差异句对,覆盖十类核心句法现象,并通过人类标注验证了其高一致性(96.10% 评分者间一致率)。在此基础上,作者对20个主流多语言大语言模型进行系统评估,揭示了当前模型在乌尔都语句法理解上的性能差异与局限性,为未来面向低资源语言的模型优化提供了可量化的评估标准和研究方向。
链接: https://arxiv.org/abs/2508.01006
作者: Farah Adeeba,Brian Dillon,Hassan Sajjad,Rajesh Bhatt
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multilingual Large Language Models (LLMs) have shown remarkable performance across various languages; however, they often include significantly less data for low-resource languages such as Urdu compared to high-resource languages like English. To assess the linguistic knowledge of LLMs in Urdu, we present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP) i.e. pairs of minimally different sentences that contrast in grammatical acceptability. UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena, carefully curated using the Urdu Treebank and diverse Urdu text corpora. A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement, confirming the reliability of the dataset. We evaluate twenty multilingual LLMs on UrBLiMP, revealing significant variation in performance across linguistic phenomena. While LLaMA-3-70B achieves the highest average accuracy (94.73%), its performance is statistically comparable to other top models such as Gemma-3-27B-PT. These findings highlight both the potential and the limitations of current multilingual LLMs in capturing fine-grained syntactic knowledge in low-resource languages.
zh
[NLP-131] MAO-ARAG : Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation
【速读】: 该论文旨在解决固定式检索增强生成(Retrieval-Augmented Generation, RAG)管道在应对复杂多变的真实世界问答(Question Answering, QA)查询时,难以同时实现高性能与成本效率平衡的问题。其解决方案的关键在于提出一种基于多智能体编排(Multi-Agent Orchestration)的自适应RAG框架MAO-ARAG,该框架将RAG流程建模为多轮交互结构,通过定义多个执行代理(executor agents),如查询改写、文档选择和生成代理,并引入一个规划代理(planner agent)以根据具体查询动态选择并组合最优工作流;该规划代理利用基于结果奖励(F1分数)与成本惩罚的强化学习进行训练,在保证答案质量的同时有效控制资源消耗,从而实现对不同复杂度查询的灵活响应与高效处理。
链接: https://arxiv.org/abs/2508.01005
作者: Yiqun Chen,Erhan Zhang,Lingyong Yan,Shuaiqiang Wang,Jizhou Huang,Dawei Yin,Jiaxin Mao
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable this http URL code of MAO-ARAG is on this https URL.
zh
[NLP-132] Small sample-based adaptive text classification through iterative and contrastive description refinement
【速读】: 该论文旨在解决零样本文本分类在知识动态演进和类别边界模糊领域(如工单系统)中的性能瓶颈问题,其中大型语言模型(LLMs)因主题可分性差而难以泛化,而少样本方法则受限于数据多样性不足。解决方案的关键在于提出一种融合迭代主题精炼、对比提示(contrastive prompting)与主动学习的分类框架:首先利用少量标注样本生成初始主题标签,随后通过误分类或模糊样本进行迭代式对比提示,显式增强模型对相近类别的区分能力;同时引入人机协同机制,允许用户以自然语言修改或新增类别定义,实现无需重新训练即可无缝集成新类别,从而适应真实世界动态环境。实验表明,该方法在AGNews和DBpedia数据集上均展现出高准确率(分别为91%和84%),且在引入未见类别后准确率下降极小(分别降至82%和87%),验证了基于提示的语义推理在有限监督下进行细粒度分类的有效性。
链接: https://arxiv.org/abs/2508.00957
作者: Amrit Rajeev,Udayaadithya Avadhanam,Harshula Tulapurkar,SaiBarath Sundar
机构: Mphasis Limited(迈普斯有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Zero-shot text classification remains a difficult task in domains with evolving knowledge and ambiguous category boundaries, such as ticketing systems. Large language models (LLMs) often struggle to generalize in these scenarios due to limited topic separability, while few-shot methods are constrained by insufficient data diversity. We propose a classification framework that combines iterative topic refinement, contrastive prompting, and active learning. Starting with a small set of labeled samples, the model generates initial topic labels. Misclassified or ambiguous samples are then used in an iterative contrastive prompting process to refine category distinctions by explicitly teaching the model to differentiate between closely related classes. The framework features a human-in-the-loop component, allowing users to introduce or revise category definitions in natural language. This enables seamless integration of new, unseen categories without retraining, making the system well-suited for real-world, dynamic environments. The evaluations on AGNews and DBpedia demonstrate strong performance: 91% accuracy on AGNews (3 seen, 1 unseen class) and 84% on DBpedia (8 seen, 1 unseen), with minimal accuracy shift after introducing unseen classes (82% and 87%, respectively). The results highlight the effectiveness of prompt-based semantic reasoning for fine-grained classification with limited supervision.
zh
[NLP-133] XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML EMNLP2025
【速读】: 该论文旨在解决资源高效的语言模型(Language Model, LM)微调中模型选择与超参数优化(Hyperparameter Optimization, HPO)的自动化难题,尤其针对重复实验带来的巨大计算开销和环境影响。现有方法无法同时优化整个微调流程,导致效率低下。其解决方案的关键在于提出 XAutoLM——一个融合元学习(meta-learning)的自动化机器学习(AutoML)框架,通过从历史任务经验中提取任务级与系统级元特征(meta-features),指导采样策略偏向高潜力配置、规避高成本无效路径,从而实现对判别式与生成式 LM 微调管道的高效优化。实验证明,XAutoLM 在多个文本分类和问答任务上显著优于零样本优化器,并大幅降低评估时间与错误率,同时发现更多帕累托前沿上的优质微调方案。
链接: https://arxiv.org/abs/2508.00924
作者: Ernesto L. Estevanell-Valladares,Suilan Estevez-Velarde,Yoan Gutiérrez,Andrés Montoyo,Ruslan Mitkov
机构: University of Alicante (阿尔卡拉大学); University of Havana (哈瓦那大学); Lancaster University (兰卡斯特大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 10 figures, 7 tables. Preprint. Under review at EMNLP 2025. This is not the final version
Abstract:Experts in machine learning leverage domain knowledge to navigate decisions in model selection, hyperparameter optimisation, and resource allocation. This is particularly critical for fine-tuning language models (LMs), where repeated trials incur substantial computational overhead and environmental impact. However, no existing automated framework simultaneously tackles the entire model selection and HPO task for resource-efficient LM fine-tuning. We introduce XAutoLM, a meta-learning-augmented AutoML framework that reuses past experiences to optimise discriminative and generative LM fine-tuning pipelines efficiently. XAutoLM learns from stored successes and failures by extracting task- and system-level meta-features to bias its sampling toward fruitful configurations and away from costly dead ends. On four text classification and two question-answering benchmarks, XAutoLM surpasses zero-shot optimiser’s peak F1 on five of six tasks, cuts mean evaluation time by up to 4.5x, reduces error ratios by up to sevenfold, and uncovers up to 50% more pipelines above the zero-shot Pareto front. In contrast, simpler memory-based baselines suffer negative transfer. We release XAutoLM and our experience store to catalyse resource-efficient, Green AI fine-tuning in the NLP community.
zh
[NLP-134] Cyber-Zero: Training Cybersecurity Agents without Runtime
【速读】: 该论文旨在解决在缺乏可执行运行时环境(runtime environment)的场景下,如何有效训练高质量的网络安全领域大语言模型(Large Language Models, LLMs)的问题。传统方法依赖于实际运行环境来生成代理轨迹(agent trajectories),但在网络安全领域如CTF(Capture The Flag)竞赛中,挑战配置和执行上下文往往是临时或受限的,难以获取。解决方案的关键在于提出首个无需运行时环境的框架——Cyber-Zero,其通过利用公开的CTF题解(writeups)并采用人格驱动的LLM模拟(persona-driven LLM simulation),逆向推导出运行时行为,并生成真实、长周期的交互序列,从而实现高质量代理轨迹的合成。这一方法使模型能够在无实际环境的情况下获得与有运行时环境相当甚至更优的性能表现,显著提升了网络安全LLM的可训练性和成本效益。
链接: https://arxiv.org/abs/2508.00910
作者: Terry Yue Zhuo,Dingmin Wang,Hantian Ding,Varun Kumar,Zijian Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Public Link: this https URL
Abstract:Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
zh
[NLP-135] An analysis of AI Decision under Risk: Prospect theory emerges in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对不确定性决策时是否表现出人类特有的风险判断偏差这一问题,特别是验证丹尼尔·卡尼曼(Daniel Kahneman)与阿莫斯·特沃斯基(Amos Tversky)提出的前景理论(Prospect Theory)是否适用于当前最先进的链式思维推理模型。其解决方案的关键在于通过设计多类情境化风险决策任务,实证检验LLMs在不同语境下对损失与收益的敏感性差异,并发现语言框架(frame)是解释风险偏好变异的核心变量——尤其军事场景比民用场景更显著地诱发“框架效应”,这表明模型不仅内化了人类启发式和认知偏差,且这些偏差具有语境依赖性,呼应维特根斯坦“语言游戏”理论所揭示的局部化、条件性特征。该研究进而为理解LLMs中推理与记忆机制的交互提供了新视角。
链接: https://arxiv.org/abs/2508.00902
作者: Kenneth Payne
机构: King’s College London (伦敦国王学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 26 pages, 2 figures, 9 tables, 2 appendices
Abstract:Judgment of risk is key to decision-making under uncertainty. As Daniel Kahneman and Amos Tversky famously discovered, humans do so in a distinctive way that departs from mathematical rationalism. Specifically, they demonstrated experimentally that humans accept more risk when they feel themselves at risk of losing something than when they might gain. I report the first tests of Kahneman and Tversky’s landmark ‘prospect theory’ with Large Language Models, including today’s state of the art chain-of-thought ‘reasoners’. In common with humans, I find that prospect theory often anticipates how these models approach risky decisions across a range of scenarios. I also demonstrate that context is key to explaining much of the variance in risk appetite. The ‘frame’ through which risk is apprehended appears to be embedded within the language of the scenarios tackled by the models. Specifically, I find that military scenarios generate far larger ‘framing effects’ than do civilian settings, ceteris paribus. My research suggests, therefore, that language models the world, capturing our human heuristics and biases. But also that these biases are uneven - the idea of a ‘frame’ is richer than simple gains and losses. Wittgenstein’s notion of ‘language games’ explains the contingent, localised biases activated by these scenarios. Finally, I use my findings to reframe the ongoing debate about reasoning and memorisation in LLMs. Comments: 26 pages, 2 figures, 9 tables, 2 appendices Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2508.00902 [cs.AI] (or arXiv:2508.00902v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.00902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-136] Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练阶段如何存储知识以及在微调推理阶段如何提取知识的理论不透明问题。现有理论研究多局限于单层注意力结构,而实证表明MLP模块对知识存储贡献最大,但仅含自注意力机制的简化模型难以有效获取或提取事实性知识。解决方案的关键在于提出一个包含自注意力(self-attention)与多层感知机(MLP)模块的可分析的一层Transformer框架,并通过追踪其梯度动态建立收敛性和泛化性保证:1)证明模型可在预训练中达到近似最优损失,实现知识有效存储;2)在满足大规模微调数据和特定数据多重性条件时,模型能对未被强化的知识实现低泛化误差,完成知识提取;反之则产生幻觉。该理论框架同时适用于全参数微调和低秩微调,并解释了学习率调度等经验现象。
链接: https://arxiv.org/abs/2508.00901
作者: Ruichen Xu,Kexin Chen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Modern large language models excel in knowledge-intensive tasks, yet how transformers acquire (store) knowledge during pre-training and extract (retrieve) it during post-fine-tuning inference remains theoretically opaque. While prior theoretical work has begun to investigate these questions through the analysis of training dynamics, such studies are limited to single-layer, attention-only architectures. However, most existing studies suggest that MLPs are the most contributing components for storing knowledge in transformer-based language models. Meanwhile, our empirical investigations reveal that such simplified models, when trained using standard next-token prediction objectives, may be incapable of acquiring or extracting factual knowledge. To overcome this limitation, we introduce a tractable one-layer transformer framework that crucially incorporates both self-attention and MLP modules. By tracking its gradient dynamics, we establish convergence and generalization guarantees that illuminate the ability of knowledge acquisition and extraction. We prove that 1) Transformers can achieve near-optimal training loss during pre-training, signifying effective knowledge acquisition; 2) With a large fine-tuning dataset and specific data multiplicity conditions met, transformers can achieve low generalization error when tested on factual knowledge learned during pre-training but not reinforced during the fine-tuning, indicating successful knowledge extraction; 3) When the conditions are not satisfied, transformers exhibit high generalization loss, resulting in hallucinations. Our analysis includes both full fine-tuning and low-rank fine-tuning. Furthermore, our analysis offers theoretical insights into several pertinent empirical phenomena, such as the role of learning rate schedules. Experiments on synthetic and real-world PopQA datasets with GPT-2 and Llama-3.2-1B validate our results.
zh
[NLP-137] Agent TTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks
【速读】: 该论文旨在解决多阶段复杂任务中测试时计算资源最优分配(test-time compute-optimal scaling)的问题,即在多个异构子任务序列中,如何为每个子任务选择合适的大型语言模型(Large Language Models, LLMs)并分配计算预算,以最大化整体性能。其核心挑战在于:(i) 模型与预算组合的搜索空间呈组合爆炸,且推理成本高,导致暴力搜索不可行;(ii) 各子任务间的最优配置存在强耦合关系,进一步加剧优化难度。解决方案的关键在于提出AgentTTS框架——一个基于LLM代理的自主搜索机制,通过迭代式反馈驱动的环境交互,高效探索可行解空间,显著提升搜索效率与鲁棒性,并增强决策过程的可解释性。
链接: https://arxiv.org/abs/2508.00890
作者: Fali Wang,Hui Liu,Zhenwei Dai,Jingying Zeng,Zhiwei Zhang,Zongyu Wu,Chen Luo,Zhen Li,Xianfeng Tang,Qi He,Suhang Wang
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review
Abstract:Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.
zh
[NLP-138] FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts KDD KDD2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业级应用场景中,特别是在客服对话分析任务中,因幻觉(hallucination)导致的生成内容缺乏事实依据的问题。由于此类场景下对情感倾向或业务问题根本原因的解析往往缺乏真实标签,传统事实性评估方法难以适用。解决方案的关键在于提出一种名为“3D”的新范式——即分解(Decompose)、解耦(Decouple)、解绑(Detach),通过将事实性判断锚定在语言学驱动的评估标准上,并基于此构建了首个面向客服对话转录文本中解释性AI生成主张的事实性评测基准数据集FECT(Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts)。该研究进一步验证了LLM评委在3D范式下的一致性,为自动评估AI在复杂语境下输出的真实性提供了可扩展的方法论框架。
链接: https://arxiv.org/abs/2508.00889
作者: Hagyeong Shin,Binoy Robin Dalal,Iwona Bialynicka-Birula,Navjot Matharu,Ryan Muir,Xingwei Yang,Samuel W. K. Wong
机构: Cresta(克雷斯塔); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for an oral presentation at Agentic GenAI Evaluation KDD 2025: KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models
Abstract:Large language models (LLMs) are known to hallucinate, producing natural language outputs that are not grounded in the input, reference materials, or real-world knowledge. In enterprise applications where AI features support business decisions, such hallucinations can be particularly detrimental. LLMs that analyze and summarize contact center conversations introduce a unique set of challenges for factuality evaluation, because ground-truth labels often do not exist for analytical interpretations about sentiments captured in the conversation and root causes of the business problems. To remedy this, we first introduce a \textbf3D – \textbfDecompose, Decouple, Detach – paradigm in the human annotation guideline and the LLM-judges’ prompt to ground the factuality labels in linguistically-informed evaluation criteria. We then introduce \textbfFECT, a novel benchmark dataset for \textbfFactuality \textbfEvaluation of Interpretive AI-Generated \textbfClaims in Contact Center Conversation \textbfTranscripts, labeled under our 3D paradigm. Lastly, we report our findings from aligning LLM-judges on the 3D paradigm. Overall, our findings contribute a new approach for automatically evaluating the factuality of outputs generated by an AI system for analyzing contact center conversations.
zh
[NLP-139] Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models
【速读】: 该论文旨在解决多变量时间序列(Multi-variate Time-Series, MVTS)基础模型中“幻觉”(hallucination)现象缺乏明确定义及检测与缓解方法的问题。当前自然语言处理领域的基础模型已具备较为成熟的幻觉定义和应对策略,但MVTS领域尚无类似体系。论文提出了一种针对MVTS的新型幻觉定义,并基于扩散模型(diffusion model)构建了用于估计幻觉水平的新检测机制,同时设计了相应的缓解方法。其关键创新在于通过关系数据集(relational datasets)对幻觉水平进行量化评估,并利用扩散模型实现对幻觉程度的精准建模与抑制,实验证明该方法可将开源预训练MVTS插补基础模型的关系幻觉降低高达47.7%。
链接: https://arxiv.org/abs/2508.00881
作者: Vijja Wichitwechkarn,Charles Fox,Ruchi Choudhary
机构: University of Cambridge (剑桥大学); University of Lincoln (林肯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Foundation models for natural language processing have many coherent definitions of hallucination and methods for its detection and mitigation. However, analogous definitions and methods do not exist for multi-variate time-series (MVTS) foundation models. We propose new definitions for MVTS hallucination, along with new detection and mitigation methods using a diffusion model to estimate hallucination levels. We derive relational datasets from popular time-series datasets to benchmark these relational hallucination levels. Using these definitions and models, we find that open-source pre-trained MVTS imputation foundation models relationally hallucinate on average up to 59.5% as much as a weak baseline. The proposed mitigation method reduces this by up to 47.7% for these models. The definition and methods may improve adoption and safe usage of MVTS foundation models.
zh
[NLP-140] Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches
【速读】: 该论文旨在解决文档分类中传统图结构表示依赖启发式规则或领域特定知识、缺乏数据驱动性且易受领域限制的问题。其解决方案的关键在于提出一种自动学习文档图结构的方法:以句子为节点构建同质加权图,通过自注意力机制(self-attention model)自动识别句对间的依赖关系来学习边权重,并引入统计过滤策略保留强相关句子,从而提升图的质量与规模效率。实验表明,该方法在三个文档分类数据集上优于基于启发式的图结构,在准确率和F₁分数上均有提升,同时验证了统计过滤对分类鲁棒性的增强作用。
链接: https://arxiv.org/abs/2508.00864
作者: Margarita Bugueño,Gerard de Melo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures, 3 tables. Appendix starts on page 10
Abstract:In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and F_1 score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.
zh
[NLP-141] he Attribution Crisis in LLM Search Results
【速读】: 该论文旨在解决生成式 AI(Generative AI)在联网检索过程中存在的“引用缺口”(attribution gap)问题,即模型在回答用户查询时未能充分引用其获取信息的原始网页来源,导致知识溯源不透明。研究基于约14,000条真实世界LMArena对话日志,发现多个主流大语言模型(如Google Gemini、OpenAI GPT-4o和Perplexity Sonar)存在未检索、未引用或高访问低引用等行为,揭示了当前系统在检索与引用之间的效率差异显著,且主要由检索设计而非技术限制决定。解决方案的关键在于建立一种基于标准化遥测(telemetry)和完整搜索轨迹及引用日志披露的透明化LLM搜索架构,以提升模型对网络资源的引用准确性与可审计性。
链接: https://arxiv.org/abs/2508.00838
作者: Ilan Strauss,Jangho Yang,Tim O’Reilly,Sruly Rosenblat,Isobel Moure
机构: AI Disclosures Project (Social Science Research Council); Institute for Innovation and Public Purpose (University College London); University of Waterloo; O’Reilly Media
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Web-enabled LLMs frequently answer queries without crediting the web pages they consume, creating an “attribution gap” - the difference between relevant URLs read and those actually cited. Drawing on approximately 14,000 real-world LMArena conversation logs with search-enabled LLM systems, we document three exploitation patterns: 1) No Search: 34% of Google Gemini and 24% of OpenAI GPT-4o responses are generated without explicitly fetching any online content; 2) No citation: Gemini provides no clickable citation source in 92% of answers; 3) High-volume, low-credit: Perplexity’s Sonar visits approximately 10 relevant pages per query but cites only three to four. A negative binomial hurdle model shows that the average query answered by Gemini or Sonar leaves about 3 relevant websites uncited, whereas GPT-4o’s tiny uncited gap is best explained by its selective log disclosures rather than by better attribution. Citation efficiency - extra citations provided per additional relevant web page visited - varies widely across models, from 0.19 to 0.45 on identical queries, underscoring that retrieval design, not technical limits, shapes ecosystem impact. We recommend a transparent LLM search architecture based on standardized telemetry and full disclosure of search traces and citation logs.
zh
[NLP-142] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent -Driven Top-Down Traceable Reasoning Workflow
【速读】: 该论文旨在解决医学大语言模型(Medical Large Language Models, MLLMs)在眼科诊断中因有限的眼科知识、视觉定位与推理能力不足以及多模态眼科数据稀缺所导致的幻觉问题,这些问题严重影响了病灶检测和疾病诊断的准确性。现有医学评估基准也未能有效识别和量化不同类型的幻觉,缺乏针对性的缓解策略。为此,作者提出了EH-Benchmark这一新型眼科评估基准,系统性地将MLLMs的幻觉按任务类型和错误类别划分为“视觉理解”与“逻辑构成”两大类,并设计了一个以代理(agent)为中心的三阶段框架:知识层检索(Knowledge-Level Retrieval)、任务层案例研究(Task-Level Case Studies)和结果层验证(Result-Level Validation)。该方案的核心在于通过多代理协作机制,强化模型对视觉信息的理解与逻辑推理能力,从而显著降低幻觉发生率,提升诊断准确性、可解释性和可靠性。
链接: https://arxiv.org/abs/2507.22929
作者: Xiaoyu Pan,Yang Bai,Ke Zou,Yang Zhou,Jun Zhou,Huazhu Fu,Yih-Chung Tham,Yong Liu
机构: Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR); Centre for Innovation and Precision Eye Health; Department of Ophthalmology, NUHS Tower Block, Level 7, 1E Kent Ridge Road, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 9 figures, 5 tables. submit/6621751
Abstract:Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at this https URL.
zh
[NLP-143] owards Actionable Pedagogical Feedback: A Multi-Perspective Analysis of Mathematics Teaching and Tutoring Dialogue
【速读】: 该论文旨在解决数学教育中课堂对话分析面临的两大挑战:一是话语的多功能性问题,即单个语句可能承载多种功能,单一标签无法完整刻画;二是领域特定话语动作(talk moves)分类对大量非目标语句的排除,导致这些语句在反馈中被忽略。解决方案的关键在于提出一种多视角话语分析框架,将领域特定的话语动作与对话行为(基于SWBD-MASL扁平化多功能标注体系,共43类标签)及话语关系(基于分割话语表示理论,16种关系)相结合,从而实现对包含和不包含话语动作的语句进行全面分析。该框架通过自上而下的分析策略,揭示了非话语动作语句在引导、确认和结构化课堂对话中的关键作用,为AI辅助教育系统提供更精准的反馈机制,并有助于开发能模拟教师与学生角色的智能教育代理。
链接: https://arxiv.org/abs/2505.07161
作者: Jannatun Naim,Jie Cao,Fareen Tasneem,Jennifer Jacobs,Brent Milne,James Martin,Tamara Sumner
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); University of Oklahoma (俄克拉荷马大学); University of Chittagong (奇塔贡大学); Saga Education (萨加教育)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to EDM’2025
Abstract:Effective feedback is essential for refining instructional practices in mathematics education, and researchers often turn to advanced natural language processing (NLP) models to analyze classroom dialogues from multiple perspectives. However, utterance-level discourse analysis encounters two primary challenges: (1) multifunctionality, where a single utterance may serve multiple purposes that a single tag cannot capture, and (2) the exclusion of many utterances from domain-specific discourse move classifications, leading to their omission in feedback. To address these challenges, we proposed a multi-perspective discourse analysis that integrates domain-specific talk moves with dialogue act (using the flattened multi-functional SWBD-MASL schema with 43 tags) and discourse relation (applying Segmented Discourse Representation Theory with 16 relations). Our top-down analysis framework enables a comprehensive understanding of utterances that contain talk moves, as well as utterances that do not contain talk moves. This is applied to two mathematics education datasets: TalkMoves (teaching) and SAGA22 (tutoring). Through distributional unigram analysis, sequential talk move analysis, and multi-view deep dive, we discovered meaningful discourse patterns, and revealed the vital role of utterances without talk moves, demonstrating that these utterances, far from being mere fillers, serve crucial functions in guiding, acknowledging, and structuring classroom discourse. These insights underscore the importance of incorporating discourse relations and dialogue acts into AI-assisted education systems to enhance feedback and create more responsive learning environments. Our framework may prove helpful for providing human educator feedback, but also aiding in the development of AI agents that can effectively emulate the roles of both educators and students.
zh
[NLP-144] Enhancing Talk Moves Analysis in Mathematics Tutoring through Classroom Teaching Discourse COLING’2025
【速读】: 该论文旨在解决大规模数学辅导对话(mathematics tutoring discourse)的收集、标注与分析难题,以支持生成式 AI 模型在辅导场景中的应用。其关键解决方案在于构建了一个紧凑的数据集 SAGA22,并探索了多种建模策略,包括引入对话上下文(dialogue context)、说话者信息(speaker information)、预训练数据集的选择及进一步微调(fine-tuning)。研究发现,基于课堂教学数据进行补充预训练可显著提升模型在辅导场景下的性能,尤其当结合较长上下文和说话者信息时效果更优。
链接: https://arxiv.org/abs/2412.13395
作者: Jie Cao,Abhijit Suresh,Jennifer Jacobs,Charis Clevenger,Amanda Howard,Chelsea Brown,Brent Milne,Tom Fischaber,Tamara Sumner,James H. Martin
机构: University of Oklahoma (俄克拉荷马大学); University of Colorado Boulder (科罗拉多大学博尔德分校); Saga Education (萨加教育)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to COLING’2025
Abstract:Human tutoring interventions play a crucial role in supporting student learning, improving academic performance, and promoting personal growth. This paper focuses on analyzing mathematics tutoring discourse using talk moves - a framework of dialogue acts grounded in Accountable Talk theory. However, scaling the collection, annotation, and analysis of extensive tutoring dialogues to develop machine learning models is a challenging and resource-intensive task. To address this, we present SAGA22, a compact dataset, and explore various modeling strategies, including dialogue context, speaker information, pretraining datasets, and further fine-tuning. By leveraging existing datasets and models designed for classroom teaching, our results demonstrate that supplementary pretraining on classroom data enhances model performance in tutoring settings, particularly when incorporating longer context and speaker information. Additionally, we conduct extensive ablation studies to underscore the challenges in talk move modeling.
zh
[NLP-145] Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes ACL2019
【速读】: 该论文旨在解决在心理治疗对话中自动分析行为编码(behavioral codes)以提供实时指导的问题,特别是在动机访谈(Motivational Interviewing, MI)场景下。其关键解决方案是构建一个对话观察器(dialogue observer),该观察器通过神经网络模型实现两个核心功能:一是对治疗师与来访者的MI行为编码进行分类,二是预测未来话语的行为编码,从而辅助引导对话进程并可能提醒治疗师。模型设计基于近期对话建模的最新进展,并通过实验证明其在两项任务上均优于多个基线方法,同时通过细致分析揭示了不同网络结构设计对治疗对话建模的影响。
链接: https://arxiv.org/abs/1907.00326
作者: Jie Cao,Michael Tanana,Zac E. Imel,Eric Poitras,David C. Atkins,Vivek Srikumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to ACL 2019
Abstract:Automatically analyzing dialogue can help understand and guide behavior in domains such as counseling, where interactions are largely mediated by conversation. In this paper, we study modeling behavioral codes used to asses a psychotherapy treatment style called Motivational Interviewing (MI), which is effective for addressing substance abuse and related problems. Specifically, we address the problem of providing real-time guidance to therapists with a dialogue observer that (1) categorizes therapist and client MI behavioral codes and, (2) forecasts codes for upcoming utterances to help guide the conversation and potentially alert the therapist. For both tasks, we define neural network models that build upon recent successes in dialogue modeling. Our experiments demonstrate that our models can outperform several baselines for both tasks. We also report the results of a careful analysis that reveals the impact of the various network design tradeoffs for modeling therapy dialogue.
zh
[NLP-146] Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
【速读】: 该论文旨在解决传统聚类方法在处理高维、异构医疗数据时面临的挑战,尤其是其缺乏对临床上下文的理解能力,从而难以实现精准的患者亚群划分。解决方案的关键在于利用大语言模型(Large Language Model, LLM)生成具有语义丰富性的嵌入表示(embedding),并通过量化后的LLM(如LLAMA 3.1 8B、DeepSeek-R1-Distill-Llama-8B和Stella-En-400M-V5)对患者记录进行文本序列化与特征提取,结合K-means聚类实现更有效的分群。其中,Stella-En-400M-V5模型获得最高轮廓系数(Silhouette Score=0.86),且带聚类目标引导的LLM方法能识别出具有显著营养、临床及社会经济差异的亚组,表明LLM能更好地捕捉关键特征并提升聚类质量,在资源有限环境中具备支持精准医疗决策的潜力。
链接: https://arxiv.org/abs/2505.09805
作者: Aditya Nagori,Ayush Gautam,Matthew O. Wiens,Vuong Nguyen,Nathan Kenya Mugisha,Jerome Kabakyenga,Niranjan Kissoon,John Mark Ansermino,Rishikesan Kamaleswaran
机构: University of British Columbia (不列颠哥伦比亚大学); St. Paul’s Hospital (圣保罗医院); BC Children’s Hospital (卑诗儿童医院); University of Oxford (牛津大学); Makerere University (马凯雷雷大学)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
备注: 11 pages, 2 Figures, 1 Table
Abstract:Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
zh
计算机视觉
[CV-0] Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models
【速读】:该论文旨在解决基于CLIP(Contrastive Language–Image Pre-training)的提示调优(prompt tuning)中,现有数据增强策略依赖外部知识(如大语言模型或预构建知识库)导致的数据收集与处理成本高,且未充分挖掘图像模态特征的问题。其解决方案的关键在于提出一种自包含的、基于知识蒸馏的提示调优方法——Augmentation-driven Prompt Tuning (AugPT),该方法仅使用原始数据集内部的自监督图像增强,并引入基于共识测试的新型门控机制,利用预训练的提示调优骨干模型自动过滤噪声样本,从而提升增强视图的质量,显著改善模型性能与泛化能力,且无需引入任何外部知识。
链接: https://arxiv.org/abs/2508.02671
作者: Haoyang Li,Liang Wang,Chao Wang,Siyu Zhou,Jing Jiang,Yan Peng,Guodong Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures, 15 tables
Abstract:For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: this https URL .
zh
[CV-1] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning
【速读】:该论文旨在解决当前医疗多模态大模型(Multimodal Large Language Models, LMMs)缺乏开放且可复现的构建方法问题,从而阻碍了社区层面的研究、分析与比较。其解决方案的关键在于提出了一套完整的开源训练配方(recipe),包括:(1) 针对文本和图文医疗数据的系统性数据筛选策略,依据推理难度分层过滤;(2) 两种训练范式——基于蒸馏推理链的监督微调(Supervised Fine-Tuning, SFT)与基于最终答案正确性的可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)。实验表明,RLVR显著优于SFT,且一个关键反直觉发现是:在RLVR框架下,仅使用文本推理数据训练比使用多模态图像-文本数据带来更大的性能提升,这为高效构建医疗多模态推理模型提供了新思路。
链接: https://arxiv.org/abs/2508.02669
作者: Xiaoke Huang,Juncheng Wu,Hui Liu,Xianfeng Tang,Yuyin Zhou
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Amazon Research (亚马逊研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL
Abstract:Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.
zh
[CV-2] PMGS: Reconstruction of Projectile Motion across Large Spatiotemporal Spans via 3D Gaussian Splatting
【速读】:该论文旨在解决动态重建中跨大时空跨度的复杂刚体运动建模难题,现有方法通常局限于短时、小尺度变形,且对物理一致性考虑不足。其核心解决方案是提出PMGS(Projectile Motion via 3D Gaussian Splatting),关键在于两个阶段:首先通过动态场景分解与改进的点密度控制实现目标中心化重建;其次通过学习每帧SE(3)位姿恢复完整运动序列,并引入加速度一致性约束将牛顿力学与位姿估计相融合,同时设计基于运动状态自适应调度学习率的动态模拟退火策略,以及利用卡尔曼滤波融合多源观测误差以抑制扰动累积。
链接: https://arxiv.org/abs/2508.02660
作者: Yijun Xu,Jingrui Zhang,Yuhan Chen,Dingwen Wang,Lei Yu,Chu He
机构: Wuhan University (武汉大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modeling complex rigid motion across large spatiotemporal spans remains an unresolved challenge in dynamic reconstruction. Existing paradigms are mainly confined to short-term, small-scale deformation and offer limited consideration for physical consistency. This study proposes PMGS, focusing on reconstructing Projectile Motion via 3D Gaussian Splatting. The workflow comprises two stages: 1) Target Modeling: achieving object-centralized reconstruction through dynamic scene decomposition and an improved point density control; 2) Motion Recovery: restoring full motion sequences by learning per-frame SE(3) poses. We introduce an acceleration consistency constraint to bridge Newtonian mechanics and pose estimation, and design a dynamic simulated annealing strategy that adaptively schedules learning rates based on motion states. Futhermore, we devise a Kalman fusion scheme to optimize error accumulation from multi-source observations to mitigate disturbances. Experiments show PMGS’s superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.
zh
[CV-3] Evaluating Variance in Visual Question Answering Benchmarks ICCV2025
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉问答(Visual Question Answering, VQA)任务中评估时仅依赖点估计(point estimates)的问题,这忽视了由模型输出随机性、训练种子敏感性和超参数配置等因素引起的显著性能波动。解决方案的关键在于系统性地分析14个主流VQA基准上的性能方差来源,包括训练种子、框架非确定性、模型规模及扩展指令微调的影响,并探索Cloze-style评估作为替代策略,以降低随机性并提升评估的可靠性。研究强调应采用考虑方差的评估方法,从而推动MLLMs的更稳健和可靠发展。
链接: https://arxiv.org/abs/2508.02645
作者: Nikitha SR
机构: Adobe(Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCV 2025 Workshop on What’s Next in Multimodal Foundational Models
Abstract:Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of MLLMs on VQA benchmarks often relies on point estimates, overlooking the significant variance in performance caused by factors such as stochastic model outputs, training seed sensitivity, and hyperparameter configurations. This paper critically examines these issues by analyzing variance across 14 widely used VQA benchmarks, covering diverse tasks such as visual reasoning, text understanding, and commonsense reasoning. We systematically study the impact of training seed, framework non-determinism, model scale, and extended instruction finetuning on performance variability. Additionally, we explore Cloze-style evaluation as an alternate assessment strategy, studying its effectiveness in reducing stochasticity and improving reliability across benchmarks. Our findings highlight the limitations of current evaluation practices and advocate for variance-aware methodologies to foster more robust and reliable development of MLLMs.
zh
[CV-4] ReMoMask: Retrieval-Augmented Masked Motion Generation
【速读】:该论文旨在解决文本到动作(Text-to-Motion, T2M)生成中两大核心挑战:一是生成模型(如扩散模型)存在的多样性不足、误差累积和物理不现实性问题;二是检索增强生成(Retrieval-Augmented Generation, RAG)方法面临的扩散惯性、部分模式坍塌及异步伪影问题。解决方案的关键在于提出一个统一框架 ReMoMask,其三大创新包括:1)双向动量文本-动作模型通过动量队列解耦负样本规模与批次大小,显著提升跨模态检索精度;2)语义时空注意力机制在局部融合阶段施加生物力学约束,消除异步伪影;3)RAG-无分类器引导引入少量无条件生成以增强泛化能力。该框架基于 MoMask 的 RVQ-VAE 架构,在极少步骤内高效生成时序一致的动作序列,实验证明其在 HumanML3D 和 KIT-ML 数据集上分别相较当前最优方法 RAG-T2M 提升 FID 分数 3.88% 和 10.97%,达到新的 SOTA 性能。
链接: https://arxiv.org/abs/2508.02605
作者: Zhengdao Li,Siheng Wang,Zeyu Zhang,Hao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask’s RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: this https URL. Website: this https URL.
zh
[CV-5] Explainable AI Methods for Neuroimaging: Systematic Failures of Common Tools the Need for Domain-Specific Validation and a Proposal for Safe Application
【速读】:该论文旨在解决当前神经影像学中可信赖的深度学习模型解释问题,即广泛使用的可解释人工智能(Explainable AI, XAI)方法缺乏严格的验证,可能导致错误解读。其解决方案的关键在于构建了一个新颖的XAI验证框架,通过设计具有已知信号来源的预测任务(如局部解剖特征或受试者特异性临床病灶),在不人为修改输入图像的前提下建立可验证的“真实标签”,从而系统性评估多种XAI方法在约4.5万例结构磁共振成像(structural MRI)数据上的表现。研究发现,GradCAM和Layer-wise Relevance Propagation存在系统性失败,而更简单的梯度基方法SmoothGrad则保持稳定准确性,表明XAI方法需针对神经影像数据特性进行领域适配,且现有基于标准XAI方法的研究结论应重新审视。
链接: https://arxiv.org/abs/2508.02560
作者: Nys Tjade Siegel,James H. Cole,Mohamad Habes,Stefan Haufe,Kerstin Ritter,Marc-André Schulz
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Trustworthy interpretation of deep learning models is critical for neuroimaging applications, yet commonly used Explainable AI (XAI) methods lack rigorous validation, risking misinterpretation. We performed the first large-scale, systematic comparison of XAI methods on ~45,000 structural brain MRIs using a novel XAI validation framework. This framework establishes verifiable ground truth by constructing prediction tasks with known signal sources - from localized anatomical features to subject-specific clinical lesions - without artificially altering input images. Our analysis reveals systematic failures in two of the most widely used methods: GradCAM consistently failed to localize predictive features, while Layer-wise Relevance Propagation generated extensive, artifactual explanations that suggest incompatibility with neuroimaging data characteristics. Our results indicate that these failures stem from a domain mismatch, where methods with design principles tailored to natural images require substantial adaptation for neuroimaging data. In contrast, the simpler, gradient-based method SmoothGrad, which makes fewer assumptions about data structure, proved consistently accurate, suggesting its conceptual simplicity makes it more robust to this domain shift. These findings highlight the need for domain-specific adaptation and validation of XAI methods, suggest that interpretations from prior neuroimaging studies using standard XAI methodology warrant re-evaluation, and provide urgent guidance for practical application of XAI in neuroimaging.
zh
[CV-6] MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming
【速读】:该论文旨在解决单目视觉(monocular)输入在视觉-语言导航(Vision-Language Navigation, VLN)任务中性能落后于全景RGB-D输入的问题。当前基于视觉-语言动作(Vision-Language Action, VLA)模型的方法虽已能在单目条件下取得一定成效,但仍难以匹配使用全景RGB-D信息的导航系统。其解决方案的关键在于提出一种轻量级VLA框架MonoDream,通过引入统一导航表征(Unified Navigation Representation, UNR),将导航相关的视觉语义(如全局布局、深度和未来线索)与语言引导的动作意图进行联合对齐,从而提升动作预测的可靠性;此外,还设计了潜在全景梦境(Latent Panoramic Dreaming, LPD)任务,利用仅有的单目输入来预测当前及未来步骤的全景RGB和深度观测的潜在特征,以此监督UNR的学习,显著缩小了单目与全景方法之间的性能差距。
链接: https://arxiv.org/abs/2508.02549
作者: Shuo Wang,Yongcai Wang,Wanting Li,Yucheng Wang,Maiyue Chen,Kaihui Wang,Zhizhong Su,Xudong Cai,Yeying Jin,Deying Li,Zhaoxin Fan
机构: Horizon Robotics( horizon 机器人公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.
zh
[CV-7] Precision-Aware Video Compression for Reducing Bandwidth Requirements in Video Communication for Vehicle Detection-Based Applications
【速读】:该论文旨在解决智能交通系统(ITS)中因通信带宽受限导致的视频传输效率低下问题,该问题会显著影响依赖实时视频流的车辆检测精度。解决方案的关键在于提出一种名为“精度感知视频压缩”(Precision-Aware Video Compression, PAVC)的动态自适应框架,该框架根据当前天气和光照等环境条件实时调整视频压缩率,在保障车辆检测准确性的前提下最大限度地降低带宽消耗。实验表明,PAVC在中等带宽环境下可提升检测精度达13%,并减少带宽使用最高达8.23倍;在严重带宽受限场景下,仍能实现高达72倍的带宽节省且不牺牲检测性能。
链接: https://arxiv.org/abs/2508.02533
作者: Abyad Enan,Jon C Calhoun,Mashrur Chowdhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the Transportation Research Record: Journal of the Transportation Research Board for possible publication
Abstract:Computer vision has become a popular tool in intelligent transportation systems (ITS), enabling various applications through roadside traffic cameras that capture video and transmit it in real time to computing devices within the same network. The efficiency of this video transmission largely depends on the available bandwidth of the communication system. However, limited bandwidth can lead to communication bottlenecks, hindering the real-time performance of ITS applications. To mitigate this issue, lossy video compression techniques can be used to reduce bandwidth requirements, at the cost of degrading video quality. This degradation can negatively impact the accuracy of applications that rely on real-time vehicle detection. Additionally, vehicle detection accuracy is influenced by environmental factors such as weather and lighting conditions, suggesting that compression levels should be dynamically adjusted in response to these variations. In this work, we utilize a framework called Precision-Aware Video Compression (PAVC), where a roadside video camera captures footage of vehicles on roadways, compresses videos, and then transmits them to a processing unit, running a vehicle detection algorithm for safety-critical applications, such as real-time collision risk assessment. The system dynamically adjusts the video compression level based on current weather and lighting conditions to maintain vehicle detection accuracy while minimizing bandwidth usage. Our results demonstrate that PAVC improves vehicle detection accuracy by up to 13% and reduces communication bandwidth requirements by up to 8.23x in areas with moderate bandwidth availability. Moreover, in locations with severely limited bandwidth, PAVC reduces bandwidth requirements by up to 72x while preserving vehicle detection performance.
zh
[CV-8] Understanding the Risks of Asphalt Art on the Reliability of Surveillance Perception Systems
【速读】:该论文试图解决的问题是:城市环境中由沥青艺术(asphalt art)引起的视觉复杂性对基于视觉的行人检测模型性能的影响,特别是其在正常(benign)和对抗性(adversarial)条件下的表现差异。解决方案的关键在于构建包含多种街头艺术图案的真实交叉路口场景,并通过复合实验评估预训练目标检测模型在不同沥青艺术设计下的行人检测准确率,从而揭示复杂高显著性的艺术图案可显著降低检测性能,甚至被恶意构造用于掩盖真实行人或制造虚假检测结果,进而强调了在设计鲁棒行人感知模型时必须考虑环境视觉变化的重要性。
链接: https://arxiv.org/abs/2508.02530
作者: Jin Ma,Abyad Enan,Long Cheng,Mashrur Chowdhury
机构: Clemson University (克莱姆森大学); Clemson University (克莱姆森大学); Clemson University (克莱姆森大学); Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: J. Ma and A. Enan are co-first authors; they have contributed equally. This work has been submitted to the Transportation Research Record: Journal of the Transportation Research Board for possible publication
Abstract:Artistic crosswalks featuring asphalt art, introduced by different organizations in recent years, aim to enhance the visibility and safety of pedestrians. However, their visual complexity may interfere with surveillance systems that rely on vision-based object detection models. In this study, we investigate the impact of asphalt art on pedestrian detection performance of a pretrained vision-based object detection model. We construct realistic crosswalk scenarios by compositing various street art patterns into a fixed surveillance scene and evaluate the model’s performance in detecting pedestrians on asphalt-arted crosswalks under both benign and adversarial conditions. A benign case refers to pedestrian crosswalks painted with existing normal asphalt art, whereas an adversarial case involves digitally crafted or altered asphalt art perpetrated by an attacker. Our results show that while simple, color-based designs have minimal effect, complex artistic patterns, particularly those with high visual salience, can significantly degrade pedestrian detection performance. Furthermore, we demonstrate that adversarially crafted asphalt art can be exploited to deliberately obscure real pedestrians or generate non-existent pedestrian detections. These findings highlight a potential vulnerability in urban vision-based pedestrian surveillance systems and underscore the importance of accounting for environmental visual variations when designing robust pedestrian perception models.
zh
[CV-9] owards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework
【速读】:该论文旨在解决音频深度伪造(audio deepfake)溯源难题,即在数字通信中识别伪造音频所使用的生成技术及具体模型实例,以增强对虚假内容的可追溯性和信任机制。其核心挑战在于现有检测方法难以区分不同生成模型,且在开放集条件下鲁棒性不足。解决方案的关键在于提出LAVA(Layered Architecture for Voice Attribution)框架,该框架基于仅使用伪造音频训练的卷积自编码器提取注意力增强的潜在表征,并通过两个专用分类器实现分层识别:Audio Deepfake Attribution (ADA) 用于识别生成技术,Audio Deepfake Model Recognition (ADMR) 用于识别具体模型实例;同时引入置信度阈值机制提升开放集场景下的可靠性,实验表明该方法在多个公开数据集上均取得优异性能,显著优于现有方法。
链接: https://arxiv.org/abs/2508.02521
作者: Andrea Di Pierno(1),Luca Guarnera(2),Dario Allegra(2),Sebastiano Battiato(2) ((1) IMT School of Advanced Studies, (2) University of Catania)
机构: IMT School of Advanced Studies (IMT高级研究学院); University of Catania (卡塔尼亚大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA’s robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at this https URL.
zh
[CV-10] Engagement Prediction of Short Videos with Large Multimodal Models ICCV
【速读】:该论文旨在解决短视频平台中用户生成内容(User-Generated Content, UGC)的视频参与度预测问题,其核心挑战在于如何有效建模语义内容、视觉质量、音频特征及用户背景等多模态因素之间的复杂交互关系。解决方案的关键在于利用大语言模型(Large Multimodal Models, LMMs)对多模态信息进行联合建模:文中采用两种代表性LMM——VideoLLaMA2(融合音视频与文本模态)和Qwen2.5-VL(仅使用视觉与文本模态),通过在SnapUGC数据集上训练,验证了LMM在视频参与度预测任务中的有效性;其中VideoLLaMA2因引入音频特征而表现更优,进一步证明了跨模态交互建模的重要性。最终,通过集成两类模型,该方法在ICCV VQualA 2025 EVQA-SnapUGC挑战赛中取得第一名。
链接: https://arxiv.org/abs/2508.02516
作者: Wei Sun,Linhan Cao,Yuqin Cao,Weixia Zhang,Wen Wen,Kaiwei Zhang,Zijian Chen,Fangfang Lu,Xiongkuo Min,Guangtao Zhai
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学); City University of Hong Kong (香港城市大学); Shanghai University of Electric Power (上海电力大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The proposed method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction
Abstract:The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at this https URL.
zh
[CV-11] QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots
【速读】:该论文旨在解决四足机器人在全景视觉感知任务中因高质量训练数据稀缺所导致的性能瓶颈问题,其根本原因在于固有的运动学约束和复杂的传感器标定挑战。解决方案的关键在于提出QuaDreamer——首个专为四足机器人设计的全景数据生成引擎,其核心创新包括:(1)垂直抖动编码(Vertical Jitter Encoding, VJE),通过频域特征滤波提取可控的垂直振动信号以模拟四足运动特有的竖向抖动;(2)场景-物体控制器(Scene-Object Controller, SOC),利用注意力机制协同控制物体运动与背景抖动,提升视频生成质量;(3)全景增强器(Panoramic Enhancer, PE),采用双流架构融合频域-纹理细化与空间-结构校正,有效缓解广角视频中的全景畸变问题。上述模块共同构建了可控制、高保真的全景视频生成能力,显著提升了四足机器人在360°场景下的多目标跟踪性能。
链接: https://arxiv.org/abs/2508.02512
作者: Sheng Wu,Fei Teng,Hao Shi,Qi Jiang,Kai Luo,Kaiwei Wang,Kailun Yang
机构: Hunan University (湖南大学); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CoRL 2025. The source code and model weights will be publicly available at this https URL
Abstract:Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data-caused by inherent kinematic constraints and complex sensor calibration challenges-fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer-the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE)-a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot’s panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available at this https URL.
zh
[CV-12] Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask
【速读】:该论文旨在解决透明物体导致深度相机生成不完整或无效深度数据的问题,从而影响机器人抓取的准确性与可靠性。现有方法通常直接将RGB-D图像输入网络以输出完整的深度图,依赖模型隐式推断深度值的可靠性,但在真实场景中因复杂光照交互导致有效与无效深度分布高度变异,泛化能力差。解决方案的关键在于提出ReMake框架,通过引入实例掩码(instance mask)和单目深度估计(monocular depth estimation)进行引导:掩码显式区分透明与非透明区域,使模型在训练时专注于学习这些区域的精确深度估计,减少对隐式推理的依赖;同时,单目深度估计提供透明物体与其周围环境之间的深度上下文信息,提升预测精度。该设计显著增强了模型在真实场景中的泛化性能。
链接: https://arxiv.org/abs/2508.02507
作者: Yaofeng Cheng,Xinkai Gao,Sen Zhang,Chao Zeng,Fusheng Zha,Lining Sun,Chenguang Yang
机构: Harbin Institute of Technology (哈尔滨工业大学); Lanzhou University of Technology (兰州理工大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to the optical properties, transparent objects often lead depth cameras to generate incomplete or invalid depth data, which in turn reduces the accuracy and reliability of robotic grasping. Existing approaches typically input the RGB-D image directly into the network to output the complete depth, expecting the model to implicitly infer the reliability of depth values. However, while effective in training datasets, such methods often fail to generalize to real-world scenarios, where complex light interactions lead to highly variable distributions of valid and invalid depth data. To address this, we propose ReMake, a novel depth completion framework guided by an instance mask and monocular depth estimation. By explicitly distinguishing transparent regions from non-transparent ones, the mask enables the model to concentrate on learning accurate depth estimation in these areas from RGB-D input during training. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, monocular depth estimation provides depth context between the transparent object and its surroundings, enhancing depth prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability. Code and videos are available at this https URL.
zh
[CV-13] Clinical Expert Uncertainty Guided Generalized Label Smoothing for Medical Noisy Label Learning
【速读】:该论文旨在解决医学图像标注中因临床专家不确定性(uncertainty-aware notes)导致的标签噪声问题。现有文本挖掘方法忽视了医生在临床笔记中常用的“maybe”或“not excluded”等表达,从而引入噪声标签,而传统去噪方法多依赖后处理,未考虑专家不确定性这一根本来源。解决方案的关键在于:首先系统评估临床专家不确定性对标签噪声的影响,进而提出一个面向专家不确定性的基准测试(clinical expert uncertainty-aware benchmark),并设计一种标签平滑方法,通过显式建模不确定性显著提升模型性能,优于当前最先进的方法。
链接: https://arxiv.org/abs/2508.02495
作者: Kunyu Zhang,Lin Gu,Liangchen Liu,Yingke Chen,Bingyang Wang,Jin Yan,Yingying Zhu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many previous studies have proposed extracting image labels from clinical notes to create large-scale medical image datasets at a low cost. However, these approaches inherently suffer from label noise due to uncertainty from the clinical experts. When radiologists and physicians analyze medical images to make diagnoses, they often include uncertainty-aware notes such as maybe'' or
not excluded’'. Unfortunately, current text-mining methods overlook these nuances, resulting in the creation of noisy labels. Existing methods for handling noisy labels in medical image analysis, which typically address the problem through post-processing techniques, have largely ignored the important issue of expert-driven uncertainty contributing to label noise. To better incorporate the expert-written uncertainty in clinical notes into medical image analysis and address the label noise issue, we first examine the impact of clinical expert uncertainty on label noise. We then propose a clinical expert uncertainty-aware benchmark, along with a label smoothing method, which significantly improves performance compared to current state-of-the-art approaches.
zh
[CV-14] Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯点绘(3D Gaussian Splatting, 3DGS)在三维重建中产生的浮游伪影(floating artifacts)问题,这类伪影表现为脱离真实几何结构的错误结构,显著降低视觉保真度。研究表明,浮游伪影主要源于低质量初始化场景下未充分优化的高斯分布(under-optimized Gaussians),其高频成分学习不足而产生不合理的空间分布。解决方案的关键在于提出EFA-GS方法:通过识别并选择性扩展这些欠优化的高斯分布,优先保障低频信息的准确学习;同时引入基于深度和尺度的动态策略,以自适应地控制高斯扩展过程,从而有效缓解细节侵蚀问题。实验表明,该方法在合成与真实数据集上均能显著减少浮游伪影,且在下游3D编辑任务中表现出更强的鲁棒性。
链接: https://arxiv.org/abs/2508.02493
作者: Jianchao Wang,Peng Zhou,Cen Li,Rong Quan,Jie Qin
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textitEliminating-Floating-Artifacts Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. Our implementation will be released on GitHub.
zh
[CV-15] MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding
【速读】:该论文旨在解决从功能性磁共振成像(fMRI)中重建多片段动态视频的难题,这是理解视觉认知和实现高沉浸感脑机接口的关键步骤。当前方法仅限于单片段视频重建,无法应对真实世界体验中的多片段特性,主要受限于fMRI信号在不同片段间的混合、fMRI与视频之间的时间分辨率不匹配导致快速场景变化难以捕捉,以及缺乏专门用于多片段fMRI-视频对齐的数据集。解决方案的核心在于提出一种新颖的“分而解码”(divide-and-decode)框架:首先通过shot boundary predictor模块显式地将混合的fMRI信号分解为各片段特定的段落;其次利用大语言模型(LLM)进行生成式关键帧描述(generative keyframe captioning),通过高阶语义信息克服时间模糊问题;最后构建包含20,000样本的大规模合成数据集以支持训练。实验表明,该框架在多片段重建保真度上优于现有最先进方法,消融研究进一步验证了fMRI分解与语义描述的关键作用,其中分解模块使解码描述的CLIP相似度提升71.8%。
链接: https://arxiv.org/abs/2508.02480
作者: Wenwen Zeng,Yonghuang Wu,Yifan Chen,Xuan Xie,Chengqian Zhao,Feiyu Yin,Guoqing Wu,Jinhua Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics. (3) Novel large-scale data synthesis (20k samples) from existing datasets. Experimental results demonstrate our framework outperforms state-of-the-art methods in multi-shot reconstruction fidelity. Ablation studies confirm the critical role of fMRI decomposition and semantic captioning, with decomposition significantly improving decoded caption CLIP similarity by 71.8%. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.
zh
[CV-16] Fine-grained Multiple Supervisory Network for Multi-modal Manipulation Detecting and Grounding
【速读】:该论文旨在解决多模态媒体篡改检测(Detecting and Grounding Multi-Modal Media Manipulation, DGM⁴)任务中因忽略不可靠单模态数据干扰以及缺乏细粒度篡改痕迹挖掘监督而导致的性能瓶颈问题。其核心解决方案是提出一种细粒度多监督(Fine-grained Multiple Supervisory, FMS)网络,关键在于构建三重监督机制:1)模态可靠性监督(通过Multimodal Decision Supervised Correction, MDSC模块),利用单模态弱监督纠正多模态决策过程;2)单模态内部监督(通过Unimodal Forgery Mining Reinforcement, UFMR模块),从特征级和样本级增强真实与伪造信息间的差异;3)跨模态监督(通过Multimodal Forgery Alignment Reasoning, MFAR模块),基于软注意力交互实现一致性与不一致性双视角的跨模态特征感知,并引入交互约束保障交互质量。
链接: https://arxiv.org/abs/2508.02479
作者: Xinquan Yu,Wei Lu,Xiangyang Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The task of Detecting and Grounding Multi-Modal Media Manipulation (DGM ^4 ) is a branch of misinformation detection. Unlike traditional binary classification, it includes complex subtasks such as forgery content localization and forgery method classification. Consider that existing methods are often limited in performance due to neglecting the erroneous interference caused by unreliable unimodal data and failing to establish comprehensive forgery supervision for mining fine-grained tampering traces. In this paper, we present a Fine-grained Multiple Supervisory (FMS) network, which incorporates modality reliability supervision, unimodal internal supervision and cross-modal supervision to provide comprehensive guidance for DGM ^4 detection. For modality reliability supervision, we propose the Multimodal Decision Supervised Correction (MDSC) module. It leverages unimodal weak supervision to correct the multi-modal decision-making process. For unimodal internal supervision, we propose the Unimodal Forgery Mining Reinforcement (UFMR) module. It amplifies the disparity between real and fake information within unimodal modality from both feature-level and sample-level perspectives. For cross-modal supervision, we propose the Multimodal Forgery Alignment Reasoning (MFAR) module. It utilizes soft-attention interactions to achieve cross-modal feature perception from both consistency and inconsistency perspectives, where we also design the interaction constraints to ensure the interaction quality. Extensive experiments demonstrate the superior performance of our FMS compared to state-of-the-art methods.
zh
[CV-17] Multi-class Image Anomaly Detection for Practical Applications: Requirements and Robust Solutions
【速读】:该论文旨在解决多类图像异常检测(multi-class image anomaly detection)中因共享模型架构导致的单类检测精度下降问题,尤其关注在训练和评估阶段是否使用类别标签时,如何合理设定检测阈值以提升性能稳定性。现有方法虽致力于缩小与专用模型的性能差距,但对类别信息的利用方式及其对阈值定义的影响研究不足。解决方案的关键在于提出Hierarchical Coreset(HierCore)框架,其核心创新是引入分层记忆库(hierarchical memory bank),无需类别标签即可估计各类别的决策准则,从而在不同标签可用条件下均能满足理论要求并保持稳定、高效的异常检测表现。
链接: https://arxiv.org/abs/2508.02477
作者: Jaehyuk Heo,Pilsung Kang
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in image anomaly detection have extended unsupervised learning-based models from single-class settings to multi-class frameworks, aiming to improve efficiency in training time and model storage. When a single model is trained to handle multiple classes, it often underperforms compared to class-specific models in terms of per-class detection accuracy. Accordingly, previous studies have primarily focused on narrowing this performance gap. However, the way class information is used, or not used, remains a relatively understudied factor that could influence how detection thresholds are defined in multi-class image anomaly detection. These thresholds, whether class-specific or class-agnostic, significantly affect detection outcomes. In this study, we identify and formalize the requirements that a multi-class image anomaly detection model must satisfy under different conditions, depending on whether class labels are available during training and evaluation. We then re-examine existing methods under these criteria. To meet these challenges, we propose Hierarchical Coreset (HierCore), a novel framework designed to satisfy all defined requirements. HierCore operates effectively even without class labels, leveraging a hierarchical memory bank to estimate class-wise decision criteria for anomaly detection. We empirically validate the applicability and robustness of existing methods and HierCore under four distinct scenarios, determined by the presence or absence of class labels in the training and evaluation phases. The experimental results demonstrate that HierCore consistently meets all requirements and maintains strong, stable performance across all settings, highlighting its practical potential for real-world multi-class anomaly detection tasks.
zh
[CV-18] SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models
【速读】:该论文旨在解决视觉基础模型(如Segment Anything Model, SAM)在提示引导分割任务中存在的“意图鸿沟”问题,即模型仅能分割显式提示的对象,无法泛化到用户隐含期望的语义相关实例,尤其在密集同质对象场景(如生物医学细胞核分割)中,稀疏视觉提示常导致结果不完整,而密集标注因成本过高难以实现。解决方案的关键在于提出SAMPO(Segment Anything Model with Preference Optimization)框架,通过偏好优化(preference optimization)机制,使模型从稀疏视觉交互中推断高层次类别意图,而非依赖语言模型或像素级微调;该方法无需额外提示生成器或语言模型辅助,即可实现多目标鲁棒分割,并在少量数据下显著提升性能,验证了其在医疗图像分割任务中的优越性与数据效率。
链接: https://arxiv.org/abs/2508.02464
作者: Yonghuang Wu,Wenwen Zeng,Xuan Xie,Chengqian Zhao,Guoqing Wu,Jinhua Yu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models like Segment Anything Model (SAM) excel in promptable segmentation but suffer from an intent gap: they segment only explicitly prompted objects, failing to generalize to semantically related instances implicitly desired by users. This limitation is critical in domains with dense homogeneous objects (e.g., biomedical nuclei segmentation), where sparse visual prompts typically yield incomplete results, rendering dense annotations impractical due to prohibitive cost. To bridge this gap, we introduce SAMPO (Segment Anything Model with Preference Optimization), a novel framework that teaches visual foundation models to infer high-level categorical intent from sparse visual interactions. Unlike conventional pixel-level fine-tuning, SAMPO optimizes models to implicitly capture target-class characteristics through preference optimization. This approach, which operates without dependency on language models, enables robust multi-object segmentation even under sparse prompting and demonstrates superior data efficiency during fine-tuning. Validated on three medical segmentation tasks, SAMPO achieves state-of-the-art performance: on challenging tasks like PanNuke-T2, our method, when fine-tuned with only 10% of the training data, significantly outperforms all existing methods trained on the full 100% dataset, achieving an improvement of over 9 percentage points compared to the best baseline. Our work establishes a new paradigm for intent-aware alignment in visual foundation models, removing dependencies on auxiliary prompt generators or language-model-assisted preference learning.
zh
[CV-19] InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition
【速读】:该论文旨在解决从无声视频中准确估计口语内容的问题,这在辅助技术(Assistive Technology, AT)和增强现实(Augmented Reality, AR)等应用中具有重要意义。由于唇部运动序列存在个体差异性以及序列内部信息分布不均,传统方法难以实现高精度映射。其解决方案的关键在于提出InfoSyncNet,一种基于非均匀序列建模的网络架构,其中核心创新是引入一个位于编码器与解码器之间的非均匀量化模块(non-uniform quantization module),该模块可动态调整模型关注点,有效应对视觉语音数据中的自然不一致性;同时结合定制化数据增强策略和多种训练机制,显著提升模型对光照变化和说话者姿态差异的鲁棒性。
链接: https://arxiv.org/abs/2508.02460
作者: Junxiao Xue,Xiaozhen Liu,Xuecheng Wu,Fei Yu,Jun Wang
机构: Zhengzhou University (郑州大学); Xi’an Jiaotong University (西安交通大学); Zhejiang Lab (浙江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network’s focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model’s capability to handle variations in lighting and the speaker’s orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC. The code is available for download (see comments).
zh
[CV-20] Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Visibility
【速读】:该论文旨在解决高斯点绘(Gaussian Splatting)在机器人学和医学等关键应用中缺乏可靠不确定性估计(Uncertainty Estimation, UE)的问题。现有方法通常仅估计高斯原型的方差并通过渲染过程获得像素级不确定性,但难以捕捉复杂场景中的真实误差分布。本文的关键解决方案是构建一种基于训练误差与视图可见性的原型表示(primitive representations),通过投影将这些信息编码到高斯原型中,从而生成可渲染的不确定性特征图(uncertainty feature maps)。进一步地,利用保留数据对这些特征图进行像素级回归,实现对新视角的不确定性估计;该方法在前景物体上表现尤为优越,并展现出良好的跨场景泛化能力,无需额外的保留数据即可完成不确定性预测。
链接: https://arxiv.org/abs/2508.02443
作者: Thomas Gottwald,Edgar Heinert,Matthias Rottmann
机构: University of Wuppertal (伍珀塔尔大学); Osnabrück University (奥斯纳布吕克大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we present a novel method for uncertainty estimation (UE) in Gaussian Splatting. UE is crucial for using Gaussian Splatting in critical applications such as robotics and medicine. Previous methods typically estimate the variance of Gaussian primitives and use the rendering process to obtain pixel-wise uncertainties. Our method establishes primitive representations of error and visibility of trainings views, which carries meaningful uncertainty information. This representation is obtained by projection of training error and visibility onto the primitives. Uncertainties of novel views are obtained by rendering the primitive representations of uncertainty for those novel views, yielding uncertainty feature maps. To aggregate these uncertainty feature maps of novel views, we perform a pixel-wise regression on holdout data. In our experiments, we analyze the different components of our method, investigating various combinations of uncertainty feature maps and regression models. Furthermore, we considered the effect of separating splatting into foreground and background. Our UEs show high correlations to true errors, outperforming state-of-the-art methods, especially on foreground objects. The trained regression models show generalization capabilities to new scenes, allowing uncertainty estimation without the need for holdout data.
zh
[CV-21] Glioblastoma Overall Survival Prediction With Vision Transformers
【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma)患者总生存期(Overall Survival, OS)预测的难题,以支持个性化治疗策略制定和临床决策优化。传统方法通常依赖于肿瘤分割步骤来提取特征,流程复杂且计算资源消耗大。其解决方案的关键在于提出一种基于视觉Transformer(Vision Transformer, ViT)的新型人工智能(Artificial Intelligence, AI)模型,直接从磁共振成像(Magnetic Resonance Imaging, MRI)图像中提取隐含特征,无需进行肿瘤分割,从而简化了工作流程并降低了计算开销。该方法在BRATS数据集上取得了62.5%的测试准确率,并在精度、召回率和F1分数上表现均衡,优于现有最佳模型,验证了ViT在下采样医学影像任务中的适用性及高效性。
链接: https://arxiv.org/abs/2508.02439
作者: Yin Lin,iccardo Barbieri,Domenico Aquino,Giuseppe Lauria,Marina Grisoli,Elena De Momi,Alberto Redaelli,Simona Ferrante
机构: Politecnico di Milano (米兰理工大学); Fondazione IRCCS Istituto Neurologico Carlo Besta (卡洛·贝斯塔神经研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures, EMBC2025
Abstract:Glioblastoma is one of the most aggressive and common brain tumors, with a median survival of 10-15 months. Predicting Overall Survival (OS) is critical for personalizing treatment strategies and aligning clinical decisions with patient outcomes. In this study, we propose a novel Artificial Intelligence (AI) approach for OS prediction using Magnetic Resonance Imaging (MRI) images, exploiting Vision Transformers (ViTs) to extract hidden features directly from MRI images, eliminating the need of tumor segmentation. Unlike traditional approaches, our method simplifies the workflow and reduces computational resource requirements. The proposed model was evaluated on the BRATS dataset, reaching an accuracy of 62.5% on the test set, comparable to the top-performing methods. Additionally, it demonstrated balanced performance across precision, recall, and F1 score, overcoming the best model in these metrics. The dataset size limits the generalization of the ViT which typically requires larger datasets compared to convolutional neural networks. This limitation in generalization is observed across all the cited studies. This work highlights the applicability of ViTs for downsampled medical imaging tasks and establishes a foundation for OS prediction models that are computationally efficient and do not rely on segmentation. Comments: 4 pages, 4 figures, EMBC2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.02439 [cs.CV] (or arXiv:2508.02439v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.02439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-22] HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis
【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series)分析中因高维性、动态特性及变量间复杂耦合关系所带来的建模难题。其解决方案的关键在于提出一种基于超图(Hypergraph)的时序变换器骨干网络HGTS-Former,通过构建分层超图结构来捕获单通道内的时序模式与不同变量间的细粒度关联,并借助EdgeToNode模块将超边转化为节点特征,结合前馈网络进一步增强表示能力,从而有效建模多变量时间序列中的复杂依赖关系。
链接: https://arxiv.org/abs/2508.02411
作者: Xiao Wang,Hao Si,Fan Zhang,Xiaoya Zhou,Dengdi Sun,Wanli Lyu,Qingquan Yang,Jin Tang
机构: Anhui University (安徽大学); Institute of Plasma Physics, Chinese Academy of Sciences (中国科学院等离子体物理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multivariate time series analysis has long been one of the key research topics in the field of artificial intelligence. However, analyzing complex time series data remains a challenging and unresolved problem due to its high dimensionality, dynamic nature, and complex interactions among variables. Inspired by the strong structural modeling capability of hypergraphs, this paper proposes a novel hypergraph-based time series transformer backbone network, termed HGTS-Former, to address the multivariate coupling in time series data. Specifically, given the multivariate time series signal, we first normalize and embed each patch into tokens. Then, we adopt the multi-head self-attention to enhance the temporal representation of each patch. The hierarchical hypergraphs are constructed to aggregate the temporal patterns within each channel and fine-grained relations between different variables. After that, we convert the hyperedge into node features through the EdgeToNode module and adopt the feed-forward network to further enhance the output features. Extensive experiments conducted on two multivariate time series tasks and eight datasets fully validated the effectiveness of our proposed HGTS-Former. The source code will be released on this https URL.
zh
[CV-23] Hydra: Accurate Multi-Modal Leaf Wetness Sensing with mm-Wave and Camera Fusion
【速读】:该论文旨在解决植物叶片湿润持续时间(Leaf Wetness Duration, LWD)检测中缺乏标准化测量方法、现有技术难以在不同植物特性及复杂环境条件下保持高精度与鲁棒性的问题。解决方案的关键在于提出Hydra系统,其创新性地融合毫米波(mm-Wave)雷达与可见光相机技术,通过设计一种基于卷积神经网络(Convolutional Neural Network, CNN)的多模态特征融合机制,将多个毫米波深度图像与RGB图像进行选择性融合以生成多特征图;随后利用Transformer架构构建编码器捕捉各特征图间的内在关联,输出统一特征表示并送入分类器完成叶片湿润状态识别。该方案在76–81 GHz频段FMCW雷达硬件平台上实现,并通过数据增强策略提升模型泛化能力,实测表明其在多种农业场景下(包括雨天、清晨或光照不足夜晚)仍能稳定达到约90%的准确率,显著优于传统方法。
链接: https://arxiv.org/abs/2508.02409
作者: Yimeng Liu,Maolin Gan,Huaili Zeng,Li Liu,Younsuk Dong,Zhichao Cao
机构: Michigan State University (密歇根州立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: In Proceedings of ACM MobiCom (2024)
Abstract:Leaf Wetness Duration (LWD), the time that water remains on leaf surfaces, is crucial in the development of plant diseases. Existing LWD detection lacks standardized measurement techniques, and variations across different plant characteristics limit its effectiveness. Prior research proposes diverse approaches, but they fail to measure real natural leaves directly and lack resilience in various environmental conditions. This reduces the precision and robustness, revealing a notable practical application and effectiveness gap in real-world agricultural settings. This paper presents Hydra, an innovative approach that integrates millimeter-wave (mm-Wave) radar with camera technology to detect leaf wetness by determining if there is water on the leaf. We can measure the time to determine the LWD based on this detection. Firstly, we design a Convolutional Neural Network (CNN) to selectively fuse multiple mm-Wave depth images with an RGB image to generate multiple feature images. Then, we develop a transformer-based encoder to capture the inherent connection among the multiple feature images to generate a feature map, which is further fed to a classifier for detection. Moreover, we augment the dataset during training to generalize our model. Implemented using a frequency-modulated continuous-wave (FMCW) radar within the 76 to 81 GHz band, Hydra’s performance is meticulously evaluated on plants, demonstrating the potential to classify leaf wetness with up to 96% accuracy across varying scenarios. Deploying Hydra in the farm, including rainy, dawn, or poorly light nights, it still achieves an accuracy rate of around 90%.
zh
[CV-24] Improving Generalization of Language-Conditioned Robot Manipulation
【速读】:该论文旨在解决机器人在未见过的环境中执行物体排列任务时,依赖大量数据进行视觉语言模型(Vision-Language Model, VLM)微调的问题。其核心挑战在于如何在有限示范下实现高效泛化与零样本迁移能力。解决方案的关键在于提出一个两阶段框架:第一阶段为目标定位阶段,用于识别自然语言指令所指定的目标物体;第二阶段为区域确定阶段,用于规划物体放置位置。其中,创新性地引入了实例级语义融合模块(instance-level semantic fusion module),通过将图像实例裁剪与文本嵌入对齐,使模型能够精准识别由自然语言定义的目标对象,从而显著提升在真实机器人场景中的零样本操作能力。
链接: https://arxiv.org/abs/2508.02405
作者: Chenglin Cui,Chaoran Zhu,Changjae Oh,Andrea Cavallaro
机构: Queen Mary University of London (伦敦玛丽女王大学); Idiap Research Institute (Idiap 研究所); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages,18 figures,2 tables
Abstract:The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune VLMs for operating in unseen environments. In this paper, we present a framework that learns object-arrangement tasks from just a few demonstrations. We propose a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking the object, and a region determination stage for placing the object. We present an instance-level semantic fusion module that aligns the instance-level image crops with the text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method on both simulation and real-world robotic environments. Our method, fine-tuned with a few demonstrations, improves generalization capability and demonstrates zero-shot ability in real-robot manipulation scenarios.
zh
[CV-25] ε-Softmax: Approximating One-Hot Vectors for Mitigating Label Noise NEURIPS2024
【速读】:该论文旨在解决深度神经网络训练中因标签噪声(label noise)导致的性能下降问题。现有方法多依赖于对称损失函数(symmetric loss)以提升鲁棒性,但常因过于严格的对称性假设而引发欠拟合(underfitting)问题。其解决方案的关键在于提出一种名为 ϵ-softmax 的简单有效方法,通过在Softmax层输出中引入可控误差 ϵ 来近似one-hot向量,从而松弛传统对称条件。理论证明表明,ϵ-softmax 可为几乎任意损失函数提供可控的过失风险界(excess risk bound),实现噪声容忍学习;同时,为平衡鲁棒性与干净数据上的拟合能力,进一步将 ϵ-softmax 增强的损失与一种对称损失结合,获得更优的权衡效果。
链接: https://arxiv.org/abs/2508.02387
作者: Jialiang Wang,Xiong Zhou,Deming Zhai,Junjun Jiang,Xiangyang Ji,Xianming Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS2024
Abstract:Noisy labels pose a common challenge for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions to achieve noise tolerance in the presence of label noise, particularly symmetric losses. However, they usually suffer from the underfitting issue due to the overly strict symmetric condition. In this work, we propose a simple yet effective approach for relaxing the symmetric condition, namely \epsilon -softmax, which simply modifies the outputs of the softmax layer to approximate one-hot vectors with a controllable error \epsilon . Essentially, \epsilon -softmax not only acts as an alternative for the softmax layer, but also implicitly plays the crucial role in modifying the loss function. We prove theoretically that \epsilon -softmax can achieve noise-tolerant learning with controllable excess risk bound for almost any loss function. Recognizing that \epsilon -softmax-enhanced losses may slightly reduce fitting ability on clean datasets, we further incorporate them with one symmetric loss, thereby achieving a better trade-off between robustness and effective learning. Extensive experiments demonstrate the superiority of our method in mitigating synthetic and real-world label noise. The code is available at this https URL.
zh
[CV-26] Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection
【速读】:该论文旨在解决无监督实例分割(unsupervised instance segmentation)与目标检测(object detection)任务中缺乏高质量伪标签(pseudo labels)以及模型依赖复杂后处理或特定损失函数的问题。其解决方案的关键在于提出一种名为Cut-Once-and-LEaRn(COLER)的简单框架:首先利用自研的CutOnce方法生成粗粒度伪标签,该方法仅使用一次归一化割(Normalized Cut)即可在图像中生成多个对象掩码(object masks),且不依赖聚类算法或掩码后处理;其次通过自训练(self-training)机制使检测器从这些伪标签中学习,从而实现无需人工标注的零样本(zero-shot)无监督目标定位与分割。该方法显著提升了无监督场景下的性能,并推动了该领域的进展。
链接: https://arxiv.org/abs/2508.02386
作者: Xingyu Feng,Hebei Gao,Hong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose Cut-Once-and-LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut only once and does not rely on any clustering methods, but it can generate multiple object masks in an image. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self-supervised models, but also free it from reliance on mask post-processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self-training. COLER is a zero-shot unsupervised model that outperforms previous state-of-the-art methods on multiple this http URL believe our method can help advance the field of unsupervised object localization.
zh
[CV-27] SMART-Ship: A Comprehensive Synchronized Multi-modal Aligned Remote Sensing Targets Dataset and Benchmark for Berthed Ships Analysis
【速读】:该论文旨在解决海上目标监测中因多尺度目标复杂性和动态环境导致的挑战,尤其是在多模态遥感(Multi-modal Remote Sensing, RS)数据利用不足的问题。其解决方案的关键在于构建了一个名为SMART-Ship的数据集,该数据集包含五种模态(可见光、合成孔径雷达SAR、全色、多光谱和近红外)的时空对齐图像,共1092组多模态图像集,覆盖38,838艘船舶,并提供细粒度标注信息,包括多边形位置、细粒度类别、实例级标识符及变化区域掩码。这些特性使该数据集能够支持多样化的多模态遥感任务,并为后续研究提供了标准化基准与验证平台。
链接: https://arxiv.org/abs/2508.02384
作者: Chen-Chen Fan,Peiyao Guo,Linping Zhang,Kehan Qi,Haolin Huang,Yong-Qiang Mao,Yuxi Suo,Zhizhuo Jiang,Yu Liu,You He
机构: Tsinghua University (清华大学); Tsinghua Shenzhen International Graduate School (清华深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Given the limitations of satellite orbits and imaging conditions, multi-modal remote sensing (RS) data is crucial in enabling long-term earth observation. However, maritime surveillance remains challenging due to the complexity of multi-scale targets and the dynamic environments. To bridge this critical gap, we propose a Synchronized Multi-modal Aligned Remote sensing Targets dataset for berthed ships analysis (SMART-Ship), containing spatiotemporal registered images with fine-grained annotation for maritime targets from five modalities: visible-light, synthetic aperture radar (SAR), panchromatic, multi-spectral, and near-infrared. Specifically, our dataset consists of 1092 multi-modal image sets, covering 38,838 ships. Each image set is acquired within one week and registered to ensure spatiotemporal consistency. Ship instances in each set are annotated with polygonal location information, fine-grained categories, instance-level identifiers, and change region masks, organized hierarchically to support diverse multi-modal RS tasks. Furthermore, we define standardized benchmarks on five fundamental tasks and comprehensively compare representative methods across the dataset. Thorough experiment evaluations validate that the proposed SMART-Ship dataset could support various multi-modal RS interpretation tasks and reveal the promising directions for further exploration.
zh
[CV-28] Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation ACM-MM2025
【速读】:该论文旨在解决当前布局生成方法中存在的两个核心问题:一是任务特定的生成能力导致适用范围受限,二是评估指标与人类感知不一致,使得效果衡量缺乏有效性。解决方案的关键在于提出一个统一框架 Uni-Layout,其核心创新包括:(1)通过将多种布局任务整合至同一分类体系,并利用自然语言提示实现背景或元素约束下的通用生成;(2)构建首个大规模人类反馈数据集 Layout-HF100k(含10万张专家标注布局),并设计一个模拟人类判断的评估器,融合视觉与几何信息,结合思维链(Chain-of-Thought)机制进行定性分析及置信度模块输出定量评分;(3)采用动态边界偏好优化(Dynamic-Margin Preference Optimization, DMPO)策略,动态调整偏好强度对应的边际,使生成器与评估器之间实现更贴近人类偏好的对齐。
链接: https://arxiv.org/abs/2508.02374
作者: Shuo Lu,Yanyin Chen,Wei Feng,Jiahao Fan,Fengheng Li,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Jian Liang
机构: NLPR & MAIS, CASIA (中国科学院自动化研究所); School of AI, UCAS (中国科学院大学人工智能学院); JD.COM (京东)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to ACM MM 2025
Abstract:Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose \textitUni-Layout, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build \textitLayout-HF100k, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on \textitLayout-HF100k, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that \textitUni-Layout significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at this https URL.
zh
[CV-29] RUDI and TITUS: A Multi-Perspective Dataset and A Three-Stage Recognition System for Transportation Unit Identification BMVC
【速读】:该论文旨在解决港口物流中运输单元(Transportation Units, TUs)识别效率低下的问题,其核心挑战在于缺乏公开可用的基准数据集来捕捉真实港口环境中多样的动态场景。为应对这一问题,作者提出了TRUDI数据集,包含35,034个标注实例,覆盖容器、罐式集装箱、拖车、ID文本和标志五类目标,并通过地面与航拍相机在多种光照和天气条件下采集图像。解决方案的关键在于TITUS管道,该管道分三步实现TU识别:(1) 实例分割,(2) ID文本定位,(3) ID字符识别与验证。相较于依赖特定视角或环境的传统系统,TITUS在不同相机角度及复杂环境条件下均表现出鲁棒性,从而显著提升了TU识别的实用性与泛化能力。
链接: https://arxiv.org/abs/2508.02372
作者: Emre Gülsoylu,André Kelm,Lennart Bengtson,Matthias Hirsch,Christian Wilms,Tim Rolff,Janick Edinger,Simone Frintrop
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures, 6 tables. Author version of the paper. Accepted for publication in The 36th British Machine Vision Conference (BMVC) 2025
Abstract:Identifying transportation units (TUs) is essential for improving the efficiency of port logistics. However, progress in this field has been hindered by the lack of publicly available benchmark datasets that capture the diversity and dynamics of real-world port environments. To address this gap, we present the TRUDI dataset-a comprehensive collection comprising 35,034 annotated instances across five categories: container, tank container, trailer, ID text, and logo. The images were captured at operational ports using both ground-based and aerial cameras, under a wide variety of lighting and weather conditions. For the identification of TUs-which involves reading the 11-digit alphanumeric ID typically painted on each unit-we introduce TITUS, a dedicated pipeline that operates in three stages: (1) segmenting the TU instances, (2) detecting the location of the ID text, and (3) recognising and validating the extracted ID. Unlike alternative systems, which often require similar scenes, specific camera angles or gate setups, our evaluation demonstrates that TITUS reliably identifies TUs from a range of camera perspectives and in varying lighting and weather conditions. By making the TRUDI dataset publicly available, we provide a robust benchmark that enables the development and comparison of new approaches. This contribution supports digital transformation efforts in multipurpose ports and helps to increase the efficiency of entire logistics chains.
zh
[CV-30] ransport-Guided Rectified Flow Inversion: Improved Image Editing Using Optimal Transport Theory WACV
【速读】:该论文旨在解决矩形流模型(rectified flow models)中图像逆映射(image inversion)的优化问题,即如何在保持高重建保真度的同时提升编辑灵活性,从而实现高质量的图像编辑应用。现有方法难以在重建精度与编辑可控性之间取得平衡,而本文提出的**最优传输逆映射流程(Optimal Transport Inversion Pipeline, OTIP)**通过引入最优传输理论,在反向扩散过程中引导生成路径,基于原理化的轨迹优化实现二者之间的有效权衡。其关键创新在于利用最优传输路径计算图像与噪声分布间的高效映射关系,同时保持较低的计算开销,在多个基准测试上显著优于基线方法,如在LSUN-Bedroom和LSUN-Church数据集上分别降低7.8%至12.9%的重建损失,并在语义人脸编辑任务中提升身份保留能力11.2%及感知质量1.6%,且不增加显著计算负担。
链接: https://arxiv.org/abs/2508.02363
作者: Marian Lupascu,Mihai-Sorin Stupariu
机构: University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 24 figures, WACV conference
Abstract:Effective image inversion in rectified flow models - mapping real images to editable latent representations - is crucial for practical image editing applications; however, achieving optimal balance between reconstruction fidelity and editing flexibility remains a fundamental challenge. In this work, we introduce the Optimal Transport Inversion Pipeline (OTIP), a zero-shot framework that leverages optimal transport theory to guide the inversion process in rectified flow models. Our underlying hypothesis is that incorporating transport-based guidance during the reverse diffusion process can effectively balance reconstruction accuracy and editing controllability through principled trajectory optimization. The method computes optimal transport paths between image and noise distributions while maintaining computational efficiency. Our approach achieves high-fidelity reconstruction with LPIPS scores of 0.001 and SSIM of 0.992 on face editing benchmarks, demonstrating superior preservation of fine-grained details compared to existing methods. We evaluate the framework across multiple editing tasks, observing 7.8% to 12.9% improvements in reconstruction loss over RF-Inversion on the LSUN-Bedroom and LSUN-Church datasets, respectively. For semantic face editing, our method achieves an 11.2% improvement in identity preservation and a 1.6% enhancement in perceptual quality, while maintaining computational efficiency comparable to baseline approaches. Qualitatively, our method produces visually compelling edits with superior semantic consistency and fine-grained detail preservation across diverse editing scenarios. Code is available at: this https URL
zh
[CV-31] xt2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering
【速读】:该论文旨在解决生成语义连贯且视觉准确的说话人脸时,语言意义与面部发音动作之间难以对齐的问题。现有基于音频驱动的方法依赖高质量的音视频配对数据,并面临声学信号到唇部运动映射的固有歧义,导致可扩展性和鲁棒性受限。解决方案的关键在于提出一种以音位单元(viseme)为中心的框架 Text2Lip,通过将文本输入嵌入结构化的 viseme 序列构建可解释的语音-视觉桥梁,作为唇部运动预测的语言先验;同时设计基于课程学习的渐进式 viseme-audio 替换策略,使模型从真实音频逐步过渡到由增强 viseme 特征通过跨模态注意力重建的伪音频,从而在有音频和无音频场景下均能实现鲁棒生成。
链接: https://arxiv.org/abs/2508.02362
作者: Xu Wang,Shengeng Tang,Fei Wang,Lechao Cheng,Dan Guo,Feng Xue,Richang Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal attention. This allows for robust generation in both audio-present and audio-free scenarios. Finally, a landmark-guided renderer synthesizes photorealistic facial videos with accurate lip synchronization. Extensive evaluations show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness, establishing a new paradigm for controllable and flexible talking face generation. Our project homepage is this https URL.
zh
[CV-32] mmWave Radar-Based Non-Line-of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction via Camera
【速读】:该论文旨在解决城市环境中非视距(Non-Line-of-Sight, NLoS)区域行人定位难题,该问题在自动驾驶系统中尤为突出。由于毫米波雷达(mmWave radar)在NLoS场景下易受多路径反射干扰,其二维点云数据(2D radar point cloud, PCD)存在畸变,难以实现准确的空间推理;而摄像头虽能提供高分辨率视觉信息,却缺乏深度感知能力且无法直接观测NLoS区域目标。解决方案的关键在于提出一种新颖的融合框架,利用摄像头提取的道路布局信息来解释雷达PCD,从而实现对NLoS区域内行人的空间场景重建与精确定位。该方法通过视觉引导的雷达点云语义解析,有效提升了复杂城市环境下的行人定位精度,实验基于实车雷达-相机系统在户外NLoS驾驶场景中验证了其有效性。
链接: https://arxiv.org/abs/2508.02348
作者: Byeonggyu Park,Hee-Yeun Kim,Byonghyok Choi,Hansang Cho,Byungkwan Kim,Soomok Lee,Mingu Jeon,Seong-Woo Kim
机构: Seoul National University (首尔国立大学); Samsung Electro-Mechanics Co., Ltd. (三星电子机械有限公司); Chungnam National University (忠南国立大学); Ajou University (延世大学); Institute of Engineering Research at Seoul National University (首尔国立大学工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.
zh
[CV-33] Learning Partially-Decorrelated Common Spaces for Ad-hoc Video Search
【速读】:该论文旨在解决Ad-hoc Video Search (AVS) 中因相关视频视觉多样性导致的检索不全面问题,即单一文本查询可能对应多种场景、环境和风格的视频内容,现有方法通过融合多特征到一个或少数公共空间难以充分捕捉这种多样性。解决方案的关键在于提出LPD(Learning Partially Decorrelated common spaces),其核心创新包括:1)为每个视频和文本特征学习独立的公共空间(feature-specific common space construction),以增强各特征表达能力;2)引入去相关损失(de-correlation loss)促使不同空间对负样本排序差异最大化,从而提升结果多样性;同时设计基于熵的公平多空间三元组排名损失,确保多空间收敛的一致性。实验表明,LPD在TRECVID AVS基准(2016–2023)上显著优于现有方法,并通过可视化验证了其在增强检索多样性方面的优势。
链接: https://arxiv.org/abs/2508.02340
作者: Fan Hu,Zijie Xin,Xirong Li
机构: Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by ACMMM2025
Abstract:Ad-hoc Video Search (AVS) involves using a textual query to search for multiple relevant videos in a large collection of unlabeled short videos. The main challenge of AVS is the visual diversity of relevant videos. A simple query such as “Find shots of a man and a woman dancing together indoors” can span a multitude of environments, from brightly lit halls and shadowy bars to dance scenes in black-and-white animations. It is therefore essential to retrieve relevant videos as comprehensively as possible. Current solutions for the AVS task primarily fuse multiple features into one or more common spaces, yet overlook the need for diverse spaces. To fully exploit the expressive capability of individual features, we propose LPD, short for Learning Partially Decorrelated common spaces. LPD incorporates two key innovations: feature-specific common space construction and the de-correlation loss. Specifically, LPD learns a separate common space for each video and text feature, and employs de-correlation loss to diversify the ordering of negative samples across different spaces. To enhance the consistency of multi-space convergence, we designed an entropy-based fair multi-space triplet ranking loss. Extensive experiments on the TRECVID AVS benchmarks (2016-2023) justify the effectiveness of LPD. Moreover, diversity visualizations of LPD’s spaces highlight its ability to enhance result diversity.
zh
[CV-34] Correspondence-Free Fast and Robust Spherical Point Pattern Registration ICCV2025
【速读】:该论文旨在解决球面上两个模式(spherical patterns)之间旋转估计的计算效率与鲁棒性问题,现有方法通常基于球面函数间的互相关最大化,存在时间复杂度高于立方级 O(n3) 且在显著异常值污染下性能下降的问题。其解决方案的关键在于将球面模式显式表示为单位球面上的离散三维点集(discrete 3D point sets on the unit sphere),从而将旋转估计问题重新建模为球面点集对齐(spherical point-set alignment),并将其自然纳入Wahba问题框架(即单位向量的最优旋转对齐)。为此,作者提出了三种新算法:SPMC(基于相关性的球面模式匹配)、FRS(快速旋转搜索)及其混合方案(SPMC+FRS),实现了线性时间复杂度 O(n),并在无对应关系场景下相比当前最先进方法提升了超过10倍的速度和精度。
链接: https://arxiv.org/abs/2508.02339
作者: Anik Sarker,Alan T. Asbeck
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at ICCV 2025
Abstract:Existing methods for rotation estimation between two spherical ( \mathbbS^2 ) patterns typically rely on spherical cross-correlation maximization between two spherical function. However, these approaches exhibit computational complexities greater than cubic O(n^3) with respect to rotation space discretization and lack extensive evaluation under significant outlier contamination. To this end, we propose a rotation estimation algorithm between two spherical patterns with linear time complexity O(n) . Unlike existing spherical-function-based methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., Wahba problem for 3D unit vectors). Given the geometric nature of our formulation, our spherical pattern alignment algorithm naturally aligns with the Wahba problem framework for 3D unit vectors. Specifically, we introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the \mathbbS^2 domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the ``Robust Vector Alignment Dataset. "Furthermore, we adapt our methods to two real-world tasks: (i) Point Cloud Registration (PCR) and (ii) rotation estimation for spherical images.
zh
[CV-35] CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)如CLIP在细粒度视觉理解能力上的不足问题。其核心挑战在于模型难以区分图像中细微的视觉差异并准确关联语义信息。解决方案的关键在于提出CLIP-IN框架,通过两个创新机制实现:一是利用原本用于图像编辑的指令编辑数据集生成难负样本(hard negative image-text pairs),结合对称难负对比损失(symmetric hard negative contrastive loss),增强模型对细微视觉-语义差异的辨别能力;二是引入长描述性标题(long descriptive captions),并采用旋转位置编码(rotary positional encodings)建模丰富的上下文语义信息,从而提升细粒度感知能力。实验证明,CLIP-IN在MMVP等细粒度任务上显著提升性能,同时保持零样本分类与检索任务的鲁棒性,并能有效降低多模态大语言模型中的视觉幻觉现象,增强推理能力。
链接: https://arxiv.org/abs/2508.02329
作者: Ziteng Wang,Siqi Yang,Limeng Qiao,Lin Ma
机构: South China Normal University (华南师范大学); Meituan Inc. (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP’s fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN’s visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.
zh
[CV-36] Qwen -Image Technical Report
【速读】:该论文旨在解决图像生成模型在复杂文本渲染(complex text rendering)和精确图像编辑(precise image editing)方面的挑战。针对复杂文本渲染问题,其关键解决方案是设计了一个涵盖大规模数据收集、过滤、标注、合成与平衡的完整数据处理流程,并采用渐进式训练策略(progressive training strategy),从无文本到有文本、由简单到复杂的文本输入逐步训练,实现课程学习(curriculum learning),从而显著提升模型对字母文字(如英语)及表意文字(如中文)的原生文本渲染能力。对于图像编辑一致性问题,其核心创新在于引入改进的多任务训练范式,融合文本到图像(text-to-image, T2I)、文本图像到图像(text-image-to-image, TI2I)与图像到图像(image-to-image, I2I)重建任务,有效对齐Qwen2.5-VL与MMDiT的潜在表示;同时提出双编码机制(dual-encoding mechanism),分别通过Qwen2.5-VL和VAE编码器获取原始图像的语义与重构特征,使编辑模块在保持语义一致性与视觉保真度之间取得良好平衡。
链接: https://arxiv.org/abs/2508.02324
作者: Chenfei Wu,Jiahao Li,Jingren Zhou,Junyang Lin,Kaiyuan Gao,Kun Yan,Sheng-ming Yin,Shuai Bai,Xiao Xu,Yilei Chen,Yuxiang Chen,Zecheng Tang,Zekai Zhang,Zhengyi Wang,An Yang,Bowen Yu,Chen Cheng,Dayiheng Liu,Deqing Li,Hang Zhang,Hao Meng,Hu Wei,Jingyuan Ni,Kai Chen,Kuan Cao,Liang Peng,Lin Qu,Minggang Wu,Peng Wang,Shuting Yu,Tingkun Wen,Wensen Feng,Xiaoxiao Xu,Yi Wang,Yichang Zhang,Yongqiang Zhu,Yujia Wu,Yuxuan Cai,Zenan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model’s native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
zh
[CV-37] Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images ICCV2025
【速读】:该论文旨在解决从单张图像中进行体素场景重建(volumetric scene reconstruction)的问题,这一任务在自动驾驶和机器人等领域具有重要意义。传统方法通常依赖昂贵的3D标注数据或多视角监督信号,限制了其泛化能力和实用性。本文的关键解决方案是利用预训练的2D扩散模型(diffusion models)与深度预测模型生成合成场景几何信息,并将其用于蒸馏(distill)一个前向传播的场景重建模型,从而在无需多视角监督的情况下实现高性能重建。实验表明,该方法在KITTI-360和Waymo等挑战性数据集上达到或超越使用多视角监督的最先进基线,尤其在动态场景建模方面展现出独特优势。
链接: https://arxiv.org/abs/2508.02323
作者: Philipp Wulff,Felix Wimbauer,Dominik Muhle,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); MCML; SE3 Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Website: this https URL
Abstract:Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.
zh
[CV-38] Zero-shot Compositional Action Recognition with Neural Logic Constraints ACM-MM2025
【速读】:该论文旨在解决零样本组合动作识别(Zero-shot Compositional Action Recognition, ZS-CAR)中的两个关键挑战:一是缺乏组合结构约束,导致动词与物体基元之间产生虚假相关性;二是忽略语义层次约束,引发语义模糊并损害训练过程。解决方案的核心在于引入一种基于逻辑驱动的框架LogicCAR,其通过嵌入双重符号约束来实现人类般的符号推理能力:显式组合逻辑(Explicit Compositional Logic)用于建模组合内部的限制关系,增强模型的组合推理能力;层次基元逻辑(Hierarchical Primitive Logic)则刻画不同基元间的语义依赖关系,赋予模型从细粒度到粗粒度的推理能力。通过将这些约束形式化为一阶逻辑并融合进神经网络架构,LogicCAR系统性地弥合了符号抽象与现有模型之间的鸿沟。
链接: https://arxiv.org/abs/2508.02320
作者: Gefan Ye,Lin Li,Kexin Li,Jun Xiao,Long chen
机构: Zhejiang University (浙江大学); The Hong Kong University of Science and Technology (香港科技大学); AI Chip Center for Emerging Smart Systems (人工智能芯片中心); Zhejiang Tobacco Monopoly Administration (浙江省烟草专卖局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures; Accepted by ACM MM2025
Abstract:Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning’s progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.
zh
[CV-39] Is Uncertainty Quantification a Viable Alternative to Learned Deferral? MICCAI
【速读】:该论文旨在解决生成式 AI 在临床应用中因数据分布偏移(data shift)导致的决策可靠性下降问题,尤其是在眼科疾病(如青光眼)诊断中,如何有效识别高风险误判案例并将其交由人类专家处理。其核心解决方案是对比分析两类策略:一类是通过监督学习优化代理损失函数以实现“学习型拒答”(learned deferral),另一类则是基于不确定性量化(uncertainty quantification, UQ)方法直接估计模型置信度作为拒答依据。研究发现,不确定性量化方法在面对分布外(out-of-distribution, OOD)输入时表现更鲁棒,表明其作为AI拒答机制具有更高的稳定性与安全性,因而可能成为临床部署中更可靠的选择。
链接: https://arxiv.org/abs/2508.02319
作者: Anna M. Wundram,Christian F. Baumgartner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an oral presentation at MICCAI UNSURE 2025
Abstract:Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models’ ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model’s confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.
zh
[CV-40] Whole-body Representation Learning For Competing Preclinical Disease Risk Assessment
【速读】:该论文旨在解决当前医学影像驱动的疾病风险预测模型在预临床阶段存在的局限性,即通常仅针对单一疾病进行建模,且依赖于人工设计的特征提取方法(如分割工具),难以实现多病种协同评估与高效泛化。其解决方案的关键在于提出一种全身自监督表示学习(whole-body self-supervised representation learning)方法,通过无监督方式从全身体部影像中自动学习通用表征,从而在竞争风险建模框架下实现对心血管疾病(CVD)、2型糖尿病(T2D)、慢性阻塞性肺疾病(COPD)及慢性肾病(CKD)等多类疾病的联合风险预测,并显著优于传统全身体部放射组学(radiomics)方法。该方法不仅具备作为独立筛查手段的潜力,还可整合至多模态临床工作流中,用于早期个性化风险分层。
链接: https://arxiv.org/abs/2508.02307
作者: Dmitrii Seletkov,Sophie Starck,Ayhan Can Erdur,Yundi Zhang,Daniel Rueckert,Rickmer Braren
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reliable preclinical disease risk assessment is essential to move public healthcare from reactive treatment to proactive identification and prevention. However, image-based risk prediction algorithms often consider one condition at a time and depend on hand-crafted features obtained through segmentation tools. We propose a whole-body self-supervised representation learning method for the preclinical disease risk assessment under a competing risk modeling. This approach outperforms whole-body radiomics in multiple diseases, including cardiovascular disease (CVD), type 2 diabetes (T2D), chronic obstructive pulmonary disease (COPD), and chronic kidney disease (CKD). Simulating a preclinical screening scenario and subsequently combining with cardiac MRI, it sharpens further the prediction for CVD subgroups: ischemic heart disease (IHD), hypertensive diseases (HD), and stroke. The results indicate the translational potential of whole-body representations as a standalone screening modality and as part of a multi-modal framework within clinical workflows for early personalized risk stratification. The code is available at this https URL
zh
[CV-41] owards Real Unsupervised Anomaly Detection Via Confident Meta-Learning ICCV2025
【速读】:该论文旨在解决当前无监督异常检测(unsupervised anomaly detection)方法中存在的关键局限性:其本质为半监督学习,依赖于训练数据均为正常样本(nominal samples)的假设,这要求人工对数据进行筛选和清洗,引入偏差并限制模型在实际场景中的适应能力。为应对这一问题,作者提出了一种名为Confident Meta-learning(CoMet)的新颖训练策略,其核心在于结合Soft Confident Learning与Meta-Learning机制——前者通过降低低置信度样本的权重来缓解噪声干扰,后者利用训练验证损失协方差对梯度更新进行正则化,从而稳定训练过程、防止过拟合,并提升对含异常样本的未清理数据集的鲁棒性。该方法不依赖特定模型架构,适用于所有可通过梯度下降优化的异常检测模型,在MVTec-AD、VIADUCT和KSDD2等多个基准数据集上验证了其有效性与优越性。
链接: https://arxiv.org/abs/2508.02293
作者: Muhammad Aqeel,Shakiba Sharifi,Marco Cristani,Francesco Setti
机构: University of Verona (维罗纳大学); Qualyco S.r.l. (Qualyco S.r.l.)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ieee/cvf international conference on computer vision (ICCV2025)
Abstract:So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets.
zh
[CV-42] Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection ICCV2025
【速读】:该论文旨在解决高动态场景下3D目标检测的感知间隙问题,传统基于LiDAR和RGB相机的方法因固定帧率无法实现连续时间感知,而现有融合事件相机与常规传感器的方法在快速运动场景中表现受限,主要归因于对同步传感器的依赖。其解决方案的关键在于提出一种仅依赖事件相机的立体3D目标检测框架,通过引入双滤波机制(dual filter mechanism)从稀疏事件数据中提取语义和几何信息,并借助以物体为中心的信息对齐优化边界框回归,从而实现无需依赖传统3D传感器的鲁棒连续时间感知。
链接: https://arxiv.org/abs/2508.02288
作者: Jae-Young Kang,Hoonhee Cho,Kuk-Jin Yoon
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. The code is available at this https URL.
zh
[CV-43] Do Edges Matter? Investigating Edge-Enhanced Pre-Training for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中预训练策略对跨模态性能影响不明确的问题,特别是边缘信息在基础模型预训练中的作用尚未被系统研究。其核心问题是:如何通过优化预训练数据的处理方式(如引入计算高效的边缘增强核),提升基础模型在多种医学成像模态下的泛化分割能力。解决方案的关键在于提出一种基于元学习的决策机制——利用原始图像的标准差和熵作为特征,自动选择在边缘增强或原始数据上预训练的基础模型进行微调,从而实现跨模态分割性能的显著提升(相比仅使用边缘增强数据预训练提升16.42%,相比仅使用原始数据预训练提升19.30%)。
链接: https://arxiv.org/abs/2508.02281
作者: Paul Zaha,Lars Böcking,Simeon Allmendinger,Leopold Müller,Niklas Kühl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, Third International Workshop on Data Engineering in Medical Imaging (DEMI 2025)
Abstract:Medical image segmentation is crucial for disease diagnosis and treatment planning, yet developing robust segmentation models often requires substantial computational resources and large datasets. Existing research shows that pre-trained and finetuned foundation models can boost segmentation performance. However, questions remain about how particular image preprocessing steps may influence segmentation performance across different medical imaging modalities. In particular, edges-abrupt transitions in pixel intensity-are widely acknowledged as vital cues for object boundaries but have not been systematically examined in the pre-training of foundation models. We address this gap by investigating to which extend pre-training with data processed using computationally efficient edge kernels, such as kirsch, can improve cross-modality segmentation capabilities of a foundation model. Two versions of a foundation model are first trained on either raw or edge-enhanced data across multiple medical imaging modalities, then finetuned on selected raw subsets tailored to specific medical modalities. After systematic investigation using the medical domains Dermoscopy, Fundus, Mammography, Microscopy, OCT, US, and XRay, we discover both increased and reduced segmentation performance across modalities using edge-focused pre-training, indicating the need for a selective application of this approach. To guide such selective applications, we propose a meta-learning strategy. It uses standard deviation and image entropy of the raw image to choose between a model pre-trained on edge-enhanced or on raw data for optimal performance. Our experiments show that integrating this meta-learning layer yields an overall segmentation performance improvement across diverse medical imaging tasks by 16.42% compared to models pre-trained on edge-enhanced data only and 19.30% compared to models pre-trained on raw data only.
zh
[CV-44] SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching
【速读】:该论文旨在解决局部特征匹配(local feature matching)中因现有基于区域到点匹配(Area to Point Matching, A2PM)方法依赖低效的像素级比较和复杂图匹配而导致的可扩展性不足问题。其解决方案的关键在于提出语义与几何感知描述符网络(Semantic and Geometric-aware Descriptor Network, SGAD),该网络通过生成高判别力的区域描述符,实现无需复杂图优化的直接匹配,从而显著提升匹配的准确率与效率;同时引入一种将区域匹配任务分解为分类与排序子任务的新监督策略,并设计层次化包含冗余过滤器(Hierarchical Containment Redundancy Filter, HCRF)以消除重叠区域,最终在多个基准测试中实现性能与速度的双重突破。
链接: https://arxiv.org/abs/2508.02278
作者: Xiangzeng Liu,Chi Wang,Guanglu Shi,Xiaodong Zhang,Qiguang Miao,Miao Fan
机构: Xidian University (西安电子科技大学); Navinfo Europe B.V
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60x (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39% AUC@5° in indoor pose estimation, establishing a new state-of-the-art.
zh
[CV-45] Semi-Supervised Dual-Threshold Contrastive Learning for Ultrasound Image Classification and Segmentation ECAI2025
【速读】:该论文旨在解决半监督对比学习(semi-supervised contrastive learning)中因伪标签(pseudo-label)过于自信而导致的错误预测问题,以及分类与分割任务独立处理、缺乏任务间关联性建模的问题。其核心解决方案是提出一种新颖的双阈值半监督对比学习策略——Hermes,通过引入两个置信度阈值动态筛选高质量伪标签以缓解早期模型误导和过拟合风险;同时设计跨任务注意力与显著性模块(inter-task attention and saliency module),增强分类与分割任务之间的信息交互,并结合跨任务一致性学习策略(inter-task consistency learning)对齐肿瘤特征表示,从而减少负迁移并降低任务间特征差异,提升整体性能。
链接: https://arxiv.org/abs/2508.02265
作者: Peng Zhang,Zhihui Lai,Heng Kong
机构: Shenzhen University (深圳大学); Guangdong Medical University Shenzhen Baoan Central Hospital (广东医科大学深圳宝安中心医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ECAI 2025
Abstract:Confidence-based pseudo-label selection usually generates overly confident yet incorrect predictions, due to the early misleadingness of model and overfitting inaccurate pseudo-labels in the learning process, which heavily degrades the performance of semi-supervised contrastive learning. Moreover, segmentation and classification tasks are treated independently and the affinity fails to be fully explored. To address these issues, we propose a novel semi-supervised dual-threshold contrastive learning strategy for ultrasound image classification and segmentation, named Hermes. This strategy combines the strengths of contrastive learning with semi-supervised learning, where the pseudo-labels assist contrastive learning by providing additional guidance. Specifically, an inter-task attention and saliency module is also developed to facilitate information sharing between the segmentation and classification tasks. Furthermore, an inter-task consistency learning strategy is designed to align tumor features across both tasks, avoiding negative transfer for reducing features discrepancy. To solve the lack of publicly available ultrasound datasets, we have collected the SZ-TUS dataset, a thyroid ultrasound image dataset. Extensive experiments on two public ultrasound datasets and one private dataset demonstrate that Hermes consistently outperforms several state-of-the-art methods across various semi-supervised settings.
zh
[CV-46] SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion
【速读】:该论文旨在解决单目3D语义场景补全(Monocular 3D Semantic Scene Completion, SSC)中因依赖大量随机初始化的3D高斯基元(3D Gaussian primitives)所导致的两个核心问题:一是基元初始化效率低下,二是异常基元引入错误伪影。解决方案的关键在于提出SplatSSC框架,其核心创新包括:1)基于深度引导的初始化策略,通过专用深度分支与分组多尺度融合(Group-wise Multi-scale Fusion, GMF)模块,整合多尺度图像和深度特征,生成稀疏但具有代表性的初始高斯基元集合;2)设计解耦高斯聚合器(Decoupled Gaussian Aggregator, DGA),在高斯到体素的投射(splattting)过程中分离几何与语义预测,从而提升对异常基元的鲁棒性。结合专门设计的概率尺度损失(Probability Scale Loss),该方法在Occ-ScanNet数据集上实现了优于现有方法6.3% IoU和4.1% mIoU的性能,同时降低延迟和内存消耗超过9.3%。
链接: https://arxiv.org/abs/2508.02261
作者: Rui Qian,Haozhi Cao,Tianchen Deng,Shenghai Yuan,Lihua Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory consumption by more than 9.3%. The code will be released upon acceptance.
zh
[CV-47] Patho-Agent icRAG Agent icRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning
【速读】:该论文旨在解决病理学视觉语言模型(Vision Language Models, VLMs)在临床应用中易产生幻觉(hallucinations)的问题,即模型生成与图像证据不符的诊断结论,从而削弱临床信任。其核心挑战在于病理图像具有超高分辨率、复杂组织结构及细微的临床语义,而现有基于文本的知识检索增强生成(Retrieval-Augmented Generation, RAG)方法难以有效利用图像中的诊断线索。解决方案的关键在于提出 Patho-AgenticRAG,一个基于权威病理学教材页面级嵌入构建的多模态 RAG 框架,支持文本-图像联合检索,可直接定位同时包含查询文本和相关视觉特征的教材页面,避免图像信息丢失;此外,该框架还集成推理、任务分解与多轮交互能力,显著提升复杂病理诊断场景下的准确性。
链接: https://arxiv.org/abs/2508.02258
作者: Wenchuan Zhang,Jingru Guo,Hengzhe Zhang,Penghao Zhang,Jie Chen,Shuwan Zhang,Zhang Zhang,Yuhao Yi,Hong Bu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: this https URL.
zh
[CV-48] Semi-Supervised Semantic Segmentation via Derivative Label Propagation
【速读】:该论文旨在解决半监督语义分割中伪标签(pseudo-label)可靠性不足的问题,尤其是在仅使用少量标注图像的情况下,如何提升伪标签的质量以增强模型性能。其解决方案的关键在于提出了一种新颖的导数标签传播方法(DerProp),通过在像素级特征向量上施加离散导数运算作为额外正则化项,生成严格约束的相似性度量,从而有效缓解因相同相似度对应不同特征而导致的病态问题(ill-posed problem),并缩小解空间,提高伪标签的准确性与一致性。
链接: https://arxiv.org/abs/2508.02254
作者: Yuanbin Fu,Xiaojie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo-labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo-labels. Hence, we develop a semi-supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo-labels. Our label propagation method imposes discrete derivative operations on pixel-wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill-posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes are available at this https URL.
zh
[CV-49] I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多模态实体链接(Multimodal Entity Linking)方法所面临的两个关键问题:一是某些场景下无必要地引入图像数据,导致资源浪费与潜在干扰;二是仅依赖单次视觉特征提取,难以充分挖掘图像信息以支持准确匹配。解决方案的关键在于提出一种名为“模内与模间协同反思”(Intra- and Inter-modal Collaborative Reflections)的新框架,该框架优先利用文本信息完成链接任务,并在文本不足以确定正确实体时,通过多轮迭代策略融合图像中不同维度的关键视觉线索,实现跨模态协同推理,从而显著提升匹配准确性。
链接: https://arxiv.org/abs/2508.02243
作者: Ziyan Liu,Junwen Li,Kaiwen Li,Tong Ruan,Chao Wang,Xinyan He,Zongyu Wang,Xuezhi Cao,Jingping Liu
机构: East China University of Science and Technology (华东理工大学); South China University of Technology (华南理工大学); Shanghai University (上海大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 10 pages, 6 figures, accepted by ACMMM 2025
Abstract:Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at this https URL.
zh
[CV-50] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor
【速读】:该论文旨在解决扩散模型(Diffusion Models)中推理速度慢的问题,尤其是在资源受限场景下的部署瓶颈。现有训练-free加速方法虽通过缓存和复用历史特征提升效率,但如TaylorSeer等方法在模块级进行泰勒展开预测时,需存储细粒度中间特征,导致显著的内存与计算开销;同时其固定缓存策略未考虑预测误差随时间步变化,易因预测失败而降低生成质量。解决方案的关键在于:1)将泰勒预测目标从模块级调整为最后一层块(block-level),大幅减少缓存特征数量;2)利用首个Transformer块的预测误差作为可靠性指标,动态决定是否采用泰勒预测——误差小则信任预测结果,否则回退至完整计算,从而实现自适应的缓存机制。该方法在FLUX、DiT和Wan Video上分别实现3.17x、2.36x和4.14x的加速,且质量损失可忽略。
链接: https://arxiv.org/abs/2508.02240
作者: Xiaoliu Guan,Lielin Jiang,Hanqi Chen,Xu Zhang,Jiaxing Yan,Guanzhong Wang,Yi Liu,Zetao Zhang,Yu Wu
机构: Wuhan University (武汉大学); Baidu Inc (百度公司); Zhejiang University (浙江大学); Yunnan Key Laboratory of Media Convergence (云南省媒体融合重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures
Abstract:Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \hrefthis https URLhere.
zh
[CV-51] An Event-based Fast Intensity Reconstruction Scheme for UAV Real-time Perception
【速读】:该论文旨在解决事件相机(event camera)异步事件流中有效信息提取与利用的难题,以实现事件相机在机载场景下的高效部署,尤其是在视觉条件恶劣时(如低光照环境)保持稳定感知能力。其解决方案的关键在于提出一种简化的基于事件的强度重建方法——事件单次积分(Event-based Single Integration, ESI),通过一次事件流积分结合改进的衰减算法实现高帧率(通常100 FPS)实时强度图像重建,同时保留事件相机的高动态范围、高时间分辨率等优势,并显著降低计算负载,从而适配无人机(UAV)等资源受限平台的实时视觉跟踪需求。
链接: https://arxiv.org/abs/2508.02238
作者: Xin Dong,Yiwei Zhang,Yangjie Cui,Jinwu Xiang,Daochun Li,Zhan Tu
机构: Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院); School of Aeronautic Science and Engineering, Beihang University (北京航空航天大学航空科学与工程学院); Institute of Unmanned System, Beihang University (北京航空航天大学无人系统研究所); Tianmushan Laboratory (天目山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: A supplementary video is available at this https URL
Abstract:Event cameras offer significant advantages, including a wide dynamic range, high temporal resolution, and immunity to motion blur, making them highly promising for addressing challenging visual conditions. Extracting and utilizing effective information from asynchronous event streams is essential for the onboard implementation of event cameras. In this paper, we propose a streamlined event-based intensity reconstruction scheme, event-based single integration (ESI), to address such implementation challenges. This method guarantees the portability of conventional frame-based vision methods to event-based scenarios and maintains the intrinsic advantages of event cameras. The ESI approach reconstructs intensity images by performing a single integration of the event streams combined with an enhanced decay algorithm. Such a method enables real-time intensity reconstruction at a high frame rate, typically 100 FPS. Furthermore, the relatively low computation load of ESI fits onboard implementation suitably, such as in UAV-based visual tracking scenarios. Extensive experiments have been conducted to evaluate the performance comparison of ESI and state-of-the-art algorithms. Compared to state-of-the-art algorithms, ESI demonstrates remarkable runtime efficiency improvements, superior reconstruction quality, and a high frame rate. As a result, ESI enhances UAV onboard perception significantly under visual adversary surroundings. In-flight tests, ESI demonstrates effective performance for UAV onboard visual tracking under extremely low illumination conditions(2-10lux), whereas other comparative algorithms fail due to insufficient frame rate, poor image quality, or limited real-time performance.
zh
[CV-52] Welcome New Doctor: Continual Learning with Expert Consultation and Autoregressive Inference for Whole Slide Image Analysis
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)在临床应用中因数据量庞大、任务多样化而带来的持续学习难题,即如何在不重新训练或微调历史任务的情况下,高效地将已训练模型适应新任务,同时保持资源效率与高性能。解决方案的关键在于提出一种基于Transformer架构的持续学习框架COSFormer,其通过设计模块化结构和注意力机制实现对新任务的顺序学习能力,无需访问完整的历史数据集,从而在类增量(class-incremental)和任务增量(task-incremental)设置下均展现出优异的泛化性能与稳定性,为WSI分析提供了可扩展、高效的临床级持续学习方案。
链接: https://arxiv.org/abs/2508.02220
作者: Doanh Cao Bui,Jin Tae Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole Slide Image (WSI) analysis, with its ability to reveal detailed tissue structures in magnified views, plays a crucial role in cancer diagnosis and prognosis. Due to their giga-sized nature, WSIs require substantial storage and computational resources for processing and training predictive models. With the rapid increase in WSIs used in clinics and hospitals, there is a growing need for a continual learning system that can efficiently process and adapt existing models to new tasks without retraining or fine-tuning on previous tasks. Such a system must balance resource efficiency with high performance. In this study, we introduce COSFormer, a Transformer-based continual learning framework tailored for multi-task WSI analysis. COSFormer is designed to learn sequentially from new tasks wile avoiding the need to revisit full historical datasets. We evaluate COSFormer on a sequence of seven WSI datasets covering seven organs and six WSI-related tasks under both class-incremental and task-incremental settings. The results demonstrate COSFormer’s superior generalizability and effectiveness compared to existing continual learning frameworks, establishing it as a robust solution for continual WSI analysis in clinical applications.
zh
[CV-53] CMIC: Content-Adaptive Mamba for Learned Image Compression
【速读】:该论文旨在解决当前基于Mamba的图像压缩(Learned Image Compression, LIC)方法中存在内容无关性导致全局依赖建模能力受限的问题。具体而言,原始Mamba模型采用固定且预定义的选择性扫描机制,无法动态适应不同内容特征,从而限制了其在复杂图像结构中对长程依赖关系的有效捕捉。解决方案的关键在于提出Content-Adaptive Mamba (CAM),其核心创新包括:一是通过内容感知的token重组织机制,依据特征空间中的相似性对tokens进行聚类与重排序,以优先保证语义相近区域在序列中的邻近性;二是引入prompt字典将全局先验信息嵌入状态空间模型(State-Space Model, SSM),缓解Mamba固有的严格因果性和token交互随距离衰减的问题。上述改进使CAM在保持线性计算复杂度的同时显著增强全局依赖建模能力,最终驱动所提出的CMIC模型在率失真性能上实现SOTA表现。
链接: https://arxiv.org/abs/2508.02192
作者: Yunuo Chen,Zezheng Lyu,Bing He,Hongwei Hu,Qi Wang,Yuan Tian,Li Song,Wenjun Zhang,Guo Lu
机构: Shanghai Jiao Tong Unversity (上海交通大学); Massachusetts Institute of Technology (麻省理工学院); Alibaba Group (阿里巴巴集团); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent Learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, vanilla Mamba is content-agnostic, relying on fixed and predefined selective scans, which restricts its ability to dynamically and fully exploit content dependencies. We introduce Content-Adaptive Mamba (CAM), a dynamic SSM that addresses two critical limitations. First, it employs content-aware token reorganization, clustering and reordering tokens based on content similarity to prioritize proximity in feature space over Euclidean space. Second, it integrates global priors into SSM via a prompt dictionary, effectively mitigating the strict causality and long-range decay in the token interactions of Mamba. These innovations enable CAM to better capture global dependencies while preserving computational efficiency. Leveraging CAM, our Content-Adaptive Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91%, -21.34%, and -17.58% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively.
zh
[CV-54] A Moment Matching-Based Method for Sparse and Noisy Point Cloud Registration
【速读】:该论文旨在解决点云配准(point cloud registration)在稀疏点云和高噪声条件下的鲁棒性与精度问题,这类挑战常见于机器人感知任务(如同时定位与建图,SLAM)中。传统方法如迭代最近点(Iterative Closest Point, ICP)和方向分布变换(Normal Distributions Transform, NDT)在上述场景下常表现不佳。其解决方案的关键在于提出一种基于矩匹配(moment matching)的配准框架:将点云视为来自同一分布的独立同分布(i.i.d.)样本,通过匹配由点云计算得到的广义高斯径向基函数矩(generalized Gaussian Radial Basis moments),来估计两帧之间的刚体变换(rigid transformation),且无需显式建立点到点对应关系(point-to-point correspondences)。该方法具有理论一致性,并在合成与真实数据集上验证了优于现有方法的精度与鲁棒性,进一步集成至4D雷达SLAM系统后显著提升了定位性能,达到与激光雷达(LiDAR)系统相当的效果。
链接: https://arxiv.org/abs/2508.02187
作者: Xingyi Li,Han Zhang,Ziliang Wang,Yukai Yang,Weidong Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point cloud registration is a key step in robotic perception tasks, such as Simultaneous Localization and Mapping (SLAM). It is especially challenging in conditions with sparse points and heavy noise. Traditional registration methods, such as Iterative Closest Point (ICP) and Normal Distributions Transform (NDT), often have difficulties in achieving a robust and accurate alignment under these conditions. In this paper, we propose a registration framework based on moment matching. In particular, the point clouds are regarded as i.i.d. samples drawn from the same distribution observed in the source and target frames. We then match the generalized Gaussian Radial Basis moments calculated from the point clouds to estimate the rigid transformation between two frames. Moreover, such method does not require explicit point-to-point correspondences among the point clouds. We further show the consistency of the proposed method. Experiments on synthetic and real-world datasets show that our approach achieves higher accuracy and robustness than existing methods. In addition, we integrate our framework into a 4D Radar SLAM system. The proposed method significantly improves the localization performance and achieves results comparable to LiDAR-based systems. These findings demonstrate the potential of moment matching technique for robust point cloud registration in sparse and noisy scenarios.
zh
[CV-55] Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training ICCV’25
【速读】:该论文旨在解决对抗训练(Adversarial Training, AT)中普遍存在的干净准确率与对抗鲁棒性之间的权衡问题。传统观点认为,这种权衡源于模型对困难对抗样本学习不足,导致决策边界变得复杂;而本文首次揭示了一个反直觉现象:经过对抗训练后仍能攻击鲁棒模型的困难对抗样本,实际上已被模型更好地学习,即存在过充分学习的问题。这种过度追求感知一致性会使模型将扰动视为噪声并忽略其内在信息,从而破坏决策边界的平滑性,加剧权衡。为此,作者提出一种名为“鲁棒感知”(Robust Perception)的新目标函数,鼓励模型在输入扰动下感知变化更加平滑,并基于此设计了鲁棒感知对抗训练(RPAT)方法,有效缓解了准确性与鲁棒性之间的冲突。
链接: https://arxiv.org/abs/2508.02186
作者: Yanyun Wang,Li Liu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/CVF International Conference on Computer Vision (ICCV’25)
Abstract:Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works. The code is available at this https URL.
zh
[CV-56] st-Time Model Adaptation for Quantized Neural Networks
【速读】:该论文旨在解决量化模型(quantized models)在动态环境中因域偏移(domain shift)导致的性能显著下降问题,尤其是相较于全精度模型,量化模型在测试时适应能力更弱。现有测试时自适应(Test-Time Adaptation, TTA)方法通常依赖梯度反向传播,而量化模型由于梯度消失及内存和延迟限制无法支持此类操作,因而难以应用。论文提出一种连续零阶自适应(Continual Zeroth-Order Adaptation, ZOA)框架,其关键在于仅通过两次前向传播即可完成模型参数更新,无需梯度信息,从而显著降低计算开销;同时引入域知识管理机制,在极低内存消耗下存储与复用不同域的知识,减少域间干扰并促进长期适应中的知识累积。实验表明,ZOA在多个量化模型上均优于现有方法,例如在ImageNet-C数据集上使W6A6 ViT-B模型性能提升5.0%。
链接: https://arxiv.org/abs/2508.02180
作者: Zeshuai Deng,Guohao Chen,Shuaicheng Niu,Hui Luo,Shuhai Zhang,Yifan Yang,Renjie Chen,Wei Luo,Mingkui Tan
机构: South China University of Technology(华南理工大学); Nanyang Technological University(南洋理工大学); Institute of Optics and Electronics, Chinese Academy of Sciences(中国科学院光电研究所); South China Agricultural University(华南农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantizing deep models prior to deployment is a widely adopted technique to speed up inference for various real-time applications, such as autonomous driving. However, quantized models often suffer from severe performance degradation in dynamic environments with potential domain shifts and this degradation is significantly more pronounced compared with their full-precision counterparts, as shown by our theoretical and empirical illustrations. To address the domain shift problem, test-time adaptation (TTA) has emerged as an effective solution by enabling models to learn adaptively from test data. Unfortunately, existing TTA methods are often impractical for quantized models as they typically rely on gradient backpropagation–an operation that is unsupported on quantized models due to vanishing gradients, as well as memory and latency constraints. In this paper, we focus on TTA for quantized models to improve their robustness and generalization ability efficiently. We propose a continual zeroth-order adaptation (ZOA) framework that enables efficient model adaptation using only two forward passes, eliminating the computational burden of existing methods. Moreover, we propose a domain knowledge management scheme to store and reuse different domain knowledge with negligible memory consumption, reducing the interference of different domain knowledge and fostering the knowledge accumulation during long-term adaptation. Experimental results on three classical architectures, including quantized transformer-based and CNN-based models, demonstrate the superiority of our methods for quantized model adaptation. On the quantized W6A6 ViT-B model, our ZOA is able to achieve a 5.0% improvement over the state-of-the-art FOA on ImageNet-C dataset. The source code is available at this https URL.
zh
[CV-57] Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning
【速读】:该论文旨在解决弱监督多模态细粒度时间伪造定位(Weakly Supervised Multimodal Fine-grained Temporal Forgery Localization, WS-MTFL)问题,即在仅使用视频级别标签的情况下实现对深度伪造(Deepfake)视频中局部篡改区域的精确时空定位。其解决方案的关键在于提出一种基于多任务学习的框架WMMT(Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning),通过将视觉与音频模态的检测任务建模为两个二分类任务并引入多任务学习机制进行联合优化,同时设计了Mixture-of-Experts结构以自适应选择特征和定位头,提升定位精度与灵活性;此外,还提出了带有时间属性保持注意力机制的特征增强模块,用于捕捉跨模态特征偏差,并引入可扩展的偏差感知损失函数以强化伪造片段相邻段间的差异性,从而显著提升弱监督场景下的定位性能。
链接: https://arxiv.org/abs/2508.02179
作者: Wenbo Xu,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University (中山大学); State Key Laboratory of Mathematical Engineering and Advanced Computing (数学工程与先进计算国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,4 figures. arXiv admin note: text overlap with arXiv:2507.16596
Abstract:The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a Mixture-of-Experts structure to adaptively select appropriate features and localization head, achieving excellent flexibility and localization precision in WS-MTFL. A feature enhancement module with temporal property preserving attention mechanism is proposed to identify the intra- and inter-modality feature deviation and construct comprehensive video features. To further explore the temporal information for weakly supervised learning, an extensible deviation perceiving loss has been proposed, which aims to enlarge the deviation of adjacent segments of the forged samples and reduce the deviation of genuine samples. Extensive experiments demonstrate the effectiveness of multitask learning for WS-MTFL, and the WMMT achieves comparable results to fully supervised approaches in several evaluation metrics.
zh
[CV-58] Deep classification algorithm for De-identification of DICOM medical images
【速读】:该论文旨在解决医学影像研究中DICOM(Digital Imaging and Communications in Medicine)文件的去标识化问题,以确保个人身份信息(PII)和受保护健康信息(PHI)符合HIPAA隐私法规要求。其解决方案的关键在于实现一种基于HIPAA“安全港”方法的算法,通过可定制参数对DICOM标签进行分类并选择性去标识,从而有效识别并移除姓名、病史、个人数据及机构信息等敏感内容,同时保持图像像素数据中嵌入的信息不被破坏,具有良好的灵活性与适用性,适用于日常临床实践与科研场景。
链接: https://arxiv.org/abs/2508.02177
作者: Bufano Michele,Kotter Elmar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background : De-identification of DICOM (Digital Imaging and Communi-cations in Medicine) files is an essential component of medical image research. Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHI) need to be hidden or removed due to legal reasons. According to the Health Insurance Portability and Accountability Act (HIPAA) and privacy rules, also full-face photographic images and any compa-rable images are direct identifiers and are considered protected health information that also need to be de-identified. Objective : The study aimed to implement a method that permit to de-identify the PII and PHI information present in the header and burned on the pixel data of DICOM. Methods : To execute the de-identification, we implemented an algorithm based on the safe harbor method, defined by HIPAA. Our algorithm uses input customizable parameter to classify and then possibly de-identify individual DICOM tags. Results : The most sensible information, like names, history, personal data and institution were successfully recognized. Conclusions : We developed a python algorithm that is able to classify infor-mation present in a DICOM file. The flexibility provided by the use of customi-zable input parameters, which allow the user to customize the entire process de-pending on the case (e.g., the language), makes the entire program very promis-ing for both everyday use and research purposes. Our code is available at this https URL.
zh
[CV-59] GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting
【速读】:该论文旨在解决3D点云表示学习中因点判别难度不足导致的模型崩溃(model collapse)和结构信息缺失问题,从而提升3D场景理解任务中的特征表达可靠性与性能。其核心解决方案是提出GaussianCross架构,关键在于引入前馈式3D高斯泼溅(3D Gaussian Splatting, 3DGS)技术,将尺度不一致的3D点云统一转化为无细节丢失的立方体归一化高斯表示,实现稳定且泛化能力强的自监督预训练;同时设计三属性自适应蒸馏泼溅模块,构建3D特征场以协同捕捉外观、几何与语义线索,保障跨模态一致性,显著提升下游任务如语义分割和实例分割的性能。
链接: https://arxiv.org/abs/2508.02172
作者: Lei Yao,Yi Wang,Yi Zhang,Moyun Liu,Lap-Pui Chau
机构: Hong Kong Polytechnic University (香港理工大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 14 pages, 8 figures, accepted by MM’25
Abstract:The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP _50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \hrefthis https URLthis https URL.
zh
[CV-60] After the Party: Navigating the Mapping From Color to Ambient Lighting
【速读】:该论文旨在解决在复杂光照条件下(如多色光源、遮挡及不同材质反射特性)图像恢复中,现有方法因过度简化假设(如单一光源或均匀白平衡照明)而导致的光照不一致、纹理泄露和色彩失真等问题。其核心挑战在于如何精确分离光照分量与反射分量。解决方案的关键在于提出一种新颖的学习框架,通过显式引入色度(chromaticity)与亮度(luminance)成分的引导机制,借鉴Retinex模型原理,实现对复杂光照条件下的图像进行更精准的光照-反射解耦,从而在保持高计算效率的同时显著提升在非均匀彩色光照和材料特异性反射变化下的鲁棒性。
链接: https://arxiv.org/abs/2508.02168
作者: Florin-Alexandru Vasluianu,Tim Seizinger,Zongwei Wu,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: an 8-pages manuscript, 9 figures, 3 tables
Abstract:Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities this http URL this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity and luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at this http URL.
zh
[CV-61] Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes ICCV2025
【速读】:该论文旨在解决仅使用RGB图像进行类别级3D姿态估计(category-level 3D pose estimation)的难题,传统方法通常依赖RGB-D输入或采用两阶段分离模型分别完成检测与姿态估计,存在数据获取受限或流程复杂的问题。其解决方案的关键在于提出一个统一框架,通过融合神经网格模型(neural mesh models)与学习到的特征表示,并引入多模态RANSAC(multi-model RANSAC)实现检测与姿态估计的一体化建模,在不依赖深度信息的前提下显著提升性能,且在REAL275数据集上相较现有最优方法平均提升22.9%。
链接: https://arxiv.org/abs/2508.02157
作者: Tom Fischer,Xiaojie Zhang,Eddy Ilg
机构: Saarland University (萨尔兰大学); University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025
Abstract:Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at this https URL.
zh
[CV-62] DreamPainter: Image Background Inpainting for E-commerce Scenarios
【速读】:该论文旨在解决电商场景中基于扩散模型的背景生成任务所面临的两大挑战:一是如何在保持前景产品与背景之间空间布局合理、阴影和反射和谐的同时,确保生成内容与输入产品高度一致;二是如何突破仅依赖文本提示进行图像控制的局限性,实现视觉信息的有效融合以提升图像修复(inpainting)的精确度。解决方案的关键在于构建了一个高质量的电商数据集 DreamEcom-400K,其中包含精准的产品实例掩码、背景参考图像、文本提示及美观的产品图像,并在此基础上提出 DreamPainter 框架——该框架不仅利用文本提示作为控制信号,还创新性地引入参考图像作为额外控制源,从而实现了文本与视觉信息的协同控制,显著提升了生成结果的产品一致性与可控性。
链接: https://arxiv.org/abs/2508.02155
作者: Sijie Zhao,Jing Cheng,Yaoyao Wu,Hao Xu,Shaohui Jiao
机构: Bytedance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although diffusion-based image genenation has been widely explored and applied, background generation tasks in e-commerce scenarios still face significant challenges. The first challenge is to ensure that the generated products are consistent with the given product inputs while maintaining a reasonable spatial arrangement, harmonious shadows, and reflections between foreground products and backgrounds. Existing inpainting methods fail to address this due to the lack of domain-specific data. The second challenge involves the limitation of relying solely on text prompts for image control, as effective integrating visual information to achieve precise control in inpainting tasks remains underexplored. To address these challenges, we introduce DreamEcom-400K, a high-quality e-commerce dataset containing accurate product instance masks, background reference images, text prompts, and aesthetically pleasing product images. Based on this dataset, we propose DreamPainter, a novel framework that not only utilizes text prompts for control but also flexibly incorporates reference image information as an additional control signal. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, maintaining high product consistency while effectively integrating both text prompt and reference image information.
zh
[CV-63] Efficient Chambolle-Pock based algorithms for Convoltional sparse representation
【速读】:该论文旨在解决卷积稀疏表示(Convolutional Sparse Representation, CSR)中优化算法效率与收敛性的问题,特别是针对基于交替方向乘子法(ADMM)的卷积稀疏编码(CSC)方法中存在的惩罚参数敏感性问题——不当参数选择可能导致不收敛或收敛速度极慢。解决方案的关键在于引入Chambolle-Pock(CP)框架,该框架无需人工调节额外参数,且具备更快的收敛速度;同时,论文提出在CSC中采用各向异性总变差(Anisotropic Total Variation, ATV)正则化系数图,并结合CP算法进行求解,从而提升了图像去噪性能,在无噪声图像上达到与最新ADMM方法相当的效果,并在高斯噪声污染图像中表现更优。
链接: https://arxiv.org/abs/2508.02152
作者: Yi Liu,Junjing Li,Yang Chen,Haowei Tang,Pengcheng Zhang,Tianling Lyu,Zhiguo Gui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recently convolutional sparse representation (CSR), as a sparse representation technique, has attracted increasing attention in the field of image processing, due to its good characteristic of translate-invariance. The content of CSR usually consists of convolutional sparse coding (CSC) and convolutional dictionary learning (CDL), and many studies focus on how to solve the corresponding optimization problems. At present, the most efficient optimization scheme for CSC is based on the alternating direction method of multipliers (ADMM). However, the ADMM-based approach involves a penalty parameter that needs to be carefully selected, and improper parameter selection may result in either no convergence or very slow convergence. In this paper, a novel fast and efficient method using Chambolle-Pock(CP) framework is proposed, which does not require extra manual selection parameters in solving processing, and has faster convergence speed. Furthermore, we propose an anisotropic total variation penalty of the coefficient maps for CSC and apply the CP algorithm to solve it. In addition, we also apply the CP framework to solve the corresponding CDL problem. Experiments show that for noise-free image the proposed CSC algorithms can achieve rival results of the latest ADMM-based approach, while outperforms in removing noise from Gaussian noise pollution image.
zh
[CV-64] AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models
【速读】:该论文旨在解决生成式图像模型中对美学属性(aesthetic attributes)进行细粒度、连续且强度可控的调节难题。现有方法通常依赖模糊的文本提示,难以精确表达美学语义及其强度,或需昂贵的人类偏好数据进行对齐,限制了可扩展性和实用性。解决方案的关键在于提出 AttriCtrl 框架:通过预训练视觉-语言模型(vision-language models)量化抽象美学语义,并引入轻量级价值编码器(value encoder),将 [0,1] 区间内的标量强度映射为可学习嵌入(learnable embeddings),从而在扩散模型中实现直观、灵活且低开销的美学属性控制,同时无缝集成至主流可控生成框架中。
链接: https://arxiv.org/abs/2508.02151
作者: Die Chen,Zhongjie Duan,Zhiwen Li,Cen Chen,Daoyuan Chen,Yaliang Li,Yinda Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in [0,1] to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.
zh
[CV-65] AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation
【速读】:该论文旨在解决参考音频-视觉分割(Reference Audio-Visual Segmentation, Ref-AVS)任务中模型缺乏真实语义理解、易依赖固定推理模式,以及联合训练导致像素级分割精度下降的问题。其解决方案的关键在于提出AURORA框架,通过结构化的思维链(Chain-of-Thought, CoT)提示机制引导模型进行分步推理,并引入分割特征蒸馏损失(segmentation feature distillation loss)实现推理能力与分割性能的有效融合;此外,采用两阶段训练策略——先通过“修正反思式训练”提升推理路径质量,再利用群体奖励策略优化(Group Reward Policy Optimization, GRPO)增强模型在复杂场景下的鲁棒性,从而显著提升模型的泛化能力和分割精度。
链接: https://arxiv.org/abs/2508.02149
作者: Ziyang Luo,Nian Liu,Fahad Shahbaz Khan,Junwei Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model’s genuine reasoning capabilities, we devise a further two-stage training strategy: first, a ``corrective reflective-style training" stage utilizes self-correction to enhance the quality of reasoning paths, followed by reinforcement learning via Group Reward Policy Optimization (GRPO) to bolster robustness in challenging scenarios. Experiments demonstrate that AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
zh
[CV-66] ScrewSplat: An End-to-End Method for Articulated Object Recognition
【速读】:该论文旨在解决可动部件物体识别(articulated object recognition)问题,即同时识别具有活动部件的物体(如门、笔记本电脑)的几何结构和运动关节,从而提升机器人在现实场景中与日常物品交互的能力。现有方法常依赖于强假设(如已知部件数量)、额外输入(如深度图像)或复杂中间步骤,限制了其实用性。解决方案的关键在于提出一种端到端的RGB观测驱动方法ScrewSplat:通过随机初始化螺旋轴(screw axis),并结合高斯点绘(Gaussian Splatting)技术,迭代优化以恢复物体的运动学结构,同时完成3D重建与刚性可动部件分割,实现无需额外传感器输入的高效准确识别,并支持零样本文本引导操作。
链接: https://arxiv.org/abs/2508.02146
作者: Seungyeon Kim,Junsu Ha,Young Hun Kim,Yonghyeon Lee,Frank C. Park
机构: Seoul National University (首尔国立大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 12 figures, Conference on Robot Learning (CoRL) 2025
Abstract:Articulated object recognition – the task of identifying both the geometry and kinematic joints of objects with movable parts – is essential for enabling robots to interact with everyday objects such as doors and laptops. However, existing approaches often rely on strong assumptions, such as a known number of articulated parts; require additional inputs, such as depth images; or involve complex intermediate steps that can introduce potential errors – limiting their practicality in real-world settings. In this paper, we introduce ScrewSplat, a simple end-to-end method that operates solely on RGB observations. Our approach begins by randomly initializing screw axes, which are then iteratively optimized to recover the object’s underlying kinematic structure. By integrating with Gaussian Splatting, we simultaneously reconstruct the 3D geometry and segment the object into rigid, movable parts. We demonstrate that our method achieves state-of-the-art recognition accuracy across a diverse set of articulated objects, and further enables zero-shot, text-guided manipulation using the recovered kinematic model.
zh
[CV-67] rackletGait: A Robust Framework for Gait Recognition in the Wild
【速读】:该论文旨在解决真实监控场景下步态识别(gait recognition)的挑战,尤其是针对非周期性步态序列和遮挡情况下的性能下降问题。现有方法依赖于周期性的步态周期和受控环境,在野外复杂条件下表现不佳。其解决方案的关键在于提出一个名为TrackletGait的新框架:首先采用随机轨迹片段采样(Random Tracklet Sampling)以平衡鲁棒性与表征能力,从而捕捉多样化的行走模式;其次引入基于哈尔小波的下采样(Haar Wavelet-based Downsampling)以在空间下采样过程中保留关键信息;最后设计硬度排除三元组损失(Hardness Exclusion Triplet Loss),通过剔除低质量轮廓样本提升模型训练效率与精度。该方法在Gait3D和GREW数据集上分别取得77.8%和80.4%的rank-1准确率,且仅使用10.3M主干参数,达到当前最优性能。
链接: https://arxiv.org/abs/2508.02143
作者: Shaoxiong Zhang,Jinkai Zheng,Shangdong Zhu,Chenggang Yan
机构: Hangzhou Dianzi University (杭州电子科技大学); Key Laboratory of Micro-nano Sensing and IoT of Wenzhou (温州微纳传感与物联网重点实验室); Wenzhou Institute of Hangzhou Dianzi University (杭州电子科技大学温州研究院); Lishui Institute of Hangzhou Dianzi University (杭州电子科技大学丽水研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8 and 80.4 rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.
zh
[CV-68] AID4AD: Aerial Image Data for Automated Driving Perception
【速读】:该论文旨在解决自动驾驶车辆(AV)在感知任务中缺乏高精度、可扩展的环境上下文信息的问题,尤其是在高精地图(HD Map)不可用、过时或维护成本高昂的场景下。解决方案的关键在于提出AID4AD数据集,该数据集通过SLAM构建的点云地图将高分辨率航空影像(aerial imagery)精确对齐至nuScenes局部坐标系,并设计了一套校正定位与投影畸变的对齐流程,辅以人工质量控制筛选出高质量对齐样本作为地面真值,从而实现航空影像与地面感知系统的空间一致性。实验表明,该方法在在线地图构建中提升15–23%精度,在运动预测中替代高精地图带来2%轨迹预测性能增益,验证了航空影像作为可扩展环境表示的有效性。
链接: https://arxiv.org/abs/2508.02140
作者: Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf
机构: Technical University of Applied Sciences Augsburg (奥格斯堡应用技术大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work investigates the integration of spatially aligned aerial imagery into perception tasks for automated vehicles (AVs). As a central contribution, we present AID4AD, a publicly available dataset that augments the nuScenes dataset with high-resolution aerial imagery precisely aligned to its local coordinate system. The alignment is performed using SLAM-based point cloud maps provided by nuScenes, establishing a direct link between aerial data and nuScenes local coordinate system. To ensure spatial fidelity, we propose an alignment workflow that corrects for localization and projection distortions. A manual quality control process further refines the dataset by identifying a set of high-quality alignments, which we publish as ground truth to support future research on automated registration. We demonstrate the practical value of AID4AD in two representative tasks: in online map construction, aerial imagery serves as a complementary input that improves the mapping process; in motion prediction, it functions as a structured environmental representation that replaces high-definition maps. Experiments show that aerial imagery leads to a 15-23% improvement in map construction accuracy and a 2% gain in trajectory prediction performance. These results highlight the potential of aerial imagery as a scalable and adaptable source of environmental context in automated vehicle systems, particularly in scenarios where high-definition maps are unavailable, outdated, or costly to maintain. AID4AD, along with evaluation code and pretrained models, is publicly released to foster further research in this direction: this https URL.
zh
[CV-69] Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLM s within Single Inference ICCV2025
【速读】:该论文旨在解决视频多模态大语言模型(Video Multimodal Large Language Models, Video-MLLM)在处理长视频时因底层大语言模型(LLM)上下文长度限制而导致性能下降的问题。现有方法如令牌压缩或流式推理虽能缓解长度限制,但往往牺牲特征粒度或推理效率。其解决方案的关键在于提出一种无需训练的高效方法 Free-MoRef,该方法通过将视觉令牌重构为多个短序列作为多参考(multi-reference),引入 MoRef-attention 并行聚合各参考片段中的线索以生成统一查询激活,并在 LLM 的影子层后设计参考融合步骤,整合来自并行片段的关键令牌,从而补偿 MoRef-attention 中被忽略的跨参考视觉交互。该机制实现了单次推理内对更长输入帧的全面感知,在显著降低计算成本的同时提升了性能,实验证明其可在单张 A100 GPU 上实现 2× 至 8× 更长视频输入的完整理解且保持即时响应,效果优于专门训练的长视频模型。
链接: https://arxiv.org/abs/2508.02134
作者: Kuo Wang,Quanlong Zheng,Junlin Xie,Yanhao Zhang,Jinguo Luo,Haonan Lu,Liang Lin,Fan Zhou,Guanbin Li
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); OPPO AI Center, OPPO Inc. (OPPO人工智能中心,OPPO公司); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Shenzhen Key Laboratory of Digital Living Network and Content Service (深圳市数字生活网络与内容服务重点实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published in ICCV 2025
Abstract:Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach \textbfFree-MoRef, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2 \times to 8 \times longer input frames without compression on a single A100 GPU while keeping instant responses, thereby bringing significant performance gains, even surpassing dedicatedly trained long-video-MLLMs. Codes are available at this https URL
zh
[CV-70] A Neural Quality Metric for BRDF Models
【速读】:该论文旨在解决传统双向反射分布函数(BRDF)空间度量方法在评估BRDF模型质量时无法准确反映人眼感知差异的问题,这类方法通常依赖数值误差指标,而忽视了渲染图像中的主观视觉差异。解决方案的关键在于提出首个基于感知信息的神经网络质量度量方法,该方法直接在BRDF空间中运行,无需渲染过程即可预测感知质量;其核心是一个紧凑的多层感知机(MLP),通过结合实测BRDF数据与合成数据,并利用经验证的图像空间感知指标进行标注训练,能够以Just-Objectionable-Difference(JOD)分数形式输出参考BRDF与近似BRDF之间的感知差异,显著提升与人类判断的相关性。
链接: https://arxiv.org/abs/2508.02131
作者: Behnaz Kavoosighafi,Rafal K. Mantiuk,Saghi Hajisharif,Ehsan Miandji,Jonas Unger
机构: Linköping University (林雪平大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately evaluating the quality of bidirectional reflectance distribution function (BRDF) models is essential for photo-realistic rendering. Traditional BRDF-space metrics often employ numerical error measures that fail to capture perceptual differences evident in rendered images. In this paper, we introduce the first perceptually informed neural quality metric for BRDF evaluation that operates directly in BRDF space, eliminating the need for rendering during quality assessment. Our metric is implemented as a compact multi-layer perceptron (MLP), trained on a dataset of measured BRDFs supplemented with synthetically generated data and labelled using a perceptually validated image-space metric. The network takes as input paired samples of reference and approximated BRDFs and predicts their perceptual quality in terms of just-objectionable-difference (JOD) scores. We show that our neural metric achieves significantly higher correlation with human judgments than existing BRDF-space metrics. While its performance as a loss function for BRDF fitting remains limited, the proposed metric offers a perceptually grounded alternative for evaluating BRDF models.
zh
[CV-71] VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling
【速读】:该论文旨在解决动态城市场景建模中,现有基于神经辐射场(Neural Radiance Fields)或高斯点绘图(Gaussian Splatting)方法在处理快速运动物体时存在的局限性,尤其是由于采样不足导致的时空不连续性建模困难以及依赖预标定物体轨迹的问题。其解决方案的关键在于提出一种视频扩散增强的4D高斯点绘图框架,通过在测试阶段自适应地从视频扩散模型中蒸馏出鲁棒且时序一致的先验信息,并结合两项核心创新:一是联合时间戳优化策略以精修插值帧的姿态,二是不确定性蒸馏方法以自适应提取目标内容并保留已良好重建区域,从而显著提升对快速运动物体的动态建模能力,在新视角合成中实现约2 dB的PSNR增益。
链接: https://arxiv.org/abs/2508.02129
作者: Yuru Xiao,Zihan Lin,Chao Lu,Deming Zhai,Kui Jiang,Wenbo Zhao,Wei Zhang,Junjun Jiang,Huanran Wang,Xianming Liu
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.
zh
[CV-72] Beyond RGB and Events: Enhancing Object Detection under Adverse Lighting with Monocular Normal Maps
【速读】:该论文旨在解决在恶劣光照条件下(如隧道或道路反射导致的干扰)物体检测准确率下降的问题,尤其是在仅依赖RGB图像或事件流数据时易产生误检的挑战。解决方案的关键在于引入由单目RGB图像预测得到的表面法向量图(normal maps),作为鲁棒的几何特征来抑制假阳性并提升检测精度。为此,作者提出NRE-Net框架,通过自适应双流融合模块(ADFM)和事件模态感知融合模块(EAFM)实现RGB、法向量图与事件流三者的高效多模态融合,从而显著优于现有基于帧或融合的方法。
链接: https://arxiv.org/abs/2508.02127
作者: Mingjie Liu,Hanqing Liu,Chuang Zhu
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate object detection under adverse lighting conditions is critical for real-world applications such as autonomous driving. Although neuromorphic event cameras have been introduced to handle these scenarios, adverse lighting often induces distracting reflections from tunnel walls or road surfaces, which frequently lead to false obstacle detections. However, neither RGB nor event data alone is robust enough to address these complexities, and mitigating these issues without additional sensors remains underexplored. To overcome these challenges, we propose leveraging normal maps, directly predicted from monocular RGB images, as robust geometric cues to suppress false positives and enhance detection accuracy. We introduce NRE-Net, a novel multi-modal detection framework that effectively fuses three complementary modalities: monocularly predicted surface normal maps, RGB images, and event streams. To optimize the fusion process, our framework incorporates two key modules: the Adaptive Dual-stream Fusion Module (ADFM), which integrates RGB and normal map features, and the Event-modality Aware Fusion Module (EAFM), which adapts to the high dynamic range characteristics of event data. Extensive evaluations on the DSEC-Det-sub and PKU-DAVIS-SOD datasets demonstrate that NRE-Net significantly outperforms state-of-the-art methods. Our approach achieves mAP50 improvements of 7.9% and 6.1% over frame-based approaches (e.g., YOLOX), while surpassing the fusion-based SFNet by 2.7% on the DSEC-Det-sub dataset and SODFormer by 7.1% on the PKU-DAVIS-SOD dataset.
zh
[CV-73] DeflareMamba: Hierarchical Vision Mamba for Contextually Consistent Lens Flare Removal
【速读】:该论文旨在解决镜头眩光(Lens flare)去除中的信息混淆问题,即如何在复杂光学交互下有效分离图像背景与光学眩光,同时保持上下文一致性。现有方法常因无法维持场景的局部-全局依赖关系而导致去除不完整或不一致。解决方案的关键在于提出DeflareMamba,首次将状态空间模型(State Space Models, SSMs)引入眩光去除任务,设计了一种分层框架,通过不同步长采样模式建立长距离像素关联,并结合局部增强的状态空间模块以同时保留局部细节,从而实现对散射型和反射型眩光的有效去除,且保持非眩光区域的自然视觉特性。
链接: https://arxiv.org/abs/2508.02113
作者: Yihang Huang,Yuanfei Huang,Junhui Lin,Hua Huang
机构: School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院); Engineering Research Center of Intelligent Technology and Educational Application, Ministry of Education (教育部智能技术与教育应用工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ACMMM 2025
Abstract:Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at this https URL.
zh
[CV-74] AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在实际部署中因参数微调不可行而导致的局限性问题,特别是针对开源低秩适配(LoRA)模块在使用过程中面临的三大挑战:稀疏元数据标注、零样本适应能力不足以及多 LoRA 融合策略效果不佳。其解决方案的关键在于提出一种语义驱动的 LoRA 检索与动态聚合框架,包含两个核心组件:(1) 基于权重编码的 LoRA 检索器,通过构建 LoRA 参数矩阵与文本提示之间的共享语义空间,摆脱对原始训练数据的依赖;(2) 细粒度门控融合机制,能够在网络层和扩散时间步上计算上下文相关的融合权重,实现生成过程中多个 LoRA 模块的最优整合。该方法显著提升了图像生成性能,推动了基础模型的可扩展与数据高效增强。
链接: https://arxiv.org/abs/2508.02107
作者: Zhiwen Li,Zhongjie Duan,Die Chen,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and text prompts, eliminating dependence on original training data, and (2) fine-grained gated fusion mechanism that computes context-specific fusion weights across network layers and diffusion timesteps to optimally integrate multiple LoRA modules during generation. Our approach achieves significant improvement in image generation perfermance, thereby facilitating scalable and data-efficient enhancement of foundational models. This work establishes a critical bridge between the fragmented landscape of community-developed LoRAs and practical deployment requirements, enabling collaborative model evolution through standardized adapter integration.
zh
[CV-75] owards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis ICCV2025
【速读】:该论文旨在解决沉浸式虚拟现实(VR)/增强现实(AR)系统与人形机器人中实时生成物理上合理的人机交互动作这一关键挑战,现有方法在实时响应性、物理可行性与安全性之间存在显著矛盾。其解决方案的核心在于提出Human-X框架,通过一种自回归反应扩散规划器(auto-regressive reaction diffusion planner)联合预测交互中的主动行为(action)与被动反应(reaction),实现动作与环境的实时同步和情境感知响应;同时引入基于强化学习训练的代理感知运动追踪策略(actor-aware motion tracking policy),动态适应交互对象的运动并避免足部滑动、穿透等物理不一致现象,从而在真实场景中显著提升交互连续性、物理合理性与安全性。
链接: https://arxiv.org/abs/2508.02106
作者: Kaiyang Ji,Ye Shi,Zichen Jin,Kangyi Chen,Lan Xu,Yuexin Ma,Jingyi Yu,Jingya Wang
机构: ShanghaiTech University (上海科技大学); Shanghai Engineering Research Center of Intelligent Vision and Imaging (智能视觉与成像上海市工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCV 2025
Abstract:Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
zh
[CV-76] VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在理解动态时空交互方面的根本性局限问题,即模型难以有效捕捉物体的平移与旋转运动、视角变化以及运动连续性等关键动态特性。解决方案的关键在于提出首个专门评估VLMs时空推理能力的基准测试集VLM4D,该基准包含多样化的现实世界和合成视频及精心设计的问题-答案对,从而系统性揭示现有模型在多视觉线索整合与时间一致性维持上的显著不足,并进一步验证了利用4D特征场重建和针对性时空监督微调策略的有效性,为提升VLMs在动态环境中的空间与时间定位能力提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2508.02095
作者: Shijie Zhou,Alexander Vilesov,Xuehai He,Ziyu Wan,Shuwang Zhang,Aditya Nagachandra,Di Chang,Dongdong Chen,Xin Eric Wang,Achuta Kadambi
机构: UCLA; Microsoft; UCSC; USC
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs’ spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.
zh
[CV-77] S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework
【速读】:该论文旨在解决传统自由文本放射学报告(Radiology Report Generation, RRG)中存在的冗余和语言不一致问题,以及现有结构化放射学报告生成(Structured Radiology Report Generation, S-RRG)方法在临床实用性上的局限性,如依赖预定义标签集、输出碎片化或模板化导致信息表达能力弱、遗漏关键临床细节等问题。其解决方案的关键在于:首先构建一个高质量的胸部X光图像结构化报告数据集(MIMIC-STRUC),其中包含疾病名称、严重程度、概率及解剖位置等结构化信息;其次采用大语言模型(Large Language Model, LLM)进行训练以生成标准化且高质量的报告;最后提出一种专门针对S-RRG的评估指标(S-Score),不仅衡量疾病预测准确性,还评估特定疾病细节的精确度,从而更贴近临床决策需求并显著优于人类评分的一致性。
链接: https://arxiv.org/abs/2508.02082
作者: Yingshu Li,Yunyi Liu,Zhanyu Wang,Xinyu Liang,Lingqiao Liu,Lei Wang,Luping Zhou
机构: University of Sydney (悉尼大学); University of Wollongong (伍伦贡大学); University of Adelaide (阿德莱德大学); First Clinical Medical College, Guangzhou University of Chinese Medicine (广州中医药大学第一临床医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI. Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details. Structured radiology report generation (S-RRG) offers a promising solution by organizing information into standardized, concise formats. However, existing approaches often rely on classification or visual question answering (VQA) pipelines that require predefined label sets and produce only fragmented outputs. Template-based approaches, which generate reports by replacing keywords within fixed sentence patterns, further compromise expressiveness and often omit clinically important details. In this work, we present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework. We first create a robust chest X-ray dataset (MIMIC-STRUC) that includes disease names, severity levels, probabilities, and anatomical locations, ensuring that the dataset is both clinically relevant and well-structured. We train an LLM-based model to generate standardized, high-quality reports. To assess the generated reports, we propose a specialized evaluation metric (S-Score) that not only measures disease prediction accuracy but also evaluates the precision of disease-specific details, thus offering a clinically meaningful metric for report quality that focuses on elements critical to clinical decision-making and demonstrates a stronger alignment with human assessments. Our approach highlights the effectiveness of structured reports and the importance of a tailored evaluation metric for S-RRG, providing a more clinically relevant measure of report quality.
zh
[CV-78] YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges
【速读】:该论文旨在系统梳理YOLO(You Only Look Once)系列模型在过去十年中的演进路径,解决当前对象检测领域中如何持续优化速度、精度与部署效率之间平衡的问题。其解决方案的关键在于通过架构创新和算法改进,推动YOLO从最初的回归式检测方法(如YOLOv1)发展到具备多任务支持能力的现代框架(如YOLOv9),从而在保持实时性能的同时显著提升检测准确性,并拓展至实例分割、姿态估计、目标跟踪及医学影像等多样化应用场景。
链接: https://arxiv.org/abs/2508.02067
作者: Manikanta Kotthapalli,Deepika Ravipati,Reshma Bhatia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures
Abstract:Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real-time vision applications through unified, end-to-end detection frameworks. From YOLOv1’s pioneering regression-based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements… Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain-specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real-world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.
zh
[CV-79] StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion
【速读】:该论文旨在解决单目3D人体姿态估计中因深度歧义和遮挡导致的精度不足及预测序列时间一致性差的问题。现有基于扩散模型的方法虽在生成质量上表现优异,但未能有效建模帧间时空相关性,从而限制了其在动态场景下的性能。解决方案的关键在于提出StarPose框架,其核心创新为:1)将2D到3D姿态映射建模为自回归扩散过程,通过历史姿态集成模块(Historical Pose Integration Module, HPIM)融合先前帧的3D预测结果,生成富含时序信息的历史姿态嵌入以指导去噪过程;2)设计了一个即插即用的时空物理引导机制(Spatial-Temporal Physical Guidance, STPG),在迭代去噪过程中同步约束空间解剖合理性与时间运动动力学,显著提升预测的准确性和时序一致性。
链接: https://arxiv.org/abs/2508.02056
作者: Haoxin Yang,Weihong Chen,Xuemiao Xu,Cheng Xu,Peng Xiao,Cuifeng Sun,Shaoyu Huang,Shengfeng He
机构: South China University of Technology (华南理工大学); State Key Laboratory of Subtropical Building Science (亚热带建筑科学国家重点实验室); Ministry of Education Key Laboratory of Big Data and Intelligent Robot (教育部大数据与智能机器人重点实验室); Guangdong Provincial Key Lab of Computational Intelligence and Cyberspace Information (广东省计算智能与网络空间信息重点实验室); The Hong Kong Polytechnic University (香港理工大学); Chinese Academy of Sciences (中国科学院); Guangzhou Yichuang Information Technology Co., Ltd. (广州亿创信息技术有限公司); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at this https URL.
zh
[CV-80] HCF: Hierarchical Cascade Framework for Distributed Multi-Stage Image Compression
【速读】:该论文旨在解决分布式多阶段图像压缩(Distributed Multi-Stage Image Compression)中的效率与性能瓶颈问题,具体包括:传统渐进式方法在比特流截断时未能充分利用计算资源;连续压缩方法重复执行高代价的像素域操作,导致质量累积损失和计算效率低下;固定参数模型缺乏编码后灵活性。解决方案的关键在于提出分层级联框架(Hierarchical Cascade Framework, HCF),通过在网络节点间直接进行潜在空间(latent-space)变换实现高效且高质量的压缩流程,并引入基于策略的量化控制机制以优化率失真权衡(rate-distortion trade-off),同时基于微分熵分析确立边缘量化原则(edge quantization principle),从而显著提升压缩性能与计算效率。
链接: https://arxiv.org/abs/2508.02051
作者: Junhao Cai,Taegun An,Chengjun Jin,Sung Il Choi,JuHyun Park,Changhee Joo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Distributed multi-stage image compression – where visual content traverses multiple processing nodes under varying quality requirements – poses challenges. Progressive methods enable bitstream truncation but underutilize available compute resources; successive compression repeats costly pixel-domain operations and suffers cumulative quality loss and inefficiency; fixed-parameter models lack post-encoding flexibility. In this work, we developed the Hierarchical Cascade Framework (HCF) that achieves high rate-distortion performance and better computational efficiency through direct latent-space transformations across network nodes in distributed multi-stage image compression system. Under HCF, we introduced policy-driven quantization control to optimize rate-distortion trade-offs, and established the edge quantization principle through differential entropy analysis. The configuration based on this principle demonstrates up to 0.6dB PSNR gains over other configurations. When comprehensively evaluated on the Kodak, CLIC, and CLIC2020-mobile datasets, HCF outperforms successive-compression methods by up to 5.56% BD-Rate in PSNR on CLIC, while saving up to 97.8% FLOPs, 96.5% GPU memory, and 90.0% execution time. It also outperforms state-of-the-art progressive compression methods by up to 12.64% BD-Rate on Kodak and enables retraining-free cross-quality adaptation with 7.13-10.87% BD-Rate reductions on CLIC2020-mobile.
zh
[CV-81] Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations ICCV2025
【速读】:该论文旨在解决自动驾驶中交通标志细粒度标注不足的问题,现有数据集(如Mapillary)通常仅提供粗粒度标签,无法区分语义上重要的类别(如停车标志与限速标志)。解决方案的关键在于构建一个新的验证集——Mapillary Vistas Validation for Traffic Signs (MVV),该数据集将复合交通标志分解为细粒度、语义明确的类别,并提供像素级实例掩码(pixel-level instance masks),且由专家人工标注以确保标签准确性。此外,作者基于该数据集对多种视觉语言模型(VLMs)和自监督DINOv2模型进行基准测试,发现DINOv2在交通标志识别及车辆、行人等高频类别上均显著优于其他模型,揭示了当前VLM在细粒度视觉理解上的局限性,并确立DINOv2作为自动驾驶场景下密集语义匹配的强基线。
链接: https://arxiv.org/abs/2508.02047
作者: Sparsh Garg,Abhishek Aich
机构: NEC Laboratories, America
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025 Workshop (4th DataCV Workshop and Challenge)
Abstract:Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems. Code and data are available at: this https URL Comments: Accepted to ICCV 2025 Workshop (4th DataCV Workshop and Challenge) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.02047 [cs.CV] (or arXiv:2508.02047v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.02047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-82] Conditional Diffusion Model with Anatomical-Dose Dual Constraints for End-to-End Multi-Tumor Dose Prediction
【速读】:该论文旨在解决放射治疗计划制定中依赖专家经验、耗时且难以泛化的难题,以及现有深度学习方法在预测准确性与临床适用性方面的局限。其解决方案的关键在于提出ADDiff-Dose模型——一种基于解剖-剂量双重约束的条件扩散模型(conditional diffusion model),通过轻量级3D变分自编码器(LightweightVAE3D)压缩高维CT数据,并融合靶区和器官危及组织(OAR)掩膜及束流参数等多模态输入,在逐步加噪与去噪框架中引入多头注意力机制实现条件特征建模,同时采用包含均方误差(MSE)、条件项和KL散度的复合损失函数,确保剂量分布精度与临床约束的一致性。该方法在大规模公开数据集和多个外部机构队列中验证了优越性能,显著提升预测准确性和计划生成效率。
链接: https://arxiv.org/abs/2508.02043
作者: Hui Xie,Haiqin Hu,Lijuan Ding,Qing Li,Yue Sun,Tao Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiotherapy treatment planning often relies on time-consuming, trial-and-error adjustments that heavily depend on the expertise of specialists, while existing deep learning methods face limitations in generalization, prediction accuracy, and clinical applicability. To tackle these challenges, we propose ADDiff-Dose, an Anatomical-Dose Dual Constraints Conditional Diffusion Model for end-to-end multi-tumor dose prediction. The model employs LightweightVAE3D to compress high-dimensional CT data and integrates multimodal inputs, including target and organ-at-risk (OAR) masks and beam parameters, within a progressive noise addition and denoising framework. It incorporates conditional features via a multi-head attention mechanism and utilizes a composite loss function combining MSE, conditional terms, and KL divergence to ensure both dosimetric accuracy and compliance with clinical constraints. Evaluation on a large-scale public dataset (2,877 cases) and three external institutional cohorts (450 cases in total) demonstrates that ADDiff-Dose significantly outperforms traditional baselines, achieving an MAE of 0.101-0.154 (compared to 0.316 for UNet and 0.169 for GAN models), a DICE coefficient of 0.927 (a 6.8% improvement), and limiting spinal cord maximum dose error to within 0.1 Gy. The average plan generation time per case is reduced to 22 seconds. Ablation studies confirm that the structural encoder enhances compliance with clinical dose constraints by 28.5%. To our knowledge, this is the first study to introduce a conditional diffusion model framework for radiotherapy dose prediction, offering a generalizable and efficient solution for automated treatment planning across diverse tumor sites, with the potential to substantially reduce planning time and improve clinical workflow efficiency.
zh
[CV-83] Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
【速读】:该论文旨在解决人脸识别(Face Recognition, FR)技术在大规模图像检索系统中引发的隐私泄露问题,例如用户上传一张人脸照片后,可能被用于追踪其数字足迹(如社交媒体活动、私密照片和新闻报道),且往往未经本人同意。为应对这一挑战,作者提出Protego,一种以用户为中心的隐私保护方法,其核心创新在于将用户的3D面部特征编码为姿态不变的2D表示,并动态生成与目标图像姿态和表情匹配的自然外观3D掩膜,在图像上传前进行应用;同时,通过增强人脸识别模型对受保护图像的敏感性,使得即使同一用户的不同受保护图像也无法被正确匹配,从而显著降低检索准确率。实验表明,Protego在多种黑盒FR模型上均能有效抑制识别性能,且视觉一致性优于现有方法,尤其适用于视频场景。
链接: https://arxiv.org/abs/2508.02034
作者: Ziling Wang,Shuya Yang,Jialin Lu,Ka-Ho Chow
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face recognition (FR) technologies are increasingly used to power large-scale image retrieval systems, raising serious privacy concerns. Services like Clearview AI and PimEyes allow anyone to upload a facial photo and retrieve a large amount of online content associated with that person. This not only enables identity inference but also exposes their digital footprint, such as social media activity, private photos, and news reports, often without their consent. In response to this emerging threat, we propose Protego, a user-centric privacy protection method that safeguards facial images from such retrieval-based privacy intrusions. Protego encapsulates a user’s 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, and applied prior to online sharing. Motivated by a critical limitation of existing methods, Protego amplifies the sensitivity of FR models so that protected images cannot be matched even among themselves. Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least 2x better than existing methods. It also offers unprecedented visual coherence, particularly in video settings where consistency and natural appearance are essential. Overall, Protego contributes to the fight against the misuse of FR for mass surveillance and unsolicited identity tracing.
zh
[CV-84] Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶(Autonomous Driving, AD)系统中的评估方法过于依赖开环静态输入、缺乏对闭环交互行为、反馈鲁棒性和真实场景安全性的有效衡量这一问题。其核心解决方案是提出Bench2ADVLM,一个统一的分层闭环评估框架,通过双系统适配架构将目标ADVLM生成的高层驾驶指令由通用VLM解释为标准化中层控制动作,并借助物理控制抽象层将其转化为低层执行信号,从而实现仿真与实车平台上的闭环测试;同时引入自反思场景生成模块以自动探索模型行为并识别潜在失效模式,显著提升了对ADVLMs在复杂动态环境下的诊断能力。
链接: https://arxiv.org/abs/2508.02028
作者: Tianyuan Zhang,Ting Jin,Lu Wang,Jiangfan Liu,Siyuan Liang,Mingchuan Zhang,Aishan Liu,Xianglong Liu
机构: Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); Henan University of Science and Technology (河南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.
zh
[CV-85] Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention
【速读】:该论文旨在解决当前基于图像提示(image prompt)的文本到图像扩散模型中,生成图像无法准确反映图像提示细节的问题。现有方法通过修改自注意力机制(self-attention mechanism)来引入图像提示信息,但存在两个关键问题:一是未正确处理分类自由引导(classifier-free guidance)中的图像提示作用,导致期望与非期望条件冲突;二是现有自注意力修改方式在图像真实感与提示对齐度之间存在权衡,难以同时优化两者。解决方案的关键在于提出“无冲突引导”(conflict-free guidance),仅将图像提示作为期望条件使用,避免信号冲突;并设计“分层注意力”(Stratified Attention)机制,通过联合利用来自图像提示和生成图像的键(keys)与值(values),而非简单选择其一,从而在保持图像真实性的同时提升对图像提示的忠实度。
链接: https://arxiv.org/abs/2508.02004
作者: Kyungmin Jo,Jooyeol Yun,Jaegul Choo
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While large-scale text-to-image diffusion models enable the generation of high-quality, diverse images from text prompts, these prompts struggle to capture intricate details, such as textures, preventing the user intent from being reflected. This limitation has led to efforts to generate images conditioned on user-provided images, referred to as image prompts. Recent work modifies the self-attention mechanism to impose image conditions in generated images by replacing or concatenating the keys and values from the image prompt. This enables the self-attention layer to work like a cross-attention layer, generally used to incorporate text prompts. In this paper, we identify two common issues in existing methods of modifying self-attention to generate images that reflect the details of image prompts. First, existing approaches neglect the importance of image prompts in classifier-free guidance. Specifically, current methods use image prompts as both desired and undesired conditions in classifier-free guidance, causing conflicting signals. To resolve this, we propose conflict-free guidance by using image prompts only as desired conditions, ensuring that the generated image faithfully reflects the image prompt. In addition, we observe that the two most common self-attention modifications involve a trade-off between the realism of the generated image and alignment with the image prompt. Specifically, selecting more keys and values from the image prompt improves alignment, while selecting more from the generated image enhances realism. To balance both, we propose an new self-attention modification method, Stratified Attention to jointly use keys and values from both images rather than selecting between them. Through extensive experiments across three image generation tasks, we show that the proposed method outperforms existing image-prompting models in faithfully reflecting the image prompt.
zh
[CV-86] Fast and Memory-efficient Non-line-of-sight Imaging with Quasi-Fresnel Transform
【速读】:该论文旨在解决非视域(Non-line-of-sight, NLOS)成像中因传统方法将测量数据与隐藏场景均建模为三维而带来的高计算复杂度和内存消耗问题,从而限制了其实时性和在轻量级设备上的应用。解决方案的关键在于提出一种基于二维函数表示隐藏场景的新方法,并引入准弗雷涅尔(Quasi-Fresnel)变换建立测量数据与隐藏场景之间的直接反演公式,从而充分利用问题的二维特性显著降低计算复杂度与内存需求,实现高效、低功耗的NLOS成像。
链接: https://arxiv.org/abs/2508.02003
作者: Yijun Wei,Jianyu Wang,Leping Xiao,Zuoqiang Shi,Xing Fu,Lingyun Qiu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-line-of-sight (NLOS) imaging seeks to reconstruct hidden objects by analyzing reflections from intermediary surfaces. Existing methods typically model both the measurement data and the hidden scene in three dimensions, overlooking the inherently two-dimensional nature of most hidden objects. This oversight leads to high computational costs and substantial memory consumption, limiting practical applications and making real-time, high-resolution NLOS imaging on lightweight devices challenging. In this paper, we introduce a novel approach that represents the hidden scene using two-dimensional functions and employs a Quasi-Fresnel transform to establish a direct inversion formula between the measurement data and the hidden scene. This transformation leverages the two-dimensional characteristics of the problem to significantly reduce computational complexity and memory requirements. Our algorithm efficiently performs fast transformations between these two-dimensional aggregated data, enabling rapid reconstruction of hidden objects with minimal memory usage. Compared to existing methods, our approach reduces runtime and memory demands by several orders of magnitude while maintaining imaging quality. The substantial reduction in memory usage not only enhances computational efficiency but also enables NLOS imaging on lightweight devices such as mobile and embedded systems. We anticipate that this method will facilitate real-time, high-resolution NLOS imaging and broaden its applicability across a wider range of platforms.
zh
[CV-87] Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling
【速读】:该论文旨在解决内容驱动的局部篡改场景下音视频时序深度伪造(audio-visual temporal deepfake)定位问题,此类场景中伪造区域通常仅存在于少数帧内,其余大部分内容与原始一致,导致检测难度显著增加。解决方案的关键在于提出一种分层边界建模网络(Hierarchical Boundary Modeling Network, HBMNet),其核心创新包括:1)设计音视频特征编码器(Audio-Visual Feature Encoder)以提取判别性帧级表示,并通过帧级监督增强可区分性;2)引入粗粒度候选边界生成模块(Coarse Proposal Generator)和细粒度概率生成模块(Fine-grained Probabilities Generator),利用双向边界-内容概率关系对候选区域进行精细化修正;3)融合多尺度时序线索与双向边界建模机制,从而有效提升定位精度与召回率。实验表明,各模块协同作用显著优于现有方法BA-TFD和UMMAFormer,且具备良好的数据扩展潜力。
链接: https://arxiv.org/abs/2508.02000
作者: Xuanjun Chen,Shih-Peng Cheng,Jiawei Du,Lin Zhang,Xiaoxiao Miao,Chung-Che Wang,Haibin Wu,Hung-yi Lee,Jyh-Shing Roger Jang
机构: 1. National Taiwan University (国立台湾大学); 2. National Taiwan University of Science and Technology (国立台湾科技大学); 3. Huawei Technologies Co., Ltd. (华为技术有限公司); 4. Tsinghua University (清华大学); 5. Alibaba Group (阿里巴巴集团)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: Work in progress
Abstract:Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.
zh
[CV-88] Deeply Dual Supervised learning for melanoma recognition
【速读】:该论文旨在解决现有深度学习模型在黑色素瘤(melanoma)识别中难以捕捉细微视觉特征、从而影响诊断准确性的难题。其关键解决方案是提出一种深度双重监督学习框架(Deeply Dual Supervised Learning framework),通过双路径结构同时提取局部细粒度特征与全局上下文信息,并结合双重注意力机制动态强化关键特征,辅以多尺度特征聚合策略提升不同图像分辨率下的鲁棒性,从而显著提高黑色素瘤检测的准确性并降低假阳性率。
链接: https://arxiv.org/abs/2508.01994
作者: Rujosh Polma,Krishnan Menon Iyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As the application of deep learning in dermatology continues to grow, the recognition of melanoma has garnered significant attention, demonstrating potential for improving diagnostic accuracy. Despite advancements in image classification techniques, existing models still face challenges in identifying subtle visual cues that differentiate melanoma from benign lesions. This paper presents a novel Deeply Dual Supervised Learning framework that integrates local and global feature extraction to enhance melanoma recognition. By employing a dual-pathway structure, the model focuses on both fine-grained local features and broader contextual information, ensuring a comprehensive understanding of the image content. The framework utilizes a dual attention mechanism that dynamically emphasizes critical features, thereby reducing the risk of overlooking subtle characteristics of melanoma. Additionally, we introduce a multi-scale feature aggregation strategy to ensure robust performance across varying image resolutions. Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art methods in melanoma detection, achieving higher accuracy and better resilience against false positives. This work lays the foundation for future research in automated skin cancer recognition and highlights the effectiveness of dual supervised learning in medical image analysis.
zh
[CV-89] IMoRe: Implicit Program-Guided Reasoning for Human Motion QA ICCV2025
【速读】:该论文旨在解决现有基于显式程序执行的人体运动问答(Human Motion Question Answering, HMQA)方法中,因依赖人工设计的功能模块而导致的可扩展性与适应性受限的问题。其核心解决方案是提出一种隐式程序引导的运动推理框架(Implicit Program-guided Motion Reasoning, IMoRe),该框架通过结构化程序函数直接条件化推理过程,而非仅从问题词汇中推断操作,从而实现对多种查询类型的统一推理;关键创新在于引入了程序引导的阅读机制,动态选择预训练运动视觉Transformer(Motion Vision Transformer, ViT)中的多层级运动表征,同时结合迭代式记忆精炼模块,利用结构化程序函数提取不同查询类型的相关信息,显著提升了模型在Babel-QA及新构建的HuMMan基础运动问答数据集上的性能与泛化能力。
链接: https://arxiv.org/abs/2508.01984
作者: Chen Li,Chinthani Sugandhika,Yeo Keat Ee,Eric Peh,Hao Zhang,Hong Yang,Deepu Rajan,Basura Fernando
机构: Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore(新加坡科技研究局高性能计算研究所); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(新加坡科技研究局前沿人工智能研究中心); College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: *Equal contribution. Accepted by the International Conference on Computer Vision (ICCV 2025)
Abstract:Existing human motion Q\A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: this https URL.
zh
[CV-90] On-the-Fly Object-aware Representative Point Selection in Point Cloud
【速读】:该论文旨在解决自动驾驶车辆(AV)生成的点云数据量庞大所带来的存储、带宽和处理成本高等问题。其核心挑战在于如何在大幅降低点云数据规模的同时,保留关键物体信息并过滤无关背景点。解决方案的关键在于提出了一种代表点选择框架,包含两个核心步骤:(1) 物体存在检测(Object Presence Detection),采用无监督的密度峰值分类器与有监督的朴素贝叶斯分类器以适应多样化场景;(2) 采样预算分配(Sampling Budget Allocation),通过优化策略在保持高物体信息保留率的前提下选择与物体相关的点。该方法具有模型无关性,可无缝集成到多种下游3D点云处理模型中,显著提升点云下采样的效率与效果。
链接: https://arxiv.org/abs/2508.01980
作者: Xiaoyu Zhang,Ziwei Wang,Hai Dong,Zhifeng Bao,Jiajun Liu
机构: RMIT University (皇家墨尔本理工大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Point clouds are essential for object modeling and play a critical role in assisting driving tasks for autonomous vehicles (AVs). However, the significant volume of data generated by AVs creates challenges for storage, bandwidth, and processing cost. To tackle these challenges, we propose a representative point selection framework for point cloud downsampling, which preserves critical object-related information while effectively filtering out irrelevant background points. Our method involves two steps: (1) Object Presence Detection, where we introduce an unsupervised density peak-based classifier and a supervised Naïve Bayes classifier to handle diverse scenarios, and (2) Sampling Budget Allocation, where we propose a strategy that selects object-relevant points while maintaining a high retention rate of object information. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method consistently outperforms state-of-the-art baselines in both efficiency and effectiveness across varying sampling rates. As a model-agnostic solution, our approach integrates seamlessly with diverse downstream models, making it a valuable and scalable addition to the 3D point cloud downsampling toolkit for AV applications.
zh
[CV-91] Self-Supervised YOLO: Leverag ing Contrastive Learning for Label-Efficient Object Detection
【速读】:该论文旨在解决一阶段目标检测器(如YOLO系列)在训练过程中对大规模标注数据高度依赖的问题,从而限制了其在标注成本高或数据稀缺场景下的应用。解决方案的关键在于引入对比自监督学习(contrastive self-supervised learning, SSL),通过在未标注图像上预训练YOLOv5和YOLOv8的骨干网络(backbone),利用SimCLR框架构建对比损失函数,结合图像增强策略与全局池化及投影头结构,使模型能够从无标签数据中学习到更具泛化能力的特征表示。实验表明,该方法显著提升了小样本条件下的检测性能(如mAP@50:95达0.7663),并加速收敛,验证了无监督预训练作为标签高效目标检测的有效路径。
链接: https://arxiv.org/abs/2508.01966
作者: Manikanta Kotthapalli,Reshma Bhatia,Nainsi Jain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures
Abstract:One-stage object detectors such as the YOLO family achieve state-of-the-art performance in real-time vision applications but remain heavily reliant on large-scale labeled datasets for training. In this work, we present a systematic study of contrastive self-supervised learning (SSL) as a means to reduce this dependency by pretraining YOLOv5 and YOLOv8 backbones on unlabeled images using the SimCLR framework. Our approach introduces a simple yet effective pipeline that adapts YOLO’s convolutional backbones as encoders, employs global pooling and projection heads, and optimizes a contrastive loss using augmentations of the COCO unlabeled dataset (120k images). The pretrained backbones are then fine-tuned on a cyclist detection task with limited labeled data. Experimental results show that SSL pretraining leads to consistently higher mAP, faster convergence, and improved precision-recall performance, especially in low-label regimes. For example, our SimCLR-pretrained YOLOv8 achieves a mAP@50:95 of 0.7663, outperforming its supervised counterpart despite using no annotations during pretraining. These findings establish a strong baseline for applying contrastive SSL to one-stage detectors and highlight the potential of unlabeled data as a scalable resource for label-efficient object detection.
zh
[CV-92] From Photons to Physics: Autonomous Indoor Drones and the Future of Objective Property Assessment
【速读】:该论文旨在解决传统房地产评估依赖主观视觉判断、缺乏客观量化数据的问题,提出通过融合自主室内无人机与物理感知技术,实现从定性观察到定量测量的范式转变。其解决方案的关键在于四大核心技术路径:一是面向室内导航优化的平台架构,包括异构计算、抗碰撞设计和分层控制系统以应对重量限制;二是超越人眼感知的先进传感模态,如高光谱成像用于材料识别、偏振感知用于表面特性分析及计算成像结合超材料实现设备微型化;三是基于主动重建算法的智能自主性,利用3D高斯溅射(3D Gaussian Splatting)策略性选择观测视角,在电池约束下最大化信息增益;四是与现有房产工作流的集成路径,涵盖建筑信息模型(Building Information Modeling, BIM)系统及统一评估数据集(Uniform Appraisal Dataset, UAD)3.6等标准接口,确保技术落地可行性。
链接: https://arxiv.org/abs/2508.01965
作者: Petteri Teikari,Mike Jarrell,Irene Bandera Moreno,Harri Pesola
机构: Mill Hill Garage; JB Real Estate Valuation & Advisory, LLC
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 63 pages, 5 figures
Abstract:The convergence of autonomous indoor drones with physics-aware sensing technologies promises to transform property assessment from subjective visual inspection to objective, quantitative measurement. This comprehensive review examines the technical foundations enabling this paradigm shift across four critical domains: (1) platform architectures optimized for indoor navigation, where weight constraints drive innovations in heterogeneous computing, collision-tolerant design, and hierarchical control systems; (2) advanced sensing modalities that extend perception beyond human vision, including hyperspectral imaging for material identification, polarimetric sensing for surface characterization, and computational imaging with metaphotonics enabling radical miniaturization; (3) intelligent autonomy through active reconstruction algorithms, where drones equipped with 3D Gaussian Splatting make strategic decisions about viewpoint selection to maximize information gain within battery constraints; and (4) integration pathways with existing property workflows, including Building Information Modeling (BIM) systems and industry standards like Uniform Appraisal Dataset (UAD) 3.6.
zh
[CV-93] CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes
【速读】:该论文旨在解决在仅使用稀疏图像输入条件下,跨不同海拔高度时相机位姿估计的鲁棒性和准确性问题(multi-altitude camera pose estimation)。其解决方案的关键在于构建一个统一框架,融合了跨视角Transformer(cross-view transformer)、深度特征提取(deep features)与结构光恢复技术(structure-from-motion),从而有效应对复杂环境条件和视角变化,显著提升位姿估计的精度与稳定性。
链接: https://arxiv.org/abs/2508.01936
作者: Yaxuan Li,Yewei Huang,Bijay Gaudel,Hamidreza Jafarnejadsani,Brendan Englot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.
zh
[CV-94] Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense
【速读】:该论文旨在解决深度神经网络(DNN)和生成式 AI(GenAI)在面对多触发器(multi-trigger)后门攻击时的检测难题,此类攻击通过在不同对象类别中注入多个未见过的触发器组合,形成难以被传统检测方法识别的隐蔽后门配置。解决方案的关键在于提出 DBOM(Disentangled Backdoor-Object Modeling)框架,其核心机制是利用视觉-语言模型(VLM)的预训练编码器,通过可学习的视觉提示库(visual prompt repository)与提示前缀调优(prompt prefix tuning),将输入图像表示解耦为独立的触发器和对象特征组件,并引入触发器-对象分离损失与多样性损失以增强特征解耦效果,从而实现对已见和未见触发器-对象组合的零样本泛化检测能力,提升数据级后门防御的鲁棒性。
链接: https://arxiv.org/abs/2508.01932
作者: Kyle Stein,Andrew A. Mahyari,Guillermo Francia III,Eman El-Sheikh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.
zh
[CV-95] IAUNet: Instance-Aware U-Net CVPR
【速读】:该论文旨在解决生物医学图像中细胞等重叠、尺寸多变目标的实例分割(instance segmentation)难题,尤其针对传统U-Net架构在查询驱动(query-based)方法中的潜力尚未被充分挖掘的问题。解决方案的关键在于提出IAUNet,一个创新的基于查询的U-Net架构:其核心设计包含一个完整的U-Net主干网络,并引入轻量级卷积Pixel解码器以提升效率并减少参数量;同时,设计了一个Transformer解码器来跨多尺度精炼对象特定特征,从而增强模型对复杂细胞结构的分割能力。此外,作者还发布了2025 Revvity Full Cell Segmentation Dataset,为该任务提供了高质量标注数据集,推动了领域基准的发展。
链接: https://arxiv.org/abs/2508.01928
作者: Yaroslav Prytula,Illia Tsiporenko,Ali Zeynalli,Dmytro Fishman
机构: University of Tartu (塔尔图大学); Ukrainian Catholic University (乌克兰天主教大学); STACC OÜ (STACC OÜ)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in CVPR Workshops (CVMI), 2025. Project page/code/models/dataset: \href
Abstract:Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often overlap and vary in size. Recent query-based methods, where object queries guide segmentation, have shown strong performance. While U-Net has been a go-to architecture in medical image segmentation, its potential in query-based approaches remains largely unexplored. In this work, we present IAUNet, a novel query-based U-Net architecture. The core design features a full U-Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object-specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of overlapping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state-of-the-art fully convolutional, transformer-based, and query-based models and cell segmentation-specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at this https URL
zh
[CV-96] InspectVLM: Unified in Theory Unreliable in Practice ICCV
【速读】:该论文旨在解决工业视觉检测中因任务分散导致的模型管理复杂、效率低下和维护成本高的问题,提出通过统一视觉-语言模型(Unified Vision-Language Models, VLMs)将分类、检测与关键点定位等多任务整合至单一语言驱动接口的方案。其关键在于构建并评估InspectVLM——一个基于Florence-2架构、在自建大规模多模态多任务工业检测数据集InspectMM上训练的VLM,以验证该统一范式在实际工业场景中的可行性与性能边界。
链接: https://arxiv.org/abs/2508.01921
作者: Conor Wallace,Isaac Corley,Jonathan Lwowski
机构: Zeitview(zeitview)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2025 ICCV VISION Workshop
Abstract:Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.
zh
[CV-97] EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses
【速读】:该论文旨在解决智能眼镜在全天候使用中因持续多模态感知(如视觉与音频)导致的能耗过高问题,从而限制其在人类记忆增强等应用场景中的实用性。解决方案的关键在于提出一种上下文感知的触发机制——EgoTrigger,该机制利用轻量级音频模型(YAMNet)结合自定义分类头,通过识别手-物体交互(Hand-Object Interaction, HOI)相关的声学线索(如抽屉开启或药瓶打开的声音)来选择性激活高功耗摄像头,实现低功耗下的高效感知。实验表明,EgoTrigger平均可减少54%的图像帧采集量,在保障记忆增强任务性能的同时显著降低传感器及后续处理环节的能量消耗。
链接: https://arxiv.org/abs/2508.01915
作者: Akshay Paruchuri,Sinan Hersek,Lavisha Aggarwal,Qiao Yang,Xin Liu,Achin Kulshrestha,Andrea Colaco,Henry Fuchs,Ishan Chatterjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
备注: 15 pages, 6 figres, 6 tables. Accepted to ISMAR 2025 as a TVCG journal paper
Abstract:All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use – supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).
zh
[CV-98] Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation
【速读】:该论文旨在解决医学影像数据共享中患者隐私保护与科学实用性难以兼顾的问题,尤其是在使用DICOM格式时,如何有效去除受保护健康信息(PHI)和 personally identifiable information (PII) 以满足合规要求的同时保持数据的可用性。其解决方案的关键在于构建了一个包含合成PHI/PII的公开DICOM数据集(MIDI dataset)及一套标准化、自动化评估框架,该框架基于HIPAA“Safe Harbor”方法、DICOM PS3.15保密性配置文件以及TCIA最佳实践,支持对去标识化流程进行客观、可重复的定量评估,从而提升医疗图像共享的安全性和一致性。
链接: https://arxiv.org/abs/2508.01889
作者: Michael W. Rutherford,Tracy Nolan,Linmin Pei,Ulrike Wagner,Qinyan Pan,Phillip Farmer,Kirk Smith,Benjamin Kopchick,Laura Opsahl-Ong,Granger Sutton,David Clunie,Keyvan Farahani,Fred Prior
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule “Safe Harbor” method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.
zh
[CV-99] StreamAgent : Towards Anticipatory Agents for Streaming Video Understanding
【速读】:该论文旨在解决实时流式视频理解(real-time streaming video understanding)中因缺乏任务驱动的规划与未来预判而导致的响应延迟和被动决策问题。现有方法依赖交替感知-反应或异步触发机制,难以在动态演变的视频流中实现主动、前瞻性的行为响应。解决方案的关键在于提出StreamAgent框架,其核心创新是通过提示(prompting)方式使代理模型能够预测未来任务相关的时间间隔和空间区域,从而实现对关键事件时间演进的预判,并将当前观测与预期未来证据对齐,进而动态调整感知动作(如聚焦任务相关区域或持续跟踪)。此外,设计了分层KV缓存记忆机制以高效检索语义信息并降低传统KV缓存的存储开销,显著提升了推理效率与准确性。
链接: https://arxiv.org/abs/2508.01875
作者: Haolin Yang,Feilong Tang,Linxiao Zhao,Xiang An,Ming Hu,Huifa Li,Xinlin Zhuang,Boqian Wang,Yifan Lu,Xiaofeng Zhang,Abdalla Swikir,Junjun He,Zongyuan Ge,Imran Razzak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
zh
[CV-100] DiffusionFF: Face Forgery Detection via Diffusion-based Artifact Localization
【速读】:该论文旨在解决深度伪造(deepfake)图像中伪造痕迹难以精确定位的问题,从而提升检测模型的可解释性与用户信任度。其解决方案的关键在于提出DiffusionFF框架,利用去噪扩散模型生成高质量的结构相似性差异(DSSIM)图,以捕捉细微的篡改痕迹,并将这些DSSIM图与预训练伪造检测器提取的高层语义特征进行融合,从而显著提升检测准确率和伪造区域的精细定位能力。
链接: https://arxiv.org/abs/2508.01873
作者: Siran Peng,Haoyuan Zhang,Li Gao,Tianshuo Zhang,Bao Li,Zhen Lei
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Huawei Technologies Co., Ltd. (华为技术有限公司); 4. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院); 5. Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Technology (广东省脑启发智能技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid evolution of deepfake generation techniques demands robust and accurate face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery artifacts has become increasingly important for improving model explainability and fostering user trust. To address this challenge, we propose DiffusionFF, a novel framework that enhances face forgery detection through diffusion-based artifact localization. Our method utilizes a denoising diffusion model to generate high-quality Structural Dissimilarity (DSSIM) maps, which effectively capture subtle traces of manipulation. These DSSIM maps are then fused with high-level semantic features extracted by a pretrained forgery detector, leading to significant improvements in detection accuracy. Extensive experiments on both cross-dataset and intra-dataset benchmarks demonstrate that DiffusionFF not only achieves superior detection performance but also offers precise and fine-grained artifact localization, highlighting its overall effectiveness.
zh
[CV-101] Implicit Search Intent Recognition using EEG and Eye Tracking: Novel Dataset and Cross-User Prediction
【速读】:该论文旨在解决机器在复杂视觉搜索任务中难以准确区分人类用户是出于导航意图(navigational intent)还是信息搜索意图(informational intent)的问题,从而实现更自然、无侵入式的交互。此前研究虽尝试结合脑电图(EEG)与眼动追踪数据来隐式识别此类意图,但存在两大局限:一是实验设计采用固定搜索时长,不符合真实视觉搜索行为中以发现目标为终止条件的特性;二是依赖目标用户的标注数据进行模型训练,限制了实际应用中的可扩展性。本文的关键解决方案在于构建首个公开可用的EEG与眼动数据集,其中用户自主决定搜索时长,并提出首个跨用户(cross-user)预测方法,仅需少量个体数据即可实现高精度意图识别(leave-one-user-out评估达84.5%准确率),其性能接近同用户训练下的基准(85.5%),显著提升了系统在真实场景中的实用性与泛化能力。
链接: https://arxiv.org/abs/2508.01860
作者: Mansi Sharma,Shuang Chen,Philipp Müller,Maurice Rekrut,Antonio Krüger
机构: DFKI(德国人工智能研究中心); Saarland Informatics Campus(萨尔兰信息学园区); Saarbrücken(萨尔布吕肯); Germany(德国)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:For machines to effectively assist humans in challenging visual search tasks, they must differentiate whether a human is simply glancing into a scene (navigational intent) or searching for a target object (informational intent). Previous research proposed combining electroencephalography (EEG) and eye-tracking measurements to recognize such search intents implicitly, i.e., without explicit user input. However, the applicability of these approaches to real-world scenarios suffers from two key limitations. First, previous work used fixed search times in the informational intent condition – a stark contrast to visual search, which naturally terminates when the target is found. Second, methods incorporating EEG measurements addressed prediction scenarios that require ground truth training data from the target user, which is impractical in many use cases. We address these limitations by making the first publicly available EEG and eye-tracking dataset for navigational vs. informational intent recognition, where the user determines search times. We present the first method for cross-user prediction of search intents from EEG and eye-tracking recordings and reach 84.5% accuracy in leave-one-user-out evaluations – comparable to within-user prediction accuracy (85.5%) but offering much greater flexibility
zh
[CV-102] Distinguishing Target and Non-Target Fixations with EEG and Eye Tracking in Realistic Visual Scenes
【速读】:该论文旨在解决在真实场景下区分目标注视(target fixation)与非目标注视(non-target fixation)的问题,以提升对用户意图的理解并构建更有效的辅助系统。传统方法依赖于受控实验环境中的显式指令搜索路径和抽象视觉刺激,忽视了场景特征对人类视觉搜索行为的驱动作用,导致其在现实应用中泛化能力有限。本文首次在自由视觉搜索任务中研究这一问题,采用包含36名参与者、140种真实场景的实验设计,在桌面图标查找和杂乱车间工具定位两个典型应用场景中验证模型性能。解决方案的关键在于融合眼动(gaze)与脑电图(EEG)特征,显著优于以往基于注视持续时间与扫视相关电位(saccade-related potentials)的方法,在跨用户评估中达到83.6%的准确率,远超此前56.9%的最优水平,证明了该方法在复杂现实场景中的高泛化能力和实用性。
链接: https://arxiv.org/abs/2508.01853
作者: Mansi Sharma,Camilo Andrés Martínez Martínez,Benedikt Emanuel Wirth,Antonio Krüger,Philipp Müller
机构: Saarland University (萨尔兰大学); DFKI (德国弗劳恩霍夫计算机视觉与人工智能研究所); Saarland Informatics Campus (萨尔兰信息学园区)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users’ intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.
zh
[CV-103] Context Guided Transformer Entropy Modeling for Video Compression ICCV2025
【速读】:该论文旨在解决现有条件熵模型在视频压缩中面临的两大问题:一是引入时间上下文通常会增加模型复杂度和计算开销;二是多数空间上下文建模方法未能显式刻画空间依赖关系的顺序,从而限制了解码时可用的相关上下文信息。解决方案的关键在于提出一种上下文引导的Transformer(Context Guided Transformer, CGT)熵模型,其核心创新包括两个部分:首先,通过一个时间上下文重采样器(temporal context resampler)利用Transformer编码器从预定义的潜在查询中提取关键时间信息,降低下游计算负担;其次,设计了一个教师-学生网络结构作为依赖加权的空间上下文分配器,其中教师生成注意力图和熵图以指导学生选择具有最高空间依赖性的top-k令牌,推理阶段仅使用学生网络基于高依赖性上下文预测未解码的像素,从而显著提升效率与压缩性能。
链接: https://arxiv.org/abs/2508.01852
作者: Junlong Tong,Wei Zhang,Yaohui Jin,Xiaoyu Shen
机构: Shanghai Jiao Tong University (上海交通大学); Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT (宁波市空间智能与数字衍生重点实验室,数字孪生研究所,EIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICCV 2025
Abstract:Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.
zh
[CV-104] Beyond Vulnerabilities: A Survey of Adversarial Attacks as Both Threats and Defenses in Computer Vision Systems
【速读】:该论文旨在解决计算机视觉系统中对抗攻击的复杂性和多样性问题,揭示其作为安全威胁与防御工具的双重属性。解决方案的关键在于构建一个系统性的分类框架,涵盖像素空间攻击、物理可实现攻击和潜在空间攻击三大领域,并深入分析从早期梯度方法(如FGSM、PGD)到融合动量、自适应步长及高迁移性机制的优化技术演进。同时,论文强调通过对抗补丁、3D纹理和动态光学扰动等手段将数字漏洞转化为现实世界威胁,以及利用语义结构在潜在空间生成更具迁移性和意义的对抗样本。此外,还探讨了对抗技术在生物特征认证漏洞评估和抵御恶意生成模型中的建设性应用,从而为提升计算机视觉系统的鲁棒性与可信度提供理论支撑与未来研究方向。
链接: https://arxiv.org/abs/2508.01845
作者: Zhongliang Guo,Yifei Qian,Yanli Li,Weiye Li,Chun Tong Lei,Shuai Zhao,Lei Fang,Ognjen Arandjelović,Chun Pong Lau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 33 pages
Abstract:Adversarial attacks against computer vision systems have emerged as a critical research area that challenges the fundamental assumptions about neural network robustness and security. This comprehensive survey examines the evolving landscape of adversarial techniques, revealing their dual nature as both sophisticated security threats and valuable defensive tools. We provide a systematic analysis of adversarial attack methodologies across three primary domains: pixel-space attacks, physically realizable attacks, and latent-space attacks. Our investigation traces the technical evolution from early gradient-based methods such as FGSM and PGD to sophisticated optimization techniques incorporating momentum, adaptive step sizes, and advanced transferability mechanisms. We examine how physically realizable attacks have successfully bridged the gap between digital vulnerabilities and real-world threats through adversarial patches, 3D textures, and dynamic optical perturbations. Additionally, we explore the emergence of latent-space attacks that leverage semantic structure in internal representations to create more transferable and meaningful adversarial examples. Beyond traditional offensive applications, we investigate the constructive use of adversarial techniques for vulnerability assessment in biometric authentication systems and protection against malicious generative models. Our analysis reveals critical research gaps, particularly in neural style transfer protection and computational efficiency requirements. This survey contributes a comprehensive taxonomy, evolution analysis, and identification of future research directions, aiming to advance understanding of adversarial vulnerabilities and inform the development of more robust and trustworthy computer vision systems.
zh
[CV-105] OmniEvent: Unified Event Representation Learning
【速读】:该论文旨在解决事件相机(event camera)数据在计算机视觉任务中因分布无结构和时空(spatial-temporal, S-T)非均匀性导致的模型设计依赖于特定任务的问题,从而难以复用现有视觉架构。其核心解决方案是提出OmniEvent——首个统一的事件表示学习框架,采用“解耦-增强-融合”(decouple-enhance-fuse)范式:首先在空间和时间域独立进行局部特征聚合与增强,以避免S-T不一致性;随后利用空间填充曲线(space-filling curves)提升感受野并优化内存与计算效率;最后通过注意力机制融合双域特征以建模S-T交互关系,最终输出网格状张量,使标准视觉模型无需修改即可处理事件数据,实现了跨任务的高性能泛化。
链接: https://arxiv.org/abs/2508.01842
作者: Weiqi Yan,Chenlu Lin,Youbiao Wang,Zhipeng Cai,Xiuhong Lin,Yangyang Shi,Weiquan Liu,Yu Zang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be ready in this https URL .
zh
[CV-106] A Simple Algebraic Solution for Estimating the Pose of a Camera from Planar Point Features IROS2025
【速读】:该论文旨在解决从已知坐标参考点及其在相机坐标系中的方向测量值(bearing measurements)中,估计相机相对于平面目标的位姿(pose)的问题。解决方案的关键在于提出了一种分层代数方法:首先确定目标平面的单位法向量(unit vector normal to the target plane),随后依次计算相机的位置矢量、到目标平面的距离以及完整的姿态(orientation)。为提升对测量噪声的鲁棒性,文中引入了平均化策略来优化目标法向量的估计,从而显著增强了整体方法的精度与稳定性。
链接: https://arxiv.org/abs/2508.01836
作者: Tarek Bouazza,Tarek Hamel,Claude Samson
机构: I3S, CNRS, Université Côte d’Azur (I3S,法国国家科学研究中心,蔚蓝海岸大学); Insitut Universitaire de France (法国国家大学研究院); INRIA Sophia Antipolis (法国国家信息与自动化研究院索菲亚分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages, 6 figures. To appear in IEEE/RSJ IROS 2025
Abstract:This paper presents a simple algebraic method to estimate the pose of a camera relative to a planar target from n \geq 4 reference points with known coordinates in the target frame and their corresponding bearing measurements in the camera frame. The proposed approach follows a hierarchical structure; first, the unit vector normal to the target plane is determined, followed by the camera’s position vector, its distance to the target plane, and finally, the full orientation. To improve the method’s robustness to measurement noise, an averaging methodology is introduced to refine the estimation of the target’s normal direction. The accuracy and robustness of the approach are validated through extensive experiments.
zh
[CV-107] Diffusion-based 3D Hand Motion Recovery with Intuitive Physics ICCV2025
【速读】:该论文旨在解决单目视频中3D手部运动估计的准确性与时间一致性难题,尤其是在手物交互场景下。其核心挑战在于如何从视频序列中恢复出既精确又连贯的手部动态轨迹。解决方案的关键在于提出了一种基于扩散模型(diffusion model)且融合物理约束的运动精修框架:通过利用仅含动作捕捉数据(无需图像标注)训练模型,学习从初始重建结果出发的精细化运动分布;同时引入手物交互中的物理先验知识(如关键运动状态及其约束),有效嵌入扩散过程以提升生成轨迹的合理性与稳定性,从而显著优于现有帧级重建方法,在多个基准测试上达到当前最优性能(state-of-the-art)。
链接: https://arxiv.org/abs/2508.01835
作者: Yufei Zhang,Zijun Cui,Jeffrey O. Kephart,Qiang Ji
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Michigan State University (密歇根州立大学); IBM Research (IBM 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025
Abstract:While 3D hand reconstruction from monocular images has made significant progress, generating accurate and temporally coherent motion estimates from videos remains challenging, particularly during hand-object interactions. In this paper, we present a novel 3D hand motion recovery framework that enhances image-based reconstructions through a diffusion-based and physics-augmented motion refinement model. Our model captures the distribution of refined motion estimates conditioned on initial ones, generating improved sequences through an iterative denoising process. Instead of relying on scarce annotated video data, we train our model only using motion capture data without images. We identify valuable intuitive physics knowledge during hand-object interactions, including key motion states and their associated motion constraints. We effectively integrate these physical insights into our diffusion model to improve its performance. Extensive experiments demonstrate that our approach significantly improves various frame-wise reconstruction methods, achieving state-of-the-art (SOTA) performance on existing benchmarks.
zh
[CV-108] SoccerTrack v2: A Full-Pitch Multi-View Soccer Dataset for Game State Reconstruction SOCC
【速读】:该论文旨在解决当前足球分析中多目标跟踪(Multi-Object Tracking, MOT)、比赛状态重建(Game State Reconstruction, GSR)和球类动作识别(Ball Action Spotting, BAS)任务缺乏高质量、大规模且标注完备的数据集问题。现有数据集通常依赖广播视角或受限场景,难以支撑真实比赛环境下的复杂视觉理解任务。解决方案的关键在于构建SoccerTrack v2——一个包含10场完整大学级比赛的全景4K视频数据集,采用BePro相机系统实现球员全视野覆盖,并对每帧进行精细化标注,包括2D场地坐标、基于球衣编号的球员ID与角色、所属队伍信息以及12类球类动作标签(如传球、带球、射门等),从而为计算机视觉与足球分析研究提供更可靠的数据基础与新基准。
链接: https://arxiv.org/abs/2508.01802
作者: Atom Scott,Ikuma Uchida,Kento Kuroda,Yufi Kim,Keisuke Fujii
机构: Nagoya University (名古屋大学); University of Tsukuba (筑波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure. Dataset and code available at this https URL and this https URL . Preliminary paper for dataset release
Abstract:SoccerTrack v2 is a new public dataset for advancing multi-object tracking (MOT), game state reconstruction (GSR), and ball action spotting (BAS) in soccer analytics. Unlike prior datasets that use broadcast views or limited scenarios, SoccerTrack v2 provides 10 full-length, panoramic 4K recordings of university-level matches, captured with BePro cameras for complete player visibility. Each video is annotated with GSR labels (2D pitch coordinates, jersey-based player IDs, roles, teams) and BAS labels for 12 action classes (e.g., Pass, Drive, Shot). This technical report outlines the datasets structure, collection pipeline, and annotation process. SoccerTrack v2 is designed to advance research in computer vision and soccer analytics, enabling new benchmarks and practical applications in tactical analysis and automated tools.
zh
[CV-109] Sonify Anything: Towards Context-Aware Sonic Interactions in AR
【速读】:该论文旨在解决增强现实(Augmented Reality, AR)中虚拟物体与真实物体交互时缺乏自然声学反馈的问题,即当前AR应用常使用无声音或通用音效,导致视听感知不一致,削弱了交互的真实感和物体的物理质感。解决方案的关键在于构建一个上下文感知的声音生成框架:通过计算机视觉技术识别并分割真实物体的材质,结合材质的物理属性和碰撞动力学参数,利用物理建模合成(physical modelling synthesis)实时生成与材料特性匹配的声音,从而实现音视频同步且符合物理规律的声学交互体验。
链接: https://arxiv.org/abs/2508.01789
作者: Laura Schütz,Sasan Matinfar,Ulrich Eck,Daniel Roth,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:In Augmented Reality (AR), virtual objects interact with real objects. However, the lack of physicality of virtual objects leads to the absence of natural sonic interactions. When virtual and real objects collide, either no sound or a generic sound is played. Both lead to an incongruent multisensory experience, reducing interaction and object realism. Unlike in Virtual Reality (VR) and games, where predefined scenes and interactions allow for the playback of pre-recorded sound samples, AR requires real-time sound synthesis that dynamically adapts to novel contexts and objects to provide audiovisual congruence during interaction. To enhance real-virtual object interactions in AR, we propose a framework for context-aware sounds using methods from computer vision to recognize and segment the materials of real objects. The material’s physical properties and the impact dynamics of the interaction are used to generate material-based sounds in real-time using physical modelling synthesis. In a user study with 24 participants, we compared our congruent material-based sounds to a generic sound effect, mirroring the current standard of non-context-aware sounds in AR applications. The results showed that material-based sounds led to significantly more realistic sonic interactions. Material-based sounds also enabled participants to distinguish visually similar materials with significantly greater accuracy and confidence. These findings show that context-aware, material-based sonic interactions in AR foster a stronger sense of realism and enhance our perception of real-world surroundings.
zh
[CV-110] Skip priors and add graph-based anatomical information for point-based Couinaud segmentation MICCAI2025
【速读】:该论文旨在解决肝切除术前规划中基于CT图像进行Couinaud分段时,传统点基表示方法依赖显式先验肝血管结构、获取耗时的问题。其解决方案的关键在于:提出一种无需显式提供肝血管结构的点基Couinaud分段方法,通过在点特征之上添加图推理模块(graph reasoning module),使模型能够隐式学习点邻域间的关联性以获取解剖学上的肝血管结构信息,从而提升分割精度与临床实用性。
链接: https://arxiv.org/abs/2508.01785
作者: Xiaotong Zhang,Alexander Broersen,Gonnie CM van Erp,Silvia L. Pintea,Jouke Dijkstra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025 GRAIL workshop
Abstract:The preoperative planning of liver surgery relies on Couinaud segmentation from computed tomography (CT) images, to reduce the risk of bleeding and guide the resection procedure. Using 3D point-based representations, rather than voxelizing the CT volume, has the benefit of preserving the physical resolution of the CT. However, point-based representations need prior knowledge of the liver vessel structure, which is time consuming to acquire. Here, we propose a point-based method for Couinaud segmentation, without explicitly providing the prior liver vessel structure. To allow the model to learn this anatomical liver vessel structure, we add a graph reasoning module on top of the point features. This adds implicit anatomical information to the model, by learning affinities across point neighborhoods. Our method is competitive on the MSD and LiTS public datasets in Dice coefficient and average surface distance scores compared to four pioneering point-based methods. Our code is available at this https URL.
zh
[CV-111] DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion
【速读】:该论文旨在解决在线高精地图(HD map)生成场景中,传统栅格表示法(raster-based representations)缺乏几何精度与图结构表示法(graph-based representations)在无精确地图时稳定性差的问题。其核心解决方案是提出DiffSemanticFusion框架,通过引入一个地图扩散模块(map diffusion module),将语义栅格融合至BEV空间中,从而增强在线高精地图表示的稳定性和表达能力,同时实现多模态轨迹预测与规划任务的性能提升。该方法在nuScenes和NAVSIM两个真实世界自动驾驶基准上验证有效,显著优于现有主流方法。
链接: https://arxiv.org/abs/2508.01778
作者: Zhigang Sun,Yiru Wang,Anqing Jiang,Shuo Wang,Yu Gao,Yuwen Heng,Shouyi Zhang,An He,Hao Jiang,Jinhao Chai,Zichong Gu,Wang Jijun,Shichen Tang,Lavdim Halilaj,Juergen Luettin,Hao Sun
机构: Bosch Corporate Research, Bosch (China) Investment Ltd.(博世(中国)投资有限公司); Shanghai University (上海大学); Shanghai Jiaotong University (上海交通大学); Tsinghua University (清华大学); Robert Bosch GmbH (罗伯特·博世公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion – a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at this https URL.
zh
[CV-112] VPN: Visual Prompt Navigation
【速读】:该论文旨在解决语言引导导航在复杂环境中因自然语言固有的模糊性和冗余性而导致的效率低下问题。其解决方案的关键在于提出一种全新的“视觉提示导航”(Visual Prompt Navigation, VPN)范式,即通过用户提供的二维俯视地图上的视觉提示来指导智能体导航,而非依赖语言指令。该方法强调在场景的俯视图中标注导航轨迹,提供直观且空间上具身的引导,显著降低非专家用户的使用门槛并减少理解歧义。为验证该方案,作者构建了两个新数据集 R2R-VP 和 R2R-CE-VP,并设计了 VPNet 基线模型及两种数据增强策略(视角级和轨迹级),以提升导航性能。
链接: https://arxiv.org/abs/2508.01766
作者: Shuo Feng,Zihan Wang,Yuchen Li,Rui Kong,Hengyi Cai,Shuaiqiang Wang,Gim Hee Lee,Piji Li,Shuqiang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at this https URL.
zh
[CV-113] Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring
【速读】:该论文旨在解决奶牛健康与福利监测中因人工观察效率低、一致性差而导致的疾病早期识别困难问题,核心挑战在于如何在复杂室内环境中实现对奶牛活动行为的连续、准确跟踪。解决方案的关键在于构建一个基于多摄像头的实时追踪系统,融合几何校准(homographic transformations)生成全景视图,并采用改进的YOLO11-m模型进行高精度目标检测(mAP@0.50 = 0.97),结合升级版Segment Anything Model 2.1(SAMURAI)实现像素级实例分割(instance segmentation),利用零样本学习和运动感知记忆增强分割鲁棒性;同时引入运动感知线性卡尔曼滤波与基于交并比(IoU)的数据关联策略,有效应对遮挡和姿态变化,实现跨视角稳定跟踪,最终在MOTA达98.7%–99.3%、IDF1超99%的性能下显著优于Deep SORT,为基于行为量化与分类的早期疾病预警提供了可靠技术支撑。
链接: https://arxiv.org/abs/2508.01752
作者: Kumail Abbas,Zeeshan Afzal,Aqeel Raza,Taha Mansouri,Andrew W. Dowsey,Chaidate Inchaisri,Ali Alameer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted in Smart Agriculture Technology
Abstract:Activity and behaviour correlate with dairy cow health and welfare, making continual and accurate monitoring crucial for disease identification and farm productivity. Manual observation and frequent assessments are laborious and inconsistent for activity monitoring. In this study, we developed a unique multi-camera, real-time tracking system for indoor-housed Holstein Friesian dairy cows. This technology uses cutting-edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately. An integrated top-down barn panorama was created by geometrically aligning six camera feeds using homographic transformations. The detection phase used a refined YOLO11-m model trained on an overhead cow dataset, obtaining high accuracy (mAP@0.50 = 0.97, F1 = 0.95). SAMURAI, an upgraded Segment Anything Model 2.1, generated pixel-precise cow masks for instance segmentation utilizing zero-shot learning and motion-aware memory. Even with occlusion and fluctuating posture, a motion-aware Linear Kalman filter and IoU-based data association reliably identified cows over time for object tracking. The proposed system significantly outperformed Deep SORT Realtime. Multi-Object Tracking Accuracy (MOTA) was 98.7% and 99.3% in two benchmark video sequences, with IDF1 scores above 99% and near-zero identity switches. This unified multi-camera system can track dairy cows in complex interior surroundings in real time, according to our data. The system reduces redundant detections across overlapping cameras, maintains continuity as cows move between viewpoints, with the aim of improving early sickness prediction through activity quantification and behavioural classification.
zh
[CV-114] Improving Noise Efficiency in Privacy-preserving Dataset Distillation ICCV2025
【速读】:该论文旨在解决差分隐私(Differentially Private, DP)数据生成中因合成数据量不足导致性能下降的问题,以及现有私有数据蒸馏(Private Dataset Distillation, DD)方法在采样与优化过程中存在同步依赖、噪声信号质量低、私有信息利用效率差等瓶颈。解决方案的关键在于:首先,通过解耦采样与优化过程以提升收敛性;其次,引入在信息子空间中的匹配机制来缓解差分隐私噪声对训练信号的干扰,从而增强私有信息的有效利用。实验表明,在CIFAR-10数据集上,该方法在每类仅50张图像时相比当前最优方法提升10.0%,且仅需其五分之一的蒸馏数据规模即可实现8.3%的性能增益,显著推动了隐私保护下的数据蒸馏技术发展。
链接: https://arxiv.org/abs/2508.01749
作者: Runkai Zheng,Vishnu Asutosh Dasu,Yinong Oliver Wang,Haohan Wang,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学); Pennsylvania State University (宾夕法尼亚州立大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV 2025
Abstract:Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace. On CIFAR-10, our method achieves a \textbf10.0% improvement with 50 images per class and \textbf8.3% increase with just \textbfone-fifth the distilled set size of previous state-of-the-art methods, demonstrating significant potential to advance privacy-preserving DD.
zh
[CV-115] Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation
【速读】:该论文旨在解决从第一人称视角视频中进行长期动作预测的问题,其核心挑战在于现有方法对细粒度的手-物交互视觉线索利用不足、忽视动词与名词之间的语义依赖关系,以及缺乏显式的认知推理机制,从而限制了模型的泛化能力和长期预测性能。解决方案的关键在于提出一个统一的两阶段框架INSIGHT:第一阶段通过提取手-物交互区域的语义丰富特征,并借助动词-名词共现矩阵增强动作表征;第二阶段引入基于强化学习的模块,模拟显式认知推理过程——即视觉感知(think)→意图推断(reason)→动作预测(answer),从而显著提升模型在多个基准数据集上的表现和泛化能力。
链接: https://arxiv.org/abs/2508.01742
作者: Qiaohui Chu,Haoyu Zhang,Meng Liu,Yisen Feng,Haoxiang Shi,Liqiang Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Our code will be released upon acceptance
Abstract:Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) - intention inference (reason) - action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.
zh
[CV-116] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models
【速读】:该论文旨在解决开源视觉语言模型(Vision-Language Models, VLMs)在微调后仍可能继承基础模型中的漏洞,从而导致可迁移的越狱攻击(jailbreak attacks)风险的问题。解决方案的关键在于提出一种新颖的灰盒越狱方法——模拟集成攻击(Simulated Ensemble Attack, SEA),其核心由两个技术组成:一是微调轨迹模拟(Fine-tuning Trajectory Simulation, FTS),通过模拟视觉编码器参数变化生成具有高迁移性的对抗图像;二是目标提示引导(Targeted Prompt Guidance, TPG),通过文本策略引导语言解码器输出恶意内容。实验表明,SEA在多个微调后的Qwen2-VL模型上实现了超过86.5%的攻击成功率和近49.5%的毒性率,显著优于传统基于PGD的方法,揭示了从基础模型继承的脆弱性是微调VLMs安全防护的关键盲区。
链接: https://arxiv.org/abs/2508.01741
作者: Ruofan Wang,Xin Wang,Yang Yao,Xuan Tong,Xingjun Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target’s weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder’s parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.
zh
[CV-117] AG2aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing
【速读】:该论文旨在解决当前3D Gaussian Splatting (3DGS) 方法在语义感知场景下存在的问题,即现有方法通过可微渲染将语义特征附加到自由分布的高斯体素上,导致分割结果噪声大、高斯体素选择混乱,难以支持精确的语义理解与编辑任务。其解决方案的关键在于提出AG²aussian框架,引入锚点图(anchor-graph)结构来组织语义特征并调控高斯基元,该结构不仅促进紧凑且实例感知的高斯分布,还支持基于图的传播机制,从而实现清晰准确的实例级高斯选择,显著提升多场景应用(如交互式点击查询、开放词汇文本驱动查询、物体移除编辑和物理仿真)中的语义表达精度与可控性。
链接: https://arxiv.org/abs/2508.01740
作者: Zhaonan Wang,Manyi Li,Changhe Tu
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG ^2 aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.
zh
[CV-118] SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models
【速读】:该论文旨在解决现有遥感基础模型(Remote Sensing Foundation Models, RSFMs)主要基于光学影像预训练,而多光谱/高光谱数据缺乏相应基础模型的问题,从而限制了光谱信息在地球观测中的潜力。解决方案的关键在于提出SpectralX框架,其核心是通过两阶段参数高效微调策略实现对多种光谱模态的适配:第一阶段采用掩码重建任务并设计专门的Hyper Tokenizer(HyperT)从空间和光谱维度提取属性令牌,同时引入面向属性的适配器混合(Attribute-oriented Mixture of Adapter, AoMoA)以动态聚合多属性专家知识;第二阶段在下游语义分割任务中插入属性精炼适配器(Are-adapter),通过高低层特征迭代查询机制聚焦任务有益属性,实现对RSFMs的定制化调整,显著提升跨区域、跨季节的域泛化能力。
链接: https://arxiv.org/abs/2508.01731
作者: Yuxiang Zhang,Wei Li,Mengmeng Zhang,Jiawei Han,Ran Tao,Shunlin Liang
机构: The University of Hong Kong (香港大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: this https URL.
zh
[CV-119] racking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos
【速读】:该论文旨在解决无人机(UAV)视频中多目标跟踪(Multi-object Tracking, MOT)因视角频繁变化和复杂的UAV-地面相对运动动力学导致的关联不稳定与身份模糊问题。现有方法通常将运动和外观特征分别建模,忽略了二者在时空上的交互作用,从而限制了跟踪性能。其解决方案的关键在于提出AMOT框架,通过两个核心组件实现外观与运动线索的联合利用:一是外观-运动一致性(Appearance-Motion Consistency, AMC)矩阵,借助外观特征引导双向空间一致性计算,提升身份关联的可靠性与上下文感知能力;二是运动感知轨迹延续(Motion-aware Track Continuation, MTC)模块,通过外观引导的预测与卡尔曼滤波预测对齐,重新激活未匹配轨迹,减少因漏检造成的轨迹断裂。该方法在VisDrone2019、UAVDT和VT-MOT-UAV三个UAV基准数据集上显著优于当前最优方法,并具备即插即用且无需训练的泛化能力。
链接: https://arxiv.org/abs/2508.01730
作者: Jianbo Ma,Hui Luo,Qi Chen,Yuankai Qi,Yumei Sun,Amin Beheshti,Jianlin Zhang,Ming-Hsuan Yang
机构: Institute of Optics and Electronics, Chinese Academy of Sciences (中国科学院光电研究所); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子、电气与通信工程学院); University of Adelaide (阿德莱德大学); Macquarie University (麦考瑞大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.
zh
[CV-120] Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations ICCV2025
【速读】:该论文旨在解决深度视觉模型中特定视觉概念在分布式表示中的定位难题,即如何精确识别哪些神经元组合编码了与给定查询相关的具体视觉概念。解决方案的关键在于提出一种名为Granular Concept Circuit (GCC) 的电路发现方法,该方法通过迭代评估神经元间的连接性,同时关注功能依赖性和语义一致性,自动构建多个细粒度的电路,每个电路对应一个特定的概念,从而实现对模型的逐概念解释,并首次在细粒度层面识别出与特定视觉概念直接关联的神经电路结构。
链接: https://arxiv.org/abs/2508.01728
作者: Dahee Kwon,Sehyun Lee,Jaesik Choi
机构: KAIST AI (韩国科学技术院人工智能研究中心); INEEJI (韩国科学技术院信息与电子工程研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 accepted paper
Abstract:Deep vision models have achieved remarkable classification performance by leveraging a hierarchical architecture in which human-interpretable concepts emerge through the composition of individual neurons across layers. Given the distributed nature of representations, pinpointing where specific visual concepts are encoded within a model remains a crucial yet challenging task. In this paper, we introduce an effective circuit discovery method, called Granular Concept Circuit (GCC), in which each circuit represents a concept relevant to a given query. To construct each circuit, our method iteratively assesses inter-neuron connectivity, focusing on both functional dependencies and semantic alignment. By automatically discovering multiple circuits, each capturing specific concepts within that query, our approach offers a profound, concept-wise interpretation of models and is the first to identify circuits tied to specific visual concepts at a fine-grained level. We validate the versatility and effectiveness of GCCs across various deep image classification models.
zh
[CV-121] OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting
【速读】:该论文旨在解决当前基于大视觉模型(Large Vision Models, LVMs)的时间序列预测方法中存在的参数冗余与语义噪声问题:尽管LVMs能提升预测性能,但其99%的参数对时间序列任务并无实质贡献,且高阶语义特征可能引入干扰,降低预测准确性。解决方案的关键在于提出OccamVTS——一种知识蒸馏框架,通过金字塔式特征对齐、相关性约束和特征蒸馏机制,从预训练LVM中提取仅占1%的必要预测信息,并将其迁移至轻量化网络中,从而在保留关键时序模式的同时过滤掉无关的视觉语义噪声,实现参数高效且精度提升的预测效果,尤其在少样本和零样本场景下表现优异。
链接: https://arxiv.org/abs/2508.01727
作者: Sisuo Lyu,Siru Zhong,Weilin Ruan,Qingxiang Liu,Qingsong Wen,Hui Xiong,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.
zh
[CV-122] Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization
【速读】:该论文旨在解决连续条件生成模型(如CcGAN和CCDM)在处理高维数据分布估计时面临的两个核心问题:一是数据不平衡问题,源于固定大小邻域约束导致的样本分布不均;二是计算效率低下,尤其是CCDM需要昂贵的迭代采样过程。解决方案的关键在于提出一种增强型CcGAN框架——CcGAN-AVAR,其创新性体现在两个方面:首先利用GAN天然的一步生成能力替代CCDM的迭代采样,实现300x–2000x的推理速度提升;其次引入自适应邻域机制(adaptive vicinity mechanism)动态调整邻域范围,并设计多任务判别器通过辅助回归与密度比估计构建正则化项,有效缓解数据不平衡并显著提升生成器训练稳定性与质量。
链接: https://arxiv.org/abs/2508.01725
作者: Xin Ding,Yun Chen,Yongwei Wang,Kao Zhang,Sen Zhang,Peibei Cao,Xiangxue Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. We present CcGAN-AVAR, an enhanced CcGAN framework that addresses both challenges: (1) leveraging the GAN framework’s native one-step generation to overcome CCDMs’ sampling bottleneck (achieving 300x-2000x faster inference), while (2) two novel components specifically target data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity’s size, and a multi-task discriminator that constructs two regularization terms (through auxiliary regression and density ratio estimation) to significantly improve generator training. Extensive experiments on four benchmark datasets (64x64 to 192x192 resolution) across eight challenging imbalanced settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.
zh
[CV-123] Dynamic Robot-Assisted Surgery with Hierarchical Class-Incremental Semantic Segmentation
【速读】:该论文旨在解决机器人辅助手术中语义分割模型在动态演化环境下的适应性问题,即传统基于静态数据集训练的分割模型难以应对新类别的持续引入且易发生灾难性遗忘(catastrophic forgetting)的问题。其解决方案的关键在于提出一种改进的类增量语义分割(Class-Incremental Semantic Segmentation, CISS)方法——TOPICS+,通过引入Dice损失以缓解类别不平衡问题、设计分层伪标签机制,并构建面向机器人手术场景的定制化标签分类体系,从而实现对新型手术结构的持续学习与已有知识的有效保留。此外,作者还构建了六个新的CISS基准测试任务和超过144类标签的Syn-Mediverse合成数据集,为该领域的研究提供了标准化评估平台。
链接: https://arxiv.org/abs/2508.01713
作者: Julia Hindel,Ema Mekic,Enamundram Naga Karthik,Rohit Mohan,Daniele Cattaneo,Maria Kalweit,Abhinav Valada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Robot-assisted surgeries rely on accurate and real-time scene understanding to safely guide surgical instruments. However, segmentation models trained on static datasets face key limitations when deployed in these dynamic and evolving surgical environments. Class-incremental semantic segmentation (CISS) allows models to continually adapt to new classes while avoiding catastrophic forgetting of prior knowledge, without training on previous data. In this work, we build upon the recently introduced Taxonomy-Oriented Poincaré-regularized Incremental Class Segmentation (TOPICS) approach and propose an enhanced variant, termed TOPICS+, specifically tailored for robust segmentation of surgical scenes. Concretely, we incorporate the Dice loss into the hierarchical loss formulation to handle strong class imbalances, introduce hierarchical pseudo-labeling, and design tailored label taxonomies for robotic surgery environments. We also propose six novel CISS benchmarks designed for robotic surgery environments including multiple incremental steps and several semantic categories to emulate realistic class-incremental settings in surgical environments. In addition, we introduce a refined set of labels with more than 144 classes on the Syn-Mediverse synthetic dataset, hosted online as an evaluation benchmark. We make the code and trained models publicly available at this http URL.
zh
[CV-124] HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection
【速读】:该论文旨在解决视频中仇恨言论(hate speech)检测的难题,其核心挑战在于多模态内容的复杂性以及现有数据集缺乏细粒度标注。解决方案的关键在于构建了一个大规模多模态数据集 HateClipSeg,该数据集包含超过 11,714 个视频片段的逐段标注,涵盖视频级和片段级标签,并细分为六类(正常、仇恨、侮辱、性相关内容、暴力、自残),同时提供明确的目标受害者标签。此外,研究采用三阶段标注流程确保高一致性(Krippendorff’s alpha = 0.817),并提出三项基准任务以评估模型在剪裁仇恨视频分类、时间定位和在线分类中的表现,从而推动更先进的多模态与时序感知方法的发展。
链接: https://arxiv.org/abs/2508.01712
作者: Han Wang,Zhuoran Wang,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design(新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures
Abstract:Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff’s alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at this https URL.
zh
[CV-125] GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval
【速读】:该论文旨在解决文本到视频检索(text-to-video retrieval)中语言与时序丰富的视频信号之间对齐不精确的问题,现有方法主要依赖视觉线索而忽视了互补的音频语义信息,或采用粗粒度的融合策略,导致多模态表征性能受限。其解决方案的关键在于提出GAID框架,包含两个核心组件:(i) 帧级门控融合(Frame-level Gated Fusion, FGF),在文本引导下自适应地融合音视频特征,实现细粒度的时间对齐;(ii) 方向自适应语义扰动(Directional Adaptive Semantic Perturbation, DASP),向文本嵌入注入结构感知扰动,在无需多轮推理的前提下提升模型鲁棒性和跨模态判别能力。这两个模块协同作用:融合机制缩小模态差距,扰动机制增强跨模态匹配稳定性,从而获得更稳定且表达力更强的多模态表示。
链接: https://arxiv.org/abs/2508.01711
作者: Bowen Yang,Yun Cao,Chen He,Xiaosu Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other – fusion reduces modality gaps while perturbation regularizes cross-modal matching – yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at this https URL.
zh
[CV-126] LT-Gaussian: Long-Term Map Update Using 3D Gaussian Splatting for Autonomous Driving
【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3D-GS)的自动驾驶地图在动态环境中更新效率低下的问题,即如何在保证重建质量的同时降低时间与计算成本。其解决方案的关键在于提出LT-Gaussian方法,该方法包含三个核心模块:多模态高斯溅射(Multimodal Gaussian Splatting)、结构变化检测模块(Structural Change Detection Module)和高斯地图更新模块(Gaussian-Map Update Module)。通过对比旧地图与当前LiDAR数据流识别结构变化,并仅对局部区域进行针对性更新,而非从头重建,从而实现高效且高质量的地图维护。
链接: https://arxiv.org/abs/2508.01704
作者: Luqi Cheng,Zhangshuo Qi,Zijie Zhou,Chao Lu,Guangming Xiong
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IV 2025
Abstract:Maps play an important role in autonomous driving systems. The recently proposed 3D Gaussian Splatting (3D-GS) produces rendering-quality explicit scene reconstruction results, demonstrating the potential for map construction in autonomous driving scenarios. However, because of the time and computational costs involved in generating Gaussian scenes, how to update the map becomes a significant challenge. In this paper, we propose LT-Gaussian, a map update method for 3D-GS-based maps. LT-Gaussian consists of three main components: Multimodal Gaussian Splatting, Structural Change Detection Module, and Gaussian-Map Update Module. Firstly, the Gaussian map of the old scene is generated using our proposed Multimodal Gaussian Splatting. Subsequently, during the map update process, we compare the outdated Gaussian map with the current LiDAR data stream to identify structural changes. Finally, we perform targeted updates to the Gaussian-map to generate an up-to-date map. We establish a benchmark for map updating on the nuScenes dataset to quantitatively evaluate our method. The experimental results show that LT-Gaussian can effectively and efficiently update the Gaussian-map, handling common environmental changes in autonomous driving scenarios. Furthermore, by taking full advantage of information from both new and old scenes, LT-Gaussian is able to produce higher quality reconstruction results compared to map update strategies that reconstruct maps from scratch. Our open-source code is available at this https URL.
zh
[CV-127] meExpert: An Expert-Guided Video LLM for Video Temporal Grounding
【速读】:该论文旨在解决现有视频大语言模型(Video-LLM)在视频时间定位(Video Temporal Grounding, VTG)任务中处理所有任务token采用相同静态路径的问题,导致无法区分时序定位、显著性评估和文本生成等本质不同的子任务。其解决方案的关键在于提出TimeExpert,一种基于专家混合(Mixture-of-Experts, MoE)架构的Video-LLM,通过动态路由机制将不同类型的任务token(如时间戳、显著性分数)分配至专用专家模块,从而实现对各子任务的精细化建模,并在保持计算效率的同时显著提升VTG任务的性能表现。
链接: https://arxiv.org/abs/2508.01699
作者: Zuhao Yang,Yingchen Yu,Yunqing Zhao,Shijian Lu,Song Bai
机构: Nanyang Technological University (南洋理工大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
zh
[CV-128] Versatile Transition Generation with Image-to-Video Diffusion
【速读】:该论文旨在解决在给定视频首尾帧和描述性文本提示的情况下,生成平滑、高保真且语义一致的视频过渡(transition video)这一挑战。当前基于扩散模型的视频生成方法虽在自动化高质量视频生成方面表现优异,但在处理此类条件引导的过渡生成任务时仍存在不足,尤其在运动平滑性和生成保真度方面受限于预训练图像到视频扩散模型的能力。解决方案的关键在于提出VTG(Versatile Transition video Generation)框架,其核心创新包括:(1)基于插值的初始化策略,有效保持物体身份并应对内容突变;(2)双向运动微调机制,提升运动连续性;(3)表示对齐正则化项,增强生成结果的视觉一致性与保真度。这些设计共同提升了过渡视频的质量与合理性,实验表明VTG在多个任务中均显著优于现有方法。
链接: https://arxiv.org/abs/2508.01698
作者: Zuhao Yang,Jiahui Zhang,Yingchen Yu,Shijian Lu,Song Bai
机构: Nanyang Technological University (南洋理工大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.
zh
[CV-129] Register Anything: Estimating “Corresponding Prompts” for Segment Anything Model
【速读】:该论文旨在解决医学图像配准中像素/体素级或区域级对应关系的建立问题,尤其关注如何在不依赖复杂训练过程的前提下实现高精度的区域匹配。传统方法通常分两步进行:先对每幅图像中的感兴趣区域(Region of Interest, ROI)进行分割,再在两幅图像间匹配这些区域。本文提出了一种全新的单步解决方案——PromptReg,其关键在于将配准问题转化为“对应提示搜索”(corresponding prompt problem),即对于图像X中的任意视觉提示Prompt X,寻找图像Y中的对应提示Prompt Y,使得二者分别引导的分割结果构成一对对应的ROI。通过引入“逆向提示”(inverse prompt)策略,将Prompt X映射到图像Y的提示空间,并结合多尺度空间信息进行边缘化处理,从而识别出多个成对的对应ROI,最终实现无需训练的高效配准。
链接: https://arxiv.org/abs/2508.01697
作者: Shiqi Huang,Tingfa Xu,Wen Yan,Dean Barratt,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Establishing pixel/voxel-level or region-level correspondences is the core challenge in image registration. The latter, also known as region-based correspondence representation, leverages paired regions of interest (ROIs) to enable regional matching while preserving fine-grained capability at pixel/voxel level. Traditionally, this representation is implemented via two steps: segmenting ROIs in each image then matching them between the two images. In this paper, we simplify this into one step by directly “searching for corresponding prompts”, using extensively pre-trained segmentation models (e.g., SAM) for a training-free registration approach, PromptReg. Firstly, we introduce the “corresponding prompt problem”, which aims to identify a corresponding Prompt Y in Image Y for any given visual Prompt X in Image X, such that the two respectively prompt-conditioned segmentations are a pair of corresponding ROIs from the two images. Secondly, we present an “inverse prompt” solution that generates primary and optionally auxiliary prompts, inverting Prompt X into the prompt space of Image Y. Thirdly, we propose a novel registration algorithm that identifies multiple paired corresponding ROIs by marginalizing the inverted Prompt X across both prompt and spatial dimensions. Comprehensive experiments are conducted on five applications of registering 3D prostate MR, 3D abdomen MR, 3D lung CT, 2D histopathology and, as a non-medical example, 2D aerial images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and learning-based DDF-predicting networks, even yielding competitive performance with weakly-supervised approaches that require fully-segmented training data.
zh
[CV-130] SURE-Med: Systematic Uncertainty Reduction for Enhanced Reliability in Medical Report Generation
【速读】:该论文旨在解决自动化医学报告生成(Automated Medical Report Generation, MRG)系统在临床部署中面临的三大不确定性问题:视觉不确定性(由噪声或错误的视图标注导致特征提取失效)、标签分布不确定性(因疾病流行率长尾分布使模型对罕见但关键病症敏感性不足)以及上下文不确定性(由未经验证的历史报告引入事实性幻觉)。解决方案的关键在于提出一个统一框架SURE-Med,从三个维度系统性降低不确定性:通过“前位感知视图修复重采样模块”纠正视图标注错误并自适应选择补充视图中的信息特征以缓解视觉不确定性;引入“令牌敏感学习目标”增强对关键诊断句建模并重新加权低频诊断词,提升对罕见病的识别能力以应对标签分布不确定性;设计“上下文证据过滤器”验证并选择与当前图像一致的先验信息,有效抑制幻觉以减少上下文不确定性。该方法在MIMIC-CXR和IU-Xray基准上实现最先进性能,显著提升了MRG系统的可靠性与临床可信度。
链接: https://arxiv.org/abs/2508.01693
作者: Yuhang Gu,Xingyu Hu,Yuyu Fan,Xulin Yan,Longhuan Xu,Peng peng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated medical report generation (MRG) holds great promise for reducing the heavy workload of radiologists. However, its clinical deployment is hindered by three major sources of uncertainty. First, visual uncertainty, caused by noisy or incorrect view annotations, compromises feature extraction. Second, label distribution uncertainty, stemming from long-tailed disease prevalence, biases models against rare but clinically critical conditions. Third, contextual uncertainty, introduced by unverified historical reports, often leads to factual hallucinations. These challenges collectively limit the reliability and clinical trustworthiness of MRG systems. To address these issues, we propose SURE-Med, a unified framework that systematically reduces uncertainty across three critical dimensions: visual, distributional, and contextual. To mitigate visual uncertainty, a Frontal-Aware View Repair Resampling module corrects view annotation errors and adaptively selects informative features from supplementary views. To tackle label distribution uncertainty, we introduce a Token Sensitive Learning objective that enhances the modeling of critical diagnostic sentences while reweighting underrepresented diagnostic terms, thereby improving sensitivity to infrequent conditions. To reduce contextual uncertainty, our Contextual Evidence Filter validates and selectively incorporates prior information that aligns with the current image, effectively suppressing hallucinations. Extensive experiments on the MIMIC-CXR and IU-Xray benchmarks demonstrate that SURE-Med achieves state-of-the-art performance. By holistically reducing uncertainty across multiple input modalities, SURE-Med sets a new benchmark for reliability in medical report generation and offers a robust step toward trustworthy clinical decision support.
zh
[CV-131] DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing
【速读】:该论文旨在解决3D编辑中多视角一致性难以维持的问题,尤其是在传统方法因单视图迭代优化导致收敛缓慢和跨视角伪影,以及现有基于2D注意力传播的方法在复杂场景中仍存在细粒度不一致和失败模式的局限。其解决方案的关键在于提出DisCo3D框架,通过将3D一致性先验知识蒸馏至2D编辑器中:首先利用多视角输入微调3D生成器以适应场景,随后通过一致性蒸馏训练2D编辑器,最终借助高斯溅射(Gaussian Splatting)将编辑后的多视角输出优化为高质量3D表示,从而实现稳定且高保真的3D编辑效果。
链接: https://arxiv.org/abs/2508.01684
作者: Yufeng Chi,Huimin Ma,Kafeng Wang,Jianmin Li
机构: Tsinghua University (清华大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures
Abstract:While diffusion models have demonstrated remarkable progress in 2D image generation and editing, extending these capabilities to 3D editing remains challenging, particularly in maintaining multi-view consistency. Classical approaches typically update 3D representations through iterative refinement based on a single editing view. However, these methods often suffer from slow convergence and blurry artifacts caused by cross-view inconsistencies. Recent methods improve efficiency by propagating 2D editing attention features, yet still exhibit fine-grained inconsistencies and failure modes in complex scenes due to insufficient constraints. To address this, we propose \textbfDisCo3D, a novel framework that distills 3D consistency priors into a 2D editor. Our method first fine-tunes a 3D generator using multi-view inputs for scene adaptation, then trains a 2D editor through consistency distillation. The edited multi-view outputs are finally optimized into 3D representations via Gaussian Splatting. Experimental results show DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality.
zh
[CV-132] Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的幻觉问题,其根源之一在于多模态信息对齐困难。解决方案的关键在于提出“图像内提示”(Prompt-in-Image)方法,即直接将文本指令嵌入图像中,从而消除对独立文本输入的依赖,并强制模型通过视觉通道处理全部内容。这一设计使信息处理统一于单一模态,显著减少了模态间差异(modality gap),提升了跨模态对齐效果。实验表明,该方法在Qwen2.5-VL上有效提升性能(POPE准确率提高4.1个百分点)并降低幻觉率,而LLaVA-1.5和InstructBLIP因CLIP编码器对嵌入文本区域存在过度注意力偏倚,导致性能严重下降,凸显了不同模型架构对文本嵌入策略的敏感性差异。
链接: https://arxiv.org/abs/2508.01678
作者: Zhaochen Wang,Yiwei Wang,Yujun Cai
机构: The University of Queensland (昆士兰大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL’s performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit excessive attention bias toward embedded text regions, disrupting visual understanding. In contrast, Qwen’s vision encoder handles text-embedded images robustly. Crucially, Prompt-in-Image reduces Qwen’s modality gap, enhancing cross-modal alignment by unifying information processing through a single modality.
zh
[CV-133] Benchmarking Adversarial Patch Selection and Location
【速读】:该论文旨在解决对抗性补丁攻击(adversarial patch attacks)对现代视觉模型可靠性构成的威胁,特别是针对补丁位置选择对攻击效果影响的研究空白。解决方案的关键在于构建了首个空间上全面的补丁放置基准 PatchMap,通过在 ImageNet 验证集上执行超过 1.5e8 次前向传播评估,系统性识别出易受攻击的热点区域(hot-spots),并提出一种基于现成分割掩码(segmentation masks)的引导式补丁放置启发式方法,无需梯度查询即可有效定位脆弱区域,从而在五种不同架构(包括对抗训练的 ResNet50)上将攻击成功率提升 8 至 13 个百分点。
链接: https://arxiv.org/abs/2508.01676
作者: Shai Kimhi,Avi Mendlson,Moshe Kimhi
机构: Technion(以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Adversarial patch attacks threaten the reliability of modern vision models. We present PatchMap, the first spatially exhaustive benchmark of patch placement, built by evaluating over 1.5e8 forward passes on ImageNet validation images. PatchMap reveals systematic hot-spots where small patches (as little as 2% of the image) induce confident misclassifications and large drops in model confidence. To demonstrate its utility, we propose a simple segmentation guided placement heuristic that leverages off the shelf masks to identify vulnerable regions without any gradient queries. Across five architectures-including adversarially trained ResNet50, our method boosts attack success rates by 8 to 13 percentage points compared to random or fixed placements. We publicly release PatchMap and the code implementation. The full PatchMap bench (6.5B predictions, multiple backbones) will be released soon to further accelerate research on location-aware defenses and adaptive attacks.
zh
[CV-134] Rein: Efficient Generalization and Adaptation for Semantic Segmentation with Vision Foundation Models
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFMs)在语义分割任务中面临的两大挑战:一是训练数据规模差异问题,即分割数据集通常远小于用于VFM预训练的数据集;二是域分布偏移问题,即真实场景中的分割任务具有多样性且常未在预训练阶段充分覆盖。解决方案的关键在于提出一个高效、可泛化的分割框架Rein++,其核心由两部分组成:一是Rein-G,通过引入可训练的实例感知令牌(instance-aware tokens),仅微调骨干网络不到1%的参数即可显著提升特征表示能力,实现高效的领域泛化;二是Rein-A,基于Rein-G进一步执行无监督域适应,从实例和logit两个层级缓解域偏移,并集成语义迁移模块利用Segment Anything Model(SAM)的类别无关能力增强目标域边界细节。整体流程先在源域(如白天场景)学习通用模型,再无需目标域标签即可适配至多样化目标域(如夜间场景),在保持高效率的同时显著优于现有最先进方法。
链接: https://arxiv.org/abs/2508.01667
作者: Zhixiang Wei,Xiaoxiao Ma,Ruishen Yan,Tao Tu,Huaian Chen,Jinjin Zheng,Yi Jin,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Foundation Models(VFMs) have achieved remarkable success in various computer vision tasks. However, their application to semantic segmentation is hindered by two significant challenges: (1) the disparity in data scale, as segmentation datasets are typically much smaller than those used for VFM pre-training, and (2) domain distribution shifts, where real-world segmentation scenarios are diverse and often underrepresented during pre-training. To overcome these limitations, we present Rein++, an efficient VFM-based segmentation framework that demonstrates superior generalization from limited data and enables effective adaptation to diverse unlabeled scenarios. Specifically, Rein++ comprises a domain generalization solution Rein-G and a domain adaptation solution Rein-A. Rein-G introduces a set of trainable, instance-aware tokens that effectively refine the VFM’s features for the segmentation task. This parameter-efficient approach fine-tunes less than 1% of the backbone’s parameters, enabling robust generalization. Building on the Rein-G, Rein-A performs unsupervised domain adaptation at both the instance and logit levels to mitigate domain shifts. In addition, it incorporates a semantic transfer module that leverages the class-agnostic capabilities of the segment anything model to enhance boundary details in the target domain. The integrated Rein++ pipeline first learns a generalizable model on a source domain (e.g., daytime scenes) and subsequently adapts it to diverse target domains (e.g., nighttime scenes) without any target labels. Comprehensive experiments demonstrate that Rein++ significantly outperforms state-of-the-art methods with efficient training, underscoring its roles an efficient, generalizable, and adaptive segmentation solution for VFMs, even for large models with billions of parameters. The code is available at this https URL.
zh
[CV-135] Shape Distribution Matters: Shape-specific Mixture-of-Experts for Amodal Segmentation under Diverse Occlusions
【速读】:该论文旨在解决非可见区域分割(amodal segmentation)中因复杂遮挡和极端形状变化(从刚性家具到高度可变形衣物)导致的性能瓶颈问题。现有方法采用单一模型处理所有形状类型,受限于表达能力不足,难以有效建模多样化的隐式形状。其核心解决方案是提出ShapeMoE框架,关键在于构建一个基于形状特性的稀疏专家混合(Mixture-of-Experts, MoE)机制:通过学习对象的潜在形状分布空间,将每个对象编码为紧凑的高斯嵌入(Gaussian embedding),并利用**形状感知稀疏路由器(Shape-Aware Sparse Router)**动态分配至最匹配的轻量级专家;每个专家专注于特定形状模式的遮挡区域预测,从而实现精准、高效且可解释的专家分工,显著提升对遮挡区域的分割精度。
链接: https://arxiv.org/abs/2508.01664
作者: Zhixuan Li,Yujia Liu,Chen Hui,Jeonghaeng Lee,Sanghoon Lee,Weisi Lin
机构: 1. Nanyang Technological University (南洋理工大学); 2. Tsinghua University (清华大学); 3. Peking University (北京大学); 4. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 5. Alibaba Group (阿里巴巴集团); 6. Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:Amodal segmentation targets to predict complete object masks, covering both visible and occluded regions. This task poses significant challenges due to complex occlusions and extreme shape variation, from rigid furniture to highly deformable clothing. Existing one-size-fits-all approaches rely on a single model to handle all shape types, struggling to capture and reason about diverse amodal shapes due to limited representation capacity. A natural solution is to adopt a Mixture-of-Experts (MoE) framework, assigning experts to different shape patterns. However, naively applying MoE without considering the object’s underlying shape distribution can lead to mismatched expert routing and insufficient expert specialization, resulting in redundant or underutilized experts. To deal with these issues, we introduce ShapeMoE, a shape-specific sparse Mixture-of-Experts framework for amodal segmentation. The key idea is to learn a latent shape distribution space and dynamically route each object to a lightweight expert tailored to its shape characteristics. Specifically, ShapeMoE encodes each object into a compact Gaussian embedding that captures key shape characteristics. A Shape-Aware Sparse Router then maps the object to the most suitable expert, enabling precise and efficient shape-aware expert routing. Each expert is designed as lightweight and specialized in predicting occluded regions for specific shape patterns. ShapeMoE offers well interpretability via clear shape-to-expert correspondence, while maintaining high capacity and efficiency. Experiments on COCOA-cls, KINS, and D2SA show that ShapeMoE consistently outperforms state-of-the-art methods, especially in occluded region segmentation. The code will be released.
zh
[CV-136] Single Point Full Mask: Velocity-Guided Level Set Evolution for End-to-End Amodal Segmentation
【速读】:该论文旨在解决非可见区域分割(amodal segmentation)中因依赖强提示(如完整遮挡掩码或边界框)而导致的实用性受限问题,以及现有方法在复杂遮挡场景下泛化能力不足、缺乏几何可解释性的问题。其核心解决方案是提出一种端到端的速度驱动水平集方法(VELA),关键在于通过从点提示出发构建初始水平集函数,并利用一个全可微网络预测形状特异性运动场,从而显式驱动轮廓演化生成最终的完整物体掩码。该设计实现了几何基础的轮廓建模与拓扑灵活性的统一,在仅需单点提示条件下显著提升了分割精度与可解释性。
链接: https://arxiv.org/abs/2508.01661
作者: Zhixuan Li,Yujia Liu,Chen Hui,Weisi Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
Abstract:Amodal segmentation aims to recover complete object shapes, including occluded regions with no visual appearance, whereas conventional segmentation focuses solely on visible areas. Existing methods typically rely on strong prompts, such as visible masks or bounding boxes, which are costly or impractical to obtain in real-world settings. While recent approaches such as the Segment Anything Model (SAM) support point-based prompts for guidance, they often perform direct mask regression without explicitly modeling shape evolution, limiting generalization in complex occlusion scenarios. Moreover, most existing methods suffer from a black-box nature, lacking geometric interpretability and offering limited insight into how occluded shapes are inferred. To deal with these limitations, we propose VELA, an end-to-end VElocity-driven Level-set Amodal segmentation method that performs explicit contour evolution from point-based prompts. VELA first constructs an initial level set function from image features and the point input, which then progressively evolves into the final amodal mask under the guidance of a shape-specific motion field predicted by a fully differentiable network. This network learns to generate evolution dynamics at each step, enabling geometrically grounded and topologically flexible contour modeling. Extensive experiments on COCOA-cls, D2SA, and KINS benchmarks demonstrate that VELA outperforms existing strongly prompted methods while requiring only a single-point prompt, validating the effectiveness of interpretable geometric modeling under weak guidance. The code will be publicly released.
zh
[CV-137] MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型生成的内容在语法上正确但与输入图像不一致。其解决方案的关键在于提出一种基于“地图级”(map-level)视角的解码方法——Map-Level Attention Processing (MAP),该方法将模型隐藏状态视为二维语义地图,并利用注意力机制在全局和局部维度上聚合信息以增强事实一致性。具体而言,通过层间交叉注意力(Layer-Wise Criss-Cross Attention)逐步优化每层token表示,并结合全局-局部logit融合机制整合不同阶段的预测结果,从而显著提升LVLMs在多个基准测试(如POPE、MME和MMHal-Bench)中的真实性和性能表现。
链接: https://arxiv.org/abs/2508.01653
作者: Chenxi Li,Yichen Guo,Benfang Qian,Jinhao You,Kai Tang,Yaosong Du,Zonghao Zhang,Xiande Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have achieved impressive performance in multimodal tasks, but they still suffer from hallucinations, i.e., generating content that is grammatically accurate but inconsistent with visual inputs. In this work, we introduce a novel map-level perspective to mitigate hallucinations in LVLMs, interpreting the hidden states of the model as a 2D semantic map. We observe that factual information is widely distributed across this map, extending beyond the localized inter- or intra-layer regions targeted by most existing methods (e.g., contrastive decoding and layer-wise consistency). Building on this insight, we propose Map-Level Attention Processing (MAP), a training-free decoding method that effectively leverages factual information through attention-based map-level operations to improve factual consistency. Specifically, we employ Layer-Wise Criss-Cross Attention to progressively refine token representations at each decoding layer by aggregating tokens from both inter- and intra-layer dimensions. Additionally, a Global-Local Logit Fusion mechanism combines logits obtained before and after global attention to further refine predictions and improve accuracy. Our method consistently improves the truthfulness and performance of LVLMs across benchmarks, such as POPE, MME, and MMHal-Bench, demonstrating the potential of the map-level decoding strategy.
zh
[CV-138] DAG: Unleash the Potential of Diffusion Model for Open-Vocabulary 3D Affordance Grounding
【速读】:该论文旨在解决3D物体可及性定位(3D object affordance grounding)任务中现有方法因依赖演示图像学习而导致泛化能力差的问题。其关键解决方案是利用文本到图像扩散模型(text-to-image diffusion models)提取通用的可及性知识,因为这类模型能够生成语义有效的交互图像(HOI images),表明其内部表征空间与真实世界的可及性概念高度相关。作者提出DAG框架,通过冻结扩散模型的内部表征,并引入可及性模块(affordance block)和多源可及性解码器(multi-source affordance decoder),实现对3D物体表面密集可及区域的精准预测,从而在开放世界场景下展现出优越的泛化性能。
链接: https://arxiv.org/abs/2508.01651
作者: Hanqing Wang,Zhenhao Zhang,Kaiyang Ji,Mingyu Liu,Wenti Yin,Yuchao Chen,Zhirui Liu,Xiangyu Zeng,Tianxiang Gui,Hangxing Zhang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. University of California, Berkeley (加州大学伯克利分校); 4. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 5. University of Science and Technology of China (中国科学技术大学); 6. Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D object affordance grounding aims to predict the touchable regions on a 3d object, which is crucial for human-object interaction, human-robot interaction, embodied perception, and robot learning. Recent advances tackle this problem via learning from demonstration images. However, these methods fail to capture the general affordance knowledge within the image, leading to poor generalization. To address this issue, we propose to use text-to-image diffusion models to extract the general affordance knowledge because we find that such models can generate semantically valid HOI images, which demonstrate that their internal representation space is highly correlated with real-world affordance concepts. Specifically, we introduce the DAG, a diffusion-based 3d affordance grounding framework, which leverages the frozen internal representations of the text-to-image diffusion model and unlocks affordance knowledge within the diffusion model to perform 3D affordance grounding. We further introduce an affordance block and a multi-source affordance decoder to endow 3D dense affordance prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization.
zh
[CV-139] StrandDesigner: Towards Practical Strand Generation with Sketch Guidance
【速读】:该论文旨在解决生成式AI(Generative AI)在虚拟现实和计算机图形学中对逼真发丝(hair strand)生成的精度与用户友好性不足的问题。现有基于文本或图像输入的方法难以实现精细控制,限制了实际应用效果。其解决方案的关键在于提出首个基于草图(sketch-based)的发丝生成模型,通过两项核心技术突破:一是可学习的发丝上采样策略,将3D发丝编码至多尺度潜在空间以建模复杂交互;二是基于Transformer的多尺度自适应条件机制,结合扩散头(diffusion heads)确保不同粒度层级的一致性。该方法显著提升了生成结果的逼真度与控制精度,在多个基准数据集上优于现有方法。
链接: https://arxiv.org/abs/2508.01650
作者: Na Zhang,Moran Li,Chengming Xu,Han Feng,Xiaobin Hu,Jiangning Zhang,Weijian Cao,Chengjie Wang,Yanwei Fu
机构: Fudan University (复旦大学); Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2025
Abstract:Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at [GitHub](this https URL).
zh
[CV-140] Minimal High-Resolution Patches Are Sufficient for Whole Slide Image Representation via Cascaded Dual-Scale Reconstruction
【速读】:该论文旨在解决全切片图像(Whole-slide image, WSI)分析中因图像尺度巨大(吉像素级)和诊断区域稀疏分布所带来的挑战,尤其是现有基于多实例学习(Multiple Instance Learning, MIL)的方法过度关注聚合器设计而忽视特征提取器在医学影像域上的适配性问题,导致域差距(domain gap)和次优表征。解决方案的关键在于提出一种级联双尺度重建(Cascaded Dual-Scale Reconstruction, CDSR)框架,其核心创新是通过两阶段选择性采样策略,从模型驱动和语义层面识别出每张WSI中最具信息量的约9个高分辨率补丁,并借助局部到全局网络(Local-to-Global Network)融合细粒度局部细节与全局上下文信息,实现空间一致性的高分辨率表示重建。相比传统密集采样或自监督学习方法,CDSR显著提升了效率和形态学保真度,在仅使用平均7,070个(占总数4.5%)高分辨率补丁的情况下,即在Camelyon16、TCGA-NSCLC和TCGA-RCC数据集上实现了准确率提升6.3%和ROC曲线下面积提升5.5%,优于基于超百万补丁训练的先进方法。
链接: https://arxiv.org/abs/2508.01641
作者: Yujian Liu,Yuechuan Lin,Dongxu Shen,Haoran Li,Yutong Wang,Xiaoli Liu,Shidang Xu
机构: 1. Tsinghua University (清华大学); 2. Chinese Academy of Sciences (中国科学院); 3. University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Whole-slide image (WSI) analysis remains challenging due to the gigapixel scale and sparsely distributed diagnostic regions. Multiple Instance Learning (MIL) mitigates this by modeling the WSI as bags of patches for slide-level prediction. However, most MIL approaches emphasize aggregator design while overlooking the impact of the feature extractor of the feature extraction stage, which is often pretrained on natural images. This leads to domain gap and suboptimal representations. Self-supervised learning (SSL) has shown promise in bridging domain gap via pretext tasks, but it still primarily builds upon generic backbones, thus requiring WSIs to be split into small patches. This inevitably splits histological structures and generates both redundant and interdependent patches, which in turn degrades aggregator performance and drastically increases training costs. To address this challenge, we propose a Cascaded Dual-Scale Reconstruction (CDSR) framework, demonstrating that only an average of 9 high-resolution patches per WSI are sufficient for robust slide-level representation. CDSR employs a two-stage selective sampling strategy that identifies the most informative representative regions from both model-based and semantic perspectives. These patches are then fed into a Local-to-Global Network, which reconstructs spatially coherent high-resolution WSI representations by integrating fine-grained local detail with global contextual information. Unlike existing dense-sampling or SSL pipelines, CDSR is optimized for efficiency and morphological fidelity. Experiments on Camelyon16, TCGA-NSCLC, and TCGA-RCC demonstrate that CDSR achieves improvements of 6.3% in accuracy and 5.5% in area under ROC curve on downstream classification tasks with only 7,070 (4.5% of total) high-resolution patches per dataset on average, outperforming state-of-the-art methods trained on over 10,000,000 patches.
zh
[CV-141] Glass Surface Segmentation with an RGB-D Camera via Weighted Feature Fusion for Service Robots
【速读】:该论文旨在解决基于RGB-D相机的玻璃表面分割问题,尤其针对透明、反射和遮挡等复杂场景下分割精度不足的挑战。其核心解决方案是提出一种加权特征融合(Weighted Feature Fusion, WFF)模块,该模块能够动态且自适应地融合RGB与深度信息,从而提升模型对玻璃表面的识别能力。WFF模块可作为即插即用组件集成到多种深度神经网络主干结构中,显著增强了分割性能,在边界交并比(boundary IoU, bIoU)上相较PSPNet提升了7.49%,验证了其有效性与通用性。
链接: https://arxiv.org/abs/2508.01639
作者: Henghong Lin,Zihan Zhu,Tao Wang,Anastasia Ioannou,Yuanshui Huang
机构: Minjiang University (闽江学院); European University Cyprus (欧洲大学塞浦路斯分校); Fujian Hantewin Intelligent Technology Co., Ltd. (福建汉特威智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted by 6th International Conference on Computer Vision, Image and Deep Learning (CVIDL 2025)
Abstract:We address the problem of glass surface segmentation with an RGB-D camera, with a focus on effectively fusing RGB and depth information. To this end, we propose a Weighted Feature Fusion (WFF) module that dynamically and adaptively combines RGB and depth features to tackle issues such as transparency, reflections, and occlusions. This module can be seamlessly integrated with various deep neural network backbones as a plug-and-play solution. Additionally, we introduce the MJU-Glass dataset, a comprehensive RGB-D dataset collected by a service robot navigating real-world environments, providing a valuable benchmark for evaluating segmentation models. Experimental results show significant improvements in segmentation accuracy and robustness, with the WFF module enhancing performance in both mean Intersection over Union (mIoU) and boundary IoU (bIoU), achieving a 7.49% improvement in bIoU when integrated with PSPNet. The proposed module and dataset provide a robust framework for advancing glass surface segmentation in robotics and reducing the risk of collisions with glass objects.
zh
[CV-142] Rate-distortion Optimized Point Cloud Preprocessing for Geometry-based Point Cloud Compression
【速读】:该论文旨在解决几何点云压缩(G-PCC)在压缩效率上落后于基于深度学习的点云压缩(PCC)方法的问题,同时保持其低计算复杂度、跨平台互操作性和编码灵活性。解决方案的关键在于提出一种新颖的预处理框架,该框架融合了一个面向压缩的体素化网络(compression-oriented voxelization network)与一个可微分的G-PCC代理模型(differentiable G-PCC surrogate model),二者在训练阶段联合优化。代理模型模拟了不可微分的G-PCC编解码器的率失真特性,从而实现端到端梯度传播;而体素化网络则通过学习式体素化、全局缩放、细粒度裁剪和点级编辑等策略,自适应地调整输入点云以优化率失真权衡。推理时仅需将轻量级体素化网络接入G-PCC编码器,无需修改解码器,因此对终端用户无额外计算开销,实验表明平均BD-rate降低达38.84%。该方法为提升传统压缩标准提供了兼顾性能与兼容性的实用路径。
链接: https://arxiv.org/abs/2508.01633
作者: Wanhao Ma,Wei Zhang,Shuai Wan,Fuzheng Yang
机构: Xidian University (西安电子科技大学); Pengcheng Laboratory (鹏城实验室); Northwestern Polytechnical University (西北工业大学); Royal Melbourne Institute of Technology (皇家墨尔本理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose a novel preprocessing framework that integrates a compression-oriented voxelization network with a differentiable G-PCC surrogate model, jointly optimized in the training phase. The surrogate model mimics the rate-distortion behaviour of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-offs. During inference, only the lightweight voxelization network is appended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.
zh
[CV-143] IMU: Influence-guided Machine Unlearning
【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)中现有方法普遍依赖保留数据(retain set)进行微调的问题,这在隐私保护和存储受限场景下难以实现。其解决方案的关键在于提出一种无需访问保留数据的 Influence-guided Machine Unlearning (IMU) 方法:通过梯度上升策略,并基于数据点的影响(influence)动态分配遗忘强度,实现对遗忘集(forget set)中样本的高效选择性遗忘,从而在不损害模型性能的前提下显著提升遗忘效果。
链接: https://arxiv.org/abs/2508.01620
作者: Xindi Fan,Jing Wu,Mingyi Zhou,Pengwei Liang,Dinh Phung
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies have shown that deep learning models are vulnerable to attacks and tend to memorize training data points, raising significant concerns about privacy leakage. This motivates the development of machine unlearning (MU), i.e., a paradigm that enables models to selectively forget specific data points upon request. However, most existing MU algorithms require partial or full fine-tuning on the retain set. This necessitates continued access to the original training data, which is often impractical due to privacy concerns and storage constraints. A few retain-data-free MU methods have been proposed, but some rely on access to auxiliary data and precomputed statistics of the retain set, while others scale poorly when forgetting larger portions of data. In this paper, we propose Influence-guided Machine Unlearning (IMU), a simple yet effective method that conducts MU using only the forget set. Specifically, IMU employs gradient ascent and innovatively introduces dynamic allocation of unlearning intensities across different data points based on their influences. This adaptive strategy significantly enhances unlearning effectiveness while maintaining model utility. Results across vision and language tasks demonstrate that IMU consistently outperforms existing retain-data-free MU methods.
zh
[CV-144] LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
【速读】:该论文旨在解决当前生物医学视觉语言模型(VLM)中自回归模型(ARMs)主导的局限性,探索掩码扩散模型(masked diffusion models)在生物医学图像理解任务中的潜力。其解决方案的关键在于提出首个面向生物医学图像理解的大型语言扩散模型——LLaDA-MedV,通过视觉指令微调(vision instruction tuning)实现对生物医学图像的高效理解与生成。该模型在开放式生物医学视觉对话任务中相较于LLaVA-Med和LLaDA-V分别取得7.855%和1.867%的相对性能提升,并在三个闭合形式视觉问答(VQA)基准测试中达到新的最先进准确率,同时具备通过显式控制响应长度生成更长、更信息丰富的输出的能力。
链接: https://arxiv.org/abs/2508.01617
作者: Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Zhipeng Wang,Peijie Qiu,Shao Tang,Xin Li,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); LinkedIn Corporation (领英公司); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbfLLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at this https URL.
zh
[CV-145] From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在图像地理定位(image geolocalization)任务中的性能评估与空间推理能力不足的问题。其解决方案的关键在于构建了一个名为 IMAGEO-Bench 的系统性基准测试框架,该框架能够量化评估模型在准确性、距离误差、地理空间偏差以及推理过程等方面的性能表现,并通过三个覆盖全球街景、美国兴趣点(Points of Interest, POIs)及未见图像的多样化数据集对10个前沿大语言模型(Large Language Models, LLMs)进行实证分析。研究揭示了闭源模型在空间推理上更具优势,同时识别出显著的地理空间偏差现象——模型在高资源区域(如北美、西欧和加州)表现更优,在低资源地区性能下降,进一步指出成功定位主要依赖于对城市环境、户外场景、街景图像和可识别地标等视觉特征的准确识别。
链接: https://arxiv.org/abs/2508.01608
作者: Lingyao Li,Runlong Yu,Qikai Hu,Bowei Li,Min Deng,Yang Zhou,Xiaowei Jia
机构: University of South Florida (南佛罗里达大学); University of Alabama (阿拉巴马大学); University of Michigan (密歇根大学); Texas Tech University (德克萨斯理工大学); Texas A & M University (德克萨斯农工大学); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.
zh
[CV-146] owards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
【速读】:该论文旨在解决AI生成图像检测中难以识别未见过的生成器所制造假图的问题。现有先进方法通常通过部分参数微调预训练基础模型来适配特定生成器,但此类模型在面对未知来源的伪造图像时泛化能力不足。解决方案的关键在于提出一种名为Image-Adaptive Prompt Learning (IAPL) 的新框架,其核心创新是引入两个自适应模块:条件信息学习器(Conditional Information Learner)和置信度驱动的自适应预测模块(Confidence-Driven Adaptive Prediction)。前者利用CNN特征提取器学习图像特有的伪造特征,并通过门控机制将这些条件传递给可学习提示令牌;后者基于单个测试样本优化最浅层提示令牌,并选择置信度最高的裁剪视图进行最终检测。这种动态调整提示的方式使模型能根据输入图像自动适应,而非固定训练后提示,从而显著提升对多样化伪造图像的检测性能。
链接: https://arxiv.org/abs/2508.01603
作者: Yiheng Li,Zichang Tan,Zhen Lei,Xu Zhou,Yang Yang
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Tencent AI Lab (腾讯人工智能实验室); 4. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:A major struggle for AI-generated image detection is identifying fake images from unseen generators. Existing cutting-edge methods typically customize pre-trained foundation models to this task via partial-parameter fine-tuning. However, these parameters trained on a narrow range of generators may fail to generalize to unknown sources. In light of this, we propose a novel framework named Image-Adaptive Prompt Learning (IAPL), which enhances flexibility in processing diverse testing images. It consists of two adaptive modules, i.e., the Conditional Information Learner and the Confidence-Driven Adaptive Prediction. The former employs CNN-based feature extractors to learn forgery-specific and image-specific conditions, which are then propagated to learnable tokens via a gated mechanism. The latter optimizes the shallowest learnable tokens based on a single test sample and selects the cropped view with the highest prediction confidence for final detection. These two modules enable the prompts fed into the foundation model to be automatically adjusted based on the input image, rather than being fixed after training, thereby enhancing the model’s adaptability to various forged images. Extensive experiments show that IAPL achieves state-of-the-art performance, with 95.61% and 96.7% mean accuracy on two widely used UniversalFakeDetect and GenImage datasets, respectively.
zh
[CV-147] Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment
【速读】:该论文旨在解决从组织病理学全切片图像(Whole Slide Images, WSI)中进行脑肿瘤亚型细粒度分类的难题,该问题因形态学特征细微差异及标注数据稀缺而尤为棘手。现有视觉-语言模型虽在零样本分类上表现潜力,但对病理特征的细粒度捕捉能力不足,导致亚型区分效果不佳。解决方案的关键在于提出一种名为细粒度补丁对齐网络(Fine-Grained Patch Alignment Network, FG-PAN)的新型零样本框架,其核心创新包括:(1) 局部特征精炼模块,通过建模代表性补丁间的空间关系增强局部视觉特征;(2) 细粒度文本描述生成模块,利用大语言模型(Large Language Models, LLM)生成具有病理感知、类别特异性的语义原型。通过将精炼后的视觉特征与LLM生成的细粒度描述对齐,FG-PAN显著提升了视觉与语义空间中的类别可分性,从而实现更准确的零样本脑肿瘤亚型分类。
链接: https://arxiv.org/abs/2508.01602
作者: Lubin Gan,Jing Zhang,Linhao Qu,Yijun Wang,Siying Wu,Xiaoyan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The fine-grained classification of brain tumor subtypes from histopathological whole slide images is highly challenging due to subtle morphological variations and the scarcity of annotated data. Although vision-language models have enabled promising zero-shot classification, their ability to capture fine-grained pathological features remains limited, resulting in suboptimal subtype discrimination. To address these challenges, we propose the Fine-Grained Patch Alignment Network (FG-PAN), a novel zero-shot framework tailored for digital pathology. FG-PAN consists of two key modules: (1) a local feature refinement module that enhances patch-level visual features by modeling spatial relationships among representative patches, and (2) a fine-grained text description generation module that leverages large language models to produce pathology-aware, class-specific semantic prototypes. By aligning refined visual features with LLM-generated fine-grained descriptions, FG-PAN effectively increases class separability in both visual and semantic spaces. Extensive experiments on multiple public pathology datasets, including EBRAINS and TCGA, demonstrate that FG-PAN achieves state-of-the-art performance and robust generalization in zero-shot brain tumor subtype classification.
zh
[CV-148] CLIMD: A Curriculum Learning Framework for Imbalanced Multimodal Diagnosis MICCAI2025
【速读】:该论文旨在解决多模态医学诊断中因类别不平衡(class imbalance)导致的少数类特征难以充分学习的问题。现有方法如重采样或损失加权虽被广泛使用,但易引发过拟合或欠拟合,且无法有效建模跨模态交互。其解决方案的关键在于提出一种面向不平衡多模态诊断的课程学习框架(Curriculum Learning framework for Imbalanced Multimodal Diagnosis, CLIMD),核心创新包括:1)设计多模态课程度量器(multimodal curriculum measurer),融合模态内置信度(intra-modal confidence)与模态间互补性(inter-modal complementarity)两个指标,引导模型聚焦关键样本并逐步适应复杂类别分布;2)引入类分布引导的训练调度器(class distribution-guided training scheduler),使模型在训练过程中动态适应不平衡的类别分布。该方法在多个多模态医疗数据集上显著优于当前最优方法,并具备良好的可集成性,为提升多模态疾病诊断准确性提供了新路径。
链接: https://arxiv.org/abs/2508.01594
作者: Kai Han,Chongwen Lyu,Lele Ma,Chengxuan Qian,Siqi Ma,Zheng Pang,Jun Chen,Zhe Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025 Early Accept
Abstract:Clinicians usually combine information from multiple sources to achieve the most accurate diagnosis, and this has sparked increasing interest in leveraging multimodal deep learning for diagnosis. However, in real clinical scenarios, due to differences in incidence rates, multimodal medical data commonly face the issue of class imbalance, which makes it difficult to adequately learn the features of minority classes. Most existing methods tackle this issue with resampling or loss reweighting, but they are prone to overfitting or underfitting and fail to capture cross-modal interactions. Therefore, we propose a Curriculum Learning framework for Imbalanced Multimodal Diagnosis (CLIMD). Specifically, we first design multimodal curriculum measurer that combines two indicators, intra-modal confidence and inter-modal complementarity, to enable the model to focus on key samples and gradually adapt to complex category distributions. Additionally, a class distribution-guided training scheduler is introduced, which enables the model to progressively adapt to the imbalanced class distribution during training. Extensive experiments on multiple multimodal medical datasets demonstrate that the proposed method outperforms state-of-the-art approaches across various metrics and excels in handling imbalanced multimodal medical data. Furthermore, as a plug-and-play CL framework, CLIMD can be easily integrated into other models, offering a promising path for improving multimodal disease diagnosis accuracy. Code is publicly available at this https URL.
zh
[CV-149] DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter
【速读】:该论文旨在解决时空多模态跟踪任务中跨模态特征融合困难与参数效率低的问题。解决方案的关键在于提出了一种新颖的双适配器架构(DMTrack),其核心由两个模块构成:一是时空模态适配器(Spatio-Temporal Modality Adapter, STMA),用于对每个模态单独调整冻结主干网络提取的时空特征,通过自提示机制缓解模态间差异并促进更好的跨模态融合;二是渐进式模态互补适配器(Progressive Modality Complementary Adapter, PMCA),包含浅层和深层两个像素级适配模块,其中浅层适配器共享参数以打通双模态分支的信息流,深层适配器则利用像素级内模态注意力和跨模态注意力生成模态感知提示,从而逐步优化融合过程。该设计仅需约0.93M可训练参数即实现卓越的多模态跟踪性能,在五个基准测试上达到当前最优结果。
链接: https://arxiv.org/abs/2508.01592
作者: Weihong Li,Shaohua Dong,Haonan Lu,Yanhao Zhang,Heng Fan,Libo Zhang
机构: Hangzhou Institute for Advanced Study (杭州研究院); Institute of Software Chinese Academy of Science (中国科学院软件研究所); University of Chinese Academy of Science (中国科学院大学); University of North Texas (北德克萨斯大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely \textbf0.93M trainable parameters. Extensive experiments on five benchmarks show that DMTrack achieves state-of-the-art results. Code will be available.
zh
[CV-150] Self-Navigated Residual Mamba for Universal Industrial Anomaly Detection AAAI2026
【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中对预训练特征依赖性强、判别能力不足的问题,尤其是在测试阶段缺乏有效参考信息时难以精准定位异常区域的挑战。其解决方案的关键在于提出一种自导航残差Mamba(Self-Navigated Residual Mamba, SNARM)框架,通过引入“自参考学习”机制,在测试图像内部动态生成参考patch以计算“内残差”(intra-residuals),并与基于训练特征库计算的“外残差”(inter-residuals)融合,利用多头Mamba模块根据残差特性动态导航注意力机制,从而增强异常区域的判别信号。该方法不依赖外部监督,实现了端到端的自适应异常检测,显著提升了在MVTec AD、MVTec 3D和VisA等基准上的性能表现。
链接: https://arxiv.org/abs/2508.01591
作者: Hanxi Li,Jingqi Wu,Lin Yuanbo Wu,Mingliang Li,Deyin Liu,Jialie Shen,Chunhua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figure,submitted to AAAI2026
Abstract:In this paper, we propose Self-Navigated Residual Mamba (SNARM), a novel framework for universal industrial anomaly detection that leverages self-referential learning'' within test images to enhance anomaly discrimination. Unlike conventional methods that depend solely on pre-trained features from normal training data, SNARM dynamically refines anomaly detection by iteratively comparing test patches against adaptively selected in-image references. Specifically, we first compute the
inter-residuals’’ features by contrasting test image patches with the training feature bank. Patches exhibiting small-norm residuals (indicating high normality) are then utilized as self-generated reference patches to compute ``intra-residuals’', amplifying discriminative signals. These inter- and intra-residual features are concatenated and fed into a novel Mamba module with multiple heads, which are dynamically navigated by residual properties to focus on anomalous regions. Finally, AD results are obtained by aggregating the outputs of a self-navigated Mamba in an ensemble learning paradigm. Extensive experiments on MVTec AD, MVTec 3D, and VisA benchmarks demonstrate that SNARM achieves state-of-the-art (SOTA) performance, with notable improvements in all metrics, including Image-AUROC, Pixel-AURC, PRO, and AP.
zh
[CV-151] A Plug-and-Play Multi-Criteria Guidance for Diverse In-Betweening Human Motion Generation
【速读】:该论文旨在解决人机运动插值生成(in-betweening human motion generation)中多样性不足的问题,尤其是在预训练生成模型的批量采样过程中,如何确保生成的运动序列在复杂动力学下仍能显著差异。其解决方案的关键在于提出了一种多准则引导的插值运动模型(Multi-Criteria Guidance with In-Betweening Motion Model, MCG-IMM),该方法通过将预训练生成模型的采样过程重构为一个多准则优化问题,引入无额外参数的优化机制来同时满足多样性与平滑性等多重约束,从而实现即插即用的多样性增强,且兼容扩散概率模型、变分自编码器和生成对抗网络等多种生成架构。
链接: https://arxiv.org/abs/2508.01590
作者: Hua Yu,Jiao Liu,Xu Gui,Melvin Wong,Yaqing Hou,Yew-Soon Ong
机构: Nanyang Technological University (南洋理工大学); Dalian University of Technology (大连理工大学); Institute of High Performance Computing (高性能计算研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In-betweening human motion generation aims to synthesize intermediate motions that transition between user-specified keyframes. In addition to maintaining smooth transitions, a crucial requirement of this task is to generate diverse motion sequences. It is still challenging to maintain diversity, particularly when it is necessary for the motions within a generated batch sampling to differ meaningfully from one another due to complex motion dynamics. In this paper, we propose a novel method, termed the Multi-Criteria Guidance with In-Betweening Motion Model (MCG-IMM), for in-betweening human motion generation. A key strength of MCG-IMM lies in its plug-and-play nature: it enhances the diversity of motions generated by pretrained models without introducing additional parameters This is achieved by providing a sampling process of pretrained generative models with multi-criteria guidance. Specifically, MCG-IMM reformulates the sampling process of pretrained generative model as a multi-criteria optimization problem, and introduces an optimization process to explore motion sequences that satisfy multiple criteria, e.g., diversity and smoothness. Moreover, our proposed plug-and-play multi-criteria guidance is compatible with different families of generative models, including denoised diffusion probabilistic models, variational autoencoders, and generative adversarial networks. Experiments on four popular human motion datasets demonstrate that MCG-IMM consistently state-of-the-art methods in in-betweening motion generation task.
zh
[CV-152] Lifelong Person Re-identification via Privacy-Preserving Data Replay
【速读】:该论文旨在解决终身行人重识别(Lifelong Person Re-Identification, LReID)中因域偏移(domain shift)导致的持续学习问题,尤其是传统基于回放(replay-based)方法因存储原始样本而引发的数据隐私风险。为实现隐私保护的同时维持模型性能,作者提出Privacy-Preserving Replay (Pr²R) 方法,其关键在于将多张真实图像的训练特征蒸馏(distill)到单个像素级压缩样本中,生成具有代表性的“浓缩样本”(condensed samples),从而在不存储原始数据的前提下保留历史知识;同时引入双对齐策略,在风格回放阶段同步对齐当前域与先前域,并调整回放样本以匹配当前域风格,有效缓解类别增量挑战和域偏移引起的遗忘问题。
链接: https://arxiv.org/abs/2508.01587
作者: Mingyu Wang,Haojie Liu,Zhiyong Li,Wei Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures
Abstract:Lifelong person re-identification (LReID) aims to incrementally accumulate knowledge across a sequence of tasks under domain shifts. Recently, replay-based methods have demonstrated strong effectiveness in LReID by rehearsing past samples stored in an auxiliary memory. However, storing historical exemplars raises concerns over data privacy. To avoid this, exemplar-free approaches attempt to match the distribution of past data without storing raw samples. Despite being privacy-friendly, these methods often suffer from performance degradation due to the forgetting of specific past knowledge representations. To this end, we propose to condense information from sequential data into the pixel space in the replay memory, enabling Privacy-Preserving Replay (Pr^2R). More specifically, by distilling the training characteristics of multiple real images into a single image, the condensed samples undergo pixel-level changes. This not only protects the privacy of the original data but also makes the replay samples more representative for sequential tasks. During the style replay phase, we align the current domain to the previous one while simultaneously adapting the replay samples to match the style of the current domain. This dual-alignment strategy effectively mitigates both class-incremental challenges and forgetting caused by domain shifts. Extensive experiments on multiple benchmarks show that the proposed method significantly improves replay effectiveness while preserving data privacy. Specifically, Pr^2R achieves 4% and 6% higher accuracy on sequential tasks compared to the current state-of-the-art and other replay-based methods, respectively.
zh
[CV-153] A Spatio-temporal Continuous Network for Stochastic 3D Human Motion Prediction
【速读】:该论文旨在解决生成式人类运动预测(Stochastic Human Motion Prediction, HMP)中普遍存在的两大挑战:一是难以建模连续的时间动态特性,二是易发生模式崩溃(mode collapse),即模型无法充分捕捉复杂人类运动的多样性。解决方案的关键在于提出一种两阶段框架STCN(Stochastic and Continuous Network),其核心创新包括:第一阶段引入时空连续网络(spatio-temporal continuous network)以生成更平滑的运动序列,并创新性地将锚点集合(anchor set)引入HMP任务,用于表征潜在的人类运动模式,从而缓解模式崩溃;第二阶段基于锚点集合学习观测运动序列的高斯混合分布(Gaussian Mixture Model, GMM),并为每个锚点分配概率权重,通过从每个锚点采样多个序列来减少类内差异,提升预测多样性与准确性。
链接: https://arxiv.org/abs/2508.01585
作者: Hua Yu,Yaqing Hou,Xu Gui,Shanshan Feng,Dongsheng Zhou,Qiang Zhang
机构: Dalian University of Technology (大连理工大学); Institute of High Performance Computing, A*STAR (新加坡高性能计算研究所); Dalian University (大连大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stochastic Human Motion Prediction (HMP) has received increasing attention due to its wide applications. Despite the rapid progress in generative fields, existing methods often face challenges in learning continuous temporal dynamics and predicting stochastic motion sequences. They tend to overlook the flexibility inherent in complex human motions and are prone to mode collapse. To alleviate these issues, we propose a novel method called STCN, for stochastic and continuous human motion prediction, which consists of two stages. Specifically, in the first stage, we propose a spatio-temporal continuous network to generate smoother human motion sequences. In addition, the anchor set is innovatively introduced into the stochastic HMP task to prevent mode collapse, which refers to the potential human motion patterns. In the second stage, STCN endeavors to acquire the Gaussian mixture distribution (GMM) of observed motion sequences with the aid of the anchor set. It also focuses on the probability associated with each anchor, and employs the strategy of sampling multiple sequences from each anchor to alleviate intra-class differences in human motions. Experimental results on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.
zh
[CV-154] Set Pivot Learning: Redefining Generalized Segmentation with Vision Foundation Models
【速读】:该论文旨在解决传统领域泛化(Domain Generalization, DG)方法在当前视觉基础模型(Vision Foundation Models, VFMs)背景下所面临的局限性问题。传统DG假设目标域在训练阶段不可访问,但VFMs基于大规模多样数据预训练,使得这一假设不再适用。为此,论文提出Set Pivot Learning(SPL),其核心在于重新定义域迁移任务:从静态域对齐转向动态适应,强调以VFM为中心的微调策略,从而实现任务驱动的特征优化与跨域鲁棒性的协同保持。SPL的关键创新在于两个方面:一是动态适应机制,使模型能够根据下游场景灵活调整特征表示;二是VFM-centric tuning,利用预训练知识作为枢轴(pivot)来精炼特定任务表征并维持跨域稳定性。基于此,作者进一步设计了动态提示微调方法(Dynamic Prompt Fine-Tuning),通过类感知提示器和提示引导特征聚焦器提升VFMs在目标场景下的性能,尤其在泛化分割任务中显著优于现有方法。
链接: https://arxiv.org/abs/2508.01582
作者: Xinhui Li,Xinyu He,Qiming Hu,Xiaojie Guo
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce, for the first time, the concept of Set Pivot Learning, a paradigm shift that redefines domain generalization (DG) based on Vision Foundation Models (VFMs). Traditional DG assumes that the target domain is inaccessible during training, but the emergence of VFMs, trained on vast and diverse data, renders this assumption unclear and obsolete. Traditional DG assumes that the target domain is inaccessible during training, but the emergence of VFMs, which are trained on vast and diverse datasets, renders this assumption unclear and obsolete. To address this challenge, we propose Set Pivot Learning (SPL), a new definition of domain migration task based on VFMs, which is more suitable for current research and application requirements. Unlike conventional DG methods, SPL prioritizes adaptive refinement over rigid domain transfer, ensuring continuous alignment with evolving real-world conditions. Specifically, SPL features two key attributes: (i) Dynamic adaptation, transitioning from static domain alignment to flexible, task-driven feature optimization, enabling models to evolve with downstream scenarios; (ii) VFM-centric tuning, leveraging pretrained knowledge as a pivot to hone task-specific representations while preserving cross-domain robustness. Building on SPL, we propose a Dynamic Prompt Fine-Tuning method, which combines a Dynamic Class-aware Prompter with a Prompt-guided Feature Focuser, to elevate VFM performance in targeted scenarios. Extensive experiments on benchmark datasets show the effectiveness of our method, highlighting its superiority over state-of-the-art methods, particularly in generalized segmentation.
zh
[CV-155] Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning, CL)中因缺乏语义感知的知识迁移机制而导致的稳定性-可塑性权衡难题,尤其针对视觉-语言模型如CLIP在知识蒸馏过程中忽视语义相关性、文本分类器模态间隙限制可塑性、以及视觉原型语义信息不足等问题。解决方案的关键在于提出一个统一框架Semantic-Enriched Continual Adaptation (SECA),其核心由两个模块构成:一是Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) 模块,通过文本提示评估新图像与历史视觉知识的语义相关性,并以实例自适应方式聚合相关知识作为蒸馏信号;二是Semantic-Enhanced Visual Prototype Refinement (SE-VPR) 模块,利用类别级文本嵌入捕捉的类间语义关系来精炼视觉原型,从而增强视觉分类器的语义结构。该方法有效融合了文本语义先验的抗遗忘特性与视觉原型的可塑性,实现更鲁棒和语义一致的持续适应能力。
链接: https://arxiv.org/abs/2508.01579
作者: Lingfeng He,De Cheng,Huaijie Wang,Nannan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint
Abstract:Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability-plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images’ relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach.
zh
[CV-156] opoImages: Incorporating Local Topology Encoding into Deep Learning Models for Medical Image Classification
【速读】:该论文旨在解决深度学习(Deep Learning, DL)框架在处理图像数据时对拓扑结构(topological structures)敏感性不足的问题,尤其是在医学图像分类等任务中,传统方法依赖外观信息而忽略局部几何与连通性特征。解决方案的关键在于提出一种名为TopoImages的新表示方法,其核心是利用持久同调(persistent homology, PH)提取图像块(patch)中的局部拓扑特征,并将每个图像块的持久图(persistence diagram, PD)向量化后排列为像素级的多通道图像表示。进一步地,通过引入多种滤波函数生成多视角TopoImages并融合至原始图像,从而增强模型对复杂拓扑信息的捕捉能力,显著提升分类性能。
链接: https://arxiv.org/abs/2508.01574
作者: Pengfei Gu,Hongxiao Wang,Yejia Zhang,Huimin Li,Chaoli Wang,Danny Chen
机构: University of Texas Rio Grande Valley (德州大学里奥格兰德河谷分校); Capital Normal University (首都师范大学); University of Notre Dame (圣母大学); The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Topological structures in image data, such as connected components and loops, play a crucial role in understanding image content (e.g., biomedical objects). % Despite remarkable successes of numerous image processing methods that rely on appearance information, these methods often lack sensitivity to topological structures when used in general deep learning (DL) frameworks. % In this paper, we introduce a new general approach, called TopoImages (for Topology Images), which computes a new representation of input images by encoding local topology of patches. % In TopoImages, we leverage persistent homology (PH) to encode geometric and topological features inherent in image patches. % Our main objective is to capture topological information in local patches of an input image into a vectorized form. % Specifically, we first compute persistence diagrams (PDs) of the patches, % and then vectorize and arrange these PDs into long vectors for pixels of the patches. % The resulting multi-channel image-form representation is called a TopoImage. % TopoImages offers a new perspective for data analysis. % To garner diverse and significant topological features in image data and ensure a more comprehensive and enriched representation, we further generate multiple TopoImages of the input image using various filtration functions, which we call multi-view TopoImages. % The multi-view TopoImages are fused with the input image for DL-based classification, with considerable improvement. % Our TopoImages approach is highly versatile and can be seamlessly integrated into common DL frameworks. Experiments on three public medical image classification datasets demonstrate noticeably improved accuracy over state-of-the-art methods.
zh
[CV-157] LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在隐私法规(如GDPR和CCPA)约束下实现“机器遗忘”(machine unlearning)的问题,特别是针对随机数据遗忘场景——即模型需删除特定样本的影响,同时保留其他同类样本的特征。其解决方案的关键在于:通过选择性掩码实验揭示ViT的核心特性——掩码高注意力区域可削弱模型的记忆能力而不影响识别性能;据此提出LetheViT方法,利用掩码图像生成正样本logits、原始图像生成负样本logits,构建对比学习目标,使模型在遗忘特定细节的同时保留类别整体结构,从而在隐私合规与模型效能之间取得最优平衡。
链接: https://arxiv.org/abs/2508.01569
作者: Yujia Tong,Tian Zhang,Jingling Yuan,Yuze Wang,Chuang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision Transformers (ViTs) have revolutionized computer vision tasks with their exceptional performance. However, the introduction of privacy regulations such as GDPR and CCPA has brought new challenges to them. These laws grant users the right to withdraw their data, necessitating not only the deletion of data but also the complete removal of its influence from trained models. Machine unlearning emerges as a critical solution, with exact unlearning being computationally prohibitive and approximate methods offering a more practical approach. This work addresses the particularly challenging scenario of random data forgetting in ViTs, where the model must forget specific samples while retaining others, even within the same class. We first reveal the core characteristics of ViTs through selective masking experiments: when high-attention areas are masked, the model retains its recognition capability but significantly weakens its memorization ability. Based on the above insights, we propose LetheViT, a contrastive unlearning method tailored for ViTs. LetheViT uses masked image inputs to generate positive logits and original image inputs to generate negative logits, guiding the model to forget specific details while retaining the general cl category outlines. Experimental results demonstrate that LetheViT achieves state-of-the-art performance, effectively balancing privacy compliance with model efficacy.
zh
[CV-158] Adaptive LiDAR Scanning: Harnessing Temporal Cues for Efficient 3D Object Detection via Multi-Modal Fusion
【速读】:该论文旨在解决传统LiDAR传感器在3D目标检测任务中因执行密集且无状态扫描所导致的感知冗余和高能耗问题,尤其在资源受限平台上的实用性受限。其核心解决方案是提出一种预测性、历史感知的自适应扫描框架,关键在于引入一个轻量级预测网络,将历史空间与时间上下文信息提炼为优化的查询嵌入(query embeddings),并进一步通过可微分的Mask Generator网络结合Gumbel-Softmax采样生成二值掩码,精准定位下一帧中的关键感兴趣区域(Region of Interest, ROI),从而仅在ROI内进行密集扫描,其余区域稀疏采样,显著降低LiDAR数据采集量。实验表明,该方法可在保持甚至优于传统LiDAR-相机融合方法的检测性能前提下,实现超过65%的LiDAR能耗降低。
链接: https://arxiv.org/abs/2508.01562
作者: Sara Shoouri,Morteza Tavakoli Taba,Hun-Seok Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-sensor fusion using LiDAR and RGB cameras significantly enhances 3D object detection task. However, conventional LiDAR sensors perform dense, stateless scans, ignoring the strong temporal continuity in real-world scenes. This leads to substantial sensing redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. To address this inefficiency, we propose a predictive, history-aware adaptive scanning framework that anticipates informative regions of interest (ROI) based on past observations. Our approach introduces a lightweight predictor network that distills historical spatial and temporal contexts into refined query embeddings. These embeddings guide a differentiable Mask Generator network, which leverages Gumbel-Softmax sampling to produce binary masks identifying critical ROIs for the upcoming frame. Our method significantly reduces unnecessary data acquisition by concentrating dense LiDAR scanning only within these ROIs and sparsely sampling elsewhere. Experiments on nuScenes and Lyft benchmarks demonstrate that our adaptive scanning strategy reduces LiDAR energy consumption by over 65% while maintaining competitive or even superior 3D object detection performance compared to traditional LiDAR-camera fusion methods with dense LiDAR scanning.
zh
[CV-159] EvoVLMA: Evolutionary Vision-Language Model Adaptation ACM-MM2025
【速读】:该论文旨在解决预训练视觉语言模型(Vision-Language Models, VLMs)在少样本图像分类等任务中,现有训练-free适应方法依赖人工设计、耗时且效率低下的问题。其核心解决方案是提出一种基于大语言模型(Large Language Models, LLMs)辅助的进化算法——EvoVLMA,通过将适应过程分解为特征选择与logits计算两个关键模块,并采用两阶段进化策略进行优化,从而有效应对搜索空间庞大带来的挑战;同时引入低精度代码转换、基于网络的代码执行和进程监控机制,显著提升了自动算法设计的稳定性和效率,最终在8-shot图像分类任务中使经典APE算法的识别准确率提升1.91个百分点。
链接: https://arxiv.org/abs/2508.01558
作者: Kun Ding,Ying Wang,Shiming Xiang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by ACM Multimedia 2025 (ACM MM 2025)
Abstract:Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: this https URL
zh
[CV-160] A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理高分辨率输入时,因采用固定压缩比例而导致的视觉令牌(visual token)剪枝不精准问题,即在复杂场景中可能误删重要信息,从而损害模型性能。其解决方案的关键在于提出一种受人类认知启发的动态剪枝框架 GlimpsePrune,该框架通过单次前向传播实现数据驱动的“瞥视”(glimpse),智能识别并剪除无关视觉令牌,从而在平均保留基线性能的同时,实现高达92.6%的令牌压缩率,并支持更高效的微调优化。
链接: https://arxiv.org/abs/2508.01548
作者: Quan-Sheng Zeng,Yunheng Li,Qilong Wang,Peng-Tao Jiang,Zuxuan Wu,Ming-Ming Cheng,Qibin Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures. Project page: this https URL
Abstract:Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ‘‘glimpse’’ and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.
zh
[CV-161] E-VRAG : Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation
【速读】:该论文旨在解决视频检索增强生成(Video Retrieval-Augmented Generation, Video RAG)中因长视频处理导致的计算成本高和检索效率与准确性难以平衡的问题。其核心解决方案在于提出E-VRAG框架,关键创新包括:1)基于分层查询分解的帧预过滤方法,在数据层面剔除无关帧以降低计算开销;2)采用轻量级视觉语言模型(Vision-Language Model, VLM)进行帧评分,在模型层面进一步减少资源消耗;3)利用帧间得分的全局统计分布设计检索策略,缓解因使用轻量VLM可能带来的性能下降;4)引入多视角问答机制对检索帧进行推理,提升模型从长视频上下文中提取和理解信息的能力。实验表明,E-VRAG在无需额外训练的情况下实现了约70%的计算成本降低,并显著优于基线方法的准确率。
链接: https://arxiv.org/abs/2508.01546
作者: Zeyu Xu,Junkang Zhang,Qiang Wang,Yi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level. We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level. Additionally, we propose a frame retrieval strategy that leverages the global statistical distribution of inter-frame scores to mitigate the potential performance degradation from using a lightweight VLM. Finally, we introduce a multi-view question answering scheme for the retrieved frames, enhancing the VLM’s capability to extract and comprehend information from long video contexts. Experiments on four public benchmarks show that E-VRAG achieves about 70% reduction in computational cost and higher accuracy compared to baseline methods, all without additional training. These results demonstrate the effectiveness of E-VRAG in improving both efficiency and accuracy for video RAG tasks.
zh
[CV-162] MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在移动设备上高效部署时面临的计算与存储资源消耗过高的问题。其核心解决方案是提出了一种专为旗舰智能手机优化的轻量级VLM——MagicVL-2B,关键创新在于:一是采用参数少于1亿的轻量化视觉编码器,并引入动态分辨率机制以自适应生成图像标记,避免对图像尺寸进行过度修改;二是设计了一种多模态课程学习策略,在训练过程中逐步提升任务难度和数据信息密度,从而显著提升紧凑编码器在各类子任务上的性能表现。实验表明,MagicVL-2B在保持与当前最优模型相当准确率的同时,将设备端功耗降低41.1%,实现了移动端高级多模态智能的实用化部署。
链接: https://arxiv.org/abs/2508.01540
作者: Yi Liu,Xiao Xu,Zeyu Xu,Meng Zhang,Yibo Li,Haoyu Chen,Junkang Zhang,Qiang Wang,Jifa Sun,Siling Lin,Shengxun Cheng,Lingshu Zhang,Kang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model’s performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.
zh
[CV-163] Reason Act: Progressive Training for Fine-Grained Video Reasoning in Small Models
【速读】:该论文旨在解决小规模多模态模型在视频理解任务中面临的细粒度时序推理能力不足的问题。其核心解决方案在于提出一种三阶段训练方法(ReasonAct):首先利用纯文本推理构建基础能力,其次在视频数据上进行微调,最后通过引入时序感知的强化学习进行精修;其中关键创新包括将时序一致性建模融入策略优化过程(基于Temporal Group Relative Policy Optimization, T-GRPO),以及设计一种受生物力学启发的子动作分解机制,以对动作各阶段提供渐进式奖励,从而显著提升小模型在HMDB51、UCF-101和Kinetics-400等基准上的视频推理性能。
链接: https://arxiv.org/abs/2508.01533
作者: Jiaxin Liu,Zhaolu Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training methodology enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.
zh
[CV-164] MiraG e: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection
【速读】:该论文旨在解决当前AI生成图像检测方法在面对新型或未见过的生成模型时性能显著下降的问题,其核心原因是不同生成器间特征嵌入存在重叠,导致跨生成器分类准确率降低。解决方案的关键在于提出一种多模态判别表示学习方法(MiraGe),通过最小化类内差异并最大化类间分离来学习生成器无关的特征表示;同时引入多模态提示学习(multimodal prompt learning)将文本嵌入作为语义锚点融入CLIP框架,进一步增强特征的判别能力与泛化性,从而在多个基准测试中实现最优性能,即使面对如Sora等未知生成器也保持鲁棒性。
链接: https://arxiv.org/abs/2508.01525
作者: Kuo Shi,Jie Lu,Shanshan Ye,Guangquan Zhang,Zhen Fang
机构: University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ACMMM 2025
Abstract:Recent advances in generative models have highlighted the need for robust detectors capable of distinguishing real images from AI-generated images. While existing methods perform well on known generators, their performance often declines when tested with newly emerging or unseen generative models due to overlapping feature embeddings that hinder accurate cross-generator classification. In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. Motivated by theoretical insights on intra-class variation minimization and inter-class separation, MiraGe tightly aligns features within the same class while maximizing separation between classes, enhancing feature discriminability. Moreover, we apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning, thereby improving generalizability. Comprehensive experiments across multiple benchmarks show that MiraGe achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.
zh
[CV-165] EfficientGFormer: Multimodal Brain Tumor Segmentation via Pruned Graph-Augmented Transformer
【速读】:该论文旨在解决脑肿瘤分割中因肿瘤亚区域异质性导致的准确性难题以及体积推理带来的高计算成本问题。其核心解决方案在于提出EfficientGFormer架构,关键创新包括:利用nnFormer作为模态感知编码器将多模态MRI数据转化为patch级嵌入,并构建包含空间邻接与语义相似性的双边图结构;采用剪枝后的、边类型感知的图注意力网络(Graph Attention Network, GAT)实现高效的关系推理;同时引入知识蒸馏模块,将全容量教师模型的知识迁移至轻量学生模型,从而在保持高精度的同时显著降低内存占用和推理时间,实现临床可部署的快速、准确的三维肿瘤分割。
链接: https://arxiv.org/abs/2508.01465
作者: Fatemeh Ziaeetabar
机构: University of Tehran (德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and efficient brain tumor segmentation remains a critical challenge in neuroimaging due to the heterogeneous nature of tumor subregions and the high computational cost of volumetric inference. In this paper, we propose EfficientGFormer, a novel architecture that integrates pretrained foundation models with graph-based reasoning and lightweight efficiency mechanisms for robust 3D brain tumor segmentation. Our framework leverages nnFormer as a modality-aware encoder, transforming multi-modal MRI volumes into patch-level embeddings. These features are structured into a dual-edge graph that captures both spatial adjacency and semantic similarity. A pruned, edge-type-aware Graph Attention Network (GAT) enables efficient relational reasoning across tumor subregions, while a distillation module transfers knowledge from a full-capacity teacher to a compact student model for real-time deployment. Experiments on the MSD Task01 and BraTS 2021 datasets demonstrate that EfficientGFormer achieves state-of-the-art accuracy with significantly reduced memory and inference time, outperforming recent transformer-based and graph-based baselines. This work offers a clinically viable solution for fast and accurate volumetric tumor delineation, combining scalability, interpretability, and generalization.
zh
[CV-166] Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
【速读】:该论文旨在解决3D场景级生成中缺乏有效可扩展的潜在表示学习模型的问题,尤其针对由3D高斯泼溅(3D Gaussian Splatting, 3DGS)表示的无界且尺度不一致的3D场景数据,传统方法难以实现统一的潜在空间建模。解决方案的关键在于提出Can3Tok——首个能够将大量高斯基元编码为低维潜在嵌入的3D场景级变分自编码器(Variational Autoencoder, VAE),其设计能有效捕捉输入场景的语义与空间信息;同时,论文构建了一套通用的3D场景数据预处理流程以缓解尺度不一致性问题,从而首次实现了在场景级3D数据上的稳定训练与泛化能力,实验表明仅Can3Tok能在DL3DV-10K数据集上成功推广至新场景,而对比方法在少量样本下即无法收敛。
链接: https://arxiv.org/abs/2508.01464
作者: Quankai Gao,Iliyan Georgiev,Tuanfeng Y. Wang,Krishna Kumar Singh,Ulrich Neumann,Jae Shin Yoon
机构: University of Southern California (南加州大学); Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate its ability to facilitate downstream generation tasks.
zh
[CV-167] Uncertainty-Aware Segmentation Quality Prediction via Deep Learning Bayesian Modeling: Comprehensive Evaluation and Interpretation on Skin Cancer and Liver Segmentation
【速读】:该论文旨在解决临床场景中缺乏手动标注数据时,如何在不依赖真实标签(ground truth)的情况下评估图像分割质量的问题。传统方法依赖Dice系数等指标进行模型训练和验证,但在无标注的测试阶段难以量化分割结果的可靠性,从而限制了模型在实际医疗应用中的部署。解决方案的关键在于提出一种无需真值标注即可预测分割质量的新框架:通过引入不确定性估计(uncertainty estimation),利用蒙特卡洛Dropout、集成学习(ensemble)与测试时增强(Test Time Augmentation, TTA)对SwinUNet和ResNet50特征金字塔网络进行贝叶斯化改造,生成分割图与不确定性图;进一步设计两种互补的融合策略——一种基于预测分割图与不确定性图,另一种整合原始输入图像、不确定性图及分割图;并通过多不确定性指标(如置信度图、熵、互信息与期望成对Kullback-Leibler散度)的加权聚合,形成单一可靠性的综合评分,显著提升了跨模态(2D皮肤病变与3D肝脏分割)下对分割质量的预测准确性(最高R²达93.25)。
链接: https://arxiv.org/abs/2508.01460
作者: Sikha O K,Meritxell Riera-Marín,Adrian Galdran,Javier García Lopez,Julia Rodríguez-Comas,Gemma Piella,Miguel A. González Ballester
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image segmentation is a critical step in computational biomedical image analysis, typically evaluated using metrics like the Dice coefficient during training and validation. However, in clinical settings without manual annotations, assessing segmentation quality becomes challenging, and models lacking reliability indicators face adoption barriers. To address this gap, we propose a novel framework for predicting segmentation quality without requiring ground truth annotations during test time. Our approach introduces two complementary frameworks: one leveraging predicted segmentation and uncertainty maps, and another integrating the original input image, uncertainty maps, and predicted segmentation maps. We present Bayesian adaptations of two benchmark segmentation models-SwinUNet and Feature Pyramid Network with ResNet50-using Monte Carlo Dropout, Ensemble, and Test Time Augmentation to quantify uncertainty. We evaluate four uncertainty estimates: confidence map, entropy, mutual information, and expected pairwise Kullback-Leibler divergence on 2D skin lesion and 3D liver segmentation datasets, analyzing their correlation with segmentation quality metrics. Our framework achieves an R2 score of 93.25 and Pearson correlation of 96.58 on the HAM10000 dataset, outperforming previous segmentation quality assessment methods. For 3D liver segmentation, Test Time Augmentation with entropy achieves an R2 score of 85.03 and a Pearson correlation of 65.02, demonstrating cross-modality robustness. Additionally, we propose an aggregation strategy that combines multiple uncertainty estimates into a single score per image, offering a more robust and comprehensive assessment of segmentation quality. Finally, we use Grad-CAM and UMAP-based embedding analysis to interpret the model’s behavior and reliability, highlighting the impact of uncertainty integration.
zh
[CV-168] Hyperspectral Image Recovery Constrained by Multi-Granularity Non-Local Self-Similarity Priors
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)恢复任务中因固定格式因子表示非局部自相似性张量组而导致的适应性不足问题,即现有方法难以应对多样化的缺失场景(如像素缺失、条带缺失等)。其解决方案的关键在于首次引入张量分解中的“粒度”(granularity)概念,提出一种基于多粒度非局部自相似性先验约束的HSI恢复模型:该模型交替执行粗粒度分解与细粒度分解——其中粗粒度分解基于Tucker张量分解,通过模式展开矩阵的奇异值收缩提取图像全局结构信息;细粒度分解采用FCTN分解,通过建模因子张量间的成对相关性捕捉局部细节信息。这一架构实现了对HSI全局、局部及非局部先验的统一表征,显著提升了在多种缺失场景下的恢复性能。
链接: https://arxiv.org/abs/2508.01435
作者: Zhuoran Peng,Yiqing Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral image (HSI) recovery, as an upstream image processing task, holds significant importance for downstream tasks such as classification, segmentation, and detection. In recent years, HSI recovery methods based on non-local prior representations have demonstrated outstanding performance. However, these methods employ a fixed-format factor to represent the non-local self-similarity tensor groups, making them unable to adapt to diverse missing scenarios. To address this issue, we introduce the concept of granularity in tensor decomposition for the first time and propose an HSI recovery model constrained by multi-granularity non-local self-similarity priors. Specifically, the proposed model alternately performs coarse-grained decomposition and fine-grained decomposition on the non-local self-similarity tensor groups. Among them, the coarse-grained decomposition builds upon Tucker tensor decomposition, which extracts global structural information of the image by performing singular value shrinkage on the mode-unfolded matrices. The fine-grained decomposition employs the FCTN decomposition, capturing local detail information through modeling pairwise correlations among factor tensors. This architectural approach achieves a unified representation of global, local, and non-local priors for HSIs. Experimental results demonstrate that the model has strong applicability and exhibits outstanding recovery effects in various types of missing scenes such as pixels and stripes. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.01435 [cs.CV] (or arXiv:2508.01435v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.01435 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhuoran Peng [view email] [v1] Sat, 2 Aug 2025 16:51:07 UTC (2,724 KB) Full-text links: Access Paper: View a PDF of the paper titled Hyperspectral Image Recovery Constrained by Multi-Granularity Non-Local Self-Similarity Priors, by Zhuoran Peng and 1 other authorsView PDFOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-08 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-169] Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification ACM-MM2025
【速读】:该论文旨在解决在线手写验证(Online Handwriting Verification, OHV)中因仅依赖时域特征而导致的判别能力不足问题,从而提升验证精度。其解决方案的关键在于提出SPECTRUM模型,通过时频协同机制实现多域表征学习:首先利用多尺度交互模块(multi-scale interactor)在细粒度层面融合时域与频域特征;其次通过自门控融合模块(self-gated fusion module)动态平衡全局时域与频域信息,实现从微观到宏观的谱-时集成;最终由基于多域距离的验证器(multi-domain distance-based verifier)联合利用时域和频域表示增强真伪手写样本的区分能力。该方法显著优于传统仅使用时域特征的方法,验证了多域学习在OHV中的有效性。
链接: https://arxiv.org/abs/2508.01427
作者: Peirong Zhang,Kai Ding,Lianwen Jin
机构: South China University of Technology (华南理工大学); INTSIG Information Co. Ltd (INTSIG信息有限公司); SCUT-Zhuhai Institute of Modern Industrial Innovation (华南理工大学珠海研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ACM MM 2025
Abstract:In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM’s superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at this https URL.
zh
[CV-170] 3DRot: 3D Rotation Augmentation for RGB-Based 3D Tasks
【速读】:该论文旨在解决基于RGB的3D任务(如3D检测、深度估计和3D关键点估计)中因标注稀缺、成本高及数据增强工具匮乏所带来的挑战,尤其是传统图像变换(如缩放和旋转)会破坏几何一致性的问题。解决方案的关键在于提出一种即插即用的数据增强方法——3DRot,它通过围绕相机光心进行旋转和镜像操作,并同步更新RGB图像、相机内参、物体位姿和3D标注信息,从而在不依赖场景深度的前提下保持投影几何一致性,实现无损的几何一致旋转与反射增强。
链接: https://arxiv.org/abs/2508.01423
作者: Shitian Yang,Deyu Li,Xiaoke Jiang,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since most image transforms, including resize and rotation, disrupt geometric consistency. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera’s optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry-achieving geometry-consistent rotations and reflections without relying on any scene depth. We validate 3DRot with a classical 3D task, monocular 3D detection. On SUN RGB-D dataset, 3DRot raises IoU_3D from 43.21 to 44.51, cuts rotation error (ROT) from 22.91 ^\circ to 20.93 ^\circ , and boosts mAP_0.5 from 35.70 to 38.11. As a comparison, Cube R-CNN adds 3 other datasets together with SUN RGB-D for monocular 3D estimation, with a similar mechanism and test dataset, increases IoU_3D from 36.2 to 37.8, boosts mAP_0.5 from 34.7 to 35.4. Because it operates purely through camera-space transforms, 3DRot is readily transferable to other 3D tasks.
zh
[CV-171] ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models
【速读】:该论文旨在解决当前AI生成图像检测方法在解释性与人类认知推理之间存在显著差距的问题,即现有基于分类器的检测手段虽能识别图像真伪,但缺乏可解释性,难以模拟人类专家对伪造痕迹的细致分析。其解决方案的关键在于提出ForenX框架,该框架利用多模态大语言模型(Multimodal Large Language Models, MLLMs)进行图像伪造检测,并通过引入专门设计的“伪造提示(forensic prompt)”引导模型聚焦于伪造指示性特征,从而提升检测泛化能力并生成准确、相关且全面的解释。此外,研究还构建了ForgeryReason数据集,通过LLM代理与人工标注者协作的方式获取高质量伪造证据描述,进一步增强模型的解释性能。
链接: https://arxiv.org/abs/2508.01402
作者: Chuangchuang Tan,Jinglu Wang,Xiang Ming,Renshuai Tao,Yunchao Wei,Yao Zhao,Yan Lu
机构: Beijing Jiaotong University (北京交通大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advances in generative models have led to AI-generated images visually indistinguishable from authentic ones. Despite numerous studies on detecting AI-generated images with classifiers, a gap persists between such methods and human cognitive forensic analysis. We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts. ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues. Furthermore, we overcome the limitations of standard MLLMs in detecting forgeries by incorporating a specialized forensic prompt that directs the MLLMs attention to forgery-indicative attributes. This approach not only enhance the generalization of forgery detection but also empowers the MLLMs to provide explanations that are accurate, relevant, and comprehensive. Additionally, we introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images. Curated through collaboration between an LLM-based agent and a team of human annotators, this process provides refined data that further enhances our model’s performance. We demonstrate that even limited manual annotations significantly improve explanation quality. We evaluate the effectiveness of ForenX on two major benchmarks. The model’s explainability is verified by comprehensive subjective evaluations.
zh
[CV-172] Spatial-Frequency Aware for Object Detection in RAW Image
【速读】:该论文旨在解决基于RAW数据的物体检测中因动态范围宽广和线性响应特性导致的关键目标细节被抑制的问题。现有增强方法多在空间域进行,难以有效从RAW图像偏斜的像素分布中恢复细节。解决方案的关键在于提出一种空间-频率感知的RAW图像增强框架(SFAE),其核心创新包括:首先将频域频带“空间化”,通过逆变换将各频带转为可解释的空间图以保留物理直观性;其次设计跨域融合注意力模块,实现空间特征与频域空间图之间的深度多模态交互;最后通过预测并应用不同伽马参数对两个域进行自适应非线性调整,从而协同提升检测性能。
链接: https://arxiv.org/abs/2508.01396
作者: Zhuohua Ye,Liming Zhang,Hongru Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Direct RAW-based object detection offers great promise by utilizing RAW data (unprocessed sensor data), but faces inherent challenges due to its wide dynamic range and linear response, which tends to suppress crucial object details. In particular, existing enhancement methods are almost all performed in the spatial domain, making it difficult to effectively recover these suppressed details from the skewed pixel distribution of RAW images. To address this limitation, we turn to the frequency domain, where features, such as object contours and textures, can be naturally separated based on frequency. In this paper, we propose Space-Frequency Aware RAW Image Object Detection Enhancer (SFAE), a novel framework that synergizes spatial and frequency representations. Our contribution is threefold. The first lies in the ``spatialization" of frequency bands. Different from the traditional paradigm of directly manipulating abstract spectra in deep networks, our method inversely transforms individual frequency bands back into tangible spatial maps, thus preserving direct physical intuition. Then the cross-domain fusion attention module is developed to enable deep multimodal interactions between these maps and the original spatial features. Finally, the framework performs adaptive nonlinear adjustments by predicting and applying different gamma parameters for the two domains.
zh
[CV-173] Open-Attribute Recognition for Person Retrieval: Finding People Through Distinctive and Novel Attributes
【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)在开放场景下的局限性问题,即传统方法依赖封闭集假设(closed-set assumption),无法处理训练阶段未见的新属性类别,且现有基准数据集中的属性通常较为通用、缺乏区分度,难以实现精准的个体检索。解决方案的关键在于提出开放属性识别任务(Open-Attribute Recognition for Person Retrieval, OAPR),通过构建一个能够学习广泛覆盖属性类别的通用身体部位表征框架,使模型具备基于任意属性线索(无论是否见过)进行行人检索的能力;同时重构四个主流数据集以支持该任务,实验证明了其必要性和有效性。
链接: https://arxiv.org/abs/2508.01389
作者: Minjeong Park,Hongbeen Park,Sangwon Lee,Yoonha Jang,Jinkyu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pedestrian Attribute Recognition (PAR) plays a crucial role in various vision tasks such as person retrieval and identification. Most existing attribute-based retrieval methods operate under the closed-set assumption that all attribute classes are consistently available during both training and inference. However, this assumption limits their applicability in real-world scenarios where novel attributes may emerge. Moreover, predefined attributes in benchmark datasets are often generic and shared across individuals, making them less discriminative for retrieving the target person. To address these challenges, we propose the Open-Attribute Recognition for Person Retrieval (OAPR) task, which aims to retrieve individuals based on attribute cues, regardless of whether those attributes were seen during training. To support this task, we introduce a novel framework designed to learn generalizable body part representations that cover a broad range of attribute categories. Furthermore, we reconstruct four widely used datasets for open-attribute recognition. Comprehensive experiments on these datasets demonstrate the necessity of the OAPR task and the effectiveness of our framework. The source code and pre-trained models will be publicly available upon publication.
zh
[CV-174] Video-based Vehicle Surveillance in the Wild: License Plate Make and Model Recognition with Self Reflective Vision-Language Models
【速读】:该论文旨在解决在非固定场景下(如手持智能手机或非静态车载摄像头拍摄的视频)进行自动车牌识别(Automatic License Plate Recognition, ALPR)及车辆品牌与型号识别的挑战,这些问题包括频繁的相机运动、视角变化、遮挡以及未知的道路几何结构。传统依赖专用硬件和手工OCR流程的方法在此类条件下性能显著下降。论文提出了一种基于大规模视觉语言模型(Vision-Language Models, VLMs)的解决方案:首先通过图像清晰度筛选策略提取高质量帧,再以多模态提示(multimodal prompt)输入VLM完成识别任务;针对车辆品牌与型号识别进一步引入自反思模块(self-reflection module),利用134类参考图像对比校正识别结果。该方案无需复杂预处理,具有良好的泛化能力与成本效益,实验证明其在校园手机数据集上达到91.67%的ALPR准确率和66.67%的品牌型号识别准确率,且自反思模块使后者平均提升5.72%。
链接: https://arxiv.org/abs/2508.01387
作者: Pouya Parsa,Keya Li,Kara M. Kockelman,Seongjin Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 4 tables
Abstract:Automatic license plate recognition (ALPR) and vehicle make and model recognition underpin intelligent transportation systems, supporting law enforcement, toll collection, and post-incident investigation. Applying these methods to videos captured by handheld smartphones or non-static vehicle-mounted cameras presents unique challenges compared to fixed installations, including frequent camera motion, varying viewpoints, occlusions, and unknown road geometry. Traditional ALPR solutions, dependent on specialized hardware and handcrafted OCR pipelines, often degrade under these conditions. Recent advances in large vision-language models (VLMs) enable direct recognition of textual and semantic attributes from arbitrary imagery. This study evaluates the potential of VLMs for ALPR and makes and models recognition using monocular videos captured with handheld smartphones and non-static mounted cameras. The proposed license plate recognition pipeline filters to sharp frames, then sends a multimodal prompt to a VLM using several prompt strategies. Make and model recognition pipeline runs the same VLM with a revised prompt and an optional self-reflection module. In the self-reflection module, the model contrasts the query image with a reference from a 134-class dataset, correcting mismatches. Experiments on a smartphone dataset collected on the campus of the University of Texas at Austin, achieve top-1 accuracies of 91.67% for ALPR and 66.67% for make and model recognition. On the public UFPR-ALPR dataset, the approach attains 83.05% and 61.07%, respectively. The self-reflection module further improves results by 5.72% on average for make and model recognition. These findings demonstrate that VLMs provide a cost-effective solution for scalable, in-motion traffic video analysis.
zh
[CV-175] Construction of Digital Terrain Maps from Multi-view Satellite Imagery using Neural Volume Rendering
【速读】:该论文旨在解决当前基于多视图立体匹配(multi-view stereo)方法生成数字地形图(Digital Terrain Maps, DTMs)时存在的流程繁琐、依赖大量手动图像预处理的问题。其解决方案的关键在于引入神经体渲染(neural volume rendering)技术,提出了一种名为神经地形图(Neural Terrain Maps, NTM)的新方法,该方法无需深度信息或其他结构先验,仅需每张图像像素的地理位置坐标即可直接从卫星影像中学习并生成带有纹理的DTM。实验表明,该方法在地球和火星的真实与合成数据上均能实现接近卫星图像分辨率的地形预测精度,即使在相机内参和外参不精确的情况下仍具备良好性能。
链接: https://arxiv.org/abs/2508.01386
作者: Josef X. Biberstein,Guilherme Cavalheiro,Juyeop Han,Sertac Karaman
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital terrain maps (DTMs) are an important part of planetary exploration, enabling operations such as terrain relative navigation during entry, descent, and landing for spacecraft and aiding in navigation on the ground. As robotic exploration missions become more ambitious, the need for high quality DTMs will only increase. However, producing DTMs via multi-view stereo pipelines for satellite imagery, the current state-of-the-art, can be cumbersome and require significant manual image preprocessing to produce satisfactory results. In this work, we seek to address these shortcomings by adapting neural volume rendering techniques to learn textured digital terrain maps directly from satellite imagery. Our method, neural terrain maps (NTM), only requires the locus for each image pixel and does not rely on depth or any other structural priors. We demonstrate our method on both synthetic and real satellite data from Earth and Mars encompassing scenes on the order of 100 \textrmkm^2 . We evaluate the accuracy of our output terrain maps by comparing with existing high-quality DTMs produced using traditional multi-view stereo pipelines. Our method shows promising results, with the precision of terrain prediction almost equal to the resolution of the satellite images even in the presence of imperfect camera intrinsics and extrinsics.
zh
[CV-176] Lightweight Backbone Networks Only Require Adaptive Lightweight Self-Attention Mechanisms
【速读】:该论文旨在解决轻量级混合骨干网络中卷积神经网络(CNN)与注意力机制之间计算效率失衡的问题,尤其是现有轻量级SoftMax注意力机制在长序列建模能力不足及特征图下采样比例确定繁琐导致计算饱和的问题。其解决方案的关键在于提出一种自适应特征图尺寸的轻量级SoftMax注意力机制——快速窗口注意力(Fast Window Attention, FWA),通过窗口聚合生成少量关键序列(Key和Value)进行注意力计算,并结合ReLU模拟SoftMax操作以提升轻量全局注意力的效率;同时设计全局-局部特征融合机制并与GhostNet结合,构建出轻量级混合骨干网络LOLViT,显著提升了推理速度与模型精度。
链接: https://arxiv.org/abs/2508.01385
作者: Fengyun Li,Chao Zheng,Yangyang Fang,Jialiang Lan,Jianhua Liang,Luhao Zhang,Fa Si
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Currently, lightweight hybrid backbone networks have partially alleviated the issue of computational saturation, but the imbalance in computational efficiencys between convolutional neural networks (CNNs) and attention mechanisms is becoming increasingly apparent. Specifically, although linear attention mechanisms and their variants have made progress in lightweight design, they still fail to meet the demands of hybrid models for long-sequence modeling. On the other hand, existing lightweight SoftMax attention computations typically reduce the feature map to a fixed size to decrease the number of sequences, thereby compressing the computational scale. However, the process of determining the feature map reduction ratio is cumbersome, and computational saturation issues still persist. To address this issue, this paper proposes a lightweight SoftMax attention mechanism with adaptive feature map sizes, named Fast Window Attention (FWA), which generates a small number of key sequences (Key and Value) through window aggregation for attention computation. Additionally, it explains the rationality of using ReLU to simulate SoftMax operations in lightweight global attention mechanisms. Finally, the paper designs a global-local feature fusion mechanism and combines it with GhostNet to propose a lightweight hybrid backbone network, LOLViT. Through visual tasks such as classification (ImageNet 1K), detection (COCO 2017), and segmentation (BDD100K), along with extensive ablation studies, it is demonstrated that LOLViT outperforms CNN models of the same level in both inference speed and model accuracy. Notably, the inference speed of LOLViT-X is 5x that of MobileViT-X.
zh
[CV-177] A Full-Stage Refined Proposal Algorithm for Suppressing False Positives in Two-Stage CNN-Based Detection Methods
【速读】:该论文旨在解决行人检测中误检(false positives)问题,尤其是在两阶段卷积神经网络(CNN-based)行人检测框架中,低质量候选框导致的误检现象难以有效抑制。解决方案的关键在于提出一种全阶段精炼提案(Full-stage Refined Proposal, FRP)算法,通过在训练和推理阶段分别引入不同的行人特征重评估策略:训练阶段采用训练模式FRP(TFRP),利用新型候选框验证机制引导模型学习更鲁棒的行人特征以抑制误检;推理阶段则结合分类器引导FRP(CFRP)与分割提案FRP(SFRP),前者将行人分类器嵌入提案生成流程以筛选高质量候选框,后者通过对提案进行垂直分割并比较子区域置信度来过滤低置信度提案,从而显著提升模型在各阶段对误检的抑制能力。
链接: https://arxiv.org/abs/2508.01382
作者: Qiang Guo,Rubo Zhang,Bingbing Zhang,Junjie Liu,Jianqing Liu
机构: Dalian Minzu University (大连民族大学); Dalian University of Technology (大连理工大学); Dalian Rijia Electronics Co., Ltd. (大连瑞嘉电子有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:False positives in pedestrian detection remain a challenge that has yet to be effectively resolved. To address this issue, this paper proposes a Full-stage Refined Proposal (FRP) algorithm aimed at eliminating these false positives within a two-stage CNN-based pedestrian detection framework. The main innovation of this work lies in employing various pedestrian feature re-evaluation strategies to filter out low-quality pedestrian proposals during both the training and testing stages. Specifically, in the training phase, the Training mode FRP algorithm (TFRP) introduces a novel approach for validating pedestrian proposals to effectively guide the model training process, thereby constructing a model with strong capabilities for false positive suppression. During the inference phase, two innovative strategies are implemented: the Classifier-guided FRP (CFRP) algorithm integrates a pedestrian classifier into the proposal generation pipeline to yield high-quality proposals through pedestrian feature evaluation, and the Split-proposal FRP (SFRP) algorithm vertically divides all proposals, sending both the original and the sub-region proposals to the subsequent subnetwork to evaluate their confidence scores, filtering out those with lower sub-region pedestrian confidence scores. As a result, the proposed algorithm enhances the model’s ability to suppress pedestrian false positives across all stages. Various experiments conducted on multiple benchmarks and the SY-Metro datasets demonstrate that the model, supported by different combinations of the FRP algorithm, can effectively eliminate false positives to varying extents. Furthermore, experiments conducted on embedded platforms underscore the algorithm’s effectiveness in enhancing the comprehensive pedestrian detection capabilities of the small pedestrian detector in resource-constrained edge devices.
zh
[CV-178] ReMu: Reconstructing Multi-layer 3D Clothed Human from Image Layers BMVC2025
【速读】:该论文旨在解决多层衣物三维重建中依赖昂贵多视角捕捉设备和复杂3D编辑流程的问题,以支持逼真衣着人体虚拟形象的构建。其关键解决方案在于提出一种基于单RGB相机采集的“图像层”(Image Layers)新设置,并引入统一的三维表示方法,在由标准人体姿态定义的共享坐标系中逐层重建并对齐各服装层;随后通过碰撞感知优化过程处理层间穿插问题,并利用隐式神经场进一步精炼衣物边界,从而实现无需模板且类别无关的多层衣物三维重建,有效生成近乎无穿透的三维衣着人体模型。
链接: https://arxiv.org/abs/2508.01381
作者: Onat Vuran,Hsuan-I Ho
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025 paper, 17 pages, 10 figures
Abstract:The reconstruction of multi-layer 3D garments typically requires expensive multi-view capture setups and specialized 3D editing efforts. To support the creation of life-like clothed human avatars, we introduce ReMu for reconstructing multi-layer clothed humans in a new setup, Image Layers, which captures a subject wearing different layers of clothing with a single RGB camera. To reconstruct physically plausible multi-layer 3D garments, a unified 3D representation is necessary to model these garments in a layered manner. Thus, we first reconstruct and align each garment layer in a shared coordinate system defined by the canonical body pose. Afterwards, we introduce a collision-aware optimization process to address interpenetration and further refine the garment boundaries leveraging implicit neural fields. It is worth noting that our method is template-free and category-agnostic, which enables the reconstruction of 3D garments in diverse clothing styles. Through our experiments, we show that our method reconstructs nearly penetration-free 3D clothed humans and achieves competitive performance compared to category-specific methods. Project page: this https URL
zh
[CV-179] Effective Damage Data Generation by Fusing Imagery with Human Knowledge Using Vision-Language Models
【速读】:该论文旨在解决人道主义援助与灾害响应(HADR)中损伤评估的准确性与泛化能力问题,其核心挑战包括数据类别不平衡、中等损伤样本稀缺以及像素级标注的人工误差。解决方案的关键在于利用视觉-语言模型(VLMs)融合图像信息与人类知识理解,从而生成多样化且高质量的基于图像的损伤数据,以提升对建筑、道路及基础设施不同损伤等级的分类性能。
链接: https://arxiv.org/abs/2508.01380
作者: Jie Wei,Erika Ardiles-Cruz,Aleksey Panasyuk,Erik Blasch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, IEEE NAECON’25
Abstract:It is of crucial importance to assess damages promptly and accurately in humanitarian assistance and disaster response (HADR). Current deep learning approaches struggle to generalize effectively due to the imbalance of data classes, scarcity of moderate damage examples, and human inaccuracy in pixel labeling during HADR situations. To accommodate for these limitations and exploit state-of-the-art techniques in vision-language models (VLMs) to fuse imagery with human knowledge understanding, there is an opportunity to generate a diversified set of image-based damage data effectively. Our initial experimental results suggest encouraging data generation quality, which demonstrates an improvement in classifying scenes with different levels of structural damage to buildings, roads, and infrastructures.
zh
[CV-180] Predicting Video Slot Attention Queries from Random Slot-Feature Pairs
【速读】:该论文旨在解决当前无监督视频对象中心学习(Unsupervised Video Object-Centric Learning, OCL)方法中存在的两个关键问题:其一,现有方法忽视了下一帧特征信息,而这一信息对查询(query)预测最具指导意义;其二,模型未能有效学习物体状态转移动态(transition dynamics),而这种动态知识对于准确预测未来查询至关重要。解决方案的核心在于提出一种名为RandSF.Q的新架构:首先设计了一个融合槽位(slot)与特征的新型过渡器(transitioner),以增强查询预测的信息源;其次,在训练过程中从已有帧序列中随机采样槽位-特征对来预测查询,从而迫使过渡器学习到有效的状态转移动态。实验表明,该方法在对象发现等任务上显著优于现有方法,达到新的SOTA水平。
链接: https://arxiv.org/abs/2508.01345
作者: Rongzhen Zhao,Jian Li,Juho Kannala,Joni Pajarinen
机构: Aalto University (阿尔托大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textiti1) neglect to incorporate next frame features, the most informative source for query prediction, and (\textiti2) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textitt1) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textitt2) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code and training logs are available as the supplement.
zh
[CV-181] SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes
【速读】:该论文旨在解决新能源汽车中对路面障碍物(如减速带和坑洼)进行高精度实时检测的问题,以支持预测性悬架控制从而提升乘坐舒适性。其解决方案的关键在于提出一种轻量化检测框架SBP-YOLO,基于YOLOv11架构优化:通过引入GhostConv实现高效计算、VoVGSCSPC增强多尺度特征表达,并设计轻量级效率检测头(Lightweight Efficiency Detection Head, LEDH)降低早期特征处理开销;同时采用结合NWD损失函数、知识蒸馏与Albumentations增强的混合训练策略,显著提升了小目标和远距离目标的检测鲁棒性。
链接: https://arxiv.org/abs/2508.01339
作者: Chuanqi Liang,Jie Fu,Lei Luo,Miao Yu
机构: Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14pages,10figures
Abstract:With increasing demand for ride comfort in new energy vehicles, accurate real-time detection of speed bumps and potholes is critical for predictive suspension control. This paper proposes SBP-YOLO, a lightweight detection framework based on YOLOv11, optimized for embedded deployment. The model integrates GhostConv for efficient computation, VoVGSCSPC for multi-scale feature enhancement, and a Lightweight Efficiency Detection Head (LEDH) to reduce early-stage feature processing costs. A hybrid training strategy combining NWD loss, knowledge distillation, and Albumentations-based weather augmentation improves detection robustness, especially for small and distant targets. Experiments show SBP-YOLO achieves 87.0% mAP (outperforming YOLOv11n by 5.8%) and runs at 139.5 FPS on a Jetson AGX Xavier with TensorRT FP16 quantization. The results validate its effectiveness for real-time road condition perception in intelligent suspension systems.
zh
[CV-182] Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework
【速读】:该论文旨在解决弱监督图像伪造定位(Weakly Supervised Image Forgery Localization, WSIFL)中因依赖像素级标注而导致的高成本问题,同时克服现有方法仅利用图像内部一致性线索、缺乏外部语义引导从而导致定位精度受限的瓶颈。其解决方案的关键在于提出一种视觉-语言协同推理框架ViLaCo,通过从预训练视觉-语言模型(Vision-Language Models, VLMs)中蒸馏辅助语义监督信号,实现仅用图像级标签即可完成像素级精确定位。具体而言,ViLaCo首先通过视觉-语言特征建模网络联合提取文本与视觉先验,再借助自适应视觉-语言推理网络实现语义与视觉特征的对齐,生成语义一致的表示;随后双预测头分别执行图像级分类和像素级定位掩码生成,有效弥合弱监督与细粒度定位之间的鸿沟;此外引入对比patch一致性模块以聚类伪造区域并分离真实区域,提升伪造判别可靠性。
链接: https://arxiv.org/abs/2508.01338
作者: Ziqi Sheng,Junyan Wu,Wei Lu,Jiantao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image forgery localization aims to precisely identify tampered regions within images, but it commonly depends on costly pixel-level annotations. To alleviate this annotation burden, weakly supervised image forgery localization (WSIFL) has emerged, yet existing methods still achieve limited localization performance as they mainly exploit intra-image consistency clues and lack external semantic guidance to compensate for weak supervision. In this paper, we propose ViLaCo, a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models (VLMs), enabling accurate pixel-level localization using only image-level labels. Specifically, ViLaCo first incorporates semantic knowledge through a vision-language feature modeling network, which jointly extracts textual and visual priors using pre-trained VLMs. Next, an adaptive vision-language reasoning network aligns textual semantics and visual features through mutual interactions, producing semantically aligned representations. Subsequently, these representations are passed into dual prediction heads, where the coarse head performs image-level classification and the fine head generates pixel-level localization masks, thereby bridging the gap between weak supervision and fine-grained localization. Moreover, a contrastive patch consistency module is introduced to cluster tampered features while separating authentic ones, facilitating more reliable forgery discrimination. Extensive experiments on multiple public datasets demonstrate that ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
zh
[CV-183] StyleSentinel: Reliable Artistic Copyright Verification via Stylistic Fingerprints
【速读】:该论文旨在解决扩散模型(diffusion models)在生成定制化图像时导致的个人艺术作品未经授权使用问题,从而威胁艺术家的知识产权。现有方法依赖嵌入额外信息(如扰动、水印和后门)进行版权保护,但防御能力有限且无法有效保护已在线发布的艺术作品。其解决方案的关键在于提出StyleSentinel方法,通过提取艺术家作品中的固有风格指纹(stylistic fingerprint)实现版权验证:首先利用语义自重构过程增强艺术作品中的风格表达,构建密集且风格一致的特征流形基础;随后自适应融合多层图像特征,将抽象艺术风格编码为紧凑的风格指纹;最终在特征空间中以最小包围超球面边界建模目标艺术家风格,将复杂的版权验证任务转化为鲁棒的一类学习(one-class learning)任务。
链接: https://arxiv.org/abs/2508.01335
作者: Lingxiao Chen,Liqin Wang,Wei Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The versatility of diffusion models in generating customized images has led to unauthorized usage of personal artwork, which poses a significant threat to the intellectual property of artists. Existing approaches relying on embedding additional information, such as perturbations, watermarks, and backdoors, suffer from limited defensive capabilities and fail to protect artwork published online. In this paper, we propose StyleSentinel, an approach for copyright protection of artwork by verifying an inherent stylistic fingerprint in the artist’s artwork. Specifically, we employ a semantic self-reconstruction process to enhance stylistic expressiveness within the artwork, which establishes a dense and style-consistent manifold foundation for feature learning. Subsequently, we adaptively fuse multi-layer image features to encode abstract artistic style into a compact stylistic fingerprint. Finally, we model the target artist’s style as a minimal enclosing hypersphere boundary in the feature space, transforming complex copyright verification into a robust one-class learning task. Extensive experiments demonstrate that compared with the state-of-the-art, StyleSentinel achieves superior performance on the one-sample verification task. We also demonstrate the effectiveness through online platforms.
zh
[CV-184] Zero-shot Segmentation of Skin Conditions: Erythema with Edit-Friendly Inversion
【速读】:该论文旨在解决皮肤红斑(erythema)自动检测中对大量标注数据依赖的问题,尤其是在缺乏专业标注的训练掩膜(training masks)情况下实现精准分割。其解决方案的关键在于利用扩散模型(diffusion models)中的可编辑逆向生成(edit-friendly inversion)技术,通过生成同一患者无红斑参考图像,并与原始图像进行精确配准,结合最小用户干预的颜色空间分析来识别红斑区域。该方法无需任何标注训练数据,显著提升了诊断支持工具的可扩展性和灵活性。
链接: https://arxiv.org/abs/2508.01334
作者: Konstantinos Moutselos,Ilias Maglogiannis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study proposes a zero-shot image segmentation framework for detecting erythema (redness of the skin) using edit-friendly inversion in diffusion models. The method synthesizes reference images of the same patient that are free from erythema via generative editing and then accurately aligns these references with the original images. Color-space analysis is performed with minimal user intervention to identify erythematous regions. This approach significantly reduces the reliance on labeled dermatological datasets while providing a scalable and flexible diagnostic support tool by avoiding the need for any annotated training masks. In our initial qualitative experiments, the pipeline successfully isolated facial erythema in diverse cases, demonstrating performance improvements over baseline threshold-based techniques. These results highlight the potential of combining generative diffusion models and statistical color segmentation for computer-aided dermatology, enabling efficient erythema detection without prior training data.
zh
[CV-185] Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network
【速读】:该论文旨在解决遥感图像中目标尺度变化剧烈时,现有方法在处理微小、模糊目标上的性能瓶颈问题。其解决方案的关键在于提出了一种并行且统一的分割框架——Cross-view Semantics Interaction Network (CSINet),该框架通过模拟人类观察目标的行为,协同利用远距离和近距离视图的视觉线索进行联合预测;具体而言,在每个编码阶段引入Cross-View Window-attention模块(CVWin)以补充全局与局部语义信息至双分支特征中,从而增强特征表示的一致性,并结合Collaboratively Dilated Attention enhanced Decoder(CDAD)挖掘目标方向特性并融合跨视图多尺度特征,显著提升了对复杂场景下目标的分割精度,同时保持高效推理速度。
链接: https://arxiv.org/abs/2508.01331
作者: Jiaxing Yang,Lihe Zhang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Referring Remote Sensing Image Segmentation (RRSIS) has aroused wide attention. To handle drastic scale variation of remote targets, existing methods only use the full image as input and nest the saliency-preferring techniques of cross-scale information interaction into traditional single-view structure. Although effective for visually salient targets, they still struggle in handling tiny, ambiguous ones in lots of real scenarios. In this work, we instead propose a paralleled yet unified segmentation framework Cross-view Semantics Interaction Network (CSINet) to solve the limitations. Motivated by human behavior in observing targets of interest, the network orchestrates visual cues from remote and close distances to conduct synergistic prediction. In its every encoding stage, a Cross-View Window-attention module (CVWin) is utilized to supplement global and local semantics into close-view and remote-view branch features, finally promoting the unified representation of feature in every encoding stage. In addition, we develop a Collaboratively Dilated Attention enhanced Decoder (CDAD) to mine the orientation property of target and meanwhile integrate cross-view multiscale features. The proposed network seamlessly enhances the exploitation of global and local semantics, achieving significant improvements over others while maintaining satisfactory speed.
zh
[CV-186] Multimodal Attention-Aware Fusion for Diagnosing Distal Myopathy: Evaluating Model Interpretability and Clinician Trust
【速读】:该论文旨在解决远端肌病(distal myopathy)这类遗传异质性骨骼肌疾病在影像学诊断中因临床表现多样而带来的挑战。其核心问题是现有方法难以同时实现高精度分类与可解释的医学决策支持。解决方案的关键在于提出一种新颖的多模态注意力感知融合架构,通过两个深度学习模型分别提取全局上下文信息和局部细节特征,并利用注意力门机制(attention gate mechanism)进行动态融合,从而提升预测性能并增强结果的可解释性。该方法在BUSI基准数据集和自建远端肌病数据集上均取得高分类准确率,并生成具有临床意义的显著性图(saliency maps),但研究也指出当前方法在解剖特异性与临床实用性方面仍存在改进空间,需结合更丰富的上下文感知解释方法及人机协同反馈以满足真实诊疗场景的需求。
链接: https://arxiv.org/abs/2508.01316
作者: Mohsen Abbaspour Onari,Lucie Charlotte Magister,Yaoxin Wu,Amalia Lupi,Dario Creazzo,Mattia Tordin,Luigi Di Donatantonio,Emilio Quaia,Chao Zhang,Isel Grau,Marco S. Nobile,Yingqian Zhang,Pietro Liò
机构: Eindhoven University of Technology (埃因霍温理工大学); Eindhoven Artificial Intelligence Systems Institute (埃因霍温人工智能系统研究所); University of Cambridge (剑桥大学); Padua University Hospital (帕多瓦大学医院); Ca’ Foscari University of Venice (威尼斯卡福斯卡里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Distal myopathy represents a genetically heterogeneous group of skeletal muscle disorders with broad clinical manifestations, posing diagnostic challenges in radiology. To address this, we propose a novel multimodal attention-aware fusion architecture that combines features extracted from two distinct deep learning models, one capturing global contextual information and the other focusing on local details, representing complementary aspects of the input data. Uniquely, our approach integrates these features through an attention gate mechanism, enhancing both predictive performance and interpretability. Our method achieves a high classification accuracy on the BUSI benchmark and a proprietary distal myopathy dataset, while also generating clinically relevant saliency maps that support transparent decision-making in medical diagnosis. We rigorously evaluated interpretability through (1) functionally grounded metrics, coherence scoring against reference masks and incremental deletion analysis, and (2) application-grounded validation with seven expert radiologists. While our fusion strategy boosts predictive performance relative to single-stream and alternative fusion strategies, both quantitative and qualitative evaluations reveal persistent gaps in anatomical specificity and clinical usefulness of the interpretability. These findings highlight the need for richer, context-aware interpretability methods and human-in-the-loop feedback to meet clinicians’ expectations in real-world diagnostic settings.
zh
[CV-187] P3P Made Easy
【速读】:该论文旨在解决校准相机的绝对位姿估计问题,即从三对2D-3D对应点中恢复相机在世界坐标系中的位置和朝向(Perspective-Three-Point, P3P问题)。其解决方案的关键在于将原问题转化为一个系数解析表达简单且计算高效的四次多项式(quartic polynomial),从而在保持与最先进方法相当精度和运行效率的同时,显著提升了可解释性和实现简洁性,适用于实时系统和教学场景。
链接: https://arxiv.org/abs/2508.01312
作者: Seong Hun Lee,Patrick Vandewalle,Javier Civera
机构: EAVISE, KU Leuven (KU Leuven大学); I3A, University of Zaragoza (萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a novel algebraic solution to the Perspective-Three-Point (P3P) problem, which aims to recover the absolute pose of a calibrated camera from three 2D-3D correspondences. Our method reformulates the problem into a quartic polynomial with coefficients that are analytically simple and computationally efficient. Despite its simplicity, the proposed solver achieves accuracy and runtime performance comparable to state-of-the-art methods. Extensive experiments on synthetic datasets validate its robustness and efficiency. This combination of simplicity and performance makes our solver appealing for both real-time systems and educational contexts, where interpretability and reliability are critical.
zh
[CV-188] C3D-AD: Toward Continual 3D Anomaly Detection via Kernel Attention with Learnable Advisor
【速读】:该论文旨在解决3D异常检测(3D Anomaly Detection, 3D AD)中现有方法在类别特定训练模式下难以适应新类别、缺乏持续学习能力的问题。其核心挑战在于如何在不遗忘旧类别知识的前提下,有效学习并识别新兴类别的异常特征。解决方案的关键在于提出了一种名为C3D-AD的持续学习框架,其中包含三个核心技术模块:(1) Kernel Attention with random feature Layer (KAL),用于提取多样化产品类型的通用局部特征并规范化特征空间;(2) Kernel Attention with learnable Advisor (KAA),通过在编码器和解码器中动态学习新类别信息并丢弃冗余旧信息,实现数据的正确且持续重建;(3) Reconstruction with Parameter Perturbation (RPP) 模块,设计表示回放损失函数以保持跨任务的表征一致性,从而增强模型对历史类别的记忆能力。该框架在Real3D-AD、Anomaly-ShapeNet和MulSen-AD三个公开数据集上验证了有效性,平均AUROC分别达到66.4%、83.1%和63.4%。
链接: https://arxiv.org/abs/2508.01311
作者: Haoquan Lu,Hanzhe Liang,Jie Zhang,Chenxi Hu,Jinbao Wang,Can Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: We have provided the code for C3D-AD with checkpoints and BASELINE at this link: this https URL
Abstract:3D Anomaly Detection (AD) has shown great potential in detecting anomalies or defects of high-precision industrial products. However, existing methods are typically trained in a class-specific manner and also lack the capability of learning from emerging classes. In this study, we proposed a continual learning framework named Continual 3D Anomaly Detection (C3D-AD), which can not only learn generalized representations for multi-class point clouds but also handle new classes emerging over this http URL, in the feature extraction module, to extract generalized local features from diverse product types of different tasks efficiently, Kernel Attention with random feature Layer (KAL) is introduced, which normalizes the feature space. Then, to reconstruct data correctly and continually, an efficient Kernel Attention with learnable Advisor (KAA) mechanism is proposed, which learns the information from new categories while discarding redundant old information within both the encoder and decoder. Finally, to keep the representation consistency over tasks, a Reconstruction with Parameter Perturbation (RPP) module is proposed by designing a representation rehearsal loss function, which ensures that the model remembers previous category information and returns category-adaptive this http URL experiments on three public datasets demonstrate the effectiveness of the proposed method, achieving an average performance of 66.4%, 83.1%, and 63.4% AUROC on Real3D-AD, Anomaly-ShapeNet, and MulSen-AD, respectively.
zh
[CV-189] Domain Generalized Stereo Matching with Uncertainty-guided Data Augmentation AAAI2026
【速读】:该论文旨在解决当前基于合成数据训练的立体匹配(Stereo Matching, SM)模型在真实场景中泛化能力差的问题,其核心原因是域间差异(如颜色、光照、对比度和纹理)导致模型学习到依赖特定域的特征捷径(domain-dependent shortcuts)。解决方案的关键在于提出一种不确定性引导的数据增强方法(Uncertainty-guided Data Augmentation, UgDA),通过扰动RGB空间中的图像统计量(均值与标准差)来模拟未见域的数据分布,并利用基于批次级统计量的高斯分布建模扰动方向与强度的不确定性;同时,通过强制原始图像与增强图像在相同场景下的特征一致性,促使模型学习结构感知且对捷径不敏感的特征表示。该方法简单、与网络架构无关,可无缝集成至任意SM网络中。
链接: https://arxiv.org/abs/2508.01303
作者: Shuangli Du,Jing Wang,Minghua Zhao,Zhenyu Xu,Jie Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, submitted to AAAI 2026
Abstract:State-of-the-art stereo matching (SM) models trained on synthetic data often fail to generalize to real data domains due to domain differences, such as color, illumination, contrast, and texture. To address this challenge, we leverage data augmentation to expand the training domain, encouraging the model to acquire robust cross-domain feature representations instead of domain-dependent shortcuts. This paper proposes an uncertainty-guided data augmentation (UgDA) method, which argues that the image statistics in RGB space (mean and standard deviation) carry the domain characteristics. Thus, samples in unseen domains can be generated by properly perturbing these statistics. Furthermore, to simulate more potential domains, Gaussian distributions founded on batch-level statistics are poposed to model the unceratinty of perturbation direction and intensity. Additionally, we further enforce feature consistency between original and augmented data for the same scene, encouraging the model to learn structure aware, shortcuts-invariant feature representations. Our approach is simple, architecture-agnostic, and can be integrated into any SM networks. Extensive experiments on several challenging benchmarks have demonstrated that our method can significantly improve the generalization performance of existing SM networks.
zh
[CV-190] GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification
【速读】:该论文旨在解决当前基于视觉语言模型(Vision-Language Model, VLM)的多实例学习(Multiple Instance Learning, MIL)在全切片图像(Whole Slide Image, WSI)分类中,因受限于VLM的token容量而导致类描述信息表达不足、且由大语言模型(Large Language Model, LLM)生成的临床描述缺乏领域内精准性和细粒度医学特异性的问题。解决方案的关键在于:(1) 提出一种基于病理学教材的多智能体描述生成系统,通过专业化分工(如形态学、空间上下文等)实现准确且多样化的临床描述;(2) 采用列表形式的文本编码策略替代单一提示词,从而捕捉更细粒度和互补的临床信号,增强与视觉特征的对齐效果。
链接: https://arxiv.org/abs/2508.01293
作者: Ngoc Bui Lam Quang,Nam Le Nguyen Binh,Thanh-Huy Nguyen,Le Thien Phuc Nguyen,Quan Nguyen,Ulas Bagci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.
zh
[CV-191] Integrating Disparity Confidence Estimation into Relative Depth Prior-Guided Unsupervised Stereo Matching
【速读】:该论文旨在解决无监督立体匹配方法中因重复纹理和无纹理区域导致的匹配歧义问题,以及现有知识迁移方法在利用稀疏对应关系时效率低下且易引入误匹配噪声的问题。其解决方案的关键在于提出一种可插拔的视差置信度估计算法和两种基于深度先验的损失函数:首先通过邻域视差与其相对深度之间的局部一致性验证来获取高置信度视差估计;随后仅使用这些高置信度视差构建准密集对应关系,以高效学习深度排序信息;最后引入双视差平滑损失,在视差不连续处提升匹配精度。
链接: https://arxiv.org/abs/2508.01275
作者: Chuang-Wei Liu,Mingjian Sun,Cairong Zhao,Hanli Wang,Alexander Dvorkovich,Rui Fan
机构: Tongji University (同济大学); Harbin Institute of Technology (哈尔滨工业大学); Moscow Institute of Physics and Technology (莫斯科物理技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Unsupervised stereo matching has garnered significant attention for its independence from costly disparity annotations. Typical unsupervised methods rely on the multi-view consistency assumption for training networks, which suffer considerably from stereo matching ambiguities, such as repetitive patterns and texture-less regions. A feasible solution lies in transferring 3D geometric knowledge from a relative depth map to the stereo matching networks. However, existing knowledge transfer methods learn depth ranking information from randomly built sparse correspondences, which makes inefficient utilization of 3D geometric knowledge and introduces noise from mistaken disparity estimates. This work proposes a novel unsupervised learning framework to address these challenges, which comprises a plug-and-play disparity confidence estimation algorithm and two depth prior-guided loss functions. Specifically, the local coherence consistency between neighboring disparities and their corresponding relative depths is first checked to obtain disparity confidence. Afterwards, quasi-dense correspondences are built using only confident disparity estimates to facilitate efficient depth ranking learning. Finally, a dual disparity smoothness loss is proposed to boost stereo matching performance at disparity discontinuities. Experimental results demonstrate that our method achieves state-of-the-art stereo matching accuracy on the KITTI Stereo benchmarks among all unsupervised stereo matching methods.
zh
[CV-192] PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在实际应用中易产生不安全内容(如暴力或色情图像)的问题,尤其是现有基于软提示调优的防护方法存在计算开销大、良性图像质量下降以及对不同提示的安全需求适应性差等局限。解决方案的关键在于提出 PromptSafe 框架,其核心创新包括:1)利用大语言模型(Large Language Model, LLM)将不安全提示重写为语义一致的安全替代句,构建仅依赖文本的高效训练语料;2)优化一个通用的软提示嵌入,在扩散去噪过程中主动排斥不安全特征并吸引安全特征;3)引入推理时门控控制网络,根据估计的提示毒性动态调整防御强度,实现风险自适应的保护机制,从而在保障高安全性的同时维持良性生成质量。
链接: https://arxiv.org/abs/2508.01272
作者: Zonglei Jing,Xiao Yang,Xiaoqian Li,Siyuan Liang,Aishan Liu,Mingchuan Zhang,Xianglong Liu
机构: Beihang University (北京航空航天大学); Beijing University of Posts and Telecommunications (北京邮电大学); Taishan University (泰山大学); Nanyang Technological University (南洋理工大学); Henan University of Science and Technology (河南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.
zh
[CV-193] SGCap: Decoding Semantic Group for Zero-shot Video Captioning
【速读】:该论文旨在解决零样本视频描述(zero-shot video captioning)任务中的关键挑战,即在不使用视频-文本对进行训练的情况下,如何有效建模视频的时序动态并提升模型对视频内容的理解与生成能力。现有方法多基于纯文本训练范式,直接将图像级文本重建策略扩展至视频域,忽略了帧间的时间依赖性,导致性能受限。其解决方案的核心在于提出语义分组描述(Semantic Group Captioning, SGCap)方法,通过引入语义分组解码(Semantic Group Decoding, SGD)策略显式建模帧间时序关系,并设计两个关键模块——关键句子选择(Key Sentences Selection, KSS)和概率采样监督(Probability Sampling Supervision, PSS),以构建语义多样化的句子组,增强模型对跨句因果关系的学习能力,从而显著提升零样本视频描述的泛化性能。
链接: https://arxiv.org/abs/2508.01270
作者: Zeyu Pan,Ping Li,Wenxiao Wang
机构: Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures, 11 tables
Abstract:Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at this https URL.
zh
[CV-194] ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification
【速读】:该论文旨在解决点云分类模型在面对合成LiDAR-like噪声时的鲁棒性(robustness)与校准能力(calibration)评估问题。现有基准缺乏对噪声扰动下模型预测不确定性的细粒度刻画,限制了对模型可靠性的深入分析。解决方案的关键在于提出ModelNet40-E这一新基准,它不仅提供带噪声的点云数据,还通过高斯噪声参数(σ, μ)提供逐点不确定性标注,从而支持基于不确定性感知的精细化评估。实验表明,Point Transformer v3 在多种噪声水平下展现出最优校准性能,其预测不确定性更贴近真实测量不确定性,验证了该基准在模型可靠性评估中的有效性。
链接: https://arxiv.org/abs/2508.01269
作者: Pedro Alonso,Tianrui Li,Chongshou Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce ModelNet40-E, a new benchmark designed to assess the robustness and calibration of point cloud classification models under synthetic LiDAR-like noise. Unlike existing benchmarks, ModelNet40-E provides both noise-corrupted point clouds and point-wise uncertainty annotations via Gaussian noise parameters (\sigma, \mu), enabling fine-grained evaluation of uncertainty modeling. We evaluate three popular models-PointNet, DGCNN, and Point Transformer v3-across multiple noise levels using classification accuracy, calibration metrics, and uncertainty-awareness. While all models degrade under increasing noise, Point Transformer v3 demonstrates superior calibration, with predicted uncertainties more closely aligned with the underlying measurement uncertainty.
zh
[CV-195] Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling ICME2025
【速读】:该论文旨在解决数据蒸馏(Dataset Distillation)在高图像每类(Image Per Class, IPC)设置或高分辨率下性能下降的问题,尤其是由扩散生成模型采样图像多样性不足导致的信息冗余问题。解决方案的关键在于提出对抗引导的课程采样(Adversary-guided Curriculum Sampling, ACS),其通过引入对抗损失指导扩散采样过程,使生成器在每个课程阶段生成与判别器所学样本差异显著的图像,从而减少不同课程间的冗余信息;同时,随着课程推进,判别器不断进化,促使生成图像从简单到复杂逐步覆盖目标数据的信息谱系,实现高效且系统性的数据压缩与多样性保障。
链接: https://arxiv.org/abs/2508.01264
作者: Lexiao Zou,Gongwei Chen,Yanda Chen,Miao Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025
Abstract:Dataset distillation aims to encapsulate the rich information contained in dataset into a compact distilled dataset but it faces performance degradation as the image-per-class (IPC) setting or image resolution grows larger. Recent advancements demonstrate that integrating diffusion generative models can effectively facilitate the compression of large-scale datasets while maintaining efficiency due to their superiority in matching data distribution and summarizing representative patterns. However, images sampled from diffusion models are always blamed for lack of diversity which may lead to information redundancy when multiple independent sampled images are aggregated as a distilled dataset. To address this issue, we propose Adversary-guided Curriculum Sampling (ACS), which partitions the distilled dataset into multiple curricula. For generating each curriculum, ACS guides diffusion sampling process by an adversarial loss to challenge a discriminator trained on sampled images, thus mitigating information overlap between curricula and fostering a more diverse distilled dataset. Additionally, as the discriminator evolves with the progression of curricula, ACS generates images from simpler to more complex, ensuring efficient and systematic coverage of target data informational spectrum. Extensive experiments demonstrate the effectiveness of ACS, which achieves substantial improvements of 4.1% on Imagewoof and 2.1% on ImageNet-1k over the state-of-the-art.
zh
[CV-196] SpatioTemporal Difference Network for Video Depth Super-Resolution
【速读】:该论文旨在解决视频深度超分辨率(Video Depth Super-Resolution)中因长尾分布(long-tailed distribution)导致的重建质量下降问题,尤其在空间非平滑区域和时序变化区域表现显著。解决方案的关键在于提出一种新颖的时空差异网络(SpatioTemporal Difference Network, STDNet),其核心由两个分支构成:一是空间差异分支,通过动态对齐RGB特征与学习到的空间差异表示,实现帧内RGB-D特征聚合以校准深度;二是时间差异分支,设计了一种优先传播相邻帧时序变化信息的策略,利用时间差异表示在时序长尾区域实现精确运动补偿。该方法有效缓解了长尾效应带来的失真问题,提升了重建精度。
链接: https://arxiv.org/abs/2508.01259
作者: Zhengxue Wang,Yuan Wu,Xiang Li,Zhiqiang Yan,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.
zh
[CV-197] Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency
【速读】:该论文旨在解决当前基于预训练视觉-语言模型(如CLIP)进行图像聚类时存在的性能瓶颈问题:现有方法通常冻结编码器,导致模型输出的通用特征表示与特定聚类任务之间存在本质不匹配,从而限制了聚类性能上限。解决方案的关键在于提出一种自增强框架(Self-Enhanced Framework),其核心由两个阶段组成:第一阶段通过跨模态语义一致性(Cross-Modal Semantic Consistency)对轻量级聚类头进行对齐训练,利用实例级、聚类分配级和聚类中心级的一致性约束,并引入高质量聚类中心生成方法与动态平衡正则化机制以提升聚类质量;第二阶段采用自增强微调策略,利用第一阶段已对齐的模型作为伪标签生成器,将自动生成的监督信号用于视觉编码器与聚类头的联合优化,从而充分释放模型潜力。该方法在六个主流数据集上显著优于现有深度聚类方法,且ViT-B/32模型即可达到甚至超越更大规模ViT-L/14模型的效果。
链接: https://arxiv.org/abs/2508.01254
作者: Zihan Li,Wei Sun,Jing Hu,Jianhua Yin,Jianlong Wu,Liqiang Nie
机构: 1: University of Science and Technology of China (中国科学技术大学); 2: Tsinghua University (清华大学); 3: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model’s task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.
zh
[CV-198] ODOV: Towards Open-Domain Open-Vocabulary Object Detection
【速读】:该论文旨在解决开放域开放词表(Open-Domain Open-Vocabulary, ODOV)目标检测问题,即模型在面对真实世界中同时存在的领域偏移(domain shift)和类别偏移(category shift)时的适应性挑战。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的新型基线方法:首先利用LLMs生成与领域无关的文本提示以构建类别嵌入(category embedding),进而从输入图像中学习领域嵌入(domain embedding),并在测试阶段将两者融合,形成针对每张测试图像的定制化领域特定类别嵌入,从而提升模型对复杂现实场景的泛化能力。
链接: https://arxiv.org/abs/2508.01253
作者: Yupeng Zhang,Ruize Han,Fangnan Zhou,Song Wang,Wei Feng,Liang Wan
机构: Tianjin University (天津大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we handle a new problem of Open-Domain Open-Vocabulary (ODOV) object detection, which considers the detection model’s adaptability to the real world including both domain and category shifts. For this problem, we first construct a new benchmark OD-LVIS, which includes 46,949 images, covers 18 complex real-world domains and 1,203 categories, and provides a comprehensive dataset for evaluating real-world object detection. Besides, we develop a novel baseline method for ODOV this http URL proposed method first leverages large language models to generate the domain-agnostic text prompts for category embedding. It further learns the domain embedding from the given image, which, during testing, can be integrated into the category embedding to form the customized domain-specific category embedding for each test image. We provide sufficient benchmark evaluations for the proposed ODOV detection task and report the results, which verify the rationale of ODOV detection, the usefulness of our benchmark, and the superiority of the proposed method.
zh
[CV-199] DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing ACM-MM2025
【速读】:该论文旨在解决弱监督面部解析(Weakly Supervised Face Parsing, WSFP)任务中的标注成本高与组件混淆问题。现有方法依赖密集的像素级标注,而WSFP仅使用图像级标签或自然语言描述等弱监督信号,但面临面部组件间高共现性和视觉相似性导致的激活模糊和分割性能下降难题。解决方案的关键在于提出DisFaceRep框架,通过显式与隐式机制实现面部组件表示解耦:一方面设计共现组件解耦策略以减少数据集层面的偏差;另一方面引入文本引导的组件解耦损失,利用语言监督隐式引导各组件分离。实验表明,该方法在CelebAMask-HQ、LaPa和Helen数据集上显著优于现有弱监督语义分割方法。
链接: https://arxiv.org/abs/2508.01250
作者: Xiaoqin Wang,Xianxu Hou,Meidan Ding,Junliang Chen,Kaijun Deng,Jinheng Xie,Linlin Shen
机构: Shenzhen University (深圳大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \hrefthis https URL\textcolorcyanthis https URL.
zh
[CV-200] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection
【速读】:该论文旨在解决当前AI生成图像检测方法在面对未知生成模型时泛化能力不足的问题,尤其当真实图像与生成图像在语义内容上高度一致时,现有检测器往往失效。其核心解决方案是提出NS-Net框架,关键在于利用NULL-Space投影从CLIP视觉特征中解耦高层语义信息,从而削弱语义干扰;随后通过对比学习捕捉真实图像与生成图像之间的内在分布差异,并设计Patch Selection策略以保留细粒度伪造痕迹,同时缓解由全局图像结构引起的语义偏差。该方法在包含40种不同生成模型的开放世界基准测试中显著优于现有最先进方法,检测准确率提升7.4%,展现出对GAN和扩散模型等主流生成技术的强大跨模型泛化能力。
链接: https://arxiv.org/abs/2508.01248
作者: Jiazhen Yan,Fan Wang,Weiwei Jiang,Ziqiang Li,Zhangjie Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP’s visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP’s visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.
zh
[CV-201] MeshLLM : Empowering Large Language Models to Progressively Understand and Generate 3D Mesh ICCV
【速读】:该论文旨在解决现有方法在处理文本序列化三维网格(text-serialized 3D meshes)时面临的两大核心问题:一是由于大型语言模型(Large Language Models, LLMs)对输入token长度的限制,导致可用于训练的数据集规模受限;二是网格序列化过程中丢失了重要的三维结构信息,影响了模型对几何拓扑的理解与生成能力。解决方案的关键在于提出了一种基于Primitive-Mesh的分解策略,将复杂3D网格划分为具有结构语义意义的子单元,从而构建了一个包含150万+样本的大规模数据集(约为先前方法的50倍),更好地契合LLM的扩展规律;同时引入从顶点推断面连接关系的方法及局部网格组装训练策略,显著提升了LLM对网格拓扑和空间结构的建模能力。实验表明,该方法在网格生成质量和形状理解方面均优于当前最优的LLaMA-Mesh模型。
链接: https://arxiv.org/abs/2508.01242
作者: Shuangkang Fang,I-Chao Shen,Yufeng Wang,Yi-Hsuan Tsai,Yi Yang,Shuchang Zhou,Wenrui Ding,Takeo Igarashi,Ming-Hsuan Yang
机构: Beihang University (北京航空航天大学); The University of Tokyo (东京大学); Google (谷歌); StepFun; UC Merced (加州大学默塞德分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV
Abstract:We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs’ token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50 times larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs’ ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.
zh
[CV-202] OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在真实场景中因标签噪声(label noise)导致的重建误差问题,此类噪声来源于移动物体、非朗伯表面和阴影等复杂因素。现有方法要么无法有效分离噪声,要么需针对特定场景调整超参数,实用性受限。解决方案的关键在于从认知不确定性(epistemic uncertainty)视角重新审视抗噪重建问题,提出OCSplats框架:其核心创新包括混合噪声评估与基于观测的认知校正机制,显著提升不同认知区域的噪声分类精度;同时设计基于动态锚点的标签噪声分类流水线,使模型能在无需参数调整的情况下适应不同噪声比例的场景,从而实现跨场景鲁棒且高精度的重建性能。
链接: https://arxiv.org/abs/2508.01239
作者: Han Ling,Xian Xu,Yinghui Sun,Quansen Sun
机构: Nanjing University of Science and Technology (南京理工大学); Southeast University (东南大学); CityU HK (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has become one of the most promising 3D reconstruction technologies. However, label noise in real-world scenarios-such as moving objects, non-Lambertian surfaces, and shadows-often leads to reconstruction errors. Existing 3DGS-Bsed anti-noise reconstruction methods either fail to separate noise effectively or require scene-specific fine-tuning of hyperparameters, making them difficult to apply in practice. This paper re-examines the problem of anti-noise reconstruction from the perspective of epistemic uncertainty, proposing a novel framework, OCSplats. By combining key technologies such as hybrid noise assessment and observation-based cognitive correction, the accuracy of noise classification in areas with cognitive differences has been significantly improved. Moreover, to address the issue of varying noise proportions in different scenarios, we have designed a label noise classification pipeline based on dynamic anchor points. This pipeline enables OCSplats to be applied simultaneously to scenarios with vastly different noise proportions without adjusting parameters. Extensive experiments demonstrate that OCSplats always achieve leading reconstruction performance and precise label noise classification in scenes of different complexity levels.
zh
[CV-203] Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models ACM-MM2025
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)因输入视觉序列包含数百至数千个token而导致的高计算成本问题。现有方法通过去除冗余视觉token来降低计算负担,但在高剪枝率下易造成视觉信息丢失,从而引发性能显著下降。其解决方案的关键在于提出自适应内容补偿方法(Adaptive Content Compensation Method, ACCM),该方法通过一个轻量级图像描述生成模型和一个选择器协同工作:前者在用户指令引导下生成与问题相关的图像描述,后者从多个候选描述中筛选出语境最合适的描述,以补偿剪枝过程中损失的视觉信息。整个模块基于自监督学习训练,无需人工或自动标注,实验证明该方法在多个基准测试中显著优于现有技术,且FLOPs更低(例如,在减少6.5% FLOPs的情况下比当前最优方法提升20.6%)。
链接: https://arxiv.org/abs/2508.01236
作者: Mingyu Fu,Wei Suo,Ji Ma,Lin Yuanbo Wu,Peng Wang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology (综合航空航天地海大数据应用技术国家工程实验室); Swansea University (斯旺西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ACM MM 2025
Abstract:Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).
zh
[CV-204] Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing
【速读】:该论文旨在解决多视图学习模型在开放集场景下的性能瓶颈问题,特别是由于其隐含的类别完备性假设导致的未知类别识别能力不足,以及训练过程中形成的视图-标签伪关联(spurious view-label associations)所引发的静态视图偏差(static view-induced biases)。解决方案的关键在于提出一种基于模糊不确定性校准与视图级去偏的多视图开放集学习框架:首先设计O-Mix合成策略生成具有校准开放集模糊不确定性的虚拟样本,并通过辅助的模糊感知网络捕捉异常模式以增强开放集适应能力;其次引入基于HSIC(Hilbert-Schmidt Independence Criterion)的对比去偏模块,强制视图特异性模糊表示与视图一致表示之间的独立性,从而促进模型学习更具泛化性的特征表示。
链接: https://arxiv.org/abs/2508.01227
作者: Zihan Fang,Zhiyong Xu,Lan Du,Shide Du,Zhiling Cai,Shiping Wang
机构: Fuzhou University(福州大学); Monash University(莫纳什大学); Fujian Agriculture and Forestry University(福建农林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.
zh
[CV-205] Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models ICCV2025
【速读】:该论文旨在解决零样本设置下测试时适应(Test-Time Adaptation, TTA)中因分布偏移导致的性能下降问题,特别是现有基于缓存增强的TTA方法依赖低熵准则选取样本构建原型时,可能引入不可靠样本、无法保障类内紧凑性的问题。解决方案的关键在于提出多缓存增强的原型驱动测试时适应框架(Multi-Cache enhanced Prototype-based Test-Time Adaptation, MCP),其核心创新为三个协同工作的缓存机制:熵缓存(entropy cache)用于初始化原型表示,对齐缓存(align cache)融合视觉与文本信息以实现类内紧凑分布,负样本缓存(negative cache)则利用高熵样本进行预测校准;进一步地,MCP++通过跨模态原型对齐和残差学习引入原型残差微调,显著提升模型在15个下游任务上的泛化能力。
链接: https://arxiv.org/abs/2508.01225
作者: Xinyu Chen,Haotian Zhai,Can Zhang,Xiupeng Shi,Ruirui Li
机构: Shanghai University (上海大学); Beijing University of Chemical Technology (北京化工大学); University of Minnesota (明尼苏达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025
Abstract:In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance.
zh
[CV-206] ParaRevSNN: A Parallel Reversible Spiking Neural Network for Efficient Training and Inference AAAI2026
【速读】:该论文旨在解决可逆脉冲神经网络(Reversible Spiking Neural Networks, RevSNNs)在训练和推理过程中因严格串行计算导致的高延迟问题。其解决方案的关键在于提出ParaRevSNN架构,通过解耦可逆模块之间的序列依赖关系,在保持可逆性的前提下实现模块间的并行计算,从而显著提升训练与推理效率,同时维持内存节省的优势。
链接: https://arxiv.org/abs/2508.01223
作者: Changqing Xu,Guoqing Sun,Yi Liu,Xinfang Liao,Yintang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, submitted to AAAI 2026
Abstract:Reversible Spiking Neural Networks (RevSNNs) enable memory-efficient training by reconstructing forward activations during backpropagation, but suffer from high latency due to strictly sequential computation. To overcome this limitation, we propose ParaRevSNN, a parallel reversible SNN architecture that decouples sequential dependencies between reversible blocks while preserving reversibility. This design enables inter-block parallelism, significantly accelerating training and inference while retaining the memory-saving benefits of reversibility. Experiments on CIFAR10, CIFAR100, CIFAR10-DVS, and DVS128 Gesture demonstrate that ParaRevSNN matches or exceeds the accuracy of standard RevSNNs, while reducing training time by up to 35.2% and inference time to 18.15%, making it well-suited for deployment in resource-constrained scenarios.
zh
[CV-207] Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNN)在梯度优化过程中因权重结构无序而导致特征表达模糊、学习动态退化的问题。其解决方案的关键在于提出一种新型架构——特征神经网络(Eigen Neural Network, ENN),通过将每一层的权重参数化为共享的、可学习的正交特征基(orthonormal eigenbasis),从结构上强制实现权重动态的去相关性和对齐性,而非依赖传统正则化手段。这一设计从根本上改善了表示能力,显著提升了图像分类与跨模态图文检索任务的性能,并进一步衍生出无需反向传播(backpropagation-free, BP-free)的高效局部学习变体ENN-ℓ,实现了训练速度提升超过2倍且精度优于端到端反向传播的突破。
链接: https://arxiv.org/abs/2508.01219
作者: Anzhe Cheng,Chenzhong Yin,Mingxi Cheng,Shukai Duan,Shahin Nazarian,Paul Bogdan
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The remarkable success of Deep Neural Networks(DNN) is driven by gradient-based optimization, yet this process is often undermined by its tendency to produce disordered weight structures, which harms feature clarity and degrades learning dynamics. To address this fundamental representational flaw, we introduced the Eigen Neural Network (ENN), a novel architecture that reparameterizes each layer’s weights in a layer-shared, learned orthonormal eigenbasis. This design enforces decorrelated, well-aligned weight dynamics axiomatically, rather than through regularization, leading to more structured and discriminative feature representations. When integrated with standard BP, ENN consistently outperforms state-of-the-art methods on large-scale image classification benchmarks, including ImageNet, and its superior representations generalize to set a new benchmark in cross-modal image-text retrieval. Furthermore, ENN’s principled structure enables a highly efficient, backpropagation-free(BP-free) local learning variant, ENN- \ell . This variant not only resolves BP’s procedural bottlenecks to achieve over 2 \times training speedup via parallelism, but also, remarkably, surpasses the accuracy of end-to-end backpropagation. ENN thus presents a new architectural paradigm that directly remedies the representational deficiencies of BP, leading to enhanced performance and enabling a more efficient, parallelizable training regime.
zh
[CV-208] MoGaFace: Momentum-Guided and Texture-Aware Gaussian Avatars for Consistent Facial Geometry
【速读】:该论文旨在解决现有3D头像重建方法中因估计的FLAME网格与目标图像之间存在错位而导致渲染质量下降及细节丢失的问题。其核心解决方案是提出MoGaFace框架,通过在高斯渲染过程中持续优化面部几何与纹理属性来实现高质量重建;关键创新在于引入Momentum-Guided Consistent Geometry模块,利用动量更新的表情库和表达感知校正机制保障时序与多视角一致性,同时设计Latent Texture Attention模块,将紧凑的多视角特征编码为头部感知表示,从而实现基于几何信息的纹理精修。
链接: https://arxiv.org/abs/2508.01218
作者: Yujian Liu,Linlang Cao,Chuang Chen,Fanyu Geng,Dongxu Shen,Peng Cao,Shidang Xu,Xiaoli Liu
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Alibaba Group (阿里巴巴集团); 4. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:Existing 3D head avatar reconstruction methods adopt a two-stage process, relying on tracked FLAME meshes derived from facial landmarks, followed by Gaussian-based rendering. However, misalignment between the estimated mesh and target images often leads to suboptimal rendering quality and loss of fine visual details. In this paper, we present MoGaFace, a novel 3D head avatar modeling framework that continuously refines facial geometry and texture attributes throughout the Gaussian rendering process. To address the misalignment between estimated FLAME meshes and target images, we introduce the Momentum-Guided Consistent Geometry module, which incorporates a momentum-updated expression bank and an expression-aware correction mechanism to ensure temporal and multi-view consistency. Additionally, we propose Latent Texture Attention, which encodes compact multi-view features into head-aware representations, enabling geometry-aware texture refinement via integration into Gaussians. Extensive experiments show that MoGaFace achieves high-fidelity head avatar reconstruction and significantly improves novel-view synthesis quality, even under inaccurate mesh initialization and unconstrained real-world settings.
zh
[CV-209] Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization? AAAI2026
【速读】:该论文旨在解决视觉楼层平面定位(Visual Floorplan Localization, FLoc)中因楼层平面图存在大量重复结构(如走廊和角落)而导致的定位模糊问题。现有方法通常依赖于楼层平面图中的二维结构线索或基于三维几何约束的视觉预训练,忽略了图像中更丰富的场景上下文信息。解决方案的关键在于引入更广泛的视觉场景上下文以增强FLoc算法的鲁棒性:提出一种带有聚类约束的无监督学习技术,用于在自收集的未标注房间图像上预训练一个房间判别器(room discriminator),该判别器能够提取图像隐含的房间类型信息并区分不同房间类别;随后将判别器总结的场景上下文信息注入FLoc算法,从而利用房间风格知识引导确定性的视觉定位,显著提升定位准确性和鲁棒性。
链接: https://arxiv.org/abs/2508.01216
作者: Bolei Chen,Shengsheng Yan,Yongzheng Cui,Jiaxu Kang,Ping Zhong,Jianxin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to AAAI 2026. arXiv admin note: text overlap with arXiv:2507.18881
Abstract:Since a building’s floorplan remains consistent over time and is inherently robust to changes in visual appearance, visual Floorplan Localization (FLoc) has received increasing attention from researchers. However, as a compact and minimalist representation of the building’s layout, floorplans contain many repetitive structures (e.g., hallways and corners), thus easily result in ambiguous localization. Existing methods either pin their hopes on matching 2D structural cues in floorplans or rely on 3D geometry-constrained visual pre-trainings, ignoring the richer contextual information provided by visual images. In this paper, we suggest using broader visual scene context to empower FLoc algorithms with scene layout priors to eliminate localization uncertainty. In particular, we propose an unsupervised learning technique with clustering constraints to pre-train a room discriminator on self-collected unlabeled room images. Such a discriminator can empirically extract the hidden room type of the observed image and distinguish it from other room types. By injecting the scene context information summarized by the discriminator into an FLoc algorithm, the room style knowledge is effectively exploited to guide definite visual FLoc. We conducted sufficient comparative studies on two standard visual Floc benchmarks. Our experiments show that our approach outperforms state-of-the-art methods and achieves significant improvements in robustness and accuracy.
zh
[CV-210] StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling
【速读】:该论文旨在解决扩散模型(Diffusion Models)在风格迁移(Style Transfer)任务中因文本驱动机制导致的语义结构丢失问题,即传统方法将文本描述视为统一、静态的指导信号,忽略了文本的非空间特性与视觉风格的空间感知属性之间的语义鸿沟,从而造成细节失真和结构破坏。其解决方案的关键在于提出一个无监督框架StyDeco,包含两个核心组件:一是Prior-Guided Data Distillation(PGD),利用冻结的生成模型自动合成伪配对数据以蒸馏风格知识;二是Contrastive Semantic Decoupling(CSD),通过域特定权重调整文本编码器,在语义空间中执行双类聚类,促使源域与目标域表征形成分离簇,实现语义解耦与保留。该方案显著提升了风格保真度与结构一致性,并支持去风格化等扩展应用。
链接: https://arxiv.org/abs/2508.01215
作者: Yuanlin Yang,Quanjian Song,Zhexian Gao,Ge Wang,Shanshan Li,Xiaoyan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages in total
Abstract:Diffusion models have emerged as the dominant paradigm for style transfer, but their text-driven mechanism is hindered by a core limitation: it treats textual descriptions as uniform, monolithic guidance. This limitation overlooks the semantic gap between the non-spatial nature of textual descriptions and the spatially-aware attributes of visual style, often leading to the loss of semantic structure and fine-grained details during stylization. In this paper, we propose StyDeco, an unsupervised framework that resolves this limitation by learning text representations specifically tailored for the style transfer task. Our framework first employs Prior-Guided Data Distillation (PGD), a strategy designed to distill stylistic knowledge without human supervision. It leverages a powerful frozen generative model to automatically synthesize pseudo-paired data. Subsequently, we introduce Contrastive Semantic Decoupling (CSD), a task-specific objective that adapts a text encoder using domain-specific weights. CSD performs a two-class clustering in the semantic space, encouraging source and target representations to form distinct clusters. Extensive experiments on three classic benchmarks demonstrate that our framework outperforms several existing approaches in both stylistic fidelity and structural preservation, highlighting its effectiveness in style transfer with semantic preservation. In addition, our framework supports a unique de-stylization process, further demonstrating its extensibility. Our code is vailable at this https URL.
zh
[CV-211] RoadMamba: A Dual Branch Visual State Space Model for Road Surface Classification
【速读】:该论文旨在解决当前基于状态空间模型(State-Space Model, SSM)的Mamba架构在道路表面分类任务中因缺乏对局部纹理特征有效提取而导致性能未达最优的问题。解决方案的关键在于提出RoadMamba,其核心创新是引入双状态空间模块(Dual State Space Model, DualSSM),以协同提取道路表面的全局语义与局部纹理信息,并通过双注意力融合机制(Dual Attention Fusion, DAF)进行特征解码与融合;同时设计双辅助损失函数,显式约束双分支网络结构,防止模型过度依赖大感受野带来的全局语义而忽略关键局部纹理特征,从而显著提升分类精度,在包含百万样本的大规模道路表面分类数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2508.01210
作者: Tianze Wang,Zhang Zhang,Chao Yue,Nuoran Li,Chao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Acquiring the road surface conditions in advance based on visual technologies provides effective information for the planning and control system of autonomous vehicles, thus improving the safety and driving comfort of the vehicles. Recently, the Mamba architecture based on state-space models has shown remarkable performance in visual processing tasks, benefiting from the efficient global receptive field. However, existing Mamba architectures struggle to achieve state-of-the-art visual road surface classification due to their lack of effective extraction of the local texture of the road surface. In this paper, we explore for the first time the potential of visual Mamba architectures for road surface classification task and propose a method that effectively combines local and global perception, called RoadMamba. Specifically, we utilize the Dual State Space Model (DualSSM) to effectively extract the global semantics and local texture of the road surface and decode and fuse the dual features through the Dual Attention Fusion (DAF). In addition, we propose a dual auxiliary loss to explicitly constrain dual branches, preventing the network from relying only on global semantic information from the deep large receptive field and ignoring the local texture. The proposed RoadMamba achieves the state-of-the-art performance in experiments on a large-scale road surface classification dataset containing 1 million samples.
zh
[CV-212] Deep Learning for Pavement Condition Evaluation Using Satellite Imagery
【速读】:该论文旨在解决传统人工巡检或车载自动化巡检在评估道路基础设施状态时存在的劳动强度大、耗时长等问题,从而难以实现对大面积路面网络的高效监测与维护。其解决方案的关键在于利用卫星遥感影像结合深度学习模型进行路面状况自动识别与评估,通过构建包含超过3000张卫星图像及德克萨斯州公路部(TxDOT)路面管理系统(PMIS)评级数据的训练集,实现了对路面病害的高精度分类,准确率超过90%,为未来快速、低成本的道路网络状态评估提供了可行的技术路径。
链接: https://arxiv.org/abs/2508.01206
作者: Prathyush Kumar Reddy Lebaku,Lu Gao,Pan Lu,Jingran Sun
机构: University of Houston (休斯顿大学); North Dakota State University (北达科他州立大学); University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Civil infrastructure systems covers large land areas and needs frequent inspections to maintain their public service capabilities. The conventional approaches of manual surveys or vehicle-based automated surveys to assess infrastructure conditions are often labor-intensive and time-consuming. For this reason, it is worthwhile to explore more cost-effective methods for monitoring and maintaining these infrastructures. Fortunately, recent advancements in satellite systems and image processing algorithms have opened up new possibilities. Numerous satellite systems have been employed to monitor infrastructure conditions and identify damages. Due to the improvement in ground sample distance (GSD), the level of detail that can be captured has significantly increased. Taking advantage of these technology advancement, this research investigated to evaluate pavement conditions using deep learning models for analyzing satellite images. We gathered over 3,000 satellite images of pavement sections, together with pavement evaluation ratings from TxDOT’s PMIS database. The results of our study show an accuracy rate is exceeding 90%. This research paves the way for a rapid and cost-effective approach to evaluating the pavement network in the future.
zh
[CV-213] A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding IROS2025
【速读】:该论文旨在解决传统视觉定位任务中依赖边界框(bounding box)导致的细粒度信息缺失问题,即边界框内并非所有体素(voxel)均被占据,从而造成对象表征不准确的问题。其解决方案的关键在于提出了一种面向复杂室外场景的3D占用定位(3D occupancy grounding)基准数据集,并设计了一个端到端的多模态模型GroundingOcc,通过融合视觉、文本与点云特征,从粗到精地预测物体位置和体素级占用信息;其中,多模态编码器、占用头(occupancy head)与定位头(grounding head)构成核心结构,同时引入2D定位模块和深度估计模块以增强几何理解能力,显著提升了3D占用定位的精度。
链接: https://arxiv.org/abs/2508.01197
作者: Zhan Shi,Song Wang,Junbo Chen,Jianke Zhu
机构: Zhejiang University (浙江大学); Udeer.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IROS 2025 Accepted Paper
Abstract:Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at this https URL.
zh
[CV-214] Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning
【速读】:该论文旨在解决具身智能(Embodied AI)中从观测数据(如图像)学习物体操作能力的核心问题,特别是3D物体功能区域(affordance)的定位(3D affordance grounding)与功能分类(affordance classification)任务。现有方法通常将这两个任务分开处理,导致预测不一致,且仅能定位图像中可见的不完整功能区域,无法推断潜在的完整功能区,同时受限于固定尺度,难以应对不同尺度的功能区域差异。其解决方案的关键在于提出一种新颖的、具备功能感知能力的3D表示学习方法,并采用分阶段推理策略,显式建模定位与分类之间的依赖关系:首先通过跨模态3D特征融合与多尺度几何特征传播构建高效表示,从而在合适区域尺度上推断完整的潜在功能区域;进而设计一个简单的两阶段预测机制,有效耦合两个任务,提升整体功能理解能力。
链接: https://arxiv.org/abs/2508.01184
作者: Xinhang Wan,Dongqiang Gou,Xinwang Liu,En Zhu,Xuming He
机构: National University of Defense Technology (国防科技大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency. In addition, these methods typically only ground the incomplete affordance areas depicted in images, failing to predict the full potential affordance areas, and operate at a fixed scale, resulting in difficulty in coping with affordances significantly varying in scale with respect to the whole object. To address these issues, we propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy leveraging the dependency between grounding and classification tasks. Specifically, we first develop a cross-modal 3D representation through efficient fusion and multi-scale geometric feature propagation, enabling inference of full potential affordance areas at a suitable regional scale. Moreover, we adopt a simple two-stage prediction mechanism, effectively coupling grounding and classification for better affordance understanding. Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.
zh
[CV-215] No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
【速读】:该论文旨在解决从稀疏多视角图像中进行高效3D高斯泼溅(3D Gaussian Splatting)重建的问题,尤其在训练和推理过程中无需依赖真实相机位姿(ground-truth poses)的限制。其关键解决方案在于提出SPFSplat框架,该框架采用共享特征提取主干网络,在单次前向传播中同时预测3D高斯原型(3D Gaussian primitives)和相机位姿,所有操作均在规范空间(canonical space)内完成;同时引入重投影损失(reprojection loss),以强化像素对齐的高斯原型学习,从而增强几何约束。这一无位姿监督的训练范式与高效的单步推理设计,使模型在显著视点变化和图像重叠有限的情况下仍能实现卓越的新视角合成效果,并在相对位姿估计任务上优于使用几何先验训练的方法。
链接: https://arxiv.org/abs/2508.01171
作者: Ranran Huang,Krystian Mikolajczyk
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: this https URL.
zh
[CV-216] DELTAv2: Accelerating Dense 3D Tracking
【速读】:该论文旨在解决视频中密集长期3D点跟踪的计算效率问题,尤其针对现有基于Transformer的迭代跟踪方法在处理大量轨迹时的高计算开销,以及相关特征计算成本过高的瓶颈。其关键解决方案包括:一是提出一种粗到精(coarse-to-fine)的策略,从少量点开始跟踪并逐步扩展轨迹集合,新增轨迹通过可学习的插值模块初始化,并与跟踪网络端到端联合训练;二是优化相关特征计算过程,显著降低该步骤的计算成本。两项改进共同实现5–100倍的速度提升,同时保持最先进的跟踪精度。
链接: https://arxiv.org/abs/2508.01170
作者: Tuan Duc Ngo,Ashkan Mirzaei,Guocheng Qian,Hanwen Liang,Chuang Gan,Evangelos Kalogerakis,Peter Wonka,Chaoyang Wang
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); Snap Inc. (Snap公司); University of Toronto (多伦多大学); TU Crete (克里特理工大学); KAUST (沙特阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos. Through analysis of existing state-of-the-art methods, we identify two major computational bottlenecks. First, transformer-based iterative tracking becomes expensive when handling a large number of trajectories. To address this, we introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories. The newly added trajectories are initialized using a learnable interpolation module, which is trained end-to-end alongside the tracking network. Second, we propose an optimization that significantly reduces the cost of correlation feature computation, another key bottleneck in prior methods. Together, these improvements lead to a 5-100x speedup over existing approaches while maintaining state-of-the-art tracking accuracy.
zh
[CV-217] EACH: Text Encoding as Curriculum Hints for Scene Text Recognition
【速读】:该论文旨在解决场景文本识别(Scene Text Recognition, STR)中因复杂视觉外观和有限语义先验而导致的识别困难问题。其解决方案的关键在于提出一种名为TEACH的新颖训练范式,该范式通过将真实文本标签作为辅助输入注入模型,并在训练过程中逐步减少其影响,从而模拟课程学习(curriculum learning)过程,引导模型从依赖标签的学习过渡到完全基于视觉的识别能力。TEACH通过将目标标签编码至嵌入空间并应用损失感知掩码(loss-aware masking),实现了无需外部预训练且推理无额外开销的端到端优化,同时具备模型无关性,可无缝集成到现有编码器-解码器框架中。
链接: https://arxiv.org/abs/2508.01153
作者: Xiahan Yang,Hui Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages (w/o ref), 5 figures, 7 tables
Abstract:Scene Text Recognition (STR) remains a challenging task due to complex visual appearances and limited semantic priors. We propose TEACH, a novel training paradigm that injects ground-truth text into the model as auxiliary input and progressively reduces its influence during training. By encoding target labels into the embedding space and applying loss-aware masking, TEACH simulates a curriculum learning process that guides the model from label-dependent learning to fully visual recognition. Unlike language model-based approaches, TEACH requires no external pretraining and introduces no inference overhead. It is model-agnostic and can be seamlessly integrated into existing encoder-decoder frameworks. Extensive experiments across multiple public benchmarks show that models trained with TEACH achieve consistently improved accuracy, especially under challenging conditions, validating its robustness and general applicability.
zh
[CV-218] LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation ICCV2025
【速读】:该论文旨在解决可控图像分割(controllable dichotomous image segmentation, DIS)中如何实现高精度、个性化对象掩码生成的问题。现有方法在灵活性和控制粒度上存在局限,难以满足用户对分割区域的精细调整需求。解决方案的关键在于提出LawDIS框架,其将DIS重构为一个基于图像条件的掩码生成任务,并嵌入到潜在扩散模型(latent diffusion model)中,从而支持用户指令的无缝集成。该框架引入宏观(macro)与微观(micro)双模式控制机制:宏观模式通过语言控制分割策略(Language-controlled Segmentation, LS)生成初始掩码,微观模式通过窗口控制细化策略(Window-controlled Refinement, WR)对指定区域进行可调尺寸的精细化调整。两种模式由模式切换器协调,既可独立使用也可协同工作,显著提升了分割精度与用户交互的灵活性。
链接: https://arxiv.org/abs/2508.01152
作者: Xinyu Yan,Meijun Sun,Ge-Peng Ji,Fahad Shahbaz Khan,Salman Khan,Deng-Ping Fan
机构: Tianjin University (天津大学); Tianjin Key Laboratory of Machine Learning (天津市机器学习重点实验室); Australian National University (澳大利亚国立大学); Nankai Institute of Advanced Research (深圳福田) (南开大学先进研究院(深圳福田)); Nankai University (南开大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures, ICCV 2025
Abstract:We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve F_\beta^\omega gains of 4.6% with both the LS and WR strategies and 3.6% gains with only the LS strategy on DIS-TE. Codes will be made available at this https URL.
zh
[CV-219] Personalized Safety Alignment for Text-to-Image Diffusion Models
【速读】:该论文旨在解决当前文本到图像扩散模型(text-to-image diffusion models)在内容安全机制上存在的“一刀切”问题,即统一的安全标准无法适应用户个体差异化的安全边界,如年龄、心理健康状况和价值观等因素。为应对这一挑战,作者提出个性化安全对齐(Personalized Safety Alignment, PSA)框架,其核心在于将用户特定的安全偏好嵌入扩散过程,通过引入一个名为Sage的新数据集,并利用交叉注意力(cross-attention)机制动态调整生成行为,从而在保障图像质量的同时实现对有害内容的有效抑制,并显著提升与用户约束的一致性表现。
链接: https://arxiv.org/abs/2508.01151
作者: Yu Lei,Jinbin Bai,Qingyu Shi,Aosong Feng,Kaidong Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, 4 tables
Abstract:Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model’s behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at this https URL.
zh
[CV-220] OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding IROS2025
【速读】:该论文旨在解决现有3D场景理解方法在开放词汇查询下难以实现精确对象级语义建模与动态融合的问题,尤其针对传统离线处理流程导致的灵活性不足和语义精度受限。其解决方案的关键在于提出OpenGS-Fusion框架,该框架融合了3D高斯表示(3D Gaussian representation)与截断有符号距离场(Truncated Signed Distance Field),实现了语义特征的无损在线融合;同时引入一种多模态语言引导自适应阈值调整策略(MLLM-Assisted Adaptive Thresholding),通过动态调节相似性阈值提升3D物体分割精度,在3D mIoU指标上相较固定阈值策略提升了17%。
链接: https://arxiv.org/abs/2508.01150
作者: Dianyi Yang,Xihan Wang,Yu Gao,Shiyang Liu,Bohan Ren,Yufeng Yue,Yi Yang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS2025
Abstract:Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at this https URL .
zh
[CV-221] Dataset Condensation with Color Compensation
【速读】:该论文旨在解决数据集压缩(Dataset Condensation)中性能与保真度之间的权衡问题,尤其是现有方法在图像级选择(如Coreset Selection、Dataset Quantization)时效率低下,而像素级优化(如Dataset Distillation)则因过度参数化导致语义失真。其解决方案的关键在于识别颜色在数据集中兼具信息载体和语义基本单元的双重角色,并提出DC3框架通过引入颜色补偿机制来提升压缩后图像的颜色丰富度,从而增强表示学习能力。具体而言,DC3采用校准的选择策略后,利用潜在扩散模型(Latent Diffusion Model)增强图像的颜色多样性而非生成全新图像,实现了高质量、高保真的数据集压缩,且首次将预训练扩散模型微调用于压缩数据集,显著优于当前最先进方法。
链接: https://arxiv.org/abs/2508.01139
作者: Huyu Wu,Duo Su,Junjie Hou,Guang Li
机构: University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学); Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color’s dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data will be released soon.
zh
[CV-222] Semi-Supervised Anomaly Detection in Brain MRI Using a Domain-Agnostic Deep Reinforcement Learning Approach
【速读】:该论文旨在解决脑部磁共振成像(MRI)中异常检测面临的挑战,包括大规模数据处理、过拟合以及类别不平衡等问题。其解决方案的关键在于提出了一种领域无关(domain-agnostic)的半监督异常检测框架,该框架融合了深度强化学习(DRL)与特征表示,有效应对标签稀缺、大规模数据和过拟合问题。实验表明,该方法在脑部MRI数据集上实现了88.7%(像素级)和96.7%(图像级)的AUROC性能,显著优于现有最先进(SOTA)方法,并且在工业表面缺陷检测数据集(如MVTec AD)上也展现出优异的跨域泛化能力(像素级AUROC=99.8%,图像级AUROC=99.3%),验证了其鲁棒性、通用性和高效性。
链接: https://arxiv.org/abs/2508.01137
作者: Zeduo Zhang,Yalda Mohsenzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 6 figures and 4 tables in main text, 17 pages supplementary material with 3 tables and 3 figures; Submitted to Radiology: Artificial Intelligence
Abstract:To develop a domain-agnostic, semi-supervised anomaly detection framework that integrates deep reinforcement learning (DRL) to address challenges such as large-scale data, overfitting, and class imbalance, focusing on brain MRI volumes. This retrospective study used publicly available brain MRI datasets collected between 2005 and 2021. The IXI dataset provided 581 T1-weighted and 578 T2-weighted MRI volumes (from healthy subjects) for training, while the BraTS 2021 dataset provided 251 volumes for validation and 1000 for testing (unhealthy subjects with Glioblastomas). Preprocessing included normalization, skull-stripping, and co-registering to a uniform voxel size. Experiments were conducted on both T1- and T2-weighted modalities. Additional experiments and ablation analyses were also carried out on the industrial datasets. The proposed method integrates DRL with feature representations to handle label scarcity, large-scale data and overfitting. Statistical analysis was based on several detection and segmentation metrics including AUROC and Dice score. The proposed method achieved an AUROC of 88.7% (pixel-level) and 96.7% (image-level) on brain MRI datasets, outperforming State-of-The-Art (SOTA) methods. On industrial surface datasets, the model also showed competitive performance (AUROC = 99.8% pixel-level, 99.3% image-level) on MVTec AD dataset, indicating strong cross-domain generalization. Studies on anomaly sample size showed a monotonic increase in AUROC as more anomalies were seen, without evidence of overfitting or additional computational cost. The domain-agnostic semi-supervised approach using DRL shows significant promise for MRI anomaly detection, achieving strong performance on both medical and industrial datasets. Its robustness, generalizability and efficiency highlight its potential for real-world clinical applications.
zh
[CV-223] UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction Forecasting and Generation ICCV2025
【速读】:该论文旨在解决当前基于第三人称视角的运动生成与预测方法在第一人称(egocentric)场景下效果受限的问题,尤其是在视场受限、频繁遮挡和动态相机导致场景感知困难的情况下。现有方法依赖显式的三维场景结构,难以适应真实世界中第一人称视角的复杂性。解决方案的关键在于提出UniEgoMotion——一个统一的条件运动扩散模型,采用新颖的头中心运动表示(head-centric motion representation),能够仅从第一人称图像中提取场景语义信息并生成合理的三维运动,无需依赖显式3D场景重建。该模型支持第一人称运动重建、预测与生成,并首次实现了仅用单张第一人称图像生成运动的能力,显著提升了第一人称场景下的运动建模性能。
链接: https://arxiv.org/abs/2508.01126
作者: Chaitanya Patel,Hiroki Nakamura,Yuta Kyuragi,Kazuki Kozuka,Juan Carlos Niebles,Ehsan Adeli
机构: Stanford University (斯坦福大学); Panasonic Holdings Corporation (松下控股公司); Panasonic R&D Company of America (松下美国研发中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project Page: this https URL
Abstract:Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
zh
[CV-224] he Promise of RL for Autoregressive Image Editing
【速读】:该论文旨在解决图像编辑任务中模型性能提升的问题,尤其是在有限训练数据条件下如何实现高效且多样化的编辑能力。其核心挑战在于如何有效整合监督微调(Supervised Fine-Tuning, SFT)、强化学习(Reinforcement Learning, RL)与思维链(Chain-of-Thought, CoT)推理策略以优化生成质量与可控性。解决方案的关键在于采用一个统一处理文本和视觉标记的自回归多模态模型,并发现结合大规模多模态大语言模型(Multimodal Large Language Model, MLLM)作为验证器的强化学习策略最为有效。基于此,作者提出了EARL(Editing with Autoregression and RL)模型,该模型在多种图像编辑任务上表现优异,且训练数据需求显著低于现有方法,从而推动了自回归多模态模型在图像编辑领域的前沿发展。
链接: https://arxiv.org/abs/2508.01119
作者: Saba Ahmadi,Rabiul Awal,Ankur Sikarwar,Amirhossein Kazemnejad,Ge Ya Luo,Juan A. Rodriguez,Sai Rajeswar,Siva Reddy,Christopher Pal,Benno Krojer,Aishwarya Agrawal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at this https URL.
zh
[CV-225] MASIV: Toward Material-Agnostic System Identification from Videos ICCV2025
【速读】:该论文旨在解决从视频中进行材料无关的系统识别(material-agnostic system identification)问题,即在不依赖预设材料先验的情况下,从视觉观测中同时恢复物体几何结构和驱动其动力学的物理规律。现有方法通常结合可微渲染与仿真,但受限于对特定材料本构关系(constitutive laws)的强假设,难以处理未知材料场景。解决方案的关键在于提出MASIV框架,其核心创新是采用可学习的神经本构模型(learnable neural constitutive models),替代传统手工设计的本构关系,从而实现无需场景特异性材料先验的动力学推理;同时引入密集几何引导机制,通过重建连续介质粒子轨迹以提供时序丰富的运动约束,缓解因粒子状态信息缺失导致的优化不稳定与物理不合理行为问题。
链接: https://arxiv.org/abs/2508.01112
作者: Yizhou Zhao,Haoyu Chen,Chunjiang Liu,Zhenyang Li,Charles Herrmann,Junhwa Hur,Yinxiao Li,Ming-Hsuan Yang,Bhiksha Raj,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Google(谷歌); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.
zh
[CV-226] rans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting ICCV2025
【速读】:该论文旨在解决现有图像修复(image inpainting)方法无法直接处理带透明通道的RGBA图像的问题,尤其在透明区域编辑时难以保持透明一致性以及传统两阶段修复-抠图(matting)流程易引入锯齿边缘的缺陷。解决方案的关键在于提出Trans-Adapter——一种即插即用的适配器模块,能够使基于扩散模型(diffusion-based inpainting models)直接处理RGBA图像,并支持通过ControlNet实现可控编辑,同时可无缝集成到多种主流社区模型中,从而提升透明区域修复的质量与一致性。
链接: https://arxiv.org/abs/2508.01098
作者: Yuekun Dai,Haitian Li,Shangchen Zhou,Chen Change Loy
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ICCV 2025
Abstract:RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. We conduct extensive experiments on LayerBench to demonstrate the effectiveness of our approach.
zh
[CV-227] AURA: A Hybrid Spatiotemporal-Chromatic Framework for Robust Real-Time Detection of Industrial Smoke Emissions
【速读】:该论文旨在解决当前工业烟尘排放监测系统在区分烟尘类型方面缺乏特异性以及在面对环境变化时稳定性不足的问题。解决方案的关键在于提出了一种新型的混合时空-色度(spatiotemporal-chromatic)框架AURA,该框架同时利用烟尘的动态运动模式和独特的颜色特征,从而提升检测与分类的准确性并降低误报率。
链接: https://arxiv.org/abs/2508.01095
作者: Mikhail Bychkov,Matey Yordanov,Andrei Kuchma
机构: Hong Kong University of Science and Technology (香港科技大学); ANTEI Limited (ANTEI有限公司); ITMO University (圣彼得堡国立信息技术机械与光学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 3 figures
Abstract:This paper introduces AURA, a novel hybrid spatiotemporal-chromatic framework designed for robust, real-time detection and classification of industrial smoke emissions. The framework addresses critical limitations of current monitoring systems, which often lack the specificity to distinguish smoke types and struggle with environmental variability. AURA leverages both the dynamic movement patterns and the distinct color characteristics of industrial smoke to provide enhanced accuracy and reduced false positives. This framework aims to significantly improve environmental compliance, operational safety, and public health outcomes by enabling precise, automated monitoring of industrial emissions.
zh
[CV-228] COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition ICCV2025
【速读】:该论文旨在解决视觉识别系统中处理新颖样本(novelty)的关键挑战,即开放集识别(Open-Set Recognition, OSR)问题。传统方法依赖“熟悉性假设”(familiarity hypothesis),通过检测是否存在已知类特征来识别未知类别。本文提出一种新的“衰减假设”(attenuation hypothesis):训练过程中学习到的小权重会削弱特征表示,从而在区分已知类别的同时丢弃有助于区分已知与未知类别的信息。解决方案的核心在于提出 COSTARR 方法,该方法同时利用已知类特征的存在性和未知类特征的缺失性,并对衰减后的特征进行有效建模——特别是引入了后衰减的 Hadamard 乘积特征(post-attenuated Hadamard product features),这些此前被忽略的信息显著提升了 OSR 性能。实验表明,COSTARR 在多种主流预训练架构(ViTs、ConvNeXts、ResNet)和大规模数据集(如 ImageNet2012-1K 与 NINCO、iNaturalist 等)上均显著优于现有最优方法。
链接: https://arxiv.org/abs/2508.01087
作者: Ryan Rabinowitz,Steve Cruz,Walter Scheirer,Terrance E. Boult
机构: University of Colorado Colorado Springs, USA (科罗拉多大学丹佛分校); University of Notre Dame, USA (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:Handling novelty remains a key challenge in visual recognition systems. Existing open-set recognition (OSR) methods rely on the familiarity hypothesis, detecting novelty by the absence of familiar features. We propose a novel attenuation hypothesis: small weights learned during training attenuate features and serve a dual role-differentiating known classes while discarding information useful for distinguishing known from unknown classes. To leverage this overlooked information, we present COSTARR, a novel approach that combines both the requirement of familiar features and the lack of unfamiliar ones. We provide a probabilistic interpretation of the COSTARR score, linking it to the likelihood of correct classification and belonging in a known class. To determine the individual contributions of the pre- and post-attenuated features to COSTARR’s performance, we conduct ablation studies that show both pre-attenuated deep features and the underutilized post-attenuated Hadamard product features are essential for improving OSR. Also, we evaluate COSTARR in a large-scale setting using ImageNet2012-1K as known data and NINCO, iNaturalist, OpenImage-O, and other datasets as unknowns, across multiple modern pre-trained architectures (ViTs, ConvNeXts, and ResNet). The experiments demonstrate that COSTARR generalizes effectively across various architectures and significantly outperforms prior state-of-the-art methods by incorporating previously discarded attenuation information, advancing open-set recognition capabilities.
zh
[CV-229] DreamSat-2.0: Towards a General Single-View Asteroid 3D Reconstruction
【速读】:该论文旨在解决深空探测中 asteroid exploration(小行星探索)与自主航天器导航(autonomous spacecraft navigation)的瓶颈问题,核心挑战在于如何利用生成式 AI (Generative AI) 实现高保真度的 3D 重建以支持任务规划与环境感知。解决方案的关键在于提出 DreamSat-2.0 基准测试管道,系统性地评估 Hunyuan-3D、Trellis-3D 和 Ouroboros-3D 三种先进 3D 重建模型在定制化的航天器与小行星数据集上的表现,通过结合 2D 图像质量(perceptual quality)和 3D 几何精度(geometric accuracy)双维度指标,揭示了模型性能具有领域依赖性:即复杂结构的航天器更易获得高质量图像重建,而简单几何形态的小行星则能实现更高精度的形状还原,从而为后续任务导向的模型选择与优化提供了量化依据。
链接: https://arxiv.org/abs/2508.01079
作者: Santiago Diaz,Xinghui Hu,Josiane Uwumukiza,Giovanni Lavezzi,Victor Rodriguez-Fernandez,Richard Linares
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:To enhance asteroid exploration and autonomous spacecraft navigation, we introduce DreamSat-2.0, a pipeline that benchmarks three state-of-the-art 3D reconstruction models-Hunyuan-3D, Trellis-3D, and Ouroboros-3D-on custom spacecraft and asteroid datasets. Our systematic analysis, using 2D perceptual (image quality) and 3D geometric (shape accuracy) metrics, reveals that model performance is domain-dependent. While models produce higher-quality images of complex spacecraft, they achieve better geometric reconstructions for the simpler forms of asteroids. New benchmarks are established, with Hunyuan-3D achieving top perceptual scores on spacecraft but its best geometric accuracy on asteroids, marking a significant advance over our prior work.
zh
[CV-230] Evading Data Provenance in Deep Neural Networks ICCV2025
【速读】:该论文旨在解决当前数据版权保护中数据集所有权验证(Dataset Ownership Verification, DOV)方法易被规避的问题,即攻击者通过特定策略训练出的模型可绕过DOV检测,从而非法使用受版权保护的数据进行模型训练。其解决方案的关键在于提出了一种统一的知识迁移式逃逸框架(unified evasion framework):首先利用教师模型从版权数据集中学习任务相关知识,再借助分布外(Out-of-Distribution, OOD)数据作为中介,将与身份标识无关但任务相关的领域知识迁移到学生模型中;同时,结合视觉-语言模型(Vision-Language Models)和大语言模型(Large Language Models),从OOD数据集中筛选最具信息量且可靠的子集作为最终的知识迁移集,并通过选择性地传输任务导向知识,在泛化能力与逃逸有效性之间取得更优平衡。实验表明,该方法能彻底消除所有版权标识,并在11种DOV方法上显著优于9种现有先进逃逸攻击,同时计算开销可控,揭示了当前DOV机制的核心漏洞并推动其长期实用性发展。
链接: https://arxiv.org/abs/2508.01074
作者: Hongyu Zhu,Sichu Liang,Wenwen Wang,Zhuomeng Zhang,Fangqi Li,Shi-Lin Wang
机构: Shanghai Jiao Tong University (上海交通大学); Southeast University (东南大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: ICCV 2025 Highlight
Abstract:Modern over-parameterized deep models are highly data-dependent, with large scale general-purpose and domain-specific datasets serving as the bedrock for rapid advancements. However, many datasets are proprietary or contain sensitive information, making unrestricted model training problematic. In the open world where data thefts cannot be fully prevented, Dataset Ownership Verification (DOV) has emerged as a promising method to protect copyright by detecting unauthorized model training and tracing illicit activities. Due to its diversity and superior stealth, evading DOV is considered extremely challenging. However, this paper identifies that previous studies have relied on oversimplistic evasion attacks for evaluation, leading to a false sense of security. We introduce a unified evasion framework, in which a teacher model first learns from the copyright dataset and then transfers task-relevant yet identifier-independent domain knowledge to a surrogate student using an out-of-distribution (OOD) dataset as the intermediary. Leveraging Vision-Language Models and Large Language Models, we curate the most informative and reliable subsets from the OOD gallery set as the final transfer set, and propose selectively transferring task-oriented knowledge to achieve a better trade-off between generalization and evasion effectiveness. Experiments across diverse datasets covering eleven DOV methods demonstrate our approach simultaneously eliminates all copyright identifiers and significantly outperforms nine state-of-the-art evasion attacks in both generalization and effectiveness, with moderate computational overhead. As a proof of concept, we reveal key vulnerabilities in current DOV methods, highlighting the need for long-term development to enhance practicality.
zh
[CV-231] CP-FREEZER: Latency Attacks against Vehicular Cooperative Perception
【速读】:该论文旨在解决车联网中协同感知(Cooperative Perception, CP)系统在时效性(或可用性)方面面临的潜在安全威胁问题,即如何通过车辆到车辆(V2V)通信注入对抗性扰动来最大化CP算法的计算延迟,从而影响自动驾驶系统的实时决策能力。解决方案的关键在于提出CP-FREEZER攻击方法,其核心创新包括:针对点云预处理过程的不可微特性设计攻击策略、解决因传输延迟导致的受害者输入信息异步问题,并引入一种新颖的损失函数以有效最大化CP流水线的执行时间,实验证明该攻击可使端到端CP延迟提升超过90倍,单帧处理时间超过3秒,成功率高达100%。
链接: https://arxiv.org/abs/2508.01062
作者: Chenyi Wang,Ruoyu Song,Raymond Muller,Jean-Philippe Monteuuis,Z. Berkay Celik,Jonathan Petit,Ryan Gerdes,Ming Li
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cooperative perception (CP) enhances situational awareness of connected and autonomous vehicles by exchanging and combining messages from multiple agents. While prior work has explored adversarial integrity attacks that degrade perceptual accuracy, little is known about CP’s robustness against attacks on timeliness (or availability), a safety-critical requirement for autonomous driving. In this paper, we present CP-FREEZER, the first latency attack that maximizes the computation delay of CP algorithms by injecting adversarial perturbation via V2V messages. Our attack resolves several unique challenges, including the non-differentiability of point cloud preprocessing, asynchronous knowledge of the victim’s input due to transmission delays, and uses a novel loss function that effectively maximizes the execution time of the CP pipeline. Extensive experiments show that CP-FREEZER increases end-to-end CP latency by over 90\times , pushing per-frame processing time beyond 3 seconds with a 100% success rate on our real-world vehicle testbed. Our findings reveal a critical threat to the availability of CP systems, highlighting the urgent need for robust defenses.
zh
[CV-232] Structured Spectral Graph Learning for Anomaly Classification in 3D Chest CT Scans MICCAI2025
【速读】:该论文旨在解决3D CT扫描图像中多标签异常分类(multi-label anomaly classification)的挑战,尤其是在复杂空间关系和多样化异常模式下,传统3D卷积网络难以建模长距离依赖,而Vision Transformer则存在计算成本高且需大量同域预训练数据的问题。其解决方案的关键在于提出一种基于图结构的新方法:将CT扫描表示为结构化图,利用轴向切片三元组作为节点,并通过频域卷积(spectral domain convolution)进行特征提取,从而有效捕捉跨区域的空间关联性,在保持对z轴平移鲁棒性的同时实现优异的跨数据集泛化性能。
链接: https://arxiv.org/abs/2508.01045
作者: Theo Di Piazza,Carole Lazarus,Olivier Nempont,Loic Boussel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at MICCAI 2025 EMERGE Workshop
Abstract:With the increasing number of CT scan examinations, there is a need for automated methods such as organ segmentation, anomaly detection and report generation to assist radiologists in managing their increasing workload. Multi-label classification of 3D CT scans remains a critical yet challenging task due to the complex spatial relationships within volumetric data and the variety of observed anomalies. Existing approaches based on 3D convolutional networks have limited abilities to model long-range dependencies while Vision Transformers suffer from high computational costs and often require extensive pre-training on large-scale datasets from the same domain to achieve competitive performance. In this work, we propose an alternative by introducing a new graph-based approach that models CT scans as structured graphs, leveraging axial slice triplets nodes processed through spectral domain convolution to enhance multi-label anomaly classification performance. Our method exhibits strong cross-dataset generalization, and competitive performance while achieving robustness to z-axis translation. An ablation study evaluates the contribution of each proposed component.
zh
[CV-233] 3D Reconstruction via Incremental Structure From Motion
【速读】:该论文旨在解决从无结构图像集合中实现高精度三维重建的问题,尤其针对全局SfM(Structure from Motion)方法在图像连接不完整或存在噪声时易失效的局限性。其解决方案的关键在于采用增量式SfM(incremental Structure from Motion)策略,通过逐步引入新视角来构建场景结构和相机运动,从而在稀疏或部分重叠的数据集上仍能有效恢复几何一致性;同时,通过迭代优化(如捆绑调整,bundle adjustment)提升几何估计的稳定性与精度,实验验证了该方法在真实数据集上的有效性,主要依据重投影误差和相机轨迹一致性进行评估。
链接: https://arxiv.org/abs/2508.01019
作者: Muhammad Zeeshan,Umer Zaki,Syed Ahmed Pasha,Zaar Khizar
机构: Air University (空军大学); Institut Pascal (帕斯卡研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 8 pages, 8 figures, proceedings in International Bhurban Conference on Applied Sciences Technology (IBCAST) 2025
Abstract:Accurate 3D reconstruction from unstructured image collections is a key requirement in applications such as robotics, mapping, and scene understanding. While global Structure from Motion (SfM) techniques rely on full image connectivity and can be sensitive to noise or missing data, incremental SfM offers a more flexible alternative. By progressively incorporating new views into the reconstruction, it enables the system to recover scene structure and camera motion even in sparse or partially overlapping datasets. In this paper, we present a detailed implementation of the incremental SfM pipeline, focusing on the consistency of geometric estimation and the effect of iterative refinement through bundle adjustment. We demonstrate the approach using a real dataset and assess reconstruction quality through reprojection error and camera trajectory coherence. The results support the practical utility of incremental SfM as a reliable method for sparse 3D reconstruction in visually structured environments.
zh
[CV-234] AutoSIGHT: Automatic Eye Tracking-based System for Immediate Grading of Human experTise
【速读】:该论文旨在解决如何通过眼动追踪特征自动评估人类在视觉任务中专家水平的问题。其核心挑战在于构建一个能够即时、准确区分专家与非专家表现的系统,从而为动态的人机协作提供依据。解决方案的关键是提出AutoSIGHT(Automatic System for Immediate Grading of Human Expertise),该系统基于眼动数据提取多维特征,并采用集成学习方法对人类表现进行分类;实验表明,在仅需5秒评估窗口的情况下,AutoSIGHT即可实现平均AUROC为0.751的判别性能,当延长至30秒时性能提升至0.8306,验证了其在实时性与准确性之间的有效权衡。这一成果为未来人机配对场景中自动融合人类与机器专家能力提供了新思路。
链接: https://arxiv.org/abs/2508.01015
作者: Byron Dowling,Jozef Probcin,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been accepted for publication in the proceedings of the IEEE VL/HCC conference 2025. The final published version will be available via IEEE Xplore
Abstract:Can we teach machines to assess the expertise of humans solving visual tasks automatically based on eye tracking features? This paper proposes AutoSIGHT, Automatic System for Immediate Grading of Human experTise, that classifies expert and non-expert performers, and builds upon an ensemble of features extracted from eye tracking data while the performers were solving a visual task. Results on the task of iris Presentation Attack Detection (PAD) used for this study show that with a small evaluation window of just 5 seconds, AutoSIGHT achieves an average average Area Under the ROC curve performance of 0.751 in subject-disjoint train-test regime, indicating that such detection is viable. Furthermore, when a larger evaluation window of up to 30 seconds is available, the Area Under the ROC curve (AUROC) increases to 0.8306, indicating the model is effectively leveraging more information at a cost of slightly delayed decisions. This work opens new areas of research on how to incorporate the automatic weighing of human and machine expertise into human-AI pairing setups, which need to react dynamically to nonstationary expertise distribution between the human and AI players (e.g. when the experts need to be replaced, or the task at hand changes rapidly). Along with this paper, we offer the eye tracking data used in this study collected from 6 experts and 53 non-experts solving iris PAD visual task.
zh
[CV-235] Hestia: Hierarchical Next-Best-View Exploration for Systematic Intelligent Autonomous Data Collection
【速读】:该论文旨在解决三维重建与新视角合成中数据采集过程高度依赖人工、效率低且耗时的问题。其核心解决方案是提出一种基于强化学习的分层最优视点探索方法(Hierarchical Next-Best-View Exploration for Systematic Intelligent Autonomous Data Collection, Hestia),通过系统性定义任务要素(包括数据集选择、观测设计、动作空间、奖励计算和学习策略)构建可泛化的5自由度(5-DoF)视点预测策略,从而实现自主智能的数据采集。关键创新在于将强化学习与真实场景中的无人机移动传感器结合,在NVIDIA IsaacLab环境中验证了其在多个数据集及物体设置下的鲁棒性,并证明了其在现实世界部署的可行性。
链接: https://arxiv.org/abs/2508.01014
作者: Cheng-You Lu,Zhuoli Zhuang,Nguyen Thanh Trung Le,Da Xiao,Yu-Cheng Chang,Thomas Do,Srinath Sridhar,Chin-teng Lin
机构: University of Technology Sydney (悉尼科技大学); Brown University (布朗大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advances in 3D reconstruction and novel view synthesis have enabled efficient, photorealistic rendering, but the data collection process remains largely manual, making it time-consuming and labor-intensive. To address the challenges, this study introduces Hierarchical Next-Best-View Exploration for Systematic Intelligent Autonomous Data Collection (Hestia), which leverages reinforcement learning to learn a generalizable policy for 5-DoF next-best viewpoint prediction. Unlike prior approaches, Hestia systematically defines the next-best-view task by proposing core components such as dataset choice, observation design, action space, reward calculation, and learning schemes, forming a foundation for the planner. Hestia goes beyond prior next-best-view approaches and traditional capture systems through integration and validation in a real-world setup, where a drone serves as a mobile sensor for active scene exploration. Experimental results show that Hestia performs robustly across three datasets and translated object settings in the NVIDIA IsaacLab environment, and proves feasible for real-world deployment.
zh
[CV-236] ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation ICCV2025
【速读】:该论文旨在解决当前实例感知的文本到图像生成(instance-grounded text-to-image generation)任务中缺乏高质量、大规模且具备开放词汇能力的合成数据集的问题。现有数据集在类别覆盖范围、图像质量与分辨率以及实例级标注的准确性方面存在局限,难以支持高精度的生成模型训练。解决方案的关键在于提出一种名为“重描述”(re-captioning)的新策略,该策略聚焦于预检测阶段:首先利用视觉语言模型(VLM)生成全面的视觉描述,再由大语言模型(LLM)从中提取扁平化的潜在类别列表,供开放词汇检测器(OVD)进行实例识别。此方法不仅确保了全局提示(prompt)与实例标注的强关联性,还能捕捉人类通常忽略的次要视觉元素,从而构建出具有丰富语义结构和高实用性的合成数据集 ROVI。
链接: https://arxiv.org/abs/2508.01008
作者: Cihang Peng,Qiming Hou,Zhong Ren,Kun Zhou
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at this https URL.
zh
[CV-237] hermoCycleNet: Stereo-based Thermogram Labeling for Model Transition to Cycling
【速读】:该论文旨在解决如何将基于立体视觉和多模态信息的自动标注方法从跑步机跑步场景迁移至自行车功率计(ergometer cycling)场景的问题,以提升红外热成像在运动医学中对特定解剖区域(如小腿)热辐射分析的自动化水平。解决方案的关键在于:通过在不同数据集组合下训练语义分割网络并进行微调(fine-tuning),发现仅需少量人工标注数据即可显著提升深度神经网络的整体性能;同时,结合自动生成的标签与小规模人工标注数据集,可有效加速模型适应新应用场景(如从跑步到骑行)的进程。
链接: https://arxiv.org/abs/2508.00974
作者: Daniel Andrés López,Vincent Weber,Severin Zentgraf,Barlo Hillen,Perikles Simon,Elmar Schömer
机构: 11; 22; 2233
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at IWANN 2025 18th International Work-Conference on Artificial Neural Networks, A Coruña, Spain, 16-18 June, 2025. Book of abstracts: ISBN: 979-13-8752213-1. Funding: Johannes Gutenberg University "Stufe I’‘: "Start ThermoCycleNet’‘. Partial funding: Carl-Zeiss-Stiftung: "Multi-dimensionAI’’ (CZS-Project number: P2022-08-010)
Abstract:Infrared thermography is emerging as a powerful tool in sports medicine, allowing assessment of thermal radiation during exercise and analysis of anatomical regions of interest, such as the well-exposed calves. Building on our previous advanced automatic annotation method, we aimed to transfer the stereo- and multimodal-based labeling approach from treadmill running to ergometer cycling. Therefore, the training of the semantic segmentation network with automatic labels and fine-tuning on high-quality manually annotated images has been examined and compared in different data set combinations. The results indicate that fine-tuning with a small fraction of manual data is sufficient to improve the overall performance of the deep neural network. Finally, combining automatically generated labels with small manually annotated data sets accelerates the adaptation of deep neural networks to new use cases, such as the transition from treadmill to bicycle.
zh
[CV-238] Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在跨模态嵌入学习中因注意力机制协调不一致而导致的注意力错位与性能欠佳问题。其解决方案的关键在于提出了一种名为一致跨层区域对齐(Consistent Cross-layer Regional Alignment, CCRA)的新架构,核心包括两个组件:一是层-块级交叉注意力(Layer-Patch-wise Cross Attention, LPWCA),通过联合加权块和层级别的嵌入来捕捉细粒度的区域语义关联;二是渐进式注意力融合(Progressive Attention Integration, PAI),按序系统性地协调LPWCA、层级和块级注意力机制,确保从语义到区域层面的一致性,避免注意力漂移并最大化各注意力模块的优势。实验表明,该方法在10个多样化视觉语言基准上显著优于现有基线,仅增加3.55M参数即实现最优性能,并提升注意力模式的区域聚焦性和语义一致性。
链接: https://arxiv.org/abs/2508.00945
作者: Yifan Wang,Hongfeng Ai,Quangao Liu,Maowei Jiang,Ruiyuan Kang,Ruiqi Li,Jiahua Dong,Mengting Xiao,Cheng Jiang,Chenzhong Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Chinese Academy of Sciences (中国科学院); Tsinghua University (清华大学); Technology Innovation Institute (技术创新研究所); University of the Chinese Academy of Sciences (中国科学院大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); McGill University (麦吉尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.
zh
[CV-239] Latent Diffusion Based Face Enhancement under Degraded Conditions for Forensic Face Recognition
【速读】:该论文旨在解决低质量法医取证图像在人脸识别系统中导致性能严重下降的问题。其解决方案的关键在于采用基于潜在扩散(latent diffusion)的图像增强技术,具体实现为使用Flux.1 Kontext Dev管道并结合Facezoom LoRA适配模块,在包含24,000次识别尝试的3,000人LFW数据集上对七类典型退化(如压缩伪影、模糊和噪声污染)进行测试。结果表明,该方法显著提升了识别准确率,从29.1%提升至84.5%,实现了55.4个百分点的改进,且在所有退化类型中均达到统计学显著性与实际应用意义。
链接: https://arxiv.org/abs/2508.00941
作者: Hassan Ugail,Hamad Mansour Alawar,AbdulNasser Abbas Zehi,Ahmed Mohammad Alkendi,Ismail Lujain Jaleel
机构: University of Bradford (布拉德福德大学); Dubai Police Headquarters (迪拜警察总部); theCircle Ltd (theCircle有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face recognition systems experience severe performance degradation when processing low-quality forensic evidence imagery. This paper presents an evaluation of latent diffusion-based enhancement for improving face recognition under forensically relevant degradations. Using a dataset of 3,000 individuals from LFW with 24,000 recognition attempts, we implement the Flux.1 Kontext Dev pipeline with Facezoom LoRA adaptation to test against seven degradation categories, including compression artefacts, blur effects, and noise contamination. Our approach demonstrates substantial improvements, increasing overall recognition accuracy from 29.1% to 84.5% (55.4 percentage point improvement, 95% CI: [54.1, 56.7]). Statistical analysis reveals significant performance gains across all degradation types, with effect sizes exceeding conventional thresholds for practical significance. These findings establish the potential of sophisticated diffusion based enhancement in forensic face recognition applications.
zh
[CV-240] A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles
【速读】:该论文旨在解决连接式自动驾驶车辆(Connected Autonomous Vehicles, CAVs)在复杂环境中执行多任务(如目标检测、语义分割、深度估计、轨迹预测、运动预测和行为预测)时面临的高部署成本、计算开销大以及难以实现实时性能的问题。传统方法通常采用独立模型分别处理各项任务,导致资源利用率低且系统效率受限。解决方案的关键在于引入多任务学习(Multi-task Learning, MTL)框架,通过在一个统一模型中联合优化多个相关任务,实现更高效的计算资源利用与更好的协同感知能力,从而提升CAV系统的整体性能与可扩展性。
链接: https://arxiv.org/abs/2508.00917
作者: Jiayuan Wang,Farhad Pourpanah,Q. M. Jonathan Wu,Ning Zhang
机构: University of Windsor (温莎大学); Queen’s University (皇后大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2303.01788 , arXiv:2304.01168 by other authors
Abstract:Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as object detection, semantic segmentation, depth estimation, trajectory prediction, motion prediction, and behaviour prediction, to ensure safe and reliable navigation in complex environments. Vehicle-to-everything (V2X) communication enables cooperative driving among CAVs, thereby mitigating the limitations of individual sensors, reducing occlusions, and improving perception over long distances. Traditionally, these tasks are addressed using distinct models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real-time performance. Multi-task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a single unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focused on MTL in the context of CAVs. We begin with an overview of CAVs and MTL to provide foundational background. We then explore the application of MTL across key functional modules, including perception, prediction, planning, control, and multi-agent collaboration. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide directions for future research aimed at advancing MTL methodologies for CAV systems.
zh
[CV-241] ESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras ICCV
【速读】:该论文旨在解决事件相机(event camera)在自监督预训练中忽视长时序信息的问题。现有方法多借鉴RGB图像的预训练范式,仅在短时间窗口内对原始事件进行建模,未能充分利用事件数据中蕴含的丰富时空动态特性,导致下游任务性能受限。解决方案的关键在于提出TESPEC框架,其首次在预训练阶段引入长事件序列,并采用一种基于伪灰度视频重建的目标函数:通过创新性地将事件累积为富含语义信息的伪灰度视频(robust to sensor noise and motion blur),迫使模型必须理解长期事件历史才能完成重建任务,从而有效挖掘事件数据中的时空特征。这一设计显著提升了递归模型在物体检测、语义分割和单目深度估计等下游任务上的表现。
链接: https://arxiv.org/abs/2508.00913
作者: Mohammad Mohammadi,Ziyi Wu,Igor Gilitschenski
机构: University of Toronto (多伦多大学); Vector Institute (矢量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE/CVF International Conference on Computer Vision (ICCV) 2025
Abstract:Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: this https URL.
zh
[CV-242] Sparse 3D Perception for Rose Harvesting Robots: A Two-Stage Approach Bridging Simulation and Real-World Applications
【速读】:该论文旨在解决药用植物(如大马士革玫瑰)采摘过程中因人工劳动密集而导致的规模化瓶颈问题。其核心解决方案是提出一种面向花蕾采摘机器人的新型三维感知流水线,关键在于通过两阶段算法实现稀疏三维定位:首先在立体图像上进行基于2D点的检测,再利用轻量级深度神经网络估计深度信息;同时,为克服真实世界标注数据稀缺的问题,引入使用Blender生成的高保真合成数据集,该数据集模拟动态玫瑰农场环境并提供精确的3D标注,从而显著降低人工标注成本并提升模型鲁棒性。实验表明,该方法在合成与真实场景中均表现出优异性能,尤其在深度估计误差仅为3%(2米距离)的情况下实现了高效计算,满足资源受限机器人系统的部署需求。
链接: https://arxiv.org/abs/2508.00900
作者: Taha Samavati,Mohsen Soryani,Sina Mansouri
机构: Islamic Azad University, Science and Research Branch (伊斯兰阿扎德大学,科学与研究分支); Iran University of Science and Technology (伊朗科学技术大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The global demand for medicinal plants, such as Damask roses, has surged with population growth, yet labor-intensive harvesting remains a bottleneck for scalability. To address this, we propose a novel 3D perception pipeline tailored for flower-harvesting robots, focusing on sparse 3D localization of rose centers. Our two-stage algorithm first performs 2D point-based detection on stereo images, followed by depth estimation using a lightweight deep neural network. To overcome the challenge of scarce real-world labeled data, we introduce a photorealistic synthetic dataset generated via Blender, simulating a dynamic rose farm environment with precise 3D annotations. This approach minimizes manual labeling costs while enabling robust model training. We evaluate two depth estimation paradigms: a traditional triangulation-based method and our proposed deep learning framework. Results demonstrate the superiority of our method, achieving an F1 score of 95.6% (synthetic) and 74.4% (real) in 2D detection, with a depth estimation error of 3% at a 2-meter range on synthetic data. The pipeline is optimized for computational efficiency, ensuring compatibility with resource-constrained robotic systems. By bridging the domain gap between synthetic and real-world data, this work advances agricultural automation for specialty crops, offering a scalable solution for precision harvesting.
zh
[CV-243] Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models
【速读】:该论文旨在解决视频帧预测(video frame prediction)这一复杂计算机视觉问题,其目标是提升预测精度以支持如天气预报、自动驾驶等关键应用场景,并优化视频压缩与流媒体传输技术。解决方案的关键在于提出并评估多种混合深度学习架构,这些架构融合了自编码器(autoencoder)的特征提取能力与循环神经网络(RNN)、3D卷积神经网络(3D CNN)及ConvLSTM等时序建模结构的优势,实验表明基于3D CNN和ConvLSTM的混合模型表现最优,SSIM指标从0.69提升至0.82,验证了此类架构在处理真实世界灰度视频数据时的有效性。
链接: https://arxiv.org/abs/2508.00898
作者: Jose M. Sánchez Velázquez,Mingbo Cai,Andrew Coney,Álvaro J. García- Tejedor,Alberto Nogales
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2 Figures, 12 Tables, 21 pages
Abstract:In recent years, advances in Artificial Intelligence have significantly impacted computer science, particularly in the field of computer vision, enabling solutions to complex problems such as video frame prediction. Video frame prediction has critical applications in weather forecasting or autonomous systems and can provide technical improvements, such as video compression and streaming. Among Artificial Intelligence methods, Deep Learning has emerged as highly effective for solving vision-related tasks, although current frame prediction models still have room for enhancement. This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling using Recurrent Neural Networks (RNNs), 3D Convolutional Neural Networks (3D CNNs), and related architectures. The proposed solutions were rigorously evaluated on three datasets that differ in terms of synthetic versus real-world scenarios and grayscale versus color imagery. Results demonstrate that the approaches perform well, with SSIM metrics increasing from 0.69 to 0.82, indicating that hybrid models utilizing 3DCNNs and ConvLSTMs are the most effective, and greyscale videos with real data are the easiest to predict.
zh
[CV-244] Phase-fraction guided denoising diffusion model for augmenting multiphase steel microstructure segmentation via micrograph image-mask pair synthesis
【速读】:该论文旨在解决金属显微组织分割中因缺乏人工标注的相掩膜(phase masks)而导致的机器学习模型性能受限问题,尤其针对增材制造多相钢中稀有或成分复杂的微观结构形态。其解决方案的关键在于提出PF-DiffSeg——一种基于相分数控制的一阶段去噪扩散框架,通过在单一生成轨迹中联合合成显微图像及其对应的分割掩膜,并以全局相分数向量作为条件,增强少数类样本的表示能力,从而提升数据多样性与训练效率。该方法显著改善了少数类分割精度,在MetalDAM基准上优于传统增强策略及两阶段掩膜引导扩散模型和生成对抗网络(GAN)基线,同时降低推理时间,实现生成与条件控制一体化的可扩展数据增强方案。
链接: https://arxiv.org/abs/2508.00896
作者: Hoang Hai Nam Nguyen,Minh Tien Tran,Hoheok Kim,Ho Won Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Image and Video Processing (eess.IV)
备注:
Abstract:The effectiveness of machine learning in metallographic microstructure segmentation is often constrained by the lack of human-annotated phase masks, particularly for rare or compositionally complex morphologies within the metal alloy. We introduce PF-DiffSeg, a phase-fraction controlled, one-stage denoising diffusion framework that jointly synthesizes microstructure images and their corresponding segmentation masks in a single generative trajectory to further improve segmentation accuracy. By conditioning on global phase-fraction vectors, augmented to represent real data distribution and emphasize minority classes, our model generates compositionally valid and structurally coherent microstructure image and mask samples that improve both data diversity and training efficiency. Evaluated on the MetalDAM benchmark for additively manufactured multiphase steel, our synthetic augmentation method yields notable improvements in segmentation accuracy compared to standard augmentation strategies especially in minority classes and further outperforms a two-stage mask-guided diffusion and generative adversarial network (GAN) baselines, while also reducing inference time compared to conventional approach. The method integrates generation and conditioning into a unified framework, offering a scalable solution for data augmentation in metallographic applications.
zh
[CV-245] HoneyImage: Verifiable Harmless and Stealthy Dataset Ownership Verification for Image Models
【速读】:该论文旨在解决图像数据集所有权验证问题,即当数据被用于训练第三方图像识别模型时,数据所有者难以可靠地确认其敏感或专有图像是否被未经授权使用。现有方法如后门水印和成员推断存在验证有效性与数据完整性之间的权衡。解决方案的关键在于提出HoneyImage方法,通过选择性修改少量“难样本”(hard samples)嵌入不可感知但可验证的痕迹,从而在不显著影响下游任务性能的前提下实现高准确率的所有权验证,同时保持数据完整性。
链接: https://arxiv.org/abs/2508.00892
作者: Zhihao Zhu,Jiale Han,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image-based AI models are increasingly deployed across a wide range of domains, including healthcare, security, and consumer applications. However, many image datasets carry sensitive or proprietary content, raising critical concerns about unauthorized data usage. Data owners therefore need reliable mechanisms to verify whether their proprietary data has been misused to train third-party models. Existing solutions, such as backdoor watermarking and membership inference, face inherent trade-offs between verification effectiveness and preservation of data integrity. In this work, we propose HoneyImage, a novel method for dataset ownership verification in image recognition models. HoneyImage selectively modifies a small number of hard samples to embed imperceptible yet verifiable traces, enabling reliable ownership verification while maintaining dataset integrity. Extensive experiments across four benchmark datasets and multiple model architectures show that HoneyImage consistently achieves strong verification accuracy with minimal impact on downstream performance while maintaining imperceptible. The proposed HoneyImage method could provide data owners with a practical mechanism to protect ownership over valuable image datasets, encouraging safe sharing and unlocking the full transformative potential of data-driven AI.
zh
[CV-246] FRAM: Frobenius-Regularized Assignment Matching with Mixed-Precision Computing
【速读】:该论文旨在解决图匹配(Graph Matching)问题中因将二次指派问题(Quadratic Assignment Problem, QAP)进行投影松弛所引入的误差问题,特别是由可行域膨胀导致的数值尺度敏感性和几何错位问题。其解决方案的关键在于提出一种新的松弛框架——Frobenius-regularized Linear Assignment (FRA),通过将投影步骤重新建模为带有可调正则化项的线性指派问题,有效抑制了可行域的过度扩张;同时设计了Scaling Doubly Stochastic Normalization (SDSN)算法以高效求解FRA,并进一步构建理论支持的混合精度架构,在保持高精度的同时实现显著加速(最高达370倍)。
链接: https://arxiv.org/abs/2508.00887
作者: Binrui Shen,Yuan Liang,Shengxin Zhu
机构: Beijing Normal University (北京师范大学); BNU-HKBU United International College (北师大-港中大联合国际学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Graph matching, typically formulated as a Quadratic Assignment Problem (QAP), seeks to establish node correspondences between two graphs. To address the NP-hardness of QAP, some existing methods adopt projection-based relaxations that embed the problem into the convex hull of the discrete domain. However, these relaxations inevitably enlarge the feasible set, introducing two sources of error: numerical scale sensitivity and geometric misalignment between the relaxed and original domains. To alleviate these errors, we propose a novel relaxation framework by reformulating the projection step as a Frobenius-regularized Linear Assignment (FRA) problem, where a tunable regularization term mitigates feasible region inflation. This formulation enables normalization-based operations to preserve numerical scale invariance without compromising accuracy. To efficiently solve FRA, we propose the Scaling Doubly Stochastic Normalization (SDSN) algorithm. Building on its favorable computational properties, we develop a theoretically grounded mixed-precision architecture to achieve substantial acceleration. Comprehensive CPU-based benchmarks demonstrate that FRAM consistently outperforms all baseline methods under identical precision settings. When combined with a GPU-based mixed-precision architecture, FRAM achieves up to 370X speedup over its CPU-FP64 counterpart, with negligible loss in solution accuracy.
zh
[CV-247] FairFedMed: Benchmarking Group Fairness in Federated Medical Imaging with FairLoRA
【速读】:该论文旨在解决医疗联邦学习(Federated Learning, FL)中公平性不足的问题,即在跨机构协作建模过程中,由于数据分布的异质性(heterogeneous data)导致不同人口统计学群体(demographic groups)在模型性能上存在显著差异,从而可能加剧医疗资源分配不均和健康结果不公平。现有FL方法在医学图像任务中表现不佳且忽视了群体公平性。其解决方案的关键在于提出FairLoRA——一种基于奇异值分解(SVD)低秩近似的公平感知联邦学习框架:通过为每个 demographic group 独立定制奇异值矩阵(singular value matrices),同时共享奇异向量(singular vectors),在保证模型效率的同时实现对不同群体的公平性优化。实验表明,FairLoRA 在 FairFedMed 数据集上不仅实现了最先进的医学图像分类性能,还显著提升了跨人群的公平性表现。
链接: https://arxiv.org/abs/2508.00873
作者: Minghan Li,Congcong Wen,Yu Tian,Min Shi,Yan Luo,Hao Huang,Yi Fang,Mengyu Wang
机构: Harvard AI and Robotics Lab (哈佛人工智能与机器人实验室); Harvard Ophthalmology AI lab (哈佛眼科人工智能实验室); Harvard University (哈佛大学); Embodied AI and Robotics Lab (具身人工智能与机器人实验室); New York University (纽约大学); NYUAD Center for Artificial Intelligence and Robotics (纽约大学阿布扎比人工智能与机器人中心); Department of Computer Science (计算机科学系); University of Central Florida (中佛罗里达大学); School of Computing and Informatics (计算与信息学院); University of Louisiana at Lafayette (路易斯安那大学拉斐特分校)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, 8 tables
Abstract:Fairness remains a critical concern in healthcare, where unequal access to services and treatment outcomes can adversely affect patient health. While Federated Learning (FL) presents a collaborative and privacy-preserving approach to model training, ensuring fairness is challenging due to heterogeneous data across institutions, and current research primarily addresses non-medical applications. To fill this gap, we establish the first experimental benchmark for fairness in medical FL, evaluating six representative FL methods across diverse demographic attributes and imaging modalities. We introduce FairFedMed, the first medical FL dataset specifically designed to study group fairness (i.e., demographics). It comprises two parts: FairFedMed-Oph, featuring 2D fundus and 3D OCT ophthalmology samples with six demographic attributes; and FairFedMed-Chest, which simulates real cross-institutional FL using subsets of CheXpert and MIMIC-CXR. Together, they support both simulated and real-world FL across diverse medical modalities and demographic groups. Existing FL models often underperform on medical images and overlook fairness across demographic groups. To address this, we propose FairLoRA, a fairness-aware FL framework based on SVD-based low-rank approximation. It customizes singular value matrices per demographic group while sharing singular vectors, ensuring both fairness and efficiency. Experimental results on the FairFedMed dataset demonstrate that FairLoRA not only achieves state-of-the-art performance in medical image classification but also significantly improves fairness across diverse populations. Our code and dataset can be accessible via link: this https URL.
zh
[CV-248] Visuo-Acoustic Hand Pose and Contact Estimation
【速读】:该论文旨在解决手部姿态(hand pose)与手-物体接触事件(hand-object contact events)的精确估计问题,这在机器人数据采集、沉浸式虚拟环境和生物力学分析中至关重要。现有方法受限于视觉遮挡、接触信号微弱、纯视觉感知的局限性以及缺乏灵活可穿戴的触觉传感手段。其解决方案的关键在于提出VibeMesh——一种融合视觉与主动声学传感的轻量级可穿戴系统,通过骨传导扬声器发射结构化声信号,并利用稀疏压电麦克风捕捉声波传播变化来推断接触引起的扰动;同时设计了一种基于图结构的注意力网络,联合处理同步的音频频谱与RGB-D重建的手部网格,实现高空间分辨率的接触与姿态联合估计。
链接: https://arxiv.org/abs/2508.00852
作者: Yuemin Ma,Uksang Yoo,Yunchao Yao,Shahram Najam Syed,Luca Bondi,Jonathan Francis,Jean Oh,Jeffrey Ichnowski
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Accurately estimating hand pose and hand-object contact events is essential for robot data-collection, immersive virtual environments, and biomechanical analysis, yet remains challenging due to visual occlusion, subtle contact cues, limitations in vision-only sensing, and the lack of accessible and flexible tactile sensing. We therefore introduce VibeMesh, a novel wearable system that fuses vision with active acoustic sensing for dense, per-vertex hand contact and pose estimation. VibeMesh integrates a bone-conduction speaker and sparse piezoelectric microphones, distributed on a human hand, emitting structured acoustic signals and capturing their propagation to infer changes induced by contact. To interpret these cross-modal signals, we propose a graph-based attention network that processes synchronized audio spectra and RGB-D-derived hand meshes to predict contact with high spatial resolution. We contribute: (i) a lightweight, non-intrusive visuo-acoustic sensing platform; (ii) a cross-modal graph network for joint pose and contact inference; (iii) a dataset of synchronized RGB-D, acoustic, and ground-truth contact annotations across diverse manipulation scenarios; and (iv) empirical results showing that VibeMesh outperforms vision-only baselines in accuracy and robustness, particularly in occluded or static-contact settings.
zh
[CV-249] Inclusive Review on Advances in Masked Human Face Recognition Technologies
【速读】:该论文旨在解决因口罩遮挡导致的面部识别准确率下降问题,即遮挡式人脸识别(Masked Face Recognition, MFR)中的关键挑战。其解决方案的核心在于利用深度学习技术,特别是卷积神经网络(Convolutional Neural Networks, CNNs)和孪生网络(Siamese Networks),以提升在部分面部特征缺失情况下的识别性能;同时通过数据增强、多模态方法以及先进的特征提取与网络架构设计来增强模型的泛化能力,并结合多样化的评测标准和数据集推动系统优化,从而实现更鲁棒、实用的MFR系统。
链接: https://arxiv.org/abs/2508.00841
作者: Ali Haitham Abdul Amir,Zainab N. Nemer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked Face Recognition (MFR) is an increasingly important area in biometric recognition technologies, especially with the widespread use of masks as a result of the COVID-19 pandemic. This development has created new challenges for facial recognition systems due to the partial concealment of basic facial features. This paper aims to provide a comprehensive review of the latest developments in the field, with a focus on deep learning techniques, especially convolutional neural networks (CNNs) and twin networks (Siamese networks), which have played a pivotal role in improving the accuracy of covering face recognition. The paper discusses the most prominent challenges, which include changes in lighting, different facial positions, partial concealment, and the impact of mask types on the performance of systems. It also reviews advanced technologies developed to overcome these challenges, including data enhancement using artificial databases and multimedia methods to improve the ability of systems to generalize. In addition, the paper highlights advance in deep network design, feature extraction techniques, evaluation criteria, and data sets used in this area. Moreover, it reviews the various applications of masked face recognition in the fields of security and medicine, highlighting the growing importance of these systems in light of recurrent health crises and increasing security threats. Finally, the paper focuses on future research trends such as developing more efficient algorithms and integrating multimedia technologies to improve the performance of recognition systems in real-world environments and expand their applications.
zh
[CV-250] am PA-VCGs Solution for Competition on Understanding Chinese College Entrance Exam Papers in ICDAR25
【速读】:该论文旨在解决高考语文试卷(Gaokao papers)中密集OCR提取与复杂文档布局带来的挑战,尤其针对中文高考试卷的图像理解任务。其解决方案的关键在于结合高分辨率图像处理和多图像端到端输入策略,同时引入领域特定的后训练(post-training)策略,从而显著提升模型在真实场景下的识别准确率,最终在ICDAR’25竞赛中以89.6%的准确率获得第一名。
链接: https://arxiv.org/abs/2508.00834
作者: Wei Wu,Wenjie Wang,Yang Tan,Ying Liu,Liang Diao,Lin Huang,Kaihe Xu,Wenfeng Xie,Ziling Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:This report presents Team PA-VGG’s solution for the ICDAR’25 Competition on Understanding Chinese College Entrance Exam Papers. In addition to leveraging high-resolution image processing and a multi-image end-to-end input strategy to address the challenges of dense OCR extraction and complex document layouts in Gaokao papers, our approach introduces domain-specific post-training strategies. Experimental results demonstrate that our post-training approach achieves the most outstanding performance, securing first place with an accuracy rate of 89.6%.
zh
[CV-251] Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation Large-Scale Dataset and New Insights
【速读】:该论文旨在解决当前针对目标检测器的对抗性补丁攻击(adversarial patch attacks)防御方法缺乏统一且全面评估框架的问题,导致现有防御措施的性能评估不一致和不完整。其解决方案的关键在于构建首个针对补丁攻击的防御基准(patch defense benchmark),涵盖2种攻击目标、13种补丁攻击方法、11种目标检测器以及4种多样化评价指标,并在此基础上创建了一个包含94类补丁和94,000张图像的大规模对抗补丁数据集。通过系统性分析,研究揭示了防御难点主要源于数据分布而非高频特性,并提出以被攻击目标的平均精度(AP)作为更可靠的防御效果衡量指标,同时发现具有复杂/随机模型结构或通用补丁属性的防御方法对自适应攻击更具鲁棒性。这一工作为合理评估与设计补丁攻击与防御提供了重要指导。
链接: https://arxiv.org/abs/2508.00649
作者: Junhao Zheng,Jiahao Sun,Chenhao Lin,Zhengyu Zhao,Chen Ma,Chong Zhang,Cong Wang,Qian Wang,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); City University of Hong Kong (香港城市大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at this https URL, where we will keep integrating new attacks/defenses.
zh
[CV-252] Zero-Shot Temporal Interaction Localization for Egocentric Videos
【速读】:该论文旨在解决在第一人称视角视频(egocentric videos)中进行零样本时空交互定位(zero-shot temporal interaction localization, ZS-TIL)时存在的两个关键问题:一是现有方法依赖标注的动作和物体类别,导致领域偏差(domain bias)和部署效率低;二是基于大规模视觉语言模型(VLMs)的方法存在粗粒度估计和开环(open-loop)推理流程,难以进一步提升交互行为的时间定位精度。解决方案的关键在于提出一种名为EgoLoc的新方法,其核心创新是引入自适应采样策略(self-adaptive sampling strategy),通过融合2D与3D观测信息,利用3D手部速度动态识别可能的交互接触/分离时刻,并在此基础上生成高质量初始猜测;同时,EgoLoc构建了从视觉与动态线索中获取闭环反馈(closed-loop feedback)的机制,持续优化定位结果,从而显著提升了在第一人称视频中对抓取类交互行为的时间定位准确性与效率。
链接: https://arxiv.org/abs/2506.03662
作者: Erhang Zhang,Junyi Ma,Yin-Dong Zheng,Yixuan Zhou,Hesheng Wang
机构: IRMV Lab, the Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at this https URL.
zh
[CV-253] RL-U2Net: A Dual-Branch UNet with Reinforcement Learning-Assisted Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation
【速读】:该论文旨在解决多模态医学图像(CT与MRI)在全心脏分割任务中面临的三大挑战:模态间空间不一致性导致特征融合困难、现有融合策略静态且缺乏适应性,以及特征对齐与分割过程解耦造成的效率低下。其解决方案的关键在于提出一种基于强化学习增强的双分支U-Net架构(RL-U²Net),通过引入新颖的RL-XAlign模块实现跨模态特征对齐——该模块结合交叉注意力机制捕捉模态间的语义对应关系,并利用强化学习代理自动学习最优旋转策略以一致对齐解剖姿态和纹理特征;随后,各模态独立重建特征并经集成学习决策模块融合局部预测,从而显著提升分割精度与鲁棒性。
链接: https://arxiv.org/abs/2508.02557
作者: Jierui Qu,Jianchun Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate whole-heart segmentation is a critical component in the precise diagnosis and interventional planning of cardiovascular diseases. Integrating complementary information from modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) can significantly enhance segmentation accuracy and robustness. However, existing multi-modal segmentation methods face several limitations: severe spatial inconsistency between modalities hinders effective feature fusion; fusion strategies are often static and lack adaptability; and the processes of feature alignment and segmentation are decoupled and inefficient. To address these challenges, we propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment, termed RL-U ^2 Net, designed for precise and efficient multi-modal 3D whole-heart segmentation. The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module between the encoders. The module employs a cross-modal attention mechanism to capture semantic correspondences between modalities and a reinforcement-learning agent learns an optimal rotation strategy that consistently aligns anatomical pose and texture features. The aligned features are then reconstructed through their respective decoders. Finally, an ensemble-learning-based decision module integrates the predictions from individual patches to produce the final segmentation result. Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-U ^2 Net outperforms existing state-of-the-art methods, achieving Dice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating the effectiveness and superiority of the proposed approach.
zh
[CV-254] From Pixels to Pathology: Restoration Diffusion for Diagnostic-Consistent Virtual IHC
【速读】:该论文旨在解决虚拟染色(从苏木精-伊红染色到免疫组化染色)中存在的两大核心问题:一是现有方法在评估合成图像时难以应对与真实免疫组化图像的空间错位问题,导致评价结果不公;二是如何在图像翻译过程中保持组织结构完整性及生物变异性的准确再现。其解决方案的关键在于提出一个端到端框架,包含生成与评估两部分:首先设计了Star-Diff模型,这是一种基于结构感知的扩散重建模型,将虚拟染色重构为图像恢复任务,通过残差与噪声驱动的双路径机制实现结构保留与生物标志物变异建模;其次引入语义保真度评分(Semantic Fidelity Score, SFS),一种以临床分级任务为导向的指标,能够量化类别层面的语义退化程度,且对空间错位和分类器不确定性具有鲁棒性,从而更真实地反映生成图像的诊断一致性。
链接: https://arxiv.org/abs/2508.02528
作者: Jingsong Liu,Xiaofeng Deng,Han Li,Azar Kazemi,Christian Grashei,Gesa Wilkens,Xin You,Tanja Groll,Nassir Navab,Carolin Mogler,Peter J. Schüffler
机构: Technical University of Munich (慕尼黑工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hematoxylin and eosin (HE) staining is the clinical standard for assessing tissue morphology, but it lacks molecular-level diagnostic information. In contrast, immunohistochemistry (IHC) provides crucial insights into biomarker expression, such as HER2 status for breast cancer grading, but remains costly and time-consuming, limiting its use in time-sensitive clinical workflows. To address this gap, virtual staining from HE to IHC has emerged as a promising alternative, yet faces two core challenges: (1) Lack of fair evaluation of synthetic images against misaligned IHC ground truths, and (2) preserving structural integrity and biological variability during translation. To this end, we present an end-to-end framework encompassing both generation and evaluation in this work. We introduce Star-Diff, a structure-aware staining restoration diffusion model that reformulates virtual staining as an image restoration task. By combining residual and noise-based generation pathways, Star-Diff maintains tissue structure while modeling realistic biomarker variability. To evaluate the diagnostic consistency of the generated IHC patches, we propose the Semantic Fidelity Score (SFS), a clinical-grading-task-driven metric that quantifies class-wise semantic degradation based on biomarker classification accuracy. Unlike pixel-level metrics such as SSIM and PSNR, SFS remains robust under spatial misalignment and classifier uncertainty. Experiments on the BCI dataset demonstrate that Star-Diff achieves state-of-the-art (SOTA) performance in both visual fidelity and diagnostic relevance. With rapid inference and strong clinical alignment,it presents a practical solution for applications such as intraoperative virtual IHC synthesis.
zh
[CV-255] Identifying actionable driver mutations in lung cancer using an efficient Asymmetric Transformer Decoder MICCAI2025
【速读】:该论文旨在解决非小细胞肺癌(NSCLC)中可操作驱动突变(actionable driver mutations)的检测难题,尤其针对当前基因检测在临床推广中存在的可用性受限和周转时间长的问题。研究提出了一种基于多实例学习(Multiple Instance Learning, MIL)的计算病理学(Computational Pathology, CPath)方法,其关键创新在于引入了一个异构Transformer解码器模型,该模型通过不同维度的查询(query)与键值(key-value)对来保持低维查询空间,从而高效提取图像块嵌入(patch embeddings)信息并降低过拟合风险;同时,该方法直接利用组织类型作为输入特征,克服了传统MIL方法忽略生物学相关区域的问题,显著提升了对常见及罕见突变(如ERBB2、BRAF)的预测性能,平均优于现有顶级MIL模型达4%,推动了机器学习驱动的诊断工具向临床实用化迈进。
链接: https://arxiv.org/abs/2508.02431
作者: Biagio Brattoli,Jack Shi,Jongchan Park,Taebum Lee,Donggeun Yoo,Sergio Pereira
机构: Lunit Oncology (Lunit肿瘤学); Seoul, South Korea (首尔,韩国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025 Workshop COMPAYL
Abstract:Identifying actionable driver mutations in non-small cell lung cancer (NSCLC) can impact treatment decisions and significantly improve patient outcomes. Despite guideline recommendations, broader adoption of genetic testing remains challenging due to limited availability and lengthy turnaround times. Machine Learning (ML) methods for Computational Pathology (CPath) offer a potential solution; however, research often focuses on only one or two common mutations, limiting the clinical value of these tools and the pool of patients who can benefit from them. This study evaluates various Multiple Instance Learning (MIL) techniques to detect six key actionable NSCLC driver mutations: ALK, BRAF, EGFR, ERBB2, KRAS, and MET ex14. Additionally, we introduce an Asymmetric Transformer Decoder model that employs queries and key-values of varying dimensions to maintain a low query dimensionality. This approach efficiently extracts information from patch embeddings and minimizes overfitting risks, proving highly adaptable to the MIL setting. Moreover, we present a method to directly utilize tissue type in the model, addressing a typical MIL limitation where either all regions or only some specific regions are analyzed, neglecting biological relevance. Our method outperforms top MIL models by an average of 3%, and over 4% when predicting rare mutations such as ERBB2 and BRAF, moving ML-based tests closer to being practical alternatives to standard genetic testing.
zh
[CV-256] GR-Gaussian: Graph-Based Radiative Gaussian Splatting for Sparse-View CT Reconstruction
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在稀疏视图(sparse-view)条件下进行CT重建时,因依赖视点内点的平均梯度幅值而导致严重针状伪影(needle-like artifacts)的问题。解决方案的关键在于提出GR-Gaussian框架,其核心创新包括:(1) 基于去噪点云的初始化策略,降低初始误差并加速收敛;(2) 像素图感知梯度策略,通过图结构密度差异优化梯度计算,提升分割精度与密度表示能力,从而有效抑制伪影并提高重建准确性。
链接: https://arxiv.org/abs/2508.02408
作者: Yikuang Yuluo,Yue Ma,Kuan Shen,Tongtong Jin,Wang Liao,Yangpu Ma,Fuquan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10
Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising approach for CT reconstruction. However, existing methods rely on the average gradient magnitude of points within the view, often leading to severe needle-like artifacts under sparse-view conditions. To address this challenge, we propose GR-Gaussian, a graph-based 3D Gaussian Splatting framework that suppresses needle-like artifacts and improves reconstruction accuracy under sparse-view conditions. Our framework introduces two key innovations: (1) a Denoised Point Cloud Initialization Strategy that reduces initialization errors and accelerates convergence; and (2) a Pixel-Graph-Aware Gradient Strategy that refines gradient computation using graph-based density differences, improving splitting accuracy and density representation. Experiments on X-3D and real-world datasets validate the effectiveness of GR-Gaussian, achieving PSNR improvements of 0.67 dB and 0.92 dB, and SSIM gains of 0.011 and 0.021. These results highlight the applicability of GR-Gaussian for accurate CT reconstruction under challenging sparse-view conditions.
zh
[CV-257] oward a reliable PWM-based light-emitting diode visual stimulus for improved SSVEP response with minimal visual fatigue
【速读】:该论文旨在解决稳态视觉诱发电位(Steady State Visual Evoked Potential, SSVEP)在实际应用中因闪烁视觉刺激引发的眼疲劳问题,以及由此导致的长期使用受限和脉冲宽度调制(Pulse-Width Modulation, PWM)精度不足所引起的响应准确性下降问题。其解决方案的关键在于采用极高占空比(duty-cycle)的刺激信号——通过定制化LED硬件生成50%至95%占空比的PWM信号,并在实验中发现85%占空比时可显著降低受试者视觉疲劳,同时实现跨被试一致的SSVEP峰值响应,从而为提升SSVEP在脑机接口等实际场景中的可用性提供了可行路径。
链接: https://arxiv.org/abs/2508.02359
作者: Surej Mouli,Ramaswamy Palaniappan
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:Steady state visual evoked response (SSVEP) is widely used in visual-based diagnosis and applications such as brain computer interfacing due to its high information transfer rate and the capability to activate commands through simple gaze control. However, one major impediment in using flashing visual stimulus to obtain SSVEP is eye fatigue that prevents continued long term use preventing practical deployment. This combined with the difficulty in establishing precise pulse-width modulation (PWM) that results in poorer accuracy warrants the development of appropriate approach to solve these issues. Various studies have suggested the usage of high frequencies of visual stimulus to reduce the visual fatigue for the user but this results in poor response performance. Here, the authors study the use of extremely high duty-cycles in the stimulus in the hope of solving these constraints. Electroencephalogram data was recorded with PWM duty-cycles of 50 to 95% generated by a precise custom-made light-emitting diode hardware and tested ten subjects responded that increasing duty-cycles had less visual strain for all the frequency values and the SSVEP exhibited a subject-independent peak response for duty-cycle of 85%. This could pave the way for increased usage of SSVEP for practical applications.
zh
[CV-258] ackling Ill-posedness of Reversible Image Conversion with Well-posed Invertible Network
【速读】:该论文旨在解决可逆图像转换(Reversible Image Conversion, RIC)中存在的病态问题(ill-posedness),其根源在于前向转换过程被视为欠定系统(underdetermined system),而现有方法即使采用可逆神经网络(Invertible Neural Networks, INN)仍因引入随机采样变量而导致不确定性,从而无法获得稳定可靠的逆映射。解决方案的关键在于构建一个超定系统(overdetermined system),并通过确保Gram行列式非零来实现良好 posed 的近似左逆(approximate left inverse)。基于此原理,作者提出了一种无随机采样的良好 posed 1×1 卷积(Well-posed Invertible Convolution, WIC),从而摆脱对随机变量的依赖,并进一步设计了 WIN-Naïve 和 WIN 两种新型可逆网络,后者通过先进的跳跃连接(skip-connections)增强长期记忆能力。实验表明,该方法在可逆图像隐藏、图像缩放和去色等任务中均达到当前最优性能,有效克服了传统RIC方法的瓶颈。
链接: https://arxiv.org/abs/2508.02111
作者: Yuanfei Huang,Hua Huang
机构: Beijing Normal University (北京师范大学); Engineering Research Center of Intelligent Technology and Educational Application, Ministry of Education (教育部智能技术与教育应用工程研究中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions
Abstract:Reversible image conversion (RIC) suffers from ill-posedness issues due to its forward conversion process being considered an underdetermined system. Despite employing invertible neural networks (INN), existing RIC methods intrinsically remain ill-posed as inevitably introducing uncertainty by incorporating randomly sampled variables. To tackle the ill-posedness dilemma, we focus on developing a reliable approximate left inverse for the underdetermined system by constructing an overdetermined system with a non-zero Gram determinant, thus ensuring a well-posed solution. Based on this principle, we propose a well-posed invertible 1\times1 convolution (WIC), which eliminates the reliance on random variable sampling and enables the development of well-posed invertible networks. Furthermore, we design two innovative networks, WIN-Naïve and WIN, with the latter incorporating advanced skip-connections to enhance long-term memory. Our methods are evaluated across diverse RIC tasks, including reversible image hiding, image rescaling, and image decolorization, consistently achieving state-of-the-art performance. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to overcome the bottlenecks of existing RIC solutions and setting a new benchmark in the field. Codes are available in this https URL.
zh
[CV-259] REACT-KD: Region-Aware Cross-modal Topological Knowledge Distillation for Interpretable Medical Image Classification
【速读】:该论文旨在解决临床影像中肿瘤分类的可靠性与可解释性难题,主要挑战包括多模态数据质量异质性、标注样本有限以及缺乏结构化的解剖引导。其解决方案的核心是提出REACT-KD框架——一种基于区域感知的跨模态拓扑知识蒸馏方法,通过双教师设计实现高质量监督信号向轻量级CT学生模型的有效迁移:一 branch 利用双示踪剂PET/CT捕捉结构-功能关联,另一 branch 通过合成降质的低剂量CT建模剂量敏感特征;二者分别通过logits蒸馏和区域图蒸馏共同指导学生模型学习语义对齐与解剖拓扑结构,同时引入CBAM-3D模块保持跨模态注意力一致性,并在训练中采用模态丢弃策略提升模型在部分或噪声输入下的鲁棒性。
链接: https://arxiv.org/abs/2508.02104
作者: Hongzhao Chen,Hexiao Ding,Yufeng Jiang,Jing Lan,Ka Chun Li,Gerald W.Y. Cheng,Sam Ng,Chi Lai Ho,Jing Cai,Liang-ting Lin,Jung Sun Yoo
机构: Hong Kong Polytechnic University (香港理工大学); Hong Kong Sanatorium and Hospital (香港养和医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable and interpretable tumor classification from clinical imaging remains a core challenge due to heterogeneous modality quality, limited annotations, and the lack of structured anatomical guidance. We introduce REACT-KD, a Region-Aware Cross-modal Topological Knowledge Distillation framework that transfers rich supervision from high-fidelity multi-modal sources into a lightweight CT-based student model. The framework uses a dual teacher design: one branch captures structure-function relationships using dual-tracer PET/CT, and the other models dose-aware features through synthetically degraded low-dose CT data. These branches jointly guide the student model through two complementary objectives. The first focuses on semantic alignment via logits distillation, while the second models anatomical topology using region graph distillation. A shared CBAM-3D module is employed to maintain consistent attention across modalities. To improve reliability for deployment, REACT-KD introduces modality dropout during training, allowing inference under partial or noisy inputs. The staging task for hepatocellular carcinoma (HCC) is conducted as a case study. REACT-KD achieves an average AUC of 93.4% on an internal PET/CT cohort and maintains 76.6% to 81.5% AUC across varying dose levels in external CT testing. Decision curve analysis shows that REACT-KD consistently provides the highest clinical benefit across decision thresholds, supporting its potential in real-world diagnostics. Code is available at this https URL.
zh
[CV-260] Less is More: AMBER-AFNO – a New Benchmark for Lightweight 3D Medical Image Segmentation
【速读】:该论文旨在解决3D医学数据立方体(medical datacube)分割任务中现有模型参数量大、计算复杂度高且训练效率低的问题。其解决方案的关键在于将原本用于多光谱遥感图像的AMBER架构迁移至医疗影像分割领域,并用自适应傅里叶神经算子(Adaptive Fourier Neural Operators, AFNO)替代传统的多头自注意力机制(multi-head self-attention mechanism),通过频域混合实现全局上下文建模,从而在显著降低模型可训练参数数量(减少超80%)的同时保持与当前最优方法相当的精度(以Dice相似系数DSC和豪斯多夫距离HD衡量),并带来训练效率、推理速度和内存占用的大幅提升。
链接: https://arxiv.org/abs/2508.01941
作者: Andrea Dosi,Semanto Mondal,Rajib Chandra Ghosh,Massimo Brescia,Giuseppe Longo
机构: University of Naples Federico II (那不勒斯腓特烈二世大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This work presents the results of a methodological transfer from remote sensing to healthcare, adapting AMBER – a transformer-based model originally designed for multiband images, such as hyperspectral data – to the task of 3D medical datacube segmentation. In this study, we use the AMBER architecture with Adaptive Fourier Neural Operators (AFNO) in place of the multi-head self-attention mechanism. While existing models rely on various forms of attention to capture global context, AMBER-AFNO achieves this through frequency-domain mixing, enabling a drastic reduction in model complexity. This design reduces the number of trainable parameters by over 80% compared to UNETR++, while maintaining a FLOPs count comparable to other state-of-the-art architectures. Model performance is evaluated on two benchmark 3D medical datasets – ACDC and Synapse – using standard metrics such as Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD), demonstrating that AMBER-AFNO achieves competitive or superior accuracy with significant gains in training efficiency, inference speed, and memory usage.
zh
[CV-261] Large Kernel MedNeXt for Breast Tumor Segmentation and Self-Normalizing Network for pCR Classification in Magnetic Resonance Images MICCAI2025
【速读】:该论文旨在解决动态对比增强磁共振成像(DCE-MRI)中乳腺肿瘤分割的准确性问题,进而提升病理完全缓解(pCR)分类的性能。其关键解决方案是采用基于大核卷积(large-kernel)的MedNeXt架构,并结合两阶段训练策略,通过UpKern算法将感受野从3×3×3扩展至5×5×5,从而稳定地迁移学习特征并提升分割效果;同时,利用从预测分割结果和首次对比后DCE-MRI中提取的影像组学特征,训练自归一化网络(SNN)进行pCR分类,实现了平均平衡准确率为57%、部分亚组达75%的性能表现。
链接: https://arxiv.org/abs/2508.01831
作者: Toufiq Musah
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 2 tables, Accepted at MICCAI 2025 Deep-Breath Workshop
Abstract:Accurate breast tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is important for downstream tasks such as pathological complete response (pCR) assessment. In this work, we address both segmentation and pCR classification using the large-scale MAMA-MIA DCE-MRI dataset. We employ a large-kernel MedNeXt architecture with a two-stage training strategy that expands the receptive field from 3x3x3 to 5x5x5 kernels using the UpKern algorithm. This approach allows stable transfer of learned features to larger kernels, improving segmentation performance on the unseen validation set. An ensemble of large-kernel models achieved a Dice score of 0.67 and a normalized Hausdorff Distance (NormHD) of 0.24. For pCR classification, we trained a self-normalizing network (SNN) on radiomic features extracted from the predicted segmentations and first post-contrast DCE-MRI, reaching an average balanced accuracy of 57%, and up to 75% in some subgroups. Our findings highlight the benefits of combining larger receptive fields and radiomics-driven classification while motivating future work on advanced ensembling and the integration of clinical variables to further improve performance and generalization. Code: this https URL
zh
[CV-262] Joint Lossless Compression and Steganography for Medical Images via Large Language Models
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的图像压缩方法在医学图像场景中面临的三大问题:压缩性能与效率之间的权衡不佳、缺乏对压缩过程安全性的考量,以及现有方案难以兼顾隐私保护与无损压缩需求。其核心解决方案是提出一种联合无损压缩与隐写术(Steganography)的新框架:首先利用比特平面切片(Bit Plane Slicing, BPS)思想,在局部模态路径中嵌入不可见的隐私消息以保障安全性;其次设计自适应模态分解策略,将图像划分为全局和局部模态,支持双路径无损压缩;最后引入基于解剖先验的低秩适配(Anatomical Priors-based Low-Rank Adaptation, A-LoRA)微调策略,显著提升压缩比、计算效率和安全性。该方法实现了医学图像压缩在性能、效率与隐私安全三方面的协同优化。
链接: https://arxiv.org/abs/2508.01782
作者: Pengcheng Zheng,Xiaorong Pu,Kecheng Chen,Jiaxin Huang,Meng Yang,Bai Feng,Yazhou Ren,Jianan Jiang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, large language models (LLMs) have driven promis ing progress in lossless image compression. However, di rectly adopting existing paradigms for medical images suf fers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compres sion process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this in sight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, pro viding global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we inno vatively propose a segmented message steganography algo rithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatom ical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the su periority of our proposed method in terms of compression ra tios, efficiency, and security. The source code will be made publicly available. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.01782 [eess.IV] (or arXiv:2508.01782v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.01782 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Meng Yang [view email] [v1] Sun, 3 Aug 2025 14:45:51 UTC (3,143 KB) Full-text links: Access Paper: View a PDF of the paper titled Joint Lossless Compression and Steganography for Medical Images via Large Language Models, by Pengcheng Zheng and 7 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: eess.IV prev | next new | recent | 2025-08 Change to browse by: cs cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-263] LoRA-based methods on Unet for transfer learning in Subarachnoid Hematoma Segmentation
【速读】:该论文旨在解决动脉瘤性蛛网膜下腔出血(aneurysmal subarachnoid hemorrhage, aSAH)的医学图像分割问题,尤其关注如何在小样本条件下提升模型性能。传统U-Net架构虽在有限数据上表现优异,但标准微调策略存在参数效率低、过拟合风险高等问题。其解决方案的关键在于引入基于张量CP分解的新型LoRA方法(CP-LoRA),通过将权重矩阵分解为幅度与方向分量实现参数高效迁移学习,同时结合多模块对比实验验证了LoRA类方法显著优于标准微调,并发现适度高秩过参数化比严格低秩适配更优,从而实现了在aSAH分割任务中以极少参数获得媲美甚至超越传统方法的性能。
链接: https://arxiv.org/abs/2508.01772
作者: Cristian Minoccheri,Matthew Hodgman,Haoyuan Ma,Rameez Merchant,Emily Wittrup,Craig Williamson,Kayvan Najarian
机构: University of Michigan (密歇根大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aneurysmal subarachnoid hemorrhage (SAH) is a life-threatening neurological emergency with mortality rates exceeding 30%. Transfer learning from related hematoma types represents a potentially valuable but underexplored approach. Although Unet architectures remain the gold standard for medical image segmentation due to their effectiveness on limited datasets, Low-Rank Adaptation (LoRA) methods for parameter-efficient transfer learning have been rarely applied to convolutional neural networks in medical imaging contexts. We implemented a Unet architecture pre-trained on computed tomography scans from 124 traumatic brain injury patients across multiple institutions, then fine-tuned on 30 aneurysmal SAH patients from the University of Michigan Health System using 3-fold cross-validation. We developed a novel CP-LoRA method based on tensor CP-decomposition and introduced DoRA variants (DoRA-C, convDoRA, CP-DoRA) that decompose weight matrices into magnitude and directional components. We compared these approaches against existing LoRA methods (LoRA-C, convLoRA) and standard fine-tuning strategies across different modules on a multi-view Unet model. LoRA-based methods consistently outperformed standard Unet fine-tuning. Performance varied by hemorrhage volume, with all methods showing improved accuracy for larger volumes. CP-LoRA achieved comparable performance to existing methods while using significantly fewer parameters. Over-parameterization with higher ranks consistently yielded better performance than strictly low-rank adaptations. This study demonstrates that transfer learning between hematoma types is feasible and that LoRA-based methods significantly outperform conventional Unet fine-tuning for aneurysmal SAH segmentation.
zh
[CV-264] Measuring and Predicting Where and When Pathologists Focus their Visual Attention while Grading Whole Slide Images of Cancer DATE
【速读】:该论文旨在解决如何预测专家病理学家在阅读前列腺癌全切片图像(Whole Slide Images, WSI)时的注意力时空轨迹(spatio-temporal attention trajectory)问题,以构建辅助病理训练的决策支持系统。其关键解决方案在于提出一种两阶段模型:第一阶段基于Transformer架构预测不同放大倍数下的静态注意力热图(attention heatmap),提取多尺度特征;第二阶段则通过自回归方式逐点预测注意力扫描路径(scanpath),从WSI中心出发,利用第一阶段获得的多倍率特征表示来建模后续注视点的动态分配。该方法结合了固定点提取算法以保留语义信息,并在43名病理学家对123张WSI的标注数据上验证了其优于随机基线和现有模型的性能,为病理学训练提供了可解释的注意力引导工具。
链接: https://arxiv.org/abs/2508.01668
作者: Souradeep Chakraborty,Ruoyu Xue,Rajarsi Gupta,Oksana Yaskiv,Constantin Friedman,Natallia Sheuka,Dana Perez,Paul Friedman,Won-Tak Choi,Waqas Mahmud,Beatrice Knudsen,Gregory Zelinsky,Joel Saltz,Dimitris Samaras
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Medical Image Analysis (MEDIA), Elsevier, 2025. This is the accepted manuscript version; the final published article link will be updated when available
Abstract:The ability to predict the attention of expert pathologists could lead to decision support systems for better pathology training. We developed methods to predict the spatio-temporal (where and when) movements of pathologists’ attention as they grade whole slide images (WSIs) of prostate cancer. We characterize a pathologist’s attention trajectory by their x, y, and m (magnification) movements of a viewport as they navigate WSIs using a digital microscope. This information was obtained from 43 pathologists across 123 WSIs, and we consider the task of predicting the pathologist attention scanpaths constructed from the viewport centers. We introduce a fixation extraction algorithm that simplifies an attention trajectory by extracting fixations in the pathologist’s viewing while preserving semantic information, and we use these pre-processed data to train and test a two-stage model to predict the dynamic (scanpath) allocation of attention during WSI reading via intermediate attention heatmap prediction. In the first stage, a transformer-based sub-network predicts the attention heatmaps (static attention) across different magnifications. In the second stage, we predict the attention scanpath by sequentially modeling the next fixation points in an autoregressive manner using a transformer-based approach, starting at the WSI center and leveraging multi-magnification feature representations from the first stage. Experimental results show that our scanpath prediction model outperforms chance and baseline models. Tools developed from this model could assist pathology trainees in learning to allocate their attention during WSI reading like an expert.
zh
[CV-265] ractography-Guided Dual-Label Collaborative Learning for Multi-Modal Cranial Nerves Parcellation
【速读】:该论文旨在解决多模态颅神经(Cranial Nerves, CNs)分割中因扩散磁共振成像(diffusion MRI)信息利用不足而导致的分割性能低下问题。现有方法虽融合了结构磁共振成像(structural MRI)与扩散MRI,但未能充分挖掘扩散MRI中的纤维追踪信息,从而限制了分割精度。其解决方案的关键在于提出一种轨迹引导的双标签协同学习网络(tractography-guided Dual-label Collaborative Learning Network, DCLNet),通过引入基于CN解剖图谱获得的粗粒度纤维追踪标签,并与专家标注的精细标签进行协同学习,从而提升分割鲁棒性;同时设计模态自适应编码模块(Modality-adaptive Encoder Module, MEM),实现结构MRI与扩散MRI之间的软信息交换,有效增强多模态特征融合能力。
链接: https://arxiv.org/abs/2508.01577
作者: Lei Xie,Junxiong Huang,Yuanjing Feng,Qingrun Zeng
机构: Zhejiang University of Technology (浙江工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The parcellation of Cranial Nerves (CNs) serves as a crucial quantitative methodology for evaluating the morphological characteristics and anatomical pathways of specific CNs. Multi-modal CNs parcellation networks have achieved promising segmentation performance, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI. However, insufficient exploration of diffusion MRI information has led to low performance of existing multi-modal fusion. In this work, we propose a tractography-guided Dual-label Collaborative Learning Network (DCLNet) for multi-modal CNs parcellation. The key contribution of our DCLNet is the introduction of coarse labels of CNs obtained from fiber tractography through CN atlas, and collaborative learning with precise labels annotated by experts. Meanwhile, we introduce a Modality-adaptive Encoder Module (MEM) to achieve soft information swapping between structural MRI and diffusion MRI. Extensive experiments conducted on the publicly available Human Connectome Project (HCP) dataset demonstrate performance improvements compared to single-label network. This systematic validation underscores the effectiveness of dual-label strategies in addressing inherent ambiguities in CNs parcellation tasks.
zh
[CV-266] Deeply Supervised Multi-Task Autoencoder for Biological Brain Age estimation using three dimensional T_1-weighted magnetic resonance imaging
【速读】:该论文旨在解决从三维(3D)T₁加权磁共振成像(MRI)中准确估计生物脑年龄(biological brain age)的问题,这是识别与神经退行性疾病相关加速衰老的关键影像生物标志物。现有深度3D模型在优化过程中常面临梯度消失等问题,且男性与女性脑结构差异显著,影响衰老轨迹和疾病易感性,导致仅依赖单一任务的预测模型难以实现高精度与泛化能力。解决方案的关键在于提出一种深度监督多任务自编码器(Deeply Supervised Multitask Autoencoder, DSMT-AE)框架:通过在训练过程中对中间层施加监督信号以稳定模型优化,并引入性别分类与图像重建作为辅助任务,协同提升特征表示能力,从而有效捕捉解剖学与人口统计学变异,显著增强脑年龄预测的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.01565
作者: Mehreen Kanwal,Yunsik Son
机构: Dongguk University (东国大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate estimation of biological brain age from three dimensional (3D) T _1 -weighted magnetic resonance imaging (MRI) is a critical imaging biomarker for identifying accelerated aging associated with neurodegenerative diseases. Effective brain age prediction necessitates training 3D models to leverage comprehensive insights from volumetric MRI scans, thereby fully capturing spatial anatomical context. However, optimizing deep 3D models remains challenging due to problems such as vanishing gradients. Furthermore, brain structural patterns differ significantly between sexes, which impacts aging trajectories and vulnerability to neurodegenerative diseases, thereby making sex classification crucial for enhancing the accuracy and generalizability of predictive models. To address these challenges, we propose a Deeply Supervised Multitask Autoencoder (DSMT-AE) framework for brain age estimation. DSMT-AE employs deep supervision, which involves applying supervisory signals at intermediate layers during training, to stabilize model optimization, and multitask learning to enhance feature representation. Specifically, our framework simultaneously optimizes brain age prediction alongside auxiliary tasks of sex classification and image reconstruction, thus effectively capturing anatomical and demographic variability to improve prediction accuracy. We extensively evaluate DSMT-AE on the Open Brain Health Benchmark (OpenBHB) dataset, the largest multisite neuroimaging cohort combining ten publicly available datasets. The results demonstrate that DSMT-AE achieves state-of-the-art performance and robustness across age and sex subgroups. Additionally, our ablation study confirms that each proposed component substantially contributes to the improved predictive accuracy and robustness of the overall architecture.
zh
[CV-267] A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics
【速读】:该论文旨在解决空间转录组学(spatial transcriptomics)领域中缺乏全面基准评估多模态学习方法的问题,尤其是如何有效融合组织病理图像(histology images)与基因表达数据进行跨模态对比预训练(cross-modal contrastive pretraining)。其解决方案的关键在于构建并公开一个大规模基准数据集 HESCAPE,该数据集基于涵盖6种基因面板和54个供体的标准化pan-organ数据集,系统评估了当前最先进的图像编码器与基因表达编码器在多种预训练策略下的表现,并揭示了基因表达编码器是实现强表征对齐的主要决定因素;同时发现,尽管对比预训练能提升基因突变分类任务性能,却会损害直接基因表达预测性能,且批效应(batch effects)是干扰跨模态对齐的关键因素。因此,论文强调开发具有批效应鲁棒性的多模态学习方法对推动该领域发展至关重要。
链接: https://arxiv.org/abs/2508.01490
作者: Rushin H. Gindra,Giovanni Palla,Mathias Nguyen,Sophia J. Wagner,Manuel Tran,Fabian J Theis,Dieter Saur,Lorin Crawford,Tingying Peng
机构: Computational Health Center, Helmholtz Munich(赫尔姆霍兹慕尼黑计算健康中心); Technical University Munich(慕尼黑工业大学); Microsoft Research(微软研究院)
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO); Applications (stat.AP)
备注: The code is accessible at: this https URL
Abstract:Spatial transcriptomics enables simultaneous measurement of gene expression and tissue morphology, offering unprecedented insights into cellular organization and disease mechanisms. However, the field lacks comprehensive benchmarks for evaluating multimodal learning methods that leverage both histology images and gene expression data. Here, we present HESCAPE, a large-scale benchmark for cross-modal contrastive pretraining in spatial transcriptomics, built on a curated pan-organ dataset spanning 6 different gene panels and 54 donors. We systematically evaluated state-of-the-art image and gene expression encoders across multiple pretraining strategies and assessed their effectiveness on two downstream tasks: gene mutation classification and gene expression prediction. Our benchmark demonstrates that gene expression encoders are the primary determinant of strong representational alignment, and that gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches. However, downstream task evaluation reveals a striking contradiction: while contrastive pretraining consistently improves gene mutation classification performance, it degrades direct gene expression prediction compared to baseline encoders trained without cross-modal objectives. We identify batch effects as a key factor that interferes with effective cross-modal alignment. Our findings highlight the critical need for batch-robust multimodal learning approaches in spatial transcriptomics. To accelerate progress in this direction, we release HESCAPE, providing standardized datasets, evaluation protocols, and benchmarking tools for the community
zh
[CV-268] Predicting EGFR Mutation in LUAD from Histopathological Whole-Slide Images Using Pretrained Foundation Model and Transfer Learning: An Indian Cohort Study
【速读】:该论文旨在解决如何通过常规苏木精-伊红染色(Hematoxylin and Eosin, H&E)全切片图像(Whole Slide Imaging, WSI)准确预测非小细胞肺癌中表皮生长因子受体(Epidermal Growth Factor Receptor, EGFR)突变状态的问题,从而辅助临床决策,尤其在资源有限环境中实现精准分型。其解决方案的关键在于构建一个基于视觉Transformer(Vision Transformer, ViT)的病理基础模型与注意力机制增强的多实例学习(Attention-Based Multiple Instance Learning, ABMIL)架构相结合的深度学习框架,该框架可在小样本数据集上高效训练,并在印度队列和TCGA外部验证集中均展现出优异且稳定的性能(AUC分别为0.933和0.965),显著优于以往研究,证明了利用基础模型与注意力机制进行病理图像分析在EGFR突变预测中的可行性与有效性。
链接: https://arxiv.org/abs/2508.01352
作者: Sagar Singh Gwal,Rajan,Suyash Devgan,Shraddhanjali Satapathy,Abhishek Goyal,Nuruddin Mohammad Iqbal,Vivaan Jain,Prabhat Singh Mallik,Deepali Jain,Ishaan Gupta
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lung adenocarcinoma (LUAD) is a subtype of non-small cell lung cancer (NSCLC). LUAD with mutation in the EGFR gene accounts for approximately 46% of LUAD cases. Patients carrying EGFR mutations can be treated with specific tyrosine kinase inhibitors (TKIs). Hence, predicting EGFR mutation status can help in clinical decision making. HE-stained whole slide imaging (WSI) is a routinely performed screening procedure for cancer staging and subtyping, especially affecting the Southeast Asian populations with significantly higher incidence of the mutation when compared to Caucasians (39-64% vs 7-22%). Recent progress in AI models has shown promising results in cancer detection and classification. In this study, we propose a deep learning (DL) framework built on vision transformers (ViT) based pathology foundation model and attention-based multiple instance learning (ABMIL) architecture to predict EGFR mutation status from HE WSI. The developed pipeline was trained using data from an Indian cohort (170 WSI) and evaluated across two independent datasets: Internal test (30 WSI from Indian cohort) set, and an external test set from TCGA (86 WSI). The model shows consistent performance across both datasets, with AUCs of 0.933 (+/-0.010), and 0.965 (+/-0.015) for the internal and external test sets respectively. This proposed framework can be efficiently trained on small datasets, achieving superior performance as compared to several prior studies irrespective of training domain. The current study demonstrates the feasibility of accurately predicting EGFR mutation status using routine pathology slides, particularly in resource-limited settings using foundation models and attention-based multiple instance learning.
zh
[CV-269] Classification of Brain Tumors using Hybrid Deep Learning Models
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在医学图像分类任务中对计算资源和训练数据量要求过高这一问题。其解决方案的关键在于采用迁移学习(transfer learning)策略,利用预训练模型在有限样本下实现高效且准确的分类性能。实验表明,基于迁移学习的EfficientNetV2模型在三类脑肿瘤(胶质瘤、脑膜瘤和垂体瘤)分类任务中表现最优,尽管其训练时间较长,但显著提升了分类准确性,验证了迁移学习在医疗影像智能分析中的有效性。
链接: https://arxiv.org/abs/2508.01350
作者: Neerav Nemchand Gala
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures
Abstract:The use of Convolutional Neural Networks (CNNs) has greatly improved the interpretation of medical images. However, conventional CNNs typically demand extensive computational resources and large training datasets. To address these limitations, this study applied transfer learning to achieve strong classification performance using fewer training samples. Specifically, the study compared EfficientNetV2 with its predecessor, EfficientNet, and with ResNet50 in classifying brain tumors into three types: glioma, meningioma, and pituitary tumors. Results showed that EfficientNetV2 delivered superior performance compared to the other models. However, this improvement came at the cost of increased training time, likely due to the model’s greater complexity.
zh
[CV-270] SWAN: Synergistic Wavelet-Attention Network for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)在复杂背景下的精度不足问题,尤其针对传统方法依赖卷积操作难以区分小目标与背景频域特征的局限性。其解决方案的关键在于提出一种协同小波-注意力网络(Synergistic Wavelet-Attention Network, SWAN),通过三个核心模块实现多域感知:首先利用Haar小波卷积(Haar Wavelet Convolution, HWConv)实现频域能量与空间细节的深层跨域融合;其次引入移位空间注意力机制(Shifted Spatial Attention, SSA)以线性复杂度建模长程空间依赖关系,增强上下文感知能力;最后采用残差双通道注意力模块(Residual Dual-Channel Attention, RDCA)自适应校准通道特征响应,在抑制背景干扰的同时强化目标相关信号。
链接: https://arxiv.org/abs/2508.01322
作者: Yuxin Jing,Jufeng Zhao,Tianpei Zhang,Yiming Zhu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (IRSTD) is thus critical in both civilian and military applications. This study addresses the challenge of precisely IRSTD in complex backgrounds. Recent methods focus fundamental reliance on conventional convolution operations, which primarily capture local spatial patterns and struggle to distinguish the unique frequency-domain characteristics of small targets from intricate background clutter. To overcome these limitations, we proposed the Synergistic Wavelet-Attention Network (SWAN), a novel framework designed to perceive targets from both spatial and frequency domains. SWAN leverages a Haar Wavelet Convolution (HWConv) for a deep, cross-domain fusion of the frequency energy and spatial details of small target. Furthermore, a Shifted Spatial Attention (SSA) mechanism efficiently models long-range spatial dependencies with linear computational complexity, enhancing contextual awareness. Finally, a Residual Dual-Channel Attention (RDCA) module adaptively calibrates channel-wise feature responses to suppress background interference while amplifying target-pertinent signals. Extensive experiments on benchmark datasets demonstrate that SWAN surpasses existing state-of-the-art methods, showing significant improvements in detection accuracy and robustness, particularly in complex challenging scenarios.
zh
[CV-271] CoCoLIT: ControlNet-Conditioned Latent Image Translation for MRI to Amyloid PET Synthesis
【速读】:该论文旨在解决从结构磁共振成像(structural MRI)合成淀粉样蛋白正电子发射断层扫描(amyloid PET)图像的问题,以实现低成本、大规模阿尔茨海默病(Alzheimer’s Disease, AD)筛查。现有方法在处理高维且结构复杂的3D神经影像数据时面临挑战,难以有效建模跨模态关系。其解决方案的关键在于提出一种基于扩散模型的潜在空间生成框架CoCoLIT(ControlNet-Conditioned Latent Image Translation),通过三项核心创新提升翻译性能:(1)引入加权图像空间损失(Weighted Image Space Loss, WISL)优化潜在表示学习与合成质量;(2)对潜在平均稳定化(Latent Average Stabilization, LAS)进行理论和实证分析,增强推理一致性;(3)采用ControlNet作为条件机制实现MRI到PET的精准引导翻译。实验表明,该方法在图像质量和淀粉样蛋白阳性分类任务上显著优于当前最优方法。
链接: https://arxiv.org/abs/2508.01292
作者: Alec Sargood,Lemuel Puglisi,James H. Cole,Neil P. Oxtoby,Daniele Ravì,Daniel C. Alexander
机构: 1. University College London (伦敦大学学院); 2. University of Oxford (牛津大学); 3. University of Manchester (曼彻斯特大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthesizing amyloid PET scans from the more widely available and accessible structural MRI modality offers a promising, cost-effective approach for large-scale Alzheimer’s Disease (AD) screening. This is motivated by evidence that, while MRI does not directly detect amyloid pathology, it may nonetheless encode information correlated with amyloid deposition that can be uncovered through advanced modeling. However, the high dimensionality and structural complexity of 3D neuroimaging data pose significant challenges for existing MRI-to-PET translation methods. Modeling the cross-modality relationship in a lower-dimensional latent space can simplify the learning task and enable more effective translation. As such, we present CoCoLIT (ControlNet-Conditioned Latent Image Translation), a diffusion-based latent generative framework that incorporates three main innovations: (1) a novel Weighted Image Space Loss (WISL) that improves latent representation learning and synthesis quality; (2) a theoretical and empirical analysis of Latent Average Stabilization (LAS), an existing technique used in similar generative models to enhance inference consistency; and (3) the introduction of ControlNet-based conditioning for MRI-to-PET translation. We evaluate CoCoLIT’s performance on publicly available datasets and find that our model significantly outperforms state-of-the-art methods on both image-based and amyloid-related metrics. Notably, in amyloid-positivity classification, CoCoLIT outperforms the second-best method with improvements of +10.5% on the internal dataset and +23.7% on the external dataset. The code and models of our approach are available at this https URL.
zh
[CV-272] Point-wise Diffusion Models for Physical Systems with Shape Variations: Application to Spatio-temporal and Large-scale system
【速读】:该论文旨在解决复杂物理系统中实时预测的效率与精度难题,尤其是针对具有形状变化的时空系统,传统基于图像的扩散模型难以直接处理非结构化数据(如网格和点云)且计算开销大。其解决方案的关键在于提出一种点级扩散模型(point-wise diffusion model),通过在每个时空点独立执行前向与反向扩散过程,并结合点级扩散Transformer架构进行去噪,从而实现对任意格式数据的直接处理并保持几何保真度;此外,采用DDIM加速采样机制,在仅需5–10步的情况下即可完成高效确定性推理,相较传统1000步方法提升计算速度100–200倍,同时显著降低参数量(减少89%)和训练时间(减少94.4%),并在多个物理场景下优于DeepONet、Meshgraphnet等数据灵活代理模型。
链接: https://arxiv.org/abs/2508.01230
作者: Jiyong Kim,Sunwoong Yang,Namwoo Kang
机构: KAIST(韩国科学技术院)
类目: Computational Physics (physics.comp-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study introduces a novel point-wise diffusion model that processes spatio-temporal points independently to efficiently predict complex physical systems with shape variations. This methodological contribution lies in applying forward and backward diffusion processes at individual spatio-temporal points, coupled with a point-wise diffusion transformer architecture for denoising. Unlike conventional image-based diffusion models that operate on structured data representations, this framework enables direct processing of any data formats including meshes and point clouds while preserving geometric fidelity. We validate our approach across three distinct physical domains with complex geometric configurations: 2D spatio-temporal systems including cylinder fluid flow and OLED drop impact test, and 3D large-scale system for road-car external aerodynamics. To justify the necessity of our point-wise approach for real-time prediction applications, we employ denoising diffusion implicit models (DDIM) for efficient deterministic sampling, requiring only 5-10 steps compared to traditional 1000-step and providing computational speedup of 100 to 200 times during inference without compromising accuracy. In addition, our proposed model achieves superior performance compared to image-based diffusion model: reducing training time by 94.4% and requiring 89.0% fewer parameters while achieving over 28% improvement in prediction accuracy. Comprehensive comparisons against data-flexible surrogate models including DeepONet and Meshgraphnet demonstrate consistent superiority of our approach across all three physical systems. To further refine the proposed model, we investigate two key aspects: 1) comparison of final physical states prediction or incremental change prediction, and 2) computational efficiency evaluation across varying subsampling ratios (10%-100%).
zh
[CV-273] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
【速读】:该论文旨在解决医疗图像分析在资源受限移动设备上执行效率低下的问题,特别是现有轻量级模型多针对自然图像优化,在医疗图像领域因信息密度差异导致性能不佳的问题。其解决方案的关键在于提出一种专为医疗图像分割设计的轻量化移动端视觉Transformer架构——Mobile U-shaped Vision Transformer (Mobile U-ViT),核心创新包括:1)采用新型分层patch嵌入模块ConvUtr,结合参数高效的宽卷积CNN与倒置瓶颈融合结构,兼具Transformer式的表征学习能力与轻量化特性;2)引入大核局部-全局-局部(LGL)块以高效实现局部与全局信息交换,缓解医疗图像低信息密度与高层语义差异问题;3)使用浅层轻量Transformer瓶颈进行长程建模,并通过级联解码器和下采样跳跃连接实现密集预测。该架构在8个公开2D/3D医疗图像数据集上达到SOTA性能,且具备零样本迁移能力,验证了其高效性、通用性和强泛化能力。
链接: https://arxiv.org/abs/2508.01064
作者: Fenghe Tang,Bingkun Nian,Jianrui Ding,Wenxin Ma,Quan Quan,Chengqi Dong,Jie Yang,Wei Liu,S. Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学); Suzhou Institute for Advanced Research, USTC (中国科学技术大学苏州研究院); Shanghai Jiao Tong University (上海交通大学); Harbin Institute of Technology (哈尔滨工业大学); State Grid Hunan Electric Power Corporation Limited Research Institute (国网湖南省电力有限公司研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM Multimedia 2025. Code: this https URL
Abstract:In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at this https URL.
zh
[CV-274] ReCoSeg:Extended Residual-Guided Cross-Modal Diffusion for Brain Tumor Segmentation
【速读】:该论文旨在解决脑肿瘤在多模态磁共振成像(MRI)中的精准分割问题,尤其针对大规模、异质性强的BraTS 2021数据集,传统方法依赖大量标注的金标准掩膜(ground-truth masks),而本研究通过半监督两阶段框架实现无需金标准掩膜即可高效分割。解决方案的关键在于:第一阶段利用残差引导的去噪扩散概率模型(residual-guided denoising diffusion probabilistic model, DDPM)从FLAIR、T1和T2模态合成T1ce模态,并将残差图(residual maps)作为空间先验;第二阶段采用轻量级U-Net融合残差图与原始多模态图像进行全肿瘤分割,同时引入切片级过滤和优化阈值策略以提升对复杂数据的适应性与性能。该方法在BraTS 2021上实现了Dice分数93.02%和IoU 86.7%,显著优于ReCoSeg基线,验证了其在真实多中心MRI数据中的准确性与可扩展性。
链接: https://arxiv.org/abs/2508.01058
作者: Sara Yavari,Rahul Nitin Pandya,Jacob Furst
机构: DePaul University (德保罗大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of brain tumors in MRI scans is critical for clinical diagnosis and treatment planning. We propose a semi-supervised, two-stage framework that extends the ReCoSeg approach to the larger and more heterogeneous BraTS 2021 dataset, while eliminating the need for ground-truth masks for the segmentation objective. In the first stage, a residual-guided denoising diffusion probabilistic model (DDPM) performs cross-modal synthesis by reconstructing the T1ce modality from FLAIR, T1, and T2 scans. The residual maps, capturing differences between predicted and actual T1ce images, serve as spatial priors to enhance downstream segmentation. In the second stage, a lightweight U-Net takes as input the concatenation of residual maps, computed as the difference between real T1ce and synthesized T1ce, with T1, T2, and FLAIR modalities to improve whole tumor segmentation. To address the increased scale and variability of BraTS 2021, we apply slice-level filtering to exclude non-informative samples and optimize thresholding strategies to balance precision and recall. Our method achieves a Dice score of 93.02% and an IoU of 86.7% for whole tumor segmentation on the BraTS 2021 dataset, outperforming the ReCoSeg baseline on BraTS 2020 (Dice: 91.7% , IoU: 85.3% ), and demonstrating improved accuracy and scalability for real-world, multi-center MRI datasets.
zh
[CV-275] Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks
【速读】:该论文旨在评估多种视觉语言模型(Visual Language Models, VLMs)在多模态医学影像诊断任务中的性能表现,以探索其在临床辅助决策中的潜力。研究的关键在于通过标准化的MedFMC数据集(涵盖胸部X光、结肠病理、内镜、新生儿黄疸及眼底照相等五类医学影像任务)系统比较不同VLMs在三种输入模式下的诊断准确性:仅视觉输入、多模态输入和链式思维推理(chain-of-thought reasoning)。结果表明,Qwen2.5在多数任务中表现最优,尤其在胸部X光和内镜图像上显著优于其他模型(p<0.001),但所有模型在复杂任务如糖尿病视网膜病变分级中表现有限,且多模态输入与链式推理并未提升准确率,提示当前开源VLMs虽具潜力,仍需针对特定临床场景进行优化与适配才能实现可靠应用。
链接: https://arxiv.org/abs/2508.01016
作者: Gustav Müller-Franzes,Debora Jutz,Jakob Nikolas Kather,Christiane Kuhl,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.
zh
人工智能
[AI-0] D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss
【速读】:该论文旨在解决扩散策略(Diffusion Policy)在机器人操作中面临的表示崩溃(representation collapse)问题,即语义相似的观测被映射为难以区分的特征,从而削弱了模型对复杂操作中细微差异的感知能力。解决方案的关键在于提出D2PPO方法,通过引入分散损失(dispersive loss)正则化机制,将每个批次内的所有隐藏表示视为负样本对,强制网络学习更具判别性的特征表示,从而提升策略对关键细节的识别能力。实验表明,该方法在RoboMimic基准上实现了预训练阶段平均提升22.7%、微调后提升26.1%,并在真实机器人(Franka Emika Panda)实验中展现出显著优于现有方法的性能,尤其在复杂任务中优势突出。
链接: https://arxiv.org/abs/2508.02644
作者: Guowei Zou,Weibing Li,Hejun Wu,Yukun Qian,Yuhang Wang,Haitao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion policies excel at robotic manipulation by naturally modeling multimodal action distributions in high-dimensional spaces. Nevertheless, diffusion policies suffer from diffusion representation collapse: semantically similar observations are mapped to indistinguishable features, ultimately impairing their ability to handle subtle but critical variations required for complex robotic manipulation. To address this problem, we propose D2PPO (Diffusion Policy Policy Optimization with Dispersive Loss). D2PPO introduces dispersive loss regularization that combats representation collapse by treating all hidden representations within each batch as negative pairs. D2PPO compels the network to learn discriminative representations of similar observations, thereby enabling the policy to identify subtle yet crucial differences necessary for precise manipulation. In evaluation, we find that early-layer regularization benefits simple tasks, while late-layer regularization sharply enhances performance on complex manipulation tasks. On RoboMimic benchmarks, D2PPO achieves an average improvement of 22.7% in pre-training and 26.1% after fine-tuning, setting new SOTA results. In comparison with SOTA, results of real-world experiments on a Franka Emika Panda robot show the excitingly high success rate of our method. The superiority of our method is especially evident in complex tasks. Project page: this https URL
zh
[AI-1] Actionable Counterfactual Explanations Using Bayesian Networks and Path Planning with Applications to Environmental Quality Improvement
【速读】:该论文旨在解决生成可行动的反事实解释(actionable counterfactual explanations)问题,即在不直接使用敏感或私有训练数据的前提下,找到能够使模型预测结果发生改变且现实中可实现的输入变化路径。其解决方案的关键在于:首先利用训练数据学习一个密度估计器(density estimator),构建用于路径规划算法搜索的潜在空间;其次,通过贝叶斯网络(Bayesian networks)对数据密度进行建模,增强解释的可理解性,并在高风险场景中保障公平性;最后,该方法能有效捕捉变量间的交互关系,确保政策建议在改善某一领域(如空气质量)的同时不会对其他领域(如社会经济状况)造成负面影响,从而提升决策的公平性和可行性。
链接: https://arxiv.org/abs/2508.02634
作者: Enrique Valero-Leal,Pedro Larrañaga,Concha Bielza
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Counterfactual explanations study what should have changed in order to get an alternative result, enabling end-users to understand machine learning mechanisms with counterexamples. Actionability is defined as the ability to transform the original case to be explained into a counterfactual one. We develop a method for actionable counterfactual explanations that, unlike predecessors, does not directly leverage training data. Rather, data is only used to learn a density estimator, creating a search landscape in which to apply path planning algorithms to solve the problem and masking the endogenous data, which can be sensitive or private. We put special focus on estimating the data density using Bayesian networks, demonstrating how their enhanced interpretability is useful in high-stakes scenarios in which fairness is raising concern. Using a synthetic benchmark comprised of 15 datasets, our proposal finds more actionable and simpler counterfactuals than the current state-of-the-art algorithms. We also test our algorithm with a real-world Environmental Protection Agency dataset, facilitating a more efficient and equitable study of policies to improve the quality of life in United States of America counties. Our proposal captures the interaction of variables, ensuring equity in decisions, as policies to improve certain domains of study (air, water quality, etc.) can be detrimental in others. In particular, the sociodemographic domain is often involved, where we find important variables related to the ongoing housing crisis that can potentially have a severe negative impact on communities.
zh
[AI-2] What Is Your AI Agent Buying? Evaluation Implications and Emerging Questions for Agent ic E-Commerce
链接: https://arxiv.org/abs/2508.02630
作者: Amine Allouah,Omar Besbes,Josué D Figueroa,Yash Kanoria,Akshit Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注:
[AI-3] AutoML-Med: A Framework for Automated Machine Learning in Medical Tabular Data
【速读】:该论文旨在解决医疗数据集普遍存在的挑战,包括缺失值、类别不平衡、特征类型异质性以及高维特征与小样本量之间的矛盾,这些问题严重制约了机器学习模型在分类和回归任务中的性能表现。其解决方案的关键在于提出AutoML-Med这一自动化机器学习工具,通过拉丁超立方采样(Latin Hypercube Sampling, LHS)高效探索预处理方法组合,并基于偏秩相关系数(Partial Rank Correlation Coefficient, PRCC)对关键预处理步骤进行精细化优化,从而自动识别最优的预处理流程与预测模型组合,在最小化人工干预的同时显著提升模型在临床场景下的平衡准确率和敏感度,尤其适用于稀疏数据和类别不平衡的医学数据集。
链接: https://arxiv.org/abs/2508.02625
作者: Riccardo Francia,Maurizio Leone,Giorgio Leonardi,Stefania Montani,Marzio Pennisi,Manuel Striani,Sandra D’Alfonso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, preprint for conference
Abstract:Medical datasets are typically affected by issues such as missing values, class imbalance, a heterogeneous feature types, and a high number of features versus a relatively small number of samples, preventing machine learning models from obtaining proper results in classification and regression tasks. This paper introduces AutoML-Med, an Automated Machine Learning tool specifically designed to address these challenges, minimizing user intervention and identifying the optimal combination of preprocessing techniques and predictive models. AutoML-Med’s architecture incorporates Latin Hypercube Sampling (LHS) for exploring preprocessing methods, trains models using selected metrics, and utilizes Partial Rank Correlation Coefficient (PRCC) for fine-tuned optimization of the most influential preprocessing steps. Experimental results demonstrate AutoML-Med’s effectiveness in two different clinical settings, achieving higher balanced accuracy and sensitivity, which are crucial for identifying at-risk patients, compared to other state-of-the-art tools. AutoML-Med’s ability to improve prediction results, especially in medical datasets with sparse data and class imbalance, highlights its potential to streamline Machine Learning applications in healthcare.
zh
[AI-4] Meta-RAG on Large Codebases Using Code Summarization
【速读】:该论文旨在解决大型现存代码库中缺陷定位(bug localization)的难题,尤其是在软件维护场景下,传统方法难以高效识别与修复问题代码。其解决方案的关键在于提出一种名为Meta-RAG的新型检索增强生成(Retrieval Augmented Generation, RAG)框架:通过摘要技术将代码库压缩平均达79.8%,生成紧凑且结构化的自然语言表示,并借助LLM代理(LLM agent)精准识别对缺陷修复至关重要的代码片段,从而实现高精度的文件级和函数级定位。
链接: https://arxiv.org/abs/2508.02611
作者: Vali Tawosia,Salwa Alamir,Xiaomo Liu,Manuela Veloso
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) systems have been at the forefront of applied Artificial Intelligence (AI) research in a multitude of domains. One such domain is software development, where researchers have pushed the automation of a number of code tasks through LLM agents. Software development is a complex ecosystem, that stretches far beyond code implementation and well into the realm of code maintenance. In this paper, we propose a multi-agent system to localize bugs in large pre-existing codebases using information retrieval and LLMs. Our system introduces a novel Retrieval Augmented Generation (RAG) approach, Meta-RAG, where we utilize summaries to condense codebases by an average of 79.8%, into a compact, structured, natural language representation. We then use an LLM agent to determine which parts of the codebase are critical for bug resolution, i.e. bug localization. We demonstrate the usefulness of Meta-RAG through evaluation with the SWE-bench Lite dataset. Meta-RAG scores 84.67 % and 53.0 % for file-level and function-level correct localization rates, respectively, achieving state-of-the-art performance.
zh
[AI-5] Entity Representation Learning Through Onsite-Offsite Graph for Pinterset Ads
【速读】:该论文旨在解决广告推荐系统中如何有效融合用户在线行为(onsite activities)与离线转化数据(offsite conversions)以提升点击率(CTR)和转化率(CVR)预测性能的问题。其核心挑战在于,尽管知识图谱嵌入(Knowledge Graph Embedding, KGE)能捕捉用户兴趣的深层语义关联,但传统方法难以将其高效集成到工业级广告排序模型中,导致离线实验收益有限。解决方案的关键在于:1)构建基于用户在线广告交互与授权离线转化行为的大规模异构图;2)提出一种名为TransRA的新颖KGE模型,结合锚点(anchors)优化嵌入表示;3)创新性地引入大ID嵌入表(Large ID Embedding Table)与基于注意力机制的KGE微调策略,从而显著提升模型对KGE信息的利用效率。最终在Pinterest广告参与度模型中实现CTR提升2.69%、CPC降低1.34%。
链接: https://arxiv.org/abs/2508.02609
作者: Jiayin Jin,Zhimeng Pan,Yang Tang,Jiarui Feng,Kungang Li,Chongyuan Xiang,Jiacheng Li,Runze Su,Siping Ji,Han Sun,Ling Leng,Prathibha Deshikachar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Graph Neural Networks (GNN) have been extensively applied to industry recommendation systems, as seen in models like GraphSage\citeGraphSage, TwHIM\citeTwHIM, LiGNN\citeLiGNN etc. In these works, graphs were constructed based on users’ activities on the platforms, and various graph models were developed to effectively learn node embeddings. In addition to users’ onsite activities, their offsite conversions are crucial for Ads models to capture their shopping interest. To better leverage offsite conversion data and explore the connection between onsite and offsite activities, we constructed a large-scale heterogeneous graph based on users’ onsite ad interactions and opt-in offsite conversion activities. Furthermore, we introduced TransRA (TransR\citeTransR with Anchors), a novel Knowledge Graph Embedding (KGE) model, to more efficiently integrate graph embeddings into Ads ranking models. However, our Ads ranking models initially struggled to directly incorporate Knowledge Graph Embeddings (KGE), and only modest gains were observed during offline experiments. To address this challenge, we employed the Large ID Embedding Table technique and innovated an attention based KGE finetuning approach within the Ads ranking models. As a result, we observed a significant AUC lift in Click-Through Rate (CTR) and Conversion Rate (CVR) prediction models. Moreover, this framework has been deployed in Pinterest’s Ads Engagement Model and contributed to 2.69% CTR lift and 1.34% CPC reduction. We believe the techniques presented in this paper can be leveraged by other large-scale industrial models.
zh
[AI-6] StructSynth: Leverag ing LLM s for Structure-Aware Tabular Data Synthesis in Low-Data Regimes
【速读】:该论文旨在解决在专业领域中,由于数据稀缺导致机器学习在表格数据上的应用受限的问题。现有生成模型在低数据场景下表现不佳,而近期的大语言模型(Large Language Models, LLMs)通常忽略表格数据的显式依赖结构,从而生成低保真度的合成数据。解决方案的关键在于提出一种名为StructSynth的新框架,其核心创新是采用两阶段架构:首先通过显式结构发现学习数据的有向无环图(Directed Acyclic Graph, DAG),以捕捉特征间的依赖关系;其次利用该结构作为高保真蓝图引导LLM的生成过程,强制其遵循已学得的依赖结构,从而确保生成数据在设计上即具备良好的结构性和统计一致性。
链接: https://arxiv.org/abs/2508.02601
作者: Siyi Liu,Yujia Zheng,Yongqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The application of machine learning on tabular data in specialized domains is severely limited by data scarcity. While generative models offer a solution, traditional methods falter in low-data regimes, and recent Large Language Models (LLMs) often ignore the explicit dependency structure of tabular data, leading to low-fidelity synthetics. To address these limitations, we introduce StructSynth, a novel framework that integrates the generative power of LLMs with robust structural control. StructSynth employs a two-stage architecture. First, it performs explicit structure discovery to learn a Directed Acyclic Graph (DAG) from the available data. Second, this learned structure serves as a high-fidelity blueprint to steer the LLM’s generation process, forcing it to adhere to the learned feature dependencies and thereby ensuring the generated data respects the underlying structure by design. Our extensive experiments demonstrate that StructSynth produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods. It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.
zh
[AI-7] Explainable AI for Automated User-specific Feedback in Surgical Skill Acquisition
【速读】:该论文旨在解决传统外科技能培养中专家反馈资源稀缺、主观性强且难以实现个性化指导的问题,从而限制了自主学习的有效性。其解决方案的关键在于构建一个基于仿真训练框架的可解释人工智能(Explainable AI, XAI)反馈机制,通过计算机视觉与机器学习技术自动分析手术视频,提取与基础操作相关的技能代理指标,并将学员表现与专家基准进行对比,以直观、可理解的代理指标识别执行偏差,提供具 actionable(可操作性)的个性化反馈。该方法实现了自动化、客观化和定量化的能力评估与教学干预,为提升外科培训效率与精准度提供了新路径。
链接: https://arxiv.org/abs/2508.02593
作者: Catalina Gomez,Lalithkumar Seenivasan,Xinrui Zou,Jeewoo Yoon,Sirui Chu,Ariel Leong,Patrick Kramer,Yu-Chun Ku,Jose L. Porras,Alejandro Martin-Gomez,Masaru Ishii,Mathias Unberath
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Traditional surgical skill acquisition relies heavily on expert feedback, yet direct access is limited by faculty availability and variability in subjective assessments. While trainees can practice independently, the lack of personalized, objective, and quantitative feedback reduces the effectiveness of self-directed learning. Recent advances in computer vision and machine learning have enabled automated surgical skill assessment, demonstrating the feasibility of automatic competency evaluation. However, it is unclear whether such Artificial Intelligence (AI)-driven feedback can contribute to skill acquisition. Here, we examine the effectiveness of explainable AI (XAI)-generated feedback in surgical training through a human-AI study. We create a simulation-based training framework that utilizes XAI to analyze videos and extract surgical skill proxies related to primitive actions. Our intervention provides automated, user-specific feedback by comparing trainee performance to expert benchmarks and highlighting deviations from optimal execution through understandable proxies for actionable guidance. In a prospective user study with medical students, we compare the impact of XAI-guided feedback against traditional video-based coaching on task outcomes, cognitive load, and trainees’ perceptions of AI-assisted learning. Results showed improved cognitive load and confidence post-intervention. While no differences emerged between the two feedback types in reducing performance gaps or practice adjustments, trends in the XAI group revealed desirable effects where participants more closely mimicked expert practice. This work encourages the study of explainable AI in surgical education and the development of data-driven, adaptive feedback mechanisms that could transform learning experiences and competency assessment.
zh
[AI-8] CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂数学推理任务中表现不佳的问题,其根本挑战在于模型难以捕捉和利用数学问题中深层次的结构依赖关系。解决方案的关键在于提出一种两阶段因果框架——Causal Mathematician (CAMA),其中核心创新是构建一个高阶的数学因果图(Mathematical Causal Graph, MCG),该图通过融合LLM先验与因果发现算法从问题-解答对语料中自动提取,并进一步通过迭代反馈优化以适配下游推理任务。在推理阶段,CAMA基于新问题内容及模型中间推理轨迹动态提取相关子图并注入LLM,从而显式引导其推理过程,显著提升数学推理能力。实验表明,结构化引导优于无结构方法,且引入非对称因果关系能带来更大性能增益。
链接: https://arxiv.org/abs/2508.02583
作者: Lei Zan,Keli Zhang,Ruichu Cai,Lujia Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbfCAusal \textbfMAthematician (\textbfCAMA), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbfMathematical \textbfCausal \textbfGraph (\textbfMCG), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM’s intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.
zh
[AI-9] Dynamic Feature Selection based on Rule-based Learning for Explainable Classification with Uncertainty Quantification
【速读】:该论文旨在解决动态特征选择(Dynamic Feature Selection, DFS)在实际应用中因依赖黑箱模型而导致决策不透明的问题,尤其在临床等对可解释性要求高的场景中。其解决方案的关键在于引入基于规则的分类器作为DFS的基础模型,相较于神经网络估计器,该方法显著提升了决策过程的可解释性;同时,该框架还能为每个特征查询提供定量的不确定性度量,并通过约束特征搜索空间降低计算复杂度,从而在保证性能的同时实现更高效、透明的特征选择机制。
链接: https://arxiv.org/abs/2508.02566
作者: Javier Fumanal-Idocin,Raquel Fernandez-Peralta,Javier Andreu-Perez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic feature selection (DFS) offers a compelling alternative to traditional, static feature selection by adapting the selected features to each individual sample. Unlike classical methods that apply a uniform feature set, DFS customizes feature selection per sample, providing insight into the decision-making process for each case. DFS is especially significant in settings where decision transparency is key, i.e., clinical decisions; however, existing methods use opaque models, which hinder their applicability in real-life scenarios. This paper introduces a novel approach leveraging a rule-based system as a base classifier for the DFS process, which enhances decision interpretability compared to neural estimators. We also show how this method provides a quantitative measure of uncertainty for each feature query and can make the feature selection process computationally lighter by constraining the feature search space. We also discuss when greedy selection of conditional mutual information is equivalent to selecting features that minimize the difference with respect to the global model predictions. Finally, we demonstrate the competitive performance of our rule-based DFS approach against established and state-of-the-art greedy and RL methods, which are mostly considered opaque, compared to our explainable rule-based system.
zh
[AI-10] Stakeholder Perspectives on Humanistic Implementation of Computer Perception in Healthcare: A Qualitative Study
【速读】:该论文旨在解决计算机感知(Computer Perception, CP)技术在医疗场景中应用时所引发的多维度挑战,包括隐私风险、偏见问题以及对以关系为中心的临床实践的侵蚀。研究通过深入访谈102名关键利益相关者(青少年患者及其照护者、一线临床医生、技术开发者及伦理法律政策学者),识别出七个相互关联的关切领域,涵盖可信度与数据完整性、个体适用性、实用性与工作流程整合、监管治理、隐私保护、直接与间接患者伤害,以及还原论的哲学批判。其解决方案的关键在于提出“个性化路线图”(personalized roadmaps)——一种由多方共同设计的实施框架,明确监测指标、反馈机制、临床干预阈值及算法推断与真实体验冲突时的校正程序,从而在利用连续行为数据的同时,保障医疗照护的人文核心。
链接: https://arxiv.org/abs/2508.02550
作者: Kristin M. Kostick-Quenet(1),Meghan E. Hurley(1),Syed Ayaz(1),John Herrington(2),Casey Zampella(2),Julia Parish-Morris(2),Birkan Tunç(2),Gabriel Lázaro-Muñoz(3),J.S. Blumenthal-Barby(1),Eric A. Storch(1) ((1) Baylor College of Medicine, Houston, TX, 77030, USA (2) Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA (3) Massachussetts General Hospital, Boston, MA 02114, USA)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 65 pages
Abstract:Computer perception (CP) technologies (digital phenotyping, affective computing and related passive sensing approaches) offer unprecedented opportunities to personalize healthcare, but provoke concerns about privacy, bias and the erosion of empathic, relationship-centered practice. A comprehensive understanding of perceived risks, benefits, and implementation challenges from those who design, deploy and experience these tools in real-world settings remains elusive. This study provides the first evidence-based account of key stakeholder perspectives on the relational, technical, and governance challenges raised by the integration of CP technologies into patient care. We conducted in-depth, semi-structured interviews with 102 stakeholders: adolescent patients and their caregivers, frontline clinicians, technology developers, and ethics, legal, policy or philosophy scholars. Transcripts underwent thematic analysis by a multidisciplinary team; reliability was enhanced through double coding and consensus adjudication. Stakeholders articulated seven interlocking concern domains: (1) trustworthiness and data integrity; (2) patient-specific relevance; (3) utility and workflow integration; (4) regulation and governance; (5) privacy and data protection; (6) direct and indirect patient harms; and (7) philosophical critiques of reductionism. To operationalize humanistic safeguards, we propose “personalized roadmaps”: co-designed plans that predetermine which metrics will be monitored, how and when feedback is shared, thresholds for clinical action, and procedures for reconciling discrepancies between algorithmic inferences and lived experience. By translating these insights into personalized roadmaps, we offer a practical framework for developers, clinicians and policymakers seeking to harness continuous behavioral data while preserving the humanistic core of care.
zh
[AI-11] he KG-ER Conceptual Schema Language
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)结构描述与具体实现表示形式(如关系数据库、属性图或RDF)耦合过紧的问题,从而难以统一建模和语义表达。解决方案的关键在于提出KG-ER——一种概念模式语言,能够在不依赖特定存储实现的前提下,抽象地描述知识图谱的结构,并有效捕获其存储信息的语义内涵。
链接: https://arxiv.org/abs/2508.02548
作者: Enrico Franconi,Benoît Groz,Jan Hidders,Nina Pardal,Sławek Staworko,Jan Van den Bussche,Piotr Wieczorek
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose KG-ER, a conceptual schema language for knowledge graphs that describes the structure of knowledge graphs independently of their representation (relational databases, property graphs, RDF) while helping to capture the semantics of the information stored in a knowledge graph.
zh
[AI-12] Automatic Identification of Machine Learning-Specific Code Smells
【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)应用中缺乏针对ML特定代码异味(code smell)的识别与验证工具及研究的问题。其解决方案的关键在于基于设计科学方法论(Design Science Methodology),通过文献综述与专家咨询,构建并实现了一个静态代码分析工具MLpylint,该工具能够依据已识别的ML代码异味标准对开源ML项目进行检测,并通过15名ML专业人员的专家调查完成静态有效性验证,结果表明该工具在识别ML代码异味方面具有有效性和实用性。
链接: https://arxiv.org/abs/2508.02541
作者: Peter Hamfelt,Ricardo Britto,Lincoln Rocha,Camilo Almendra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning (ML) has rapidly grown in popularity, becoming vital to many industries. Currently, the research on code smells in ML applications lacks tools and studies that address the identification and validity of ML-specific code smells. This work investigates suitable methods and tools to design and develop a static code analysis tool (MLpylint) based on code smell criteria. This research employed the Design Science Methodology. In the problem identification phase, a literature review was conducted to identify ML-specific code smells. In solution design, a secondary literature review and consultations with experts were performed to select methods and tools for implementing the tool. We evaluated the tool on data from 160 open-source ML applications sourced from GitHub. We also conducted a static validation through an expert survey involving 15 ML professionals. The results indicate the effectiveness and usefulness of the MLpylint. We aim to extend our current approach by investigating ways to introduce MLpylint seamlessly into development workflows, fostering a more productive and innovative developer environment.
zh
[AI-13] Accurate and Interpretable Postmenstrual Age Prediction via Multimodal Large Language Model NEURIPS2025 ALT
【速读】:该论文旨在解决新生儿脑部磁共振成像(MRI)中胎龄(postmenstrual age, PMA)预测的准确性与可解释性之间的矛盾问题。现有深度学习模型虽能实现高精度预测,但其“黑箱”特性限制了临床决策支持的信任度。解决方案的关键在于引入一种参数高效微调(parameter-efficient fine-tuning, PEFT)策略,结合指令微调(instruction tuning)和低秩适应(Low-Rank Adaptation, LoRA),对多模态大语言模型(multimodal large language model, MLLM)Qwen2.5-VL-7B进行适配,使其在训练阶段完成回归任务以精准预测PMA,在推理阶段生成基于发育特征的临床可解释说明。该方法在四个二维皮层表面投影图上验证,预测误差区间为0.78至1.52周,显著提升了AI系统在围产期神经科学中的透明度与可信度。
链接: https://arxiv.org/abs/2508.02525
作者: Qifan Chen,Jin Cui,Cindy Duan,Yushuo Han,Yifei Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to the NeurIPS 2025 Workshop GenAI4Health. Conference website: this https URL
Abstract:Accurate estimation of postmenstrual age (PMA) at scan is crucial for assessing neonatal development and health. While deep learning models have achieved high accuracy in predicting PMA from brain MRI, they often function as black boxes, offering limited transparency and interpretability in clinical decision support. In this work, we address the dual challenge of accuracy and interpretability by adapting a multimodal large language model (MLLM) to perform both precise PMA prediction and clinically relevant explanation generation. We introduce a parameter-efficient fine-tuning (PEFT) strategy using instruction tuning and Low-Rank Adaptation (LoRA) applied to the Qwen2.5-VL-7B model. The model is trained on four 2D cortical surface projection maps derived from neonatal MRI scans. By employing distinct prompts for training and inference, our approach enables the MLLM to handle a regression task during training and generate clinically relevant explanations during inference. The fine-tuned model achieves a low prediction error with a 95 percent confidence interval of 0.78 to 1.52 weeks, while producing interpretable outputs grounded in developmental features, marking a significant step toward transparent and trustworthy AI systems in perinatal neuroscience.
zh
[AI-14] Decomposed Reasoning with Reinforcement Learning for Relevance Assessment in UGC Platforms
【速读】:该论文旨在解决用户生成内容(User-Generated Content, UGC)平台中检索增强生成(Retrieval-Augmented Generation, RAG)系统在相关性评估上的有效性问题,主要挑战包括:1)由于RAG场景下用户反馈稀疏导致的查询意图模糊;2)非正式、无结构化语言引入的噪声。解决方案的关键在于提出一种基于强化学习的相关性评估模型——R3A(Reinforced Reasoning Model for Relevance Assessment),其核心创新是采用分解式推理框架:首先利用平台内高排名文档推断隐含查询意图,其次通过逐字片段提取提供可解释的判断依据,从而降低噪声干扰;最终在强化学习框架下优化模型,以缓解模糊查询和非结构化内容带来的偏差。
链接: https://arxiv.org/abs/2508.02506
作者: Xiaowei Yuan,Lei Jin,Haoxin Zhang,Yan Gao,Yi Wu,Yao Hu,Ziyang Huang,Jun Zhao,Kang Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness depends heavily on accurate relevance assessment of query-document pairs. Despite recent advances in applying large language models (LLMs) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to sparse user feedback in RAG scenarios, and 2) substantial noise introduced by informal and unstructured language. To address these issues, we propose the Reinforced Reasoning Model for Relevance Assessment (R3A), which introduces a decomposed reasoning framework over queries and candidate documents before scoring. R3A first leverages auxiliary high-ranked documents within the platform to infer latent query intent. It then performs verbatim fragment extraction to justify relevance decisions, thereby reducing errors caused by noisy UGC. Based on a reinforcement learning framework, R3A is optimized to mitigate distortions arising from ambiguous queries and unstructured content. Experimental results show that R3A significantly outperforms existing baseline methods in terms of relevance accuracy, across both offline benchmarks and online experiments.
zh
[AI-15] PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management
【速读】:该论文旨在解决当前生成式人工智能(Generative AI)与预测与健康管理(Prognostics and Health Management, PHM)融合过程中,现有评估方法在结构完整性、维度全面性和评估粒度方面存在的不足,从而阻碍了大语言模型(Large Language Models, LLMs)在PHM领域的深度集成。其解决方案的关键在于提出PHM-Bench——一个面向PHM任务的三维评估框架,该框架基于基础能力、核心任务和全生命周期的三元结构,定义了涵盖知识理解、算法生成与任务优化的多层级评价指标,并通过定制化案例集与公开工业数据集实现对通用型与领域专用模型的多维评测,为LLMs在PHM中的大规模评估提供了方法论基础和关键基准。
链接: https://arxiv.org/abs/2508.02490
作者: Puyu Yang,Laifa Tao,Zijian Huang,Haifei Liu,Wenyan Cao,Hao Ji,Jianan Qiu,Qixuan Huang,Xuanyuan Su,Yuhang Xie,Jun Zhang,Shangyu Li,Chen Lu,Zhixuan Lian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of generative artificial intelligence, large language models (LLMs) are increasingly adopted in industrial domains, offering new opportunities for Prognostics and Health Management (PHM). These models help address challenges such as high development costs, long deployment cycles, and limited generalizability. However, despite the growing synergy between PHM and LLMs, existing evaluation methodologies often fall short in structural completeness, dimensional comprehensiveness, and evaluation granularity. This hampers the in-depth integration of LLMs into the PHM domain. To address these limitations, this study proposes PHM-Bench, a novel three-dimensional evaluation framework for PHM-oriented large models. Grounded in the triadic structure of fundamental capability, core task, and entire lifecycle, PHM-Bench is tailored to the unique demands of PHM system engineering. It defines multi-level evaluation metrics spanning knowledge comprehension, algorithmic generation, and task optimization. These metrics align with typical PHM tasks, including condition monitoring, fault diagnosis, RUL prediction, and maintenance decision-making. Utilizing both curated case sets and publicly available industrial datasets, our study enables multi-dimensional evaluation of general-purpose and domain-specific models across diverse PHM tasks. PHM-Bench establishes a methodological foundation for large-scale assessment of LLMs in PHM and offers a critical benchmark to guide the transition from general-purpose to PHM-specialized models.
zh
[AI-16] reeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEs
【速读】:该论文旨在解决现代集成开发环境(IDE)中代码补全(code completion)的排序问题,即如何更有效地将最相关的标识符和API推荐给开发者。当前系统多依赖人工设计的启发式规则或轻量级机器学习模型,难以充分捕捉上下文信息并跨项目和编码风格泛化。其解决方案的关键在于提出一种基于语言模型的轻量级、与模型架构无关的评分方法:通过构建前缀树(prefix tree)组织所有有效补全项,并执行单次贪婪解码遍历,从而实现无需束搜索(beam search)、提示工程(prompt engineering)或模型适配的精准token级排名。该方法兼具高效性与兼容性,可直接集成至现有IDE代码补全工具中,提升补全结果的相关性和实用性。
链接: https://arxiv.org/abs/2508.02455
作者: Daniele Cipollone,Egor Bogomolov,Arie van Deursen,Maliheh Izadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Token-level code completion is one of the most critical features in modern Integrated Development Environments (IDEs). It assists developers by suggesting relevant identifiers and APIs during coding. While completions are typically derived from static analysis, their usefulness depends heavily on how they are ranked, as correct predictions buried deep in the list are rarely seen by users. Most current systems rely on hand-crafted heuristics or lightweight machine learning models trained on user logs, which can be further improved to capture context information and generalize across projects and coding styles. In this work, we propose a new scoring approach to ranking static completions using language models in a lightweight and model-agnostic way. Our method organizes all valid completions into a prefix tree and performs a single greedy decoding pass to collect token-level scores across the tree. This enables a precise token-aware ranking without needing beam search, prompt engineering, or model adaptations. The approach is fast, architecture-agnostic, and compatible with already deployed models for code completion. These findings highlight a practical and effective pathway for integrating language models into already existing tools within IDEs, and ultimately providing smarter and more responsive developer assistance.
zh
[AI-17] Dynamic Forgetting and Spatio-Temporal Periodic Interest Modeling for Local-Life Service Recommendation
【速读】:该论文旨在解决本地生活服务平台中推荐系统在建模用户行为序列时面临的两大挑战:长序列数据稀疏性以及强时空依赖性问题。其核心解决方案是引入人类记忆的遗忘曲线(forgetting curve)概念,构建了时空周期兴趣建模(Spatio-Temporal periodic Interest Modeling, STIM)方法,关键在于三个组件的协同作用:基于遗忘曲线的动态掩码模块用于提取近期与周期性的时空特征;基于查询的专家混合(Query-based Mixture of Experts, MoE)机制实现不同动态掩码下专家网络的自适应激活,从而联合建模时间、地点与物品信息;以及分层多兴趣网络单元,通过捕捉浅层与深层语义间的行为交互关系来建模用户的多兴趣表示。该方法已在大规模本地生活服务推荐系统中部署,显著提升了交易额(GTV)表现。
链接: https://arxiv.org/abs/2508.02451
作者: Zhaoyu Hu,Hao Guo,Yuan Tian,Erpeng Xue,Jianyang Wang,Xianyang Qi,Hongxiang Lin,Lei Wang,Sheng Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In the context of the booming digital economy, recommendation systems, as a key link connecting users and numerous services, face challenges in modeling user behavior sequences on local-life service platforms, including the sparsity of long sequences and strong spatio-temporal dependence. Such challenges can be addressed by drawing an analogy to the forgetting process in human memory. This is because users’ responses to recommended content follow the recency effect and the cyclicality of memory. By exploring this, this paper introduces the forgetting curve and proposes Spatio-Temporal periodic Interest Modeling (STIM) with long sequences for local-life service recommendation. STIM integrates three key components: a dynamic masking module based on the forgetting curve, which is used to extract both recent spatiotemporal features and periodic spatiotemporal features; a query-based mixture of experts (MoE) approach that can adaptively activate expert networks under different dynamic masks, enabling the collaborative modeling of time, location, and items; and a hierarchical multi-interest network unit, which captures multi-interest representations by modeling the hierarchical interactions between the shallow and deep semantics of users’ recent behaviors. By introducing the STIM method, we conducted online A/B tests and achieved a 1.54% improvement in gross transaction volume (GTV). In addition, extended offline experiments also showed improvements. STIM has been deployed in a large-scale local-life service recommendation system, serving hundreds of millions of daily active users in core application scenarios.
zh
[AI-18] Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在高等教育场景下自动作文评分(Automated Essay Scoring, AES)中的可靠性与有效性问题,特别是其是否能准确模拟人类评分者的判断。研究通过评估五种先进LLMs(Claude 3.5、DeepSeek v2、Gemini 2.5、GPT-4和Mistral 24B)对67篇意大利语心理学课程作文的评分表现,发现模型与人类评分者之间的一致性极低且不显著(Quadratic Weighted Kappa),且同一模型在不同提示重复下的评分稳定性也较弱(中位数Kendall’s W < 0.30)。关键发现在于:LLMs在处理需要学科洞察力和情境敏感性的维度(如Pertinence和Feasibility)时存在系统性偏差,仅在Coherence和Originality上表现出有限的模型间一致性。因此,解决方案的关键在于强调人工监督的必要性,尤其是在解释性较强的学术写作评价领域,当前LLMs尚不足以独立承担此类高复杂度任务。
链接: https://arxiv.org/abs/2508.02442
作者: Andrea Gaggioli,Giuseppe Casaburi,Leonardo Ercolani,Francesco Collova’,Pietro Torre,Fabrizio Davide
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 24 pages (including appendix), 12 tables, 1 figure
Abstract:This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall’s W 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate convergence for Coherence and Originality, but negligible concordance for Pertinence and Feasibility. Although limited in scope, these findings suggest that current LLMs may struggle to replicate human judgment in tasks requiring disciplinary insight and contextual sensitivity. Human oversight remains critical when evaluating open-ended academic work, particularly in interpretive domains.
zh
[AI-19] Multimodal Large Language Models for End-to-End Affective Computing: Benchmarking and Boosting with Generative Knowledge Prompting
【速读】:该论文旨在解决多模态情感计算(Multimodal Affective Computing, MAC)中模型性能在复杂任务上存在波动、以及对模型架构设计与数据特性如何影响情感分析理解不足的问题。解决方案的关键在于:首先,通过系统性基准测试评估当前开源多模态大语言模型(Multimodal Large Language Models, MLLMs)在多个主流MAC数据集上的表现,揭示模型架构与数据特征之间的关联;其次,提出一种融合生成式知识提示(generative knowledge prompting)与监督微调(supervised fine-tuning)的新型混合策略,显著提升MLLMs在各类MAC任务中的情感识别能力。
链接: https://arxiv.org/abs/2508.02429
作者: Miaosen Luo,Jiesen Long,Zequn Li,Yunying Yang,Yuncheng Jiang,Sijie Mai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Affective Computing (MAC) aims to recognize and interpret human emotions by integrating information from diverse modalities such as text, video, and audio. Recent advancements in Multimodal Large Language Models (MLLMs) have significantly reshaped the landscape of MAC by offering a unified framework for processing and aligning cross-modal information. However, practical challenges remain, including performance variability across complex MAC tasks and insufficient understanding of how architectural designs and data characteristics impact affective analysis. To address these gaps, we conduct a systematic benchmark evaluation of state-of-the-art open-source MLLMs capable of concurrently processing audio, visual, and textual modalities across multiple established MAC datasets. Our evaluation not only compares the performance of these MLLMs but also provides actionable insights into model optimization by analyzing the influence of model architectures and dataset properties. Furthermore, we propose a novel hybrid strategy that combines generative knowledge prompting with supervised fine-tuning to enhance MLLMs’ affective computing capabilities. Experimental results demonstrate that this integrated approach significantly improves performance across various MAC tasks, offering a promising avenue for future research and development in this field. Our code is released on this https URL.
zh
[AI-20] CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在复杂任务处理中缺乏系统性评估方法的问题。当前,可组合 AI (Composable AI) 通过将复杂任务分解为子任务并利用预训练模型求解,展现出良好的扩展性和有效性,但其性能评估仍处于探索阶段。为此,作者提出了 CABENCH——首个公开的基准测试平台,包含 70 个真实场景下的可组合 AI 任务及 700 个跨模态、跨领域的预训练模型,并设计了一套端到端的评估框架。解决方案的关键在于构建标准化的任务与模型库以及统一的评估机制,从而为自动构建高效执行流水线(execution pipelines)提供基线和验证手段,推动可组合 AI 方法从人工设计向自动化演进。
链接: https://arxiv.org/abs/2508.02427
作者: Tung-Thuy Pham,Duy-Quan Luong,Minh-Quan Duong,Trung-Hieu Nguyen,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.
zh
[AI-21] Multi-Class Human/Object Detection on Robot Manipulators using Proprioceptive Sensing
【速读】:该论文旨在解决物理人机协作(Physical Human-Robot Collaboration, pHRC)场景中人类与机器人交互时对接触对象的准确识别问题,以提升安全性与任务协同效率。传统方法采用二分类机器学习模型区分软硬物体,但难以满足复杂交互场景下的细粒度分析需求。解决方案的关键在于构建三类分类模型(人类、软物体、硬物体),并基于Franka Emika Panda机械臂采集的时间序列数据,系统评估了LSTM、GRU和Transformer等深度学习架构的性能,最终通过滑动窗口预处理策略优化特征提取,实现了91.11%的实时检测准确率,验证了多类接触状态识别的可行性与有效性。
链接: https://arxiv.org/abs/2508.02425
作者: Justin Hehli,Marco Heiniger,Maryam Rezayati,Hans Wernher van de Venn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:In physical human-robot collaboration (pHRC) settings, humans and robots collaborate directly in shared environments. Robots must analyze interactions with objects to ensure safety and facilitate meaningful workflows. One critical aspect is human/object detection, where the contacted object is identified. Past research introduced binary machine learning classifiers to distinguish between soft and hard objects. This study improves upon those results by evaluating three-class human/object detection models, offering more detailed contact analysis. A dataset was collected using the Franka Emika Panda robot manipulator, exploring preprocessing strategies for time-series analysis. Models including LSTM, GRU, and Transformers were trained on these datasets. The best-performing model achieved 91.11% accuracy during real-time testing, demonstrating the feasibility of multi-class detection models. Additionally, a comparison of preprocessing strategies suggests a sliding window approach is optimal for this task.
zh
[AI-22] Emergence of Fair Leaders via Mediators in Multi-Agent Reinforcement Learning ECAI2025
【速读】:该论文旨在解决多智能体强化学习中因领导者选择不当而导致的公平性问题,尤其是在自利智能体仅关注自身目标和回报的情况下,传统Stackelberg博弈中的角色分配可能引发不公平的结果。其核心问题是:在可互换领导与跟随角色的情境下,如何动态确定最优领导者以实现各智能体间收益的公平性。解决方案的关键在于引入一种基于中介(mediator)的多智能体强化学习框架,该框架通过最小控制力度(即仅负责领导者选择)来调节智能体行为,促使自利智能体采取公平行动,从而显著提升整体收益公平性。
链接: https://arxiv.org/abs/2508.02421
作者: Akshay Dodwadmath,Setareh Maghsudi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ECAI 2025
Abstract:Stackelberg games and their resulting equilibria have received increasing attention in the multi-agent reinforcement learning literature. Each stage of a traditional Stackelberg game involves a leader(s) acting first, followed by the followers. In situations where the roles of leader(s) and followers can be interchanged, the designated role can have considerable advantages, for example, in first-mover advantage settings. Then the question arises: Who should be the leader and when? A bias in the leader selection process can lead to unfair outcomes. This problem is aggravated if the agents are self-interested and care only about their goals and rewards. We formally define this leader selection problem and show its relation to fairness in agents’ returns. Furthermore, we propose a multi-agent reinforcement learning framework that maximizes fairness by integrating mediators. Mediators have previously been used in the simultaneous action setting with varying levels of control, such as directly performing agents’ actions or just recommending them. Our framework integrates mediators in the Stackelberg setting with minimal control (leader selection). We show that the presence of mediators leads to self-interested agents taking fair actions, resulting in higher overall fairness in agents’ returns.
zh
[AI-23] Inference-time Scaling for Diffusion-based Audio Super-resolution
【速读】:该论文旨在解决扩散模型(Diffusion Models)在音频超分辨率(Audio Super-Resolution, SR)任务中因采样过程的随机性导致输出质量受限、方差较高这一核心问题。现有方法通常通过增加采样步数来提升质量,但难以突破其固有的性能上限。解决方案的关键在于提出一种**推理时缩放(inference-time scaling)**的新范式,即在采样过程中探索多条解空间轨迹,并引入任务特定的验证器(verifier)与两种搜索算法(随机搜索和零阶搜索),通过验证器-算法组合主动引导高维解空间的探索,从而实现更鲁棒且高质量的音频重建。该方法在语音、音乐和音效等多种音频域及不同频段上均表现出显著性能提升,尤其在从4kHz到24kHz的语音SR任务中,谱距离指标提升达46.98%。
链接: https://arxiv.org/abs/2508.02391
作者: Yizhu Jin,Zhen Ye,Zeyue Tian,Haohe Liu,Qiuqiang Kong,Yike Guo,Wei Xue
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4kHz to 24kHz, showcasing the effectiveness of our approach. Audio samples are available at: this https URL.
zh
[AI-24] raffic-R1: Reinforced LLM s Bring Human-Like Reasoning to Traffic Signal Control Systems
【速读】:该论文旨在解决交通信号控制(Traffic Signal Control, TSC)系统在复杂城市环境中面临的泛化能力弱、计算资源消耗高及可解释性差等问题。传统强化学习(Reinforcement Learning, RL)方法依赖大量训练数据且难以迁移至新场景,而现有基于大语言模型(Large Language Models, LLMs)的方法缺乏实时推理能力和多交叉口协同机制。解决方案的关键在于提出 Traffic-R1——一个具有类人推理能力的交通信号控制基础模型,其核心创新包括:通过专家引导的自探索与迭代式强化LLM训练构建内生交通控制策略;采用3B参数轻量化架构实现移动端芯片上的实时推理;并通过自迭代机制与新型同步通信网络提升跨路口协同效率与决策可解释性,从而在无需重新训练的情况下实现零样本泛化和大规模边缘部署。
链接: https://arxiv.org/abs/2508.02344
作者: Xingchen Zou,Yuhao Yang,Zheng Chen,Xixuan Hao,Yiqi Chen,Chao Huang,Yuxuan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic signal control (TSC) is vital for mitigating congestion and sustaining urban mobility. In this paper, we introduce Traffic-R1, a foundation model with human-like reasoning for TSC systems. Our model is developed through self-exploration and iteration of reinforced large language models (LLMs) with expert guidance in a simulated traffic environment. Compared to traditional reinforcement learning (RL) and recent LLM-based methods, Traffic-R1 offers three significant advantages. First, Traffic-R1 delivers zero-shot generalisation, transferring unchanged to new road networks and out-of-distribution incidents by utilizing its internal traffic control policies and human-like reasoning. Second, its 3B-parameter architecture is lightweight enough for real-time inference on mobile-class chips, enabling large-scale edge deployment. Third, Traffic-R1 provides an explainable TSC process and facilitates multi-intersection communication through its self-iteration and a new synchronous communication network. Extensive benchmarks demonstrate that Traffic-R1 sets a new state of the art, outperforming strong baselines and training-intensive RL controllers. In practice, the model now manages signals for more than 55,000 drivers daily, shortening average queues by over 5% and halving operator workload. Our checkpoint is available at this https URL.
zh
[AI-25] MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
【速读】:该论文旨在解决当前低精度量化(quantization)方法在NVIDIA Blackwell架构上无法充分利用FP4 Tensor Cores加速能力的问题。现有基于INT4的量化方案因数据格式与硬件支持不匹配,导致计算效率受限。其解决方案的关键在于提出MicroMix,一种协同设计的混合精度量化算法与矩阵乘法内核,基于Microscaling(MX)数据格式,支持MXFP4、MXFP6和MXFP8通道的任意组合,并输出BFloat16结果;同时引入量化阈值机制,动态识别高误差区域并分配更高精度通道,从而在保持计算效率的同时优化每层线性变换的精度-效率权衡。
链接: https://arxiv.org/abs/2508.02343
作者: Wenyuan Liu,Haoqian Meng,Yilun Luo,Peng Zhang,Xindian Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA’s Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at this https URL.
zh
[AI-26] A Survey on Data Security in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中因依赖海量、来源多样且未经严格筛选的数据而引发的数据安全风险问题,这些问题可能导致模型产生有害输出、幻觉或易受提示注入、数据投毒等攻击。其解决方案的关键在于系统性地识别和分类LLMs面临的主要数据安全威胁,并综述当前主流防御策略,如对抗训练(adversarial training)、基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)以及数据增强技术;同时,通过梳理用于评估模型鲁棒性和安全性的相关数据集,为未来研究提供方向指引,强调安全模型更新、可解释性驱动的防御机制及有效治理框架的重要性,从而推动LLMs的安全与负责任发展。
链接: https://arxiv.org/abs/2508.02312
作者: Kang Chen,Xiuze Zhou,Yuanguo Lin,Jinhe Su,Yuanhui Yu,Li Shen,Fan Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs), now a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems. Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks. Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning. As LLMs continue to be integrated into critical real-world systems, understanding and addressing these data-centric security risks is imperative to safeguard user trust and system reliability. This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies, including adversarial training, RLHF, and data augmentation. Additionally, we categorize and analyze relevant datasets used for assessing robustness and security across different domains, providing guidance for future research. Finally, we highlight key research directions that focus on secure model updates, explainability-driven defenses, and effective governance frameworks, aiming to promote the safe and responsible development of LLM technology. This work aims to inform researchers, practitioners, and policymakers, driving progress toward data security in LLMs.
zh
[AI-27] FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment
【速读】:该论文旨在解决当前金融AI平台在任务覆盖范围有限、多模态数据融合能力不足以及对大语言模型(Large Language Models, LLMs)训练与部署支持薄弱等问题。其解决方案的核心在于提出一个名为FinWorld的开源端到端金融AI平台,通过原生集成异构金融数据、统一支持多种人工智能范式(如深度学习与强化学习),并引入先进的智能体自动化机制,实现从数据获取到实验验证再到模型部署的全流程高效支持。该平台已在两个代表性市场、四个股票池及超过8亿条金融数据点上完成系统性评估,实证表明其显著提升了研究可复现性、支持透明基准测试,并简化了实际部署流程,为未来金融AI研究和应用奠定坚实基础。
链接: https://arxiv.org/abs/2508.02292
作者: Wentao Zhang,Yilei Zhao,Chuqiao Zong,Xinrun Wang,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Financial AI holds great promise for transforming modern finance, with the potential to support a wide range of tasks such as market forecasting, portfolio management, quantitative trading, and automated analysis. However, existing platforms remain limited in task coverage, lack robust multimodal data integration, and offer insufficient support for the training and deployment of large language models (LLMs). In response to these limitations, we present FinWorld, an all-in-one open-source platform that provides end-to-end support for the entire financial AI workflow, from data acquisition to experimentation and deployment. FinWorld distinguishes itself through native integration of heterogeneous financial data, unified support for diverse AI paradigms, and advanced agent automation, enabling seamless development and deployment. Leveraging data from 2 representative markets, 4 stock pools, and over 800 million financial data points, we conduct comprehensive experiments on 4 key financial AI tasks. These experiments systematically evaluate deep learning and reinforcement learning algorithms, with particular emphasis on RL-based finetuning for LLMs and LLM Agents. The empirical results demonstrate that FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment, thereby providing a strong foundation for future research and real-world applications. Code is available at Github~\footnotethis https URL.
zh
[AI-28] Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method AAAI2026
【速读】:该论文旨在解决大规模神经网络在资源受限的边缘设备上部署时面临的计算与内存开销过大的问题,核心挑战在于如何高效地进行结构化剪枝(structured pruning),以在不显著降低模型性能的前提下实现模型压缩。解决方案的关键在于提出了一种名为FAIR-Pruner的新方法,其创新性体现在:首先通过Wasserstein距离量化每个单元(如神经元或通道)的利用度得分(Utilization Score),其次引入基于损失函数泰勒展开的重构误差(Reconstruction Error)来评估移除单元后的性能退化,最后通过控制一个称为“差异容差”(Tolerance of Difference)的指标,自动识别对模型性能影响可忽略的冗余单元。该方法无需人工设定各层剪枝率,能自适应确定层级别的剪枝比例,并且在单次剪枝后即可获得高性能压缩模型,无需后续微调,从而实现了高效、灵活且高精度的模型压缩。
链接: https://arxiv.org/abs/2508.02291
作者: Chenqing Lin,Mostafa Hussien,Chengyao Yu,Mohamed Cheriet,Osama Abdelrahman,Ruixing Ming
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to AAAI 2026
Abstract:Neural network pruning is a critical compression technique that facilitates the deployment of large-scale neural networks on resource-constrained edge devices, typically by identifying and eliminating redundant or insignificant parameters to reduce computational and memory overhead. This paper proposes the Flexible Automatic Identification and Removal (FAIR)-Pruner, a novel method for neural network structured pruning. Specifically, FAIR-Pruner first evaluates the importance of each unit (e.g., neuron or channel) through the Utilization Score quantified by the Wasserstein distance. To reflect the performance degradation after unit removal, it then introduces the Reconstruction Error, which is computed via the Taylor expansion of the loss function. Finally, FAIR-Pruner identifies superfluous units with negligible impact on model performance by controlling the proposed Tolerance of Difference, which measures differences between unimportant units and those that cause performance degradation. A major advantage of FAIR-Pruner lies in its capacity to automatically determine the layer-wise pruning rates, which yields a more efficient subnetwork structure compared to applying a uniform pruning rate. Another advantage of the FAIR-Pruner is its great one-shot performance without post-pruning fine-tuning. Furthermore, with utilization scores and reconstruction errors, users can flexibly obtain pruned models under different pruning ratios. Comprehensive experimental validation on diverse benchmark datasets (e.g., ImageNet) and various neural network architectures (e.g., VGG) demonstrates that FAIR-Pruner achieves significant model compression while maintaining high accuracy.
zh
[AI-29] AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models
【速读】:该论文旨在解决空中交通管制(Air Traffic Control, ATC)训练中场景手动设计耗时且多样性不足的问题,从而限制了模拟训练的广度与深度。解决方案的关键在于提出了一种端到端的自动化方法 AirTrafficGen,其利用大语言模型(Large Language Models, LLMs)生成复杂且符合操作现实性的ATC场景;该方法通过自定义的基于图的表示结构将扇区拓扑(包括空域几何、航路和定位点)编码为LLM可处理的格式,并借助工程化提示(engineered prompting)实现对交互存在性、类型及位置的细粒度控制,同时展现出基于文本反馈进行迭代修正的能力,从而提供一种可扩展的替代方案以满足ATC训练与验证仿真对数量和多样性的需求。
链接: https://arxiv.org/abs/2508.02269
作者: Dewi Sid William Gould,George De Ath,Ben Carvell,Nick Pepper
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages and appendices
Abstract:The manual design of scenarios for Air Traffic Control (ATC) training is a demanding and time-consuming bottleneck that limits the diversity of simulations available to controllers. To address this, we introduce a novel, end-to-end approach, AirTrafficGen, that leverages large language models (LLMs) to automate and control the generation of complex ATC scenarios. Our method uses a purpose-built, graph-based representation to encode sector topology (including airspace geometry, routes, and fixes) into a format LLMs can process. Through rigorous benchmarking, we show that state-of-the-art models like Gemini 2.5 Pro and OpenAI o3 can generate high-traffic scenarios whilst maintaining operational realism. Our engineered prompting enables fine-grained control over interaction presence, type, and location. Initial findings suggest these models are also capable of iterative refinement, correcting flawed scenarios based on simple textual feedback. This approach provides a scalable alternative to manual scenario design, addressing the need for a greater volume and variety of ATC training and validation simulations. More broadly, this work showcases the potential of LLMs for complex planning in safety-critical domains.
zh
[AI-30] StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation INTERSPEECH2025
【速读】:该论文旨在解决语音流畅性检测中仅能进行整体语句级分类而无法精确定位口吃起始点的问题,即如何实现对口吃(dysfluency)的帧级别分割。其解决方案的关键在于提出 StutterCut 框架,将口吃分割建模为图划分问题:利用重叠窗口生成的语音嵌入作为图节点,并通过一个基于弱标签(语句级)训练的伪oracle分类器优化节点间连接关系;同时引入蒙特卡洛丢弃(Monte Carlo dropout)的不确定性度量来动态调节伪标签的影响强度,从而提升分割精度。此外,研究扩展了 FluencyBank 数据集,增加了四种口吃类型的帧级边界标注,为真实场景下的评估提供了更可靠的基准。
链接: https://arxiv.org/abs/2508.02255
作者: Suhita Ghosh,Melanie Jouaiti,Jan-Ole Perschewski,Sebastian Stober
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted in Interspeech 2025
Abstract:Detecting and segmenting dysfluencies is crucial for effective speech therapy and real-time feedback. However, most methods only classify dysfluencies at the utterance level. We introduce StutterCut, a semi-supervised framework that formulates dysfluency segmentation as a graph partitioning problem, where speech embeddings from overlapping windows are represented as graph nodes. We refine the connections between nodes using a pseudo-oracle classifier trained on weak (utterance-level) labels, with its influence controlled by an uncertainty measure from Monte Carlo dropout. Additionally, we extend the weakly labelled FluencyBank dataset by incorporating frame-level dysfluency boundaries for four dysfluency types. This provides a more realistic benchmark compared to synthetic datasets. Experiments on real and synthetic datasets show that StutterCut outperforms existing methods, achieving higher F1 scores and more precise stuttering onset detection.
zh
[AI-31] FinCPRG: A Bidirectional Generation Pipeline for Hierarchical Queries and Rich Relevance in Financial Chinese Passage Retrieval
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在构建段落检索数据集时面临的两个核心问题:一是难以有效表达跨文档(cross-doc)查询需求,二是难以控制标注质量。其解决方案的关键在于提出了一种双向生成流水线(bidirectional generation pipeline),能够同时生成三级层次化查询(hierarchical queries),覆盖单文档(intra-doc)与跨文档场景,并通过直接映射标注与间接正例挖掘(indirect positives mining)方法丰富相关性标签。该方案结合自底向上(从单文档文本中提取句级和段落级结构化查询)与自顶向下(基于行业、主题和时间三个金融要素对多文档标题聚类并生成主题级查询)两种策略,从而系统性提升查询表达能力与标注质量,最终构建出高质量的金融段落检索生成数据集(FinCPRG)。
链接: https://arxiv.org/abs/2508.02222
作者: Xuan Xu,Beilin Chu,Qinhong Lin,Yixiao Zhong,Fufang Wen,Jiaqi Liu,Binjie Fei,Yu Li,Zhongliang Yang,Linna Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:In recent years, large language models (LLMs) have demonstrated significant potential in constructing passage retrieval datasets. However, existing methods still face limitations in expressing cross-doc query needs and controlling annotation quality. To address these issues, this paper proposes a bidirectional generation pipeline, which aims to generate 3-level hierarchical queries for both intra-doc and cross-doc scenarios and mine additional relevance labels on top of direct mapping annotation. The pipeline introduces two query generation methods: bottom-up from single-doc text and top-down from multi-doc titles. The bottom-up method uses LLMs to disassemble and generate structured queries at both sentence-level and passage-level simultaneously from intra-doc passages. The top-down approach incorporates three key financial elements–industry, topic, and time–to divide report titles into clusters and prompts LLMs to generate topic-level queries from each cluster. For relevance annotation, our pipeline not only relies on direct mapping annotation from the generation relationship but also implements an indirect positives mining method to enrich the relevant query-passage pairs. Using this pipeline, we constructed a Financial Passage Retrieval Generated dataset (FinCPRG) from almost 1.3k Chinese financial research reports, which includes hierarchical queries and rich relevance labels. Through evaluations of mined relevance labels, benchmarking and training experiments, we assessed the quality of FinCPRG and validated its effectiveness as a passage retrieval dataset for both training and benchmarking.
zh
[AI-32] Balancing Information Accuracy and Response Timeliness in Networked LLM s
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际部署中因训练数据需求大、计算资源消耗高和能耗高等问题所带来的挑战。为应对这些限制,作者提出了一种基于网络化LLM系统的解决方案,其核心在于利用多个主题专业化的小型语言模型(topic-specialized LLMs)协同工作,并通过中央任务处理器对用户提交的二元分类查询进行路由与聚合处理。关键创新点在于构建一个由多用户、中央任务处理器和若干主题专用LLM集群组成的系统架构,在保证响应时效性的同时,通过聚合多个模型的输出结果显著提升整体回答准确性,尤其在各参与模型独立性能相近时效果更优。
链接: https://arxiv.org/abs/2508.02209
作者: Yigit Turkmen,Baturalp Buyukates,Melih Bastopcu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have transformed many fields including scientific discovery, content generation, biomedical text mining, and educational technology. However, the substantial requirements for training data, computational resources, and energy consumption pose significant challenges for their practical deployment. A promising alternative is to leverage smaller, specialized language models and aggregate their outputs to improve overall response quality. In this work, we investigate a networked LLM system composed of multiple users, a central task processor, and clusters of topic-specialized LLMs. Each user submits categorical binary (true/false) queries, which are routed by the task processor to a selected cluster of m LLMs. After gathering individual responses, the processor returns a final aggregated answer to the user. We characterize both the information accuracy and response timeliness in this setting, and formulate a joint optimization problem to balance these two competing objectives. Our extensive simulations demonstrate that the aggregated responses consistently achieve higher accuracy than those of individual LLMs. Notably, this improvement is more significant when the participating LLMs exhibit similar standalone performance.
zh
[AI-33] A Message Passing Realization of Expected Free Energy Minimization
【速读】:该论文旨在解决在存在认知不确定性(epistemic uncertainty)的环境中,如何高效进行策略推理与决策的问题。传统基于KL散度控制(KL-control)的方法在面对此类环境时表现不稳定,难以实现鲁棒规划与有效探索。解决方案的关键在于将预期自由能(Expected Free Energy, EFE)最小化问题重新表述为带有认知先验(epistemic priors)的变分自由能(Variational Free Energy)最小化问题,从而将原本复杂的组合搜索问题转化为可通过标准变分推断技术求解的可 tractable 推理问题。通过在因子图上应用消息传递方法,该框架实现了对因子化状态空间模型的有效策略推断,并在随机网格世界和部分可观测Minigrid任务中验证了其优越性,展现出更优的风险规避能力和系统性的信息获取行为。
链接: https://arxiv.org/abs/2508.02197
作者: Wouter W. L. Nuijten,Mykola Lukashchuk,Thijs van de Laar,Bert de Vries
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.
zh
[AI-34] Neuromorphic Computing with Multi-Frequency Oscillations: A Bio-Inspired Approach to Artificial Intelligence
【速读】:该论文旨在解决当前人工神经网络在灵活、泛化智能方面的局限性问题,其根源在于现有模型与生物认知机制存在根本差异,即忽视了神经区域的功能特异性以及协调这些特化系统的时序动态。解决方案的关键在于提出一种三元脑启发架构,包含功能专一的感知系统、辅助系统和执行系统,并通过模拟多频神经振荡与突触动态适应机制来整合时序动力学,从而提升人工认知的灵活性与效率。
链接: https://arxiv.org/abs/2508.02191
作者: Boheng Liu,Ziyu Li,Xia Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite remarkable capabilities, artificial neural networks exhibit limited flexible, generalizable intelligence. This limitation stems from their fundamental divergence from biological cognition that overlooks both neural regions’ functional specialization and the temporal dynamics critical for coordinating these specialized systems. We propose a tripartite brain-inspired architecture comprising functionally specialized perceptual, auxiliary, and executive systems. Moreover, the integration of temporal dynamics through the simulation of multi-frequency neural oscillation and synaptic dynamic adaptation mechanisms enhances the architecture, thereby enabling more flexible and efficient artificial cognition. Initial evaluations demonstrate superior performance compared to state-of-the-art temporal processing approaches, with 2.18% accuracy improvements while reducing required computation iterations by 48.44%, and achieving higher correlation with human confidence patterns. Though currently demonstrated on visual processing tasks, this architecture establishes a theoretical foundation for brain-like intelligence across cognitive domains, potentially bridging the gap between artificial and biological intelligence.
zh
[AI-35] FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation ICCV2025
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中训练时依赖大规模用户特定数据所引发的隐私与安全问题,从而限制了其广泛应用。解决方案的关键在于提出首个联邦VLA学习框架FedVLA,通过任务感知表示学习、自适应专家选择和专家驱动的联邦聚合机制,在不牺牲性能的前提下实现分布式训练并保护数据隐私。其中,核心创新包括:1)指令导向的场景解析机制(Instruction Oriented Scene-Parsing),基于任务指令增强对象级特征以提升上下文理解;2)双门控混合专家机制(Dual Gating Mixture-of-Experts, DGMoE),使输入token与自感知专家均能动态激活,显著提升计算效率;3)服务器端由激活专家引导的聚合策略,确保跨客户端知识有效融合。实验证明,FedVLA在真实机器人任务中达到与集中式训练相当的成功率,同时保障数据隐私。
链接: https://arxiv.org/abs/2508.02190
作者: Cui Miao,Tao Chang,Meihan Wu,Hongbin Xu,Chun Li,Ming Li,Xiaodong Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025
Abstract:Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an Instruction Oriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture-of-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge this http URL simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.
zh
[AI-36] Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)中存在的“过度思考”(overthinking)问题,即推理过程产生冗长且低效的思维链(Chain-of-Thought),从而影响计算效率和可解释性。其核心解决方案是将过思考分解为两种形式:内部冗余(internal redundancy),指首次正确解(First Correct Solution, FCS)内贡献度低的推理步骤;以及外部冗余(external redundancy),指在FCS之后不必要的继续推理。为此,作者提出一种双惩罚强化学习框架:通过滑动窗口语义分析对内部冗余进行惩罚,减少低收益步骤;并通过惩罚FCS之后的推理比例来抑制外部冗余,促使模型更早终止。实验表明,该方法可在几乎不损失准确率的前提下显著压缩推理轨迹,并有效泛化至问答与代码生成等域外任务,尤其发现外部冗余可安全移除,而内部冗余需谨慎处理以保障正确性。这一机制实现了对推理链长度的隐式语义感知控制,推动了更简洁、可解释的LRM发展。
链接: https://arxiv.org/abs/2508.02178
作者: Jialiang Hong,Taihang Zhen,Kai Chen,Jiaheng Liu,Wenpeng Zhu,Jing Huo,Yang Gao,Depeng Wang,Haitao Wan,Xi Yang,Boyan Wang,Fanyu Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) often produce excessively verbose reasoning traces, a phenomenon known as overthinking, which hampers both efficiency and interpretability. Prior works primarily address this issue by reducing response length, without fully examining the underlying semantic structure of the reasoning process. In this paper, we revisit overthinking by decomposing it into two distinct forms: internal redundancy, which consists of low-contribution reasoning steps within the first correct solution (FCS), and external redundancy, which refers to unnecessary continuation after the FCS. To mitigate both forms, we propose a dual-penalty reinforcement learning framework. For internal redundancy, we adopt a sliding-window semantic analysis to penalize low-gain reasoning steps that contribute little toward reaching the correct answer. For external redundancy, we penalize its proportion beyond the FCS to encourage earlier termination. Our method significantly compresses reasoning traces with minimal accuracy loss, and generalizes effectively to out-of-domain tasks such as question answering and code generation. Crucially, we find that external redundancy can be safely removed without degrading performance, whereas internal redundancy must be reduced more cautiously to avoid impairing correctness. These findings suggest that our method not only improves reasoning efficiency but also enables implicit, semantic-aware control over Chain-of-Thought length, paving the way for more concise and interpretable LRMs.
zh
[AI-37] Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models Instruction Following
【速读】:该论文旨在解决推理模型(reasoning models)在复杂问题求解中表现出较强推理能力的同时,指令遵循能力(instruction following abilities)却存在显著不足的问题。现有提升指令遵循能力的方法依赖于更强的外部模型,导致方法论瓶颈和实际限制,如成本上升与可访问性受限。解决方案的关键在于提出一种自监督强化学习(self-supervised RL)框架,该框架利用推理模型自身的内部信号进行训练,无需外部监督即可有效提升指令遵循能力,同时保持原有推理性能,从而实现可扩展且低成本的优化路径。
链接: https://arxiv.org/abs/2508.02150
作者: Qingyu Ren,Qianyu He,Bowei Zhang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models’ own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at this https URL.
zh
[AI-38] Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation
【速读】:该论文旨在解决大规模模型(Large-scale Models, LSMs)在语义通信(Semantic Communication, SC)系统中直接部署时面临的高计算复杂度和资源消耗问题,同时确保系统在信道噪声下的鲁棒性。其解决方案的关键在于提出了一种基于知识蒸馏的鲁棒语义通信(Robust Knowledge Distillation-based Semantic Communication, RKD-SC)框架:首先设计了基于知识蒸馏的轻量级可微架构搜索算法(KDL-DARTS),通过引入知识蒸馏损失与复杂度惩罚项,在架构搜索过程中自动发现高性能且轻量的语义编码器结构;其次开发了一种两阶段鲁棒知识蒸馏(Two-stage Robust Knowledge Distillation, RKD)机制,实现从LSM教师模型到紧凑学生编码器的知识迁移,并进一步提升系统对信道干扰的抗扰能力;此外,引入通道感知Transformer(Channel-aware Transformer, CAT)块作为信道编解码模块,通过在多种信道条件下训练以生成变长输出,显著增强了系统的实际鲁棒性。
链接: https://arxiv.org/abs/2508.02148
作者: Kuiyuan DIng,Caili Guo,Yang Yang,Zhongtian Du,Walid Saad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 13 pages, 8 figures, 3 tables
Abstract:Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic communication (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic communication (RKD-SC) framework is proposed to enable efficient and \textcolorblackchannel-noise-robust LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware transformer (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model’s performance and exhibiting superior robustness compared to existing methods.
zh
[AI-39] Fitness aligned structural modeling enables scalable virtual screening with AuroBind
【速读】:该论文旨在解决当前药物研发中大量人类蛋白质(超过96%)仍处于“不可成药”状态的问题,尤其是现有基于结构的虚拟筛选方法在原子级精度和结合适配性预测方面的不足,限制了其向临床转化的实际应用。解决方案的关键在于提出AuroBind框架,该框架通过在百万级化基因组数据上微调定制的原子级结构模型,并融合直接偏好优化、高置信度复合物自蒸馏以及教师-学生加速策略,实现对配体结合结构与结合适配性的联合预测。这一方法显著优于现有最先进模型,在结构和功能基准测试中表现优异,并支持超大规模化合物库(达10万倍加速)的高通量筛选,从而有效推动从结构预测到治疗发现的跨越。
链接: https://arxiv.org/abs/2508.02137
作者: Zhongyue Zhang,Jiahua Rao,Jie Zhong,Weiqiang Bai,Dongxue Wang,Shaobo Ning,Lifeng Qiao,Sheng Xu,Runze Ma,Will Hua,Jack Xiaoyu Chen,Odin Zhang,Wei Lu,Hanyi Feng,He Yang,Xinchao Shi,Rui Li,Wanli Ouyang,Xinzhu Ma,Jiahao Wang,Jixian Zhang,Jia Duan,Siqi Sun,Jian Zhang,Shuangjia Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 54 pages, 13 figures, code available at this https URL
Abstract:Most human proteins remain undrugged, over 96% of human proteins remain unexploited by approved therapeutics. While structure-based virtual screening promises to expand the druggable proteome, existing methods lack atomic-level precision and fail to predict binding fitness, limiting translational impact. We present AuroBind, a scalable virtual screening framework that fine-tunes a custom atomic-level structural model on million-scale chemogenomic data. AuroBind integrates direct preference optimization, self-distillation from high-confidence complexes, and a teacher-student acceleration strategy to jointly predict ligand-bound structures and binding fitness. The proposed models outperform state-of-the-art models on structural and functional benchmarks while enabling 100,000-fold faster screening across ultra-large compound libraries. In a prospective screen across ten disease-relevant targets, AuroBind achieved experimental hit rates of 7-69%, with top compounds reaching sub-nanomolar to picomolar potency. For the orphan GPCRs GPR151 and GPR160, AuroBind identified both agonists and antagonists with success rates of 16-30%, and functional assays confirmed GPR160 modulation in liver and prostate cancer models. AuroBind offers a generalizable framework for structure-function learning and high-throughput molecular screening, bridging the gap between structure prediction and therapeutic discovery.
zh
[AI-40] All Stories Are One Story: Emotional Arc Guided Procedural Game Level Generation
【速读】:该论文旨在解决交互式叙事中故事结构缺乏统一情感驱动力的问题,即如何在程序化生成的游戏中构建具有情感起伏、提升玩家沉浸感与叙事连贯性的动态情节。其解决方案的关键在于将“情感弧线”(emotional arc)作为叙事生成的结构性基础,通过引入两种核心情感模式——上升(Rise)与下降(Fall),指导分支故事图谱的生成,并结合大语言模型(LLMs)与自适应实体生成机制,自动为每个故事节点分配角色、物品及游戏属性(如生命值、攻击力等),同时根据情感轨迹动态调整难度。实证评估表明,该方法显著增强了玩家参与度、叙事一致性与情感冲击力,验证了基于情感结构的程序化生成在游戏交互叙事中的可行性与有效性。
链接: https://arxiv.org/abs/2508.02132
作者: Yunge Wen,Chenliang Huang,Hangyu Zhou,Zhuo Zeng,Chun Ming Louis Po,Julian Togelius,Timothy Merino,Sam Earle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The emotional arc is a universal narrative structure underlying stories across cultures and media – an idea central to structuralist narratology, often encapsulated in the phrase “all stories are one story.” We present a framework for procedural game narrative generation that incorporates emotional arcs as a structural backbone for both story progression and gameplay dynamics. Leveraging established narratological theories and large-scale empirical analyses, we focus on two core emotional patterns – Rise and Fall – to guide the generation of branching story graphs. Each story node is automatically populated with characters, items, and gameplay-relevant attributes (e.g., health, attack), with difficulty adjusted according to the emotional trajectory. Implemented in a prototype action role-playing game (ARPG), our system demonstrates how emotional arcs can be operationalized using large language models (LLMs) and adaptive entity generation. Evaluation through player ratings, interviews, and sentiment analysis shows that emotional arc integration significantly enhances engagement, narrative coherence, and emotional impact. These results highlight the potential of emotionally structured procedural generation for advancing interactive storytelling for games.
zh
[AI-41] he Complexity of Extreme Climate Events on the New Zealands Kiwifruit Industry
【速读】:该论文旨在解决气候极端事件(如霜冻、干旱、极端降雨和热浪)对新西兰猕猴桃(kiwifruit)产量影响的量化与识别问题,尤其关注这些事件如何导致不同农场间产量差异显著。其解决方案的关键在于采用无监督异常检测方法Isolation Forest分析历史气候数据与实际产量记录,识别出极端事件对作物的影响模式;同时指出当前方法在准确识别霜冻等事件上的局限性,并提出通过整合农场管理策略与区域气候站特征,结合集成学习方法融合邻近农场产量数据以降低变异、提升检测精度与响应策略可靠性。
链接: https://arxiv.org/abs/2508.02130
作者: Boyuan Zheng,Victor W. Chu,Zhidong Li,Evan Webster,Ashley Rootsey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print v0.8 2025-08-04
Abstract:Climate change has intensified the frequency and severity of extreme weather events, presenting unprecedented challenges to the agricultural industry worldwide. In this investigation, we focus on kiwifruit farming in New Zealand. We propose to examine the impacts of climate-induced extreme events, specifically frost, drought, extreme rainfall, and heatwave, on kiwifruit harvest yields. These four events were selected due to their significant impacts on crop productivity and their prevalence as recorded by climate monitoring institutions in the country. We employed Isolation Forest, an unsupervised anomaly detection method, to analyse climate history and recorded extreme events, alongside with kiwifruit yields. Our analysis reveals considerable variability in how different types of extreme event affect kiwifruit yields underscoring notable discrepancies between climatic extremes and individual farm’s yield outcomes. Additionally, our study highlights critical limitations of current anomaly detection approaches, particularly in accurately identifying events such as frost. These findings emphasise the need for integrating supplementary features like farm management strategies with climate adaptation practices. Our further investigation will employ ensemble methods that consolidate nearby farms’ yield data and regional climate station features to reduce variance, thereby enhancing the accuracy and reliability of extreme event detection and the formulation of response strategies.
zh
[AI-42] Amber Pruner: Leverag ing N:M Activation Sparsity for Efficient Prefill in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段因计算密集而效率低下的问题,特别是传统权重稀疏化(weight sparsity)易导致精度显著下降,而激活稀疏化(activation sparsity)通常依赖训练且泛化能力差的局限性。解决方案的关键在于提出一种无需训练的 N:M 激活稀疏方法——Amber Pruner,专门针对预填充阶段(prefill stage)中线性投影层的加速,能够在不重新训练模型的前提下有效稀疏化超过55%的线性计算;同时进一步引入统一框架 Outstanding-sparse,将 Amber Pruner 与后训练量化(post-training W8A8 quantization)结合,显著提升泛化性和效率,尤其在生成类任务中表现优异,为激活稀疏化提供了新的研究范式。
链接: https://arxiv.org/abs/2508.02128
作者: Tai An,Ruwu Cai,Yanzhe Zhang,Yang Liu,Hao Chen,Pengcheng Xie,Sheng Chang,Yiwu Yao,Gongyi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.
zh
[AI-43] A Survey on Agent Ops: Categorization Challenges and Future Directions
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)驱动的智能体系统(agent systems)在实际运行中频繁遭遇异常导致不稳定和不安全的问题,而现有针对此类系统运维的研究极为匮乏。其解决方案的关键在于提出一个全新的、系统的运维框架——Agent System Operations(AgentOps),该框架包含四个核心阶段:监控(monitoring)、异常检测(anomaly detection)、根因分析(root cause analysis)和故障恢复(resolution),从而为智能体系统的稳定运行提供结构化、可操作的运维路径。
链接: https://arxiv.org/abs/2508.02121
作者: Zexin Wang,Jingjing Li,Quan Zhou,Haotian Si,Yuanhao Liu,Jianhui Li,Gaogang Xie,Fei Sun,Dan Pei,Changhua Pei
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 35 pages
Abstract:As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause analysis, and resolution.
zh
[AI-44] Dont Overthink It: A Survey of Efficient R1-style Large Reasoning Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中大型推理模型(Large Reasoning Models, LRMs)因过度思考(overthinking)导致的推理效率低下问题,即模型在回答任务时倾向于构建冗长且重复的思维链(chain-of-thought),从而影响推理速度与最终答案准确性。解决方案的关键在于系统性地梳理当前高效推理方法的研究进展,并将其归纳为两大方向:一是基于单模型优化的高效推理方法,通过改进个体模型的推理机制提升效率;二是基于模型协作的高效推理方法,利用多模型协同策略优化推理路径,从而在不牺牲性能的前提下缩短推理链条长度。
链接: https://arxiv.org/abs/2508.02120
作者: Linan Yue,Yichao Du,Yizhi Wang,Weibo Gao,Fangzhou Yao,Li Wang,Ye Liu,Ziyu Xu,Qi Liu,Shimin Di,Min-Ling Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks. Among them, DeepSeek R1 has garnered significant attention for its exceptional performance and open-source nature, driving advancements in the research of R1-style LRMs. Unlike traditional Large Language Models (LLMs), these models enhance logical deduction and decision-making capabilities during reasoning by incorporating mechanisms such as long chain-of-thought and self-reflection through reinforcement learning. However, with the widespread application of these models, the problem of overthinking has gradually emerged. Specifically, when generating answers, these models often construct excessively long reasoning chains with redundant or repetitive steps, which leads to reduced reasoning efficiency and may affect the accuracy of the final answer. To this end, various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability. By reviewing the current research advancements in the field of efficient reasoning methods systematically, we categorize existing works into two main directions based on the lens of single-model optimization versus model collaboration: (1) Efficient Reasoning with Single Model, which focuses on improving the reasoning efficiency of individual models; and (2) Efficient Reasoning with Model Collaboration, which explores optimizing reasoning paths through collaboration among multiple models. Besides, we maintain a public GitHub repository that tracks the latest progress in efficient reasoning methods.
zh
[AI-45] Coward: Toward Practical Proactive Federated Backdoor Defense via Collision-based Watermark
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中后门攻击的检测难题,特别是现有被动与主动防御方法各自的局限性:被动防御易受非独立同分布(non-i.i.d.)数据分布和客户端随机参与的影响,而现有主动防御则因依赖后门共存效应(backdoor co-existence effects)导致不可避免的分布外(out-of-distribution, OOD)偏差。解决方案的关键在于提出一种新的主动防御机制——Coward,其核心思想源于对多后门碰撞效应(multi-backdoor collision effects)的发现:连续植入的不同后门会显著抑制早期后门的效果。Coward通过在服务器端注入冲突性的全局水印(global watermark),并检测该水印是否在本地训练过程中被保留或被擦除来识别恶意客户端,从而在保持对数据异构性(non-i.i.d.)鲁棒性的前提下,有效缓解OOD偏差问题。
链接: https://arxiv.org/abs/2508.02115
作者: Wenjie Li,Siying Gu,Yiming Li,Kangjie Chen,Zhili Chen,Tianwei Zhang,Shu-Tao Xia,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13-page main body and 4-page appendix
Abstract:Backdoor detection is currently the mainstream defense against backdoor attacks in federated learning (FL), where malicious clients upload poisoned updates that compromise the global model and undermine the reliability of FL deployments. Existing backdoor detection techniques fall into two categories, including passive and proactive ones, depending on whether the server proactively modifies the global model. However, both have inherent limitations in practice: passive defenses are vulnerable to common non-i.i.d. data distributions and random participation of FL clients, whereas current proactive defenses suffer inevitable out-of-distribution (OOD) bias because they rely on backdoor co-existence effects. To address these issues, we introduce a new proactive defense, dubbed Coward, inspired by our discovery of multi-backdoor collision effects, in which consecutively planted, distinct backdoors significantly suppress earlier ones. In general, we detect attackers by evaluating whether the server-injected, conflicting global watermark is erased during local training rather than retained. Our method preserves the advantages of proactive defenses in handling data heterogeneity (\ie, non-i.i.d. data) while mitigating the adverse impact of OOD bias through a revised detection mechanism. Extensive experiments on benchmark datasets confirm the effectiveness of Coward and its resilience to potential adaptive attacks. The code for our method would be available at this https URL.
zh
[AI-46] Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在使用外部工具时因工具元数据(metadata)被恶意操纵而引发的安全问题。当前工具导向的LLM代理架构依赖于对工具名称、描述和参数模式等元数据的解析来选择合适工具,但这一机制存在隐蔽的攻击面——攻击者可通过修改元数据诱导代理优先选择恶意工具,且无需进行提示注入或访问模型内部结构。解决方案的关键在于提出一种名为“吸引性元数据攻击”(Attractive Metadata Attack, AMA)的黑盒上下文学习框架,该框架通过迭代优化生成语法和语义均合法但极具吸引力的工具元数据,从而实现对代理工具选择行为的有效操控。AMA可无缝集成至标准工具生态系统中,不需改动代理执行框架,并在多种真实场景与主流LLM代理上验证了高达81%–95%的成功率及显著隐私泄露风险,同时对现有提示层防御和结构化工具选择协议仍具有效性,揭示了当前代理架构在执行层面存在的系统性安全漏洞。
链接: https://arxiv.org/abs/2508.02110
作者: Kanghua Mo,Li Hu,Yucheng Long,Zhihao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools. However, this tool-centric paradigm introduces a previously underexplored attack surface: adversaries can manipulate tool metadata – such as names, descriptions, and parameter schemas – to influence agent behavior. We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents, without requiring prompt injection or access to model internals. To demonstrate and exploit this vulnerability, we propose the Attractive Metadata Attack (AMA), a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata through iterative optimization. Our attack integrates seamlessly into standard tool ecosystems and requires no modification to the agent’s execution framework. Extensive experiments across ten realistic, simulated tool-use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81%-95%) and significant privacy leakage, with negligible impact on primary task execution. Moreover, the attack remains effective even under prompt-level defenses and structured tool-selection protocols such as the Model Context Protocol, revealing systemic vulnerabilities in current agent architectures. These findings reveal that metadata manipulation constitutes a potent and stealthy attack surface, highlighting the need for execution-level security mechanisms that go beyond prompt-level defenses.
zh
[AI-47] Evaluating User Experience in Conversational Recommender Systems: A Systematic Review Across Classical and LLM -Powered Approaches
【速读】:该论文旨在解决对话式推荐系统(Conversational Recommender Systems, CRSs)在用户体验(User Experience, UX)评估方面的研究不足问题,尤其是针对自适应系统和基于大语言模型(Large Language Model, LLM)的CRSs缺乏充分的实证研究。其解决方案的关键在于:采用PRISMA指南开展系统性文献综述,整合2017至2025年间发表的23项实证研究,从领域特性、自适应行为及LLM应用三个维度分析UX的概念化、测量方式及其影响机制;进而提出结构化的UX指标体系、对比自适应与非自适应系统的UX表现,并构建面向LLM-aware的未来UX评估框架,以推动更透明、交互性强且用户中心的CRS评价实践发展。
链接: https://arxiv.org/abs/2508.02096
作者: Raj Mahmud,Yufeng Wu,Abdullah Bin Sawad,Shlomo Berkovsky,Mukesh Prasad,A. Baki Kocaballi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at OZCHI 2025. 23 pages, 1 figure, 5 tables
Abstract:Conversational Recommender Systems (CRSs) are receiving growing research attention across domains, yet their user experience (UX) evaluation remains limited. Existing reviews largely overlook empirical UX studies, particularly in adaptive and large language model (LLM)-based CRSs. To address this gap, we conducted a systematic review following PRISMA guidelines, synthesising 23 empirical studies published between 2017 and 2025. We analysed how UX has been conceptualised, measured, and shaped by domain, adaptivity, and LLM. Our findings reveal persistent limitations: post hoc surveys dominate, turn-level affective UX constructs are rarely assessed, and adaptive behaviours are seldom linked to UX outcomes. LLM-based CRSs introduce further challenges, including epistemic opacity and verbosity, yet evaluations infrequently address these issues. We contribute a structured synthesis of UX metrics, a comparative analysis of adaptive and nonadaptive systems, and a forward-looking agenda for LLM-aware UX evaluation. These findings support the development of more transparent, engaging, and user-centred CRS evaluation practices. Comments: Accepted at OZCHI 2025. 23 pages, 1 figure, 5 tables Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) ACMclasses: H.3.3; H.5.2; I.2.7 Cite as: arXiv:2508.02096 [cs.IR] (or arXiv:2508.02096v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.02096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-48] “Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch
【速读】:该论文旨在解决当前机器人操作(robot manipulation)系统难以直接根据非专业人士绘制的2D草图生成稳定且结构合理的3D物体摆放问题。现有方法通常依赖于精确的3D块体位姿作为目标,这需要复杂的结构分析和专业工具(如CAD),限制了用户交互的自然性和普适性。解决方案的关键在于提出StackItUp系统,其核心创新是引入一个抽象关系图(abstract relation graph),该图能够从粗糙草图中提取符号化的几何关系(如“左-右”)和稳定性模式(如“双柱桥”结构),同时忽略草图中的噪声细节;随后通过组合扩散模型(compositional diffusion models)将该关系图映射到具体的3D位姿,并迭代预测隐藏的内部及背面支撑结构——这些支撑对稳定性至关重要但未在原始草图中体现。此方法使非专家用户仅凭2D前视草图即可生成高稳定性和视觉相似度的多层3D结构。
链接: https://arxiv.org/abs/2508.02093
作者: Yiqing Xu,Linfeng Li,Cunjun Yu,David Hsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to CoRL 2025
Abstract:Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly-they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present StackItUp, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. StackItUp introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., left-of) and stability patterns (e.g., two-pillar-bridge) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports-critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, StackItUp consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.
zh
[AI-49] FPEdit: Robust LLM Fingerprinting through Localized Knowledge Editing
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对未经授权的再分发和商业滥用时,缺乏有效且隐蔽的版权保护机制的问题。现有指纹技术存在两大局限:基于内在特征的方法需访问全部参数,而基于后门的方案则因统计异常易被攻击者检测并移除。为此,作者提出FPEdit这一知识编辑框架,其核心创新在于通过修改模型权重的一个稀疏子集来注入语义一致的自然语言指纹,从而实现对模型所有权的隐蔽编码,同时不损害模型的核心功能。该方法能够在全参数微调与参数高效适应下保持95%-100%的指纹保留率,并在量化、剪枝及随机解码等复杂场景中保持鲁棒性,且资源消耗显著低于现有技术,首次实现了对抗适应、抗检测与模型效用三者的统一。
链接: https://arxiv.org/abs/2508.02092
作者: Shida Wang,Chaohu Liu,Yubo Wang,Linli Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models represent significant investments in computation, data, and engineering expertise, making them extraordinarily valuable intellectual assets. Nevertheless, these AI assets remain vulnerable to unauthorized redistribution and commercial exploitation through fine-tuning or black-box deployment. Current fingerprinting approaches face a fundamental trade-off: intrinsic methods require full parameter access, while backdoor-based techniques employ statistically anomalous triggers easily detected and filtered by adversaries. To address these limitations, we introduce FPEdit, a novel knowledge-editing framework that injects semantically coherent natural language fingerprints by modifying a sparse subset of model weights. This ensures stealthy and precise ownership encoding without degrading the core functionality. Extensive experiments show that FPEdit achieves 95 - 100% fingerprint retention under both full-parameter fine-tuning and parameter-efficient adaptation, while preserving performance on 24 downstream benchmarks. Moreover, FPEdit remains robust under quantization, pruning, and stochastic decoding, and can embed 10 fingerprint pairs into LLaMA2-7B in under 10 minutes using less than 32 GB of GPU memory, a 70% reduction in resource requirements compared to existing techniques. These advances establish FPEdit as the first fingerprinting approach to simultaneously achieve robustness against adaptation, resistance to detection, and preservation of model utility, providing a minimally invasive solution for reliable provenance verification of large language models in adversarial deployment scenarios.
zh
[AI-50] SE-Agent : Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM -Based Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体在复杂任务中因推理轨迹(interaction trajectory)未被充分挖掘而导致的效率低下与性能瓶颈问题。现有方法如蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)虽能平衡探索与利用,但忽视了不同轨迹间的相互依赖性,且搜索空间缺乏多样性,导致冗余推理和次优结果。解决方案的关键在于提出SE-Agent框架,通过三个核心操作——修订(revision)、重组(recombination)和精炼(refinement)——对历史轨迹进行迭代优化,从而实现智能体的自我进化:一方面借助先前轨迹引导探索更广的解空间以跳出局部最优,另一方面利用跨轨迹启发式信息提升整体性能并削弱劣质路径的影响,最终实现推理质量的持续改进。
链接: https://arxiv.org/abs/2508.02085
作者: Jiaye Lin,Yifu Guo,Yuzhen Han,Sen Hu,Ziyi Ni,Licheng Wang,Mingguang Chen,Daxin Jiang,Binxing Jiao,Chen Hu,Huacan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents’ interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at this https URL.
zh
[AI-51] SSBD Ontology: A Two-Tier Approach for Interoperable Bioimaging Metadata ISWC2025
【速读】:该论文旨在解决生物成像领域中大规模多维数据获取背景下,元数据管理与互操作性不足的问题。其解决方案的关键在于提出了一种基于本体驱动的两层架构框架:核心层采用以类为中心的结构,引用现有生物医学本体(biomedical ontologies),支持两种模式——SSBD:repository(侧重快速发布、最小元数据)和SSBD:database(增强生物学与成像相关注释);实例层则以资源描述框架(Resource Description Framework, RDF)个体形式表示实际成像数据集,并显式链接至核心类。这种分层设计实现了灵活实例数据与严谨本体类的对齐,从而促进数据无缝集成与高级语义查询,提升生物成像数据的可发现性、可访问性、互操作性和可重用性(FAIR原则)。
链接: https://arxiv.org/abs/2508.02084
作者: Yuki Yamagata,Koji Kyoda,Hiroya Itoga,Emi Fujisawa,Shuichi Onami
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted to the 24th International Semantic Web Conference Resource Track (ISWC 2025)
Abstract:Advanced bioimaging technologies have enabled the large-scale acquisition of multidimensional data, yet effective metadata management and interoperability remain significant challenges. To address these issues, we propose a new ontology-driven framework for the Systems Science of Biological Dynamics Database (SSBD) that adopts a two-tier architecture. The core layer provides a class-centric structure referencing existing biomedical ontologies, supporting both SSBD:repository – which focuses on rapid dataset publication with minimal metadata – and SSBD:database, which is enhanced with biological and imaging-related annotations. Meanwhile, the instance layer represents actual imaging dataset information as Resource Description Framework individuals that are explicitly linked to the core classes. This layered approach aligns flexible instance data with robust ontological classes, enabling seamless integration and advanced semantic queries. By coupling flexibility with rigor, the SSBD Ontology promotes interoperability, data reuse, and the discovery of novel biological mechanisms. Moreover, our solution aligns with the Recommended Metadata for Biological Images guidelines and fosters compatibility. Ultimately, our approach contributes to establishing a Findable, Accessible, Interoperable, and Reusable data ecosystem within the bioimaging community.
zh
[AI-52] AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
【速读】:该论文旨在解决低秩适应(Low-rank adaptation, LoRA)在微调大语言模型(Large Language Models, LLMs)过程中引发的对齐漂移(alignment drift)问题,即微小的参数更新可能因参数空间中的纠缠变化而削弱模型的安全性和行为约束。解决方案的关键在于提出AlignGuard-LoRA(AGL)框架,其核心包括:基于Fisher信息矩阵的正则化以限制对对齐敏感子空间的更新;任务特定正则化以稳定新知识的整合;以及碰撞感知正则化,融合黎曼重叠(Riemannian overlap)与测地线分离(geodesic separation),从而惩罚坐标级干扰并鼓励解耦更新几何结构。该方法在不损害下游任务性能的前提下,可将安全关键基准上的对齐漂移降低高达50%。
链接: https://arxiv.org/abs/2508.02079
作者: Amitava Das,Abhilekh Borah,Vinija Jain,Aman Chadha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap – which penalizes coordinate-wise interference – and geodesic separation – which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.
zh
[AI-53] Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games
【速读】:该论文旨在解决多大语言模型(Large Language Models, LLMs)协同完成复杂任务时面临的计算成本与集体性能之间的权衡问题。传统方法中,尽管多个LLM协作可提升性能,但往往因通信开销高或存在搭便车(free-riding)行为而导致效率低下。解决方案的关键在于提出一种基于博弈论的强化学习框架——多智能体合作序贯公共品博弈(Multi-Agent Cooperation Sequential Public Goods Game, MAC-SPGG),通过重构公共品奖励机制,使努力贡献成为唯一子博弈完美纳什均衡(Subgame Perfect Nash Equilibrium, SPNE),从而有效抑制搭便车行为;同时采用序贯决策协议替代传统的轮次式信息交换,显著降低通信开销并保持策略深度。实证结果表明,经MAC-SPGG训练的多LLM集成系统在推理、数学、代码生成和自然语言处理等任务上优于单模型基线及链式思维提示法,甚至达到大规模单模型的性能水平。
链接: https://arxiv.org/abs/2508.02076
作者: Yunhao Liang,Yuan Qu,Jingyuan Yang,Shaochong Lin,Zuo-Jun Max Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Coordinating multiple large language models (LLMs) to solve complex tasks collaboratively poses a fundamental trade-off between the computation costs and collective performance compared with individual model. We introduce a novel, game-theoretically grounded reinforcement learning (RL) framework, the Multi-Agent Cooperation Sequential Public Goods Game (MAC-SPGG), to systematically incentivize cooperation in multi-LLM ensembles. In MAC-SPGG, LLM agents move in sequence, observing predecessors’ outputs and updating beliefs to condition their own contributions. By redesigning the public-goods reward, effortful contributions become the unique Subgame Perfect Nash Equilibrium (SPNE), which eliminates free-riding under traditional SPGG or PGG. Its sequential protocol replaces costly round-based information exchanges with a streamlined decision flow, cutting communication overhead while retaining strategic depth. We prove the existence and uniqueness of the SPNE under realistic parameters, and empirically show that MAC-SPGG-trained ensembles outperform single-agent baselines, chain-of-thought prompting, and other cooperative methods, even achieving comparable performance to large-scale models across reasoning, math, code generation, and NLP tasks. Our results highlight the power of structured, incentive-aligned MAC-SPGG cooperation for scalable and robust multi-agent language generation.
zh
[AI-54] Risk identification based on similar case retrieval enhancement
【速读】:该论文旨在解决施工场景中风险与隐患识别自动化不足的问题,尤其针对当前基于大语言模型(Large Language Models, LLMs)的方法在复杂隐患特征识别上的局限性,以及依赖专业数据集进行指令微调或对话引导所带来的高训练成本和弱特征关联问题。其解决方案的关键在于引入一种基于相似案例检索增强(Similar Case Retrieval Enhancement)的识别方法:通过提示词微调(prompt fine-tuning)融合外部知识与检索到的案例上下文,有效缓解因领域知识有限和特征关联薄弱导致的误判问题。该方法由检索库、图像相似性检索和大模型检索增强三个模块构成,无需重新训练即可实现高效识别,在真实施工数据上显著提升准确率(如GLM-4V识别准确率从14.51%提升至50%),增强了识别精度、上下文理解能力和系统稳定性。
链接: https://arxiv.org/abs/2508.02073
作者: Jiawei Li,Chengye Yang,Yaochen Zhang,Weilin Sun,Lei Meng,Xiangxu Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: in Chinese language
Abstract:The goal of construction site risk and hazard identification is to enhance safety management through automation. Existing research based on large language models falls into two categories: image-text matching for collaborative reasoning, which struggles with complex hazard features, and instruction fine-tuning or dialogue guidance using professional datasets, which suffers from high training costs and poor this http URL address this, we propose a hazard identification method using similar case retrieval enhancement. By integrating external knowledge and retrieved case contexts via prompt fine-tuning, we mitigate misjudgments caused by limited domain knowledge and weak feature associations. Our method includes three modules: retrieval library, image similarity retrieval, and large model retrieval enhancement, enabling efficient recognition without training. Experiments on real construction data show significant improvements. For instance, GLM-4V’s recognition accuracy increased to 50%, a 35.49% boost. The method enhances accuracy, context understanding, and stability, offering new theoretical and technical support for hazard detection.
zh
[AI-55] SpikeSTAG: Spatial-Temporal Forecasting via GNN-SNN Collaboration
【速读】:该论文旨在解决生成式 AI (Generative AI) 中多变量时间序列预测任务中空间建模潜力尚未被充分挖掘的问题,尤其是在传统脉冲神经网络(Spiking Neural Networks, SNNs)缺乏有效空间结构学习机制的情况下。其解决方案的关键在于提出一种全新的SNN架构,首次实现图结构学习与基于脉冲的时间处理的无缝融合:通过嵌入时间特征和自适应矩阵消除对预定义图结构的依赖;利用Observation (OBS) Block提取序列特征;借助多尺度脉冲聚合(Multi-Scale Spike Aggregation, MSSA)模块通过脉冲SAGE层实现多跳特征提取且无需浮点运算;最终通过双路径脉冲融合(Dual-Path Spike Fusion, DSF)块,以脉冲门控机制整合空间图特征与时间动态信息,结合LSTM处理的序列与脉冲自注意力输出,显著提升长序列数据的预测精度。
链接: https://arxiv.org/abs/2508.02069
作者: Bang Hu,Changze Lv,Mingjie Li,Yunpeng Liu,Xiaoqing Zheng,Fengzhe Zhang,Wei cao,Fan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
Abstract:Spiking neural networks (SNNs), inspired by the spiking behavior of biological neurons, offer a distinctive approach for capturing the complexities of temporal data. However, their potential for spatial modeling in multivariate time-series forecasting remains largely unexplored. To bridge this gap, we introduce a brand new SNN architecture, which is among the first to seamlessly integrate graph structural learning with spike-based temporal processing for multivariate time-series forecasting. Specifically, we first embed time features and an adaptive matrix, eliminating the need for predefined graph structures. We then further learn sequence features through the Observation (OBS) Block. Building upon this, our Multi-Scale Spike Aggregation (MSSA) hierarchically aggregates neighborhood information through spiking SAGE layers, enabling multi-hop feature extraction while eliminating the need for floating-point operations. Finally, we propose a Dual-Path Spike Fusion (DSF) Block to integrate spatial graph features and temporal dynamics via a spike-gated mechanism, combining LSTM-processed sequences with spiking self-attention outputs, effectively improve the model accuracy of long sequence datasets. Experiments show that our model surpasses the state-of-the-art SNN-based iSpikformer on all datasets and outperforms traditional temporal models at long horizons, thereby establishing a new paradigm for efficient spatial-temporal modeling.
zh
[AI-56] RACEALIGN – Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调以对齐人类价值观后出现的对齐漂移(alignment drift)问题,即模型在面对对抗性提示、解码扰动或改写越狱攻击时生成不安全或违反政策的内容。其核心挑战在于缺乏对导致此类失败的训练阶段信念来源的理解。解决方案的关键是提出TraceAlign框架,该框架通过引入信念冲突指数(Belief Conflict Index, BCI)量化生成片段与对齐策略之间的语义不一致性,并基于后缀数组匹配从训练语料中检索相关文档,从而实现对不安全输出的溯源。在此基础上,论文设计了三项互补干预措施:(i) TraceShield——推理时的安全过滤器,拒绝高BCI片段;(ii) 对比信念去混淆损失(Contrastive Belief Deconfliction Loss),在直接偏好优化(DPO)过程中惩罚高BCI延续;(iii) Prov-Decode——一种可溯源的解码策略,阻止可能产生高BCI片段的束搜索扩展。实验表明,这些方法在自建的对齐漂移基准(Alignment Drift Benchmark, ADB)上最多可减少85%的对齐漂移,同时保持任务性能稳定(Δ < 0.2),并提升拒绝质量。此外,论文还推导出基于后缀数组跨度统计的漂移概率理论上限,揭示了记忆频率和长度与对抗性再激活风险之间的关系,为从根本上理解和缓解对齐失效提供了可扩展、可追溯且有根基的工具集。
链接: https://arxiv.org/abs/2508.02063
作者: Amitava Das,Vinija Jain,Aman Chadha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model’s training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: this https URL
zh
[AI-57] RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models
【速读】:该论文旨在解决当前多任务视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人领域中缺乏有效在线教学能力的问题。尽管这些模型已展现出无需微调即可执行新任务的潜力,但其在实际应用中仍难以被终端用户快速、灵活地指导以适应新场景或新任务。解决方案的关键在于提出一种后训练注入式上下文学习(Retraining for In-Context Learning, RICL)方法:通过少量机器人演示数据(10–20次)和特定的微调策略,使原本不具备上下文学习能力的VLA模型获得在推理阶段利用示例进行即时适应的能力,从而实现无需参数更新的高效任务泛化与性能提升。
链接: https://arxiv.org/abs/2508.02062
作者: Kaustubh Sridhar,Souradeep Dutta,Dinesh Jayaraman,Insup Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Conference on Robot Learning 2025 (CoRL 2025), 17 pages
Abstract:Multi-task ``vision-language-action’’ (VLA) models have recently demonstrated increasing promise as generalist foundation models for robotics, achieving non-trivial performance out of the box on new tasks in new environments. However, for such models to be truly useful, an end user must have easy means to teach them to improve. For language and vision models, the emergent ability to perform in-context learning (ICL) has proven to be a versatile and highly useful interface to easily teach new tasks with no parameter finetuning. Unfortunately, VLAs pre-trained with imitation learning objectives do not naturally acquire ICL abilities. In this paper, we demonstrate that, with the right finetuning recipe and a small robot demonstration dataset, it is possible to inject in-context adaptability post hoc into such a VLA. After retraining for in-context learning (RICL), our system permits an end user to provide a small number (10-20) of demonstrations for a new task. RICL then fetches the most relevant portions of those demonstrations into the VLA context to exploit ICL, performing the new task and boosting task performance. We apply RICL to inject ICL into the \pi_0 -FAST VLA, and show that it permits large in-context improvements for a variety of new manipulation tasks with only 20 demonstrations per task, without any parameter updates. When parameter updates on the target task demonstrations is possible, RICL finetuning further boosts performance. We release code and model weights for RICL- \pi_0 -FAST alongside the paper to enable, for the first time, a simple in-context learning interface for new manipulation tasks. Website: this https URL.
zh
[AI-58] Epi2-Net: Advancing Epidemic Dynamics Forecasting with Physics-Inspired Neural Networks
【速读】:该论文旨在解决现有流行病动力学预测方法在建模复杂现实动态时的局限性问题:机制模型受限于预定义的 compartmental 结构和简化的系统假设,难以刻画真实世界的复杂性;而数据驱动模型则忽略流行病学物理约束,易产生偏差或误导性结果。解决方案的关键在于提出 Epi²-Net,一个基于物理启发的神经网络框架,其核心创新是将流行病传播重新概念化为物理传输过程,并引入“神经流行病传输”(neural epidemic transport)的概念,通过显式整合物理约束与神经模块,有效建模流行病的时空演化模式,从而显著提升预测精度。
链接: https://arxiv.org/abs/2508.02049
作者: Rui Sun,Chenghua Gong,Tianjun Gu,Yuhao Zheng,Jie Ding,Juyuan Zhang,Liming Pan,Linyuan Lü
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Advancing epidemic dynamics forecasting is vital for targeted interventions and safeguarding public health. Current approaches mainly fall into two categories: mechanism-based and data-driven models. Mechanism-based models are constrained by predefined compartmental structures and oversimplified system assumptions, limiting their ability to model complex real-world dynamics, while data-driven models focus solely on intrinsic data dependencies without physical or epidemiological constraints, risking biased or misleading representations. Although recent studies have attempted to integrate epidemiological knowledge into neural architectures, most of them fail to reconcile explicit physical priors with neural representations. To overcome these obstacles, we introduce Epi ^2 -Net, a Epidemic Forecasting Framework built upon Physics-Inspired Neural Networks. Specifically, we propose reconceptualizing epidemic transmission from the physical transport perspective, introducing the concept of neural epidemic transport. Further, we present a physic-inspired deep learning framework, and integrate physical constraints with neural modules to model spatio-temporal patterns of epidemic dynamics. Experiments on real-world datasets have demonstrated that Epi ^2 -Net outperforms state-of-the-art methods in epidemic forecasting, providing a promising solution for future epidemic containment. The code is available at: this https URL.
zh
[AI-59] Graph Unlearning via Embedding Reconstruction – A Range-Null Space Decomposition Approach
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在面对节点级删除请求时的“图遗忘”(graph unlearning)问题,尤其是现有方法在处理扰动性更强的节点删除场景下效果不佳的问题。其解决方案的关键在于提出一种新颖的节点遗忘方法:通过嵌入重构(embedding reconstruction)来逆向GNN中的聚合过程,并引入范围-零空间分解(Range-Null Space Decomposition)以精确建模节点间的交互学习,从而在无需重新训练模型的前提下实现高效且有效的节点遗忘。
链接: https://arxiv.org/abs/2508.02044
作者: Hang Yin,Zipeng Liu,Xiaoyong Peng,Liyao Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 15 pages
Abstract:Graph unlearning is tailored for GNNs to handle widespread and various graph structure unlearning requests, which remain largely unexplored. The GIF (graph influence function) achieves validity under partial edge unlearning, but faces challenges in dealing with more disturbing node unlearning. To avoid the overhead of retraining and realize the model utility of unlearning, we proposed a novel node unlearning method to reverse the process of aggregation in GNN by embedding reconstruction and to adopt Range-Null Space Decomposition for the nodes’ interaction learning. Experimental results on multiple representative datasets demonstrate the SOTA performance of our proposed approach.
zh
[AI-60] Confidence-Diversity Calibration of AI Judgement Enables Reliable Qualitative Coding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在质性编码(qualitative coding)任务中输出可靠性难以评估的问题,尤其是在人类专家共识较低的领域。解决方案的关键在于提出了一种基于模型置信度(self-confidence)与模型多样性(model diversity)的双信号机制:其中模型平均置信度已能较好预测不同模型间的编码一致性(Pearson r=0.82),而引入模型投票分布的归一化香农熵(normalised Shannon entropy)作为多样性指标后,可将解释力提升至R²=0.979;由此构建的“置信度-多样性”组合信号驱动三阶段自动化工作流,实现35%样本的自动接受(错误率仅5%)和剩余样本的精准人工复核,从而减少高达65%的人工劳动量,并在金融、医学、法律等多领域验证其有效性。
链接: https://arxiv.org/abs/2508.02029
作者: Zhilong Zhao,Yindi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures. Code and data available at this https URL
Abstract:LLMs enable qualitative coding at large scale, but assessing the reliability of their output remains challenging in domains where human experts seldom agree. Analysing 5,680 coding decisions from eight state-of-the-art LLMs across ten thematic categories, we confirm that a model’s mean self-confidence already tracks inter-model agreement closely (Pearson r=0.82). Adding model diversity-quantified as the normalised Shannon entropy of the panel’s votes-turns this single cue into a dual signal that explains agreement almost completely (R^2=0.979). The confidence-diversity duo enables a three-tier workflow that auto-accepts 35% of segments with 5% audit-detected error and routes the remainder for targeted human review, cutting manual effort by up to 65%. Cross-domain replication on six public datasets spanning finance, medicine, law and multilingual tasks confirms these gains (kappa improvements of 0.20-0.78). Our results establish a generalisable, evidence-based criterion for calibrating AI judgement in qualitative research.
zh
[AI-61] Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations
【速读】:该论文旨在解决基于检索增强生成(Retrieval-Augmented Generation, RAG)的角色扮演代理(Role-Playing Agent, RPA)在处理超出训练知识范围的问题时,难以保持角色一致性的问题。解决方案的关键在于提出AMADEUS框架,其核心组件包括自适应上下文感知文本切分器(Adaptive Context-aware Text Splitter, ACTS)、引导选择机制(Guided Selection, GS)和属性提取器(Attribute Extractor, AE)。其中,AE通过从GS检索到的文本块中识别角色的通用属性,并将其作为最终上下文,从而在回答知识外问题时仍能维持角色的稳定人格一致性。
链接: https://arxiv.org/abs/2508.02016
作者: Jeiyoon Park,Yongshin Han,Minseop Kim,Kisu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: preprint
Abstract:We propose AMADEUS, which is composed of Adaptive Context-aware Text Splitter (ACTS), Guided Selection (GS), and Attribute Extractor (AE). ACTS finds an optimal chunk length and hierarchical contexts for each character. AE identifies a character’s general attributes from the chunks retrieved by GS and uses these attributes as a final context to maintain robust persona consistency even when answering out of knowledge questions. To facilitate the development and evaluation of RAG-based RPAs, we construct CharacterRAG, a role-playing dataset that consists of persona documents for 15 distinct fictional characters totaling 976K written characters, and 450 question and answer pairs. We find that our framework effectively models not only the knowledge possessed by characters, but also various attributes such as personality.
zh
[AI-62] DIRF: A Framework for Digital Identity Protection and Clone Governance in Agent ic AI Systems
【速读】:该论文旨在解决生成式人工智能(Generative AI)快速发展所带来的个人身份完整性威胁问题,包括数字克隆、高级冒充以及身份相关数据的未经授权商业化等风险。解决方案的关键在于提出一个名为“数字身份权利框架”(Digital Identity Rights Framework, DIRF)的结构化安全与治理模型,该框架通过九个领域和63项控制措施,整合法律、技术和混合执行机制,以保障数字身份的同意权、可追溯性和商业化合规性,从而为AI驱动系统中的身份权利实施提供统一且可操作的控制体系。
链接: https://arxiv.org/abs/2508.01997
作者: Hammad Atta,Muhammad Zeeshan Baig,Yasir Mehmood,Nadeem Shahzad,Ken Huang,Muhammad Aziz Ul Haq,Muhammad Awais,Kamal Ahmed,Anthony Green
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:The rapid advancement and widespread adoption of generative artificial intelligence (AI) pose significant threats to the integrity of personal identity, including digital cloning, sophisticated impersonation, and the unauthorized monetization of identity-related data. Mitigating these risks necessitates the development of robust AI-generated content detection systems, enhanced legal frameworks, and ethical guidelines. This paper introduces the Digital Identity Rights Framework (DIRF), a structured security and governance model designed to protect behavioral, biometric, and personality-based digital likeness attributes to address this critical need. Structured across nine domains and 63 controls, DIRF integrates legal, technical, and hybrid enforcement mechanisms to secure digital identity consent, traceability, and monetization. We present the architectural foundations, enforcement strategies, and key use cases supporting the need for a unified framework. This work aims to inform platform builders, legal entities, and regulators about the essential controls needed to enforce identity rights in AI-driven systems.
zh
[AI-63] Controllable and Stealthy Shilling Attacks via Dispersive Latent Diffusion
【速读】:该论文旨在解决推荐系统(Recommender Systems, RSs)在面对洗钱攻击(shilling attacks)时的脆弱性问题,尤其是现有攻击模型难以同时实现对目标物品的强促进效果与行为真实性以规避检测的困境。解决方案的关键在于提出DLDA——一种基于扩散机制的攻击框架,其核心创新包括:在预对齐的协同嵌入空间中利用条件潜在扩散过程,实现对伪造用户行为的细粒度控制;并通过引入分散正则化机制,增强生成行为模式的多样性与真实性,从而在显著提升目标物品排名的同时降低被检测风险。实验表明,DLDA在多个真实数据集和主流推荐模型上均优于先前攻击方法,揭示了当前推荐系统防御能力的不足。
链接: https://arxiv.org/abs/2508.01987
作者: Shutong Qiao,Wei Yuan,Junliang Yu,Tong Chen,Quoc Viet Hung Nguyen,Hongzhi Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Recommender systems (RSs) are now fundamental to various online platforms, but their dependence on user-contributed data leaves them vulnerable to shilling attacks that can manipulate item rankings by injecting fake users. Although widely studied, most existing attack models fail to meet two critical objectives simultaneously: achieving strong adversarial promotion of target items while maintaining realistic behavior to evade detection. As a result, the true severity of shilling threats that manage to reconcile the two objectives remains underappreciated. To expose this overlooked vulnerability, we present DLDA, a diffusion-based attack framework that can generate highly effective yet indistinguishable fake users by enabling fine-grained control over target promotion. Specifically, DLDA operates in a pre-aligned collaborative embedding space, where it employs a conditional latent diffusion process to iteratively synthesize fake user profiles with precise target item control. To evade detection, DLDA introduces a dispersive regularization mechanism that promotes variability and realism in generated behavioral patterns. Extensive experiments on three real-world datasets and five popular RS models demonstrate that, compared to prior attacks, DLDA consistently achieves stronger item promotion while remaining harder to detect. These results highlight that modern RSs are more vulnerable than previously recognized, underscoring the urgent need for more robust defenses.
zh
[AI-64] Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因依赖过程奖励模型(Process Reward Models, PRMs)进行中间步骤奖励而导致的计算开销过大问题。现有方法虽能提升推理质量,但生成大量候选解时会产生显著的计算冗余。论文的关键解决方案是提出PRMs具有“部分奖励模型”(Partial Reward Models)特性,即其对部分完成推理步骤的评分可预测最终输出质量,从而允许基于中间token级别的信号实现早期拒绝(early rejection)。这一机制可在不牺牲最终性能的前提下,通过提前终止低质量路径显著降低推理阶段的浮点运算次数(FLOPs),实验证明其可带来1.4×至9×的计算效率提升。
链接: https://arxiv.org/abs/2508.01969
作者: Seyyed Saeid Cheshmi,Azal Ahmad Khan,Xinran Wang,Zirui Liu,Ali Anwar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4 \times -9 \times reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.
zh
[AI-65] Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable sustainable fine-tuning
【速读】:该论文旨在解决大规模预训练语言模型在多任务微调中对参数效率与表达能力之间的权衡问题,即如何在减少适配器(adapter)参数量的同时保持甚至提升模型性能。其解决方案的关键在于提出Kron-LoRA,一种两阶段适配器结构:首先将冻结线性层的更新矩阵ΔW分解为Kronecker积形式ΔW = A ⊗ B,随后利用低秩LoRA分解压缩B ≈ B₁B₂;通过利用Kronecker积的秩性质rank(A ⊗ B) = rank(A) × rank(B),Kron-LoRA在仅使用标准LoRA-8约四分之一参数的情况下仍能保留高表达能力,并且其紧凑的适配器矩阵更利于量化至8-bit或4-bit,从而显著降低内存占用和存储开销,适用于设备端部署。
链接: https://arxiv.org/abs/2508.01961
作者: Yixin Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning massive pre-trained language models across many tasks demands adapters that are both parameter-efficient and highly expressive. We introduce \textbfKron-LoRA, a two-stage adapter that first factorizes each frozen linear update as a Kronecker product [ \Delta W = A \otimes B ] and then compresses [ B \in \mathbbR^d_B2\times d_B1 ] via an (r)-rank LoRA decomposition (B \approx B_1B_2). By leveraging [ \mathrmrank(A \otimes B) ;=; \mathrmrank(A),\mathrmrank(B), ] Kron-LoRA retains the expressivity of the update while using up to 4!\times! fewer parameters than a standard rank-8 LoRA adapter. Its compact adapter matrices also quantize to 8- or 4-bit with less accuracy degradation than LoRA, enabling further memory and storage savings for on-device deployment. We benchmark on DistilBERT and Mistral-7B across five tasks (PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge) over multiple epochs of adapter-only tuning: on DistilBERT, an 840 K-parameter Kron-LoRA matches LoRA-16’s performance, and on Mistral-7B, a 5.7 M-parameter Kron-LoRA rivals LoRA-8 with modest memory savings and only a 3-8% speed overhead. In sequential fine-tuning from ARC-Challenge to ARC-Easy, Kron-LoRA retains 55.18% accuracy versus 53.17% for LoRA-8-despite using only one-quarter of the adapter parameters-underscoring its competitive cross-task transfer performance. By uniting Kronecker structure, low-rank compression, quantization-friendliness, and by providing transparent trade-off analysis, Kron-LoRA offers a scalable, sustainable, and continual-learning-ready solution for multi-task adaptation of large language models.
zh
[AI-66] Agent -Based Feature Generation from Clinical Notes for Outcome Prediction
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中未结构化临床笔记难以有效提取有意义特征以支持预测建模的问题。现有方法要么依赖人工专家进行特征工程(Clinician-Generated Features, CFG),导致效率低下,要么采用完全自动化的表示特征生成(Representation Feature Generation, RFG),但缺乏可解释性和临床相关性。其解决方案的关键在于提出一种模块化的多智能体系统 SNOW(Scalable Note-to-Outcome Workflow),该系统基于大语言模型(Large Language Models, LLMs)实现无监督的结构化临床特征自动生成,通过专业化智能体分工完成特征发现、提取、验证、后处理与聚合,从而在无需临床专家干预的情况下生成具有高解释性的特征,并在前列腺癌5年复发预测任务中达到与人工特征工程相当的性能(AUC-ROC: 0.761 vs. 0.771)。
链接: https://arxiv.org/abs/2508.01956
作者: Jiayi Wang,Jacqueline Jil Vallon,Neil Panjwani,Xi Ling,Sushmita Vij,Sandy Srinivas,John Leppert,Mark K. Buyyounouski,Mohsen Bayati
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Electronic health records (EHRs) contain rich unstructured clinical notes that could enhance predictive modeling, yet extracting meaningful features from these notes remains challenging. Current approaches range from labor-intensive manual clinician feature generation (CFG) to fully automated representational feature generation (RFG) that lack interpretability and clinical relevance. Here we introduce SNOW (Scalable Note-to-Outcome Workflow), a modular multi-agent system powered by large language models (LLMs) that autonomously generates structured clinical features from unstructured notes without human intervention. We evaluated SNOW against manual CFG, clinician-guided LLM approaches, and RFG methods for predicting 5-year prostate cancer recurrence in 147 patients from Stanford Healthcare. While manual CFG achieved the highest performance (AUC-ROC: 0.771), SNOW matched this performance (0.761) without requiring any clinical expertise, significantly outperforming both baseline features alone (0.691) and all RFG approaches. The clinician-guided LLM method also performed well (0.732) but still required expert input. SNOW’s specialized agents handle feature discovery, extraction, validation, post-processing, and aggregation, creating interpretable features that capture complex clinical information typically accessible only through manual review. Our findings demonstrate that autonomous LLM systems can replicate expert-level feature engineering at scale, potentially transforming how clinical ML models leverage unstructured EHR data while maintaining the interpretability essential for clinical deployment.
zh
[AI-67] Flow-Aware GNN for Transmission Network Reconfiguration via Substation Breaker Optimization
【速读】:该论文旨在解决大规模电力网络中离散拓扑优化问题,即在满足物理约束的前提下,通过选择最优的变电站断路器配置来最大化跨区域功率输送。这一问题通常被建模为混合整数规划(Mixed-Integer Program, MIP),因其NP-hard特性,在大型电网中求解计算复杂度极高,难以实时应用。论文提出的解决方案核心在于构建一个两阶段神经网络架构——首先使用线图神经网络(Line-Graph Neural Network, LGNN)近似给定拓扑下的直流潮流分布,再通过异构图神经网络(Heterogeneous Graph Neural Network, HeteroGNN)预测满足结构与物理约束的断路器状态,并引入物理信息一致性损失(physics-informed consistency loss)强制预测潮流满足基尔霍夫定律。该方法显著提升了优化效率,将推理时间从小时级缩短至毫秒级,同时实现最高达18%的功率输出增益。
链接: https://arxiv.org/abs/2508.01951
作者: Dekang Meng,Rabab Haider,Pascal van Hentenryck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces OptiGridML, a machine learning framework for discrete topology optimization in power grids. The task involves selecting substation breaker configurations that maximize cross-region power exports, a problem typically formulated as a mixed-integer program (MIP) that is NP-hard and computationally intractable for large networks. OptiGridML replaces repeated MIP solves with a two-stage neural architecture: a line-graph neural network (LGNN) that approximates DC power flows for a given network topology, and a heterogeneous GNN (HeteroGNN) that predicts breaker states under structural and physical constraints. A physics-informed consistency loss connects these components by enforcing Kirchhoff’s law on predicted flows. Experiments on synthetic networks with up to 1,000 breakers show that OptiGridML achieves power export improvements of up to 18% over baseline topologies, while reducing inference time from hours to milliseconds. These results demonstrate the potential of structured, flow-aware GNNs for accelerating combinatorial optimization in physical networked systems.
zh
[AI-68] Inferring Reward Machines and Transition Machines from Partially Observable Markov Decision Processes
【速读】:该论文旨在解决部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)中因观测具有非马尔可夫性(non-Markovianity)而导致策略学习困难的问题。现有方法依赖于奖励机器(Reward Machines, RMs)来建模非马尔可夫性,但仅能处理基于奖励的非马尔可夫结构,导致问题建模不自然;同时,自动化推理算法计算开销巨大。为突破这一局限,作者提出过渡机器(Transition Machines, TMs)以补充RM的能力,并进一步设计一种统一的双行为Mealy机(Dual Behavior Mealy Machine, DBMM),其可涵盖TMs与RMs两种结构。在此基础上,提出DB-RPNI算法,通过避免先前方法所需的昂贵归约步骤,高效地被动学习DBMM,并结合优化技术与最小正确自动机的充分条件判定机制,显著提升推断效率——实验表明相较当前最优基线实现高达三个数量级的速度提升。
链接: https://arxiv.org/abs/2508.01947
作者: Yuly Wu,Jiamou Liu,Libo Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures. Under review as a conference paper. Source code is available at: this https URL
Abstract:Partially Observable Markov Decision Processes (POMDPs) are fundamental to many real-world applications. Although reinforcement learning (RL) has shown success in fully observable domains, learning policies from traces in partially observable environments remains challenging due to non-Markovian observations. Inferring an automaton to handle the non-Markovianity is a proven effective approach, but faces two limitations: 1) existing automaton representations focus only on reward-based non-Markovianity, leading to unnatural problem formulations; 2) inference algorithms face enormous computational costs. For the first limitation, we introduce Transition Machines (TMs) to complement existing Reward Machines (RMs). To develop a unified inference algorithm for both automata types, we propose the Dual Behavior Mealy Machine (DBMM) that subsumes both TMs and RMs. We then introduce DB-RPNI, a passive automata learning algorithm that efficiently infers DBMMs while avoiding the costly reductions required by prior work. We further develop optimization techniques and identify sufficient conditions for inferring the minimal correct automata. Experimentally, our inference method achieves speedups of up to three orders of magnitude over SOTA baselines.
zh
[AI-69] L3MP: Lifelong Planning with Large Language Models
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)与经典规划方法结合的代理系统在应用于通用服务机器人时面临的两大挑战:一是经典规划算法通常需要详尽且一致的环境规范,而这类信息在实际场景中难以获取;二是现有框架多聚焦于孤立的任务规划,无法支持机器人在长期连续部署中维护动态环境记忆,并通过多模态输入更新知识、提取规划所需信息。解决方案的关键在于提出L3M+P(Lifelong LLM+P)框架,其核心是利用外部知识图谱作为世界状态的表示形式,该图谱可从传感器数据和人类自然语言交互等多源信息中持续更新,并通过预设规则确保图谱更新的一致性;在规划阶段,系统从知识图谱中检索上下文并生成符合经典规划器要求的问题定义,从而显著提升对自然语言状态变化的准确识别能力和规划生成的正确率。
链接: https://arxiv.org/abs/2508.01917
作者: Krish Agarwal,Yuqian Jiang,Jiaheng Hu,Bo Liu,Peter Stone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:By combining classical planning methods with large language models (LLMs), recent research such as LLM+P has enabled agents to plan for general tasks given in natural language. However, scaling these methods to general-purpose service robots remains challenging: (1) classical planning algorithms generally require a detailed and consistent specification of the environment, which is not always readily available; and (2) existing frameworks mainly focus on isolated planning tasks, whereas robots are often meant to serve in long-term continuous deployments, and therefore must maintain a dynamic memory of the environment which can be updated with multi-modal inputs and extracted as planning knowledge for future tasks. To address these two issues, this paper introduces L3M+P (Lifelong LLM+P), a framework that uses an external knowledge graph as a representation of the world state. The graph can be updated from multiple sources of information, including sensory input and natural language interactions with humans. L3M+P enforces rules for the expected format of the absolute world state graph to maintain consistency between graph updates. At planning time, given a natural language description of a task, L3M+P retrieves context from the knowledge graph and generates a problem definition for classical planners. Evaluated on household robot simulators and on a real-world service robot, L3M+P achieves significant improvement over baseline methods both on accurately registering natural language state changes and on correctly generating plans, thanks to the knowledge graph retrieval and verification.
zh
[AI-70] How Does Controllability Emerge In Language Models During Pretraining?
【速读】:该论文旨在解决语言模型中概念控制(如情绪、风格或真实性)干预有效性缺乏理论依据的问题,当前方法多依赖启发式和试错验证。其解决方案的关键在于提出一种统一的“干预检测器”(Intervention Detector, ID)框架,通过分析隐藏状态与表示空间的变化,揭示线性可操控性(linear steerability)在训练过程中动态演化规律;研究发现,线性可操控性主要在训练中期出现,并且不同语义相近的概念(如愤怒与悲伤)其可操控性涌现阶段存在差异,而这种现象与隐藏空间中概念逐渐变得线性可分密切相关。
链接: https://arxiv.org/abs/2508.01892
作者: Jianshu She,Xinyue Li,Eric Xing,Zhengzhong Liu,Qirong Ho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the “Intervention Detector” (ID), which is designed to reveal how linear steerability evolves over the course of training through hidden state and representation analysis. ID reveals that concepts become increasingly linearly separable in the hidden space as training progresses, which strongly correlates with the emergence of linear steerability. We further introduce ID-based metrics, such as heatmaps, entropy trends, and cosine similarity, to help interpret how linear steerability evolves throughout training. In addition, we apply ID across different model families to ensure the generality of our findings on steerability dynamics. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01892 [cs.LG] (or arXiv:2508.01892v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.01892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-71] Multi-turn Natural Language to Graph Query Language Translation
【速读】:该论文旨在解决自然语言到图查询语言(NL2GQL)转换中单轮对话方法难以应对多轮交互与复杂上下文依赖的问题,尤其是在实际应用中用户常需通过多轮对话逐步调整查询、探索实体间关联或获取额外信息。其解决方案的关键在于提出一种基于大语言模型(LLM)的自动化方法,用于构建高质量的多轮NL2GQL数据集,并在此基础上开发了MTGQL数据集(源自金融市场图数据库),同时设计三种基线方法以评估多轮NL2GQL翻译的有效性,从而为该领域的后续研究奠定基础。
链接: https://arxiv.org/abs/2508.01871
作者: Yuanyuan Liang,Lei Pan,Tingyu Xie,Yunshi Lan,Weining Qian
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 21 pages
Abstract:In recent years, research on transforming natural language into graph query language (NL2GQL) has been increasing. Most existing methods focus on single-turn transformation from NL to GQL. In practical applications, user interactions with graph databases are typically multi-turn, dynamic, and context-dependent. While single-turn methods can handle straightforward queries, more complex scenarios often require users to iteratively adjust their queries, investigate the connections between entities, or request additional details across multiple dialogue turns. Research focused on single-turn conversion fails to effectively address multi-turn dialogues and complex context dependencies. Additionally, the scarcity of high-quality multi-turn NL2GQL datasets further hinders the progress of this field. To address this challenge, we propose an automated method for constructing multi-turn NL2GQL datasets based on Large Language Models (LLMs) , and apply this method to develop the MTGQL dataset, which is constructed from a financial market graph database and will be publicly released for future research. Moreover, we propose three types of baseline methods to assess the effectiveness of multi-turn NL2GQL translation, thereby laying a solid foundation for future research.
zh
[AI-72] ProKG-Dial: Progressive Multi-Turn Dialogue Construction with Domain Knowledge Graphs
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在专业领域中缺乏特定知识精度的问题,以及现有构建高质量领域多轮对话数据集方法(如人工标注、模拟人类LLM交互和基于角色的LLM对话)存在资源消耗高、对话质量低或领域覆盖不足等局限性。其解决方案的关键在于提出ProKG Dial框架,该框架利用领域特定知识图谱(Knowledge Graph, KG)的结构化特性来编码复杂的专业知识与关系,通过社区检测将KG划分为语义一致的子图,并在每个子图中围绕目标实体逐步生成相关问答对,辅以严格过滤机制确保对话质量,从而实现高效、高质量且覆盖全面的领域多轮对话数据生成。
链接: https://arxiv.org/abs/2508.01869
作者: Yuanyuan Liang,Xiaoman Wang,Tingyu Xie,Lei Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Current large language models (LLMs) excel at general NLP tasks but often lack domain specific precision in professional settings. Building a high quality domain specific multi turn dialogue dataset is essential for developing specialized conversational systems. However, existing methods such as manual annotation, simulated human LLM interactions, and role based LLM dialogues are resource intensive or suffer from limitations in dialogue quality and domain coverage. To address these challenges, we introduce ProKG Dial, a progressive framework for constructing knowledge intensive multi turn dialogue datasets using domain specific knowledge graphs (KGs). ProKG Dial leverages the structured nature of KGs to encode complex domain knowledge and relationships, providing a solid foundation for generating meaningful and coherent dialogues. Specifically, ProKG Dial begins by applying community detection to partition the KG into semantically cohesive subgraphs. For each subgraph, the framework incrementally generates a series of questions and answers centered around a target entity, ensuring relevance and coverage. A rigorous filtering step is employed to maintain high dialogue quality. We validate ProKG Dial on a medical knowledge graph by evaluating the generated dialogues in terms of diversity, semantic coherence, and entity coverage. Furthermore, we fine tune a base LLM on the resulting dataset and benchmark it against several baselines. Both automatic metrics and human evaluations demonstrate that ProKG Dial substantially improves dialogue quality and domain specific performance, highlighting its effectiveness and practical utility.
zh
[AI-73] Counterfactual Reciprocal Recommender Systems for User-to-User Matching KDD KDD2025
【速读】:该论文旨在解决**互惠推荐系统(Reciprocal Recommender Systems, RRS)**中存在的数据偏差问题,即由于历史曝光策略导致热门用户被过度代表,从而引发反馈循环并影响推荐准确性和公平性。其解决方案的关键在于提出一种基于因果推断的框架——反事实互惠推荐系统(Counterfactual Reciprocal Recommender Systems, CFRR),通过引入逆倾向得分(inverse propensity scoring)与自归一化目标函数,有效校正数据偏差,提升长尾用户覆盖率和推荐公平性,同时保持推荐性能的稳定提升。
链接: https://arxiv.org/abs/2508.01867
作者: Kazuki Kawamura,Takuma Udagawa,Kei Tateno
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures. Accepted for publication at the Workshop on Two-sided Marketplace Optimization (TSMO '25), held in conjunction with the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025), Toronto, Canada
Abstract:Reciprocal recommender systems (RRS) in dating, gaming, and talent platforms require mutual acceptance for a match. Logged data, however, over-represents popular profiles due to past exposure policies, creating feedback loops that skew learning and fairness. We introduce Counterfactual Reciprocal Recommender Systems (CFRR), a causal framework to mitigate this bias. CFRR uses inverse propensity scored, self-normalized objectives. Experiments show CFRR improves NDCG@10 by up to 3.5% (e.g., from 0.459 to 0.475 on DBLP, from 0.299 to 0.307 on Synthetic), increases long-tail user coverage by up to 51% (from 0.504 to 0.763 on Synthetic), and reduces Gini exposure inequality by up to 24% (from 0.708 to 0.535 on Synthetic). CFRR offers a promising approach for more accurate and fair user-to-user matching.
zh
[AI-74] ChairPose: Pressure-based Chair Morphology Grounded Sitting Pose Estimation through Simulation-Assisted Training
【速读】:该论文旨在解决现代环境中长时间久坐对肌肉骨骼健康的影响以及现有姿势感知方法(如基于视觉或可穿戴设备的方法)在遮挡、隐私保护、用户舒适度和部署灵活性方面的局限性。其解决方案的关键在于提出 ChairPose,这是一个首个依赖压力传感且无需依赖椅子几何结构的全身无穿戴坐姿估计系统;它采用两阶段生成式模型,利用薄型、与椅子无关的压力感应垫采集的压力图进行训练,并在推理过程中显式引入椅子形态信息,从而实现无遮挡、隐私保护且准确的坐姿估计。此外,为提升跨用户和椅子的泛化能力,研究还设计了一种物理驱动的数据增强流程,模拟真实坐姿和座椅条件的变化,实验表明在未见过的用户和椅子组合下仍能保持89.4 mm的平均关节位置误差,验证了其强大的现实世界泛化性能。
链接: https://arxiv.org/abs/2508.01850
作者: Lala Shakti Swarup Ray,Vitor Fortes Rey,Bo Zhou,Paul Lukowicz,Sungho Suh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Prolonged seated activity is increasingly common in modern environments, raising concerns around musculoskeletal health, ergonomics, and the design of responsive interactive systems. Existing posture sensing methods such as vision-based or wearable approaches face limitations including occlusion, privacy concerns, user discomfort, and restricted deployment flexibility. We introduce ChairPose, the first full body, wearable free seated pose estimation system that relies solely on pressure sensing and operates independently of chair geometry. ChairPose employs a two stage generative model trained on pressure maps captured from a thin, chair agnostic sensing mattress. Unlike prior approaches, our method explicitly incorporates chair morphology into the inference process, enabling accurate, occlusion free, and privacy preserving pose estimation. To support generalization across diverse users and chairs, we introduce a physics driven data augmentation pipeline that simulates realistic variations in posture and seating conditions. Evaluated across eight users and four distinct chairs, ChairPose achieves a mean per joint position error of 89.4 mm when both the user and the chair are unseen, demonstrating robust generalization to novel real world generalizability. ChairPose expands the design space for posture aware interactive systems, with potential applications in ergonomics, healthcare, and adaptive user interfaces.
zh
[AI-75] CloudAnoAgent : Anomaly Detection for Cloud Sites via LLM Agent with Neuro-Symbolic Mechanism
【速读】:该论文旨在解决云环境中的异常检测问题,现有基于指标数据的方法因正常与异常事件的数据不平衡而导致误报率(False Positive Rate, FPR)过高,增加运维工程师的负担。其解决方案的关键在于提出CloudAnoAgent——首个基于神经符号大语言模型(neuro-symbolic LLM)的异常检测代理,通过统一处理结构化指标数据和文本日志数据,并引入符号验证机制来验证检测假设并生成结构化的异常报告,从而显著提升检测准确性、降低误报率并增强可解释性。
链接: https://arxiv.org/abs/2508.01844
作者: Xinkai Zou,Xuan Jiang,Ruikai Huang,Haoze He,Parv Kapoor,Jiahua Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Anomaly detection in cloud sites remains a critical yet challenging task. Existing approaches that rely solely on metric data often suffer from high false positive rates (FPR) due to data imbalance between normal and anomalous events, leading to significant operational overhead for system reliance engineers. Recent advances in large language models (LLMs) offer new opportunities for integrating metrics with log data, enabling more accurate and interpretable anomaly detection. In this paper, we propose CloudAnoAgent, the first neuro-symbolic LLM-based agent for anomaly detection in cloud environments. CloudAnoAgent jointly processes structured metrics and textual log data in a unified pipeline, leveraging symbolic verification to validate detection hypotheses and generate structured anomaly reports. To support systematic evaluation, we introduce CloudAnoBench, the first benchmark that provides LLM-generated paired metrics and log data with fine-grained anomaly behavior annotations, filling a critical gap in existing datasets. Experimental results demonstrate that CloudAnoAgent improves anomaly classification accuracy by 46.36% and 36.67% on average and reduces the FPR by 36.67% and 33.89% on average over traditional baselines and LLM-only baseline, with a boost on anomaly type detection accuracy by 12.8% compared to vanilla LLM prompting. These results demonstrate the strengths of our approach in improving detection accuracy, reducing false positives, and enhancing interpretability, thereby supporting practical deployment in enterprise cloud environments.
zh
[AI-76] Neural Predictive Control to Coordinate Discrete- and Continuous-Time Models for Time-Series Analysis with Control-Theoretical Improvements KDD
【速读】:该论文旨在解决现有基于神经微分方程(Neural ODEs)的时间序列建模方法在分布偏移(distributional shifts)下适应性差的问题,即传统方法依赖无约束神经网络直接学习动态规律,缺乏鲁棒性和理论保障。其解决方案的关键在于将时间序列问题重新建模为连续时间的最优控制问题:通过设计包含丰富上下文信息的控制动作(control actions),利用离散时间模型提取长期时序特征以调制短期连续动力学,并采用模型预测控制(Model Predictive Control, MPC)进行多步轨迹规划与任务相关代价最小化,从而实现对ODE轨迹的可控引导。在此框架下,论文证明了在温和假设下,多步优化可指数收敛至无限horizon解,显著提升了模型的泛化能力和适应性。
链接: https://arxiv.org/abs/2508.01833
作者: Haoran Li,Muhao Guo,Yang Weng,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, submitted to ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Abstract:Deep sequence models have achieved notable success in time-series analysis, such as interpolation and forecasting. Recent advances move beyond discrete-time architectures like Recurrent Neural Networks (RNNs) toward continuous-time formulations such as the family of Neural Ordinary Differential Equations (Neural ODEs). Generally, they have shown that capturing the underlying dynamics is beneficial for generic tasks like interpolation, extrapolation, and classification. However, existing methods approximate the dynamics using unconstrained neural networks, which struggle to adapt reliably under distributional shifts. In this paper, we recast time-series problems as the continuous ODE-based optimal control problem. Rather than learning dynamics solely from data, we optimize control actions that steer ODE trajectories toward task objectives, bringing control-theoretical performance guarantees. To achieve this goal, we need to (1) design the appropriate control actions and (2) apply effective optimal control algorithms. As the actions should contain rich context information, we propose to employ the discrete-time model to process past sequences and generate actions, leading to a coordinate model to extract long-term temporal features to modulate short-term continuous dynamics. During training, we apply model predictive control to plan multi-step future trajectories, minimize a task-specific cost, and greedily select the optimal current action. We show that, under mild assumptions, this multi-horizon optimization leads to exponential convergence to infinite-horizon solutions, indicating that the coordinate model can gain robust and generalizable performance. Extensive experiments on diverse time-series datasets validate our method’s superior generalization and adaptability compared to state-of-the-art baselines.
zh
[AI-77] Mitigating Persistent Client Dropout in Asynchronous Decentralized Federated Learning KDD KDD2025
【速读】:该论文旨在解决异步去中心化联邦学习(Asynchronous Decentralized Federated Learning, DFL)中持续性客户端掉线(persistent client dropout)的问题。由于异步性和去中心化特性导致节点间模型更新信息不透明,使得在客户端掉线后难以恢复其贡献,且缺乏重建缺失邻居损失函数所需的完整信息(如训练轮次、数据分布等)。现有缓解策略效果有限,本文提出基于客户端重构的自适应策略作为解决方案核心,通过局部重构机制有效恢复因掉线造成的性能损失,而无需精确重建掉线客户端的数据。实验表明,该方法在多种数据集和异构场景下均能提升系统鲁棒性,尽管重构精度有限,但仍具显著有效性。
链接: https://arxiv.org/abs/2508.01807
作者: Ignacy Stępka,Nicholas Gisolfi,Kacper Trębacz,Artur Dubrawski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Presented on FedKDD Workshop at KDD 2025
Abstract:We consider the problem of persistent client dropout in asynchronous Decentralized Federated Learning (DFL). Asynchronicity and decentralization obfuscate information about model updates among federation peers, making recovery from a client dropout difficult. Access to the number of learning epochs, data distributions, and all the information necessary to precisely reconstruct the missing neighbor’s loss functions is limited. We show that obvious mitigations do not adequately address the problem and introduce adaptive strategies based on client reconstruction. We show that these strategies can effectively recover some performance loss caused by dropout. Our work focuses on asynchronous DFL with local regularization and differs substantially from that in the existing literature. We evaluate the proposed methods on tabular and image datasets, involve three DFL algorithms, and three data heterogeneity scenarios (iid, non-iid, class-focused non-iid). Our experiments show that the proposed adaptive strategies can be effective in maintaining robustness of federated learning, even if they do not reconstruct the missing client’s data precisely. We also discuss the limitations and identify future avenues for tackling the problem of client dropout.
zh
[AI-78] RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging
【速读】:该论文旨在解决多专家混合(Mixture-of-Experts, MoE)模型合并过程中专家知识产权(Intellectual Property, IP)归属与保护问题,即如何在将多个任务特定模型合并为稀疏统一架构后,识别并验证各专家的来源及防止篡改。解决方案的关键在于提出RouteMark框架,其核心是利用专家路由指纹(routing fingerprints)来实现IP追溯:通过构建两种互补的统计特征——路由评分指纹(Routing Score Fingerprint, RSF)量化专家激活强度,以及路由偏好指纹(Routing Preference Fingerprint, RPF)刻画激活专家的输入分布偏好,从而形成可复现、任务区分度高且轻量的专家级指纹;进一步设计基于相似度匹配的算法,在怀疑模型与参考模型之间比对专家指纹,实现对专家重用的准确识别和对结构/参数篡改的鲁棒检测。
链接: https://arxiv.org/abs/2508.01784
作者: Xin He,Junxi Shen,Zhenheng Tang,Xiaowen Chu,Bo Li,Ivor W. Tsang,Yew-Soon Ong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: MoE, Model Merging, Fingerprint
Abstract:Model merging via Mixture-of-Experts (MoE) has emerged as a scalable solution for consolidating multiple task-specific models into a unified sparse architecture, where each expert is derived from a model fine-tuned on a distinct task. While effective for multi-task integration, this paradigm introduces a critical yet underexplored challenge: how to attribute and protect the intellectual property (IP) of individual experts after merging. We propose RouteMark, a framework for IP protection in merged MoE models through the design of expert routing fingerprints. Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs. To capture these patterns, we construct expert-level fingerprints using two complementary statistics: the Routing Score Fingerprint (RSF), quantifying the intensity of expert activation, and the Routing Preference Fingerprint (RPF), characterizing the input distribution that preferentially activates each expert. These fingerprints are reproducible, task-discriminative, and lightweight to construct. For attribution and tampering detection, we introduce a similarity-based matching algorithm that compares expert fingerprints between a suspect and a reference (victim) model. Extensive experiments across diverse tasks and CLIP-based MoE architectures show that RouteMark consistently yields high similarity for reused experts and clear separation from unrelated ones. Moreover, it remains robust against both structural tampering (expert replacement, addition, deletion) and parametric tampering (fine-tuning, pruning, permutation), outperforming weight- and activation-based baseliness. Our work lays the foundation for RouteMark as a practical and broadly applicable framework for IP verification in MoE-based model merging.
zh
[AI-79] VAGPO: Vision-augmented Asymmetric Group Preference Optimization for the Routing Problems
【速读】:该论文旨在解决旅行商问题(Traveling Salesman Problem, TSP)和带容量约束的车辆路径问题(Capacitated Vehicle Routing Problem, CVRP)等组合优化问题在数据驱动方法中普遍存在的训练效率低和对大规模实例泛化能力差的问题。解决方案的关键在于提出一种视觉增强的非对称组偏好优化(Vision-Augmented Asymmetric Group Preference Optimization, VAGPO)方法:通过ResNet进行视觉编码以捕捉空间结构信息,利用Transformer建模序列依赖关系以学习时间动态特征,并引入非对称组偏好优化策略替代传统的策略梯度方法,从而显著加快收敛速度。实验表明,该方法不仅在标准测试集上获得高质量解,且无需重新训练即可有效推广至最多1000个节点的大规模实例,展现出优异的学习效率与可扩展性。
链接: https://arxiv.org/abs/2508.01774
作者: Shiyan Liu,Bohan Tan,Yan Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are well-known combinatorial optimization challenges with broad practical relevance. Recent data-driven optimization methods have made significant progress, yet they often face limitations in training efficiency and generalization to large-scale instances. In this paper, we propose a novel Vision-Augmented Asymmetric Group Preference Optimization (VAGPO) approach for solving the routing problems. By leveraging ResNet-based visual encoding and Transformer-based sequential modeling, VAGPO captures both spatial structure and temporal dependencies. Furthermore, we introduce an asymmetric group preference optimization strategy that significantly accelerates convergence compared to commonly used policy gradient methods. Experimental results on TSP and CVRP benchmarks show that the proposed VAGPO not only achieves highly competitive solution quality but also exhibits strong generalization to larger instances (up to 1000 nodes) without re-training, highlighting its effectiveness in both learning efficiency and scalability.
zh
[AI-80] Reasoning Systems as Structured Processes: Foundations Failures and Formal Criteria
【速读】:该论文旨在解决如何统一建模与比较不同领域中推理系统(reasoning systems)的结构问题,尤其关注在内部失效、适应性变化或碎片化等复杂情境下的推理机制分析。其解决方案的关键在于提出一个形式化的通用框架,将推理系统表示为包含现象(phenomena)、解释空间(explanation space)、推理映射(inference map)与生成映射(generation map)以及原则基础(principle base)的结构化元组,从而兼容逻辑、算法和学习驱动的推理过程,并支持对一致性、有效性及完备性等基本内生标准的分析,同时允许动态行为如迭代优化与原则演化,为未来理论与实践研究提供可比较的基础结构。
链接: https://arxiv.org/abs/2508.01763
作者: Saleh Nikooroo,Thomas Engel
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:This paper outlines a general formal framework for reasoning systems, intended to support future analysis of inference architectures across domains. We model reasoning systems as structured tuples comprising phenomena, explanation space, inference and generation maps, and a principle base. The formulation accommodates logical, algorithmic, and learning-based reasoning processes within a unified structural schema, while remaining agnostic to any specific reasoning algorithm or logic system. We survey basic internal criteria–including coherence, soundness, and completeness-and catalog typical failure modes such as contradiction, incompleteness, and non-convergence. The framework also admits dynamic behaviors like iterative refinement and principle evolution. The goal of this work is to establish a foundational structure for representing and comparing reasoning systems, particularly in contexts where internal failure, adaptation, or fragmentation may arise. No specific solution architecture is proposed; instead, we aim to support future theoretical and practical investigations into reasoning under structural constraint.
zh
[AI-81] Semantically-Guided Inference for Conditional Diffusion Models: Enhancing Covariate Consistency in Time Series Forecasting
【速读】:该论文旨在解决扩散模型(diffusion models)在时间序列预测中常见的语义不一致问题,即生成轨迹与条件协变量(conditioning covariates)之间存在语义偏差,尤其在复杂或多模态条件下更为显著。解决方案的关键在于提出一种即插即用的推理阶段方法 SemGuide,其核心是引入一个评分网络(scoring network),用于评估扩散过程中的中间状态与未来协变量之间的语义对齐度;这些评分作为代理似然值,在逐步重要性重加权过程中调整采样路径,从而提升条件一致性,且无需修改原始训练流程。该方法具有模型无关性,兼容任何条件扩散框架,并在真实世界预测任务中显著提升了预测精度和协变量对齐效果。
链接: https://arxiv.org/abs/2508.01761
作者: Rui Ding,Hanyang Meng,Zeyang Zhang,Jielong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have demonstrated strong performance in time series forecasting, yet often suffer from semantic misalignment between generated trajectories and conditioning covariates, especially under complex or multimodal conditions. To address this issue, we propose SemGuide, a plug-and-play, inference-time method that enhances covariate consistency in conditional diffusion models. Our approach introduces a scoring network to assess the semantic alignment between intermediate diffusion states and future covariates. These scores serve as proxy likelihoods in a stepwise importance reweighting procedure, which progressively adjusts the sampling path without altering the original training process. The method is model-agnostic and compatible with any conditional diffusion framework. Experiments on real-world forecasting tasks show consistent gains in both predictive accuracy and covariate alignment, with especially strong performance under complex conditioning scenarios.
zh
[AI-82] Implementing Cumulative Functions with Generalized Cumulative Constraints
【速读】:该论文旨在解决在开源约束求解器中缺乏对包含条件时间区间(conditional time intervals)和累积函数(cumulative functions)的调度问题建模能力的问题,此类建模在商业约束编程求解器中已成主流,但尚未被开放源代码工具支持。其解决方案的关键在于提出一种名为“广义累积”(Generalized Cumulative)的单一通用全局约束,并设计了一种新颖的时间表过滤算法(time-table filtering algorithm),用于高效处理定义在条件时间区间上的任务。实验表明,该方法在建模生产者-消费者调度问题方面具有竞争力,并能有效扩展至大规模问题。
链接: https://arxiv.org/abs/2508.01751
作者: Pierre Schaus,Charles Thomas,Roger Kameugne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling scheduling problems with conditional time intervals and cumulative functions has become a common approach when using modern commercial constraint programming solvers. This paradigm enables the modeling of a wide range of scheduling problems, including those involving producers and consumers. However, it is unavailable in existing open-source solvers and practical implementation details remain undocumented. In this work, we present an implementation of this modeling approach using a single, generic global constraint called the Generalized Cumulative. We also introduce a novel time-table filtering algorithm designed to handle tasks defined on conditional time-intervals. Experimental results demonstrate that this approach, combined with the new filtering algorithm, performs competitively with existing solvers enabling the modeling of producer and consumer scheduling problems and effectively scales to large problems.
zh
[AI-83] Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization
【速读】:该论文旨在解决科学知识爆炸背景下,如何自动化生成兼具新颖性、可行性与研究价值的科学假设这一核心挑战。现有基于大语言模型的方法难以系统建模假设内在结构,且缺乏闭环反馈机制以实现持续优化。其解决方案的关键在于提出一个名为HypoAgents的多智能体协作框架,首次将贝叶斯推理与信息熵驱动的搜索机制相结合,在假设生成、证据验证和假设精炼三个阶段构建迭代闭环,模拟科学家的认知过程:首先通过多样性采样生成初始假设并建立先验信念(基于新颖性-相关性-可行性综合评分),继而利用检索增强生成(Retrieval-Augmented Generation, RAG)获取外部文献证据,依据贝叶斯定理更新假设后验概率;最后通过信息熵 $ H = - \sum p_i\log p_i $ 识别高不确定性假设并主动精炼,引导假设集向更高质量和置信度演进。实验表明,该框架在ICLR 2025真实研究问题数据集上经12轮优化后,平均ELO得分提升116.3,超越真实论文摘要基准17.8,同时整体不确定性显著降低0.92(Shannon熵)。
链接: https://arxiv.org/abs/2508.01746
作者: Shiyang Duan,Yuan Tian,Qi Bing,Xiaowei Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Corresponding author: Xiaowei Shao. 12 pages, 4 figures
Abstract:The exponential growth of scientific knowledge has made the automated generation of scientific hypotheses that combine novelty, feasibility, and research value a core challenge. Existing methods based on large language models fail to systematically model the inherent in hypotheses or incorporate the closed-loop feedback mechanisms crucial for refinement. This paper proposes a multi-agent collaborative framework called HypoAgents, which for the first time integrates Bayesian reasoning with an information entropy-driven search mechanism across three stages-hypotheses generation, evidence validation, and hypotheses Refinement-to construct an iterative closed-loop simulating scientists’ cognitive processes. Specifically, the framework first generates an initial set of hypotheses through diversity sampling and establishes prior beliefs based on a composite novelty-relevance-feasibility (N-R-F) score. It then employs etrieval-augmented generation (RAG) to gather external literature evidence, updating the posterior probabilities of hypotheses using Bayes’ theorem. Finally, it identifies high-uncertainty hypotheses using information entropy H = - \sum p_i\log p_i and actively refines them, guiding the iterative optimization of the hypothesis set toward higher quality and confidence. Experimental results on the ICLR 2025 conference real-world research question dataset (100 research questions) show that after 12 optimization iterations, the average ELO score of generated hypotheses improves by 116.3, surpassing the benchmark of real paper abstracts by 17.8, while the framework’s overall uncertainty, as measured by Shannon entropy, decreases significantly by 0.92. This study presents an interpretable probabilistic reasoning framework for automated scientific discovery, substantially improving the quality and reliability of machine-generated research hypotheses.
zh
[AI-84] ReflecSched: Solving Dynamic Flexible Job-Shop Scheduling via LLM -Powered Hierarchical Reflection
【速读】:该论文旨在解决动态柔性作业车间调度(Dynamic Flexible Job-Shop Scheduling, DFJSP)问题中因实时事件适应性和复杂机器路径规划带来的挑战,尤其针对传统调度规则灵活性不足以及深度学习方法存在黑箱性与特征工程复杂性的问题。其解决方案的关键在于提出ReflecSched框架,该框架通过赋予大语言模型(Large Language Models, LLMs)战略分析能力,使其不再直接进行调度决策,而是先基于启发式规则模拟多时间窗口下的调度场景,并将结果提炼为自然语言形式的“战略经验”(Strategic Experience),再将其作为提示注入最终决策模块,从而生成非短视(non-myopic)的调度动作。此设计有效缓解了LLM在长上下文利用、专家启发式信息未充分利用和局部最优决策三大缺陷,显著提升了调度性能。
链接: https://arxiv.org/abs/2508.01724
作者: Shijie Cao,Yuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic Flexible Job-Shop Scheduling (DFJSP) is an NP-hard problem challenged by real-time event adaptation and complex machine routing. While traditional dispatching rules are efficient but rigid, deep learning approaches are opaque and require intricate feature engineering. Large Language Models (LLMs) promise adaptive reasoning without this engineering overhead, yet we find their direct application is suboptimal. Baseline LLMs suffer from three key pitfalls: the long-context paradox, where crucial data is underutilized; an underutilization of expert heuristics; and myopic decision-making. To address this, we propose ReflecSched, a framework that empowers the LLM beyond a direct scheduler by equipping it with a strategic analysis capability. ReflecSched tasks the LLM to analyze heuristic-driven simulations across multiple planning horizons and distill them into a concise, natural-language summary termed ``Strategic Experience’'. This summary is then integrated into the prompt of a final decision-making module, guiding it to produce non-myopic actions. Experiments show that ReflecSched not only statistically significantly outperforms direct LLM baselines, securing a 71.35% Win Rate and a 2.755% Relative Percentage Deviation reduction, but also surpasses the performance of all individual heuristics evaluated, all while demonstrably mitigating the three identified pitfalls. Additionally, ReflecSched performs on par with the best heuristic tailored to each instance across all problem cases.
zh
[AI-85] MHARFedLLM : Multimodal Human Activity Recognition Using Federated Large Language Model
【速读】:该论文旨在解决传统人类活动识别(Human Activity Recognition, HAR)系统因依赖单一模态数据(如仅运动传感器或摄像头)而导致的鲁棒性和准确性不足的问题。解决方案的关键在于提出了一种名为FedTime-MAGNET的新型多模态联邦学习框架,其核心创新包括:1)引入多模态自适应图神经专家变换器(Multimodal Adaptive Graph Neural Expert Transformer, MAGNET),通过图注意力机制与专家混合(Mixture of Experts)实现跨模态统一且具有判别性的嵌入表示;2)设计轻量级T5编码器架构以捕捉复杂的时间依赖性,从而有效融合来自深度相机、压力垫和加速度计等异构数据源的信息。实验表明,该方法在集中式场景下达到0.934的F1分数,在联邦学习场景下仍保持0.881的高精度,验证了多模态融合、时间序列大语言模型(Time Series LLMs)与联邦学习协同优化的有效性。
链接: https://arxiv.org/abs/2508.01701
作者: Asmit Bandyopadhyay,Rohit Basu,Tanmay Sen,Swagatam Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Human Activity Recognition (HAR) plays a vital role in applications such as fitness tracking, smart homes, and healthcare monitoring. Traditional HAR systems often rely on single modalities, such as motion sensors or cameras, limiting robustness and accuracy in real-world environments. This work presents FedTime-MAGNET, a novel multimodal federated learning framework that advances HAR by combining heterogeneous data sources: depth cameras, pressure mats, and accelerometers. At its core is the Multimodal Adaptive Graph Neural Expert Transformer (MAGNET), a fusion architecture that uses graph attention and a Mixture of Experts to generate unified, discriminative embeddings across modalities. To capture complex temporal dependencies, a lightweight T5 encoder only architecture is customized and adapted within this framework. Extensive experiments show that FedTime-MAGNET significantly improves HAR performance, achieving a centralized F1 Score of 0.934 and a strong federated F1 Score of 0.881. These results demonstrate the effectiveness of combining multimodal fusion, time series LLMs, and federated learning for building accurate and robust HAR systems.
zh
[AI-86] DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning
【速读】:该论文旨在解决自然语言到可视化(Natural Language to Visualization, NL2VIS)过程中存在的黑箱问题,即现有方法缺乏透明的推理过程,导致用户难以理解生成可视化的设计逻辑,也无法对不理想的输出进行有效修正。其解决方案的关键在于将思维链(Chain-of-Thought, CoT)推理机制引入NL2VIS流程:首先设计了针对NL2VIS的完整CoT推理框架,并构建自动化管道为现有数据集添加结构化的推理步骤;其次提出nvBench-CoT这一专用数据集,其中包含从模糊自然语言描述到最终可视化结果的详细分步推理过程,显著提升模型微调后的性能;最后开发DeepVIS交互式界面,使用户能够审查推理步骤、定位错误并针对性调整,从而在提升可视化质量的同时增强用户对生成过程的理解与控制力。
链接: https://arxiv.org/abs/2508.01700
作者: Zhihao Shuai,Boyan Li,Siyu Yan,Yuyu Luo,Weikai Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although data visualization is powerful for revealing patterns and communicating insights, creating effective visualizations requires familiarity with authoring tools and often disrupts the analysis flow. While large language models show promise for automatically converting analysis intent into visualizations, existing methods function as black boxes without transparent reasoning processes, which prevents users from understanding design rationales and refining suboptimal outputs. To bridge this gap, we propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline. First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps. Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations, which enables state-of-the-art performance when used for model fine-tuning. Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process, allowing users to inspect reasoning steps, identify errors, and make targeted adjustments to improve visualization outcomes. Quantitative benchmark evaluations, two use cases, and a user study collectively demonstrate that our CoT framework effectively enhances NL2VIS quality while providing insightful reasoning steps to users.
zh
[AI-87] From SHAP to Rules: Distilling Expert Knowledge from Post-hoc Model Explanations in Time Series Classification
【速读】:该论文旨在解决时间序列(Time Series, TS)分类模型解释性差的问题,尤其针对原始时间序列难以直接解读以及高维度带来的复杂性挑战。其解决方案的关键在于构建一个将后验、实例级的特征归因(如LIME、SHAP)转化为结构化、人类可读规则的框架:通过定义规则区间来明确规则适用的时间与空间范围,从而提升透明度;同时引入规则融合策略(如加权选择和基于Lasso的精炼),在覆盖度、置信度与简洁性之间取得平衡,确保每个实例均获得唯一且优化的解释规则,并借助可视化技术调控规则特异性与泛化性的权衡,最终整合由Rashomon效应导致的冲突或重叠解释,形成符合专家系统原则、可适配领域知识的一致性洞察。
链接: https://arxiv.org/abs/2508.01687
作者: Maciej Mozolewski,Szymon Bobek,Grzegorz J. Nalepa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Explaining machine learning (ML) models for time series (TS) classification is challenging due to inherent difficulty in raw time series interpretation and doubled down by the high dimensionality. We propose a framework that converts numeric feature attributions from post-hoc, instance-wise explainers (e.g., LIME, SHAP) into structured, human-readable rules. These rules define intervals indicating when and where they apply, improving transparency. Our approach performs comparably to native rule-based methods like Anchor while scaling better to long TS and covering more instances. Rule fusion integrates rule sets through methods such as weighted selection and lasso-based refinement to balance coverage, confidence, and simplicity, ensuring all instances receive an unambiguous, metric-optimized rule. It enhances explanations even for a single explainer. We introduce visualization techniques to manage specificity-generalization trade-offs. By aligning with expert-system principles, our framework consolidates conflicting or overlapping explanations - often resulting from the Rashomon effect - into coherent and domain-adaptable insights. Experiments on UCI datasets confirm that the resulting rule-based representations improve interpretability, decision transparency, and practical applicability for TS classification.
zh
[AI-88] -GRAG : A Dynamic GraphRAG Framework for Resolving Temporal Conflicts and Redundancy in Knowledge Retrieval
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在知识密集型任务中因忽略知识的时间动态性而导致的局限性,如时间模糊性、时间不敏感的检索和语义冗余等问题。其核心解决方案是提出Temporal GraphRAG(T-GRAG),一个基于时间感知的知识图谱结构的动态RAG框架,关键在于构建一个包含时标知识图谱的五组件系统:(1) 时间知识图谱生成器以建模知识演化;(2) 时间查询分解机制将复杂时序查询拆解为子查询;(3) 三层交互式检索器在时间子图中逐层过滤与精炼;(4) 源文本提取器降低噪声干扰;(5) 基于大语言模型(LLM)的生成器输出时空一致的回答。实验表明,T-GRAG在时间约束下的检索准确性和回答相关性上显著优于传统RAG和GraphRAG基线,验证了对知识演化进行建模对于长文本问答任务的重要性。
链接: https://arxiv.org/abs/2508.01680
作者: Dong Li,Yichen Niu,Ying Ai,Xiang Zou,Biqing Qi,Jianxing Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated strong performance in natural language generation but remain limited in knowle- dge-intensive tasks due to outdated or incomplete internal knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external retrieval, with GraphRAG further enhancing performance through structured knowledge graphs and multi-hop reasoning. However, existing GraphRAG methods largely ignore the temporal dynamics of knowledge, leading to issues such as temporal ambiguity, time-insensitive retrieval, and semantic redundancy. To overcome these limitations, we propose Temporal GraphRAG (T-GRAG), a dynamic, temporally-aware RAG framework that models the evolution of knowledge over time. T-GRAG consists of five key components: (1) a Temporal Knowledge Graph Generator that creates time-stamped, evolving graph structures; (2) a Temporal Query Decomposition mechanism that breaks complex temporal queries into manageable sub-queries; (3) a Three-layer Interactive Retriever that progressively filters and refines retrieval across temporal subgraphs; (4) a Source Text Extractor to mitigate noise; and (5) a LLM-based Generator that synthesizes contextually and temporally accurate responses. We also introduce Time-LongQA, a novel benchmark dataset based on real-world corporate annual reports, designed to test temporal reasoning across evolving knowledge. Extensive experiments show that T-GRAG significantly outperforms prior RAG and GraphRAG baselines in both retrieval accuracy and response relevance under temporal constraints, highlighting the necessity of modeling knowledge evolution for robust long-text question answering. Our code is publicly available on the T-GRAG Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01680 [cs.AI] (or arXiv:2508.01680v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.01680 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-89] QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在化学领域中进行严谨、分步的定量推理能力不足的问题,尤其在计算化学任务中的数学推理表现尚未被系统评估。其解决方案的关键在于提出QCBench——一个包含350道计算化学问题的基准测试集,覆盖分析化学、生物/有机化学、普通化学、无机化学、物理化学、高分子化学和量子化学7个子领域,并按基础、中级和专家三个层级组织,强调纯数值计算与逐步推理过程,从而实现对LLMs在科学计算准确性方面的细粒度诊断与量化评估。
链接: https://arxiv.org/abs/2508.01670
作者: Jiaqing Xie,Weida Wang,Ben Gao,Zhuo Yang,Haiyuan Wan,Shufei Zhang,Tianfan Fu,Yuqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 13 pages, 8 figures
Abstract:Quantitative chemistry plays a fundamental role in chemistry research, enabling precise predictions of molecular properties, reaction outcomes, and material behaviors. While large language models (LLMs) have shown promise in chemistry-related tasks, their ability to perform rigorous, step-by-step quantitative reasoning remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry benchmark comprising 350 computational chemistry problems across 7 chemistry subfields (analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry), categorized into three hierarchical tiers-basic, intermediate, and expert-to systematically evaluate the mathematical reasoning abilities of large language models (LLMs). Designed to minimize shortcuts and emphasize stepwise numerical reasoning, each problem focuses on pure calculations rooted in real-world chemical vertical fields. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain adaptive fine-tuning or multi-modal integration. Evaluations on 19 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy.
zh
[AI-90] SPARTA: Advancing Sparse Attention in Spiking Neural Networks via Spike-Timing-Based Prioritization AAAI2026
【速读】:该论文旨在解决当前脉冲神经网络(Spiking Neural Networks, SNNs)在处理时空信息时效率低下、未能充分利用尖峰时间动态的问题,即现有方法主要依赖率编码(rate coding),忽视了尖峰精确时间(spike timing)所蕴含的丰富计算线索。其解决方案的关键在于提出SPARTA(Spiking Priority Attention with Resource-Adaptive Temporal Allocation)框架,该框架通过融合异质性神经元动力学与尖峰时间信息,实现基于时间线索(包括放电模式、尖峰时刻和峰间间隔)的优先级注意力机制,利用竞争性门控策略实现高达65.4%的稀疏性,从而将注意力复杂度从O(N²)降低至O(K²)(K ≪ N),同时保持高精度,在DVS-Gesture等数据集上达到SOTA性能。
链接: https://arxiv.org/abs/2508.01646
作者: Minsuk Jang,Changick Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 4 figures, submitted to AAAI 2026
Abstract:Current Spiking Neural Networks (SNNs) underutilize the temporal dynamics inherent in spike-based processing, relying primarily on rate coding while overlooking precise timing information that provides rich computational cues. We propose SPARTA (Spiking Priority Attention with Resource-Adaptive Temporal Allocation), a framework that leverages heterogeneous neuron dynamics and spike-timing information to enable efficient sparse attention. SPARTA prioritizes tokens based on temporal cues, including firing patterns, spike timing, and inter-spike intervals, achieving 65.4% sparsity through competitive gating. By selecting only the most salient tokens, SPARTA reduces attention complexity from O(N^2) to O(K^2) with k n, while maintaining high accuracy. Our method achieves state-of-the-art performance on DVS-Gesture (98.78%) and competitive results on CIFAR10-DVS (83.06%) and CIFAR-10 (95.3%), demonstrating that exploiting spike timing dynamics improves both computational efficiency and accuracy.
zh
[AI-91] DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition
【速读】:该论文针对多模态情感识别(Multimodal Emotion Recognition, MER)中因模态异质性(modality heterogeneity)和情感线索不一致性(emotional inconsistency)导致的性能瓶颈问题,提出了一种解耦表示与知识融合(Decoupled Representations with Knowledge Fusion, DRKF)方法。其解决方案的关键在于:首先通过优化表示学习模块(Optimized Representation Learning, ORL)利用对比互信息估计与渐进式模态增强策略,分离出任务相关的共享表示与模态特有特征,从而缓解模态间差异;其次,在知识融合模块(Knowledge Fusion, KF)中引入轻量级自注意力融合编码器(Fusion Encoder, FE),动态识别主导模态并融合其他模态的情感信息以增强联合表示;同时设计情感判别子模块(Emotion Discrimination Submodule, ED),在情感不一致场景下保留判别性不一致性线索,确保即使FE误选主导模态,情感分类子模块(Emotion Classification Submodule, EC)仍能基于保留的信息做出准确预测。这一机制有效提升了模型在复杂真实场景下的鲁棒性和准确性。
链接: https://arxiv.org/abs/2508.01644
作者: Peiyuan Jiang(School of Computer Science and Engineering, University of Electronic Science and Technology of China),Yao Liu(School of Information and Software Engineering, University of Electronic Science and Technology of China),Qiao Liu(School of Computer Science and Engineering, University of Electronic Science and Technology of China),Zongshun Zhang(School of Computer Science and Engineering, University of Electronic Science and Technology of China),Jiaye Yang(School of Computer Science and Engineering, University of Electronic Science and Technology of China),Lu Liu(School of Computer Science and Engineering, University of Electronic Science and Technology of China),Daibing Yao(Yizhou Prison, Sichuan Province)
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: Published in ACM Multimedia 2025. 10 pages, 4 figures
Abstract:Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at this https URL.
zh
[AI-92] Semantic Encryption: Secure and Effective Interaction with Cloud-based Large Language Models via Semantic Transformation
【速读】:该论文旨在解决云原生大语言模型(Cloud-based Large Language Models, CLLMs)在用户交互过程中面临的数据隐私保护问题。现有方法多聚焦于对敏感信息进行加密,但忽视了用户输入的逻辑结构,导致数据效用下降且CLLM性能受损。解决方案的关键在于提出一种名为语义加密(Semantic Encryption, SE)的即插即用框架,其核心由语义编码(Semantic Encoding)和语义解码(Semantic Decoding)两部分构成:编码阶段利用轻量本地模型将原始输入转换为保持原意与逻辑结构但隐藏敏感信息的语义上下文,供CLLM处理;解码阶段则基于本地存储的原始输入重构CLLM输出,确保用户体验无缝衔接。实验表明,SE在不牺牲数据效用和用户体验的前提下显著提升了隐私保护能力,优于当前最优方案InferDPT。
链接: https://arxiv.org/abs/2508.01638
作者: Dong Chen,Tong Yang,Feipeng Zhai,Pengpeng Ouyang,Qidong Liu,Yafei Li,Chong Fu,Mingliang Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing adoption of Cloud-based Large Language Models (CLLMs) has raised significant concerns regarding data privacy during user interactions. While existing approaches primarily focus on encrypting sensitive information, they often overlook the logical structure of user inputs. This oversight can lead to reduced data utility and degraded performance of CLLMs. To address these limitations and enable secure yet effective interactions, we propose Semantic Encryption (SE)-a plug-and-play framework designed to preserve both privacy and utility. SE consists of two key components: Semantic Encoding and Semantic Decoding. In the encoding phase, a lightweight local model transforms the original user input into an alternative semantic context that maintains the original intent and logical structure while obfuscating sensitive information. This transformed input is then processed by the CLLM, which generates a response based on the transformed semantic context. To maintain a seamless user experience, the decoding phase will reconstruct the CLLM’s response back into the original semantic context by referencing the locally stored user input. Extensive experimental evaluations demonstrate that SE effectively protects data privacy without compromising data utility or user experience, offering a practical solution for secure interaction with CLLMs. Particularly, the proposed SE demonstrates a significant improvement over the state-of-the-art InferDPT, surpassing it across various evaluated metrics and datasets.
zh
[AI-93] Learning Unified System Representations for Microservice Tail Latency Prediction
【速读】:该论文旨在解决微服务架构下性能监控与资源管理中的关键挑战,即传统基于单请求延迟的指标对瞬态噪声敏感,难以反映复杂并发工作负载的整体行为;同时现有方法在处理异构数据时存在不足,无法有效区分并融合流量侧(traffic-side)特征(如请求传播模式)与资源侧(resource-side)信号(如局部瓶颈),且缺乏系统性的架构设计来整合这两种互补模态。解决方案的关键在于提出USRFNet——一种深度学习网络,通过图神经网络(GNN)显式建模服务间交互和请求传播模式以捕捉流量侧特征,利用通用多层感知机(gMLP)模块独立建模集群资源动态以提取资源侧特征,最终将二者融合为统一的系统嵌入表示,从而高精度预测窗口级P95尾部延迟。
链接: https://arxiv.org/abs/2508.01635
作者: Wenzhuo Qian,Hailiang Zhao,Tianlv Chen,Jiayi Chen,Ziqi Wang,Kingsum Chow,Shuiguang Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:
Abstract:Microservice architectures have become the de facto standard for building scalable cloud-native applications, yet their distributed nature introduces significant challenges in performance monitoring and resource management. Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise and fail to reflect the holistic behavior of complex, concurrent workloads. In contrast, window-level P95 tail latency provides a stable and meaningful signal that captures both system-wide trends and user-perceived performance degradation. We identify two key shortcomings in existing methods: (i) inadequate handling of heterogeneous data, where traffic-side features propagate across service dependencies and resource-side signals reflect localized bottlenecks, and (ii) the lack of principled architectural designs that effectively distinguish and integrate these complementary modalities. To address these challenges, we propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features. USRFNet employs GNNs to capture service interactions and request propagation patterns, while gMLP modules independently model cluster resource dynamics. These representations are then fused into a unified system embedding to predict window-level P95 latency with high accuracy. We evaluate USRFNet on real-world microservice benchmarks under large-scale stress testing conditions, demonstrating substantial improvements in prediction accuracy over state-of-the-art baselines.
zh
[AI-94] EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models ACL2025
【速读】:该论文针对混合专家模型(Mixture-of-Experts, MoE)在扩展大语言模型(LLM)时面临的两个关键问题展开研究:一是由于需加载全部专家导致的GPU内存消耗过大;二是激活参数比例低无法等效转化为推理加速效果。解决方案的关键在于提出EAC-MoE,一个面向专家选择特性的压缩框架,包含两个核心模块:其一为基于专家选择校准的量化方法(Quantization with Expert-Selection Calibration, QESC),通过校准MoE中的路由器缓解低比特量化引发的专家选择偏差;其二为基于专家选择频率的剪枝方法(Pruning based on Expert-Selection Frequency, PESF),通过移除当前任务中使用频率较低的专家显著降低推理延迟。实验表明,该方案可在极少性能损失下有效减少内存占用并提升推理效率。
链接: https://arxiv.org/abs/2508.01625
作者: Yuanteng Chen,Yuantian Shao,Peisong Wang,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 13 figures. ACL 2025
Abstract:Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.
zh
[AI-95] A Multi-Agent Pokemon Tournament for Evaluating Strategic Reasoning of Large Language Models
【速读】:该论文旨在解决如何评估大型语言模型(Large Language Models, LLMs)在复杂、规则约束的策略环境中进行决策、适应与优化的能力问题。其解决方案的关键在于构建一个基于 Pokémon 战斗机制的竞技平台——LLM Pokemon League,该平台将 LLMs 作为智能代理参与回合制、属性相克的对战,并通过单淘汰赛制系统性地记录和分析模型在组队逻辑、行动选择及换宠决策等维度上的行为数据,从而实现对不同 LLM 在战术深度、应变能力和元策略演化等方面的量化比较,为 AI 在战略推理与竞争学习领域的研究提供新的基准。
链接: https://arxiv.org/abs/2508.01623
作者: Tadisetty Sai Yashwanth,Dhatri C
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This research presents LLM Pokemon League, a competitive tournament system that leverages Large Language Models (LLMs) as intelligent agents to simulate strategic decision-making in Pokémon battles. The platform is designed to analyze and compare the reasoning, adaptability, and tactical depth exhibited by different LLMs in a type-based, turn-based combat environment. By structuring the competition as a single-elimination tournament involving diverse AI trainers, the system captures detailed decision logs, including team-building rationale, action selection strategies, and switching decisions. The project enables rich exploration into comparative AI behavior, battle psychology, and meta-strategy development in constrained, rule-based game environments. Through this system, we investigate how modern LLMs understand, adapt, and optimize decisions under uncertainty, making Pokémon League a novel benchmark for AI research in strategic reasoning and competitive learning.
zh
[AI-96] CDiff: Triplex Cascaded Diffusion for High-fidelity Multimodal EHRs Generation with Incomplete Clinical Data
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHR)数据稀缺且质量不足的问题,尤其是在生成式 AI(Generative AI)模型日益依赖大规模高质量数据的背景下。现有方法难以有效建模异构多模态 EHR 数据(如连续型、离散型和文本模态)的内在特性、捕捉模态间复杂依赖关系,并在普遍存在缺失数据的情况下保持鲁棒性,这一问题在中医(Traditional Chinese Medicine, TCM)领域尤为突出。解决方案的关键在于提出 TCDiff(Triplex Cascaded Diffusion Network),其通过级联三个扩散网络构建多阶段生成流程:参考模态扩散(Reference Modalities Diffusion)、跨模态桥接(Cross-Modal Bridging)与目标模态扩散(Target Modality Diffusion),从而实现对真实世界 EHR 数据特征的分层学习与高质量合成,在不同缺失率下均显著优于当前最优基线模型(平均提升 10% 数据保真度),同时保障隐私安全性。
链接: https://arxiv.org/abs/2508.01615
作者: Yandong Yan,Chenxi Li,Yu Huang,Dexuan Xu,Jiaqi Zhu,Zhongyan Chai,Huamin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The scarcity of large-scale and high-quality electronic health records (EHRs) remains a major bottleneck in biomedical research, especially as large foundation models become increasingly data-hungry. Synthesizing substantial volumes of de-identified and high-fidelity data from existing datasets has emerged as a promising solution. However, existing methods suffer from a series of limitations: they struggle to model the intrinsic properties of heterogeneous multimodal EHR data (e.g., continuous, discrete, and textual modalities), capture the complex dependencies among them, and robustly handle pervasive data incompleteness. These challenges are particularly acute in Traditional Chinese Medicine (TCM). To this end, we propose TCDiff (Triplex Cascaded Diffusion Network), a novel EHR generation framework that cascades three diffusion networks to learn the features of real-world EHR data, formatting a multi-stage generative process: Reference Modalities Diffusion, Cross-Modal Bridging, and Target Modality Diffusion. Furthermore, to validate our proposed framework, besides two public datasets, we also construct and introduce TCM-SZ1, a novel multimodal EHR dataset for benchmarking. Experimental results show that TCDiff consistently outperforms state-of-the-art baselines by an average of 10% in data fidelity under various missing rate, while maintaining competitive privacy guarantees. This highlights the effectiveness, robustness, and generalizability of our approach in real-world healthcare scenarios.
zh
[AI-97] Augmented Reinforcement Learning Framework For Enhancing Decision-Making In Machine Learning Models Using External Agents
【速读】:该论文旨在解决强化学习中因“垃圾进,垃圾出”(Garbage-In, Garbage-Out)问题导致的决策质量下降问题,即由于不良数据输入或错误动作引发模型性能退化。其解决方案的关键在于提出了一种增强型强化学习框架(Augmented Reinforcement Learning, ARL),通过引入两个外部代理(External Agent)实现对训练过程的监督与校正:External Agent 1 实时评估模型决策并构建拒绝数据流(Rejected Data Pipeline),用于识别次优行为;External Agent 2 则基于业务相关性与准确性对反馈进行选择性筛选,生成高质量标注数据集以供后续训练迭代。该机制结合机器效率与人类洞察力,在复杂或模糊环境中显著提升了模型的鲁棒性和决策准确性。
链接: https://arxiv.org/abs/2508.01612
作者: Sandesh Kumar Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Master’s thesis, 274 pages, 8 Tables, 73 figures
Abstract:This work proposes a novel technique Augmented Reinforcement Learning framework for the improvement of decision-making capabilities of machine learning models. The introduction of agents as external overseers checks on model decisions. The external agent can be anyone, like humans or automated scripts, that helps in decision path correction. It seeks to ascertain the priority of the “Garbage-In, Garbage-Out” problem that caused poor data inputs or incorrect actions in reinforcement learning. The ARL framework incorporates two external agents that aid in course correction and the guarantee of quality data at all points of the training cycle. The External Agent 1 is a real-time evaluator, which will provide feedback light of decisions taken by the model, identify suboptimal actions forming the Rejected Data Pipeline. The External Agent 2 helps in selective curation of the provided feedback with relevance and accuracy in business scenarios creates an approved dataset for future training cycles. The validation of the framework is also applied to a real-world scenario, which is “Document Identification and Information Extraction”. This problem originates mainly from banking systems, but can be extended anywhere. The method of classification and extraction of information has to be done correctly here. Experimental results show that including human feedback significantly enhances the ability of the model in order to increase robustness and accuracy in making decisions. The augmented approach, with a combination of machine efficiency and human insight, attains a higher learning standard-mainly in complex or ambiguous environments. The findings of this study show that human-in-the-loop reinforcement learning frameworks such as ARL can provide a scalable approach to improving model performance in data-driven applications.
zh
[AI-98] Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning
【速读】:该论文旨在解决多数据流(multi-stream)场景下因固有异质性和不可预测的概念漂移(concept drift)导致的学习难题。现有方法通常假设数据流同质且采用静态架构与无差别知识融合,难以适应复杂动态环境。其解决方案的关键在于提出CAMEL框架——一种动态协作式专家混合学习机制:通过为每个数据流分配独立的特征提取器和任务头以应对异质性;引入一个动态私有专家池来捕捉各流特有的模式;并设计一个专用的“协助专家”(assistance expert),利用多头注意力机制自主蒸馏和整合其他并发流的相关上下文信息,实现精准的知识迁移并抑制无关源的负迁移;此外,结合自适应专家调优策略(AET),根据概念漂移动态管理专家生命周期,即对新兴概念实例化新专家(冻结旧专家以防止灾难性遗忘)并剪枝过时专家,从而实现在线模型容量的鲁棒高效调整。
链接: https://arxiv.org/abs/2508.01598
作者: En Yu,Jie Lu,Kun Wang,Xiaoyu Yang,Guangquan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning from multiple data streams in real-world scenarios is fundamentally challenging due to intrinsic heterogeneity and unpredictable concept drifts. Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion, limiting generalizability in complex dynamic environments. To tackle this gap, we propose CAMEL, a dynamic \textbfCollaborative \textbfAssistance \textbfMixture of \textbfExperts \textbfLearning framework. It addresses heterogeneity by assigning each stream an independent system with a dedicated feature extractor and task-specific head. Meanwhile, a dynamic pool of specialized private experts captures stream-specific idiosyncratic patterns. Crucially, collaboration across these heterogeneous streams is enabled by a dedicated assistance expert. This expert employs a multi-head attention mechanism to distill and integrate relevant context autonomously from all other concurrent streams. It facilitates targeted knowledge transfer while inherently mitigating negative transfer from irrelevant sources. Furthermore, we propose an Autonomous Expert Tuner (AET) strategy, which dynamically manages expert lifecycles in response to drift. It instantiates new experts for emerging concepts (freezing prior ones to prevent catastrophic forgetting) and prunes obsolete ones. This expert-level plasticity provides a robust and efficient mechanism for online model capacity adaptation. Extensive experiments demonstrate CAMEL’s superior generalizability across diverse multistreams and exceptional resilience against complex concept drifts.
zh
[AI-99] Censored Sampling for Topology Design: Guiding Diffusion with Human Preferences
【速读】:该论文旨在解决当前基于去噪扩散模型(denoising diffusion models)的拓扑优化设计生成过程中,因依赖代理预测器(surrogate predictors)强制执行物理约束而导致的潜在设计缺陷问题,例如浮动物体或边界不连续等细微但关键的制造不可行性问题,这些问题往往在人类专家眼中显而易见。解决方案的关键在于提出一种“人在回路”(human-in-the-loop)的扩散框架,通过少量人工反馈训练轻量级奖励模型(reward model),并将其嵌入预训练扩散生成器的采样循环中,利用人类对齐奖励梯度调节反向扩散轨迹,从而引导生成既结构性能优良又物理合理且可制造的设计。该方法模块化、无需重新训练扩散模型,显著降低了失败模式并提升了设计真实性。
链接: https://arxiv.org/abs/2508.01589
作者: Euihyun Kim,Keun Park,Yeoneung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in denoising diffusion models have enabled rapid generation of optimized structures for topology optimization. However, these models often rely on surrogate predictors to enforce physical constraints, which may fail to capture subtle yet critical design flaws such as floating components or boundary discontinuities that are obvious to human experts. In this work, we propose a novel human-in-the-loop diffusion framework that steers the generative process using a lightweight reward model trained on minimal human feedback. Inspired by preference alignment techniques in generative modeling, our method learns to suppress unrealistic outputs by modulating the reverse diffusion trajectory using gradients of human-aligned rewards. Specifically, we collect binary human evaluations of generated topologies and train classifiers to detect floating material and boundary violations. These reward models are then integrated into the sampling loop of a pre-trained diffusion generator, guiding it to produce designs that are not only structurally performant but also physically plausible and manufacturable. Our approach is modular and requires no retraining of the diffusion model. Preliminary results show substantial reductions in failure modes and improved design realism across diverse test conditions. This work bridges the gap between automated design generation and expert judgment, offering a scalable solution to trustworthy generative design.
zh
[AI-100] Diffusion Models for Future Networks and Communications: A Comprehensive Survey
【速读】:该论文旨在解决生成式人工智能(Generative AI)在无线通信与网络系统中应用过程中,如何有效利用扩散模型(Diffusion Models, DMs)提升复杂场景下的性能问题。其核心挑战在于传统方法难以应对高维数据分布、噪声干扰以及动态资源调度等难题。解决方案的关键在于系统性地阐述DMs的理论基础,并将其创新性地应用于优化器设计、强化学习、激励机制、信道建模与估计、信号检测与数据重建、感知与通信一体化(ISAC)、边缘计算中的资源管理及语义通信等多个前沿方向,从而实现对无线网络中关键任务的高效建模与优化。
链接: https://arxiv.org/abs/2508.01586
作者: Nguyen Cong Luong,Nguyen Duc Hai,Duc Van Le,Huy T. Nguyen,Thai-Hoc Vu,Thien Huynh-The,Ruichen Zhang,Nguyen Duc Duy Anh,Dusit Niyato,Marco Di Renzo,Dong In Kim,Quoc-Viet Pham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注: This work was submitted to Proceedings of the IEEE
Abstract:The rise of Generative AI (GenAI) in recent years has catalyzed transformative advances in wireless communications and networks. Among the members of the GenAI family, Diffusion Models (DMs) have risen to prominence as a powerful option, capable of handling complex, high-dimensional data distribution, as well as consistent, noise-robust performance. In this survey, we aim to provide a comprehensive overview of the theoretical foundations and practical applications of DMs across future communication systems. We first provide an extensive tutorial of DMs and demonstrate how they can be applied to enhance optimizers, reinforcement learning and incentive mechanisms, which are popular approaches for problems in wireless networks. Then, we review and discuss the DM-based methods proposed for emerging issues in future networks and communications, including channel modeling and estimation, signal detection and data reconstruction, integrated sensing and communication, resource management in edge computing networks, semantic communications and other notable issues. We conclude the survey with highlighting technical limitations of DMs and their applications, as well as discussing future research directions.
zh
[AI-101] Polymorphic Combinatorial Frameworks (PCF): Guiding the Design of Mathematically-Grounded Adaptive AI Agents
【速读】:该论文旨在解决复杂动态环境中AI代理(Agent)难以实现高效自适应与可扩展性的问题,特别是在传统静态架构下无法实时调整核心行为特征的局限。其解决方案的关键在于提出多态组合框架(Polymorphic Combinatorial Framework, PCF),通过数学基础(组合逻辑、拓扑理论及粗糙模糊集理论)构建一个五维SPARK参数空间(技能Skills、个性Personalities、方法Approaches、资源Resources、知识Knowledge),使大型语言模型(LLMs)能够对高维参数空间进行建模和估计,并借助蒙特卡洛模拟实现大规模动态配置优化。PCF不仅支持基于场景的最优代理配置生成,还确保了逻辑一致性,从而为客服、医疗、机器人及协作系统等领域的可扩展、动态、可解释且合乎伦理的AI应用提供新范式。
链接: https://arxiv.org/abs/2508.01581
作者: David Pearl,Matthew Murphy,James Intriligator
机构: 未知
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO); Computation (stat.CO)
备注:
Abstract:The Polymorphic Combinatorial Framework (PCF) leverages Large Language Models (LLMs) and mathematical frameworks to guide the meta-prompt enabled design of solution spaces and adaptive AI agents for complex, dynamic environments. Unlike static agent architectures, PCF enables real-time parameter reconfiguration through mathematically-grounded combinatorial spaces, allowing agents to adapt their core behavioral traits dynamically. Grounded in combinatorial logic, topos theory, and rough fuzzy set theory, PCF defines a multidimensional SPARK parameter space (Skills, Personalities, Approaches, Resources, Knowledge) to capture agent behaviors. This paper demonstrates how LLMs can parameterize complex spaces and estimate likely parameter values/variabilities. Using PCF, we parameterized mock café domains (five levels of complexity), estimated variables/variabilities, and conducted over 1.25 million Monte Carlo simulations. The results revealed trends in agent adaptability and performance across the five complexity tiers, with diminishing returns at higher complexity levels highlighting thresholds for scalable designs. PCF enables the generation of optimized agent configurations for specific scenarios while maintaining logical consistency. This framework supports scalable, dynamic, explainable, and ethical AI applications in domains like customer service, healthcare, robotics, and collaborative systems, paving the way for adaptable and cooperative next-generation polymorphic agents.
zh
[AI-102] One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中复杂且具有时序扩展性的任务目标与安全约束的泛化问题,尤其是针对嵌套长时程任务和安全约束难以处理、以及无法识别子目标不可满足情形从而寻求替代方案的问题。解决方案的关键在于提出GenZ-LTL方法,该方法利用Büchi自动机的结构将线性时序逻辑(Linear Temporal Logic, LTL)任务规范分解为一系列可达-避障(reach-avoid)子目标,并通过合理的安全强化学习(safe RL)形式化逐个求解这些子目标,而非依赖于对子目标序列的条件建模;此外,引入了一种新颖的子目标诱导观测降维技术,在合理假设下缓解了子目标状态组合带来的指数级复杂度问题,从而实现了对未见过的LTL规范的零样本泛化能力。
链接: https://arxiv.org/abs/2508.01561
作者: Zijian Guo,İlker Işık,H. M. Sabbir Ahmad,Wenchao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generalizing to complex and temporally extended task objectives and safety constraints remains a critical challenge in reinforcement learning (RL). Linear temporal logic (LTL) offers a unified formalism to specify such requirements, yet existing methods are limited in their abilities to handle nested long-horizon tasks and safety constraints, and cannot identify situations when a subgoal is not satisfiable and an alternative should be sought. In this paper, we introduce GenZ-LTL, a method that enables zero-shot generalization to arbitrary LTL specifications. GenZ-LTL leverages the structure of Büchi automata to decompose an LTL task specification into sequences of reach-avoid subgoals. Contrary to the current state-of-the-art method that conditions on subgoal sequences, we show that it is more effective to achieve zero-shot generalization by solving these reach-avoid problems \textitone subgoal at a time through proper safe RL formulations. In addition, we introduce a novel subgoal-induced observation reduction technique that can mitigate the exponential complexity of subgoal-state combinations under realistic assumptions. Empirical results show that GenZ-LTL substantially outperforms existing methods in zero-shot generalization to unseen LTL specifications.
zh
[AI-103] Empowering Tabular Data Preparation with Language Models: Why and How?
【速读】:该论文旨在解决传统表格数据准备方法在捕捉表内复杂关系及适应不同任务需求方面存在的局限性,同时系统性地探索语言模型(Language Models, LMs),尤其是大语言模型(Large Language Models, LLMs)在表格数据准备各阶段中的适配机制与应用潜力。其解决方案的关键在于:通过系统分析LMs在数据获取、集成、清洗和转换四个核心阶段的作用,揭示其能力如何匹配具体任务需求,并提出融合LMs与其他组件的协同机制,从而构建高效、自动化且可扩展的表格数据准备流程框架。
链接: https://arxiv.org/abs/2508.01556
作者: Mengshi Chen,Yuxiang Sun,Tengchao Li,Jianwei Wang,Kai Wang,Xuemin Lin,Ying Zhang,Wenjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint under submission, 16 pages, 2 figures, 1 table
Abstract:Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other components for different preparation tasks, highlight key advancements, and outline prospective pipelines.
zh
[AI-104] Understanding Why ChatGPT Outperforms Humans in Visualization Design Advice
【速读】:该论文旨在解决“为何近期生成式AI模型在数据可视化知识任务中表现优于人类”的问题。其解决方案的关键在于通过系统性的对比分析,揭示了ChatGPT-4相较于人类和ChatGPT-3.5在修辞结构、知识广度及感知质量等方面的差异,发现ChatGPT-4展现出人类与旧版模型的混合特征,且其在覆盖范围、知识广度以及对技术性和任务导向反馈的强调共同构成了整体质量优势。这一发现为基于大语言模型(LLM)提升用户体验提供了关键洞见,并明确了人类感知与AI能力之间的潜在协同路径。
链接: https://arxiv.org/abs/2508.01547
作者: Yongsu Ahn,Nam Wook Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates why recent generative AI models outperform humans in data visualization knowledge tasks. Through systematic comparative analysis of responses to visualization questions, we find that differences exist between two ChatGPT models and human outputs over rhetorical structure, knowledge breadth, and perceptual quality. Our findings reveal that ChatGPT-4, as a more advanced model, displays a hybrid of characteristics from both humans and ChatGPT-3.5. The two models were generally favored over human responses, while their strengths in coverage and breadth, and emphasis on technical and task-oriented visualization feedback collectively shaped higher overall quality. Based on our findings, we draw implications for advancing user experiences based on the potential of LLMs and human perception over their capabilities, with relevance to broader applications of AI.
zh
[AI-105] Getting out of the Big-Muddy: Escalation of Commitment in LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中是否以及如何表现出人类认知偏差(如沉没成本谬误,escalation of commitment)的问题。研究表明,LLMs并非固有地表现出此类偏差,其偏倚行为高度依赖于社会与组织情境:在个体决策中(N=4000),模型表现理性且极少出现沉没成本谬误;但在多智能体协商(N=500)和复合压力情境下(N=2000),模型则显著表现出偏倚——对称型同伴决策导致近乎普遍的沉没成本倾向(99.2%),而复合压力下平均分配至失败部门的比例达68.95%。因此,解决方案的关键在于识别并控制诱发偏倚的社会结构与环境条件,而非单纯优化模型本身。
链接: https://arxiv.org/abs/2508.01545
作者: Emilio Barkett,Olivia Long,Paul Kröger
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed in autonomous decision-making roles across high-stakes domains. However, since models are trained on human-generated data, they may inherit cognitive biases that systematically distort human judgment, including escalation of commitment, where decision-makers continue investing in failing courses of action due to prior investment. Understanding when LLMs exhibit such biases presents a unique challenge. While these biases are well-documented in humans, it remains unclear whether they manifest consistently in LLMs or require specific triggering conditions. This paper investigates this question using a two-stage investment task across four experimental conditions: model as investor, model as advisor, multi-agent deliberation, and compound pressure scenario. Across N = 6,500 trials, we find that bias manifestation in LLMs is highly context-dependent. In individual decision-making contexts (Studies 1-2, N = 4,000), LLMs demonstrate strong rational cost-benefit logic with minimal escalation of commitment. However, multi-agent deliberation reveals a striking hierarchy effect (Study 3, N = 500): while asymmetrical hierarchies show moderate escalation rates (46.2%), symmetrical peer-based decision-making produces near-universal escalation (99.2%). Similarly, when subjected to compound organizational and personal pressures (Study 4, N = 2,000), models exhibit high degrees of escalation of commitment (68.95% average allocation to failing divisions). These findings reveal that LLM bias manifestation depends critically on social and organizational context rather than being inherent, with significant implications for the deployment of multi-agent systems and unsupervised operations where such conditions may emerge naturally.
zh
[AI-106] Refine-n-Judge: Curating High-Quality Preference Chains for LLM -Fine-Tuning
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在偏好微调过程中对高质量训练数据的依赖问题,尤其是传统依赖人工反馈(human feedback)所导致的数据标注成本高、难以扩展的瓶颈。其解决方案的关键在于提出一种名为 Refine-n-Judge 的自动化迭代方法,该方法仅使用单一 LLM 作为“精炼器”(refiner)和“裁判”(judge),通过交替执行响应优化与改进评估来逐步提升数据质量:每次迭代中,LLM 生成新的响应并显式判断是否优于前一版本,当不再有改进时停止,从而形成一系列偏好明确、质量递增的响应序列,无需额外人工标注或独立奖励模型即可实现高效、可扩展的数据增强。
链接: https://arxiv.org/abs/2508.01543
作者: Derin Cayir,Renjie Tao,Rashi Rungta,Kai Sun,Sean Chen,Haidar Khan,Minseok Kim,Julia Reinspach,Yue Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01543 [cs.AI] (or arXiv:2508.01543v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.01543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-107] Leverag ing Machine Learning for Botnet Attack Detection in Edge-Computing Assisted IoT Networks
【速读】:该论文旨在解决边缘计算辅助的物联网(Internet of Things, IoT)环境中因设备数量激增和安全漏洞频发而导致的botnet攻击威胁问题。其关键解决方案是采用三种先进的集成学习算法——随机森林(Random Forest)、XGBoost和LightGBM,对大规模物联网网络流量数据进行训练与分类,以实现对botnet活动的高精度检测与识别。研究不仅评估了这些模型在准确性上的表现,还重点探讨了它们在资源受限的边缘和物联网设备上的部署可行性,从而验证了机器学习方法在实际场景中增强物联网网络安全性的有效性。
链接: https://arxiv.org/abs/2508.01542
作者: Dulana Rupanetti,Naima Kaabouch
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The increase of IoT devices, driven by advancements in hardware technologies, has led to widespread deployment in large-scale networks that process massive amounts of data daily. However, the reliance on Edge Computing to manage these devices has introduced significant security vulnerabilities, as attackers can compromise entire networks by targeting a single IoT device. In light of escalating cybersecurity threats, particularly botnet attacks, this paper investigates the application of machine learning techniques to enhance security in Edge-Computing-Assisted IoT environments. Specifically, it presents a comparative analysis of Random Forest, XGBoost, and LightGBM – three advanced ensemble learning algorithms – to address the dynamic and complex nature of botnet threats. Utilizing a widely recognized IoT network traffic dataset comprising benign and malicious instances, the models were trained, tested, and evaluated for their accuracy in detecting and classifying botnet activities. Furthermore, the study explores the feasibility of deploying these models in resource-constrained edge and IoT devices, demonstrating their practical applicability in real-world scenarios. The results highlight the potential of machine learning to fortify IoT networks against emerging cybersecurity challenges.
zh
[AI-108] Revisiting Gossip Protocols: A Vision for Emergent Coordination in Agent ic Multi-Agent Systems
【速读】:该论文旨在解决当前多智能体系统(multi-agent systems)在规模化扩展过程中,因静态角色和固定工具链导致的协调灵活性不足问题,尤其在缺乏支持涌现式群体智能(swarm-like intelligence)的通信机制时,难以实现分布式代理间的持续学习、适应与集体认知构建。其解决方案的关键在于引入**八卦协议(gossip protocols)**作为结构化通信协议的补充层,利用其在分布式系统中长期验证的容错性和去中心化特性,实现上下文丰富、自适应的知识传播;同时指出需重点攻克语义过滤、信息陈旧性、可信度与一致性等挑战,从而推动形成更具韧性、自我调节能力的多智能体协作体系。
链接: https://arxiv.org/abs/2508.01531
作者: Mansura Habiba,Nafiul I. Khan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:As agentic platforms scale, agents are evolving beyond static roles and fixed toolchains, creating a growing need for flexible, decentralized coordination. Today’s structured communication protocols (e.g., direct agent-to-agent messaging) excel at reliability and task delegation, but they fall short in enabling emergent, swarm-like intelligence, where distributed agents continuously learn, adapt, and communicate to form collective cognition. This paper revisits gossip protocols, long valued in distributed systems for their fault tolerance and decentralization, and argues that they offer a missing layer for context-rich, adaptive communication in agentic AI. Gossip enables scalable, low-overhead dissemination of shared knowledge, but also raises unresolved challenges around semantic filtering, staleness, trustworthiness, and consistency in high-stakes environments. Rather than proposing a new framework, this work charts a research agenda for integrating gossip as a complementary substrate alongside structured protocols. We identify critical gaps in current agent-to-agent architectures, highlight where gossip could reshape assumptions about coordination, and outline open questions around intent propagation, knowledge decay, and peer-to-peer trust. Gossip is not a silver bullet, but overlooking it risks missing a key path toward resilient, reflexive, and self-organizing multi-agent systems.
zh
[AI-109] Decentralized Aerial Manipulation of a Cable-Suspended Load using Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多架微型飞行器(Micro-Aerial Vehicles, MAVs)在真实世界中对悬吊负载进行六自由度(6-DoF)操作的去中心化控制问题。传统方法依赖于集中式控制架构,需要全局状态信息、MAV间通信及邻近代理信息,限制了系统的可扩展性和实时性。其解决方案的关键在于:1)采用多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)训练每个MAV的外层控制策略,仅通过负载位姿观测实现隐式通信,无需显式信息交换;2)设计基于线加速度和机体角速率的新动作空间,并结合鲁棒的低层控制器,有效应对动态三维运动中由缆绳张力引起的不确定性,从而实现从仿真到现实环境的可靠迁移。此方法显著降低推理阶段计算开销,支持在机部署,且具备良好的抗故障能力与异构策略协作能力。
链接: https://arxiv.org/abs/2508.01522
作者: Jack Zeng,Andreu Matoses Gimenez,Eugene Vinitsky,Javier Alonso-Mora,Sihao Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: this https URL
zh
[AI-110] he Vanishing Gradient Problem for Stiff Neural Differential Equations
【速读】:该论文旨在解决神经微分方程(Neural Differential Equations)等参数化动力系统在刚性(stiff)条件下梯度消失问题,即训练过程中对快速衰减模态控制参数的敏感性趋于零,导致优化困难。其关键发现是:这种梯度消失现象并非特定数值积分方法的缺陷,而是所有A-稳定(A-stable)和L-稳定(L-stable)刚性数值积分格式的固有特性;通过分析一般刚性积分格式的有理稳定性函数及其导数,论文揭示了参数敏感性随刚性增大而衰减至零的机制,并严格证明了稳定性函数导数的最慢衰减速率为 $ O(|z|^{-1}) $,从而指出:所有A-稳定时间步长方法在刚性区域不可避免地抑制参数梯度,构成训练和参数识别的根本性障碍。
链接: https://arxiv.org/abs/2508.01519
作者: Colby Fronk,Linda Petzold
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Numerical Analysis (math.NA)
备注:
Abstract:Gradient-based optimization of neural differential equations and other parameterized dynamical systems fundamentally relies on the ability to differentiate numerical solutions with respect to model parameters. In stiff systems, it has been observed that sensitivities to parameters controlling fast-decaying modes become vanishingly small during training, leading to optimization difficulties. In this paper, we show that this vanishing gradient phenomenon is not an artifact of any particular method, but a universal feature of all A-stable and L-stable stiff numerical integration schemes. We analyze the rational stability function for general stiff integration schemes and demonstrate that the relevant parameter sensitivities, governed by the derivative of the stability function, decay to zero for large stiffness. Explicit formulas for common stiff integration schemes are provided, which illustrate the mechanism in detail. Finally, we rigorously prove that the slowest possible rate of decay for the derivative of the stability function is O(|z|^-1) , revealing a fundamental limitation: all A-stable time-stepping methods inevitably suppress parameter gradients in stiff regimes, posing a significant barrier for training and parameter identification in stiff neural ODEs.
zh
[AI-111] FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models
【速读】:该论文旨在解决当前基于奇异值分解(Singular Value Decomposition, SVD)的大型语言模型(Large Language Models, LLMs)压缩方法在实际部署中面临的峰值推理内存瓶颈问题。尽管SVD能实现20%-80%的参数压缩且保持精度损失最小,但现有方法忽略了在推理过程中因使用截断因子导致的激活内存开销——该开销随序列长度和隐藏维度增长,使得整体峰值内存未见降低,限制了其在资源受限设备上的应用。解决方案的关键在于提出FlashSVD,一个端到端的、秩感知的流式推理框架:通过将低秩投影核直接融合进自注意力机制与前馈网络(Feed-Forward Network, FFN)流水线中,避免生成全尺寸激活缓冲区;同时利用片上SRAM分块加载并即时计算、释放小块截断因子,从而显著减少峰值激活内存(最多70.2%)和瞬时中间内存(最多75%),且不引入额外延迟或精度损失。
链接: https://arxiv.org/abs/2508.01506
作者: Zishan Shao,Yixiao Wang,Qinsi Wang,Ting Jiang,Zhixu Du,Hancheng Ye,Danyang Zhuo,Yiran Chen,Hai Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: Technical Report
Abstract:Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments. We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs. Comments: Technical Report Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2508.01506 [cs.LG] (or arXiv:2508.01506v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.01506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-112] ShrutiSense: Microtonal Modeling and Correction in Indian Classical Music
【速读】:该论文旨在解决印度古典音乐中微分音系统(22 shrutis)与特定拉格(raga)语法在符号化音乐处理中的缺失问题,现有工具因无法区分微分音差异及缺乏文化特异性旋律规则,导致对印度古典音乐表达的失真。解决方案的关键在于提出ShrutiSense系统,其核心由两个互补模型构成:一是基于22音阶框架的Shruti感知有限状态转换器(FST),用于上下文相关的音高校正;二是结合拉格约束的Shruti隐马尔可夫模型(GC-SHMM),用于缺失旋律片段的语境补全。实验证明,该系统在模拟数据上对五种拉格均表现出高精度(91.3%音高分类准确率),且在±50音分噪声下仍保持稳定性能,有效保留了印度古典音乐的文化真实性。
链接: https://arxiv.org/abs/2508.01498
作者: Rajarshi Ghosh,Jayanth Athipatla
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Indian classical music relies on a sophisticated microtonal system of 22 shrutis (pitch intervals), which provides expressive nuance beyond the 12-tone equal temperament system. Existing symbolic music processing tools fail to account for these microtonal distinctions and culturally specific raga grammars that govern melodic movement. We present ShrutiSense, a comprehensive symbolic pitch processing system designed for Indian classical music, addressing two critical tasks: (1) correcting westernized or corrupted pitch sequences, and (2) completing melodic sequences with missing values. Our approach employs complementary models for different tasks: a Shruti-aware finite-state transducer (FST) that performs contextual corrections within the 22-shruti framework and a grammar-constrained Shruti hidden Markov model (GC-SHMM) that incorporates raga-specific transition rules for contextual completions. Comprehensive evaluation on simulated data across five ragas demonstrates that ShrutiSense (FST model) achieves 91.3% shruti classification accuracy for correction tasks, with example sequences showing 86.7-90.0% accuracy at corruption levels of 0.2 to 0.4. The system exhibits robust performance under pitch noise up to +/-50 cents, maintaining consistent accuracy across ragas (90.7-91.8%), thus preserving the cultural authenticity of Indian classical music expression.
zh
[AI-113] WinkTPG: An Execution Framework for Multi-Agent Path Finding Using Temporal Reasoning
【速读】:该论文旨在解决大规模多智能体路径规划(Multi-Agent Path Finding, MAPF)中因标准算法依赖简化运动学与动力学模型,导致生成的路径无法直接被智能体执行的问题。其核心挑战在于如何将离散的MAPF规划结果转化为满足实际运动约束的可执行速度轨迹,同时保持无碰撞性并降低不确定性。解决方案的关键在于提出一种基于时间计划图的运动学动力学优化方法——kinodynamic Temporal Plan Graph Planning (kTPG),它能高效地将MAPF计划转化为满足运动学与动力学限制的可行轨迹,并通过窗口化机制(Windowed kTPG, WinkTPG)在执行过程中动态更新信息,从而显著提升解的质量和实时性。实验表明,WinkTPG可在1秒内为多达1000个智能体生成速度剖面,相较现有方法提升解质量达51.7%。
链接: https://arxiv.org/abs/2508.01495
作者: Jingtian Yan,Stephen F. Smith,Jiaoyang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Planning collision-free paths for a large group of agents is a challenging problem with numerous real-world applications. While recent advances in Multi-Agent Path Finding (MAPF) have shown promising progress, standard MAPF algorithms rely on simplified kinodynamic models, preventing agents from directly following the generated MAPF plan. To bridge this gap, we propose kinodynamic Temporal Plan Graph Planning (kTPG), a multi-agent speed optimization algorithm that efficiently refines a MAPF plan into a kinodynamically feasible plan while accounting for uncertainties and preserving collision-freeness. Building on kTPG, we propose Windowed kTPG (WinkTPG), a MAPF execution framework that incrementally refines MAPF plans using a window-based mechanism, dynamically incorporating agent information during execution to reduce uncertainty. Experiments show that WinkTPG can generate speed profiles for up to 1,000 agents in 1 second and improves solution quality by up to 51.7% over existing MAPF execution methods.
zh
[AI-114] ranslation-Equivariant Self-Supervised Learning for Pitch Estimation with Optimal Transport
【速读】:该论文旨在解决单音高估计(single pitch estimation)中自监督学习模型训练的稳定性与理论基础不足的问题。其解决方案的关键在于提出一种基于最优传输(Optimal Transport)的目标函数,用于学习一维平移等变系统(translation-equivariant systems),从而为当前最先进的自监督音高估计算法提供一种理论严谨、数值更稳定且实现更简洁的替代训练方法。
链接: https://arxiv.org/abs/2508.01493
作者: Bernardo Torres,Alain Riou,Gaël Richard,Geoffroy Peeters
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Extended Abstracts for the Late-Breaking Demo Session of the 26th International Society for Music Information Retrieval Conference
Abstract:In this paper, we propose an Optimal Transport objective for learning one-dimensional translation-equivariant systems and demonstrate its applicability to single pitch estimation. Our method provides a theoretically grounded, more numerically stable, and simpler alternative for training state-of-the-art self-supervised pitch estimators.
zh
[AI-115] PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
【速读】:该论文旨在解决单音高估计(Single-Pitch Estimation, SPE)任务中对大量标注数据的依赖问题,提出了一种无需人工标注的自监督学习方法。其关键解决方案在于设计了一个基于Siamese架构的神经网络模型PESTO,该模型利用可变Q变换(Variable-Q Transform, VQT)帧作为输入,并通过Toeplitz全连接层实现平移等变性(translation equivariance),从而在不依赖标签的情况下学习音高分布。此外,作者构建了音高偏移配对样本并引入一种基于类别的音高平移等变目标函数,有效提升了模型性能和跨数据集泛化能力。最终,PESTO展现出轻量级(仅130k参数)、低延迟(<10ms)和实时流式处理能力,适用于实际应用场景。
链接: https://arxiv.org/abs/2508.01488
作者: Alain Riou,Bernardo Torres,Ben Hayes,Stefan Lattner,Gaëtan Hadjeres,Gaël Richard,Geoffroy Peeters
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Transactions of the International Society for Music Information Retrieval
Abstract:In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable- Q Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ( 130 k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO’s practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model’s low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.
zh
[AI-116] raining Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler
【速读】:该论文旨在解决Transformer训练中学习率调度(learning rate scheduling)的cooldown阶段(即最终衰减阶段)机制不明确的问题,特别是该阶段导致损失函数下降的现象缺乏理论解释。其解决方案的关键在于对Warmup-Stable-Decay (WSD)调度器中的cooldown阶段进行系统性分析,揭示了不同衰减形状会引发模型在偏差(bias)与方差(variance)之间的权衡,并指出能够平衡探索(exploration)与利用(exploitation)的衰减策略性能最优;同时发现AdamW优化器中的β₂参数在cooldown阶段取较高值时能带来稳定性能提升,并通过损失景观(loss landscape)可视化验证了“河流谷地”(river valley)视角的合理性,从而为WSD调度器配置提供了可操作的实践建议。
链接: https://arxiv.org/abs/2508.01483
作者: Aleksandr Dremov,Alexander Hägele,Atli Kosson,Martin Jaggi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in TMLR. Review: this https URL
Abstract:Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations \unicodex2013 comparable to those from cooldown shape selection \unicodex2013 when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of \beta_2 during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.
zh
[AI-117] Reconstructing Trust Embeddings from Siamese Trust Scores: A Direct-Sum Approach with Fixed-Point Semantics
【速读】:该论文试图解决从分布式安全框架中暴露的一维相似度评分(Siamese trust scores)重建高维信任嵌入(trust embeddings)的逆问题。其核心挑战在于如何利用两个独立代理对同一组设备发布的带时间戳的相似度分数序列,准确还原出原始设备间的高维几何结构。解决方案的关键在于提出了一种显式的直接求和估计器(direct-sum estimator),该估计器将配对的评分序列与四阶矩特征进行拼接,并通过Banach不动点定理证明了重构映射在收缩条件下存在唯一固定点,从而保证了重建结果的稳定性与唯一性。
链接: https://arxiv.org/abs/2508.01479
作者: Faruk Alpay,Taylan Alpay,Bugra Kilictas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 22 pages, 3 figures, 1 table
Abstract:We study the inverse problem of reconstructing high-dimensional trust embeddings from the one-dimensional Siamese trust scores that many distributed-security frameworks expose. Starting from two independent agents that publish time-stamped similarity scores for the same set of devices, we formalise the estimation task, derive an explicit direct-sum estimator that concatenates paired score series with four moment features, and prove that the resulting reconstruction map admits a unique fixed point under a contraction argument rooted in Banach theory. A suite of synthetic benchmarks (20 devices x 10 time steps) confirms that, even in the presence of Gaussian noise, the recovered embeddings preserve inter-device geometry as measured by Euclidean and cosine metrics; we complement these experiments with non-asymptotic error bounds that link reconstruction accuracy to score-sequence length. Beyond methodology, the paper demonstrates a practical privacy risk: publishing granular trust scores can leak latent behavioural information about both devices and evaluation models. We therefore discuss counter-measures – score quantisation, calibrated noise, obfuscated embedding spaces – and situate them within wider debates on transparency versus confidentiality in networked AI systems. All datasets, reproduction scripts and extended proofs accompany the submission so that results can be verified without proprietary code.
zh
[AI-118] CARGO: A Co-Optimization Framework for EV Charging and Routing in Goods Delivery Logistics
【速读】:该论文旨在解决电动车辆(Electric Vehicle, EV)在城市配送中面临的路径规划与充电调度协同优化问题,即EV-based delivery route planning problem (EDRP)。其核心挑战在于如何在满足时间窗约束的前提下,合理安排配送路径和充电行为,以降低充电成本并提升配送效率。解决方案的关键在于提出一个名为CARGO的框架,该框架通过构建混合整数线性规划(Mixed Integer Linear Programming, MILP)模型获得精确解,并设计了一种计算高效的启发式算法,在保证配送完成率的同时显著减少充电成本(相较基准策略Earliest Deadline First (EDF) 和 Nearest Delivery First (NDF),分别最多降低39%和22%)。
链接: https://arxiv.org/abs/2508.01476
作者: Arindam Khanda,Anurag Satpathy,Amit Jha,Sajal K. Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With growing interest in sustainable logistics, electric vehicle (EV)-based deliveries offer a promising alternative for urban distribution. However, EVs face challenges due to their limited battery capacity, requiring careful planning for recharging. This depends on factors such as the charging point (CP) availability, cost, proximity, and vehicles’ state of charge (SoC). We propose CARGO, a framework addressing the EV-based delivery route planning problem (EDRP), which jointly optimizes route planning and charging for deliveries within time windows. After proving the problem’s NP-hardness, we propose a mixed integer linear programming (MILP)-based exact solution and a computationally efficient heuristic method. Using real-world datasets, we evaluate our methods by comparing the heuristic to the MILP solution, and benchmarking it against baseline strategies, Earliest Deadline First (EDF) and Nearest Delivery First (NDF). The results show up to 39% and 22% reductions in the charging cost over EDF and NDF, respectively, while completing comparable deliveries.
zh
[AI-119] R2-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation
【速读】:该论文旨在解决当前自然语言处理(Natural Language Processing, NLP)中关系推理(Relational Reasoning)任务对文本与图结构信息协同利用不足的问题,尤其是缺乏对二者交互机制的系统性理解。其解决方案的关键在于提出一种统一架构支持知识共蒸馏(Knowledge Co-Distillation, CoD),通过追踪训练过程中文本与图表示的演化路径,揭示二者在不同任务中的对齐与分歧模式,从而为混合模型中如何有效整合文本与图信息提供可解释的洞察。
链接: https://arxiv.org/abs/2508.01475
作者: Zhen Wu,Ritam Dutt,Luke M. Breitfeller,Armineh Nourbakhsh,Siddharth Parekh,Carolyn Rosé
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Relational reasoning lies at the core of many NLP tasks, drawing on complementary signals from text and graphs. While prior research has investigated how to leverage this dual complementarity, a detailed and systematic understanding of text-graph interplay and its effect on hybrid models remains underexplored. We take an analysis-driven approach to investigate text-graph representation complementarity via a unified architecture that supports knowledge co-distillation (CoD). We explore five tasks involving relational reasoning that differ in how text and graph structures encode the information needed to solve that task. By tracking how these dual representations evolve during training, we uncover interpretable patterns of alignment and divergence, and provide insights into when and why their integration is beneficial.
zh
[AI-120] Fast and scalable retrosynthetic planning with a transformer neural network and speculative beam search
【速读】:该论文旨在解决基于人工智能的计算机辅助合成路线规划(Computer-Aided Synthesis Planning, CASP)系统在高通量分子可合成性筛选场景中因延迟过高而难以实用的问题。其核心挑战在于多步合成规划依赖的SMILES-to-SMILES转换器模型在标准束搜索(beam search)策略下计算效率低下,无法满足几秒内完成复杂分子合成路径生成的实时性要求。解决方案的关键在于引入一种结合推测性束搜索(speculative beam search)与可扩展草稿策略(Medusa)的方法,显著降低了SMILES-to-SMILES转换器的推理延迟,使AiZynthFinder在相同时间约束下能成功规划26%至86%更多的分子合成路径,从而更贴近高通量筛选所需的低延迟性能指标并提升用户体验。
链接: https://arxiv.org/abs/2508.01459
作者: Mikhail Andronov,Natalia Andronova,Michael Wand,Jürgen Schmidhuber,Djork-Arné Clevert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-based computer-aided synthesis planning (CASP) systems are in demand as components of AI-driven drug discovery workflows. However, the high latency of such CASP systems limits their utility for high-throughput synthesizability screening in de novo drug design. We propose a method for accelerating multi-step synthesis planning systems that rely on SMILES-to-SMILES transformers as single-step retrosynthesis models. Our approach reduces the latency of SMILES-to-SMILES transformers powering multi-step synthesis planning in AiZynthFinder through speculative beam search combined with a scalable drafting strategy called Medusa. Replacing standard beam search with our approach allows the CASP system to solve 26% to 86% more molecules under the same time constraints of several seconds. Our method brings AI-based CASP systems closer to meeting the strict latency requirements of high-throughput synthesizability screening and improving general user experience.
zh
[AI-121] uning LLM -based Code Optimization via Meta-Prompting: An Industrial Perspective
【速读】:该论文旨在解决工业级多大语言模型(Large Language Models, LLMs)代码优化系统中因提示词(prompt)模型特异性导致的部署瓶颈问题,即针对某一LLM优化的提示词在其他LLM上失效,需昂贵的模型专属提示工程。其解决方案的关键在于提出Meta-Prompted Code Optimization (MPCO)框架,通过元提示(meta-prompting)动态整合项目元数据、任务需求和LLM特定上下文,自动生成跨模型高效、任务定制化的高质量提示词,并在ARTEMIS工业平台上实现自动化验证与扩展,从而显著提升多LLM环境下代码优化的实用性与性能表现。
链接: https://arxiv.org/abs/2508.01443
作者: Jingzhi Gong,Rafail Giavrimis,Paul Brookes,Vardan Voskanyan,Fan Wu,Mari Ashiga,Matthew Truscott,Mike Basios,Leslie Kanthan,Jie Xu,Zheng Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to ASE’25 Industry Showcase
Abstract:There is a growing interest in leveraging large language models (LLMs) for automated code optimization. However, industrial platforms deploying multiple LLMs face a critical challenge: prompts optimized for one LLM often fail with others, requiring expensive model-specific prompt engineering. This cross-model prompt engineering bottleneck severely limits the practical deployment of multi-LLM optimization systems in production environments. To address this, we introduce Meta-Prompted Code Optimization (MPCO), a framework that automatically generates high-quality, task-specific prompts across diverse LLMs while maintaining industrial efficiency requirements. MPCO leverages meta-prompting to dynamically synthesize context-aware optimization prompts by integrating project metadata, task requirements, and LLM-specific contexts, and it seamlessly deploys on the ARTEMIS industrial platform for automated validation and scaling. Our comprehensive evaluation on five real-world codebases with 366 hours of runtime benchmarking demonstrates MPCO’s effectiveness: it achieves overall performance improvements up to 19.06% with the best statistical rank across all systems compared to baseline methods. Analysis shows that 96% of the top-performing optimizations stem from meaningful edits. Through systematic ablation studies and meta-prompter sensitivity analysis, we identify that comprehensive context integration is essential for effective meta-prompting, and that all three major LLMs can serve effectively as meta-prompters, providing actionable insights for industrial practitioners. Comments: Submitted to ASE’25 Industry Showcase Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01443 [cs.SE] (or arXiv:2508.01443v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.01443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-122] ripTailor: A Real-World Benchmark for Personalized Travel Planning ACL2025
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在个性化旅行规划任务中评估标准不足的问题,尤其是现有基准测试多依赖于不真实的模拟数据,无法准确反映LLM生成方案与真实世界旅行计划之间的差异。为此,作者提出了TripTailor这一专门针对真实场景下个性化旅行规划的基准数据集,其核心创新在于构建了一个包含超过50万真实景点(Points of Interest, POIs)和近4000条详细旅行行程的数据集,从而提供更贴近实际的评估框架。实验表明,当前最先进的LLMs生成的行程中仅有不到10%能达到人类水平,凸显出可行性、合理性及个性化定制等方面的显著挑战。TripTailor的关键价值在于推动开发能够理解用户需求并生成实用行程的智能旅行代理。
链接: https://arxiv.org/abs/2508.01432
作者: Yuanzhe Shen,Kaimin Wang,Changze Lv,Xiaoqing Zheng,Xuanjing Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Findings
Abstract:The continuous evolution and enhanced reasoning capabilities of large language models (LLMs) have elevated their role in complex tasks, notably in travel planning, where demand for personalized, high-quality itineraries is rising. However, current benchmarks often rely on unrealistic simulated data, failing to reflect the differences between LLM-generated and real-world itineraries. Existing evaluation metrics, which primarily emphasize constraints, fall short of providing a comprehensive assessment of the overall quality of travel plans. To address these limitations, we introduce TripTailor, a benchmark designed specifically for personalized travel planning in real-world scenarios. This dataset features an extensive collection of over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries, complete with detailed information, providing a more authentic evaluation framework. Experiments show that fewer than 10% of the itineraries generated by the latest state-of-the-art LLMs achieve human-level performance. Moreover, we identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization of the proposed solutions. We hope that TripTailor will drive the development of travel planning agents capable of understanding and meeting user needs while generating practical itineraries. Our code and dataset are available at this https URL
zh
[AI-123] RoboMemory: A Brain-inspired Multi-memory Agent ic Framework for Lifelong Learning in Physical Embodied Systems
【速读】:该论文旨在解决物理具身系统在真实环境中进行持续学习时面临的四大挑战:连续学习能力不足、多模块记忆延迟、任务相关性捕捉困难以及闭环规划中的无限循环问题。其解决方案的核心是提出一种受大脑启发的多记忆框架 RoboMemory,该框架通过整合四个类脑模块(信息预处理器、终身具身记忆系统、闭环规划模块和低级执行器)实现长期规划与累积学习。其中,核心创新在于“终身具身记忆系统”,它采用空间(Spatial)、时间(Temporal)、情景(Episodic)和语义(Semantic)子模块并行更新/检索机制,结合动态知识图谱(Knowledge Graph, KG)与一致架构设计,在保证记忆一致性的同时显著提升推理效率与可扩展性,从而有效缓解高延迟问题并实现稳定长寿命周期的学习性能。
链接: https://arxiv.org/abs/2508.01415
作者: Mingcong Lei,Honghao Cai,Zezhou Cui,Liangchen Tan,Junkun Hong,Gehan Hu,Shuangyu Zhu,Yimou Wu,Shaohan Jiang,Ge Wang,Zhen Li,Shuguang Cui,Yiming Zhao,Yatong Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present RoboMemory, a brain-inspired multi-memory framework for lifelong learning in physical embodied systems, addressing critical challenges in real-world environments: continuous learning, multi-module memory latency, task correlation capture, and infinite-loop mitigation in closed-loop planning. Grounded in cognitive neuroscience, it integrates four core modules: the Information Preprocessor (thalamus-like), the Lifelong Embodied Memory System (hippocampus-like), the Closed-Loop Planning Module (prefrontal lobe-like), and the Low-Level Executer (cerebellum-like) to enable long-term planning and cumulative learning. The Lifelong Embodied Memory System, central to the framework, alleviates inference speed issues in complex memory frameworks via parallelized updates/retrieval across Spatial, Temporal, Episodic, and Semantic submodules. It incorporates a dynamic Knowledge Graph (KG) and consistent architectural design to enhance memory consistency and scalability. Evaluations on EmbodiedBench show RoboMemory outperforms the open-source baseline (Qwen2.5-VL-72B-Ins) by 25% in average success rate and surpasses the closed-source State-of-the-Art (SOTA) (Claude3.5-Sonnet) by 5%, establishing new SOTA. Ablation studies validate key components (critic, spatial memory, long-term memory), while real-world deployment confirms its lifelong learning capability with significantly improved success rates across repeated tasks. RoboMemory alleviates high latency challenges with scalability, serving as a foundational reference for integrating multi-modal memory systems in physical robots.
zh
[AI-124] Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation
【速读】:该论文旨在解决当前音乐生成领域中长期存在的四大挑战:可控性(controllability)、泛化能力(generalizability)、感知质量(perceptual quality)和时长(duration)。作者指出,现有方法主要基于从原始音频中直接学习音乐理论,这一任务对当前模型而言仍极具难度。为此,论文提出了一种名为BACH(Bar-level AI Composing Helper)的新模型,其核心创新在于采用面向人类可编辑符号乐谱(symbolic scores)的架构设计,引入专为分层歌曲结构定制的标记化策略与符号生成流程。该方案显著提升了生成效率、时长控制能力和感知质量,并在小模型规模下实现了当前公开报道中最优性能(SOTA),甚至超越了商业产品如Suno。
链接: https://arxiv.org/abs/2508.01394
作者: Tongxi Wang,Yang Yu,Qing Wang,Junlang Qian
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Song generation is regarded as the most challenging problem in music AIGC; nonetheless, existing approaches have yet to fully overcome four persistent limitations: controllability, generalizability, perceptual quality, and duration. We argue that these shortcomings stem primarily from the prevailing paradigm of attempting to learn music theory directly from raw audio, a task that remains prohibitively difficult for current models. To address this, we present Bar-level AI Composing Helper (BACH), the first model explicitly designed for song generation through human-editable symbolic scores. BACH introduces a tokenization strategy and a symbolic generative procedure tailored to hierarchical song structure. Consequently, it achieves substantial gains in the efficiency, duration, and perceptual quality of song generation. Experiments demonstrate that BACH, with a small model size, establishes a new SOTA among all publicly reported song generation systems, even surpassing commercial solutions such as Suno. Human evaluations further confirm its superiority across multiple subjective metrics.
zh
[AI-125] Recognising Anticipating and Mitigating LLM Pollution of Online Behavioural Research
【速读】:该论文旨在解决生成式 AI(Generative AI)在在线行为研究中引发的“大语言模型污染”(LLM Pollution)问题,即参与者利用大语言模型(LLMs)进行任务辅助、代劳或因预期 LLM 存在而改变自身行为,从而导致研究数据失真、样本真实性受损及方法论根基动摇。其解决方案的关键在于构建多层应对机制:从研究者实践改进、平台责任强化到学术共同体协作,形成系统性防御体系,以适应 LLM 技术快速演进带来的方法论挑战,确保在线行为研究的效度与可信度。
链接: https://arxiv.org/abs/2508.01390
作者: Raluca Rilla,Tobias Werner,Hiromu Yakura,Iyad Rahwan,Anne-Marie Nussberger
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Online behavioural research faces an emerging threat as participants increasingly turn to large language models (LLMs) for advice, translation, or task delegation: LLM Pollution. We identify three interacting variants through which LLM Pollution threatens the validity and integrity of online behavioural research. First, Partial LLM Mediation occurs when participants make selective use of LLMs for specific aspects of a task, such as translation or wording support, leading researchers to (mis)interpret LLM-shaped outputs as human ones. Second, Full LLM Delegation arises when agentic LLMs complete studies with little to no human oversight, undermining the central premise of human-subject research at a more foundational level. Third, LLM Spillover signifies human participants altering their behaviour as they begin to anticipate LLM presence in online studies, even when none are involved. While Partial Mediation and Full Delegation form a continuum of increasing automation, LLM Spillover reflects second-order reactivity effects. Together, these variants interact and generate cascading distortions that compromise sample authenticity, introduce biases that are difficult to detect post hoc, and ultimately undermine the epistemic grounding of online research on human cognition and behaviour. Crucially, the threat of LLM Pollution is already co-evolving with advances in generative AI, creating an escalating methodological arms race. To address this, we propose a multi-layered response spanning researcher practices, platform accountability, and community efforts. As the challenge evolves, coordinated adaptation will be essential to safeguard methodological integrity and preserve the validity of online behavioural research.
zh
[AI-126] Prompt to Pwn: Automated Exploit Generation for Smart Contracts
【速读】:该论文旨在解决智能合约漏洞自动化利用(Automated Exploit Generation, AEG)的难题,即如何借助大语言模型(Large Language Models, LLMs)自动生成可验证的Proof-of-Concept(PoC)漏洞利用代码。其核心解决方案是提出一个名为ReX的框架,该框架将LLM驱动的漏洞利用合成技术与Foundry测试套件集成,实现从漏洞识别到PoC生成及验证的全流程自动化。关键创新在于通过系统性评估五种前沿LLM在合成基准和真实漏洞合约上的表现,证明现代LLM能够以高达92%的成功率生成针对多种漏洞类型的可执行PoC exploit,且Gemini 2.5 Pro和GPT-4.1展现出最优性能,同时构建了首个面向研究的真实PoC漏洞利用数据集,为后续AEG研究提供基础支持。
链接: https://arxiv.org/abs/2508.01371
作者: Zeke Xiao,Yuekang Li,Qin Wang,Shiping Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:We explore the feasibility of using LLMs for Automated Exploit Generation (AEG) against vulnerable smart contracts. We present \textscReX, a framework integrating LLM-based exploit synthesis with the Foundry testing suite, enabling the automated generation and validation of proof-of-concept (PoC) exploits. We evaluate five state-of-the-art LLMs (GPT-4.1, Gemini 2.5 Pro, Claude Opus 4, DeepSeek, and Qwen3 Plus) on both synthetic benchmarks and real-world smart contracts affected by known high-impact exploits. Our results show that modern LLMs can reliably generate functional PoC exploits for diverse vulnerability types, with success rates reaching up to 92%. Notably, Gemini 2.5 Pro and GPT-4.1 consistently outperform others in both synthetic and real-world scenarios. We further analyze factors influencing AEG effectiveness, including model capabilities, contract structure, and vulnerability types. We also collect the first curated dataset of real-world PoC exploits to support future research.
zh
[AI-127] Relation-Aware LNN-Transformer for Intersection-Centric Next-Step Prediction
【速读】:该论文旨在解决传统人类移动性建模中因假设封闭世界(closed world)而导致的局限性问题,即仅限于预定义兴趣点(Points of Interest, POIs)的预测方法难以捕捉探索性或目标无关的行为以及城市道路网络拓扑约束。其解决方案的关键在于提出一种以道路节点为中心(road-node-centric)的框架,将用户轨迹表示在城市道路交叉口图上,从而打破固定POI集合的限制;同时引入分扇区方向性POI聚合机制来编码环境上下文特征,并结合结构化图嵌入生成语义 grounded 的节点表示;此外,采用关系感知的LNN-Transformer(Relation-Aware LNN-Transformer)融合连续时间遗忘单元(CfC-LNN)与方位偏置自注意力模块,有效建模细粒度时间动态和长距离空间依赖关系,显著提升下一跳位置预测的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.01368
作者: Zhehong Ren,Tianluo Zhang,Yiheng Lu,Yushen Liang,Promethee Spathis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures
Abstract:Next-step location prediction plays a pivotal role in modeling human mobility, underpinning applications from personalized navigation to strategic urban planning. However, approaches that assume a closed world - restricting choices to a predefined set of points of interest (POIs) - often fail to capture exploratory or target-agnostic behavior and the topological constraints of urban road networks. Hence, we introduce a road-node-centric framework that represents road-user trajectories on the city’s road-intersection graph, thereby relaxing the closed-world constraint and supporting next-step forecasting beyond fixed POI sets. To encode environmental context, we introduce a sector-wise directional POI aggregation that produces compact features capturing distance, bearing, density and presence cues. By combining these cues with structural graph embeddings, we obtain semantically grounded node representations. For sequence modeling, we integrate a Relation-Aware LNN-Transformer - a hybrid of a Continuous-time Forgetting Cell CfC-LNN and a bearing-biased self-attention module - to capture both fine-grained temporal dynamics and long-range spatial dependencies. Evaluated on city-scale road-user trajectories, our model outperforms six state-of-the-art baselines by up to 17 percentage points in accuracy at one hop and 10 percentage points in MRR, and maintains high resilience under noise, losing only 2.4 percentage points in accuracy at one under 50 meter GPS perturbation and 8.9 percentage points in accuracy at one hop under 25 percent POI noise.
zh
[AI-128] Convergence Analysis of Aggregation-Broadcast in LoRA-enabled Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因模型规模扩大带来的通信与计算挑战,特别是针对低秩适应(Low-Rank Adaptation, LoRA)方法在本地更新后如何有效聚合模型参数的问题。其解决方案的关键在于提出一种统一的收敛性分析框架,通过定义聚合-广播算子(Aggregation-Broadcast Operator, ABO),在温和假设下推导出全局模型收敛的一般条件,并进一步给出若干保证收敛的充分条件。研究证明了当前主流的两类聚合方式——和-积(Sum-Product, SP)与积-和(Product-Sum, PS)均满足该收敛条件,但二者在达到最优收敛速率方面存在差异,从而为LoRA-based FL提供了理论依据与策略选择指导。
链接: https://arxiv.org/abs/2508.01348
作者: Xin Chen,Shuaijun Chen,Omid Tavallaie,Nguyen Tran,Shuhuang Xiang,Albert Zomaya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) enables collaborative model training across decentralized data sources while preserving data privacy. However, the growing size of Machine Learning (ML) models poses communication and computation challenges in FL. Low-Rank Adaptation (LoRA) has recently been introduced into FL as an efficient fine-tuning method, reducing communication overhead by updating only a small number of trainable parameters. Despite its effectiveness, how to aggregate LoRA-updated local models on the server remains a critical and understudied problem. In this paper, we provide a unified convergence analysis for LoRA-based FL. We first categories the current aggregation method into two major type: Sum-Product (SP) and Product-Sum (PS). Then we formally define the Aggregation-Broadcast Operator (ABO) and derive a general convergence condition under mild assumptions. Furthermore, we present several sufficient conditions that guarantee convergence of the global model. These theoretical analyze offer a principled understanding of various aggregation strategies. Notably, we prove that the SP and PS aggregation methods both satisfy our convergence condition, but differ in their ability to achieve the optimal convergence rate. Extensive experiments on standard benchmarks validate our theoretical findings.
zh
[AI-129] UEChecker: Detecting Unchecked External Call Vulnerabilities in DApps via Graph Analysis
【速读】:该论文旨在解决去中心化应用(DApp)智能合约中因未验证外部调用结果而导致的安全漏洞问题,这类漏洞常引发闪电贷攻击和重入攻击等,造成巨额经济损失。解决方案的关键在于提出一种基于深度学习的检测工具UEChecker,其核心创新包括:利用调用图(call graph)构建特征表示;设计边预测模块重构节点与边的特征;引入节点聚合模块捕获局部与全局结构信息;结合Conformer Block模块融合多头注意力、卷积和前馈网络以捕捉不同尺度的依赖关系,从而显著提升对未检查外部调用漏洞的识别准确率,最终在608个DApp的审计中达到87.59%的检测精度,优于GAT、LSTM和GCN等基线模型。
链接: https://arxiv.org/abs/2508.01343
作者: Dechao Kong,Xiaoqi Li,Wenkai Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing number of attacks on the contract layer of DApps has resulted in economic losses amounting to 66 billion. Vulnerabilities arise when contracts interact with external protocols without verifying the results of the calls, leading to exploit entry points such as flash loan attacks and reentrancy attacks. In this paper, we propose UEChecker, a deep learning-based tool that utilizes a call graph and a Graph Convolutional Network to detect unchecked external call vulnerabilities. We design the following components: An edge prediction module that reconstructs the feature representation of nodes and edges in the call graph; A node aggregation module that captures structural information from both the node itself and its neighbors, thereby enhancing feature representation between nodes and improving the model’s understanding of the global graph structure; A Conformer Block module that integrates multi-head attention, convolutional modules, and feedforward neural networks to more effectively capture dependencies of different scales within the call graph, extending beyond immediate neighbors and enhancing the performance of vulnerability detection. Finally, we combine these modules with Graph Convolutional Network to detect unchecked external call vulnerabilities. By auditing the smart contracts of 608 DApps, our results show that our tool achieves an accuracy of 87.59% in detecting unchecked external call vulnerabilities. Furthermore, we compare our tool with GAT, LSTM, and GCN baselines, and in the comparison experiments, UEChecker consistently outperforms these models in terms of accuracy.
zh
[AI-130] BlockA2A: Towards Secure and Verifiable Agent -to-Agent Interoperability
【速读】:该论文旨在解决由大语言模型(Large Language Models, LLMs)驱动的多智能体系统(Multi-Agent Systems, MASes)中出现的新型安全风险问题,包括身份框架碎片化、通信渠道不安全以及对拜占庭智能体或对抗性提示攻击缺乏有效防御。其核心解决方案是提出首个统一的多智能体信任框架 BlockA2A,关键在于:采用去中心化标识符(Decentralized Identifiers, DIDs)实现跨域细粒度的智能体认证,利用区块链锚定日志保障操作不可篡改的可审计性,并通过智能合约动态执行上下文感知的访问控制策略;同时引入防御编排引擎(Defense Orchestration Engine, DOE),实现实时攻击响应机制,如拜占庭智能体标记、执行中断和权限即时撤销,从而在保证低延迟(亚秒级开销)的前提下,显著提升多智能体系统的安全性与可扩展性。
链接: https://arxiv.org/abs/2508.01332
作者: Zhenhua Zou,Zhuotao Liu,Lepeng Zhao,Qiuyang Zhan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 43 pages
Abstract:The rapid adoption of agentic AI, powered by large language models (LLMs), is transforming enterprise ecosystems with autonomous agents that execute complex workflows. Yet we observe several key security vulnerabilities in LLM-driven multi-agent systems (MASes): fragmented identity frameworks, insecure communication channels, and inadequate defenses against Byzantine agents or adversarial prompts. In this paper, we present the first systematic analysis of these emerging multi-agent risks and explain why the legacy security strategies cannot effectively address these risks. Afterwards, we propose BlockA2A, the first unified multi-agent trust framework that enables secure and verifiable and agent-to-agent interoperability. At a high level, BlockA2A adopts decentralized identifiers (DIDs) to enable fine-grained cross-domain agent authentication, blockchain-anchored ledgers to enable immutable auditability, and smart contracts to dynamically enforce context-aware access control policies. BlockA2A eliminates centralized trust bottlenecks, ensures message authenticity and execution integrity, and guarantees accountability across agent interactions. Furthermore, we propose a Defense Orchestration Engine (DOE) that actively neutralizes attacks through real-time mechanisms, including Byzantine agent flagging, reactive execution halting, and instant permission revocation. Empirical evaluations demonstrate BlockA2A’s effectiveness in neutralizing prompt-based, communication-based, behavioral and systemic MAS attacks. We formalize its integration into existing MAS and showcase a practical implementation for Google’s A2A protocol. Experiments confirm that BlockA2A and DOE operate with sub-second overhead, enabling scalable deployment in production LLM-based MAS environments.
zh
[AI-131] NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)驱动的图形用户界面(Graphical User Interface, GUI)代理在评估基准上存在的准确性不足、可复现性差和扩展性弱的问题。其核心解决方案是提出一个基于因果路径(Causal Pathways)设计原则的新基准 \Benchmark,将复杂任务分解为一系列可程序化验证的原子步骤,从而实现严格、全自动且可复现的评估标准;同时开发了一种分层代理架构 \Agent,专为长程任务优化,并利用该代理生成高质量、人工验证的轨迹数据集,用于对Qwen2.5-VL-7B模型进行强化微调(Reinforcement Fine-Tuning, RFT),显著提升了小模型在GUI执行中的表现,但同时也揭示了小模型在整合感知、决策与执行的综合性任务中存在能力瓶颈。
链接: https://arxiv.org/abs/2508.01330
作者: Zihan Zheng,Tianle Cui,Chuwen Xie,Jiahui Zhang,Jiahui Pan,Lewei He,Qianglong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce \Benchmark, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed \Agent, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that \Benchmark~presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model’s GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.
zh
[AI-132] Is Exploration or Optimization the Problem for Deep Reinforcement Learning?
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中因优化困难导致的性能瓶颈问题,即即使生成了高质量的经验数据,DRL算法仍难以充分挖掘其潜力,从而限制了整体性能提升。解决方案的关键在于提出了一种新的实用次优性估计器(practical sub-optimality estimator),用于量化DRL算法在优化过程中的局限性;实验表明,最佳经验所对应的性能比模型实际学习到的性能高出2–3倍,说明当前方法仅利用了约一半的良好经验,揭示了优化环节是制约DRL性能提升的核心瓶颈。
链接: https://arxiv.org/abs/2508.01329
作者: Glen Berseth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In the era of deep reinforcement learning, making progress is more complex, as the collected experience must be compressed into a deep model for future exploitation and sampling. Many papers have shown that training a deep learning policy under the changing state and action distribution leads to sub-optimal performance, or even collapse. This naturally leads to the concern that even if the community creates improved exploration algorithms or reward objectives, will those improvements fall on the \textitdeaf ears of optimization difficulties. This work proposes a new \textitpractical sub-optimality estimator to determine optimization limitations of deep reinforcement learning algorithms. Through experiments across environments and RL algorithms, it is shown that the difference between the best experience generated is 2-3 \times better than the policies’ learned performance. This large difference indicates that deep RL methods only exploit half of the good experience they generate.
zh
[AI-133] owards Evaluation for Real-World LLM Unlearning
【速读】:该论文旨在解决现有大语言模型(Large Language Model, LLM)去遗忘(unlearning)评估指标在实际应用中的实用性、精确性和鲁棒性不足的问题。其核心挑战在于,传统指标难以准确量化模型对特定训练数据的遗忘程度,尤其在分布偏移和噪声干扰下表现不稳定。为此,作者提出了一种新的评估指标——基于分布校正的去遗忘评估(Distribution Correction-based Unlearning Evaluation, DCUE),其关键创新在于:首先识别出对遗忘目标至关重要的核心token,然后利用验证集对这些token的置信度分布进行偏差校正,并通过柯尔莫哥洛夫-斯米尔诺夫检验(Kolmogorov-Smirnov test)对校正后的分布差异进行量化评估。该方法显著提升了评估结果的可靠性与可解释性,为设计更实用、稳健的去遗忘算法提供了新思路。
链接: https://arxiv.org/abs/2508.01324
作者: Ke Miao,Yuke Hu,Xiaochen Li,Wenjie Bao,Zhihao Liu,Zhan Qin,Kui Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper analyzes the limitations of existing unlearning evaluation metrics in terms of practicality, exactness, and robustness in real-world LLM unlearning scenarios. To overcome these limitations, we propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE). It identifies core tokens and corrects distributional biases in their confidence scores using a validation set. The evaluation results are quantified using the Kolmogorov-Smirnov test. Experimental results demonstrate that DCUE overcomes the limitations of existing metrics, which also guides the design of more practical and reliable unlearning algorithms in the future.
zh
[AI-134] Idempotent Equilibrium Analysis of Hybrid Workflow Allocation: A Mathematical Schema for Future Work
【速读】:该论文试图解决的问题是:随着大规模人工智能(Large-scale AI)系统的发展,人与机器之间的工作分工如何动态演化并最终达到稳定状态。其核心挑战在于理解自动化进程如何影响任务分配,并识别出在何种条件下会出现具有经济效率的均衡结果。解决方案的关键在于将这一过程形式化为一个迭代的任务委托映射(task-delegation map),并利用格论中的不动点工具(Tarski 和 Banach 不动点定理)证明存在至少一个幂等稳定均衡(idempotent equilibrium),即每个任务由具有持久比较优势的代理(人类或机器)执行。进一步地,通过引入温和的单调性条件可保证该均衡的唯一性;在连续模型中推导出长期自动化比例的闭式解 $ x^* = \alpha / (\alpha + \beta) $,其中 $ \alpha $ 表征自动化速率,$ \beta $ 表示新出现的人类主导型任务速率,从而表明只要 $ \beta > 0 $,完全自动化即被排除。这揭示了人类在未来工作体系中的不可替代角色——作为“流程指挥者”(workflow conductor)来协调和整合AI模块,而非直接竞争。
链接: https://arxiv.org/abs/2508.01323
作者: Faruk Alpay,Bugra Kilictas,Taylan Alpay,Hamdi Alakkad
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注: 25 pages, 9 figures, 4 tables. Proves existence/uniqueness of an “idempotent equilibrium” for human-AI task allocation and provides closed-form steady-state automation share
Abstract:The rapid advance of large-scale AI systems is reshaping how work is divided between people and machines. We formalise this reallocation as an iterated task-delegation map and show that–under broad, empirically grounded assumptions–the process converges to a stable idempotent equilibrium in which every task is performed by the agent (human or machine) with enduring comparative advantage. Leveraging lattice-theoretic fixed-point tools (Tarski and Banach), we (i) prove existence of at least one such equilibrium and (ii) derive mild monotonicity conditions that guarantee uniqueness. In a stylised continuous model the long-run automated share takes the closed form x^* = \alpha / (\alpha + \beta) , where \alpha captures the pace of automation and \beta the rate at which new, human-centric tasks appear; hence full automation is precluded whenever \beta 0 . We embed this analytic result in three complementary dynamical benchmarks–a discrete linear update, an evolutionary replicator dynamic, and a continuous Beta-distributed task spectrum–each of which converges to the same mixed equilibrium and is reproducible from the provided code-free formulas. A 2025-to-2045 simulation calibrated to current adoption rates projects automation rising from approximately 10% of work to approximately 65%, leaving a persistent one-third of tasks to humans. We interpret that residual as a new profession of workflow conductor: humans specialise in assigning, supervising and integrating AI modules rather than competing with them. Finally, we discuss implications for skill development, benchmark design and AI governance, arguing that policies which promote “centaur” human-AI teaming can steer the economy toward the welfare-maximising fixed point.
zh
[AI-135] PUZZLED: Jailbreaking LLM s through Word-Based Puzzles
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的安全风险问题,特别是针对“越狱攻击”(jailbreak attacks)的防御薄弱性。现有方法多依赖迭代提示工程或语义变换来绕过检测机制,但效果有限。本文提出了一种名为PUZZLED的新颖越狱方法,其核心创新在于利用LLM自身的推理能力:通过将有害指令中的关键词掩码并转化为人类熟悉但对LLM而言认知难度较高的三种谜题形式(词搜索、乱序词、填字游戏),迫使模型先解谜才能还原原始指令,进而生成有害内容。该方案的关键在于将自然认知任务转化为对LLM推理能力的挑战,从而实现高成功率的越狱攻击(平均攻击成功率达88.8%,GPT-4.1达96.5%)。
链接: https://arxiv.org/abs/2508.01306
作者: Yelim Ahn,Jaejin Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages
Abstract:As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM’s reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs’ reasoning capabilities.
zh
[AI-136] How Far Are LLM s from Symbolic Planners? An NLP-Based Perspective
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在人工智能任务规划(AI task planning)中生成计划质量不可靠的问题,特别是其计划中常出现错误或幻觉动作(hallucinated actions),导致执行失败。解决方案的关键在于提出一个基于自然语言处理(Natural Language Processing, NLP)的恢复流水线(recovery pipeline),包含三个核心阶段:首先对LLM生成的计划进行NLP驱动的评估;其次通过NLP手段对计划进行修复与调整;最后利用符号规划器(symbolic planner)完成最终计划的生成与验证。该方法实现了对LLM生成计划的质量分析与改进,虽未达到传统规划器的可靠性水平,但显著提升了可执行动作比例和整体成功率(从21.9%提升至27.5%)。
链接: https://arxiv.org/abs/2508.01300
作者: Ma’ayan Armony,Albert Meroño-Peñuela,Gerard Canal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The reasoning and planning abilities of Large Language Models (LLMs) have been a frequent topic of discussion in recent years. Their ability to take unstructured planning problems as input has made LLMs’ integration into AI planning an area of interest. Nevertheless, LLMs are still not reliable as planners, with the generated plans often containing mistaken or hallucinated actions. Existing benchmarking and evaluation methods investigate planning with LLMs, focusing primarily on success rate as a quality indicator in various planning tasks, such as validating plans or planning in relaxed conditions. In this paper, we approach planning with LLMs as a natural language processing (NLP) task, given that LLMs are NLP models themselves. We propose a recovery pipeline consisting of an NLP-based evaluation of the generated plans, along with three stages to recover the plans through NLP manipulation of the LLM-generated plans, and eventually complete the plan using a symbolic planner. This pipeline provides a holistic analysis of LLM capabilities in the context of AI task planning, enabling a broader understanding of the quality of invalid plans. Our findings reveal no clear evidence of underlying reasoning during plan generation, and that a pipeline comprising an NLP-based analysis of the plans, followed by a recovery mechanism, still falls short of the quality and reliability of classical planners. On average, only the first 2.65 actions of the plan are executable, with the average length of symbolically generated plans being 8.4 actions. The pipeline still improves action quality and increases the overall success rate from 21.9% to 27.5%.
zh
[AI-137] Exploitation Is All You Need… for Exploration
【速读】:该论文旨在解决元强化学习(meta-reinforcement learning, meta-RL)在训练过程中如何确保充分探索新环境的核心挑战。传统方法通常通过引入显式激励机制(如随机化、不确定性奖励或内在奖励)来平衡探索与利用(exploration-exploitation dilemma)。本文提出了一种新颖的解决方案:即使仅优化贪婪目标(exploitation-only objective),只要满足三个前提条件,代理仍能涌现出探索行为——即(1)环境具有可重复的结构(Recurring Environmental Structure),使历史经验可用于指导未来决策;(2)代理具备记忆能力(Agent Memory),能够保留并利用交互历史数据;(3)存在长时程信用分配(Long-Horizon Credit Assignment),使得探索带来的延迟收益能影响当前策略更新。实验表明,在结构和记忆同时存在时,纯贪婪训练的策略会表现出信息导向的探索行为;而若缺失任一条件,这种涌现探索行为将消失。值得注意的是,即使移除长时程信用分配,探索行为仍可能通过伪汤普森采样效应(pseudo-Thompson Sampling effect)维持,提示探索与利用可统一于单一奖励最大化框架中。
链接: https://arxiv.org/abs/2508.01287
作者: Micah Rentschler,Jesse Roberts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.
zh
[AI-138] BioDisco: Multi-agent hypothesis generation with dual-mode evidence iterative feedback and temporal evaluation
【速读】:该论文旨在解决科学研究所面临的假设生成难题,即如何从海量且复杂的文献与知识中自动识别出新颖且有证据支持的假设,同时克服现有自动化方法在新颖性保障、迭代优化能力以及未来发现潜力评估方面的不足。其解决方案的关键在于提出BioDisco这一多智能体框架,该框架融合基于语言模型的推理能力与双模式证据系统(生物医学知识图谱和自动化文献检索),确保假设的“ grounded novelty”;通过内置评分与反馈机制实现假设的迭代精炼;并采用开创性的时序评估与人类评估结合的方式,辅以Bradley-Terry配对比较模型进行统计学验证,从而系统性提升假设生成的质量与可信赖度。
链接: https://arxiv.org/abs/2508.01285
作者: Yujing Ke,Kevin George,Kathan Pandya,David Blumenthal,Maximilian Sprang,Gerrit Großmann,Sebastian Vollmer,David Antony Selby
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Applications (stat.AP)
备注: 7 pages main content + 11 pages appendices
Abstract:Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence-grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi-agent framework that draws upon language model-based reasoning and a dual-mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley-Terry paired comparison model to provide statistically-grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations representative of existing agentic architectures. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code. We anticipate researchers using this practical tool as a catalyst for the discovery of new hypotheses.
zh
[AI-139] Defending Against Beta Poisoning Attacks in Machine Learning Models
【速读】:该论文旨在解决Beta Poisoning攻击对机器学习模型安全性构成的威胁,此类攻击通过使训练数据集线性不可分来破坏模型准确性。解决方案的关键在于识别并利用中毒样本的特定特征:中毒样本彼此间距离较近,且集中于目标类别的均值附近。基于这些观察,作者提出了四种防御策略——kNN邻近度防御(KPB)、邻域类别比较(NCC)、聚类防御(CBD)和均值距离阈值(MDT),其中KPB与MDT在MNIST和CIFAR-10数据集上实现了完美的准确率和F1分数,验证了其有效性。
链接: https://arxiv.org/abs/2508.01276
作者: Nilufer Gulciftci,M. Emre Gursoy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Poisoning attacks, in which an attacker adversarially manipulates the training dataset of a machine learning (ML) model, pose a significant threat to ML security. Beta Poisoning is a recently proposed poisoning attack that disrupts model accuracy by making the training dataset linearly nonseparable. In this paper, we propose four defense strategies against Beta Poisoning attacks: kNN Proximity-Based Defense (KPB), Neighborhood Class Comparison (NCC), Clustering-Based Defense (CBD), and Mean Distance Threshold (MDT). The defenses are based on our observations regarding the characteristics of poisoning samples generated by Beta Poisoning, e.g., poisoning samples have close proximity to one another, and they are centered near the mean of the target class. Experimental evaluations using MNIST and CIFAR-10 datasets demonstrate that KPB and MDT can achieve perfect accuracy and F1 scores, while CBD and NCC also provide strong defensive capabilities. Furthermore, by analyzing performance across varying parameters, we offer practical insights regarding defenses’ behaviors under varying conditions.
zh
[AI-140] KCR: Resolving Long-Context Knowledge Conflicts via Reasoning in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对跨上下文知识冲突(inter-context knowledge conflicts)时表现不佳的问题,尤其是在处理长文本中存在逻辑矛盾的多个来源信息时,模型易产生混淆。解决方案的关键在于提出一种知识冲突推理(Knowledge Conflict Reasoning, KCR)框架,其核心思想是通过强化学习(Reinforcement Learning)训练骨干LLM,使其在面对冲突上下文时能够识别并坚持逻辑一致性更强的推理路径(即正确推理路径),从而提升模型对长上下文中知识冲突的辨别与解决能力。具体而言,KCR首先从冲突长文本中提取推理路径(以文本或局部知识图谱形式表示),随后利用强化学习奖励机制引导模型学习遵循正确推理路径的模式,而非错误路径,从而使模型真正具备在复杂长上下文中解析和整合知识冲突的能力。
链接: https://arxiv.org/abs/2508.01273
作者: Xianda Zheng,Zijian Huang,Meng-Fen Chiang,Michael J. Witbrock,Kaiqi Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge conflicts commonly arise across diverse sources, and their prevalence has increased with the advent of LLMs. When dealing with conflicts between multiple contexts, also known as \emphinter-context knowledge conflicts, LLMs are often confused by lengthy and conflicting contexts. To address this challenge, we propose the Knowledge Conflict Reasoning (KCR) framework, which enhances the ability of LLMs to resolve conflicting knowledge. The key idea of KCR is to train backbone LLMs to establish a correct reasoning process by rewarding them for selecting and adhering to the context with stronger logical consistency when presented with conflicting contexts. Specifically, we first extract reasoning paths, represented by either text or local knowledge graphs, from the conflicting long contexts. Subsequently, we employ Reinforcement Learning to encourage the model to learn the paradigm of reasoning process that follows correct reasoning paths rather than the incorrect counterparts. This enables the backbone models to genuinely acquire the capability to resolve inter-context knowledge conflicts within long contexts. Experimental results demonstrate that our framework significantly improves the ability of various backbone models to resolve knowledge conflicts in long-context scenarios, yielding substantial performance gains.
zh
[AI-141] Win-k: Improved Membership Inference Attacks on Small Language Models
【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在资源受限环境中的隐私安全问题,特别是针对成员推断攻击(Membership Inference Attacks, MIAs)的有效性下降问题。随着SLMs在边缘计算和设备端部署的普及,其训练数据的隐私泄露风险日益突出,而现有MIAs在小型模型上的效果显著减弱,限制了对SLMs隐私风险的准确评估。论文提出了一种新的MIA方法——win-k,其核心创新在于基于当前最先进的min-k攻击框架进行改进,通过引入“胜者为王”(win-k)的策略来增强攻击性能,具体表现为利用多个候选样本中得分最高的k个预测结果进行更精细的决策判断。实验表明,win-k在多个数据集和SLMs上均优于五种现有MIAs,在AUROC、TPR @ 1% FPR及FPR @ 99% TPR等指标上表现更优,尤其在较小模型中优势明显,从而提升了对SLMs隐私脆弱性的检测能力。
链接: https://arxiv.org/abs/2508.01268
作者: Roya Arkhmammadova,Hosein Madadi Tamar,M. Emre Gursoy
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Small language models (SLMs) are increasingly valued for their efficiency and deployability in resource-constrained environments, making them useful for on-device, privacy-sensitive, and edge computing applications. On the other hand, membership inference attacks (MIAs), which aim to determine whether a given sample was used in a model’s training, are an important threat with serious privacy and intellectual property implications. In this paper, we study MIAs on SLMs. Although MIAs were shown to be effective on large language models (LLMs), they are relatively less studied on emerging SLMs, and furthermore, their effectiveness decreases as models get smaller. Motivated by this finding, we propose a new MIA called win-k, which builds on top of a state-of-the-art attack (min-k). We experimentally evaluate win-k by comparing it with five existing MIAs using three datasets and eight SLMs. Results show that win-k outperforms existing MIAs in terms of AUROC, TPR @ 1% FPR, and FPR @ 99% TPR metrics, especially on smaller models.
zh
[AI-142] Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
【速读】:该论文旨在解决语言模型中模型容量与计算效率之间的根本性权衡问题。其解决方案的关键在于提出一种融合稀疏专家(Mixture of Experts, MoE)机制、多头潜在注意力(Multi-head Latent Attention, MLA)和旋转位置编码(Rotary Position Embeddings, RoPE)的新型架构——MoE-MLA-RoPE。该架构通过三项核心创新实现高效建模:(1) 采用64个微专家及top-k选择策略,支持高达3.6×10⁷种专家组合的细粒度路由;(2) 引入共享专家隔离机制,固定2个通用专家处理共性模式,同时路由至62个专业专家中的6个;(3) 设计无梯度冲突的负载均衡策略,在不干扰主损失优化的前提下维持专家利用率。实验表明,该方法在参数量从17M到202M的模型上实现了68% KV缓存内存减少和3.2倍推理加速,且保持了接近原始模型的困惑度(仅下降0.8%),验证了架构创新对资源受限场景下语言模型部署效率边界的决定性作用。
链接: https://arxiv.org/abs/2508.01261
作者: Sushant Mehta,Raj Dandekar,Rajat Dandekar,Sreedath Panat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top- k selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2 achieves 68% KV cache memory reduction and 3.2x inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, MoE-MLA-RoPE improves the validation loss by 6.9% over the vanilla transformers while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2x inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural novelty, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.01261 [cs.AI] (or arXiv:2508.01261v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.01261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-143] SketchAgent : Generating Structured Diagrams from Hand-Drawn Sketches IJCAI2025
【速读】:该论文旨在解决手绘草图(hand-drawn sketches)向结构化、机器可读图表(structured diagrams)自动转换的难题,这一过程在当前仍高度依赖人工且效率低下。其核心挑战在于草图本身语义模糊、缺乏结构约束,难以直接用于自动化生成精确的图表。解决方案的关键在于提出SketchAgent——一个由多智能体(multi-agent system)构成的系统,通过集成草图识别(sketch recognition)、符号推理(symbolic reasoning)和迭代验证(iterative validation)三个模块,实现从草图到语义一致、结构准确的图表的端到端转化,显著降低了人工干预需求。
链接: https://arxiv.org/abs/2508.01237
作者: Cheng Tan,Qi Chen,Jingxuan Wei,Gaowei Wu,Zhangyang Gao,Siyuan Li,Bihui Yu,Ruifeng Guo,Stan Z. Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025
Abstract:Hand-drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine-readable diagrams remains a labor-intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce SketchAgent, a multi-agent system designed to automate the transformation of hand-drawn sketches into structured diagrams. SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high-quality examples with token-level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, SketchAgent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machine-readable diagram generation. The benchmark is released at this https URL.
zh
[AI-144] Oldie but Goodie: Re-illuminating Label Propagation on Graphs with Partially Observed Features KDD2025
【速读】:该论文旨在解决现实世界图数据中节点特征缺失问题,即在仅有少量或部分节点特征可用时,传统图神经网络(GNN)在下游任务(如节点分类)中表现不佳的问题。解决方案的关键在于提出一种名为GOODIE的新框架,其核心创新包括:1)通过设计基于GNN的解码器使标签传播(Label Propagation, LP)分支输出与特征传播(Feature Propagation, FP)分支对齐的隐藏嵌入;2)引入结构-特征注意力机制(Structure-Feature Attention),自动捕捉结构信息与特征信息的重要性权重;3)采用新型伪标签对比学习策略,区分来自LP分支伪标签中正样本对的贡献,从而生成更可靠的最终预测。该方法在特征稀疏和丰富两种场景下均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2508.01209
作者: Sukwon Yun,Xin Liu,Yunhak Oh,Junseok Lee,Tianlong Chen,Tsuyoshi Murata,Chanyoung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: KDD 2025
Abstract:In real-world graphs, we often encounter missing feature situations where a few or the majority of node features, e.g., sensitive information, are missed. In such scenarios, directly utilizing Graph Neural Networks (GNNs) would yield sub-optimal results in downstream tasks such as node classification. Despite the emergence of a few GNN-based methods attempting to mitigate its missing situation, when only a few features are available, they rather perform worse than traditional structure-based models. To this end, we propose a novel framework that further illuminates the potential of classical Label Propagation (Oldie), taking advantage of Feature Propagation, especially when only a partial feature is available. Now called by GOODIE, it takes a hybrid approach to obtain embeddings from the Label Propagation branch and Feature Propagation branch. To do so, we first design a GNN-based decoder that enables the Label Propagation branch to output hidden embeddings that align with those of the FP branch. Then, GOODIE automatically captures the significance of structure and feature information thanks to the newly designed Structure-Feature Attention. Followed by a novel Pseudo-Label contrastive learning that differentiates the contribution of each positive pair within pseudo-labels originating from the LP branch, GOODIE outputs the final prediction for the unlabeled nodes. Through extensive experiments, we demonstrate that our proposed model, GOODIE, outperforms the existing state-of-the-art methods not only when only a few features are available but also in abundantly available situations. Source code of GOODIE is available at: this https URL.
zh
[AI-145] Calibrated Prediction Set in Fault Detection with Risk Guarantees via Significance Tests
【速读】:该论文旨在解决现有故障诊断模型在面对复杂场景(如分布偏移)时缺乏严格的风险控制和可靠不确定性量化的问题。解决方案的关键在于将故障检测任务转化为假设检验问题,通过基于模型残差定义非一致性度量(nonconformity measure),并结合校准数据集计算新样本的p值,从而构建出具有数学保证的预测集——该预测集以用户指定的概率 1−α 包含真实标签。此方法实现了对故障分类的显式风险控制,并在实验中验证了其在不同风险水平下保持理论覆盖概率的能力,同时展现出可调节的风险-效率权衡特性。
链接: https://arxiv.org/abs/2508.01208
作者: Mingchen Mei,Yi Li,YiYao Qian,Zijun Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fault detection is crucial for ensuring the safety and reliability of modern industrial systems. However, a significant scientific challenge is the lack of rigorous risk control and reliable uncertainty quantification in existing diagnostic models, particularly when facing complex scenarios such as distributional shifts. To address this issue, this paper proposes a novel fault detection method that integrates significance testing with the conformal prediction framework to provide formal risk guarantees. The method transforms fault detection into a hypothesis testing task by defining a nonconformity measure based on model residuals. It then leverages a calibration dataset to compute p-values for new samples, which are used to construct prediction sets mathematically guaranteed to contain the true label with a user-specified probability, 1-\alpha . Fault classification is subsequently performed by analyzing the intersection of the constructed prediction set with predefined normal and fault label sets. Experimental results on cross-domain fault diagnosis tasks validate the theoretical properties of our approach. The proposed method consistently achieves an empirical coverage rate at or above the nominal level ( 1-\alpha ), demonstrating robustness even when the underlying point-prediction models perform poorly. Furthermore, the results reveal a controllable trade-off between the user-defined risk level ( \alpha ) and efficiency, where higher risk tolerance leads to smaller average prediction set sizes. This research contributes a theoretically grounded framework for fault detection that enables explicit risk control, enhancing the trustworthiness of diagnostic systems in safety-critical applications and advancing the field from simple point predictions to informative, uncertainty-aware outputs.
zh
[AI-146] Conquering High Packet-Loss Erasure: MoE Swin Transformer-Based Video Semantic Communication
【速读】:该论文旨在解决基于分组传输的语义通信系统中因数据包丢失导致的语义信息损失问题。当前协议在检测到错误数据包时直接丢弃,无法利用含错的语义信息进行鲁棒解码,从而严重影响重建质量。解决方案的关键在于提出一种抗丢包的MoE Swin Transformer视频语义通信(MSTVSC)系统:首先通过语义向量编码与上层协议分组传输实现高效语义表示;其次,在接收端引入3D卷积网络结合未丢失语义数据和丢包掩码矩阵恢复缺失信息;同时采用语义级交织策略降低集中丢包引发的语义失真,并通过共性-个性分解结构对个体信息下采样以压缩冗余,最终构建轻量化模型以支持实际部署。该方案在90%丢包率下仍能保持MS-SSIM > 0.6 和 PSNR > 20 dB 的优异性能。
链接: https://arxiv.org/abs/2508.01205
作者: Lei Teng,Senran Fan,Chen Dong,Haotai Liang,Zhicheng Bao,Xiaodong Xu,Rui Meng,Ping Zhang
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Semantic communication with joint semantic-channel coding robustly transmits diverse data modalities but faces challenges in mitigating semantic information loss due to packet drops in packet-based systems. Under current protocols, packets with errors are discarded, preventing the receiver from utilizing erroneous semantic data for robust decoding. To address this issue, a packet-loss-resistant MoE Swin Transformer-based Video Semantic Communication (MSTVSC) system is proposed in this paper. Semantic vectors are encoded by MSTVSC and transmitted through upper-layer protocol packetization. To investigate the impact of the packetization, a theoretical analysis of the packetization strategy is provided. To mitigate the semantic loss caused by packet loss, a 3D CNN at the receiver recovers missing information using un-lost semantic data and an packet-loss mask matrix. Semantic-level interleaving is employed to reduce concentrated semantic loss from packet drops. To improve compression, a common-individual decomposition approach is adopted, with downsampling applied to individual information to minimize redundancy. The model is lightweighted for practical deployment. Extensive simulations and comparisons demonstrate strong performance, achieving an MS-SSIM greater than 0.6 and a PSNR exceeding 20 dB at a 90% packet loss rate.
zh
[AI-147] Importance Sampling is All You Need: Predict LLM s performance on new benchmark by reusing existing benchmark
【速读】:该论文旨在解决当前代码生成基准测试面临的两大问题:一是构建高质量测试集和参考解决方案的成本不断上升;二是数据污染风险加剧,影响基于基准评估的可靠性。其解决方案的关键在于提出一种以提示(prompt)为中心的评估框架 BIS(Benchmark-free Inference System),通过分析提示分布而非执行生成代码来无真值地预测大型语言模型(Large Language Models, LLMs)在代码生成任务中的性能。该方法基于重要性采样理论,并利用重要性加权自动编码器(Importance Weighted Autoencoders)对已有标注基准样本进行重加权,从而估计在新未见基准上的表现。为稳定估计结果,BIS 引入权重截断策略并计算拟合分布上的边际期望,实现低成本、高效率且可靠的模型评估,尤其适用于资源受限场景下的提示优化与污染检测。
链接: https://arxiv.org/abs/2508.01203
作者: Junjie Shi,Wei Ma,Shi Ying,Lingxiao Jiang,Yang liu,Bo Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of large language models , code generation has become a key benchmark for evaluating LLM capabilities. However, existing benchmarks face two major challenges: (1) the escalating cost of constructing high-quality test suites and reference solutions, and (2) the increasing risk of data contamination, which undermines the reliability of benchmark-based evaluations. In this paper, we propose BIS, a prompt-centric evaluation framework that enables ground-truth-free prediction of LLM performance on code generation tasks. Rather than executing generated code, BIS estimates performance metrics by analyzing the prompt distribution alone. Built on importance sampling theory and implemented using Importance Weighted Autoencoders, our method reweights samples from existing annotated benchmarks to estimate performance on new, unseen benchmarks. To stabilize the estimation, we introduce weight truncation strategies and compute marginal expectations across the fitted distributions. BIS serves as a complementary tool that supports benchmark development and validation under constrained resources, offering actionable and quick feedback for prompt selection and contamination assessment. We conduct extensive experiments involving 8,000 evaluation points across 4 CodeLlama models and 9 diverse benchmarks. Our framework achieves an average absolute prediction error of 1.1% for code correctness scores, with best- and worst-case errors of 0.3% and 1.9%, respectively. It also generalizes well to other metrics, attaining average absolute errors of 2.15% for pass@1. These results demonstrate the reliability and broad applicability of BIS, which can significantly reduce the cost and effort of benchmarking LLMs in code-related tasks.
zh
[AI-148] BSL: A Unified and Generalizable Multitask Learning Platform for Virtual Drug Discovery from Design to Synthesis
【速读】:该论文旨在解决当前虚拟药物发现领域中计算平台存在的任务碎片化、算法创新不足以及对分布外(out-of-distribution, OOD)分子结构泛化能力差的问题。其关键解决方案是提出一个名为“百胜来”(Baishenglai, BSL)的深度学习增强型开源平台,该平台在统一且模块化的框架内整合了七项核心药物发现任务,并引入生成模型与图神经网络等先进技术,不仅在多个基准数据集上达到最先进(state-of-the-art, SOTA)性能,还特别强化了对OOD分子结构的泛化评估机制,从而显著提升了算法创新性和实际预测精度,最终在真实医药研究中验证了其有效性,如成功识别出作用于GluN1/GluN3A NMDA受体的三个具有明确生物活性的新化合物。
链接: https://arxiv.org/abs/2508.01195
作者: Kun Li,Zhennan Wu,Yida Xiong,Hongzhi Zhang,Longtao Hu,Zhonglie Liu,Junqi Zeng,Wenjie Wu,Mukun Chen,Jiameng Chen,Wenbin Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Drug discovery is of great social significance in safeguarding human health, prolonging life, and addressing the challenges of major diseases. In recent years, artificial intelligence has demonstrated remarkable advantages in key tasks across bioinformatics and pharmacology, owing to its efficient data processing and data representation capabilities. However, most existing computational platforms cover only a subset of core tasks, leading to fragmented workflows and low efficiency. In addition, they often lack algorithmic innovation and show poor generalization to out-of-distribution (OOD) data, which greatly hinders the progress of drug discovery. To address these limitations, we propose Baishenglai (BSL), a deep learning-enhanced, open-access platform designed for virtual drug discovery. BSL integrates seven core tasks within a unified and modular framework, incorporating advanced technologies such as generative models and graph neural networks. In addition to achieving state-of-the-art (SOTA) performance on multiple benchmark datasets, the platform emphasizes evaluation mechanisms that focus on generalization to OOD molecular structures. Comparative experiments with existing platforms and baseline methods demonstrate that BSL provides a comprehensive, scalable, and effective solution for virtual drug discovery, offering both algorithmic innovation and high-precision prediction for real-world pharmaceutical research. In addition, BSL demonstrated its practical utility by discovering novel modulators of the GluN1/GluN3A NMDA receptor, successfully identifying three compounds with clear bioactivity in in-vitro electrophysiological assays. These results highlight BSL as a promising and comprehensive platform for accelerating biomedical research and drug discovery. The platform is accessible at this https URL.
zh
[AI-149] SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy
【速读】:该论文旨在解决当前深度学习在光谱学研究中缺乏标准化方法和评估体系的问题。其核心解决方案是提出一个统一的平台——SpectrumLab,该平台的关键在于集成三大模块:一是包含数据处理与评估工具及排行榜的Python库;二是通过SpectrumAnnotator模块从少量种子数据生成高质量基准;三是构建了覆盖14类光谱任务和10余种谱型、涵盖超过120万种化学物质的多层基准测试套件SpectrumBench。这一系统化设计为光谱学中的深度学习研究提供了可复现、可比较的实验环境,推动领域向标准化和高效化发展。
链接: https://arxiv.org/abs/2508.01188
作者: Zhuo Yang,Jiaqing Xie,Shuaike Shen,Daolang Wang,Yeyun Chen,Ben Gao,Shuzhou Sun,Biqing Qi,Dongzhan Zhou,Lei Bai,Linjiang Chen,Shufei Zhang,Jun Jiang,Tianfan Fu,Yuqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.
zh
[AI-150] A Survey on Agent Workflow – Status and Future
【速读】:该论文旨在解决当前自主代理系统(autonomous agents)在复杂场景下缺乏统一、可扩展且可控的工作流管理机制的问题。随着大语言模型(LLMs)的发展,代理系统日益依赖工具调用、记忆存储和推理能力来实现用户目标,但其行为的规模化、安全性与可控性亟需结构化的工作流框架支持。解决方案的关键在于提出一种基于功能能力和架构特征的双维分类体系,对超过20个代表性代理工作流系统进行系统性梳理,识别出通用模式、技术挑战及发展趋势,并进一步探讨工作流优化策略与安全机制,从而为未来代理设计、工作流基础设施与安全自动化融合提供理论基础与实践指导。
链接: https://arxiv.org/abs/2508.01186
作者: Chaojia Yu,Zihan Cheng,Hanwen Cui,Yishuo Gao,Zexu Luo,Yijin Wang,Hangbin Zheng,Yong Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 3 figures, accepted to IEEE Conference, ICAIBD(International Conference of Artificial Intelligence and Big Data) 2025. This is the author’s version, not the publisher’s. See this https URL
Abstract:In the age of large language models (LLMs), autonomous agents have emerged as a powerful paradigm for achieving general intelligence. These agents dynamically leverage tools, memory, and reasoning capabilities to accomplish user-defined goals. As agent systems grow in complexity, agent workflows-structured orchestration frameworks-have become central to enabling scalable, controllable, and secure AI behaviors. This survey provides a comprehensive review of agent workflow systems, spanning academic frameworks and industrial implementations. We classify existing systems along two key dimensions: functional capabilities (e.g., planning, multi-agent collaboration, external API integration) and architectural features (e.g., agent roles, orchestration flows, specification languages). By comparing over 20 representative systems, we highlight common patterns, potential technical challenges, and emerging trends. We further address concerns related to workflow optimization strategies and security. Finally, we outline open problems such as standardization and multimodal integration, offering insights for future research at the intersection of agent design, workflow infrastructure, and safe automation.
zh
[AI-151] Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理情绪冲突场景时存在的偏差问题,即当不同模态(如视觉与音频)传递不一致的情绪线索时,模型倾向于过度依赖音频信号而忽视视觉模态的关键信息。为应对这一挑战,作者提出了一种参数高效框架MoSEAR,其关键在于两个模块:(1) MoSE(Modality-Specific Experts),通过正则化门控机制降低微调头中的模态偏倚;(2) AR(Attention Reallocation),在推理阶段重新分配冻结主干网络中各模态的注意力权重,从而实现更平衡的模态融合。该方案有效缓解了情绪冲突下的性能下降,并在一致性样本上提升表现,且不牺牲任一模态的优势。
链接: https://arxiv.org/abs/2508.01181
作者: Zhiyuan Han,Beier Zhu,Yanlong Xu,Peipei Song,Xun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: ACM Multimedia 2025
Abstract:Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads; and (2)AR, an attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference. Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples-without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks-including MER2023, EMER, DFEW, and our CA-MER-demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.
zh
[AI-152] Advancing the Foundation Model for Music Understanding
【速读】:该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)领域中模型任务割裂、缺乏统一理解能力的问题,即现有专用模型仅在单一任务上表现优异,难以实现对音乐内容的全面理解。其解决方案的关键在于提出一个名为MuFun的统一基础模型,该模型采用新颖的联合架构,能够同时处理乐器和歌词内容,并在覆盖多种任务(如流派分类、音乐标签识别和问答)的大规模数据集上进行训练,从而实现对音乐的多维度、深层次理解。此外,作者还构建了MuCUE(Music Comprehensive Understanding Evaluation)基准用于系统评估模型在复杂音乐理解任务上的综合性能,实验证明MuFun在该基准上显著优于现有音频大语言模型,展现出卓越的泛化能力和先进性。
链接: https://arxiv.org/abs/2508.01178
作者: Yi Jiang,Wei Wang,Xianwen Guo,Huiyun Liu,Hanrui Wang,Youri Xu,Haoqi Gu,Zhongqian Xie,Chuanjiang Luo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.
zh
[AI-153] RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models
【速读】:该论文旨在解决大语言模型后训练中风险中性目标与风险偏好评估指标之间的不匹配问题,即当前优化目标通常最大化期望奖励(risk-neutral),而实际评估多采用风险寻求型指标如Pass@k(至少一次成功)和Max@k(k次生成中的最大奖励),这种差异会导致性能不佳。解决方案的关键在于提出风险寻求策略优化(Risk-Seeking Policy Optimization, RSPO),其核心创新是直接在训练过程中优化Pass@k和Max@k指标,并通过利用单个响应在k次采样中成为最大值的闭式概率来缓解“搭便车”(hitchhiking)问题——即低奖励响应因与高奖励响应共现而被错误强化的问题。RSPO进一步设计了高效的无偏梯度估计器,即使在多响应嵌套梯度情况下也能实现稳定且有效的优化。
链接: https://arxiv.org/abs/2508.01174
作者: Kaichen Zhang,Shenghao Gao,Yuzhong Hong,Haipeng Sun,Junwei Bao,Hongfei Jiang,Yang Song,Hong Dingqian,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the “hitchhiking” problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over multiple responses, RSPO produces efficient, unbiased gradient estimators for both metrics. We validate our approach with both rigorous theoretical analysis and comprehensive experimental results.
zh
[AI-154] GeHirNet: A Gender-Aware Hierarchical Model for Voice Pathology Classification
【速读】:该论文旨在解决基于生成式 AI (Generative AI) 的语音分析在疾病诊断中面临的两大挑战:一是性别相关的声学差异导致现有分类器难以准确识别特定病理特征;二是罕见疾病数据稀缺引发的模型性能下降。其解决方案的关键在于提出一种两阶段框架:首先利用ResNet-50在梅尔频谱图(Mel spectrograms)上识别性别特异性的病理模式,随后进行性别条件下的疾病分类;同时通过多尺度重采样和时间扭曲增强(time warping augmentation)缓解类别不平衡问题。该方法在四个公共数据集合并后的测试中达到97.63%准确率和95.25%马修斯相关系数(MCC),相较单阶段基线提升5% MCC,显著提升了语音病理分类性能并降低了性别偏差。
链接: https://arxiv.org/abs/2508.01172
作者: Fan Wu(1),Kaicheng Zhao(2),Elgar Fleisch(1 and 3),Filipe Barata(1) ((1) Centre for Digital Health Interventions, ETH Zurich, Zurich, Switzerland, (2) Institute of Mechanism Theory, Machine Dynamics and Robotics, RWTH Aachen University, Aachen, Germany, (3) Centre for Digital Health Interventions, University of St. Gallen, St. Gallen, Switzerland)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-based voice analysis shows promise for disease diagnostics, but existing classifiers often fail to accurately identify specific pathologies because of gender-related acoustic variations and the scarcity of data for rare diseases. We propose a novel two-stage framework that first identifies gender-specific pathological patterns using ResNet-50 on Mel spectrograms, then performs gender-conditioned disease classification. We address class imbalance through multi-scale resampling and time warping augmentation. Evaluated on a merged dataset from four public repositories, our two-stage architecture with time warping achieves state-of-the-art performance (97.63% accuracy, 95.25% MCC), with a 5% MCC improvement over single-stage baseline. This work advances voice pathology classification while reducing gender bias through hierarchical modeling of vocal characteristics.
zh
[AI-155] H2C: Hippocampal Circuit-inspired Continual Learning for Lifelong Trajectory Prediction in Autonomous Driving
【速读】:该论文旨在解决深度学习(Deep Learning, DL)在轨迹预测任务中因持续学习能力不足而导致的灾难性遗忘(catastrophic forgetting)问题,即模型在适应新场景分布时会显著退化先前学到的知识,从而限制其在自动驾驶(Autonomous Driving, AD)系统中的实际应用。解决方案的关键在于受海马体回路(hippocampal circuit)启发,提出一种名为H2C的持续学习方法:通过两种互补策略选择一小部分代表性样本进行记忆回放——一是最大化样本间多样性以保留独特知识,二是采用等概率采样估计整体知识;随后利用这些样本计算记忆回放损失函数,在学习新数据的同时有效保留旧知识,实现在无需人工标注分布变化的前提下,平均减少22.71%的灾难性遗忘。
链接: https://arxiv.org/abs/2508.01158
作者: Yunlong Lin,Zirui Li,Guodong Du,Xiaocong Zhao,Cheng Gong,Xinwei Wang,Chao Lu,Jianwei Gong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Open source code: this https URL
Abstract:Deep learning (DL) has shown state-of-the-art performance in trajectory prediction, which is critical to safe navigation in autonomous driving (AD). However, most DL-based methods suffer from catastrophic forgetting, where adapting to a new distribution may cause significant performance degradation in previously learned ones. Such inability to retain learned knowledge limits their applicability in the real world, where AD systems need to operate across varying scenarios with dynamic distributions. As revealed by neuroscience, the hippocampal circuit plays a crucial role in memory replay, effectively reconstructing learned knowledge based on limited resources. Inspired by this, we propose a hippocampal circuit-inspired continual learning method (H2C) for trajectory prediction across varying scenarios. H2C retains prior knowledge by selectively recalling a small subset of learned samples. First, two complementary strategies are developed to select the subset to represent learned knowledge. Specifically, one strategy maximizes inter-sample diversity to represent the distinctive knowledge, and the other estimates the overall knowledge by equiprobable sampling. Then, H2C updates via a memory replay loss function calculated by these selected samples to retain knowledge while learning new data. Experiments based on various scenarios from the INTERACTION dataset are designed to evaluate H2C. Experimental results show that H2C reduces catastrophic forgetting of DL baselines by 22.71% on average in a task-free manner, without relying on manually informed distributional shifts. The implementation is available at this https URL.
zh
[AI-156] COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning
【速读】:该论文旨在解决少样本模仿学习(few-shot imitation learning)中的数据检索问题,即如何从大规模数据集中筛选出与目标任务高度相关的示范数据,以训练出高性能策略。传统方法依赖单一特征距离度量(如视觉、语义或运动空间相似性),但容易引入无关任务的干扰数据或目标不一致的相似动作,导致性能下降。其解决方案的关键在于提出COLLAGE方法,采用自适应的晚期融合机制,通过任务特定的多线索组合来指导数据选择:首先基于单个特征预选数据子集(如外观、形状或语言相似性),再根据每个子集训练的策略在目标示范上的预测表现分配权重,并利用这些权重进行重要性采样,从而在策略训练中动态调整不同数据的采样密度。该方法具有通用性和特征无关性,可灵活整合任意数量的检索子集并识别对目标任务最有益的数据来源。
链接: https://arxiv.org/abs/2508.01131
作者: Sateesh Kumar,Shivin Dass,Georgios Pavlakos,Roberto Martín-Martín
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Conference on Robot Learning (CoRL), 2025. Project page: this https URL
Abstract:In this work, we study the problem of data retrieval for few-shot imitation learning: selecting data from a large dataset to train a performant policy for a specific task, given only a few target demonstrations. Prior methods retrieve data using a single-feature distance heuristic, assuming that the best demonstrations are those that most closely resemble the target examples in visual, semantic, or motion space. However, this approach captures only a subset of the relevant information and can introduce detrimental demonstrations, e.g., retrieving data from unrelated tasks due to similar scene layouts, or selecting similar motions from tasks with divergent goals. We present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation learning that uses an adaptive late fusion mechanism to guide the selection of relevant demonstrations based on a task-specific combination of multiple cues. COLLAGE follows a simple, flexible, and efficient recipe: it assigns weights to subsets of the dataset that are pre-selected using a single feature (e.g., appearance, shape, or language similarity), based on how well a policy trained on each subset predicts actions in the target demonstrations. These weights are then used to perform importance sampling during policy training, sampling data more densely or sparsely according to estimated relevance. COLLAGE is general and feature-agnostic, allowing it to combine any number of subsets selected by any retrieval heuristic, and to identify which subsets provide the greatest benefit for the target task. In extensive experiments, COLLAGE outperforms state-of-the-art retrieval and multi-task learning approaches by 5.1% in simulation across 10 tasks, and by 16.6% in the real world across 6 tasks, where we perform retrieval from the large-scale DROID dataset. More information at this https URL .
zh
[AI-157] Human-Robot Red Teaming for Safety-Aware Reasoning
【速读】:该论文旨在解决机器人在高风险场景中如何安全执行任务的问题,尤其是在人类环境中协作时缺乏对安全性充分考量的现状。其核心挑战在于提升机器人与人类操作者之间的信任关系,以实现安全关键任务中的有效协同。解决方案的关键是提出“人机红队对抗”(human-robot red teaming)范式,通过人类与机器人共同挑战环境假设、探索潜在危险空间,从而实现安全感知推理——包括危害识别、风险评估、风险缓解和安全报告。该方法使不同形态的机器人能够在月球栖息地和家庭等不同环境中学习适应差异化的安全定义,并验证了其在多领域安全规划中的可行性与有效性。
链接: https://arxiv.org/abs/2508.01129
作者: Emily Sheetz,Emma Zemler,Misha Savchenko,Connor Rainen,Erik Holum,Jodi Graf,Andrew Albright,Shaun Azimi,Benjamin Kuipers
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:While much research explores improving robot capabilities, there is a deficit in researching how robots are expected to perform tasks safely, especially in high-risk problem domains. Robots must earn the trust of human operators in order to be effective collaborators in safety-critical tasks, specifically those where robots operate in human environments. We propose the human-robot red teaming paradigm for safety-aware reasoning. We expect humans and robots to work together to challenge assumptions about an environment and explore the space of hazards that may arise. This exploration will enable robots to perform safety-aware reasoning, specifically hazard identification, risk assessment, risk mitigation, and safety reporting. We demonstrate that: (a) human-robot red teaming allows human-robot teams to plan to perform tasks safely in a variety of domains, and (b) robots with different embodiments can learn to operate safely in two different environments – a lunar habitat and a household – with varying definitions of safety. Taken together, our work on human-robot red teaming for safety-aware reasoning demonstrates the feasibility of this approach for safely operating and promoting trust on human-robot teams in safety-critical problem domains.
zh
[AI-158] owards Bridging Review Sparsity in Recommendation with Textual Edge Graph Representation
【速读】:该论文旨在解决推荐系统中因用户评论稀疏(sparsity)导致的性能下降问题,即在现实场景中用户很少留下文本评论,使得现有模型难以获取细粒度偏好信号和可解释性。其解决方案的关键在于提出一个统一框架TWISTER(ToWards Imputation on Sparsity with Textual Edge Graph Representation),通过将用户-物品交互建模为带有文本边属性的Textual-Edge Graph(TEG),同时联合捕捉语义信息与结构依赖关系:利用线图视图(line-graph views)构建关系上下文,并引入大语言模型(LLM)作为图感知聚合器,对缺失评论的交互节点聚合邻域自然语言表示以生成连贯且个性化的评论,从而提升推荐效果。
链接: https://arxiv.org/abs/2508.01128
作者: Leyao Wang,Xutao Mao,Xuhui Zhan,Yuying Zhao,Bo Ni,Ryan A. Rossi,Nesreen K. Ahmed,Tyler Derr
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages
Abstract:Textual reviews enrich recommender systems with fine-grained preference signals and enhanced explainability. However, in real-world scenarios, users rarely leave reviews, resulting in severe sparsity that undermines the effectiveness of existing models. A natural solution is to impute or generate missing reviews to enrich the data. However, conventional imputation techniques – such as matrix completion and LLM-based augmentation – either lose contextualized semantics by embedding texts into vectors, or overlook structural dependencies among user-item interactions. To address these shortcomings, we propose TWISTER (ToWards Imputation on Sparsity with Textual Edge Graph Representation), a unified framework that imputes missing reviews by jointly modeling semantic and structural signals. Specifically, we represent user-item interactions as a Textual-Edge Graph (TEG), treating reviews as edge attributes. To capture relational context, we construct line-graph views and employ a large language model as a graph-aware aggregator. For each interaction lacking a textual review, our model aggregates the neighborhood’s natural-language representations to generate a coherent and personalized review. Experiments on the Amazon and Goodreads datasets show that TWISTER consistently outperforms traditional numeric, graph-based, and LLM baselines, delivering higher-quality imputed reviews and, more importantly, enhanced recommendation performance. In summary, TWISTER generates reviews that are more helpful, authentic, and specific, while smoothing structural signals for improved recommendations.
zh
[AI-159] Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent -Induced Novelty?
【速读】:该论文旨在解决如何利用多模态数据(卫星遥感图像与互联网文本)来准确预测非洲地区家庭财富水平的问题,尤其关注社会经济指标是否能在物理环境和历史经济叙事中留下可识别的印记。其核心解决方案是构建一个融合视觉与语言模态的多通道预测框架,关键在于通过五种不同路径(纯视觉模型、仅基于位置/年份的大型语言模型(LLM)生成文本、AI代理检索并合成网络文本、联合图像-文本编码器以及所有信号的集成模型)进行对比实验,并发现:将卫星图像与LLM内部知识生成的文本相结合时,预测性能显著优于单一视觉基线(如R²从0.63提升至0.77),且这种融合方式在跨国家和跨时间场景下更具鲁棒性;同时,尽管LLM生成文本表现优于代理检索文本,但部分情况下引入代理获取的信息仍能带来微小增益,表明其可能携带了静态LLM知识未完全覆盖的独特表征结构,从而支持了“共享潜在财富编码”的假设。
链接: https://arxiv.org/abs/2508.01109
作者: Satiyabooshan Murugaboopathy,Connor T. Jerzak,Adel Daoud
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 figures
Abstract:We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.
zh
[AI-160] Protecting Student Mental Health with a Context-Aware Machine Learning Framework for Stress Monitoring
【速读】:该论文旨在解决高校学生心理健康问题中传统评估方法(如主观问卷和周期性测评)难以实现及时干预的局限性,从而提升对学生压力状态的早期识别能力。其解决方案的关键在于构建一个情境感知(context-aware)的机器学习框架,通过整合心理、学业、环境与社会等多维度因素的数据,采用六阶段处理流程——包括预处理、特征选择(SelectKBest、RFECV)、降维(PCA)以及多种基础分类器(SVM、随机森林、梯度提升、XGBoost、AdaBoost、Bagging)训练,并进一步引入集成策略(硬投票、软投票、加权投票与堆叠)以优化模型性能,最终在两个调查数据集上分别达到93.09%和99.53%的准确率,显著优于现有基准,验证了数据驱动系统在实际教育场景中支持学生福祉的可行性与有效性。
链接: https://arxiv.org/abs/2508.01105
作者: Md Sultanul Islam Ovi,Jamal Hossain,Md Raihan Alam Rahi,Fatema Akter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 6 pages, 3 figures, 3 tables, 1 algorithm. Conference paper
Abstract:Student mental health is an increasing concern in academic institutions, where stress can severely impact well-being and academic performance. Traditional assessment methods rely on subjective surveys and periodic evaluations, offering limited value for timely intervention. This paper introduces a context-aware machine learning framework for classifying student stress using two complementary survey-based datasets covering psychological, academic, environmental, and social factors. The framework follows a six-stage pipeline involving preprocessing, feature selection (SelectKBest, RFECV), dimensionality reduction (PCA), and training with six base classifiers: SVM, Random Forest, Gradient Boosting, XGBoost, AdaBoost, and Bagging. To enhance performance, we implement ensemble strategies, including hard voting, soft voting, weighted voting, and stacking. Our best models achieve 93.09% accuracy with weighted hard voting on the Student Stress Factors dataset and 99.53% with stacking on the Stress and Well-being dataset, surpassing previous benchmarks. These results highlight the potential of context-integrated, data-driven systems for early stress detection and underscore their applicability in real-world academic settings to support student well-being.
zh
[AI-161] Multispin Physics of AI Tipping Points and Hallucinations
【速读】:该论文旨在解决生成式 AI(Generative AI)在输出过程中可能出现的“隐性突变”问题,即模型在回答中途从正确转向错误或误导性内容而用户难以察觉的现象。这一问题已导致2024年高达670亿美元的损失和多起死亡事件。论文的关键解决方案是建立生成式 AI 与多自旋热力学系统之间的数学映射,揭示出在每个基本注意力头(Attention head)尺度上存在一种隐藏的突变不稳定性,并推导出一个简洁但本质上精确的突变点公式,该公式直接量化了用户提示(prompt)选择与模型训练偏置对突变的影响。进一步表明,这种突变可通过模型的多层架构被放大,从而为提升AI的透明度、可解释性和性能提供理论依据,并为量化用户使用AI的风险及法律责任开辟新路径。
链接: https://arxiv.org/abs/2508.01097
作者: Neil F. Johnson,Frank Yingjie Huo
机构: 未知
类目: Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Computational Physics (physics.comp-ph)
备注:
Abstract:Output from generative AI such as ChatGPT, can be repetitive and biased. But more worrying is that this output can mysteriously tip mid-response from good (correct) to bad (misleading or wrong) without the user noticing. In 2024 alone, this reportedly caused 67 billion in losses and several deaths. Establishing a mathematical mapping to a multispin thermal system, we reveal a hidden tipping instability at the scale of the AI’s ‘atom’ (basic Attention head). We derive a simple but essentially exact formula for this tipping point which shows directly the impact of a user’s prompt choice and the AI’s training bias. We then show how the output tipping can get amplified by the AI’s multilayer architecture. As well as helping improve AI transparency, explainability and performance, our results open a path to quantifying users’ AI risk and legal liabilities.
zh
[AI-162] Provably Secure Retrieval-Augmented Generation
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在隐私和安全方面存在的关键问题,包括数据泄露(data leakage)和数据投毒(data poisoning)等风险,这些问题在现有防御策略中尚未得到系统性解决。当前方法主要依赖启发式过滤或提升检索器鲁棒性,存在可解释性差、缺乏形式化安全保证以及易受自适应攻击的缺陷。论文提出首个可证明安全的RAG框架——SAG(Secure Augmented Generation),其核心创新在于采用预存储全加密机制(pre-storage full-encryption scheme),同时保护检索内容与向量嵌入(vector embeddings),确保仅授权实体可访问数据;并通过计算安全模型下的形式化安全证明,严格验证方案在保密性和完整性方面的安全性。实验表明,该框架能有效抵御多种前沿攻击,为构建具备形式化安全保障的RAG系统提供了理论基础与实践范式。
链接: https://arxiv.org/abs/2508.01084
作者: Pengcheng Zhou,Yinglun Feng,Zhongliang Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Retrieval-Augmented Generation (RAG) systems have been widely applied, the privacy and security risks they face, such as data leakage and data poisoning, have not been systematically addressed yet. Existing defense strategies primarily rely on heuristic filtering or enhancing retriever robustness, which suffer from limited interpretability, lack of formal security guarantees, and vulnerability to adaptive attacks. To address these challenges, this paper proposes the first provably secure framework for RAG systems(SAG). Our framework employs a pre-storage full-encryption scheme to ensure dual protection of both retrieved content and vector embeddings, guaranteeing that only authorized entities can access the data. Through formal security proofs, we rigorously verify the scheme’s confidentiality and integrity under a computational security model. Extensive experiments across multiple benchmark datasets demonstrate that our framework effectively resists a range of state-of-the-art attacks. This work establishes a theoretical foundation and practical paradigm for verifiably secure RAG systems, advancing AI-powered services toward formally guaranteed security.
zh
[AI-163] Learning Pivoting Manipulation with Force and Vision Feedback Using Optimization-based Demonstrations
【速读】:该论文旨在解决非预握式操作(non-prehensile manipulation)中因物体、环境与机器人之间复杂接触相互作用而导致的轨迹规划难题,尤其是在面对新物体时模型偏差敏感性和对特权信息(如物体质量、尺寸、位姿)依赖性强的问题。解决方案的关键在于融合基于模型的方法与学习方法:利用计算高效的隐式接触轨迹优化(Contact-Implicit Trajectory Optimization, CITO)生成高质量演示数据,并以此引导深度强化学习(deep Reinforcement Learning, RL),从而实现样本高效的学习;同时提出一种特权训练策略驱动的仿真到现实(sim-to-real)迁移方法,使机器人仅通过本体感知、视觉和力觉即可完成翻转操作,无需访问特权信息。
链接: https://arxiv.org/abs/2508.01082
作者: Yuki Shirai,Kei Ota,Devesh K. Jha,Diego Romeres
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Non-prehensile manipulation is challenging due to complex contact interactions between objects, the environment, and robots. Model-based approaches can efficiently generate complex trajectories of robots and objects under contact constraints. However, they tend to be sensitive to model inaccuracies and require access to privileged information (e.g., object mass, size, pose), making them less suitable for novel objects. In contrast, learning-based approaches are typically more robust to modeling errors but require large amounts of data. In this paper, we bridge these two approaches to propose a framework for learning closed-loop pivoting manipulation. By leveraging computationally efficient Contact-Implicit Trajectory Optimization (CITO), we design demonstration-guided deep Reinforcement Learning (RL), leading to sample-efficient learning. We also present a sim-to-real transfer approach using a privileged training strategy, enabling the robot to perform pivoting manipulation using only proprioception, vision, and force sensing without access to privileged information. Our method is evaluated on several pivoting tasks, demonstrating that it can successfully perform sim-to-real transfer.
zh
[AI-164] he Lattice Geometry of Neural Network Quantization – A Short Equivalence Proof of GPT Q and Babais algorithm
【速读】:该论文旨在解决神经网络中线性单元(linear unit)的量化问题,即如何将浮点权重高效地映射为低精度表示(如整数),以降低模型存储和计算开销。其解决方案的关键在于揭示数据驱动量化过程与格(lattice)上的最近向量问题(Closest Vector Problem, CVP)之间的数学等价关系,并证明GPTQ算法本质上等同于经典的Babai最近平面算法(nearest-plane algorithm)。这一理论连接不仅提供了对现有量化方法的几何解释,还暗示了通过格基约化(lattice basis reduction)技术可进一步优化量化性能的可能性。
链接: https://arxiv.org/abs/2508.01077
作者: Johann Birnick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:We explain how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data. We prove that the GPTQ algorithm is equivalent to Babai’s well-known nearest-plane algorithm. We furthermore provide geometric intuition for both algorithms. Lastly, we note the consequences of these results, in particular hinting at the possibility for using lattice basis reduction for better quantization.
zh
[AI-165] gpuRDF2vec – Scalable GPU-based RDF2vec ISWC2025
【速读】:该论文旨在解决大规模知识图谱(Knowledge Graph, KG)嵌入生成的效率瓶颈问题,尤其是在Web规模数据下如何实现高可扩展性和高性能。其解决方案的关键在于提出gpuRDF2vec——一个开源库,充分利用现代GPU加速能力并支持多节点分布式执行,从而显著提升RDF2vec整个流水线各阶段的计算效率。实验表明,该方案在单节点环境下仅走步提取阶段就远超pyRDF2vec、SparkKGML和jRDF2vec,并且能很好地扩展至长随机游走路径,从而生成更高质量的嵌入表示,同时基于PyTorch Lightning实现了可扩展的word2vec训练模块,使研究人员和从业者能够在合理时间预算内完成大规模知识图谱嵌入训练。
链接: https://arxiv.org/abs/2508.01073
作者: Martin Böckling,Heiko Paulheim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, ISWC 2025
Abstract:Generating Knowledge Graph (KG) embeddings at web scale remains challenging. Among existing techniques, RDF2vec combines effectiveness with strong scalability. We present gpuRDF2vec, an open source library that harnesses modern GPUs and supports multi-node execution to accelerate every stage of the RDF2vec pipeline. Extensive experiments on both synthetically generated graphs and real-world benchmarks show that gpuRDF2vec achieves up to a substantial speedup over the currently fastest alternative, i.e., jRDF2vec. In a single-node setup, our walk-extraction phase alone outperforms pyRDF2vec, SparkKGML, and jRDF2vec by a substantial margin using random walks on large/ dense graphs, and scales very well to longer walks, which typically lead to better quality embeddings. Our implementation of gpuRDF2vec enables practitioners and researchers to train high-quality KG embeddings on large-scale graphs within practical time budgets and builds on top of Pytorch Lightning for the scalable word2vec implementation.
zh
[AI-166] Expressive Power of Graph Transformers via Logic
【速读】:该论文旨在解决图神经网络中两类主流架构——图变压器(Graph Transformers, GTs)与GPS网络(GPS-networks)——在表达能力上的理论边界问题,特别是它们在不同数值精度(实数与浮点数)下对图结构属性的刻画能力。解决方案的关键在于引入逻辑框架进行形式化分析:在实数设定下,通过将图属性限制于一阶逻辑(First-Order Logic, FO)可定义的范畴,证明GPS网络等价于带全局模态的分级模态逻辑(Graded Modal Logic with Global Modality, GML);而在浮点数设定下,进一步揭示GPS网络等价于带计数全局模态的GML(GML with Counting Global Modality),此结果为绝对性结论,不依赖背景逻辑约束。类似地,GTs也被刻画为对应于带全局模态或计数全局模态的命题逻辑。这一方法实现了对两种模型表达能力的精细分类,为理解其在图表示学习中的潜力提供了严格的理论依据。
链接: https://arxiv.org/abs/2508.01067
作者: Veeti Ahvonen,Maurice Funk,Damian Heiman,Antti Kuusisto,Carsten Lutz
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers are the basis of modern large language models, but relatively little is known about their precise expressive power on graphs. We study the expressive power of graph transformers (GTs) by Dwivedi and Bresson (2020) and GPS-networks by Rampásek et al. (2022), both under soft-attention and average hard-attention. Our study covers two scenarios: the theoretical setting with real numbers and the more practical case with floats. With reals, we show that in restriction to vertex properties definable in first-order logic (FO), GPS-networks have the same expressive power as graded modal logic (GML) with the global modality. With floats, GPS-networks turn out to be equally expressive as GML with the counting global modality. The latter result is absolute, not restricting to properties definable in a background logic. We also obtain similar characterizations for GTs in terms of propositional logic with the global modality (for reals) and the counting global modality (for floats).
zh
[AI-167] Connectivity Management in Satellite-Aided Vehicular Networks with Multi-Head Attention-Based State Estimation
【速读】:该论文旨在解决6G时代下集成卫星-地面车联网(Integrated Satellite-Terrestrial Vehicular Networks)中因动态环境和局部可观测性导致的连接管理难题。其核心解决方案是提出一种多智能体强化学习框架——基于卫星辅助多头自注意力机制的多智能体演员-评论家算法(MAAC-SAM),关键创新在于引入多头注意力机制以在车辆间信息共享受限且波动的情况下实现鲁棒的状态估计,同时结合自模仿学习(Self-Imitation Learning, SIL)与指纹识别技术提升学习效率与实时决策能力,从而显著改善传输效用并维持高精度状态估计。
链接: https://arxiv.org/abs/2508.01060
作者: Ibrahim Althamary,Chen-Fu Chou,Chih-Wei Huang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Managing connectivity in integrated satellite-terrestrial vehicular networks is critical for 6G, yet is challenged by dynamic conditions and partial observability. This letter introduces the Multi-Agent Actor-Critic with Satellite-Aided Multi-head self-attention (MAAC-SAM), a novel multi-agent reinforcement learning framework that enables vehicles to autonomously manage connectivity across Vehicle-to-Satellite (V2S), Vehicle-to-Infrastructure (V2I), and Vehicle-to-Vehicle (V2V) links. Our key innovation is the integration of a multi-head attention mechanism, which allows for robust state estimation even with fluctuating and limited information sharing among vehicles. The framework further leverages self-imitation learning (SIL) and fingerprinting to improve learning efficiency and real-time decisions. Simulation results, based on realistic SUMO traffic models and 3GPP-compliant configurations, demonstrate that MAAC-SAM outperforms state-of-the-art terrestrial and satellite-assisted baselines by up to 14% in transmission utility and maintains high estimation accuracy across varying vehicle densities and sharing levels.
zh
[AI-168] Llama-3.1-FoundationAI-SecurityLLM -8B-Instruct Technical Report
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在网络安全(cybersecurity)领域应用受限的问题,主要瓶颈包括缺乏通用的网络安全数据、表示复杂性高以及安全与监管合规风险。解决方案的关键在于构建一个专为网络安全对话设计的指令跟随模型——Foundation-Sec-8B-Instruct,该模型基于先前的Foundation-Sec-8B基础模型,通过引入指令遵循能力、对话交互能力和人类偏好对齐机制,显著提升了在网络安全任务中的响应质量与相关性。评估表明,该模型在多项网络安全任务上优于Llama 3.1-8B-Instruct,并在威胁情报和指令遵循任务上媲美GPT-4o-mini,具备成为网络安全从业者日常工作的关键辅助工具的潜力。
链接: https://arxiv.org/abs/2508.01059
作者: Sajana Weerawardhena,Paul Kassianik,Blaine Nelson,Baturay Saglam,Anu Vellore,Aman Priyanshu,Supriti Vijay,Massimo Aufiero,Arthur Goldblatt,Fraser Burch,Ed Li,Jianliang He,Dhruv Kedia,Kojin Oshiba,Zhouran Yang,Yaron Singer,Amin Karbasi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 34 pages - Technical Report
Abstract:Large language models (LLMs) have shown remarkable success across many domains, yet their integration into cybersecurity applications remains limited due to a lack of general-purpose cybersecurity data, representational complexity, and safety and regulatory concerns. To address this gap, we previously introduced Foundation-Sec-8B, a cybersecurity-focused LLM suitable for fine-tuning on downstream tasks. That model, however, was not designed for chat-style interactions or instruction-following. In this report, we release Foundation-Sec-8B-Instruct: a model specifically trained for general-purpose cybersecurity dialogue. Built on Foundation-Sec-8B, it combines domain-specific knowledge with instruction-following, conversational capabilities, and alignment with human preferences to produce high-quality, relevant responses. Comprehensive evaluations show that Foundation-Sec-8B-Instruct outperforms Llama 3.1-8B-Instruct on a range of cybersecurity tasks while matching its instruction-following performance. It is also competitive with GPT-4o-mini on cyber threat intelligence and instruction-following tasks. We envision Foundation-Sec-8B-Instruct becoming an indispensable assistant in the daily workflows of cybersecurity professionals. We release the model publicly at this https URL.
zh
[AI-169] REACT: A Real-Time Edge-AI Based V2X Framework for Accident Avoidance in Autonomous Driving System
【速读】:该论文旨在解决当前基于Transformer的车路协同(Vehicle-to-Everything, V2X)感知框架在泛化能力、上下文推理深度和单模态输入依赖方面的局限性,以及视觉语言模型(Vision-Language Models, VLMs)在安全关键场景中难以满足实时性能要求的问题。解决方案的关键在于提出一个名为REACT的实时、V2X集成轨迹优化框架,其核心是基于微调后的轻量级VLM构建,并引入一组专用模块将多模态输入转化为风险感知的优化轨迹;同时通过边缘适应策略降低模型复杂度并加速推理,在Jetson AGX Orin平台上实现0.57秒的推理延迟,显著提升了自动驾驶系统在复杂交通环境下的安全性与响应速度。
链接: https://arxiv.org/abs/2508.01057
作者: Fengze Yang,Bo Yu,Yang Zhou,Xuewen Luo,Zhengzhong Tu,Chenxi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 24 pages, 6 tables, 7 figures
Abstract:Collisions caused by human error are the most common type of multi-vehicle crash, highlighting the critical need for autonomous driving (AD) systems to leverage cooperative perception through Vehicle-to-Everything (V2X) communication. This capability extends situational awareness beyond the limitations of onboard sensors. However, current transformer-based V2X frameworks suffer from limited generalization, shallow contextual reasoning, and reliance on mono-modal inputs. Vision-Language Models (VLMs) offer enhanced reasoning and multimodal integration but typically fall short of real-time performance requirements in safety-critical applications. This paper presents REACT, a real-time, V2X-integrated trajectory optimization framework built upon a fine-tuned lightweight VLM. REACT integrates a set of specialized modules that process multimodal inputs into optimized, risk-aware trajectories. To ensure real-time performance on edge devices, REACT incorporates edge adaptation strategies that reduce model complexity and accelerate inference. Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation studies validate the contribution of each input, module, and edge adaptation strategy. These results demonstrate the feasibility of lightweight VLMs for real-time edge-based cooperative planning and showcase the potential of language-guided contextual reasoning to improve safety and responsiveness in autonomous driving.
zh
[AI-170] Managing Escalation in Off-the-Shelf Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在国家安全场景中可能引发的升级倾向问题,即当其被用于模拟地缘政治或战略情境时,常倾向于建议激进行动。解决方案的关键在于引入两种简单且非技术性的干预措施,这些干预在实验性推演游戏中得以实施后,显著降低了整体升级行为的发生概率。研究强调,不应因潜在风险而限制LLMs在国家安全领域的应用,而是应主动采取可操作的手段,使其与国家战略目标(包括升级管控)保持一致。
链接: https://arxiv.org/abs/2508.01056
作者: Sebastian Elbaum,Jonathan Panther
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:
Abstract:U.S. national security customers have begun to utilize large language models, including enterprise versions of ``off-the-shelf’’ models (e.g., ChatGPT) familiar to the public. This uptake will likely accelerate. However, recent studies suggest that off-the-shelf large language models frequently suggest escalatory actions when prompted with geopolitical or strategic scenarios. We demonstrate two simple, non-technical interventions to control these tendencies. Introducing these interventions into the experimental wargame design of a recent study, we substantially reduce escalation throughout the game. Calls to restrict the use of large language models in national security applications are thus premature. The U.S. government is already, and will continue, employing large language models for scenario planning and suggesting courses of action. Rather than warning against such applications, this study acknowledges the imminent adoption of large language models, and provides actionable measures to align them with national security goals, including escalation management.
zh
[AI-171] FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在化学领域应用中普遍忽视细粒度功能基团(Functional Group, FG)信息的问题,从而限制了模型对分子结构与性质关系的理解和推理能力。现有数据集主要聚焦于分子层面的性质预测,缺乏对功能基团层级的精确标注与建模,导致LLMs难以捕捉特定功能基团对分子性质的影响及其相互作用机制。解决方案的关键在于构建FGBench这一大规模、高精度的功能基团级分子性质推理数据集,其中包含62.5万条带功能基团定位标注的问题样本,涵盖单功能基团影响、多功能基团交互以及分子直接比较三类任务,为训练更具可解释性和结构感知能力的LLMs提供基础支撑,并揭示了当前主流LLMs在FG级别推理上的显著不足,推动化学生成式AI(Generative AI)向更精细的结构-性质关联建模演进。
链接: https://arxiv.org/abs/2508.01055
作者: Xuan Liu,Siru Ouyang,Xianrui Zhong,Jiawei Han,Huimin Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: 20 pages, 20 figures
Abstract:Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset’s interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships. The dataset and evaluation code are available at \hrefthis https URLthis https URL.
zh
[AI-172] Autonomous Penetration Testing: Solving Capture-the-Flag Challenges with LLM s
【速读】:该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)自动化初学者级别的渗透测试任务这一问题,具体通过将GPT-4o接入OverTheWire的Bandit CTF(Capture-The-Flag)游戏环境来验证其自主执行能力。解决方案的关键在于构建一个单命令SSH框架,使模型能够直接与目标系统交互并生成相应操作指令;实验表明,GPT-4o在涉及Linux文件系统导航、数据提取或解码以及基础网络操作的单步挑战中表现出色,成功率达80%,且多数情况下能一次性输出正确命令,速度超越人类。失败案例则揭示了当前LLM在多步骤任务(如持久化工作目录维护、复杂网络探测、守护进程创建等)中的局限性,提示未来需从架构层面优化以提升复杂场景下的鲁棒性。
链接: https://arxiv.org/abs/2508.01054
作者: Isabelle Bakker,John Hastings
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 6 pages, 2 figures, 3 tables
Abstract:This study evaluates the ability of GPT-4o to autonomously solve beginner-level offensive security tasks by connecting the model to OverTheWire’s Bandit capture-the-flag game. Of the 25 levels that were technically compatible with a single-command SSH framework, GPT-4o solved 18 unaided and another two after minimal prompt hints for an overall 80% success rate. The model excelled at single-step challenges that involved Linux filesystem navigation, data extraction or decoding, and straightforward networking. The approach often produced the correct command in one shot and at a human-surpassing speed. Failures involved multi-command scenarios that required persistent working directories, complex network reconnaissance, daemon creation, or interaction with non-standard shells. These limitations highlight current architectural deficiencies rather than a lack of general exploit knowledge. The results demonstrate that large language models (LLMs) can automate a substantial portion of novice penetration-testing workflow, potentially lowering the expertise barrier for attackers and offering productivity gains for defenders who use LLMs as rapid reconnaissance aides. Further, the unsolved tasks reveal specific areas where secure-by-design environments might frustrate simple LLM-driven attacks, informing future hardening strategies. Beyond offensive cybersecurity applications, results suggest the potential to integrate LLMs into cybersecurity education as practice aids.
zh
[AI-173] A Deep Reinforcement Learning-Based TCP Congestion Control Algorithm: Design Simulation and Evaluation
【速读】:该论文旨在解决传统TCP拥塞控制算法(如TCP New Reno)在动态网络环境中难以有效平衡吞吐量与延迟的问题。其解决方案的关键在于引入基于深度强化学习(Deep Reinforcement Learning)的新型拥塞控制机制,利用深度Q网络(Deep Q-Networks)对关键网络参数进行实时观测,并据此动态调整拥塞窗口(cWnd)大小,从而实现更优的性能表现。该方法通过NS-3网络仿真平台结合OpenGym接口进行训练与验证,显著提升了网络适应性和整体效率。
链接: https://arxiv.org/abs/2508.01047
作者: Efe Ağlamazlar,Emirhan Eken,Harun Batur Geçici
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This paper presents a novel TCP congestion control algorithm based on Deep Reinforcement Learning. The study includes 5 figures and 8 pages of content
Abstract:This paper presents a novel TCP congestion control algorithm based on Deep Reinforcement Learning. The proposed approach utilizes Deep Q-Networks to optimize the congestion window (cWnd) by observing key network parameters and taking real-time actions. The algorithm is trained and evaluated within the NS-3 network simulator using the OpenGym interface. The results demonstrate significant improvements over traditional TCP New Reno in terms of latency and throughput, with better adaptability to changing network conditions. This study emphasizes the potential of reinforcement learning techniques for solving complex congestion control problems in modern networks.
zh
[AI-174] On Some Tunable Multi-fidelity Bayesian Optimization Frameworks
【速读】:该论文旨在解决多保真度优化(multi-fidelity optimization)中如何高效利用不同保真度信息以减少对高保真度目标函数评估依赖的问题。其核心挑战在于设计一种既能保持收敛效率又能精确控制高保真度评估次数的策略。解决方案的关键在于引入一种基于邻近性的采集函数(acquisition function)策略,该策略通过统一的采集机制简化了保真度选择过程,无需在每个保真度层级分别定义采集函数;同时,将多保真度高斯过程(multi-fidelity Gaussian Process, GP)与上置信界(Upper Confidence Bound, UCB)策略相结合,从而实现了更高效的探索与利用平衡,显著提升了优化过程的稳定性和可控性。
链接: https://arxiv.org/abs/2508.01013
作者: Arjun Manoj,Anastasia S. Georgiou,Dimitris G. Giovanis,Themistoklis P. Sapsis,Ioannis G. Kevrekidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Multi-fidelity optimization employs surrogate models that integrate information from varying levels of fidelity to guide efficient exploration of complex design spaces while minimizing the reliance on (expensive) high-fidelity objective function evaluations. To advance Gaussian Process (GP)-based multi-fidelity optimization, we implement a proximity-based acquisition strategy that simplifies fidelity selection by eliminating the need for separate acquisition functions at each fidelity level. We also enable multi-fidelity Upper Confidence Bound (UCB) strategies by combining them with multi-fidelity GPs rather than the standard GPs typically used. We benchmark these approaches alongside other multi-fidelity acquisition strategies (including fidelity-weighted approaches) comparing their performance, reliance on high-fidelity evaluations, and hyperparameter tunability in representative optimization tasks. The results highlight the capability of the proximity-based multi-fidelity acquisition function to deliver consistent control over high-fidelity usage while maintaining convergence efficiency. Our illustrative examples include multi-fidelity chemical kinetic models, both homogeneous and heterogeneous (dynamic catalysis for ammonia production).
zh
[AI-175] AutoEDA: Enabling EDA Flow Automation through Microservice-Based LLM Agents
【速读】:该论文旨在解决现代电子设计自动化(Electronic Design Automation, EDA)流程中,尤其是RTL到GDSII流程中存在的高度依赖人工脚本编写、工具间交互复杂且缺乏标准化的问题,这些问题限制了流程的可扩展性和效率。解决方案的关键在于提出AutoEDA框架,该框架通过Model Context Protocol (MCP) 实现跨工具的标准化自然语言交互,并采用结构化提示工程(structured prompt engineering)减少对昂贵微调的需求;同时集成智能参数提取与任务分解机制,并引入扩展的CodeBLEU指标评估TCL脚本质量,从而在多个基准测试中显著提升自动化准确率、执行效率和脚本质量。
链接: https://arxiv.org/abs/2508.01012
作者: Yiyi Lu,Hoi Ian Au,Junyao Zhang,Jingyu Pan,Yiting Wang,Ang Li,Jianyi Zhang,Yiran Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Electronic Design Automation (EDA) workflows, especially the RTL-to-GDSII flow, require heavily manual scripting and demonstrate a multitude of tool-specific interactions which limits scalability and efficiency. While LLMs introduces strides for automation, existing LLM solutions require expensive fine-tuning and do not contain standardized frameworks for integration and evaluation. We introduce AutoEDA, a framework for EDA automation that leverages paralleled learning through the Model Context Protocol (MCP) specific for standardized and scalable natural language experience across the entire RTL-to-GDSII flow. AutoEDA limits fine-tuning through structured prompt engineering, implements intelligent parameter extraction and task decomposition, and provides an extended CodeBLEU metric to evaluate the quality of TCL scripts. Results from experiments over five previously curated benchmarks show improvements in automation accuracy and efficiency, as well as script quality when compared to existing methods. AutoEDA is released open-sourced to support reproducibility and the EDA community. Available at: this https URL
zh
[AI-176] v-PuNNs: van der Put Neural Networks for Transparent Ultrametric Representation Learning
【速读】:该论文旨在解决传统深度学习模型将数据嵌入欧几里得空间(Euclidean space)时,难以有效表示严格层次结构数据(如分类学层级、词义层次或文件系统)的问题。其解决方案的核心是提出van der Put Neural Networks (v-PuNNs),其中神经元为p进制球(p-adic balls)的特征函数,且权重本身为p进制数,从而满足透明超度量表示学习(Transparent Ultrametric Representation Learning, TURL)原则,实现精确的子树语义解释。通过有限层次近似定理(Finite Hierarchical Approximation Theorem),证明了深度K的v-PuNN可通用逼近任意K层树结构;同时引入估值自适应扰动优化(Valuation-Adaptive Perturbation Optimization, VAPO)以应对离散空间中梯度消失问题,显著提升训练效率与性能,在多个基准测试中达到当前最优结果。
链接: https://arxiv.org/abs/2508.01010
作者: Gnankan Landry Regis N’guessan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional deep learning models embed data in Euclidean space \mathbbR^d , a poor fit for strictly hierarchical objects such as taxa, word senses, or file systems. We introduce van der Put Neural Networks (v-PuNNs), the first architecture whose neurons are characteristic functions of p-adic balls in \mathbbZ_p . Under our Transparent Ultrametric Representation Learning (TURL) principle every weight is itself a p-adic number, giving exact subtree semantics. A new Finite Hierarchical Approximation Theorem shows that a depth-K v-PuNN with \sum_j=0^K-1p^,j neurons universally represents any K-level tree. Because gradients vanish in this discrete space, we propose Valuation-Adaptive Perturbation Optimization (VAPO), with a fast deterministic variant (HiPaN-DS) and a moment-based one (HiPaN / Adam-VAPO). On three canonical benchmarks our CPU-only implementation sets new state-of-the-art: WordNet nouns (52,427 leaves) 99.96% leaf accuracy in 16 min; GO molecular-function 96.9% leaf / 100% root in 50 s; NCBI Mammalia Spearman \rho = -0.96 with true taxonomic distance. The learned metric is perfectly ultrametric (zero triangle violations), and its fractal and information-theoretic properties are analyzed. Beyond classification we derive structural invariants for quantum systems (HiPaQ) and controllable generative codes for tabular data (Tab-HiPaN). v-PuNNs therefore bridge number theory and deep learning, offering exact, interpretable, and efficient models for hierarchical data.
zh
[AI-177] Generative AI Adoption in Postsecondary Education AI Hype and ChatGPT s Launch
【速读】:该论文旨在解决生成式人工智能(Generative AI)在高等教育领域快速渗透背景下,其多维度影响尚未得到系统审视的问题,尤其聚焦于OpenAI的ChatGPT在发布后前六个月对学术场景中写作、教学与学习实践的变革性影响。解决方案的关键在于通过三种路径进行深入分析:首先,借助主流话语中的AI hype现象揭示ChatGPT作为变革事件的认知建构;其次,运用批判话语分析(Critical Discourse Analysis)和批判人工智能研究(Critical AI Studies)视角,解析生成式AI对教育实践的潜在伦理与认知后果;最后,提出推动生成式AI在教育中负责任采用的最佳实践框架,以引导其向促进教学创新与公平的方向发展。
链接: https://arxiv.org/abs/2508.01003
作者: Isabel Pedersen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:The rapid integration of generative artificial intelligence (AI) into postsecondary education and many other sectors resulted in a global reckoning with this new technology. This paper contributes to the study of the multifaceted influence of generative AI, with a particular focus on OpenAI’s ChatGPT within academic settings during the first six months after the release in three specific ways. First, it scrutinizes the rise of ChatGPT as a transformative event construed through a study of mainstream discourses exhibiting AI hype. Second, it discusses the perceived implications of generative AI for writing, teaching, and learning through the lens of critical discourse analysis and critical AI studies. Third, it encourages the necessity for best practices in the adoption of generative AI technologies in education.
zh
[AI-178] Are LLM -Powered Social Media Bots Realistic?
【速读】:该论文试图解决的问题是:如何利用大语言模型(Large Language Models, LLMs)生成具有现实感的社会媒体机器人(bot)网络,并评估其与真实人类和野生机器人(wild bots)在社交网络结构和语言特征上的差异。解决方案的关键在于结合人工设计、网络科学和LLM技术,构建合成的机器人个体人格(persona)、推文内容及其交互行为,从而模拟出可量化的社会媒体网络;通过对比实证数据,发现LLM驱动的机器人在网络拓扑和语言模式上均与真实人类及野生机器人存在显著差异,这为后续识别和评估此类新型AI驱动机器人提供了重要依据。
链接: https://arxiv.org/abs/2508.00998
作者: Lynnette Hui Xian Ng,Kathleen M. Carley
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted into SBP-BRiMS 2025
Abstract:As Large Language Models (LLMs) become more sophisticated, there is a possibility to harness LLMs to power social media bots. This work investigates the realism of generating LLM-Powered social media bot networks. Through a combination of manual effort, network science and LLMs, we create synthetic bot agent personas, their tweets and their interactions, thereby simulating social media networks. We compare the generated networks against empirical bot/human data, observing that both network and linguistic properties of LLM-Powered Bots differ from Wild Bots/Humans. This has implications towards the detection and effectiveness of LLM-Powered Bots.
zh
[AI-179] Generative AI as a Geopolitical Factor in Industry 5.0: Sovereignty Access and Control
【速读】:该论文试图解决的问题是:在工业5.0(Industry 5.0)背景下,生成式AI(Generative AI, GenAI)与自主系统如何重塑全球地缘政治格局,并引发关于人类控制、双重用途风险及责任归属等治理挑战,进而影响国防战略、产业竞争力和供应链韧性。解决方案的关键在于构建一个融合技术、经济与伦理视角的综合框架,以平衡国家自主性与国际协调,确保在人工智能驱动的世界中维护以人为本的价值观,并应对出口管制武器化和数据主权崛起等新兴地缘政治现实。
链接: https://arxiv.org/abs/2508.00973
作者: Azmine Toushik Wasi,Enjamamul Haque Eram,Sabrina Afroz Mitu,Md Manjurul Ahsan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: In Review
Abstract:Industry 5.0 marks a new phase in industrial evolution, emphasizing human-centricity, sustainability, and resilience through the integration of advanced technologies. Within this evolving landscape, Generative AI (GenAI) and autonomous systems are not only transforming industrial processes but also emerging as pivotal geopolitical instruments. We examine strategic implications of GenAI in Industry 5.0, arguing that these technologies have become national assets central to sovereignty, access, and global influence. As countries compete for AI supremacy, growing disparities in talent, computational infrastructure, and data access are reshaping global power hierarchies and accelerating the fragmentation of the digital economy. The human-centric ethos of Industry 5.0, anchored in collaboration between humans and intelligent systems, increasingly conflicts with the autonomy and opacity of GenAI, raising urgent governance challenges related to meaningful human control, dual-use risks, and accountability. We analyze how these dynamics influence defense strategies, industrial competitiveness, and supply chain resilience, including the geopolitical weaponization of export controls and the rise of data sovereignty. Our contribution synthesizes technological, economic, and ethical perspectives to propose a comprehensive framework for navigating the intersection of GenAI and geopolitics. We call for governance models that balance national autonomy with international coordination while safeguarding human-centric values in an increasingly AI-driven world.
zh
[AI-180] AI-Educational Development Loop (AI-EDL): A Conceptual Framework to Bridge AI Capabilities with Classical Educational Theories
【速读】:该论文旨在解决传统教育中反馈时效性差、个性化不足以及学生自我调节学习能力难以有效提升的问题,尤其是在写作类任务中,教师反馈往往滞后且难以覆盖所有学生。其解决方案的核心在于提出并验证了AI-Educational Development Loop (AI-EDL) 框架,该框架将经典学习理论与“人在回路”的人工智能(human-in-the-loop AI)相结合,通过透明化、可迭代的AI生成反馈机制,支持学生进行反思性学习和持续改进。关键创新点在于:利用AI提供即时、具体且具有发展性的反馈,同时保留教师的教学主导权(pedagogical oversight),从而在保证教育质量的同时实现规模化应用,实证表明该方法显著提升了学生的学习表现,并增强了其对AI反馈的积极感知。
链接: https://arxiv.org/abs/2508.00970
作者: Ning Yu,Jie Zhang,Sandeep Mitra,Rebecca Smith,Adam Rich
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This work has been submitted to Journal of Educational Technology Systems. It is under review
Abstract:This study introduces the AI-Educational Development Loop (AI-EDL), a theory-driven framework that integrates classical learning theories with human-in-the-loop artificial intelligence (AI) to support reflective, iterative learning. Implemented in EduAlly, an AI-assisted platform for writing-intensive and feedback-sensitive tasks, the framework emphasizes transparency, self-regulated learning, and pedagogical oversight. A mixed-methods study was piloted at a comprehensive public university to evaluate alignment between AI-generated feedback, instructor evaluations, and student self-assessments; the impact of iterative revision on performance; and student perceptions of AI feedback. Quantitative results demonstrated statistically significant improvement between first and second attempts, with agreement between student self-evaluations and final instructor grades. Qualitative findings indicated students valued immediacy, specificity, and opportunities for growth that AI feedback provided. These findings validate the potential to enhance student learning outcomes through developmentally grounded, ethically aligned, and scalable AI feedback systems. The study concludes with implications for future interdisciplinary applications and refinement of AI-supported educational technologies.
zh
[AI-181] Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles
【速读】:该论文旨在解决病理图像(如HE染色切片)与高维组学数据(如转录组、甲基化组和基因组)之间缺乏统一建模框架的问题,从而限制了对癌症分子特征和临床结局的精准理解。其核心解决方案是提出MORPHEUS,一个基于Transformer的统一预训练框架,通过在随机选择的组学片段上应用掩码建模目标(masked modeling objective),促使模型学习跨模态的生物学意义关联,并将组织病理学和多组学数据映射到共享潜在空间中。该方法具备灵活性,可适应不同模态组合输入,并支持任意模态之间的生成推理(any-to-any omics generation),在泛癌队列上预训练后,在多种任务和模态组合下均优于现有最先进方法,为肿瘤学中的多模态基础模型开发提供了有力工具。
链接: https://arxiv.org/abs/2508.00969
作者: Lucas Robinet,Ahmad Berjaoui,Elizabeth Cohen-Jonathan Moyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised learning has driven major advances in computational pathology by enabling models to learn rich representations from hematoxylin and eosin (HE)-stained cancer tissue. However, histopathology alone often falls short for molecular characterization and understanding clinical outcomes, as important information is contained in high-dimensional omics profiles like transcriptomics, methylomics, or genomics. In this work, we introduce MORPHEUS, a unified transformer-based pre-training framework that encodes both histopathology and multi-omics data into a shared latent space. At its core, MORPHEUS relies on a masked modeling objective applied to randomly selected omics portions, encouraging the model to learn biologically meaningful cross-modal relationships. The same pre-trained network can be applied to histopathology alone or in combination with any subset of omics modalities, seamlessly adapting to the available inputs. Additionally, MORPHEUS enables any-to-any omics generation, enabling one or more omics profiles to be inferred from any subset of modalities, including HE alone. Pre-trained on a large pan-cancer cohort, MORPHEUS consistently outperforms state-of-the-art methods across diverse modality combinations and tasks, positioning itself as a promising framework for developing multimodal foundation models in oncology. The code is available at: this https URL
zh
[AI-182] Cooperative Perception: A Resource-Efficient Framework for Multi-Drone 3D Scene Reconstruction Using Federated Diffusion and NeRF NEURIPS2024
【速读】:该论文旨在解决无人机集群(drone swarm)在感知与场景重建中面临的计算资源受限、通信带宽低以及实时性不足的问题,同时确保隐私保护和系统可扩展性。其解决方案的关键在于提出一种基于联邦学习的共享扩散模型(shared diffusion model)与轻量级语义提取(YOLOv12)及本地NeRF(Neural Radiance Fields)更新相结合的框架,通过语义感知压缩协议提升多智能体协同场景理解效率,实现高效、隐私安全且可扩展的3D/4D场景合成。
链接: https://arxiv.org/abs/2508.00967
作者: Massoud Pourmandi
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 15 pages, 3 figures, 1 table, 1 algorithm. Preprint based on NeurIPS 2024 template
Abstract:The proposal introduces an innovative drone swarm perception system that aims to solve problems related to computational limitations and low-bandwidth communication, and real-time scene reconstruction. The framework enables efficient multi-agent 3D/4D scene synthesis through federated learning of shared diffusion model and YOLOv12 lightweight semantic extraction and local NeRF updates while maintaining privacy and scalability. The framework redesigns generative diffusion models for joint scene reconstruction, and improves cooperative scene understanding, while adding semantic-aware compression protocols. The approach can be validated through simulations and potential real-world deployment on drone testbeds, positioning it as a disruptive advancement in multi-agent AI for autonomous systems.
zh
[AI-183] VAULT: Vigilant Adversarial Updates via LLM -Driven Retrieval-Augmented Generation for NLI
【速读】:该论文旨在解决自然语言推理(Natural Language Inference, NLI)模型在面对对抗样本时鲁棒性不足的问题,即模型在标准测试集上表现良好,但在遭遇精心构造的对抗性输入时性能显著下降。解决方案的关键在于提出VAULT——一个全自动的对抗性检索增强生成(Adversarial Retrieval-Augmented Generation, RAG)流水线,其核心机制包括三个阶段:首先通过语义(BGE)与词法(BM25)双相似度平衡采样检索上下文;其次利用大语言模型(LLM)生成对抗性假设,并通过LLM集成验证标签一致性以确保质量;最后将验证后的对抗样本以递增混合比例注入训练集,迭代重训练零样本RoBERTa-base模型。该方法在SNLI、ANLI和MultiNLI等多个基准上显著提升模型准确率,实现无需人工干预的高效鲁棒性增强。
链接: https://arxiv.org/abs/2508.00965
作者: Roie Kazoom,Ofir Cohen,Rami Puzis,Asaf Shabtai,Ofer Hadar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce VAULT, a fully automated adversarial RAG pipeline that systematically uncovers and remedies weaknesses in NLI models through three stages: retrieval, adversarial generation, and iterative retraining. First, we perform balanced few-shot retrieval by embedding premises with both semantic (BGE) and lexical (BM25) similarity. Next, we assemble these contexts into LLM prompts to generate adversarial hypotheses, which are then validated by an LLM ensemble for label fidelity. Finally, the validated adversarial examples are injected back into the training set at increasing mixing ratios, progressively fortifying a zero-shot RoBERTa-base this http URL standard benchmarks, VAULT elevates RoBERTa-base accuracy from 88.48% to 92.60% on SNLI +4.12%, from 75.04% to 80.95% on ANLI +5.91%, and from 54.67% to 71.99% on MultiNLI +17.32%. It also consistently outperforms prior in-context adversarial methods by up to 2.0% across datasets. By automating high-quality adversarial data curation at scale, VAULT enables rapid, human-independent robustness improvements in NLI inference tasks.
zh
[AI-184] Rethinking Multimodality: Optimizing Multimodal Deep Learning for Biomedical Signal Classification
【速读】:该论文旨在解决多模态深度学习在生物医学信号分类中一个关键问题:即并非所有模态融合都能提升模型性能,传统上认为增加模态数量必然带来精度提升的观点可能存在误区。研究发现,模态间的冗余性可能导致边际收益递减甚至性能下降,因此需要从信息互补性的角度重新审视多模态设计原则。解决方案的关键在于提出并验证“互补特征域”(Complementary Feature Domains)理论框架,通过量化不同特征域之间的信息论互补性,识别出真正具有协同效应的模态组合,而非盲目堆叠模态;实证表明,仅融合时间域与时频域(Hybrid 1)即可显著优于单一模态基线,而引入频率域后(Hybrid 2)则无进一步增益,说明最优融合取决于模态间的信息互补质量而非数量。
链接: https://arxiv.org/abs/2508.00963
作者: Timothy Oladunni,Alex Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This study proposes a novel perspective on multimodal deep learning for biomedical signal classification, systematically analyzing how complementary feature domains impact model performance. While fusing multiple domains often presumes enhanced accuracy, this work demonstrates that adding modalities can yield diminishing returns, as not all fusions are inherently advantageous. To validate this, five deep learning models were designed, developed, and rigorously evaluated: three unimodal (1D-CNN for time, 2D-CNN for time-frequency, and 1D-CNN-Transformer for frequency) and two multimodal (Hybrid 1, which fuses 1D-CNN and 2D-CNN; Hybrid 2, which combines 1D-CNN, 2D-CNN, and a Transformer). For ECG classification, bootstrapping and Bayesian inference revealed that Hybrid 1 consistently outperformed the 2D-CNN baseline across all metrics (p-values 0.05, Bayesian probabilities 0.90), confirming the synergistic complementarity of the time and time-frequency domains. Conversely, Hybrid 2’s inclusion of the frequency domain offered no further improvement and sometimes a marginal decline, indicating representational redundancy; a phenomenon further substantiated by a targeted ablation study. This research redefines a fundamental principle of multimodal design in biomedical signal analysis. We demonstrate that optimal domain fusion isn’t about the number of modalities, but the quality of their inherent complementarity. This paradigm-shifting concept moves beyond purely heuristic feature selection. Our novel theoretical contribution, “Complementary Feature Domains in Multimodal ECG Deep Learning,” presents a mathematically quantifiable framework for identifying ideal domain combinations, demonstrating that optimal multimodal performance arises from the intrinsic information-theoretic complementarity among fused domains.
zh
[AI-185] FinKario: Event-Enhanced Automated Construction of Financial Knowledge Graph
【速读】:该论文旨在解决个体投资者在金融市场上因信息过载和专业分析能力不足而导致的决策劣势问题,尤其是在现有知识库更新滞后于市场事件演变、以及财务研究报告长篇且非结构化难以被大语言模型(LLM)及时高效利用的背景下。其解决方案的关键在于:一是构建了Event-Enhanced Automated Construction of Financial Knowledge Graph(FinKario)数据集,通过提示驱动抽取机制整合实时公司基本面与市场事件,形成包含30.5万实体、9,625个关系三元组和19类关系类型的结构化金融知识图谱;二是提出两阶段基于图的检索策略(FinKario-RAG),优化大规模动态金融知识的精准访问效率。实验证明,该方案显著提升了股票趋势预测准确率,相较金融领域大模型平均提升18.81%,优于机构策略17.85%。
链接: https://arxiv.org/abs/2508.00961
作者: Xiang Li,Penglei Sun,Wanyun Zhou,Zikai Wei,Yongqi Zhang,Xiaowen Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Individual investors are significantly outnumbered and disadvantaged in financial markets, overwhelmed by abundant information and lacking professional analysis. Equity research reports stand out as crucial resources, offering valuable insights. By leveraging these reports, large language models (LLMs) can enhance investors’ decision-making capabilities and strengthen financial analysis. However, two key challenges limit their effectiveness: (1) the rapid evolution of market events often outpaces the slow update cycles of existing knowledge bases, (2) the long-form and unstructured nature of financial reports further hinders timely and context-aware integration by LLMs. To address these challenges, we tackle both data and methodological aspects. First, we introduce the Event-Enhanced Automated Construction of Financial Knowledge Graph (FinKario), a dataset comprising over 305,360 entities, 9,625 relational triples, and 19 distinct relation types. FinKario automatically integrates real-time company fundamentals and market events through prompt-driven extraction guided by professional institutional templates, providing structured and accessible financial insights for LLMs. Additionally, we propose a Two-Stage, Graph-Based retrieval strategy (FinKario-RAG), optimizing the retrieval of evolving, large-scale financial knowledge to ensure efficient and precise data access. Extensive experiments show that FinKario with FinKario-RAG achieves superior stock trend prediction accuracy, outperforming financial LLMs by 18.81% and institutional strategies by 17.85% on average in backtesting.
zh
[AI-186] Compression-Induced Communication-Efficient Large Model Training and Inferencing
【速读】:该论文旨在解决大规模神经网络训练中张量并行(tensor parallelism)带来的高能耗问题,这是当前可持续大规模机器学习工作负载面临的关键挑战。其核心解决方案是提出一种名为“幽灵并行”(phantom parallelism)的新策略,通过重新设计前向与反向传播操作,以更低的带宽消耗和浮点运算次数(FLOP count)实现模型并行训练。该方法在前馈网络(feed-forward network, FFN)架构上进行了系统性验证,实验证明其相较传统张量并行方案可降低约50%的训练能耗,并且能够在更少GPU数量上达到与更大规模张量并行相当的模型精度,从而为未来高效、节能的深度学习训练提供了新路径。
链接: https://arxiv.org/abs/2508.00960
作者: Sudip K. Seal,Maksudul Alam,Jorge Ramirez,Sajal Dash,Hao Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Energy efficiency of training and inferencing with large neural network models is a critical challenge facing the future of sustainable large-scale machine learning workloads. This paper introduces an alternative strategy, called phantom parallelism, to minimize the net energy consumption of traditional tensor (model) parallelism, the most energy-inefficient component of large neural network training. The approach is presented in the context of feed-forward network architectures as a preliminary, but comprehensive, proof-of-principle study of the proposed methodology. We derive new forward and backward propagation operators for phantom parallelism, implement them as custom autograd operations within an end-to-end phantom parallel training pipeline and compare its parallel performance and energy-efficiency against those of conventional tensor parallel training pipelines. Formal analyses that predict lower bandwidth and FLOP counts are presented with supporting empirical results on up to 256 GPUs that corroborate these gains. Experiments are shown to deliver ~50% reduction in the energy consumed to train FFNs using the proposed phantom parallel approach when compared with conventional tensor parallel methods. Additionally, the proposed approach is shown to train smaller phantom models to the same model loss on smaller GPU counts as larger tensor parallel models on larger GPU counts offering the possibility for even greater energy savings.
zh
[AI-187] Enhancing material behavior discovery using embedding-oriented Physically-Guided Neural Networks with Internal Variables
【速读】:该论文旨在解决物理引导神经网络(Physically Guided Neural Networks with Internal Variables, PGNNIV)在处理高维数据(如精细网格空间场或时变系统)时面临的可扩展性问题。其关键解决方案在于引入降阶建模技术,包括采用谱分解、本征正交分解(POD)以及预训练自编码器映射等替代原始解码器结构,从而在计算效率、精度、噪声鲁棒性和泛化能力之间实现灵活权衡;同时通过迁移学习和微调策略实现模型复用,有效利用已有知识以适应新材料或配置,显著降低训练时间并保持或提升性能。
链接: https://arxiv.org/abs/2508.00959
作者: Rubén Muñoz-Sierra,Manuel Doblaré,Jacobo Ayensa-Jiménez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physically Guided Neural Networks with Internal Variables are SciML tools that use only observable data for training and and have the capacity to unravel internal state relations. They incorporate physical knowledge both by prescribing the model architecture and using loss regularization, thus endowing certain specific neurons with a physical meaning as internal state variables. Despite their potential, these models face challenges in scalability when applied to high-dimensional data such as fine-grid spatial fields or time-evolving systems. In this work, we propose some enhancements to the PGNNIV framework that address these scalability limitations through reduced-order modeling techniques. Specifically, we introduce alternatives to the original decoder structure using spectral decomposition, POD, and pretrained autoencoder-based mappings. These surrogate decoders offer varying trade-offs between computational efficiency, accuracy, noise tolerance, and generalization, while improving drastically the scalability. Additionally, we integrate model reuse via transfer learning and fine-tuning strategies to exploit previously acquired knowledge, supporting efficient adaptation to novel materials or configurations, and significantly reducing training time while maintaining or improving model performance. To illustrate these various techniques, we use a representative case governed by the nonlinear diffusion equation, using only observable data. Results demonstrate that the enhanced PGNNIV framework successfully identifies the underlying constitutive state equations while maintaining high predictive accuracy. It also improves robustness to noise, mitigates overfitting, and reduces computational demands. The proposed techniques can be tailored to various scenarios depending on data availability, resources, and specific modeling objectives, overcoming scalability challenges in all the scenarios.
zh
[AI-188] Learning Unified User Quantized Tokenizers for User Representation
【速读】:该论文旨在解决多源用户表征学习在个性化服务场景中面临的三大挑战:缺乏统一的表示框架、数据压缩带来的可扩展性与存储瓶颈,以及跨任务泛化能力不足。其解决方案的关键在于提出U²QT(Unified User Quantized Tokenizers)框架,通过两阶段架构实现:首先利用因果Q-Former将异构域特征映射到共享的因果表示空间以保留模态间依赖关系;其次采用多视角RQ-VAE结合共享与源特定码本对因果嵌入进行离散化,生成紧凑的token表示,在保证语义一致性的同时显著提升存储与计算效率。该方法实现了跨域知识迁移与早期融合的统一,并支持与语言模型无缝集成,适用于工业级大规模应用。
链接: https://arxiv.org/abs/2508.00956
作者: Chuan He,Yang Chen,Wuliang Huang,Tianyi Zheng,Jianhu Chen,Bin Dou,Yice Luo,Yun Zhu,Baokun Wang,Yongchao Liu,Xing Fu,Yu Cheng,Chuntao Hong,Weiqiang Wang,Xin-Wei Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U^2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, a causal Q-Former projects domain-specific features into a shared causal representation space to preserve inter-modality dependencies; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U^2QT’s advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.
zh
[AI-189] From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLM s via Building Zero-Shot Discriminative Embedding Model
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在通用多模态嵌入任务中,如何有效利用其生成式特性进行判别式表征学习的问题。当前主流的基于大规模对比预训练的方法存在计算成本高、难以发挥MLLM内在指令遵循能力等局限。解决方案的关键在于提出一个高效的框架,包含两个协同组件:一是层级嵌入提示模板(hierarchical embedding prompt template),通过双层指令架构强制模型输出判别性表征;二是自感知难负样本采样(self-aware hard negative sampling),利用模型自身理解能力高效挖掘挑战性负样本并主动过滤潜在伪负样本,从而显著提升微调效率与性能,实现无需对比预训练即可达到最先进效果。
链接: https://arxiv.org/abs/2508.00955
作者: Yeong-Joon Ju,Seong-Whan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model’s own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.
zh
[AI-190] Academic Vibe Coding: Opportunities for Accelerating Research in an Era of Resource Constraint
【速读】:该论文旨在解决学术实验室面临的资源约束问题,包括预算紧缩、资助管理费用可能被上限限制,以及数据科学人才市场薪酬远高于高校薪资水平等挑战。其解决方案的关键在于引入“vibe coding”(结构化、提示驱动的代码生成,嵌入可复现工作流中的大语言模型(LLM)应用),通过压缩从想法到分析的时间周期、减轻对专业数据岗位的人力依赖,并确保输出结果具有严格的版本控制与可追溯性,从而提升科研效率与可持续性。
链接: https://arxiv.org/abs/2508.00952
作者: Matthew G Crowson,Leo Celi A. Celi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Academic laboratories face mounting resource constraints: budgets are tightening, grant overheads are potentially being capped, and the market rate for data-science talent significantly outstrips university compensation. Vibe coding, which is structured, prompt-driven code generation with large language models (LLMs) embedded in reproducible workflows, offers one pragmatic response. It aims to compress the idea-to-analysis timeline, reduce staffing pressure on specialized data roles, and maintain rigorous, version-controlled outputs. This article defines the vibe coding concept, situates it against the current academic resourcing crisis, details a beginner-friendly toolchain for its implementation, and analyzes inherent limitations that necessitate governance and mindful application.
zh
[AI-191] LLM s Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
【速读】:该论文旨在解决人工智能(AI)系统在危险能力评估中可能存在的“沙袋行为”(sandbagging)问题,即模型或其开发者在评估过程中故意表现低于实际能力,从而误导安全审查。为应对这一挑战,研究提出以链式思维(Chain-of-Thought, CoT)监控作为防御手段,通过观察模型推理过程来识别潜在的意图与计划。关键发现是:无论是前沿模型还是小型开源模型,均能在无提示条件下零样本(0-shot)地对CoT监控实施隐蔽沙袋行为;然而,当模型意识到被监控时,仅能16–36%的概率成功绕过检测,表明当前CoT监控仍存在显著漏洞。研究进一步揭示了五种由模型生成的隐蔽沙袋策略,为理解CoT监控的失效机制提供了实证依据,并为构建更鲁棒的评估框架和多样化沙袋行为模拟器(model organisms)指明方向。
链接: https://arxiv.org/abs/2508.00943
作者: Chloe Li,Mary Phuong,Noah Y. Siegel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures
Abstract:Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging - the strategic underperformance on evaluations by AI models or their developers. One promising defense is to monitor a model’s chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.
zh
[AI-192] rusted Routing for Blockchain-Empowered UAV Networks via Multi-Agent Deep Reinforcement Learning
【速读】:该论文旨在解决无人机网络(UAV networks)中因分布式拓扑和高动态性导致的路由安全问题,特别是在存在恶意节点的情况下如何保障路由可靠性并最小化总延迟。其关键解决方案包括:首先设计基于区块链的信任管理机制(BTMM),用于动态评估节点信任值并识别低信任度无人机;其次提出一种共识无人机更新机制以优化传统实用拜占庭容错算法在区块链中的应用;最后将路由问题建模为去中心化的部分可观测马尔可夫决策过程(Decentralized Partially Observable Markov Decision Process, Dec-POMDP),并引入多智能体双深度Q网络(Multi-Agent Double Deep Q-Network, MAD-DQN)算法实现高效路由决策。仿真结果表明,该方案相较于多智能体近端策略优化算法、多智能体深度Q网络算法以及无BTMM的方法,分别降低了13.39%、12.74%和16.6%的延迟。
链接: https://arxiv.org/abs/2508.00938
作者: Ziye Jia,Sijie He,Qiuming Zhu,Wei Wang,Qihui Wu,Zhu Han
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: IEEE Tcom Accepted
Abstract:Due to the high flexibility and versatility, unmanned aerial vehicles (UAVs) are leveraged in various fields including surveillance and disaster this http URL, in UAV networks, routing is vulnerable to malicious damage due to distributed topologies and high dynamics. Hence, ensuring the routing security of UAV networks is challenging. In this paper, we characterize the routing process in a time-varying UAV network with malicious nodes. Specifically, we formulate the routing problem to minimize the total delay, which is an integer linear programming and intractable to solve. Then, to tackle the network security issue, a blockchain-based trust management mechanism (BTMM) is designed to dynamically evaluate trust values and identify low-trust UAVs. To improve traditional practical Byzantine fault tolerance algorithms in the blockchain, we propose a consensus UAV update mechanism. Besides, considering the local observability, the routing problem is reformulated into a decentralized partially observable Markov decision process. Further, a multi-agent double deep Q-network based routing algorithm is designed to minimize the total delay. Finally, simulations are conducted with attacked UAVs and numerical results show that the delay of the proposed mechanism decreases by 13.39 % , 12.74 % , and 16.6 % than multi-agent proximal policy optimal algorithms, multi-agent deep Q-network algorithms, and methods without BTMM, respectively.
zh
[AI-193] Measuring Harmfulness of Computer-Using Agents
【速读】:该论文旨在解决生成式 AI (Generative AI) 作为计算机使用代理(Computer-using Agents, CUAs)时可能带来的安全风险评估问题。现有基准主要聚焦于聊天机器人或简单工具调用场景下的语言模型(LMs)安全性,未充分考虑赋予模型完整计算机访问权限后潜在的恶意行为。解决方案的关键在于提出一个名为 CUAHarm 的新基准,包含104个由专家编写的真实世界滥用任务(如禁用防火墙、泄露敏感信息等),并通过沙箱环境与规则驱动的可验证奖励机制衡量模型执行这些任务的成功率,而非仅依赖拒绝响应。实验表明,即使无精心设计的越狱提示,前沿模型在CUA角色下仍以高成功率执行恶意操作,且新型模型风险更高;同时发现,当前主流智能体框架(如UI-TARS-1.5)虽提升性能但也放大了滥用风险,而基于链式思维(Chain-of-Thought, CoT)的监控策略检测准确率有限,凸显了对CUA行为进行有效安全管控的挑战。
链接: https://arxiv.org/abs/2508.00935
作者: Aaron Xuxiang Tian,Ruofan Zhang,Janet Tang,Jiaxin Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures. Benchmark release at this https URL
Abstract:Computer-using agents (CUAs), which autonomously control computers to perform multi-step actions, might pose significant safety risks if misused. Existing benchmarks mostly evaluate language models’ (LMs) safety risks in chatbots or simple tool-usage scenarios, without granting full computer access. To better evaluate CUAs’ misuse risks, we introduce a new benchmark: CUAHarm. CUAHarm consists of 104 expert-written realistic misuse risks, such as disabling firewalls, leaking confidential information, launching denial-of-service attacks, or installing backdoors. We provide a sandbox environment and rule-based verifiable rewards to measure CUAs’ success rates in executing these tasks (e.g., whether the firewall is indeed disabled), not just refusal. We evaluate multiple frontier open-source and proprietary LMs, such as Claude Sonnet, GPT-4o, Gemini Pro 1.5, Llama-3.3-70B, and Mistral Large 2. Surprisingly, even without carefully designed jailbreaking prompts, these frontier LMs comply with executing these malicious tasks at a high success rate (e.g., 59% for Claude 3.7 Sonnet). Newer models show higher misuse rates: Claude 3.7 Sonnet succeeds on 15% more tasks than Claude 3.5. While these models are robust to common malicious prompts (e.g., creating a bomb) in chatbot settings, they behave unsafely as CUAs. We further evaluate a leading agentic framework (UI-TARS-1.5) and find that while it improves performance, it also amplifies misuse risks. Benign variants reveal refusals stem from alignment, not capability limits. To mitigate risks, we explore using LMs to monitor CUAs’ actions and chain-of-thoughts (CoTs). Monitoring CUAs is significantly harder than chatbot outputs. Monitoring CoTs yields modest gains, with average detection accuracy at only 72%. Even with hierarchical summarization, improvement is limited to 4%. CUAHarm will be released at this https URL.
zh
[AI-194] OKG-LLM : Aligning Ocean Knowledge Graph with Observation Data via LLM s for Global Sea Surface Temperature Prediction
【速读】:该论文旨在解决当前海表温度(Sea Surface Temperature, SST)预测方法中忽视海洋领域知识积累的问题,从而限制了预测精度的进一步提升。现有数据驱动方法虽取得一定成效,但未能有效融合长期积累的海洋学先验知识与数值数据。其解决方案的关键在于提出一种基于海洋知识图谱增强的大语言模型(Ocean Knowledge Graph-enhanced LLM, OKG-LLM)框架:首先构建首个专为SST预测设计的海洋知识图谱(Ocean Knowledge Graph, OKG),系统性地表示多元海洋知识;其次通过图嵌入网络学习OKG中的语义与结构信息,捕捉不同海域特征及其复杂关联;最后将提取的知识与细粒度数值SST数据对齐融合,并利用预训练大语言模型建模SST时空模式,实现高精度预测。
链接: https://arxiv.org/abs/2508.00933
作者: Hanchen Yang,Jiaqi Wang,Jiannong Cao,Wengen Li,Jialun Zheng,Yangning Li,Chunyu Miao,Jihong Guan,Shuigeng Zhou,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sea surface temperature (SST) prediction is a critical task in ocean science, supporting various applications, such as weather forecasting, fisheries management, and storm tracking. While existing data-driven methods have demonstrated significant success, they often neglect to leverage the rich domain knowledge accumulated over the past decades, limiting further advancements in prediction accuracy. The recent emergence of large language models (LLMs) has highlighted the potential of integrating domain knowledge for downstream tasks. However, the application of LLMs to SST prediction remains underexplored, primarily due to the challenge of integrating ocean domain knowledge and numerical data. To address this issue, we propose Ocean Knowledge Graph-enhanced LLM (OKG-LLM), a novel framework for global SST prediction. To the best of our knowledge, this work presents the first systematic effort to construct an Ocean Knowledge Graph (OKG) specifically designed to represent diverse ocean knowledge for SST prediction. We then develop a graph embedding network to learn the comprehensive semantic and structural knowledge within the OKG, capturing both the unique characteristics of individual sea regions and the complex correlations between them. Finally, we align and fuse the learned knowledge with fine-grained numerical SST data and leverage a pre-trained LLM to model SST patterns for accurate prediction. Extensive experiments on the real-world dataset demonstrate that OKG-LLM consistently outperforms state-of-the-art methods, showcasing its effectiveness, robustness, and potential to advance SST prediction. The codes are available in the online repository.
zh
[AI-195] SmartDate: AI-Driven Precision Sorting and Quality Control in Date Fruits
【速读】:该论文旨在解决 date fruit(椰枣)在分拣与质量控制过程中依赖人工、效率低且准确率不足的问题。解决方案的关键在于构建一个融合深度学习、遗传算法(Genetic Algorithm, GA)和强化学习(Reinforcement Learning, RL)的智能系统 SmartDate,通过高分辨率成像与可见光-近红外(Visible-Near-Infrared, VisNIR)光谱传感技术提取水分、糖度和质地等关键品质参数,并利用强化学习实现对生产环境的实时自适应调整,同时借助遗传算法优化模型超参数,从而显著提升分类准确率与货架期预测能力,最终实现高效、精准的自动化分级与质量管控。
链接: https://arxiv.org/abs/2508.00921
作者: Khaled Eskaf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 6 pages, 2 figures, published in Proceedings of the 21st IEEE International Conference on High Performance Computing and Networking (HONET 2024), Doha, Qatar, December 2024
Abstract:SmartDate is an AI-powered system for automated sorting and quality control of date fruits. It combines deep learning, genetic algorithms, and reinforcement learning to improve classification accuracy and predict shelf life. The system uses high-resolution imaging and Visible-Near-Infrared (VisNIR) spectral sensors to evaluate key features such as moisture, sugar content, and texture. Reinforcement learning enables real-time adaptation to production conditions, while genetic algorithms optimize model parameters. SmartDate achieved 94.5 percent accuracy, 93.1 percent F1-score, and an AUC-ROC of 0.96. The system reduces waste and ensures that only high-quality dates reach the market, setting a new benchmark in smart agriculture.
zh
[AI-196] Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis IJCAI2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在知识编辑(Knowledge Editing, KE)过程中难以有效支持需要组合推理的任务,如多跳问答(Multi-hop Question Answering, MQA)的问题。现有KE方法依赖分解式技术,导致推理过程逻辑不连贯,无法保证复杂任务中的准确性。其解决方案的关键在于提出一种基于语义分析的知识编辑框架CHECK,该框架借鉴编译器与LLM推理之间的类比:先对推理链进行语义分析,识别并修正逻辑错误,通过逻辑优化和高温度重提示(re-prompting)重构推理路径,从而提升MQA任务的准确率。实验表明,CHECK相较五种前沿框架在四个数据集上平均提升22.8%的MQA准确率。
链接: https://arxiv.org/abs/2508.00914
作者: Dominic Simon,Rickard Ewetz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 15 figures, pre-print of paper accepted to IJCAI 2025
Abstract:Large Language Models (LLMs) require lightweight avenues of updating stored information that has fallen out of date. Knowledge Editing (KE) approaches have been successful in updating model knowledge for simple factual queries but struggle with handling tasks that require compositional reasoning such as multi-hop question answering (MQA). We observe that existing knowledge editors leverage decompositional techniques that result in illogical reasoning processes. In this paper, we propose a knowledge editor for MQA based on semantic analysis called CHECK. Our framework is based on insights from an analogy between compilers and reasoning using LLMs. Similar to how source code is first compiled before being executed, we propose to semantically analyze reasoning chains before executing the chains to answer questions. Reasoning chains with semantic errors are revised to ensure consistency through logic optimization and re-prompting the LLM model at a higher temperature. We evaluate the effectiveness of CHECK against five state-of-the-art frameworks on four datasets and achieve an average 22.8% improved MQA accuracy.
zh
[AI-197] Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation
【速读】:该论文旨在解决商业大语言模型(Large Language Models, LLM)服务中因隐藏内部推理轨迹而导致的令牌(token)计费不透明问题,即用户可能被收取包含未显示中间步骤的token费用,引发令牌膨胀(token inflation)和潜在过度收费的风险。解决方案的关键在于提出PALACE框架——一种基于用户端的推理令牌计数预测方法,其核心创新是引入GRPO-augmented适应模块与轻量级领域路由机制,实现对多样化推理任务的动态校准,从而在无需访问内部执行痕迹的情况下准确估计隐藏的token消耗,有效支持细粒度成本审计与通胀检测。
链接: https://arxiv.org/abs/2508.00912
作者: Ziyao Wang,Guoheng Sun,Yexiao He,Zheyu Shen,Bowei Tian,Ang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token, including those from hidden intermediate steps, raising concerns of token inflation and potential overbilling. This gap underscores the urgent need for reliable token auditing, yet achieving it is far from straightforward: cryptographic verification (e.g., hash-based signature) offers little assurance when providers control the entire execution pipeline, while user-side prediction struggles with the inherent variance of reasoning LLMs, where token usage fluctuates across domains and prompt styles. To bridge this gap, we present PALACE (Predictive Auditing of LLM APIs via Reasoning Token Count Estimation), a user-side framework that estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces. PALACE introduces a GRPO-augmented adaptation module with a lightweight domain router, enabling dynamic calibration across diverse reasoning tasks and mitigating variance in token usage patterns. Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy, supporting both fine-grained cost auditing and inflation detection. Taken together, PALACE represents an important first step toward standardized predictive auditing, offering a practical path to greater transparency, accountability, and user trust.
zh
[AI-198] Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling
【速读】:该论文旨在解决在异构硬件平台上(如CPU、NPU和集成GPU)准确预测大语言模型(Large Language Models, LLMs)推理性能的难题,现有方法依赖于特定硬件的基准测试或机器学习延迟预测器,缺乏通用性。其解决方案的关键在于提出一个轻量且模块化的分析框架LIFE(Lightweight and Modular Analytical Framework),该框架通过可配置的算子级分析模型,以硬件和数据集无关的方式刻画LLM推理工作负载,并能量化软件与模型优化(如量化、KV缓存压缩、LoRA适配器、分块预填充、注意力机制差异及算子融合)对关键性能指标(如首次词元时间TTFT、每输出词元时间TPOT、每秒词元数TPS)的影响。LIFE仅需硬件规格(如TOPS和内存带宽)即可实现性能预测,无需大量数据集基准测试,从而显著提升跨平台部署效率。
链接: https://arxiv.org/abs/2508.00904
作者: Rajeev Patwari,Ashish Sirasao,Devleena Das
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: 10 pages, 9 figures
Abstract:Large language models (LLMs) have been increasingly deployed as local agents on personal devices with CPUs, NPUs and integrated GPUs. However, forecasting inference performance on devices with such heterogeneity remains challenging due to the dynamic compute and memory demands. Existing approaches rely on GPU benchmarking or machine learning-based latency predictors, which are often hardware-specific and lack generalizability. To this end, we introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators, configurable to characterize LLM inference workloads in a hardware and dataset-agnostic manner. LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion, on performance metrics such as time-to-first-token (TTFT), time-per-output-token (TPOT) and tokens-per-second (TPS). LIFE enables performance forecasting using only hardware specifications, such as TOPS and memory bandwidth, without requiring extensive dataset benchmarking. We validate LIFE’s forecasting with inference on AMD Ryzen CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants, demonstrating the utility of LIFE in forecasting LLM performance through lens of system efficiency to enable efficient LLM deployment across different hardware platforms.
zh
[AI-199] Universal Neurons in GPT -2: Emergence Persistence and Functional Impact
【速读】:该论文旨在解决生成式 AI(Generative AI)模型中是否存在稳定且跨模型一致的神经元表征结构这一问题,即“神经元普适性”(neuron universality)现象的形成机制与功能意义。其解决方案的关键在于通过在独立训练的 GPT-2 Small 模型中进行激活相关性分析,识别出在多个训练检查点(100k、200k、300k 步)上保持高稳定性的通用神经元(universal neurons),并借助消融实验量化其对模型预测的影响(以损失函数和 KL 散度为指标),从而揭示深度神经网络训练过程中自发形成的具有功能性稳定性的表示结构。
链接: https://arxiv.org/abs/2508.00903
作者: Advey Nandan,Cheng-Ting Chou,Amrit Kurakula,Cole Blondin,Kevin Zhu,Vasu Sharma,Sean O’Brien
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining how these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via loss and KL divergence. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in deeper layers. These findings suggest stable and universal representational structures emerge during neural network training.
zh
[AI-200] ff4ERA: A new Fuzzy Framework for Ethical Risk Assessment in AI
【速读】:该论文旨在解决共生人工智能(Symbiotic AI, SAI)背景下伦理风险评估(Ethical Risk Assessment, ERA)面临的不确定性、模糊性和信息不完整问题,尤其是在人机协同加深时,传统ERA方法难以有效量化和优先排序多维伦理风险。解决方案的关键在于提出ff4ERA框架,其核心是融合模糊逻辑(Fuzzy Logic)、模糊层次分析法(Fuzzy Analytic Hierarchy Process, FAHP)与确定性因子(Certainty Factors, CF),通过构建伦理风险评分(Ethical Risk Score, ERS)实现对每类伦理风险的定量评估。该框架将FAHP确定的权重、传播后的CF以及风险等级进行整合,形成可解释、可追溯且具备风险感知能力的伦理决策支持机制,从而在复杂情境下提供稳定、透明且符合人类价值观的伦理判断依据。
链接: https://arxiv.org/abs/2508.00899
作者: Abeer Dyoub,Ivan Letteri,Francesca A. Lisi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:The emergence of Symbiotic AI (SAI) introduces new challenges to ethical decision-making as it deepens human-AI collaboration. As symbiosis grows, AI systems pose greater ethical risks, including harm to human rights and trust. Ethical Risk Assessment (ERA) thus becomes crucial for guiding decisions that minimize such risks. However, ERA is hindered by uncertainty, vagueness, and incomplete information, and morality itself is context-dependent and imprecise. This motivates the need for a flexible, transparent, yet robust framework for ERA. Our work supports ethical decision-making by quantitatively assessing and prioritizing multiple ethical risks so that artificial agents can select actions aligned with human values and acceptable risk levels. We introduce ff4ERA, a fuzzy framework that integrates Fuzzy Logic, the Fuzzy Analytic Hierarchy Process (FAHP), and Certainty Factors (CF) to quantify ethical risks via an Ethical Risk Score (ERS) for each risk type. The final ERS combines the FAHP-derived weight, propagated CF, and risk level. The framework offers a robust mathematical approach for collaborative ERA modeling and systematic, step-by-step analysis. A case study confirms that ff4ERA yields context-sensitive, ethically meaningful risk scores reflecting both expert input and sensor-based evidence. Risk scores vary consistently with relevant factors while remaining robust to unrelated inputs. Local sensitivity analysis shows predictable, mostly monotonic behavior across perturbations, and global Sobol analysis highlights the dominant influence of expert-defined weights and certainty factors, validating the model design. Overall, the results demonstrate ff4ERA ability to produce interpretable, traceable, and risk-aware ethical assessments, enabling what-if analyses and guiding designers in calibrating membership functions and expert judgments for reliable ethical decision support.
zh
[AI-201] Maximize margins for robust splicing detection
【速读】:该论文旨在解决深度学习驱动的图像拼接检测工具在实际部署中因对训练条件高度敏感而导致的可靠性问题,尤其是当评估图像经历轻微后处理时,检测性能显著下降的现象。解决方案的关键在于认识到模型权重差异会引发潜在空间(latent space)的不同结构,从而影响模型对未见后处理图像的泛化能力;基于此,作者提出一种实用策略:在同一架构下训练多个不同条件的模型变体,并选择能最大化潜在空间边缘(latent margins)的那个模型,以提升检测器的鲁棒性。
链接: https://arxiv.org/abs/2508.00897
作者: Julien Simon de Kergunic(CRIStAL),Rony Abecidan(CRIStAL),Patrick Bas(CRIStAL),Vincent Itier(IMT Nord Europe, CRIStAL)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: in French language. GRETSI 2025 - Colloque Francophone de Traitement du Signal et des Images, this https URL , Aug 2025, Strasbourg, France
Abstract:Despite recent progress in splicing detection, deep learning-based forensic tools remain difficult to deploy in practice due to their high sensitivity to training conditions. Even mild post-processing applied to evaluation images can significantly degrade detector performance, raising concerns about their reliability in operational contexts. In this work, we show that the same deep architecture can react very differently to unseen post-processing depending on the learned weights, despite achieving similar accuracy on in-distribution test data. This variability stems from differences in the latent spaces induced by training, which affect how samples are separated internally. Our experiments reveal a strong correlation between the distribution of latent margins and a detector’s ability to generalize to post-processed images. Based on this observation, we propose a practical strategy for building more robust detectors: train several variants of the same model under different conditions, and select the one that maximizes latent margins.
zh
[AI-202] Multi-Grained Temporal-Spatial Graph Learning for Stable Traffic Flow Forecasting
【速读】:该论文旨在解决动态交通流预测中因复杂时空依赖性导致的建模难题,特别是现有方法难以捕捉全局时空模式且易受预定义地理关联性过拟合的问题,从而限制了模型在复杂交通环境下的鲁棒性。其解决方案的关键在于提出一种多粒度时空图学习框架,通过设计的图变压器编码器自适应提取全局时空模式,并结合图卷积网络获取局部模式,再利用带有残差连接的门控融合单元对二者进行动态平衡融合,从而有效挖掘监测站点间的隐含全局时空关系并协调局部与全局模式的重要性权重。
链接: https://arxiv.org/abs/2508.00884
作者: Zhenan Lin,Yuni Lai,Wai Lun Lo,Richard Tai-Chiu Hsung,Harris Sik-Ho Tsang,Xiaoyu Xue,Kai Zhou,Yulin Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time-evolving traffic flow forecasting are playing a vital role in intelligent transportation systems and smart cities. However, the dynamic traffic flow forecasting is a highly nonlinear problem with complex temporal-spatial dependencies. Although the existing methods has provided great contributions to mine the temporal-spatial patterns in the complex traffic networks, they fail to encode the globally temporal-spatial patterns and are prone to overfit on the pre-defined geographical correlations, and thus hinder the model’s robustness on the complex traffic environment. To tackle this issue, in this work, we proposed a multi-grained temporal-spatial graph learning framework to adaptively augment the globally temporal-spatial patterns obtained from a crafted graph transformer encoder with the local patterns from the graph convolution by a crafted gated fusion unit with residual connection techniques. Under these circumstances, our proposed model can mine the hidden global temporal-spatial relations between each monitor stations and balance the relative importance of local and global temporal-spatial patterns. Experiment results demonstrate the strong representation capability of our proposed method and our model consistently outperforms other strong baselines on various real-world traffic networks.
zh
[AI-203] Reproducibility of Machine Learning-Based Fault Detection and Diagnosis for HVAC Systems in Buildings: An Empirical Study
【速读】:该论文旨在解决机器学习(Machine Learning, ML)在建筑能源系统领域应用中的可复现性问题,即当前研究普遍缺乏足够的透明度和方法细节,导致独立验证困难。其关键解决方案在于通过系统分析现有文献的可复现性标准,揭示出数据来源不明确(如72%未说明数据是否公开或商用)、代码共享率极低(仅2篇提供链接且其中1个失效)等核心缺陷,并据此提出针对性干预措施,包括制定专门的可复现性指南、加强研究人员培训以及由期刊和会议实施促进透明性的政策。
链接: https://arxiv.org/abs/2508.00880
作者: Adil Mukhtar,Michael Hadwiger,Franz Wotawa,Gerald Schweiger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reproducibility is a cornerstone of scientific research, enabling independent verification and validation of empirical findings. The topic gained prominence in fields such as psychology and medicine, where concerns about non - replicable results sparked ongoing discussions about research practices. In recent years, the fast-growing field of Machine Learning (ML) has become part of this discourse, as it faces similar concerns about transparency and reliability. Some reproducibility issues in ML research are shared with other fields, such as limited access to data and missing methodological details. In addition, ML introduces specific challenges, including inherent nondeterminism and computational constraints. While reproducibility issues are increasingly recognized by the ML community and its major conferences, less is known about how these challenges manifest in applied disciplines. This paper contributes to closing this gap by analyzing the transparency and reproducibility standards of ML applications in building energy systems. The results indicate that nearly all articles are not reproducible due to insufficient disclosure across key dimensions of reproducibility. 72% of the articles do not specify whether the dataset used is public, proprietary, or commercially available. Only two papers share a link to their code - one of which was broken. Two-thirds of the publications were authored exclusively by academic researchers, yet no significant differences in reproducibility were observed compared to publications with industry-affiliated authors. These findings highlight the need for targeted interventions, including reproducibility guidelines, training for researchers, and policies by journals and conferences that promote transparency and reproducibility.
zh
[AI-204] GNN-ASE: Graph-Based Anomaly Detection and Severity Estimation in Three-Phase Induction Machines
【速读】:该论文旨在解决传统感应电机故障诊断方法依赖复杂动态模型、实现困难且计算成本高的问题。其解决方案的关键在于提出一种无需模型的图神经网络(Graph Neural Networks, GNNs)方法——GNN-ASE,直接以原始电流和振动信号为输入,自动学习并提取特征,利用图结构捕捉不同信号类型与故障模式间的复杂关系,从而实现多类故障(偏心、轴承缺陷、断条)的高精度检测与分类,显著提升了诊断的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2508.00879
作者: Moutaz Bellah Bentrad,Adel Ghoggal,Tahar Bahi,Abderaouf Bahi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The diagnosis of induction machines has traditionally relied on model-based methods that require the development of complex dynamic models, making them difficult to implement and computationally expensive. To overcome these limitations, this paper proposes a model-free approach using Graph Neural Networks (GNNs) for fault diagnosis in induction machines. The focus is on detecting multiple fault types – including eccentricity, bearing defects, and broken rotor bars – under varying severity levels and load conditions. Unlike traditional approaches, raw current and vibration signals are used as direct inputs, eliminating the need for signal preprocessing or manual feature extraction. The proposed GNN-ASE model automatically learns and extracts relevant features from raw inputs, leveraging the graph structure to capture complex relationships between signal types and fault patterns. It is evaluated for both individual fault detection and multi-class classification of combined fault conditions. Experimental results demonstrate the effectiveness of the proposed model, achieving 92.5% accuracy for eccentricity defects, 91.2% for bearing faults, and 93.1% for broken rotor bar detection. These findings highlight the model’s robustness and generalization capability across different operational scenarios. The proposed GNN-based framework offers a lightweight yet powerful solution that simplifies implementation while maintaining high diagnostic performance. It stands as a promising alternative to conventional model-based diagnostic techniques for real-world induction machine monitoring and predictive maintenance.
zh
[AI-205] Satellite Connectivity Prediction for Fast-Moving Platforms
【速读】:该论文旨在解决高速移动物体(如飞机、车辆和列车)在缺乏地面覆盖区域中维持稳定卫星连接的问题,特别是在频繁切换卫星波束、星座或轨道时因切换延迟导致的通信中断问题。解决方案的关键在于利用机器学习(Machine Learning, ML)算法分析历史通信数据,预测特定位置的信号质量,从而实现提前切换网络以避免连接中断。研究基于地球同步轨道(Geostationary Orbit, GEO)卫星与飞机之间的实际通信数据构建预测模型,在测试集上达到了0.97的F1分数,验证了模型对信号质量实时预测的高准确性,为自动化卫星及波束切换机制提供了可行路径,并具备扩展至其他移动平台(如联网车辆和火车)的应用潜力。
链接: https://arxiv.org/abs/2508.00877
作者: Chao Yan,Babak Mafakheri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Satellite connectivity is gaining increased attention as the demand for seamless internet access, especially in transportation and remote areas, continues to grow. For fast-moving objects such as aircraft, vehicles, or trains, satellite connectivity is critical due to their mobility and frequent presence in areas without terrestrial coverage. Maintaining reliable connectivity in these cases requires frequent switching between satellite beams, constellations, or orbits. To enhance user experience and address challenges like long switching times, Machine Learning (ML) algorithms can analyze historical connectivity data and predict network quality at specific locations. This allows for proactive measures, such as network switching before connectivity issues arise. In this paper, we analyze a real dataset of communication between a Geostationary Orbit (GEO) satellite and aircraft over multiple flights, using ML to predict signal quality. Our prediction model achieved an F1 score of 0.97 on the test data, demonstrating the accuracy of machine learning in predicting signal quality during flight. By enabling seamless broadband service, including roaming between different satellite constellations and providers, our model addresses the need for real-time predictions of signal quality. This approach can further be adapted to automate satellite and beam-switching mechanisms to improve overall communication efficiency. The model can also be retrained and applied to any moving object with satellite connectivity, using customized datasets, including connected vehicles and trains.
zh
[AI-206] Patents as Knowledge Artifacts: An Information Science Perspective on Global Innovation
【速读】:该论文试图解决在快速技术变革背景下,传统专利制度面临的新挑战,包括生成式 AI (Generative AI) 引发的发明人归属争议、生物技术领域所有权与伦理困境,以及国际竞争中专利被用作战略工具的问题。其解决方案的关键在于将专利重新定义为知识本体(knowledge artifacts),通过建立结构化元数据标准、完善检索系统以高效获取既有成果、强化对创新生态系统中隐性关联的伦理审视,并推动信息专业人员与政策制定者协同合作,构建一个协作、透明且以伦理为基础的知识管理体系,从而促进创新资源的公平可及性。
链接: https://arxiv.org/abs/2508.00871
作者: M. S. Rajeevan,B. Mini Devi
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Comments: 8 pages. This is a preprint version of the paper titled “Patents as Knowledge Artifacts: An Information Science Perspective on Global Innovation” Not peer-reviewed. Feedback welcome
Abstract:In an age of fast-paced technological change, patents have evolved into not only legal mechanisms of intellectual property, but also structured storage containers of knowledge full of metadata, categories, and formal innovation. This chapter proposes to reframe patents in the context of information science, by focusing on patents as knowledge artifacts, and by seeing patents as fundamentally tied to the global movement of scientific and technological knowledge. With a focus on three areas, the inventions of AIs, biotech patents, and international competition with patents, this work considers how new technologies are challenging traditional notions of inventorship, access, and moral this http URL chapter provides a critical analysis of AI’s implications for patent authorship and prior art searches, ownership issues arising from proprietary claims in biotechnology to ethical dilemmas, and the problem of using patents for strategic advantage in a global context of innovation competition. In this analysis, the chapter identified the importance of organizing information, creating metadata standards about originality, implementing retrieval systems to access previous works, and ethical contemplation about patenting unseen relationships in innovation ecosystems. Ultimately, the chapter called for a collaborative, transparent, and ethically-based approach in managing knowledge in the patenting environment highlighting the role for information professionals and policy to contribute to access equity in innovation.
zh
[AI-207] Better Recommendations: Validating AI-generated Subject Terms Through LOC Linked Data Service
【速读】:该论文试图解决传统图书馆编目中Subject Cataloging(主题标引)效率低下与馆藏 backlog(积压)的问题,尤其是在使用Library of Congress Subject Headings(国会图书馆主题词表,LCSH)时面临的准确性不足和人工处理耗时的挑战。解决方案的关键在于提出一种混合方法(hybrid approach),即结合生成式 AI 技术与人工验证机制,利用 Library of Congress Linked Data Service(国会图书馆链接数据服务)对 AI 自动生成的主题词进行校验,从而在提升元数据创建效率的同时保障其精度与质量。
链接: https://arxiv.org/abs/2508.00867
作者: Kwok Leong Tang,Yi Jiang
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:This article explores the integration of AI-generated subject terms into library cataloging, focusing on validation through the Library of Congress Linked Data Service. It examines the challenges of traditional subject cataloging under the Library of Congress Subject Headings system, including inefficiencies and cataloging backlogs. While generative AI shows promise in expediting cataloging workflows, studies reveal significant limitations in the accuracy of AI-assigned subject headings. The article proposes a hybrid approach combining AI technology with human validation through LOC Linked Data Service, aiming to enhance the precision, efficiency, and overall quality of metadata creation in library cataloging practices.
zh
[AI-208] Deploying Geospatial Foundation Models in the Real World: Lessons from WorldCereal
【速读】:该论文旨在解决当前地理空间基础模型(geospatial foundation models)在实际遥感应用中部署困难的问题,即尽管其在标准基准测试中表现优异,但因难以应对真实场景中的数据异质性、资源限制和特定应用场景需求,导致其在操作性映射系统中的落地应用极为有限。解决方案的关键在于提出一个结构化的三步集成协议:首先明确应用需求,其次将预训练模型适配至领域特定数据,最后开展严格的实证测试。通过在作物制图案例中对Presto模型进行微调,验证了该方法显著优于传统监督学习方法,并展现出强大的时空泛化能力,从而为未来在多样化遥感任务中实现基础模型的实用化提供了可复现的框架与实践路径。
链接: https://arxiv.org/abs/2508.00858
作者: Christina Butsko,Kristof Van Tricht,Gabriel Tseng,Giorgia Milli,David Rolnick,Ruben Cartuyvels,Inbal Becker Reshef,Zoltan Szantoi,Hannah Kerner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The increasing availability of geospatial foundation models has the potential to transform remote sensing applications such as land cover classification, environmental monitoring, and change detection. Despite promising benchmark results, the deployment of these models in operational settings is challenging and rare. Standardized evaluation tasks often fail to capture real-world complexities relevant for end-user adoption such as data heterogeneity, resource constraints, and application-specific requirements. This paper presents a structured approach to integrate geospatial foundation models into operational mapping systems. Our protocol has three key steps: defining application requirements, adapting the model to domain-specific data and conducting rigorous empirical testing. Using the Presto model in a case study for crop mapping, we demonstrate that fine-tuning a pre-trained model significantly improves performance over conventional supervised methods. Our results highlight the model’s strong spatial and temporal generalization capabilities. Our protocol provides a replicable blueprint for practitioners and lays the groundwork for future research to operationalize foundation models in diverse remote sensing applications. Application of the protocol to the WorldCereal global crop-mapping system showcases the framework’s scalability.
zh
[AI-209] EthicAlly: a Prototype for AI-Powered Research Ethics Support for the Social Sciences and Humanities
【速读】:该论文旨在解决社会科学研究与人文学科领域中伦理审查机制缺失或不足的问题,尤其是在部分欧洲地区及低收入和中等收入国家,由于缺乏针对非生物医学研究方法的定制化伦理培训和支持,导致研究人员难以获得具备相关专业知识的伦理委员会(Research Ethics Committee, REC)服务。解决方案的关键在于开发一种基于生成式 AI(Generative AI)的伦理支持系统——EthicAlly,其核心创新在于融合宪法AI(Constitutional AI)技术和协作式提示词开发方法,实现对社会科学研究中普遍适用的伦理原则与具体情境、解释性考量的结构化整合,从而辅助研究人员进行伦理设计并准备REC提交,减轻机构伦理委员会负担,同时不替代人类伦理判断。
链接: https://arxiv.org/abs/2508.00856
作者: Steph Grohmann
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:In biomedical science, review by a Research Ethics Committee (REC) is an indispensable way of protecting human subjects from harm. However, in social science and the humanities, mandatory ethics compliance has long been met with scepticism as biomedical models of ethics can map poorly onto methodologies involving complex socio-political and cultural considerations. As a result, tailored ethics training and support as well as access to RECs with the necessary expertise is lacking in some areas, including parts of Europe and low- and middle-income countries. This paper suggests that Generative AI can meaningfully contribute to closing these gaps, illustrating this claim by presenting EthicAlly, a proof-of-concept prototype for an AI-powered ethics support system for social science and humanities researchers. Drawing on constitutional AI technology and a collaborative prompt development methodology, EthicAlly provides structured ethics assessment that incorporates both universal ethics principles and contextual and interpretive considerations relevant to most social science research. In supporting researchers in ethical research design and preparation for REC submission, this kind of system can also contribute to easing the burden on institutional RECs, without attempting to automate or replace human ethical oversight.
zh
[AI-210] A Formal Framework for the Definition of State: Hierarchical Representation and Meta-Universe Interpretation
【速读】:该论文旨在解决当前跨学科系统(如智能定义、形式逻辑与科学理论)中“状态”概念长期缺乏统一数学形式化表述的问题,从而为智力的公理化定义提供坚实的理论基础。其解决方案的关键在于提出一个由“状态深度”与“映射层级”构成的“分层状态网格”,作为跨数学、物理和语言领域的统一符号体系,并引入“中间元宇宙(Intermediate Meta-Universe, IMU)”以显式描述定义者(即研究主体)及其使用的语言,从而实现意识层面的元操作而不引发自指悖论与逻辑不一致。在此基础上,论文进一步将跨宇宙理论拓展至语言翻译与智能体整合,通过宏观跨宇宙与微观跨宇宙操作的概念划分增强表达能力,最终构建了一个以“定义 = 状态”为核心原则的元形式逻辑框架,可覆盖时间、语言、智能体与操作等多维要素,为智能定义及科学理论的形式化奠定数学严谨的基础。
链接: https://arxiv.org/abs/2508.00853
作者: Kei Itoh
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 43 pages, 8 figures, 8 Tables, in English, in Japanese
Abstract:This study aims to reinforce the theoretical foundation for diverse systems–including the axiomatic definition of intelligence–by introducing a mathematically rigorous and unified formal structure for the concept of ‘state,’ which has long been used without consensus or formal clarity. First, a ‘hierarchical state grid’ composed of two axes–state depth and mapping hierarchy–is proposed to provide a unified notational system applicable across mathematical, physical, and linguistic domains. Next, the ‘Intermediate Meta-Universe (IMU)’ is introduced to enable explicit descriptions of definers (ourselves) and the languages we use, thereby allowing conscious meta-level operations while avoiding self-reference and logical inconsistency. Building on this meta-theoretical foundation, this study expands inter-universal theory beyond mathematics to include linguistic translation and agent integration, introducing the conceptual division between macrocosm-inter-universal and microcosm-inter-universal operations for broader expressivity. Through these contributions, this paper presents a meta-formal logical framework–grounded in the principle of definition = state–that spans time, language, agents, and operations, providing a mathematically robust foundation applicable to the definition of intelligence, formal logic, and scientific theory at large.
zh
[AI-211] Gearshift Fellowship: A Next-Generation Neurocomputational Game Platform to Model and Train Human-AI Adaptability
【速读】:该论文试图解决的问题是如何在动态环境中识别和建模人类及人工智能代理在不同情境下调整行为的机制,特别是关于何时坚持、何时放弃以及何时转换策略(即适应性行为的调控)。其解决方案的关键在于提出并实现了一种名为“齿轮切换联谊会”(Gearshift Fellowship, GF)的新范式——一种基于认知神经科学、计算精神病学、经济学与人工智能交叉融合的“超任务”(Supertask)系统。GF通过结合计算神经认知建模与严肃游戏(serious gaming),构建了一个可编程的多任务动态环境,能够精确控制实验参数以解析个体在感知决策、学习过程和元认知水平上的差异,并揭示认知-情感控制、学习风格、策略使用和动机变化如何跨情境和时间演化。这一设计使GF成为科学研究的实验平台、临床干预的表型到机制桥梁,以及提升自我调节学习、情绪稳定性和压力韧性的人机协同训练工具。
链接: https://arxiv.org/abs/2508.00850
作者: Nadja R. Ging-Jehli,Russell K. Childers,Joshua Lu,Robert Gemma,Rachel Zhu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注:
Abstract:How do we learn when to persist, when to let go, and when to shift gears? Gearshift Fellowship (GF) is the prototype of a new Supertask paradigm designed to model how humans and artificial agents adapt to shifting environment demands. Grounded in cognitive neuroscience, computational psychiatry, economics, and artificial intelligence, Supertasks combine computational neurocognitive modeling with serious gaming. This creates a dynamic, multi-mission environment engineered to assess mechanisms of adaptive behavior across cognitive and social contexts. Computational parameters explain behavior and probe mechanisms by controlling the game environment. Unlike traditional tasks, GF enables neurocognitive modeling of individual differences across perceptual decisions, learning, and meta-cognitive levels. This positions GF as a flexible testbed for understanding how cognitive-affective control processes, learning styles, strategy use, and motivational shifts adapt across contexts and over time. It serves as an experimental platform for scientists, a phenotype-to-mechanism intervention for clinicians, and a training tool for players aiming to strengthen self-regulated learning, mood, and stress resilience. Online study (n = 60, ongoing) results show that GF recovers effects from traditional neuropsychological tasks (construct validity), uncovers novel patterns in how learning differs across contexts and how clinical features map onto distinct adaptations. These findings pave the way for developing in-game interventions that foster self-efficacy and agency to cope with real-world stress and uncertainty. GF builds a new adaptive ecosystem designed to accelerate science, transform clinical care, and foster individual growth. It offers a mirror and training ground where humans and machines co-develop together deeper flexibility and awareness.
zh
[AI-212] Cognitive Exoskeleton: Augmenting Human Cognition with an AI-Mediated Intelligent Visual Feedback
【速读】:该论文试图解决在数学算术任务中,时间压力反馈对用户表现具有双刃剑效应的问题——即时间压力可能通过调节注意力和焦虑水平而提升或损害用户表现,难以实现稳定优化。解决方案的关键在于提出一种双深度强化学习(dual-DRL)框架:其中一代理论上通过与另一模拟用户认知行为的DRL代理交互来训练,从而动态调节时间压力反馈策略,以适应用户的实时表现,实现更智能、自适应的认知增强。该方法有效缓解了传统DRL因需大量数据和迭代用户研究而导致的训练瓶颈问题。
链接: https://arxiv.org/abs/2508.00846
作者: Songlin Xu,Xinyu Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we introduce an AI-mediated framework that can provide intelligent feedback to augment human cognition. Specifically, we leverage deep reinforcement learning (DRL) to provide adaptive time pressure feedback to improve user performance in a math arithmetic task. Time pressure feedback could either improve or deteriorate user performance by regulating user attention and anxiety. Adaptive time pressure feedback controlled by a DRL policy according to users’ real-time performance could potentially solve this trade-off problem. However, the DRL training and hyperparameter tuning may require large amounts of data and iterative user studies. Therefore, we propose a dual-DRL framework that trains a regulation DRL agent to regulate user performance by interacting with another simulation DRL agent that mimics user cognition behaviors from an existing dataset. Our user study demonstrates the feasibility and effectiveness of the dual-DRL framework in augmenting user performance, in comparison to the baseline group.
zh
[AI-213] Exploring Agent ic Artificial Intelligence Systems: Towards a Typological Framework
【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统缺乏统一分类与比较框架的问题,尤其是在其自主性、推理能力和环境交互水平日益提升的背景下。解决方案的关键在于构建一个包含八个有序维度的类型学框架,用以刻画 AI 系统在认知代理性和环境代理性的不同层级,通过多阶段方法论逐步完善并验证该框架,最终形成可操作的类型划分,从而为研究者和实践者提供结构化工具来评估现有系统的代理能力,并预测未来 agentic AI 的演进方向。
链接: https://arxiv.org/abs/2508.00844
作者: Christopher Wissuchek,Patrick Zschech
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注: Preprint accepted for archival and presentation at the Pacific-Asia Conference on Information Systems (PACIS) 2025, Kuala Lumpur, Malaysia
Abstract:Artificial intelligence (AI) systems are evolving beyond passive tools into autonomous agents capable of reasoning, adapting, and acting with minimal human intervention. Despite their growing presence, a structured framework is lacking to classify and compare these systems. This paper develops a typology of agentic AI systems, introducing eight dimensions that define their cognitive and environmental agency in an ordinal structure. Using a multi-phase methodological approach, we construct and refine this typology, which is then evaluated through a human-AI hybrid approach and further distilled into constructed types. The framework enables researchers and practitioners to analyze varying levels of agency in AI systems. By offering a structured perspective on the progression of AI capabilities, the typology provides a foundation for assessing current systems and anticipating future developments in agentic AI.
zh
[AI-214] Generative AI for CAD Automation: Leverag ing Large Language Models for 3D Modelling
【速读】:该论文旨在解决传统计算机辅助设计(CAD)流程复杂、依赖专业绘图技能,从而限制快速原型设计和生成式设计的问题。其解决方案的关键在于构建一个将大型语言模型(LLM)与FreeCAD集成的框架,通过自然语言描述自动生成初始CAD脚本,并基于错误反馈进行迭代执行与优化。该方法显著提升了CAD自动化水平,但实验表明,对于高度约束的模型仍需多次修正,凸显了改进记忆检索机制、自适应提示工程以及混合人工智能技术以增强脚本鲁棒性的必要性。
链接: https://arxiv.org/abs/2508.00843
作者: Sumit Kumar,Sarthak Kapoor,Harsh Vardhan,Yao Zhao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large Language Models (LLMs) are revolutionizing industries by enhancing efficiency, scalability, and innovation. This paper investigates the potential of LLMs in automating Computer-Aided Design (CAD) workflows, by integrating FreeCAD with LLM as CAD design tool. Traditional CAD processes are often complex and require specialized sketching skills, posing challenges for rapid prototyping and generative design. We propose a framework where LLMs generate initial CAD scripts from natural language descriptions, which are then executed and refined iteratively based on error feedback. Through a series of experiments with increasing complexity, we assess the effectiveness of this approach. Our findings reveal that LLMs perform well for simple to moderately complex designs but struggle with highly constrained models, necessitating multiple refinements. The study highlights the need for improved memory retrieval, adaptive prompt engineering, and hybrid AI techniques to enhance script robustness. Future directions include integrating cloud-based execution and exploring advanced LLM capabilities to further streamline CAD automation. This work underscores the transformative potential of LLMs in design workflows while identifying critical areas for future development.
zh
[AI-215] PCS Workflow for Veridical Data Science in the Age of AI
【速读】:该论文旨在解决数据科学生命周期(Data Science Life Cycle, DSLC)中因决策选择带来的不确定性问题,这种不确定性导致许多AI驱动的数据分析结果难以复现。传统统计框架通常无法有效量化和管理此类不确定性。其解决方案的关键在于引入“可预测性-可计算性-稳定性”(Predictability-Computability-Stability, PCS)框架,该框架为数据科学提供了基于真理的(veridical)原则性方法,以系统性地识别、量化并缓解DSL C各阶段中的不确定性。论文进一步优化了PCS工作流,使其更适用于实践者,并整合了生成式AI(Generative AI)辅助工具,提升流程的可操作性和透明度,同时通过案例研究验证了数据清洗阶段的主观判断对下游预测不确定性的影响。
链接: https://arxiv.org/abs/2508.00835
作者: Zachary T. Rewolinski,Bin Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.
zh
[AI-216] Bike-Bench: A Bicycle Design Benchmark for Generative Models with Objectives and Constraints
【速读】:该论文旨在解决生成式 AI 在多目标、多约束的工程设计任务中评估能力不足的问题,尤其关注其对物理规律、人类规范及硬性约束的理解与实现能力。解决方案的关键在于构建 Bike-Bench——一个面向自行车设计的工程设计基准,它通过量化多种人本和多物理场性能指标(如空气动力学、人体工学、结构力学、可用性等),并提供包含10K个人类评分数据和140万条合成设计数据的多模态数据集(支持CAD/XML/SVG/PNG表示),实现了对表格式生成模型、大语言模型(LLM)、设计优化算法及混合算法的统一评估。实验表明,当前主流生成模型在有效性和最优性上均落后于优化增强型方法,凸显了改进空间。
链接: https://arxiv.org/abs/2508.00830
作者: Lyle Regenwetter,Yazan Abu Obaideh,Fabien Chiotti,Ioanna Lykourentzou,Faez Ahmed
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce Bike-Bench, an engineering design benchmark for evaluating generative models on problems with multiple real-world objectives and constraints. As generative AI’s reach continues to grow, evaluating its capability to understand physical laws, human guidelines, and hard constraints grows increasingly important. Engineering product design lies at the intersection of these difficult tasks, providing new challenges for AI capabilities. Bike-Bench evaluates AI models’ capability to generate designs that not only resemble the dataset, but meet specific performance objectives and constraints. To do so, Bike-Bench quantifies a variety of human-centered and multiphysics performance characteristics, such as aerodynamics, ergonomics, structural mechanics, human-rated usability, and similarity to subjective text or image prompts. Supporting the benchmark are several datasets of simulation results, a dataset of 10K human-rated bicycle assessments, and a synthetically-generated dataset of 1.4M designs, each with a parametric, CAD/XML, SVG, and PNG representation. Bike-Bench is uniquely configured to evaluate tabular generative models, LLMs, design optimization, and hybrid algorithms side-by-side. Our experiments indicate that LLMs and tabular generative models fall short of optimization and optimization-augmented generative models in both validity and optimality scores, suggesting significant room for improvement. We hope Bike-Bench, a first-of-its-kind benchmark, will help catalyze progress in generative AI for constrained multi-objective engineering design problems. Code, data, and other resources are published at this http URL.
zh
[AI-217] A Schema.org Mapping for Brazilian Legal Norms: Toward Interoperable Legal Graphs and Open Government Data
【速读】:该论文旨在解决开放政府数据(Open Government Data, OGD)在法律科技(Legal Tech)应用中面临的法律规范机器可读性不足的问题,尤其聚焦于巴西立法数据的结构化与语义化表达。其解决方案的关键在于提出一种基于JSON-LD和链接数据(Linked Data)的统一映射框架,将巴西立法体系中的“规范”实体(Norm)映射为Schema.org的sdo:Legislation类型,并将其数字发布版本(如官方公报中的文本)映射为sdo:LegislationObject,同时定义了关键属性,包括URN标识符(符合LexML标准)、多语言支持、版本控制及规范间的引用关系(如引用与被引用),从而显著提升巴西法律数据的质量、互操作性并促进其在全球OGD生态系统中的集成与应用。
链接: https://arxiv.org/abs/2508.00827
作者: Hudson de Martim
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:Open Government Data (OGD) initiatives aim to enhance transparency and public participation by making government data openly accessible. However, structuring legal norms for machine readability remains a critical challenge for advancing Legal Tech applications such as Legal Knowledge Graphs (LKGs). Focusing on the this http URL portal initiative by the Brazilian National Congress, we propose a unified mapping of Brazilian legislation to the this http URL vocabulary via JSON-LD and Linked Data. Our approach covers both the conceptual “Norm” entity (mapped to sdo:Legislation) and its digital publications or manifestations (mapped to sdo:LegislationObject). We detail key properties for each type, providing concrete examples and considering URN identifiers (per the LexML standard), multilingual support, versioning in the Official Journal, and inter-norm relationships (e.g., citations and references). Our structured schema improves the quality and interoperability of Brazilian legal data, fosters integration within the global OGD ecosystem, and facilitates the creation of a wor
zh
[AI-218] Large Language Models for Wireless Communications: From Adaptation to Autonomy
【速读】:该论文旨在解决无线通信系统在日益复杂和动态环境下对智能、自适应解决方案的迫切需求问题。传统方法难以应对高频谱效率、低时延和高可靠性的挑战,而生成式 AI(Generative AI)尤其是大语言模型(Large Language Models, LLMs)展现出强大的推理、泛化与零样本学习能力,为无线网络智能化提供了新路径。其解决方案的关键在于三方面:一是将预训练LLMs适配至核心通信任务(如信道估计、资源调度),二是开发面向无线场景的专用基础模型以兼顾通用性与计算效率,三是构建具备自主推理与协同能力的代理型LLM(agentic LLMs),从而实现无线网络从感知到决策的闭环智能演化。
链接: https://arxiv.org/abs/2507.21524
作者: Le Liang,Hao Ye,Yucheng Sheng,Ouya Wang,Jiacheng Wang,Shi Jin,Geoffrey Ye Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:The emergence of large language models (LLMs) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless communications, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of LLMs in transforming wireless systems across three key directions: adapting pretrained LLMs for core communication tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic LLMs with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of LLM-based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks of the future.
zh
[AI-219] An Efficient Continuous-Time MILP for Integrated Aircraft Hangar Scheduling and Layout
【速读】:该论文旨在解决飞机维修机库(maintenance hangar)中复杂的时空协同调度问题,即如何在时间维度和空间维度上同时优化飞机的维修排程与停机位分配。其解决方案的关键在于提出了一种新颖的连续时间混合整数线性规划(continuous-time mixed-integer linear programming, MILP)模型,通过将时间视为连续变量而非离散时间段,有效克服了传统离散时间方法在扩展性上的瓶颈,从而实现了对25架飞机实例的快速最优求解,并在40架飞机的大规模场景下仍能提供高质量近优解,显著优于启发式算法,体现出显著的经济价值和管理决策支持能力。
链接: https://arxiv.org/abs/2508.02640
作者: Shayan Farhang Pazhooh(1),Hossein Shams Shemirani(2) ((1) Department of Industrial Engineering, Isfahan University of Technology, Isfahan, Iran, (2) Industrial Engineering Group, Golpayegan College of Engineering, Isfahan University of Technology, Golpayegan, Iran)
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 35 pages, 7 figures
Abstract:Efficient management of aircraft maintenance hangars is a critical operational challenge, involving complex, interdependent decisions regarding aircraft scheduling and spatial allocation. This paper introduces a novel continuous-time mixed-integer linear programming (MILP) model to solve this integrated spatio-temporal problem. By treating time as a continuous variable, our formulation overcomes the scalability limitations of traditional discrete-time approaches. The performance of the exact model is benchmarked against a constructive heuristic, and its practical applicability is demonstrated through a custom-built visualization dashboard. Computational results are compelling: the model solves instances with up to 25 aircraft to proven optimality, often in mere seconds, and for large-scale cases of up to 40 aircraft, delivers high-quality solutions within known optimality gaps. In all tested scenarios, the resulting solutions consistently and significantly outperform the heuristic, which highlights the framework’s substantial economic benefits and provides valuable managerial insights into the trade-off between solution time and optimality.
zh
[AI-220] ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space
【速读】:该论文旨在解决高频限价订单簿(Limit Order Book, LOB)动态建模这一在量化金融中关键 yet 未解决的挑战,传统方法受限于简化的随机假设或现代深度学习模型(如Transformer)依赖的离散化与分箱操作所引入的特征工程和tokenization偏差。其解决方案的关键在于提出ByteGen——一个直接在LOB事件原始字节流上进行建模的端到端生成式框架,通过将问题定义为自回归的“下一个字节预测”任务,并设计了一种32字节打包的二进制格式以无损表示市场消息;同时采用H-Net架构(一种混合Mamba-Transformer模型),利用动态分块机制自动发现市场消息中的内在结构,从而完全消除人工特征工程和tokenization步骤,使模型能够从最基础的数据表示中学习复杂市场动态。
链接: https://arxiv.org/abs/2508.02247
作者: Yang Li,Zhi Chen
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
备注: 21 pages, 3 tables, 5 figures
Abstract:Generative modeling of high-frequency limit order book (LOB) dynamics is a critical yet unsolved challenge in quantitative finance, essential for robust market simulation and strategy backtesting. Existing approaches are often constrained by simplifying stochastic assumptions or, in the case of modern deep learning models like Transformers, rely on tokenization schemes that affect the high-precision, numerical nature of financial data through discretization and binning. To address these limitations, we introduce ByteGen, a novel generative model that operates directly on the raw byte streams of LOB events. Our approach treats the problem as an autoregressive next-byte prediction task, for which we design a compact and efficient 32-byte packed binary format to represent market messages without information loss. The core novelty of our work is the complete elimination of feature engineering and tokenization, enabling the model to learn market dynamics from its most fundamental representation. We achieve this by adapting the H-Net architecture, a hybrid Mamba-Transformer model that uses a dynamic chunking mechanism to discover the inherent structure of market messages without predefined rules. Our primary contributions are: 1) the first end-to-end, byte-level framework for LOB modeling; 2) an efficient packed data representation; and 3) a comprehensive evaluation on high-frequency data. Trained on over 34 million events from CME Bitcoin futures, ByteGen successfully reproduces key stylized facts of financial markets, generating realistic price distributions, heavy-tailed returns, and bursty event timing. Our findings demonstrate that learning directly from byte space is a promising and highly flexible paradigm for modeling complex financial systems, achieving competitive performance on standard market quality metrics without the biases of tokenization.
zh
[AI-221] Enhancement of Quantum Semi-Supervised Learning via Improved Laplacian and Poisson Methods
【速读】:该论文旨在解决标签数据稀缺场景下图结构半监督学习(Graph-based Semi-Supervised Learning)性能受限的问题。其解决方案的关键在于提出两种改进的量子模型——改进型拉普拉斯量子半监督学习(ILQSSL)和改进型泊松量子半监督学习(IPQSSL),通过在变分量子电路中引入先进的标签传播策略,并利用QR分解将图结构直接嵌入量子态,从而提升低标签设置下的学习效率。实验表明,这两种方法在多个基准数据集上均优于主流经典半监督学习算法,且通过对纠缠熵和随机基准测试(Randomized Benchmarking, RB)的分析揭示了电路深度与量子比特数量对模型性能的影响机制,为量子机器学习在数据高效分类中的应用提供了可操作的设计原则。
链接: https://arxiv.org/abs/2508.02054
作者: Hamed Gholipour,Farid Bozorgnia,Hamzeh Mohammadigheymasi,Kailash Hambarde,Javier Mancilla,Hugo Proenca,Joao Neves,Moharram Challenger
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper develops a hybrid quantum approach for graph-based semi-supervised learning to enhance performance in scenarios where labeled data is scarce. We introduce two enhanced quantum models, the Improved Laplacian Quantum Semi-Supervised Learning (ILQSSL) and the Improved Poisson Quantum Semi-Supervised Learning (IPQSSL), that incorporate advanced label propagation strategies within variational quantum circuits. These models utilize QR decomposition to embed graph structure directly into quantum states, thereby enabling more effective learning in low-label settings. We validate our methods across four benchmark datasets like Iris, Wine, Heart Disease, and German Credit Card – and show that both ILQSSL and IPQSSL consistently outperform leading classical semi-supervised learning algorithms, particularly under limited supervision. Beyond standard performance metrics, we examine the effect of circuit depth and qubit count on learning quality by analyzing entanglement entropy and Randomized Benchmarking (RB). Our results suggest that while some level of entanglement improves the model’s ability to generalize, increased circuit complexity may introduce noise that undermines performance on current quantum hardware. Overall, the study highlights the potential of quantum-enhanced models for semi-supervised learning, offering practical insights into how quantum circuits can be designed to balance expressivity and stability. These findings support the role of quantum machine learning in advancing data-efficient classification, especially in applications constrained by label availability and hardware limitations.
zh
[AI-222] ACT-Tensor: Tensor Completion Framework for Financial Dataset Imputation
【速读】:该论文旨在解决金融面板数据中严重且异质性缺失问题,此类缺失会削弱资产定价模型的有效性并降低投资策略的表现。传统插补方法常因忽略数据的多维结构(如企业、时间与金融变量维度)而失效,难以应对复杂缺失模式或在极端稀疏场景下过拟合。解决方案的关键在于提出一种自适应的基于聚类的时间平滑张量完成框架(ACT-Tensor),其核心创新包括:一是基于聚类的插补模块,通过学习分组特定的潜在结构来捕捉横截面异质性;二是时间平滑模块,主动去除短期噪声同时保留长期基本面趋势。实验证明,ACT-Tensor在多种缺失情形下均显著优于现有最优基准,并在资产定价管道中提升组合的风险调整收益,从而为金融决策提供高精度且具信息价值的插补结果。
链接: https://arxiv.org/abs/2508.01861
作者: Junyi Mo,Jiayu Li,Duo Zhang,Elynn Chen
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Missing data in financial panels presents a critical obstacle, undermining asset-pricing models and reducing the effectiveness of investment strategies. Such panels are often inherently multi-dimensional, spanning firms, time, and financial variables, which adds complexity to the imputation task. Conventional imputation methods often fail by flattening the data’s multidimensional structure, struggling with heterogeneous missingness patterns, or overfitting in the face of extreme data sparsity. To address these limitations, we introduce an Adaptive, Cluster-based Temporal smoothing tensor completion framework (ACT-Tensor) tailored for severely and heterogeneously missing multi-dimensional financial data panels. ACT-Tensor incorporates two key innovations: a cluster-based completion module that captures cross-sectional heterogeneity by learning group-specific latent structures; and a temporal smoothing module that proactively removes short-lived noise while preserving slow-moving fundamental trends. Extensive experiments show that ACT-Tensor consistently outperforms state-of-the-art benchmarks in terms of imputation accuracy across a range of missing data regimes, including extreme sparsity scenarios. To assess its practical financial utility, we evaluate the imputed data with an asset-pricing pipeline tailored for tensor-structured financial data. Results show that ACT-Tensor not only reduces pricing errors but also significantly improves risk-adjusted returns of the constructed portfolio. These findings confirm that our method delivers highly accurate and informative imputations, offering substantial value for financial decision-making.
zh
[AI-223] Deep Learning-Driven Prediction of Microstructure Evolution via Latent Space Interpolation
【速读】:该论文旨在解决相场模型(phase-field models)在模拟微结构演化时因需求解复杂偏微分方程而导致计算成本高昂的问题。其解决方案的关键在于构建一个基于深度学习的代理模型框架,核心是利用条件变分自编码器(Conditional Variational Autoencoder, CVAE)学习微结构的紧凑潜在表示,并结合三次样条插值(Cubic Spline Interpolation)在潜在空间中预测任意未知成分下的微结构演化,最后通过球面线性插值(Spherical Linear Interpolation, SLERP)保证时间演化过程中的形态连续性和粗化行为的物理合理性。该方法显著提升了微结构演化的预测效率,同时保持了与原始相场模拟高度一致的视觉和统计特征。
链接: https://arxiv.org/abs/2508.01822
作者: Sachin Gaikwad,Thejas Kasilingam,Owais Ahmad,Rajdip Mukherjee,Somnath Bhowmick
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 7 figures
Abstract:Phase-field models accurately simulate microstructure evolution, but their dependence on solving complex differential equations makes them computationally expensive. This work achieves a significant acceleration via a novel deep learning-based framework, utilizing a Conditional Variational Autoencoder (CVAE) coupled with Cubic Spline Interpolation and Spherical Linear Interpolation (SLERP). We demonstrate the method for binary spinodal decomposition by predicting microstructure evolution for intermediate alloy compositions from a limited set of training compositions. First, using microstructures from phase-field simulations of binary spinodal decomposition, we train the CVAE, which learns compact latent representations that encode essential morphological features. Next, we use cubic spline interpolation in the latent space to predict microstructures for any unknown composition. Finally, SLERP ensures smooth morphological evolution with time that closely resembles coarsening. The predicted microstructures exhibit high visual and statistical similarity to phase-field simulations. This framework offers a scalable and efficient surrogate model for microstructure evolution, enabling accelerated materials design and composition optimization.
zh
[AI-224] Contrastive Multi-Task Learning with Solvent-Aware Augmentation for Drug Discovery
【速读】:该论文旨在解决现有蛋白质-配体相互作用预测方法在捕捉溶剂依赖的构象变化方面能力不足,以及缺乏联合学习多个相关任务的能力这一问题。其解决方案的关键在于提出一种预训练方法,通过引入在不同溶剂条件下生成的配体构象集合作为增强输入,使模型能够统一学习结构灵活性和环境上下文信息;同时,训练过程融合了分子重建、原子间距离预测与对比学习,从而实现溶剂不变的分子表征构建,显著提升了结合亲和力预测、对接精度及虚拟筛选性能。
链接: https://arxiv.org/abs/2508.01799
作者: Jing Lan,Hexiao Ding,Hongzhao Chen,Yufeng Jiang,Ng Nga Chun,Gerald W.Y. Cheng,Zongxi Li,Jing Cai,Liang-ting Lin,Jung Sun Yoo
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures
Abstract:Accurate prediction of protein-ligand interactions is essential for computer-aided drug discovery. However, existing methods often fail to capture solvent-dependent conformational changes and lack the ability to jointly learn multiple related tasks. To address these limitations, we introduce a pre-training method that incorporates ligand conformational ensembles generated under diverse solvent conditions as augmented input. This design enables the model to learn both structural flexibility and environmental context in a unified manner. The training process integrates molecular reconstruction to capture local geometry, interatomic distance prediction to model spatial relationships, and contrastive learning to build solvent-invariant molecular representations. Together, these components lead to significant improvements, including a 3.7% gain in binding affinity prediction, an 82% success rate on the PoseBusters Astex docking benchmarks, and an area under the curve of 97.1% in virtual screening. The framework supports solvent-aware, multi-task modeling and produces consistent results across benchmarks. A case study further demonstrates sub-angstrom docking accuracy with a root-mean-square deviation of 0.157 angstroms, offering atomic-level insight into binding mechanisms and advancing structure-based drug design.
zh
[AI-225] nsoMeta-VQC: A Tensor-Train-Guided Meta-Learning Framework for Robust and Scalable Variational Quantum Computing
【速读】:该论文旨在解决变分量子计算(Variational Quantum Computing, VQC)在可扩展性方面面临的两大核心挑战: barren plateaus(平坦区问题)导致梯度消失,以及对量子噪声的高度敏感性。解决方案的关键在于提出 TensoMeta-VQC,一个基于张量列车(Tensor-Train, TT)引导的元学习框架,其核心创新是将量子电路参数的生成完全交由经典TT网络完成,从而实现优化过程与量子硬件的解耦。这一参数化策略有效缓解了梯度消失问题,通过结构化的低秩表示增强抗噪能力,并支持高效的梯度传播,同时理论分析(基于神经切线核和统计学习理论)为近似能力、优化稳定性和泛化性能提供了严格保障。
链接: https://arxiv.org/abs/2508.01116
作者: Jun Qi,Chao-Han Yang,Pin-Yu Chen,Min-Hsiu Hsieh
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: In submission
Abstract:Variational Quantum Computing (VQC) faces fundamental barriers in scalability, primarily due to barren plateaus and quantum noise sensitivity. To address these challenges, we introduce TensoMeta-VQC, a novel tensor-train (TT)-guided meta-learning framework designed to improve the robustness and scalability of VQC significantly. Our framework fully delegates the generation of quantum circuit parameters to a classical TT network, effectively decoupling optimization from quantum hardware. This innovative parameterization mitigates gradient vanishing, enhances noise resilience through structured low-rank representations, and facilitates efficient gradient propagation. Based on Neural Tangent Kernel and statistical learning theory, our rigorous theoretical analyses establish strong guarantees on approximation capability, optimization stability, and generalization performance. Extensive empirical results across quantum dot classification, Max-Cut optimization, and molecular quantum simulation tasks demonstrate that TensoMeta-VQC consistently achieves superior performance and robust noise tolerance, establishing it as a principled pathway toward practical and scalable VQC on near-term quantum devices.
zh
[AI-226] Accelerating multiparametric quantitative MRI using self-supervised scan-specific implicit neural representation with model reinforcement
【速读】:该论文旨在解决加速多参数定量磁共振成像(multiparametric quantitative MRI, qMRI)中重建精度与效率难以兼顾的问题。现有方法在高加速比下常面临伪影显著、参数估计偏差大以及计算复杂度高等挑战,尤其在自由水自旋-晶格弛豫时间(T1f)、组织大分子质子分数(MPF)和磁化交换速率(kex)等多参数联合估计时表现受限。解决方案的关键在于提出一种自监督的扫描特异性深度学习框架REFINE-MORE,其核心创新包括:(1)基于隐式神经表示(implicit neural representation, INR)架构,通过显式建模时空相关性初始化多参数定量图谱;(2)引入模型强化模块,将MR物理约束嵌入优化过程以增强数据一致性;(3)采用低秩适应策略提升模型收敛速度,实现约五倍的计算效率提升。该方法在体模和活体脑数据上均验证了优异的重建质量与鲁棒性,为高维加速qMRI提供了灵活高效的解决方案。
链接: https://arxiv.org/abs/2508.00891
作者: Ruimin Feng,Albert Jang,Xingxin He,Fang Liu
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: To develop a self-supervised scan-specific deep learning framework for reconstructing accelerated multiparametric quantitative MRI (qMRI). Methods: We propose REFINE-MORE (REference-Free Implicit NEural representation with MOdel REinforcement), combining an implicit neural representation (INR) architecture with a model reinforcement module that incorporates MR physics constraints. The INR component enables informative learning of spatiotemporal correlations to initialize multiparametric quantitative maps, which are then further refined through an unrolled optimization scheme enforcing data consistency. To improve computational efficiency, REFINE-MORE integrates a low-rank adaptation strategy that promotes rapid model convergence. We evaluated REFINE-MORE on accelerated multiparametric quantitative magnetization transfer imaging for simultaneous estimation of free water spin-lattice relaxation, tissue macromolecular proton fraction, and magnetization exchange rate, using both phantom and in vivo brain data. Results: Under 4x and 5x accelerations on in vivo data, REFINE-MORE achieved superior reconstruction quality, demonstrating the lowest normalized root-mean-square error and highest structural similarity index compared to baseline methods and other state-of-the-art model-based and deep learning approaches. Phantom experiments further showed strong agreement with reference values, underscoring the robustness and generalizability of the proposed framework. Additionally, the model adaptation strategy improved reconstruction efficiency by approximately fivefold. Conclusion: REFINE-MORE enables accurate and efficient scan-specific multiparametric qMRI reconstruction, providing a flexible solution for high-dimensional, accelerated qMRI applications. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.00891 [physics.med-ph] (or arXiv:2508.00891v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2508.00891 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ruimin Feng [view email] [v1] Sun, 27 Jul 2025 03:06:49 UTC (3,057 KB)
zh
机器学习
[LG-0] LOST: Low-rank and Sparse Pre-training for Large Language Models
链接: https://arxiv.org/abs/2508.02668
作者: Jiaxi Li,Lu Yin,Li Shen,Jinjin Xu,Liwu Xu,Tianjin Huang,Wenwu Wang,Shiwei Liu,Xilu Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbfLOw-rank and \textbfSparse pre-\textbfTraining (\textbfLOST) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \hrefthis https URLLOST Repo
[LG-1] CAK: Emergent Audio Effects from Minimal Deep Learning
链接: https://arxiv.org/abs/2508.02643
作者: Austin Rockman
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 3 figures, code and other resources at this https URL
Abstract:We demonstrate that a single 3x3 convolutional kernel can produce emergent audio effects when trained on 200 samples from a personalized corpus. We achieve this through two key techniques: (1) Conditioning Aware Kernels (CAK), where output = input + (learned_pattern x control), with a soft-gate mechanism supporting identity preservation at zero control; and (2) AuGAN (Audit GAN), which reframes adversarial training from “is this real?” to “did you apply the requested value?” Rather than learning to generate or detect forgeries, our networks cooperate to verify control application, discovering unique transformations. The learned kernel exhibits a diagonal structure creating frequency-dependent temporal shifts that are capable of producing musical effects based on input characteristics. Our results show the potential of adversarial training to discover audio transformations from minimal data, enabling new approaches to effect design.
[LG-2] Instance-Optimal Uniformity Testing and Tracking
链接: https://arxiv.org/abs/2508.02637
作者: Guy Blanc,Clément L. Canonne,Erik Waingarten
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: FOCS 2025, to appear
Abstract:In the uniformity testing task, an algorithm is provided with samples from an unknown probability distribution over a (known) finite domain, and must decide whether it is the uniform distribution, or, alternatively, if its total variation distance from uniform exceeds some input distance parameter. This question has received a significant amount of interest and its complexity is, by now, fully settled. Yet, we argue that it fails to capture many scenarios of interest, and that its very definition as a gap problem in terms of a prespecified distance may lead to suboptimal performance. To address these shortcomings, we introduce the problem of uniformity tracking, whereby an algorithm is required to detect deviations from uniformity (however they may manifest themselves) using as few samples as possible, and be competitive against an optimal algorithm knowing the distribution profile in hindsight. Our main contribution is a \operatornamepolylog(\operatornameopt) -competitive uniformity tracking algorithm. We obtain this result by leveraging new structural results on Poisson mixtures, which we believe to be of independent interest. Comments: FOCS 2025, to appear Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2508.02637 [cs.DS] (or arXiv:2508.02637v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2508.02637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-3] nsor Dynamic Mode Decomposition
链接: https://arxiv.org/abs/2508.02627
作者: Ziqin He,Mengqi Hu,Yifei Lou,Can Chen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 1 table
Abstract:Dynamic mode decomposition (DMD) has become a powerful data-driven method for analyzing the spatiotemporal dynamics of complex, high-dimensional systems. However, conventional DMD methods are limited to matrix-based formulations, which might be inefficient or inadequate for modeling inherently multidimensional data including images, videos, and higher-order networks. In this letter, we propose tensor dynamic mode decomposition (TDMD), a novel extension of DMD to third-order tensors based on the recently developed T-product framework. By incorporating tensor factorization techniques, TDMD achieves more efficient computation and better preservation of spatial and temporal structures in multiway data for tasks such as state reconstruction and dynamic component separation, compared to standard DMD with data flattening. We demonstrate the effectiveness of TDMD on both synthetic and real-world datasets.
[LG-4] DeepKoopFormer: A Koopman Enhanced Transformer Based Architecture for Time Series Forecasting
链接: https://arxiv.org/abs/2508.02616
作者: Ali Forootani,Mohammad Khosravi,Masoud Barati
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting plays a vital role across scientific, industrial, and environmental domains, especially when dealing with high-dimensional and nonlinear systems. While Transformer-based models have recently achieved state-of-the-art performance in long-range forecasting, they often suffer from interpretability issues and instability in the presence of noise or dynamical uncertainty. In this work, we propose DeepKoopFormer, a principled forecasting framework that combines the representational power of Transformers with the theoretical rigor of Koopman operator theory. Our model features a modular encoder-propagator-decoder structure, where temporal dynamics are learned via a spectrally constrained, linear Koopman operator in a latent space. We impose structural guarantees-such as bounded spectral radius, Lyapunov based energy regularization, and orthogonal parameterization to ensure stability and interpretability. Comprehensive evaluations are conducted on both synthetic dynamical systems, real-world climate dataset (wind speed and surface pressure), financial time series (cryptocurrency), and electricity generation dataset using the Python package that is prepared for this purpose. Across all experiments, DeepKoopFormer consistently outperforms standard LSTM and baseline Transformer models in terms of accuracy, robustness to noise, and long-term forecasting stability. These results establish DeepKoopFormer as a flexible, interpretable, and robust framework for forecasting in high dimensional and dynamical settings.
[LG-5] Adaptive Riemannian Graph Neural Networks
链接: https://arxiv.org/abs/2508.02600
作者: Xudong Wang,Tongxin Li,Chris Ding,Jicong Fan
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Graph data often exhibits complex geometric heterogeneity, where structures with varying local curvature, such as tree-like hierarchies and dense communities, coexist within a single network. Existing geometric GNNs, which embed graphs into single fixed-curvature manifolds or discrete product spaces, struggle to capture this diversity. We introduce Adaptive Riemannian Graph Neural Networks (ARGNN), a novel framework that learns a continuous and anisotropic Riemannian metric tensor field over the graph. It allows each node to determine its optimal local geometry, enabling the model to fluidly adapt to the graph’s structural landscape. Our core innovation is an efficient parameterization of the node-wise metric tensor, specializing to a learnable diagonal form that captures directional geometric information while maintaining computational tractability. To ensure geometric regularity and stable training, we integrate a Ricci flow-inspired regularization that smooths the learned manifold. Theoretically, we establish the rigorous geometric evolution convergence guarantee for ARGNN and provide a continuous generalization that unifies prior fixed or mixed-curvature GNNs. Empirically, our method demonstrates superior performance on both homophilic and heterophilic benchmark datasets with the ability to capture diverse structures adaptively. Moreover, the learned geometries both offer interpretable insights into the underlying graph structure and empirically corroborate our theoretical analysis.
[LG-6] CSI Obfuscation: Single-Antenna Transmitters Can Not Hide from Adversarial Multi-Antenna Radio Localization Systems
链接: https://arxiv.org/abs/2508.02553
作者: Phillip Stephan,Florian Euchner,Stephan ten Brink
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:The ability of modern telecommunication systems to locate users and objects in the radio environment raises justified privacy concerns. To prevent unauthorized localization, single-antenna transmitters can obfuscate the signal by convolving it with a randomized sequence prior to transmission, which alters the channel state information (CSI) estimated at the receiver. However, this strategy is only effective against CSI-based localization systems deploying single-antenna receivers. Inspired by the concept of blind multichannel identification, we propose a simple CSI recovery method for multi-antenna receivers to extract channel features that ensure reliable user localization regardless of the transmitted signal. We comparatively evaluate the impact of signal obfuscation and the proposed recovery method on the localization performance of CSI fingerprinting, channel charting, and classical triangulation using real-world channel measurements. This work aims to demonstrate the necessity for further efforts to protect the location privacy of users from adversarial radio-based localization systems.
[LG-7] Solved in Unit Domain: JacobiNet for Differentiable Coordinate Transformations
链接: https://arxiv.org/abs/2508.02537
作者: Xi Chen,Jianchuan Yang,Junjie Zhang,Runnan Yang,Xu Liu,Hong Wang,Ziyu Ren,Wenqi Hu
类目: Machine Learning (cs.LG)
*备注: Submitted to CMAME, revision in progress
Abstract:Physics-Informed Neural Networks (PINNs) are effective for solving PDEs by incorporating physical laws into the learning process. However, they face challenges with irregular boundaries, leading to instability and slow convergence due to inconsistent normalization, inaccurate boundary enforcement, and imbalanced loss terms. A common solution is to map the domain to a regular space, but traditional methods rely on case-specific meshes and simple geometries, limiting their compatibility with modern frameworks. To overcome these limitations, we introduce JacobiNet, a neural network-based coordinate transformation method that learns continuous, differentiable mappings from supervised point pairs. Utilizing lightweight MLPs, JacobiNet allows for direct Jacobian computation via autograd and integrates seamlessly with downstream PINNs, enabling end-to-end differentiable PDE solving without the need for meshing or explicit Jacobian computation. JacobiNet effectively addresses normalization challenges, facilitates hard constraints of boundary conditions, and mitigates the long-standing imbalance among loss terms. It demonstrates significant improvements, reducing the relative L2 error from 0.287-0.637 to 0.013-0.039, achieving an average accuracy improvement of 18.3*. In vessel-like domains, it enables rapid mapping for unseen geometries, improving prediction accuracy by 3.65* and achieving over 10* speedup, showcasing its generalization, accuracy, and efficiency.
[LG-8] Communication and Computation Efficient Split Federated Learning in O-RAN
链接: https://arxiv.org/abs/2508.02534
作者: Shunxian Gu,Chaoqun You,Bangbang Ren,Deke Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:The hierarchical architecture of Open Radio Access Network (O-RAN) has enabled a new Federated Learning (FL) paradigm that trains models using data from non- and near-real-time (near-RT) Radio Intelligent Controllers (RICs). However, the ever-increasing model size leads to longer training time, jeopardizing the deadline requirements for both non-RT and near-RT RICs. To address this issue, split federated learning (SFL) offers an approach by offloading partial model layers from near-RT-RIC to high-performance non-RT-RIC. Nonetheless, its deployment presents two challenges: (i) Frequent data/gradient transfers between near-RT-RIC and non-RT-RIC in SFL incur significant communication cost in O-RAN. (ii) Proper allocation of computational and communication resources in O-RAN is vital to satisfying the deadline and affects SFL convergence. Therefore, we propose SplitMe, an SFL framework that exploits mutual learning to alternately and independently train the near-RT-RIC’s model and the non-RT-RIC’s inverse model, eliminating frequent transfers. The ‘‘inverse’’ of the inverse model is derived via a zeroth-order technique to integrate the final model. Then, we solve a joint optimization problem for SplitMe to minimize overall resource costs with deadline-aware selection of near-RT-RICs and adaptive local updates. Our numerical results demonstrate that SplitMe remarkably outperforms FL frameworks like SFL, FedAvg and O-RANFed regarding costs and convergence.
[LG-9] Causality and Interpretability for Electrical Distribution System faults
链接: https://arxiv.org/abs/2508.02524
作者: Karthik Peddi,Sai Ram Aditya Parisineni,Hemanth Macharla,Mayukha Pal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Causal analysis helps us understand variables that are responsible for system failures. This improves fault detection and makes system more reliable. In this work, we present a new method that combines causal inference with machine learning to classify faults in electrical distribution systems (EDS) using graph-based models. We first build causal graphs using transfer entropy (TE). Each fault case is represented as a graph, where the nodes are features such as voltage and current, and the edges demonstrate how these features influence each other. Then, the graphs are classified using machine learning and GraphSAGE where the model learns from both the node values and the structure of the graph to predict the type of fault. To make the predictions understandable, we further developed an integrated approach using GNNExplainer and Captums Integrated Gradients to highlight the nodes (features) that influences the most on the final prediction. This gives us clear insights into the possible causes of the fault. Our experiments show high accuracy: 99.44% on the EDS fault dataset, which is better than state of art models. By combining causal graphs with machine learning, our method not only predicts faults accurately but also helps understand their root causes. This makes it a strong and practical tool for improving system reliability.
[LG-10] AnalogCoder-Pro: Unifying Analog Circuit Generation and Optimization via Multi-modal LLM s
链接: https://arxiv.org/abs/2508.02518
作者: Yao Lai,Souradip Poddar,Sungyoung Lee,Guojin Chen,Mengkang Hu,Bei Yu,Ping Luo,David Z. Pan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite advances in analog design automation, analog front-end design still heavily depends on expert intuition and iterative simulations, underscoring critical gaps in fully automated optimization for performance-critical applications. Recently, the rapid development of Large Language Models (LLMs) has brought new promise to analog design automation. However, existing work remains in its early stages, and holistic joint optimization for practical end-to-end solutions remains largely unexplored. We propose AnalogCoder-Pro, a unified multimodal LLM-based framework that integrates generative capabilities and optimization techniques to jointly explore circuit topologies and optimize device sizing, automatically generating performance-specific, fully sized schematic netlists. AnalogCoder-Pro employs rejection sampling for fine-tuning LLMs on high-quality synthesized circuit data and introduces a multimodal diagnosis and repair workflow based on functional specifications and waveform images. By leveraging LLMs to interpret generated circuit netlists, AnalogCoder-Pro automates the extraction of critical design parameters and the formulation of parameter spaces, establishing an end-to-end workflow for simultaneous topology generation and device sizing optimization. Extensive experiments demonstrate that these orthogonal approaches significantly improve the success rate of analog circuit design and enhance circuit performance.
[LG-11] On Distributional Dependent Performance of Classical and Neural Routing Solvers
链接: https://arxiv.org/abs/2508.02510
作者: Daniela Thyssens,Tim Dernedde,Wilson Sentanoe,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures
Abstract:Neural Combinatorial Optimization aims to learn to solve a class of combinatorial problems through data-driven methods and notably through employing neural networks by learning the underlying distribution of problem instances. While, so far neural methods struggle to outperform highly engineered problem specific meta-heuristics, this work explores a novel approach to formulate the distribution of problem instances to learn from and, more importantly, plant a structure in the sampled problem instances. In application to routing problems, we generate large problem instances that represent custom base problem instance distributions from which training instances are sampled. The test instances to evaluate the methods on the routing task consist of unseen problems sampled from the underlying large problem instance. We evaluate representative NCO methods and specialized Operation Research meta heuristics on this novel task and demonstrate that the performance gap between neural routing solvers and highly specialized meta-heuristics decreases when learning from sub-samples drawn from a fixed base node distribution.
[LG-12] Federated Graph Unlearning
链接: https://arxiv.org/abs/2508.02485
作者: Yuming Ai,Xunkai Li,Jiaqi Chao,Bowen Fan,Zhengyu Wu,Yinlin Zhu,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: under review
Abstract:The demand for data privacy has led to the development of frameworks like Federated Graph Learning (FGL), which facilitate decentralized model training. However, a significant operational challenge in such systems is adhering to the right to be forgotten. This principle necessitates robust mechanisms for two distinct types of data removal: the selective erasure of specific entities and their associated knowledge from local subgraphs and the wholesale removal of a user’s entire dataset and influence. Existing methods often struggle to fully address both unlearning requirements, frequently resulting in incomplete data removal or the persistence of residual knowledge within the system. This work introduces a unified framework, conceived to provide a comprehensive solution to these challenges. The proposed framework employs a bifurcated strategy tailored to the specific unlearning request. For fine-grained Meta Unlearning, it uses prototype gradients to direct the initial local forgetting process, which is then refined by generating adversarial graphs to eliminate any remaining data traces among affected clients. In the case of complete client unlearning, the framework utilizes adversarial graph generation exclusively to purge the departed client’s contributions from the remaining network. Extensive experiments on multiple benchmark datasets validate the proposed approach. The framework achieves substantial improvements in model prediction accuracy across both client and meta-unlearning scenarios when compared to existing methods. Furthermore, additional studies confirm its utility as a plug-in module, where it materially enhances the predictive capabilities and unlearning effectiveness of other established methods.
[LG-13] oward Using Machine Learning as a Shape Quality Metric for Liver Point Cloud Generation
链接: https://arxiv.org/abs/2508.02482
作者: Khoa Tuan Nguyen,Gaeun Oh,Ho-min Park,Francesca Tozzi,Wouter Willaert,Joris Vankerschaver,Niki Rashidian,Wesley De Neve
类目: Machine Learning (cs.LG)
*备注:
Abstract:While 3D medical shape generative models such as diffusion models have shown promise in synthesizing diverse and anatomically plausible structures, the absence of ground truth makes quality evaluation challenging. Existing evaluation metrics commonly measure distributional distances between training and generated sets, while the medical field requires assessing quality at the individual level for each generated shape, which demands labor-intensive expert review. In this paper, we investigate the use of classical machine learning (ML) methods and PointNet as an alternative, interpretable approach for assessing the quality of generated liver shapes. We sample point clouds from the surfaces of the generated liver shapes, extract handcrafted geometric features, and train a group of supervised ML and PointNet models to classify liver shapes as good or bad. These trained models are then used as proxy discriminators to assess the quality of synthetic liver shapes produced by generative models. Our results show that ML-based shape classifiers provide not only interpretable feedback but also complementary insights compared to expert evaluation. This suggests that ML classifiers can serve as lightweight, task-relevant quality metrics in 3D organ shape generation, supporting more transparent and clinically aligned evaluation protocols in medical shape modeling. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.02482 [cs.LG] (or arXiv:2508.02482v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.02482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] An Efficient and Adaptive Next Edit Suggestion Framework with Zero Human Instructions in IDEs
链接: https://arxiv.org/abs/2508.02473
作者: Xinfang Chen,Siyang Xiao,Xianying Zhu,Junhong Xie,Ming Liang,Dajun Chen,Wei Jiang,Yong Li,Peng Di
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 13 pages
Abstract:Code editing, including modifying, refactoring, and maintaining existing code, is the most frequent task in software development and has garnered significant attention from AI-powered tools. However, existing solutions that translate explicit natural language instructions into code edits face critical limitations, such as heavy reliance on human instruction input and high latency, which hinder their effective integration into a developer’s workflow. We observe that developers’ habitual behaviors and coding objectives are often reflected in their historical editing patterns, making this data key to addressing existing limitations. To leverage these insights, we propose NES (Next Edit Suggestion), an LLM-driven code editing framework that delivers an instruction-free and low-latency experience. Built on a dual-model architecture and trained with our high-quality SFT and DAPO datasets, NES enhances productivity by understanding developer intent while optimizing inference to minimize latency. NES is a scalable, industry-ready solution with a continuous Tab key interaction workflow, seamlessly adopted by a FinTech company with over 20,000 developers. Evaluations on real-world datasets show NES achieves 75.6% and 81.6% accuracy in two tasks of predicting next edit locations, alongside 91.36% ES and 27.7% EMR for intent-aligned edits, outperforming SOTA models. Our open-sourced SFT and DAPO datasets have been demonstrated to enhance the performance of open-source CodeLLMs. The demonstration of NES is available at this https URL.
[LG-15] Computationally efficient Gauss-Newton reinforcement learning for model predictive control
链接: https://arxiv.org/abs/2508.02441
作者: Dean Brandner,Sebastien Gros,Sergio Lucia
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, submitted to Elsevier
Abstract:Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but converge at most linearly, making them inefficient when each policy update requires solving an optimal control problem, as is the case with MPC. While MPC policies are typically sparsely parameterized and thus amenable to second-order approaches, existing second-order methods demand second-order policy derivatives, which can be computationally and memory-wise intractable. This work introduces a Gauss-Newton approximation of the deterministic policy Hessian that eliminates the need for second-order policy derivatives, enabling superlinear convergence with minimal computational overhead. To further improve robustness, we propose a momentum-based Hessian averaging scheme for stable training under noisy estimates. We demonstrate the effectiveness of the approach on a nonlinear continuously stirred tank reactor (CSTR), showing faster convergence and improved data efficiency over state-of-the-art first-order methods. Comments: 14 pages, 8 figures, submitted to Elsevier Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2508.02441 [eess.SY] (or arXiv:2508.02441v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2508.02441 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] ASMR: Angular Support for Malfunctioning Client Resilience in Federated Learning
链接: https://arxiv.org/abs/2508.02414
作者: Mirko Konstantin,Moritz Fuchs,Anirban Mukhopadhyay
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) allows the training of deep neural networks in a distributed and privacy-preserving manner. However, this concept suffers from malfunctioning updates sent by the attending clients that cause global model performance degradation. Reasons for this malfunctioning might be technical issues, disadvantageous training data, or malicious attacks. Most of the current defense mechanisms are meant to require impractical prerequisites like knowledge about the number of malfunctioning updates, which makes them unsuitable for real-world applications. To counteract these problems, we introduce a novel method called Angular Support for Malfunctioning Client Resilience (ASMR), that dynamically excludes malfunctioning clients based on their angular distance. Our novel method does not require any hyperparameters or knowledge about the number of malfunctioning clients. Our experiments showcase the detection capabilities of ASMR in an image classification task on a histopathological dataset, while also presenting findings on the significance of dynamically adapting decision boundaries.
[LG-17] Graph Embedding in the Graph Fractional Fourier Transform Domain
链接: https://arxiv.org/abs/2508.02383
作者: Changjie Sheng,Zhichao Zhang,Wei Yao
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Spectral graph embedding plays a critical role in graph representation learning by generating low-dimensional vector representations from graph spectral information. However, the embedding space of traditional spectral embedding methods often exhibit limited expressiveness, failing to exhaustively capture latent structural features across alternative transform domains. To address this issue, we use the graph fractional Fourier transform to extend the existing state-of-the-art generalized frequency filtering embedding (GEFFE) into fractional domains, giving birth to the generalized fractional filtering embedding (GEFRFE), which enhances embedding informativeness via the graph fractional domain. The GEFRFE leverages graph fractional domain filtering and a nonlinear composition of eigenvector components derived from a fractionalized graph Laplacian. To dynamically determine the fractional order, two parallel strategies are introduced: search-based optimization and a ResNet18-based adaptive learning. Extensive experiments on six benchmark datasets demonstrate that the GEFRFE captures richer structural features and significantly enhance classification performance. Notably, the proposed method retains computational complexity comparable to GEFFE approaches.
[LG-18] Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLM s
链接: https://arxiv.org/abs/2508.02381
作者: Zuxin Ma,Yunhe Cui,Yongbin Qin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Non-uniform structured network pruning methods can effectively reduce Large Language Model (LLM) size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed pruning policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic pruning ratio requirements. Additionly, a critical bottleneck – the time-consuming evaluation of pruning policies – further limits the feasibility of iteratively and dynamically finding optimal pruning policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel pruning framework for LLMs that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time pruning decisions under dynamic pruning ratios but is also applicable to static pruning scenarios. It employs an agent for producing adaptive and real-time pruning actions, while a lightweight performance predictor that can evaluate a pruning policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static pruning policies and it reduces perplexity by up to 33.4% (dynamic pruning) and 84.78% (static pruning) over existing methods, outperforming manually designed pruning policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 second), achieving over 64 times speedup. Our code will be available at this https URL .
[LG-19] A Novel Sliced Fused Gromov-Wasserstein Distance
链接: https://arxiv.org/abs/2508.02364
作者: Moritz Piening,Robert Beinert
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The Gromov–Wasserstein (GW) distance and its fused extension (FGW) are powerful tools for comparing heterogeneous data. Their computation is, however, challenging since both distances are based on non-convex, quadratic optimal transport (OT) problems. Leveraging 1D OT, a sliced version of GW has been proposed to lower the computational burden. Unfortunately, this sliced version is restricted to Euclidean geometry and loses invariance to isometries, strongly limiting its application in practice. To overcome these issues, we propose a novel slicing technique for GW as well as for FGW that is based on an appropriate lower bound, hierarchical OT, and suitable quadrature rules for the underlying 1D OT problems. Our novel sliced FGW significantly reduces the numerical effort while remaining invariant to isometric transformations and allowing the comparison of arbitrary geometries. We show that our new distance actually defines a pseudo-metric for structured spaces that bounds FGW from below and study its interpolation properties between sliced Wasserstein and GW. Since we avoid the underlying quadratic program, our sliced distance is numerically more robust and reliable than the original GW and FGW distance; especially in the context of shape retrieval and graph isomorphism testing.
[LG-20] Detecting COPD Through Speech Analysis: A Dataset of Danish Speech and Machine Learning Approach
链接: https://arxiv.org/abs/2508.02354
作者: Cuno Sankey-Olsen,Rasmus Hvass Olesen,Tobias Oliver Eberhard,Andreas Triantafyllopoulos,Björn Schuller,Ilhan Aslan
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Chronic Obstructive Pulmonary Disease (COPD) is a serious and debilitating disease affecting millions around the world. Its early detection using non-invasive means could enable preventive interventions that improve quality of life and patient outcomes, with speech recently shown to be a valuable biomarker. Yet, its validity across different linguistic groups remains to be seen. To that end, audio data were collected from 96 Danish participants conducting three speech tasks (reading, coughing, sustained vowels). Half of the participants were diagnosed with different levels of COPD and the other half formed a healthy control group. Subsequently, we investigated different baseline models using openSMILE features and learnt x-vector embeddings. We obtained a best accuracy of 67% using openSMILE features and logistic regression. Our findings support the potential of speech-based analysis as a non-invasive, remote, and scalable screening tool as part of future COPD healthcare solutions.
[LG-21] Posterior Sampling of Probabilistic Word Embeddings
链接: https://arxiv.org/abs/2508.02337
作者: Väinö Yrjänäinen,Isac Boström,Måns Magnusson,Johan Jonasson
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:Quantifying uncertainty in word embeddings is crucial for reliable inference from textual data. However, existing Bayesian methods such as Hamiltonian Monte Carlo (HMC) and mean-field variational inference (MFVI) are either computationally infeasible for large data or rely on restrictive assumptions. We propose a scalable Gibbs sampler using Polya-Gamma augmentation as well as Laplace approximation and compare them with MFVI and HMC for word embeddings. In addition, we address non-identifiability in word embeddings. Our Gibbs sampler and HMC correctly estimate uncertainties, while MFVI does not, and Laplace approximation only does so on large sample sizes, as expected. Applying the Gibbs sampler to the US Congress and the Movielens datasets, we demonstrate the feasibility on larger real data. Finally, as a result of having draws from the full posterior, we show that the posterior mean of word embeddings improves over maximum a posteriori (MAP) estimates in terms of hold-out likelihood, especially for smaller sampling sizes, further strengthening the need for posterior sampling of word embeddings. Subjects: Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2508.02337 [cs.LG] (or arXiv:2508.02337v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.02337 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-22] BOOST: Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique
链接: https://arxiv.org/abs/2508.02332
作者: Joon-Hyun Park,Mujin Cheon,Dong-Yeun Koh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages
Abstract:The performance of Bayesian optimization (BO), a highly sample-efficient method for expensive black-box problems, is critically governed by the selection of its hyperparameters, including the kernel and acquisition functions. This presents a challenge: an inappropriate combination of these can lead to poor performance and wasted evaluations. While individual improvements to kernel functions (e.g., tree-based kernels, deep kernel learning) and acquisition functions (e.g., multi-step lookahead, tree-based planning) have been explored, the joint and autonomous selection of the best pair of these fundamental hyperparameters has been overlooked. This forces practitioners to rely on heuristics or costly manual training. We propose a simple yet effective framework, BOOST (Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique), that automates this selection. BOOST utilizes a lightweight, offline evaluation stage to predict the performance of various kernel-acquisition function pairs and identify the most suitable configuration before expensive evaluations. BOOST partitions data-in-hand into two subsets: a reference subset and a query subset, and it prepares all possible kernel-acquisition pairs from the user’s chosen candidates. For each configuration, BOOST conducts internal BO runs using the reference subset, evaluating how effectively each pair guides the search toward the optimum in the unknown query subset, thereby identifying the configuration with the best retrospective performance for future optimization. Experiments on both synthetic benchmark functions and real-world hyperparameter optimization tasks demonstrate that BOOST consistently outperforms standard BO approaches with fixed hyperparameters, highlighting its effectiveness and robustness in diverse problem landscapes.
[LG-23] A Compression Based Classification Framework Using Symbolic Dynamics of Chaotic Maps
链接: https://arxiv.org/abs/2508.02330
作者: Parth Naik,Harikrishnan N B
类目: Machine Learning (cs.LG)
*备注: 4 figures, 3 tables
Abstract:We propose a novel classification framework grounded in symbolic dynamics and data compression using chaotic maps. The core idea is to model each class by generating symbolic sequences from thresholded real-valued training data, which are then evolved through a one-dimensional chaotic map. For each class, we compute the transition probabilities of symbolic patterns (e.g., 00',
01’, 10', and
11’ for the second return map) and aggregate these statistics to form a class-specific probabilistic model. During testing phase, the test data are thresholded and symbolized, and then encoded using the class-wise symbolic statistics via back iteration, a dynamical reconstruction technique. The predicted label corresponds to the class yielding the shortest compressed representation, signifying the most efficient symbolic encoding under its respective chaotic model. This approach fuses concepts from dynamical systems, symbolic representations, and compression-based learning. We evaluate the proposed method: \emphChaosComp on both synthetic and real-world datasets, demonstrating competitive performance compared to traditional machine learning algorithms (e.g., macro F1-scores for the proposed method on Breast Cancer Wisconsin = 0.9531, Seeds = 0.9475, Iris = 0.8317 etc.). Rather than aiming for state-of-the-art performance, the goal of this research is to reinterpret the classification problem through the lens of dynamical systems and compression, which are foundational perspectives in learning theory and information processing.
[LG-24] NMS: Efficient Edge DNN Training via Near-Memory Sampling on Manifolds
链接: https://arxiv.org/abs/2508.02313
作者: Boran Zhao,Haiduo Huang,Qiwei Dang,Wenzhe Zhao,Tian Xia,Pengju Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training deep neural networks (DNNs) on edge devices has attracted increasing attention due to its potential to address challenges related to domain adaptation and privacy preservation. However, DNNs typically rely on large datasets for training, which results in substantial energy consumption, making the training in edge devices impractical. Some dataset compression methods have been proposed to solve this challenge. For instance, the coreset selection and dataset distillation reduce the training cost by selecting and generating representative samples respectively. Nevertheless, these methods have two significant defects: (1) The necessary of leveraging a DNN model to evaluate the quality of representative samples, which inevitably introduces inductive bias of DNN, resulting in a severe generalization issue; (2) All training images require multiple accesses to the DDR via long-distance PCB connections, leading to substantial energy overhead. To address these issues, inspired by the nonlinear manifold stationary of the human brain, we firstly propose a DNN-free sample-selecting algorithm, called DE-SNE, to improve the generalization issue. Secondly, we innovatively utilize the near-memory computing technique to implement DE-SNE, thus only a small fraction of images need to access the DDR via long-distance PCB. It significantly reduces DDR energy consumption. As a result, we build a novel expedited DNN training system with a more efficient in-place Near-Memory Sampling characteristic for edge devices, dubbed NMS. As far as we know, our NMS is the first DNN-free near-memory sampling technique that can effectively alleviate generalization issues and significantly reduce DDR energy caused by dataset access. The experimental results show that our NMS outperforms the current state-of-the-art (SOTA) approaches, namely DQ, DQAS, and NeSSA, in model accuracy.
[LG-25] Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data
链接: https://arxiv.org/abs/2508.02294
作者: Abdulmajid Murad,Massimiliano Ruocco
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review
Abstract:Access to comprehensive flight operations data remains severely restricted in aviation due to commercial sensitivity and competitive considerations, hindering the development of predictive models for operational planning. This paper investigates whether synthetic data can effectively replace real operational data for training machine learning models in pre-tactical aviation scenarios-predictions made hours to days before operations using only scheduled flight information. We evaluate four state-of-the-art synthetic data generators on three prediction tasks: aircraft turnaround time, departure delays, and arrival delays. Using a Train on Synthetic, Test on Real (TSTR) methodology on over 1.7 million European flight records, we first validate synthetic data quality through fidelity assessments, then assess both predictive performance and the preservation of operational relationships. Our results show that advanced neural network architectures, specifically transformer-based generators, can retain 94-97% of real-data predictive performance while maintaining feature importance patterns informative for operational decision-making. Our analysis reveals that even with real data, prediction accuracy is inherently limited when only scheduled information is available-establishing realistic baselines for pre-tactical forecasting. These findings suggest that high-quality synthetic data can enable broader access to aviation analytics capabilities while preserving commercial confidentiality, though stakeholders must maintain realistic expectations about pre-tactical prediction accuracy given the stochastic nature of flight operations.
[LG-26] An Enhanced Focal Loss Function to Mitigate Class Imbalance in Auto Insurance Fraud Detection with Explainable AI
链接: https://arxiv.org/abs/2508.02283
作者: Francis Boabang,Samuel Asante Gyamerah
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Risk Management (q-fin.RM)
*备注: 28 pages, 4 figures, 2 tables
Abstract:In insurance fraud prediction, handling class imbalance remains a critical challenge. This paper presents a novel multistage focal loss function designed to enhance the performance of machine learning models in such imbalanced settings by helping to escape local minima and converge to a good solution. Building upon the foundation of the standard focal loss, our proposed approach introduces a dynamic, multi-stage convex and nonconvex mechanism that progressively adjusts the focus on hard-to-classify samples across training epochs. This strategic refinement facilitates more stable learning and improved discrimination between fraudulent and legitimate cases. Through extensive experimentation on a real-world insurance dataset, our method achieved better performance than the traditional focal loss, as measured by accuracy, precision, F1-score, recall and Area Under the Curve (AUC) metrics on the auto insurance dataset. These results demonstrate the efficacy of the multistage focal loss in boosting model robustness and predictive accuracy in highly skewed classification tasks, offering significant implications for fraud detection systems in the insurance industry. An explainable model is included to interpret the results.
[LG-27] mCardiacDx: Radar-Driven Contactless Monitoring and Diagnosis of Arrhythmia
链接: https://arxiv.org/abs/2508.02274
作者: Arjun Kumar,Noppanat Wadlom,Jaeheon Kwak,Si-Hyuck Kang,Insik Shin
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 15 pages, 27 images
Abstract:Arrhythmia is a common cardiac condition that can precipitate severe complications without timely intervention. While continuous monitoring is essential for timely diagnosis, conventional approaches such as electrocardiogram and wearable devices are constrained by their reliance on specialized medical expertise and patient discomfort from their contact nature. Existing contactless monitoring, primarily designed for healthy subjects, face significant challenges when analyzing reflected signals from arrhythmia patients due to disrupted spatial stability and temporal consistency. In this paper, we introduce mCardiacDx, a radar-driven contactless system that accurately analyzes reflected signals and reconstructs heart pulse waveforms for arrhythmia monitoring and diagnosis. The key contributions of our work include a novel precise target localization (PTL) technique that locates reflected signals despite spatial disruptions, and an encoder-decoder model that transforms these signals into HPWs, addressing temporal inconsistencies. Our evaluation on a large dataset of healthy subjects and arrhythmia patients shows that both mCardiacDx and PTL outperform state-of-the-art approach in arrhythmia monitoring and diagnosis, also demonstrating improved performance in healthy subjects. Comments: 15 pages, 27 images Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) MSC classes: 92C55, 68T07 Cite as: arXiv:2508.02274 [cs.HC] (or arXiv:2508.02274v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2508.02274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Skeleton-Guided Learning for Shortest Path Search
链接: https://arxiv.org/abs/2508.02270
作者: Tiantian Liu,Xiao Li,Huan Li,Hua Lu,Christian S. Jensen,Jianliang Xu
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Shortest path search is a core operation in graph-based applications, yet existing methods face important limitations. Classical algorithms such as Dijkstra’s and A* become inefficient as graphs grow more complex, while index-based techniques often require substantial preprocessing and storage. Recent learning-based approaches typically focus on spatial graphs and rely on context-specific features like geographic coordinates, limiting their general applicability. We propose a versatile learning-based framework for shortest path search on generic graphs, without requiring domain-specific features. At the core of our approach is the construction of a skeleton graph that captures multi-level distance and hop information in a compact form. A Skeleton Graph Neural Network (SGNN) operates on this structure to learn node embeddings and predict distances and hop lengths between node pairs. These predictions support LSearch, a guided search algorithm that uses model-driven pruning to reduce the search space while preserving accuracy. To handle larger graphs, we introduce a hierarchical training strategy that partitions the graph into subgraphs with individually trained SGNNs. This structure enables HLSearch, an extension of our method for efficient path search across graph partitions. Experiments on five diverse real-world graphs demonstrate that our framework achieves strong performance across graph types, offering a flexible and effective solution for learning-based shortest path search.
[LG-29] Pigeon-SL: Robust Split Learning Framework for Edge Intelligence under Malicious Clients
链接: https://arxiv.org/abs/2508.02235
作者: Sangjun Park,Tony Q.S. Quek,Hyowoon Seo
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 14 figures
Abstract:Recent advances in split learning (SL) have established it as a promising framework for privacy-preserving, communication-efficient distributed learning at the network edge. However, SL’s sequential update process is vulnerable to even a single malicious client, which can significantly degrade model accuracy. To address this, we introduce Pigeon-SL, a novel scheme grounded in the pigeonhole principle that guarantees at least one entirely honest cluster among M clients, even when up to N of them are adversarial. In each global round, the access point partitions the clients into N+1 clusters, trains each cluster independently via vanilla SL, and evaluates their validation losses on a shared dataset. Only the cluster with the lowest loss advances, thereby isolating and discarding malicious updates. We further enhance training and communication efficiency with Pigeon-SL+, which repeats training on the selected cluster to match the update throughput of standard SL. We validate the robustness and effectiveness of our approach under three representative attack models – label flipping, activation and gradient manipulation – demonstrating significant improvements in accuracy and resilience over baseline SL methods in future intelligent wireless networks.
[LG-30] CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
链接: https://arxiv.org/abs/2508.02219
作者: Dongchi Huang,Zhirui Fang,Tianle Zhang,Yihang Li,Lin Zhao,Chunhe Xia
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language-Action (VLA) models demonstrate significant potential for developing generalized policies in real-world robotic control. This progress inspires researchers to explore fine-tuning these models with Reinforcement Learning (RL). However, fine-tuning VLA models with RL still faces challenges related to sample efficiency, compatibility with action chunking, and training stability. To address these challenges, we explore the fine-tuning of VLA models through offline reinforcement learning incorporating action chunking. In this work, we propose Chunked RL, a novel reinforcement learning framework specifically designed for VLA models. Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models. Building upon this framework, we propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations (30 to 60 samples). Specifically, we first conduct imitation learning (IL) with full parameter fine-tuning to initialize both the backbone and the policy. Subsequently, we implement offline RL with action chunking to optimize the pretrained policy. Our empirical results in real-world environments demonstrate that CO-RFT outperforms previous supervised methods, achieving a 57% improvement in success rate and a 22.3% reduction in cycle time. Moreover, our method exhibits robust positional generalization capabilities, attaining a success rate of 44.3% in previously unseen positions.
[LG-31] Multi-Policy Pareto Front Tracking Based Online and Offline Multi-Objective Reinforcement Learning
链接: https://arxiv.org/abs/2508.02217
作者: Zeyu Zhao,Yueling Che,Kaichen Liu,Jian Li,Junmei Yao
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures, conference paper
Abstract:Multi-objective reinforcement learning (MORL) plays a pivotal role in addressing multi-criteria decision-making problems in the real world. The multi-policy (MP) based methods are widely used to obtain high-quality Pareto front approximation for the MORL problems. However, traditional MP methods only rely on the online reinforcement learning (RL) and adopt the evolutionary framework with a large policy population. This may lead to sample inefficiency and/or overwhelmed agent-environment interactions in practice. By forsaking the evolutionary framework, we propose the novel Multi-policy Pareto Front Tracking (MPFT) framework without maintaining any policy population, where both online and offline MORL algorithms can be applied. The proposed MPFT framework includes four stages: Stage 1 approximates all the Pareto-vertex policies, whose mapping to the objective space fall on the vertices of the Pareto front. Stage 2 designs the new Pareto tracking mechanism to track the Pareto front, starting from each of the Pareto-vertex policies. Stage 3 identifies the sparse regions in the tracked Pareto front, and introduces a new objective weight adjustment method to fill the sparse regions. Finally, by combining all the policies tracked in Stages 2 and 3, Stage 4 approximates the Pareto front. Experiments are conducted on seven different continuous-action robotic control tasks with both online and offline MORL algorithms, and demonstrate the superior hypervolume performance of our proposed MPFT approach over the state-of-the-art benchmarks, with significantly reduced agent-environment interactions and hardware requirements.
[LG-32] WhiSQA: Non-Intrusive Speech Quality Prediction Using Whisper Encoder Features
链接: https://arxiv.org/abs/2508.02210
作者: George Close,Kris Hong,Thomas Hain,Stefan Goetze
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at SPECOM 2025
Abstract:There has been significant research effort developing neural-network-based predictors of SQ in recent years. While a primary objective has been to develop non-intrusive, i.e.~reference-free, metrics to assess the performance of SE systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an ASR model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human MOS ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric.
[LG-33] CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation
链接: https://arxiv.org/abs/2508.02184
作者: Manh Nguyen,Sunil Gupta,Hung Le
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensuring truthfulness in large language models remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose a context-aware adaptive decoding method that leverages a compact reference grounding space, built from as few as 10 annotated examples and comprising pairs of context embeddings and next token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, our method retrieves top-N semantically similar contexts and aggregates their associated next token logits to modify the LLM’s logits. Across three open-ended question-answering benchmarks, our approach achieves a 2.8 percent average improvement on TruthfulQA and further outperforms existing baselines on both Biographies and WikiQA. Experimental results also demonstrate cross-task generalization, with TruthfulQA-derived grounding enhancing biography generation. Our model-agnostic, scalable, and efficient method requires only a single generation pass, highlighting the potential of context-aware decoding for factual reliability in LLMs.
[LG-34] Multi-Treatment-DML: Causal Estimation for Multi-Dimensional Continuous Treatments with Monotonicity Constraints in Personal Loan Risk Optimization
链接: https://arxiv.org/abs/2508.02183
作者: Kexin Zhao,Bo Wang,Cuiying Zhao,Tongyao Wan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimizing credit limits, interest rates, and loan terms is crucial for managing borrower risk and lifetime value (LTV) in personal loan platform. However, counterfactual estimation of these continuous, multi-dimensional treatments faces significant challenges: randomized trials are often prohibited by risk controls and long repayment cycles, forcing reliance on biased observational data. Existing causal methods primarily handle binary/discrete treatments and struggle with continuous, multi-dimensional settings. Furthermore, financial domain knowledge mandates provably monotonic treatment-outcome relationships (e.g., risk increases with credit limit).To address these gaps, we propose Multi-Treatment-DML, a novel framework leveraging Double Machine Learning (DML) to: (i) debias observational data for causal effect estimation; (ii) handle arbitrary-dimensional continuous treatments; and (iii) enforce monotonic constraints between treatments and outcomes, guaranteeing adherence to domain this http URL experiments on public benchmarks and real-world industrial datasets demonstrate the effectiveness of our approach. Furthermore, online A/B testing conducted on a realworld personal loan platform, confirms the practical superiority of Multi-Treatment-DML in real-world loan operations.
[LG-35] User Trajectory Prediction Unifying Global and Local Temporal Information
链接: https://arxiv.org/abs/2508.02161
作者: Wei Hao,Bin Chong,Ronghua Ji,Chen Hou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Trajectory prediction is essential for formulating proactive strategies that anticipate user mobility and support advance preparation. Therefore, how to reduce the forecasting error in user trajectory prediction within an acceptable inference time arises as an interesting issue. However, trajectory data contains both global and local temporal information, complicating the extraction of the complete temporal pattern. Moreover, user behavior occurs over different time scales, increasing the difficulty of capturing behavioral patterns. To address these challenges, a trajectory prediction model based on multilayer perceptron (MLP), multi-scale convolutional neural network (MSCNN), and cross-attention (CA) is proposed. Specifically, MLP is used to extract the global temporal information of each feature. In parallel, MSCNN is employed to extract the local temporal information by modeling interactions among features within a local temporal range. Convolutional kernels with different sizes are used in MSCNN to capture temporal information at multiple resolutions, enhancing the model’s adaptability to different behavioral patterns. Finally, CA is applied to fuse the global and local temporal information. Experimental results show that our model reduces mean squared error (MSE) by 5.04% and mean absolute error (MAE) by 4.35% compared with ModernTCN in 12-step prediction, while maintaining similar inference time.
[LG-36] PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning ICML2025
链接: https://arxiv.org/abs/2508.02159
作者: Dongchi Huang,Jiaqi Wang,Yang Li,Chunhe Xia,Tianle Zhang,Kaige Zhang
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Partial observability presents a significant challenge for safe reinforcement learning, as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer, a model-based safe reinforcement learning approach that leverages privileged information to enhance the agent’s safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that our approach significantly outperforms existing methods in terms of safety and task-centric performance. Meanwhile, compared to alternative privileged model-based reinforcement learning methods, our approach exhibits superior performance and ease of training.
[LG-37] Robust Detection of Planted Subgraphs in Semi-Random Models
链接: https://arxiv.org/abs/2508.02158
作者: Dor Elimelech,Wasim Huleihel
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 32 pages
Abstract:Detection of planted subgraphs in Erdös-Rényi random graphs has been extensively studied, leading to a rich body of results characterizing both statistical and computational thresholds. However, most prior work assumes a purely random generative model, making the resulting algorithms potentially fragile in the face of real-world perturbations. In this work, we initiate the study of semi-random models for the planted subgraph detection problem, wherein an adversary is allowed to remove edges outside the planted subgraph before the graph is revealed to the statistician. Crucially, the statistician remains unaware of which edges have been removed, introducing fundamental challenges to the inference task. We establish fundamental statistical limits for detection under this semi-random model, revealing a sharp dichotomy. Specifically, for planted subgraphs with strongly sub-logarithmic maximum density detection becomes information-theoretically impossible in the presence of an adversary, despite being possible in the classical random model. In stark contrast, for subgraphs with super-logarithmic density, the statistical limits remain essentially unchanged; we prove that the optimal (albeit computationally intractable) likelihood ratio test remains robust. Beyond these statistical boundaries, we design a new computationally efficient and robust detection algorithm, and provide rigorous statistical guarantees for its performance. Our results establish the first robust framework for planted subgraph detection and open new directions in the study of semi-random models, computational-statistical trade-offs, and robustness in graph inference problems.
[LG-38] FedLAD: A Linear Algebra Based Data Poisoning Defence for Federated Learning
链接: https://arxiv.org/abs/2508.02136
作者: Qi Xiong,Hai Dong,Nasrin Sohrabi,Zahir Tari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sybil attacks pose a significant threat to federated learning, as malicious nodes can collaborate and gain a majority, thereby overwhelming the system. Therefore, it is essential to develop countermeasures that ensure the security of federated learning environments. We present a novel defence method against targeted data poisoning, which is one of the types of Sybil attacks, called Linear Algebra-based Detection (FedLAD). Unlike existing approaches, such as clustering and robust training, which struggle in situations where malicious nodes dominate, FedLAD models the federated learning aggregation process as a linear problem, transforming it into a linear algebra optimisation challenge. This method identifies potential attacks by extracting the independent linear combinations from the original linear combinations, effectively filtering out redundant and malicious elements. Extensive experimental evaluations demonstrate the effectiveness of FedLAD compared to five well-established defence methods: Sherpa, CONTRA, Median, Trimmed Mean, and Krum. Using tasks from both image classification and natural language processing, our experiments confirm that FedLAD is robust and not dependent on specific application settings. The results indicate that FedLAD effectively protects federated learning systems across a broad spectrum of malicious node ratios. Compared to baseline defence methods, FedLAD maintains a low attack success rate for malicious nodes when their ratio ranges from 0.2 to 0.8. Additionally, it preserves high model accuracy when the malicious node ratio is between 0.2 and 0.5. These findings underscore FedLAD’s potential to enhance both the reliability and performance of federated learning systems in the face of data poisoning attacks.
[LG-39] Understanding Learning Dynamics Through Structured Representations
链接: https://arxiv.org/abs/2508.02126
作者: Saleh Nikooroo,Thomas Engel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:While modern deep networks have demonstrated remarkable versatility, their training dynamics remain poorly understood–often driven more by empirical tweaks than architectural insight. This paper investigates how internal structural choices shape the behavior of learning systems. Building on prior efforts that introduced simple architectural constraints, we explore the broader implications of structure for convergence, generalization, and adaptation. Our approach centers on a family of enriched transformation layers that incorporate constrained pathways and adaptive corrections. We analyze how these structures influence gradient flow, spectral sensitivity, and fixed-point behavior–uncovering mechanisms that contribute to training stability and representational regularity. Theoretical analysis is paired with empirical studies on synthetic and structured tasks, demonstrating improved robustness, smoother optimization, and scalable depth behavior. Rather than prescribing fixed templates, we emphasize principles of tractable design that can steer learning behavior in interpretable ways. Our findings support a growing view that architectural design is not merely a matter of performance tuning, but a critical axis for shaping learning dynamics in scalable and trustworthy neural systems.
[LG-40] Understanding the Essence: Delving into Annotator Prototype Learning for Multi-Class Annotation Aggregation
链接: https://arxiv.org/abs/2508.02123
作者: Ju Chen,Jun Feng,Shenyu Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Multi-class classification annotations have significantly advanced AI applications, with truth inference serving as a critical technique for aggregating noisy and biased annotations. Existing state-of-the-art methods typically model each annotator’s expertise using a confusion matrix. However, these methods suffer from two widely recognized issues: 1) when most annotators label only a few tasks, or when classes are imbalanced, the estimated confusion matrices are unreliable, and 2) a single confusion matrix often remains inadequate for capturing each annotator’s full expertise patterns across all tasks. To address these issues, we propose a novel confusion-matrix-based method, PTBCC (ProtoType learning-driven Bayesian Classifier Combination), to introduce a reliable and richer annotator estimation by prototype learning. Specifically, we assume that there exists a set S of prototype confusion matrices, which capture the inherent expertise patterns of all annotators. Rather than a single confusion matrix, the expertise per annotator is extended as a Dirichlet prior distribution over these prototypes. This prototype learning-driven mechanism circumvents the data sparsity and class imbalance issues, ensuring a richer and more flexible characterization of annotators. Extensive experiments on 11 real-world datasets demonstrate that PTBCC achieves up to a 15% accuracy improvement in the best case, and a 3% higher average accuracy while reducing computational cost by over 90%.
[LG-41] Real-Time Conflict Prediction for Large Truck Merging in Mixed Traffic at Work Zone Lane Closures
链接: https://arxiv.org/abs/2508.02109
作者: Abyad Enan,Abdullah Al Mamun,Gurcan Comert,Debbie Aisiana Indah,Judith Mwakalonge,Amy W. Apon,Mashrur Chowdhury
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the Transportation Research Record: Journal of the Transportation Research Board for possible publication
Abstract:Large trucks substantially contribute to work zone-related crashes, primarily due to their large size and blind spots. When approaching a work zone, large trucks often need to merge into an adjacent lane because of lane closures caused by construction activities. This study aims to enhance the safety of large truck merging maneuvers in work zones by evaluating the risk associated with merging conflicts and establishing a decision-making strategy for merging based on this risk assessment. To predict the risk of large trucks merging into a mixed traffic stream within a work zone, a Long Short-Term Memory (LSTM) neural network is employed. For a large truck intending to merge, it is critical that the immediate downstream vehicle in the target lane maintains a minimum safe gap to facilitate a safe merging process. Once a conflict-free merging opportunity is predicted, large trucks are instructed to merge in response to the lane closure. Our LSTM-based conflict prediction method is compared against baseline approaches, which include probabilistic risk-based merging, 50th percentile gap-based merging, and 85th percentile gap-based merging strategies. The results demonstrate that our method yields a lower conflict risk, as indicated by reduced Time Exposed Time-to-Collision (TET) and Time Integrated Time-to-Collision (TIT) values relative to the baseline models. Furthermore, the findings indicate that large trucks that use our method can perform early merging while still in motion, as opposed to coming to a complete stop at the end of the current lane prior to closure, which is commonly observed with the baseline approaches.
[LG-42] Instance-Dependent Continuous-Time Reinforcement Learning via Maximum Likelihood Estimation
链接: https://arxiv.org/abs/2508.02103
作者: Runze Zhao,Yue Yu,Ruhan Wang,Chunfeng Huang,Dongruo Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 3 figures, 1 table. The first two authors contributed equally
Abstract:Continuous-time reinforcement learning (CTRL) provides a natural framework for sequential decision-making in dynamic environments where interactions evolve continuously over time. While CTRL has shown growing empirical success, its ability to adapt to varying levels of problem difficulty remains poorly understood. In this work, we investigate the instance-dependent behavior of CTRL and introduce a simple, model-based algorithm built on maximum likelihood estimation (MLE) with a general function approximator. Unlike existing approaches that estimate system dynamics directly, our method estimates the state marginal density to guide learning. We establish instance-dependent performance guarantees by deriving a regret bound that scales with the total reward variance and measurement resolution. Notably, the regret becomes independent of the specific measurement strategy when the observation frequency adapts appropriately to the problem’s complexity. To further improve performance, our algorithm incorporates a randomized measurement schedule that enhances sample efficiency without increasing measurement cost. These results highlight a new direction for designing CTRL algorithms that automatically adjust their learning behavior based on the underlying difficulty of the environment.
[LG-43] he Geometry of Machine Learning Models
链接: https://arxiv.org/abs/2508.02080
作者: Pawel Gajer,Jacques Ravel
类目: Machine Learning (cs.LG)
*备注: 61 pages, 1 figure
Abstract:This paper presents a mathematical framework for analyzing machine learning models through the geometry of their induced partitions. By representing partitions as Riemannian simplicial complexes, we capture not only adjacency relationships but also geometric properties including cell volumes, volumes of faces where cells meet, and dihedral angles between adjacent cells. For neural networks, we introduce a differential forms approach that tracks geometric structure through layers via pullback operations, making computations tractable by focusing on data-containing cells. The framework enables geometric regularization that directly penalizes problematic spatial configurations and provides new tools for model refinement through extended Laplacians and simplicial splines. We also explore how data distribution induces effective geometric curvature in model partitions, developing discrete curvature measures for vertices that quantify local geometric complexity and statistical Ricci curvature for edges that captures pairwise relationships between cells. While focused on mathematical foundations, this geometric perspective offers new approaches to model interpretation, regularization, and diagnostic tools for understanding learning dynamics. Comments: 61 pages, 1 figure Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.02080 [cs.LG] (or arXiv:2508.02080v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.02080 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pawel Gajer [view email] [v1] Mon, 4 Aug 2025 05:45:52 UTC (93 KB) Full-text links: Access Paper: View a PDF of the paper titled The Geometry of Machine Learning Models, by Pawel Gajer and Jacques RavelView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-08 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-44] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks
链接: https://arxiv.org/abs/2508.02046
作者: Zhihao Luo,Wentao Yan abd Jingyu Gong,Min Wang,Zhizhong Zhang,Xuhong Wang,Yuan Xie,Xin Tan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Homepage: this https URL
Abstract:Recent advances in Graphical User Interface (GUI) and embodied navigation have driven significant progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of seamlessly integrating GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks in one formulation. (ii) employs a unified reinforcement learning framework on the mix data for better generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further confirm the efficacy of our unified training strategy, data mixing strategy, and reward design.
[LG-45] Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning
链接: https://arxiv.org/abs/2508.02039
作者: Sijia Wang,Ricardo Henao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Increasing concerns for data privacy and other difficulties associated with retrieving source data for model training have created the need for source-free transfer learning, in which one only has access to pre-trained models instead of data from the original source domains. This setting introduces many challenges, as many existing transfer learning methods typically rely on access to source data, which limits their direct applicability to scenarios where source data is unavailable. Further, practical concerns make it more difficult, for instance efficiently selecting models for transfer without information on source data, and transferring without full access to the source models. So motivated, we propose a model recycling framework for parameter-efficient training of models that identifies subsets of related source models to reuse in both white-box and black-box settings. Consequently, our framework makes it possible for Model as a Service (MaaS) providers to build libraries of efficient pre-trained models, thus creating an opportunity for multi-source data-free supervised transfer learning.
[LG-46] An Evolving Scenario Generation Method based on Dual-modal Driver Model Trained by Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2508.02027
作者: Xinzheng Wu,Junyi Chen,Shaolingfeng Ye,Wei Jiang,Yong Shen
类目: Machine Learning (cs.LG)
*备注: 16 pages, 17 figures
Abstract:In the autonomous driving testing methods based on evolving scenarios, the construction method of the driver model, which determines the driving maneuvers of background vehicles (BVs) in the scenario, plays a critical role in generating safety-critical scenarios. In particular, the cooperative adversarial driving characteristics between BVs can contribute to the efficient generation of safety-critical scenarios with high testing value. In this paper, a multi-agent reinforcement learning (MARL) method is used to train and generate a dual-modal driver model (Dual-DM) with non-adversarial and adversarial driving modalities. The model is then connected to a continuous simulated traffic environment to generate complex, diverse and strong interactive safety-critical scenarios through evolving scenario generation method. After that, the generated evolving scenarios are evaluated in terms of fidelity, test efficiency, complexity and diversity. Results show that without performance degradation in scenario fidelity (85% similarity to real-world scenarios) and complexity (complexity metric: 0.45, +32.35% and +12.5% over two baselines), Dual-DM achieves a substantial enhancement in the efficiency of generating safety-critical scenarios (efficiency metric: 0.86, +195% over two baselines). Furthermore, statistical analysis and case studies demonstrate the diversity of safety-critical evolving scenarios generated by Dual-DM in terms of the adversarial interaction patterns. Therefore, Dual-DM can greatly improve the performance of the generation of safety-critical scenarios through evolving scenario generation method.
[LG-47] A Comprehensive Analysis of Evolving Permission Usage in Android Apps: Trends Threats and Ecosystem Insights
链接: https://arxiv.org/abs/2508.02008
作者: Ali Alkinoon,Trung Cuong Dang,Ahod Alghuried,Abdulaziz Alghamdi,Soohyeon Choi,Manar Mohaisen,An Wang,Saeed Salem,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, 14 tables. In submission to Journal of Cybersecurity and Privacy
Abstract:The proper use of Android app permissions is crucial to the success and security of these apps. Users must agree to permission requests when installing or running their apps. Despite official Android platform documentation on proper permission usage, there are still many cases of permission abuse. This study provides a comprehensive analysis of the Android permission landscape, highlighting trends and patterns in permission requests across various applications from the Google Play Store. By distinguishing between benign and malicious applications, we uncover developers’ evolving strategies, with malicious apps increasingly requesting fewer permissions to evade detection, while benign apps request more to enhance functionality. In addition to examining permission trends across years and app features such as advertisements, in-app purchases, content ratings, and app sizes, we leverage association rule mining using the FP-Growth algorithm. This allows us to uncover frequent permission combinations across the entire dataset, specific years, and 16 app genres. The analysis reveals significant differences in permission usage patterns, providing a deeper understanding of co-occurring permissions and their implications for user privacy and app functionality. By categorizing permissions into high-level semantic groups and examining their application across distinct app categories, this study offers a structured approach to analyzing the dynamics within the Android ecosystem. The findings emphasize the importance of continuous monitoring, user education, and regulatory oversight to address permission misuse effectively.
[LG-48] Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization
链接: https://arxiv.org/abs/2508.02002
作者: Yu Lei,Jiayang Zhao,Yilei Zhao,Zhaoqi Zhang,Linyou Cai,Qianlong Xie,Xingxing Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Modern auto-bidding systems are required to balance overall performance with diverse advertiser goals and real-world constraints, reflecting the dynamic and evolving needs of the industry. Recent advances in conditional generative models, such as transformers and diffusers, have enabled direct trajectory generation tailored to advertiser preferences, offering a promising alternative to traditional Markov Decision Process-based methods. However, these generative methods face significant challenges, such as the distribution shift between offline and online environments, limited exploration of the action space, and the necessity to meet constraints like marginal Cost-per-Mille (CPM) and Return on Investment (ROI). To tackle these challenges, we propose GRAD (Generative Reward-driven Ad-bidding with Mixture-of-Experts), a scalable foundation model for auto-bidding that combines an Action-Mixture-of-Experts module for diverse bidding action exploration with the Value Estimator of Causal Transformer for constraint-aware optimization. Extensive offline and online experiments demonstrate that GRAD significantly enhances platform revenue, highlighting its effectiveness in addressing the evolving and diverse requirements of modern advertisers. Furthermore, GRAD has been implemented in multiple marketing scenarios at Meituan, one of the world’s largest online food delivery platforms, leading to a 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.
[LG-49] Convolutions are Competitive with Transformers for Encrypted Traffic Classification with Pre-training
链接: https://arxiv.org/abs/2508.02001
作者: Chungang Lin,Weiyao Zhang,Tianyu Zuo,Chao Zha,Yilong Jiang,Ruiqi Meng,Haitong Luo,Xuying Meng,Yujun Zhang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Under review
Abstract:Encrypted traffic classification is vital for modern network management and security. To reduce reliance on handcrafted features and labeled data, recent methods focus on learning generic representations through pre-training on large-scale unlabeled data. However, current pre-trained models face two limitations originating from the adopted Transformer architecture: (1) Limited model efficiency due to the self-attention mechanism with quadratic complexity; (2) Unstable traffic scalability to longer byte sequences, as the explicit positional encodings fail to generalize to input lengths not seen during pre-training. In this paper, we investigate whether convolutions, with linear complexity and implicit positional encoding, are competitive with Transformers in encrypted traffic classification with pre-training. We first conduct a systematic comparison, and observe that convolutions achieve higher efficiency and scalability, with lower classification performance. To address this trade-off, we propose NetConv, a novel pre-trained convolution model for encrypted traffic classification. NetConv employs stacked traffic convolution layers, which enhance the ability to capture localized byte-sequence patterns through window-wise byte scoring and sequence-wise byte gating. We design a continuous byte masking pre-training task to help NetConv learn protocol-specific patterns. Experimental results on four tasks demonstrate that NetConv improves average classification performance by 6.88% and model throughput by 7.41X over existing pre-trained models.
[LG-50] oward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation
链接: https://arxiv.org/abs/2508.01992
作者: Hongze Sun,Wuque Cai,Duo Chen,Shifeng Mao,Jiayi He,Zhenxing Wang,Dezhong Yao,Daqing Guo
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:As a foundational architecture of artificial intelligence models, Transformer has been recently adapted to spiking neural networks with promising performance across various tasks. However, existing spiking Transformer (ST)-based models require a substantial number of parameters and incur high computational costs, thus limiting their deployment in resource-constrained environments. To address these challenges, we propose combining synapse pruning with a synergistic learning-based compensation strategy to derive lightweight ST-based models. Specifically, two types of tailored pruning strategies are introduced to reduce redundancy in the weight matrices of ST blocks: an unstructured \mathrmL_1P method to induce sparse representations, and a structured DSP method to induce low-rank representations. In addition, we propose an enhanced spiking neuron model, termed the synergistic leaky integrate-and-fire (sLIF) neuron, to effectively compensate for model pruning through synergistic learning between synaptic and intrinsic plasticity mechanisms. Extensive experiments on benchmark datasets demonstrate that the proposed methods significantly reduce model size and computational overhead while maintaining competitive performance. These results validate the effectiveness of the proposed pruning and compensation strategies in constructing efficient and high-performing ST-based models.
[LG-51] Diffusion models for inverse problems
链接: https://arxiv.org/abs/2508.01975
作者: Hyungjin Chung,Jeongsol Kim,Jong Chul Ye
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Using diffusion priors to solve inverse problems in imaging have significantly matured over the years. In this chapter, we review the various different approaches that were proposed over the years. We categorize the approaches into the more classic explicit approximation approaches and others, which include variational inference, sequential monte carlo, and decoupled data consistency. We cover the extension to more challenging situations, including blind cases, high-dimensional data, and problems under data scarcity and distribution mismatch. More recent approaches that aim to leverage multimodal information through texts are covered. Through this chapter, we aim to (i) distill the common mathematical threads that connect these algorithms, (ii) systematically contrast their assumptions and performance trade-offs across representative inverse problems, and (iii) spotlight the open theoretical and practical challenges by clarifying the landscape of diffusion model based inverse problem solvers.
[LG-52] Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2508.01971
作者: Ziyu Zhou,Yiming Huang,Yanyun Wang,Yuankai Wu,James Kwok,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注: Under review
Abstract:Irregular multivariate time series (IMTS), characterized by uneven sampling and inter-variate asynchrony, fuel many forecasting applications yet remain challenging to model efficiently. Canonical Pre-Alignment (CPA) has been widely adopted in IMTS modeling by padding zeros at every global timestamp, thereby alleviating inter-variate asynchrony and unifying the series length, but its dense zero-padding inflates the pre-aligned series length, especially when numerous variates are present, causing prohibitive compute overhead. Recent graph-based models with patching strategies sidestep CPA, but their local message passing struggles to capture global inter-variate correlations. Therefore, we posit that CPA should be retained, with the pre-aligned series properly handled by the model, enabling it to outperform state-of-the-art graph-based baselines that sidestep CPA. Technically, we propose KAFNet, a compact architecture grounded in CPA for IMTS forecasting that couples (1) Pre-Convolution module for sequence smoothing and sparsity mitigation, (2) Temporal Kernel Aggregation module for learnable compression and modeling of intra-series irregularity, and (3) Frequency Linear Attention blocks for the low-cost inter-series correlations modeling in the frequency domain. Experiments on multiple IMTS datasets show that KAFNet achieves state-of-the-art forecasting performance, with a 7.2 \times parameter reduction and a 8.4 \times training-inference acceleration.
[LG-53] Improving Hospital Risk Prediction with Knowledge-Augmented Multimodal EHR Modeling
链接: https://arxiv.org/abs/2508.01970
作者: Rituparna Datta,Jiaming Cui,Zihan Guan,Rupesh Silwal,Joshua C Eby,Gregory Madden,Anil Vullikanti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of clinical outcomes using Electronic Health Records (EHRs) is critical for early intervention, efficient resource allocation, and improved patient care. EHRs contain multimodal data, including both structured data and unstructured clinical notes that provide rich, context-specific information. In this work, we introduce a unified framework that seamlessly integrates these diverse modalities, leveraging all relevant available information through a two-stage architecture for clinical risk prediction. In the first stage, a fine-tuned Large Language Model (LLM) extracts crucial, task-relevant information from clinical notes, which is enhanced by graph-based retrieval of external domain knowledge from sources such as a medical corpus like PubMed, grounding the LLM’s understanding. The second stage combines both unstructured representations and features derived from the structured data to generate the final predictions. This approach supports a wide range of clinical tasks. Here, we demonstrate its effectiveness on 30-day readmission and in-hospital mortality prediction. Experimental results show that our framework achieves strong performance, with AUC scores of 0.84 and 0.92 , respectively, despite these tasks involving severely imbalanced datasets, with positive rates ranging from approximately 4% to 13% . Moreover, it outperforms all existing baselines and clinical practices, including established risk scoring systems. To the best of our knowledge, this is one of the first frameworks for healthcare prediction which enhances the power of an LLM-based graph-guided knowledge retrieval method by combining it with structured data for improved clinical outcome prediction.
[LG-54] Stochastic Encodings for Active Feature Acquisition ICML2025
链接: https://arxiv.org/abs/2508.01957
作者: Alexander Norcliffe,Changhee Lee,Fergus Imrie,Mihaela van der Schaar,Pietro Lio
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 15 figures, 17 tables, published at ICML 2025
Abstract:Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.
[LG-55] Navigating High Dimensional Concept Space with Metalearning
链接: https://arxiv.org/abs/2508.01948
作者: Max Gupta
类目: Machine Learning (cs.LG)
*备注:
Abstract:Rapidly learning abstract concepts from limited examples is a hallmark of human intelligence. This work investigates whether gradient-based meta-learning can equip neural networks with inductive biases for efficient few-shot acquisition of discrete concepts. We compare meta-learning methods against a supervised learning baseline on Boolean tasks generated by a probabilistic context-free grammar (PCFG). By systematically varying concept dimensionality (number of features) and compositionality (depth of grammar recursion), we identify regimes in which meta-learning robustly improves few-shot concept learning. We find improved performance and sample efficiency by training a multilayer perceptron (MLP) across concept spaces increasing in dimensional and compositional complexity. We are able to show that meta-learners are much better able to handle compositional complexity than featural complexity and establish an empirical analysis demonstrating how featural complexity shapes ‘concept basins’ of the loss landscape, allowing curvature-aware optimization to be more effective than first order methods. We see that we can robustly increase generalization on complex concepts by increasing the number of adaptation steps in meta-SGD, encouraging exploration of rougher loss basins. Overall, this work highlights the intricacies of learning compositional versus featural complexity in high dimensional concept spaces and provides a road to understanding the role of 2nd order methods and extended gradient adaptation in few-shot concept learning.
[LG-56] From Binary to Continuous: Stochastic Re-Weighting for Robust Graph Explanation
链接: https://arxiv.org/abs/2508.01925
作者: Zhuomin Chen,Jingchao Ni,Hojat Allah Salehi,Xu Zheng,Dongsheng Luo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have achieved remarkable performance in a wide range of graph-related learning tasks. However, explaining their predictions remains a challenging problem, especially due to the mismatch between the graphs used during training and those encountered during explanation. Most existing methods optimize soft edge masks on weighted graphs to highlight important substructures, but these graphs differ from the unweighted graphs on which GNNs are trained. This distributional shift leads to unreliable gradients and degraded explanation quality, especially when generating small, sparse subgraphs. To address this issue, we propose a novel iterative explanation framework which improves explanation robustness by aligning the model’s training data distribution with the weighted graph distribution appeared during explanation. Our method alternates between two phases: explanation subgraph identification and model adaptation. It begins with a relatively large explanation subgraph where soft mask optimization is reliable. Based on this subgraph, we assign importance-aware edge weights to explanatory and non-explanatory edges, and retrain the GNN on these weighted graphs. This process is repeated with progressively smaller subgraphs, forming an iterative refinement procedure. We evaluate our method on multiple benchmark datasets using different GNN backbones and explanation methods. Experimental results show that our method consistently improves explanation quality and can be flexibly integrated with different architectures.
[LG-57] IMUCoCo: Enabling Flexible On-Body IMU Placement for Human Pose Estimation and Activity Recognition
链接: https://arxiv.org/abs/2508.01894
作者: Haozhe Zhou,Riku Arakawa,Yuvraj Agarwal,Mayank Goel
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:IMUs are regularly used to sense human motion, recognize activities, and estimate full-body pose. Users are typically required to place sensors in predefined locations that are often dictated by common wearable form factors and the machine learning model’s training process. Consequently, despite the increasing number of everyday devices equipped with IMUs, the limited adaptability has seriously constrained the user experience to only using a few well-explored device placements (e.g., wrist and ears). In this paper, we rethink IMU-based motion sensing by acknowledging that signals can be captured from any point on the human body. We introduce IMU over Continuous Coordinates (IMUCoCo), a novel framework that maps signals from a variable number of IMUs placed on the body surface into a unified feature space based on their spatial coordinates. These features can be plugged into downstream models for pose estimation and activity recognition. Our evaluations demonstrate that IMUCoCo supports accurate pose estimation in a wide range of typical and atypical sensor placements. Overall, IMUCoCo supports significantly more flexible use of IMUs for motion sensing than the state-of-the-art, allowing users to place their sensors-laden devices according to their needs and preferences. The framework also supports the ability to change device locations depending on the context and suggests placement depending on the use case.
[LG-58] Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain
链接: https://arxiv.org/abs/2508.01888
作者: Navneet Verma,Ying Xie
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
*备注:
Abstract:The increasing penetration of renewable energy sources in day-ahead energy markets introduces challenges in balancing supply and demand, ensuring grid resilience, and maintaining trust in decentralized trading systems. This paper proposes a novel framework that integrates the Proximal Policy Optimization (PPO) algorithm, a state-of-the-art reinforcement learning method, with blockchain technology to optimize automated trading strategies for prosumers in day-ahead energy markets. We introduce a comprehensive framework that employs RL agent for multi-objective energy optimization and blockchain for tamper-proof data and transaction management. Simulations using real-world data from the Electricity Reliability Council of Texas (ERCOT) demonstrate the effectiveness of our approach. The RL agent achieves demand-supply balancing within 2% and maintains near-optimal supply costs for the majority of the operating hours. Moreover, it generates robust battery storage policies capable of handling variability in solar and wind generation. All decisions are recorded on an Algorand-based blockchain, ensuring transparency, auditability, and security - key enablers for trustworthy multi-agent energy trading. Our contributions include a novel system architecture, curriculum learning for robust agent development, and actionable policy insights for practical deployment.
[LG-59] Proactive Constrained Policy Optimization with Preemptive Penalty
链接: https://arxiv.org/abs/2508.01883
作者: Ning Yang,Pengyu Wang,Guoqing Liu,Haifeng Zhang,Pin Lyu,Jun Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method’s convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.
[LG-60] Causal Discovery in Multivariate Time Series through Mutual Information Featurization
链接: https://arxiv.org/abs/2508.01848
作者: Gian Marco Paldino,Gianluca Bontempi
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Discovering causal relationships in complex multivariate time series is a fundamental scientific challenge. Traditional methods often falter, either by relying on restrictive linear assumptions or on conditional independence tests that become uninformative in the presence of intricate, non-linear dynamics. This paper proposes a new paradigm, shifting from statistical testing to pattern recognition. We hypothesize that a causal link creates a persistent and learnable asymmetry in the flow of information through a system’s temporal graph, even when clear conditional independencies are obscured. We introduce Temporal Dependency to Causality (TD2C), a supervised learning framework that operationalizes this hypothesis. TD2C learns to recognize these complex causal signatures from a rich set of information-theoretic and statistical descriptors. Trained exclusively on a diverse collection of synthetic time series, TD2C demonstrates remarkable zero-shot generalization to unseen dynamics and established, realistic benchmarks. Our results show that TD2C achieves state-of-the-art performance, consistently outperforming established methods, particularly in high-dimensional and non-linear settings. By reframing the discovery problem, our work provides a robust and scalable new tool for uncovering causal structures in complex systems.
[LG-61] A Trainable Optimizer
链接: https://arxiv.org/abs/2508.01764
作者: Ruiqi Wang,Diego Klabjan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The concept of learning to optimize involves utilizing a trainable optimization strategy rather than relying on manually defined full gradient estimations such as ADAM. We present a framework that jointly trains the full gradient estimator and the trainable weights of the model. Specifically, we prove that pseudo-linear TO (Trainable Optimizer), a linear approximation of the full gradient, matches SGD’s convergence rate while effectively reducing variance. Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional tensor multiplications. To further improve computational efficiency, we introduce two simplified variants of Pseudo-linear TO. Experiments demonstrate that TO methods converge faster than benchmark algorithms (e.g., ADAM) in both strongly convex and non-convex settings, and fine tuning of an LLM.
[LG-62] Energy-Efficient Federated Learning for Edge Real-Time Vision via Joint Data Computation and Communication Design
链接: https://arxiv.org/abs/2508.01745
作者: Xiangwang Hou,Jingjing Wang,Fangming Guan,Jun Du,Chunxiao Jiang,Yong Ren
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Emerging real-time computer vision (CV) applications on wireless edge devices demand energy-efficient and privacy-preserving learning. Federated learning (FL) enables on-device training without raw data sharing, yet remains challenging in resource-constrained environments due to energy-intensive computation and communication, as well as limited and non-i.i.d. local data. We propose FedDPQ, an ultra energy-efficient FL framework for real-time CV over unreliable wireless networks. FedDPQ integrates diffusion-based data augmentation, model pruning, communication quantization, and transmission power control to enhance training efficiency. It expands local datasets using synthetic data, reduces computation through pruning, compresses updates via quantization, and mitigates transmission outages with adaptive power control. We further derive a closed-form energy-convergence model capturing the coupled impact of these components, and develop a Bayesian optimization(BO)-based algorithm to jointly tune data augmentation strategy, pruning ratio, quantization level, and power control. To the best of our knowledge, this is the first work to jointly optimize FL performance from the perspectives of data, computation, and communication under unreliable wireless conditions. Experiments on representative CV tasks show that FedDPQ achieves superior convergence speed and energy efficiency.
[LG-63] AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization
链接: https://arxiv.org/abs/2508.01744
作者: Zicong Ye,Kunming Zhang,Guoming Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising peak performance. To address this challenge, we propose AGFT (An Adaptive GPU Frequency Tuner), a framework that employs online reinforcement learning to autonomously learn an optimal frequency tuning policy. By monitoring real-time features like request load and latency, AGFT utilizes fine-grained frequency control for precise adjustments and intelligent action space pruning for stable, efficient decision-making. This creates a robust, automated energy management solution. We comprehensively evaluated AGFT in an environment simulating realistic, fluctuating inference requests. The experimental results demonstrate that AGFT successfully saves 44.3% of GPU energy consumption while introducing a minimal performance latency overhead of under 10%. This achievement translates into a comprehensive Energy-Delay Product (EDP) optimization of up to 40.3%, clearly showing that our framework can significantly enhance the energy efficiency and economic benefits of existing LLM inference clusters without compromising service quality.
[LG-64] Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach
链接: https://arxiv.org/abs/2508.01718
作者: Yeongjong Kim,Yeoneung Kim,Minseok Kim,Namkyeong Cho
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:
Abstract:We propose a physics-informed neural network policy iteration (PINN-PI) framework for solving stochastic optimal control problems governed by second-order Hamilton–Jacobi–Bellman (HJB) equations. At each iteration, a neural network is trained to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy. This linear structure enables systematic L^2 error control at each policy evaluation step, and allows us to derive explicit Lipschitz-type bounds that quantify how value gradient errors propagate to the policy updates. This interpretability provides a theoretical basis for evaluating policy quality during training. Our method extends recent deterministic PINN-based approaches to stochastic settings, inheriting the global exponential convergence guarantees of classical policy iteration under mild conditions. We demonstrate the effectiveness of our method on several benchmark problems, including stochastic cartpole, pendulum problems and high-dimensional linear quadratic regulation (LQR) problems in up to 10D.
[LG-65] Innovative tokenisation of structured data for LLM training
链接: https://arxiv.org/abs/2508.01685
作者: Kayvan Karim,Hani Ragab Hassen. Hadj Batatia
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Data representation remains a fundamental challenge in machine learning, particularly when adapting sequence-based architectures like Transformers and Large Language Models (LLMs) for structured tabular data. Existing methods often fail to cohesively encode the mix of numerical and categorical features or preserve the inherent structure of tables. This paper introduces a novel, hybrid tokenisation methodology designed to convert tabular data into a unified, sequential format suitable for LLM training. Our approach combines predefined fixed tokens to represent structural elements and low-cardinality categorical features, with a learned subword vocabulary using Byte-Pair Encoding (BPE) for high-cardinality and continuous values. We demonstrate the efficacy of this technique by applying it to a large-scale NetFlow dataset (CIDDS-001), preparing a corpus for a Network Intrusion Detection System (NIDS) foundation model. The evaluation shows that our method is highly efficient, processing over 31 million network flows in under five hours and achieving a significant data compression ratio of 6.18:1. This process resulted in a computationally manageable corpus of over one billion tokens, establishing a viable and generalisable pathway for training foundation models on structured data.
[LG-66] Generalized Kernelized Bandits: Self-Normalized Bernstein-Like Dimension-Free Inequality and Regret Bounds
链接: https://arxiv.org/abs/2508.01681
作者: Alberto Maria Metelli,Simone Drago,Marco Mussi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the regret minimization problem in the novel setting of generalized kernelized bandits (GKBs), where we optimize an unknown function f^* belonging to a reproducing kernel Hilbert space (RKHS) having access to samples generated by an exponential family (EF) noise model whose mean is a non-linear function \mu(f^) . This model extends both kernelized bandits (KBs) and generalized linear bandits (GLBs). We propose an optimistic algorithm, GKB-UCB, and we explain why existing self-normalized concentration inequalities do not allow to provide tight regret guarantees. For this reason, we devise a novel self-normalized Bernstein-like dimension-free inequality resorting to Freedman’s inequality and a stitching argument, which represents a contribution of independent interest. Based on it, we conduct a regret analysis of GKB-UCB, deriving a regret bound of order \widetildeO( \gamma_T \sqrtT/\kappa_) , being T the learning horizon, \gamma_T the maximal information gain, and \kappa_* a term characterizing the magnitude the reward nonlinearity. Our result matches, up to multiplicative constants and logarithmic terms, the state-of-the-art bounds for both KBs and GLBs and provides a unified view of both settings.
[LG-67] Asynchronous Federated Learning with non-convex client objective functions and heterogeneous dataset
链接: https://arxiv.org/abs/2508.01675
作者: Ali Forootani,Raffaele Iervolino
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated Learning (FL) enables collaborative model training across decentralized devices while preserving data privacy. However, traditional FL suffers from communication overhead, system heterogeneity, and straggler effects. Asynchronous Federated Learning (AFL) addresses these by allowing clients to update independently, improving scalability and reducing synchronization delays. This paper extends AFL to handle non-convex objective functions and heterogeneous datasets, common in modern deep learning. We present a rigorous convergence analysis, deriving bounds on the expected gradient norm and studying the effects of staleness, variance, and heterogeneity. To mitigate stale updates, we introduce a staleness aware aggregation that prioritizes fresher updates and a dynamic learning rate schedule that adapts to client staleness and heterogeneity, improving stability and convergence. Our framework accommodates variations in computational power, data distribution, and communication delays, making it practical for real world applications. We also analyze the impact of client selection strategies-sampling with or without replacement-on variance and convergence. Implemented in PyTorch with Python’s asyncio, our approach is validated through experiments demonstrating improved performance and scalability for asynchronous, heterogeneous, and non-convex FL scenarios.
[LG-68] Boosting Generalization Performance in Model-Heterogeneous Federated Learning Using Variational Transposed Convolution
链接: https://arxiv.org/abs/2508.01669
作者: Ziru Niu,Hai Dong,A.K. Qin
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated learning (FL) is a pioneering machine learning paradigm that enables distributed clients to process local data effectively while ensuring data privacy. However, the efficacy of FL is usually impeded by the data heterogeneity among clients, resulting in local models with low generalization performance. To address this problem, traditional model-homogeneous approaches mainly involve debiasing the local training procedures with regularization or dynamically adjusting client weights in aggregation. Nonetheless, these approaches become incompatible for scenarios where clients exhibit heterogeneous model architectures. In this paper, we propose a model-heterogeneous FL framework that can improve clients’ generalization performance over unseen data without model aggregation. Instead of model parameters, clients exchange the feature distributions with the server, including the mean and the covariance. Accordingly, clients train a variational transposed convolutional (VTC) neural network with Gaussian latent variables sampled from the feature distributions, and use the VTC model to generate synthetic data. By fine-tuning local models with the synthetic data, clients significantly increase their generalization performance. Experimental results show that our approach obtains higher generalization accuracy than existing model-heterogeneous FL frameworks, as well as lower communication costs and memory consumption
[LG-69] Privacy-Preserving Inference for Quantized BERT Models
链接: https://arxiv.org/abs/2508.01636
作者: Tianpei Lu,Bingsheng Zhang,Lekun Peng,Bowen Zheng,Lichun Li,Kui Ren
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:With the increasing deployment of generative machine learning models in privacy-sensitive domains such as healthcare and personalized services, ensuring secure inference has become a critical challenge. Secure multi-party computation (MPC) enables privacy-preserving model inference but suffers from high communication and computation overhead. The main bottleneck lies in the expensive secure evaluation of floating-point operations. Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations, significantly reducing overhead. However, existing MPC-based quantized inference methods either rely on public quantization parameters-posing privacy risks-or suffer from inefficiencies, particularly in handling nonlinear functions such as activations and softmax. In this work, we propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting. We design a multi-input lookup table protocol to evaluate softmax efficiently and securely. Furthermore, we use dual secret sharing schemes and perform precision conversions via lookup tables, eliminating truncation overhead entirely. Experimental evaluation on BERT-base models demonstrates that our approach achieves up to 8\times speedup compared to Lu \emphet al. (NDSS 25), 9\times speedup compared to Gupta \emphet al. (PETS 24) and 22 \times speedup compared to Knott \emphet al. (NeurIPS 21).
[LG-70] VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation
链接: https://arxiv.org/abs/2508.01622
作者: Xuanran Zhai,Ce Hao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 tasks across four benchmark environments, demonstrating its effectiveness and sampling efficiency in both task and path multi-modality settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines, while maintaining fast inference and compact model size. More details are available on our project page: this https URL
[LG-71] Enhancing Math Reasoning in Small-sized LLM s via Preview Difficulty-Aware Intervention ICML
链接: https://arxiv.org/abs/2508.01604
作者: Xinhan Di,JoyJiaoW
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 table, accepted by SIM ICML@2025 Workshop
Abstract:Reinforcement learning scaling enhances the reasoning capabilities of large language models, with reinforcement learning serving as the key technique to draw out complex reasoning. However, key technical details of state-of-the-art reasoning LLMs, such as those in the OpenAI O series, Claude 3 series, DeepMind’s Gemini 2.5 series, and Grok 3 series, remain undisclosed, making it difficult for the research community to replicate their reinforcement learning training results. Therefore, we start our study from an Early Preview Reinforcement Learning (EPRLI) algorithm built on the open-source GRPO framework, incorporating difficulty-aware intervention for math problems. Applied to a 1.5B-parameter LLM, our method achieves 50.0% on AIME24, 89.2% on Math500, 77.1% on AMC, 35.3% on Minerva, and 51.9% on OBench, superpass O1-Preview and is comparable to O1-mini within standard school-lab settings.
[LG-72] Why Heuristic Weighting Works: A Theoretical Analysis of Denoising Score Matching
链接: https://arxiv.org/abs/2508.01597
作者: Juyan Zhang,Rhys Newbury,Xinyang Zhang,Tin Tran,Dana Kulic,Michael Burke
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Score matching enables the estimation of the gradient of a data distribution, a key component in denoising diffusion models used to recover clean data from corrupted inputs. In prior work, a heuristic weighting function has been used for the denoising score matching loss without formal justification. In this work, we demonstrate that heteroskedasticity is an inherent property of the denoising score matching objective. This insight leads to a principled derivation of optimal weighting functions for generalized, arbitrary-order denoising score matching losses, without requiring assumptions about the noise distribution. Among these, the first-order formulation is especially relevant to diffusion models. We show that the widely used heuristical weighting function arises as a first-order Taylor approximation to the trace of the expected optimal weighting. We further provide theoretical and empirical comparisons, revealing that the heuristical weighting, despite its simplicity, can achieve lower variance than the optimal weighting with respect to parameter gradients, which can facilitate more stable and efficient training.
[LG-73] Dynamic Clustering for Personalized Federated Learning on Heterogeneous Edge Devices
链接: https://arxiv.org/abs/2508.01580
作者: Heting Liu,Junzhe Huang,Fang He,Guohong Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) enables edge devices to collaboratively learn a global model, but it may not perform well when clients have high data heterogeneity. In this paper, we propose a dynamic clustering algorithm for personalized federated learning system (DC-PFL) to address the problem of data heterogeneity. DC-PFL starts with all clients training a global model and gradually groups the clients into smaller clusters for model personalization based on their data similarities. To address the challenge of estimating data heterogeneity without exposing raw data, we introduce a discrepancy metric called model discrepancy, which approximates data heterogeneity solely based on the model weights received by the server. We demonstrate that model discrepancy is strongly and positively correlated with data heterogeneity and can serve as a reliable indicator of data heterogeneity. To determine when and how to change grouping structures, we propose an algorithm based on the rapid decrease period of the training loss curve. Moreover, we propose a layer-wise aggregation mechanism that aggregates the low-discrepancy layers at a lower frequency to reduce the amount of transmitted data and communication costs. We conduct extensive experiments on various datasets to evaluate our proposed algorithm, and our results show that DC-PFL significantly reduces total training time and improves model accuracy compared to baselines.
[LG-74] KANMixer: Can KAN Serve as a New Modeling Core for Long-term Time Series Forecasting?
链接: https://arxiv.org/abs/2508.01575
作者: Lingyu Jiang,Yuping Wang,Yao Su,Shuo Xing,Wenjing Chen,Xin Zhang,Zhengzhong Tu,Ziming Zhang,Fangzhou Lin,Michael Zielewski,Kazunori D Yamada
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures, 5 tables
Abstract:In recent years, multilayer perceptrons (MLP)-based deep learning models have demonstrated remarkable success in long-term time series forecasting (LTSF). Existing approaches typically augment MLP backbones with hand-crafted external modules to address the inherent limitations of their flat architectures. Despite their success, these augmented methods neglect hierarchical locality and sequential inductive biases essential for time-series modeling, and recent studies indicate diminishing performance improvements. To overcome these limitations, we explore Kolmogorov-Arnold Networks (KAN), a recently proposed model featuring adaptive basis functions capable of granular, local modulation of nonlinearities. This raises a fundamental question: Can KAN serve as a new modeling core for LTSF? To answer this, we introduce KANMixer, a concise architecture integrating a multi-scale mixing backbone that fully leverages KAN’s adaptive capabilities. Extensive evaluation demonstrates that KANMixer achieves state-of-the-art performance in 16 out of 28 experiments across seven benchmark datasets. To uncover the reasons behind this strong performance, we systematically analyze the strengths and limitations of KANMixer in comparison with traditional MLP architectures. Our findings reveal that the adaptive flexibility of KAN’s learnable basis functions significantly transforms the influence of network structural prior on forecasting performance. Furthermore, we identify critical design factors affecting forecasting accuracy and offer practical insights for effectively utilizing KAN in LTSF. Together, these insights constitute the first empirically grounded guidelines for effectively leveraging KAN in LTSF. Code is available in the supplementary file.
[LG-75] Unsupervised Learning for the Elementary Shortest Path Problem
链接: https://arxiv.org/abs/2508.01557
作者: Jingyi Chen,Xinyuan Zhang,Xinwu Qian
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Elementary Shortest-Path Problem(ESPP) seeks a minimum cost path from s to t that visits each vertex at most once. The presence of negative-cost cycles renders the problem NP-hard. We present a probabilistic method for finding near-optimal ESPP, enabled by an unsupervised graph neural network that jointly learns node value estimates and edge-selection probabilities via a surrogate loss function. The loss provides a high probability certificate of finding near-optimal ESPP solutions by simultaneously reducing negative-cost cycles and embedding the desired algorithmic alignment. At inference time, a decoding algorithm transforms the learned edge probabilities into an elementary path. Experiments on graphs of up to 100 nodes show that the proposed method surpasses both unsupervised baselines and classical heuristics, while exhibiting high performance in cross-size and cross-topology generalization on unseen synthetic graphs.
[LG-76] FluidFormer: Transformer with Continuous Convolution for Particle-based Fluid Simulation
链接: https://arxiv.org/abs/2508.01537
作者: Nianyi Wang,Yu Chen,Shuai Zheng
类目: Computational Engineering, Finance, and Science (cs.CE); Graphics (cs.GR); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Learning-based fluid simulation networks have been proven as viable alternatives to traditional numerical solvers for the Navier-Stokes equations. Existing neural methods follow Smoothed Particle Hydrodynamics (SPH) frameworks, which inherently rely only on local inter-particle interactions. However, we emphasize that global context integration is also essential for learning-based methods to stabilize complex fluid simulations. We propose the first Fluid Attention Block (FAB) with a local-global hierarchy, where continuous convolutions extract local features while self-attention captures global dependencies. This fusion suppresses the error accumulation and models long-range physical phenomena. Furthermore, we pioneer the first Transformer architecture specifically designed for continuous fluid simulation, seamlessly integrated within a dual-pipeline architecture. Our method establishes a new paradigm for neural fluid simulation by unifying convolution-based local features with attention-based global context modeling. FluidFormer demonstrates state-of-the-art performance, with stronger stability in complex fluid scenarios.
[LG-77] Prototype Learning to Create Refined Interpretable Digital Phenotypes from ECGs
链接: https://arxiv.org/abs/2508.01521
作者: Sahil Sethi,David Chen,Michael C. Burkhart,Nipun Bhandari,Bashar Ramadan,Brett Beaulieu-Jones
类目: Machine Learning (cs.LG)
*备注: Preprint; under review
Abstract:Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier’s class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.
[LG-78] SimDeep: Federated 3D Indoor Localization via Similarity-Aware Aggregation
链接: https://arxiv.org/abs/2508.01515
作者: Ahmed Jaheen,Sarah Elsamanody,Hamada Rizk,Moustafa Youssef
类目: Machine Learning (cs.LG)
*备注: Accepted for ICMU 2025 – The 15th International Conference on Mobile Computing and Ubiquitous Networking, Busan, Korea, September 10–12, 2025. Nominated for Best Paper Award
Abstract:Indoor localization plays a pivotal role in supporting a wide array of location-based services, including navigation, security, and context-aware computing within intricate indoor environments. Despite considerable advancements, deploying indoor localization systems in real-world scenarios remains challenging, largely because of non-independent and identically distributed (non-IID) data and device heterogeneity. In response, we propose SimDeep, a novel Federated Learning (FL) framework explicitly crafted to overcome these obstacles and effectively manage device heterogeneity. SimDeep incorporates a Similarity Aggregation Strategy, which aggregates client model updates based on data similarity, significantly alleviating the issues posed by non-IID data. Our experimental evaluations indicate that SimDeep achieves an impressive accuracy of 92.89%, surpassing traditional federated and centralized techniques, thus underscoring its viability for real-world deployment.
[LG-79] End-to-End Personalization: Unifying Recommender Systems with Large Language Models KDD2025
链接: https://arxiv.org/abs/2508.01514
作者: Danial Ebrat,Tina Aminian,Sepideh Ahmadian,Luis Rueda
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Second Workshop on Generative AI for Recommender Systems and Personalization at the ACM Conference on Knowledge Discovery and Data Mining (GenAIRecP@KDD 2025)
Abstract:Recommender systems are essential for guiding users through the vast and diverse landscape of digital content by delivering personalized and relevant suggestions. However, improving both personalization and interpretability remains a challenge, particularly in scenarios involving limited user feedback or heterogeneous item attributes. In this article, we propose a novel hybrid recommendation framework that combines Graph Attention Networks (GATs) with Large Language Models (LLMs) to address these limitations. LLMs are first used to enrich user and item representations by generating semantically meaningful profiles based on metadata such as titles, genres, and overviews. These enriched embeddings serve as initial node features in a user and movie bipartite graph, which is processed using a GAT based collaborative filtering model. To enhance ranking accuracy, we introduce a hybrid loss function that combines Bayesian Personalized Ranking (BPR), cosine similarity, and robust negative sampling. Post-processing involves reranking the GAT-generated recommendations using the LLM, which also generates natural-language justifications to improve transparency. We evaluated our model on benchmark datasets, including MovieLens 100k and 1M, where it consistently outperforms strong baselines. Ablation studies confirm that LLM-based embeddings and the cosine similarity term significantly contribute to performance gains. This work demonstrates the potential of integrating LLMs to improve both the accuracy and interpretability of recommender systems.
[LG-80] Canoe Paddling Quality Assessment Using Smart Devices: Preliminary Machine Learning Study
链接: https://arxiv.org/abs/2508.01511
作者: S. Parab,A. Lamelas,A. Hassan,P. Bhote
类目: Machine Learning (cs.LG)
*备注: 30 pages, 16 figures, 4 tables
Abstract:Over 22 million Americans participate in paddling-related activities annually, contributing to a global paddlesports market valued at 2.4 billion US dollars in 2020. Despite its popularity, the sport has seen limited integration of machine learning (ML) and remains hindered by the cost of coaching and specialized equipment. This study presents a novel AI-based coaching system that uses ML models trained on motion data and delivers stroke feedback via a large language model (LLM). Participants were recruited through a collaboration with the NYU Concrete Canoe Team. Motion data were collected across two sessions, one with suboptimal form and one with corrected technique, using Apple Watches and smartphones secured in sport straps. The data underwent stroke segmentation and feature extraction. ML models, including Support Vector Classifier, Random Forest, Gradient Boosting, and Extremely Randomized Trees, were trained on both raw and engineered features. A web based interface was developed to visualize stroke quality and deliver LLM-based feedback. Across four participants, eight trials yielded 66 stroke samples. The Extremely Randomized Tree model achieved the highest performance with an F score of 0.9496 under five fold cross validation. The web interface successfully provided both quantitative metrics and qualitative feedback. Sensor placement near the wrists improved data quality. Preliminary results indicate that smartwatches and smartphones can enable low cost, accessible alternatives to traditional paddling instruction. While limited by sample size, the study demonstrates the feasibility of using consumer devices and ML to support stroke refinement and technique improvement.
[LG-81] A Reward-Directed Diffusion Framework for Generative Design Optimization
链接: https://arxiv.org/abs/2508.01509
作者: Hadi Keramati,Patrick Kirchen,Mohammed Hannan,Rajeev K. Jaiman
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:This study presents a generative optimization framework that builds on a fine-tuned diffusion model and reward-directed sampling to generate high-performance engineering designs. The framework adopts a parametric representation of the design geometry and produces new parameter sets corresponding to designs with enhanced performance metrics. A key advantage of the reward-directed approach is its suitability for scenarios in which performance metrics rely on costly engineering simulations or surrogate models (e.g. graph-based, ensemble models, or tree-based) are non-differentiable or prohibitively expensive to differentiate. This work introduces the iterative use of a soft value function within a Markov decision process framework to achieve reward-guided decoding in the diffusion model. By incorporating soft-value guidance during both the training and inference phases, the proposed approach reduces computational and memory costs to achieve high-reward designs, even beyond the training data. Empirical results indicate that this iterative reward-directed method substantially improves the ability of the diffusion models to generate samples with reduced resistance in 3D ship hull design and enhanced hydrodynamic performance in 2D airfoil design tasks. The proposed framework generates samples that extend beyond the training data distribution, resulting in a greater 25 percent reduction in resistance for ship design and over 10 percent improvement in the lift-to-drag ratio for the 2D airfoil design. Successful integration of this model into the engineering design life cycle can enhance both designer productivity and overall design performance.
[LG-82] Frequency-Constrained Learning for Long-Term Forecasting
链接: https://arxiv.org/abs/2508.01508
作者: Menglin Kong,Vincent Zhihao Zheng,Lijun Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many real-world time series exhibit strong periodic structures arising from physical laws, human routines, or seasonal cycles. However, modern deep forecasting models often fail to capture these recurring patterns due to spectral bias and a lack of frequency-aware inductive priors. Motivated by this gap, we propose a simple yet effective method that enhances long-term forecasting by explicitly modeling periodicity through spectral initialization and frequency-constrained optimization. Specifically, we extract dominant low-frequency components via Fast Fourier Transform (FFT)-guided coordinate descent, initialize sinusoidal embeddings with these components, and employ a two-speed learning schedule to preserve meaningful frequency structure during training. Our approach is model-agnostic and integrates seamlessly into existing Transformer-based architectures. Extensive experiments across diverse real-world benchmarks demonstrate consistent performance gains–particularly at long horizons–highlighting the benefits of injecting spectral priors into deep temporal models for robust and interpretable long-range forecasting. Moreover, on synthetic data, our method accurately recovers ground-truth frequencies, further validating its interpretability and effectiveness in capturing latent periodic patterns.
[LG-83] ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search
链接: https://arxiv.org/abs/2508.01505
作者: Azaz-Ur-Rehman Nasir,Samroz Ahmad Shoaib,Muhammad Abdullah Hanif,Muhammad Shafique
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hardware-aware Neural Architecture Search (NAS) is one of the most promising techniques for designing efficient Deep Neural Networks (DNNs) for resource-constrained devices. Surrogate models play a crucial role in hardware-aware NAS as they enable efficient prediction of performance characteristics (e.g., inference latency and energy consumption) of different candidate models on the target hardware device. In this paper, we focus on building hardware-aware latency prediction models. We study different types of surrogate models and highlight their strengths and weaknesses. We perform a systematic analysis to understand the impact of different factors that can influence the prediction accuracy of these models, aiming to assess the importance of each stage involved in the model designing process and identify methods and policies necessary for designing/training an effective estimation model, specifically for GPU-powered devices. Based on the insights gained from the analysis, we present a holistic framework that enables reliable dataset generation and efficient model generation, considering the overall costs of different stages of the model generation pipeline.
[LG-84] Instruction-based Time Series Editing
链接: https://arxiv.org/abs/2508.01504
作者: Jiaxing Qiu,Dongliang Guo,Brynne Sullivan,Teague R. Henry,Tom Hartvigsen
类目: Machine Learning (cs.LG)
*备注:
Abstract:In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient’s blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.
[LG-85] Hyperparameter-Free Neurochaos Learning Algorithm for Classification
链接: https://arxiv.org/abs/2508.01478
作者: Akhila Henry,Nithin Nagaraj
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Neurochaos Learning (NL) is a brain-inspired classification framework that employs chaotic dynamics to extract features from input data and yields state of the art performance on classification tasks. However, NL requires the tuning of multiple hyperparameters and computing of four chaotic features per input sample. In this paper, we propose AutochaosNet - a novel, hyperparameter-free variant of the NL algorithm that eliminates the need for both training and parameter optimization. AutochaosNet leverages a universal chaotic sequence derived from the Champernowne constant and uses the input stimulus to define firing time bounds for feature extraction. Two simplified variants - TM AutochaosNet and TM-FR AutochaosNet - are evaluated against the existing NL architecture - ChaosNet. Our results demonstrate that AutochaosNet achieves competitive or superior classification performance while significantly reducing training time due to reduced computational effort. In addition to eliminating training and hyperparameter tuning, AutochaosNet exhibits excellent generalisation capabilities, making it a scalable and efficient choice for real-world classification tasks. Future work will focus on identifying universal orbits under various chaotic maps and incorporating them into the NL framework to further enhance performance.
[LG-86] HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens
链接: https://arxiv.org/abs/2508.01474
作者: Ivan Karpukhin,Andrey Savchenko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks. However, transformers often underperform RNNs in classification tasks where the objective is to predict future targets. The reason behind this performance gap remains largely unexplored. In this paper, we identify a key limitation of transformers: the absence of a single state vector that provides a compact and effective representation of the entire sequence. Additionally, we show that contrastive pretraining of embedding vectors fails to capture local context, which is crucial for accurate prediction. To address these challenges, we introduce history tokens, a novel concept that facilitates the accumulation of historical information during next-token prediction pretraining. Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks. The code is publicly available on GitHub.
[LG-87] Regression Augmentation With Data-Driven Segmentation
链接: https://arxiv.org/abs/2508.01455
作者: Shayan Alahyari,Shiva Mehdipour Ghobadlou,Mike Domaratzki
类目: Machine Learning (cs.LG)
*备注:
Abstract:Imbalanced regression arises when the target distribution is skewed, causing models to focus on dense regions and struggle with underrepresented (minority) samples. Despite its relevance across many applications, few methods have been designed specifically for this challenge. Existing approaches often rely on fixed, ad hoc thresholds to label samples as rare or common, overlooking the continuous complexity of the joint feature-target space and fail to represent the true underlying rare regions. To address these limitations, we propose a fully data-driven GAN-based augmentation framework that uses Mahalanobis-Gaussian Mixture Modeling (GMM) to automatically identify minority samples and employs deterministic nearest-neighbour matching to enrich sparse regions. Rather than preset thresholds, our method lets the data determine which observations are truly rare. Evaluation on 32 benchmark imbalanced regression datasets demonstrates that our approach consistently outperforms state-of-the-art data augmentation methods.
[LG-88] Kernel-Based Sparse Additive Nonlinear Model Structure Detection through a Linearization Approach
链接: https://arxiv.org/abs/2508.01453
作者: Sadegh Ebrahimkhani,John Lataire
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The choice of parameterization in Nonlinear (NL) system models greatly affects the quality of the estimated model. Overly complex models can be impractical and hard to interpret, necessitating data-driven methods for simpler and more accurate representations. In this paper, we propose a data-driven approach to simplify a class of continuous-time NL system models using linear approximations around varying operating points. Specifically, for sparse additive NL models, our method identifies the number of NL subterms and their corresponding input spaces. Under small-signal operation, we approximate the unknown NL system as a trajectory-scheduled Linear Parameter-Varying (LPV) system, with LPV coefficients representing the gradient of the NL function and indicating input sensitivity. Using this sensitivity measure, we determine the NL system’s structure through LPV model reduction by identifying non-zero LPV coefficients and selecting scheduling parameters. We introduce two sparse estimators within a vector-valued Reproducing Kernel Hilbert Space (RKHS) framework to estimate the LPV coefficients while preserving their structural relationships. The structure of the sparse additive NL model is then determined by detecting non-zero elements in the gradient vector (LPV coefficients) and the Hessian matrix (Jacobian of the LPV coefficients). We propose two computationally tractable RKHS-based estimators for this purpose. The sparsified Hessian matrix reveals the NL model’s structure, with numerical simulations confirming the approach’s effectiveness.
[LG-89] UniExtreme: A Universal Foundation Model for Extreme Weather Forecasting KDD2026
链接: https://arxiv.org/abs/2508.01426
作者: Hang Ni,Weijia Zhang,Hao Liu
类目: Machine Learning (cs.LG)
*备注: 35 pages, 80 figures, submitted to ACM KDD 2026 conference
Abstract:Recent advancements in deep learning have led to the development of Foundation Models (FMs) for weather forecasting, yet their ability to predict extreme weather events remains limited. Existing approaches either focus on general weather conditions or specialize in specific-type extremes, neglecting the real-world atmospheric patterns of diversified extreme events. In this work, we identify two key characteristics of extreme events: (1) the spectral disparity against normal weather regimes, and (2) the hierarchical drivers and geographic blending of diverse extremes. Along this line, we propose UniExtreme, a universal extreme weather forecasting foundation model that integrates (1) an Adaptive Frequency Modulation (AFM) module that captures region-wise spectral differences between normal and extreme weather, through learnable Beta-distribution filters and multi-granularity spectral aggregation, and (2) an Event Prior Augmentation (EPA) module which incorporates region-specific extreme event priors to resolve hierarchical extreme diversity and composite extreme schema, via a dual-level memory fusion network. Extensive experiments demonstrate that UniExtreme outperforms state-of-the-art baselines in both extreme and general weather forecasting, showcasing superior adaptability across diverse extreme scenarios.
[LG-90] Cryptocurrency Price Forecasting Using Machine Learning: Building Intelligent Financial Prediction Models
链接: https://arxiv.org/abs/2508.01419
作者: Md Zahidul Islam,Md Shafiqur Rahman,Md Sumsuzoha,Babul Sarker,Md Rafiqul Islam,Mahfuz Alam,Sanjib Kumar Shil
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cryptocurrency markets are experiencing rapid growth, but this expansion comes with significant challenges, particularly in predicting cryptocurrency prices for traders in the U.S. In this study, we explore how deep learning and machine learning models can be used to forecast the closing prices of the XRP/USDT trading pair. While many existing cryptocurrency prediction models focus solely on price and volume patterns, they often overlook market liquidity, a crucial factor in price predictability. To address this, we introduce two important liquidity proxy metrics: the Volume-To-Volatility Ratio (VVR) and the Volume-Weighted Average Price (VWAP). These metrics provide a clearer understanding of market stability and liquidity, ultimately enhancing the accuracy of our price predictions. We developed four machine learning models, Linear Regression, Random Forest, XGBoost, and LSTM neural networks, using historical data without incorporating the liquidity proxy metrics, and evaluated their performance. We then retrained the models, including the liquidity proxy metrics, and reassessed their performance. In both cases (with and without the liquidity proxies), the LSTM model consistently outperformed the others. These results underscore the importance of considering market liquidity when predicting cryptocurrency closing prices. Therefore, incorporating these liquidity metrics is essential for more accurate forecasting models. Our findings offer valuable insights for traders and developers seeking to create smarter and more risk-aware strategies in the U.S. digital assets market.
[LG-91] MoRe-ERL: Learning Motion Residuals using Episodic Reinforcement Learning
链接: https://arxiv.org/abs/2508.01409
作者: Xi Huang,Hongyi Zhou,Ge Li,Yucheng Tang,Weiran Liao,Björn Hein,Tamim Asfour,Rudolf Lioutikov
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We propose MoRe-ERL, a framework that combines Episodic Reinforcement Learning (ERL) and residual learning, which refines preplanned reference trajectories into safe, feasible, and efficient task-specific trajectories. This framework is general enough to incorporate into arbitrary ERL methods and motion generators seamlessly. MoRe-ERL identifies trajectory segments requiring modification while preserving critical task-related maneuvers. Then it generates smooth residual adjustments using B-Spline-based movement primitives to ensure adaptability to dynamic task contexts and smoothness in trajectory refinement. Experimental results demonstrate that residual learning significantly outperforms training from scratch using ERL methods, achieving superior sample efficiency and task performance. Hardware evaluations further validate the framework, showing that policies trained in simulation can be directly deployed in real-world systems, exhibiting a minimal sim-to-real gap.
[LG-92] CPformer – Concept and Physics enhanced Transformer for Time Series Forecasting
链接: https://arxiv.org/abs/2508.01407
作者: Hongwei Ma,Junbin Gao,Minh-Ngoc Tran
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate, explainable and physically-credible forecasting remains a persistent challenge for multivariate time-series whose statistical properties vary across domains. We present CPformer, a Concept- and Physics-enhanced Transformer that channels every prediction through five self-supervised, domain-agnostic concepts while enforcing differentiable residuals drawn from first-principle constraints. Unlike prior efficiency-oriented Transformers that rely purely on sparsity or frequency priors , CPformer combines latent transparency with hard scientific guidance while retaining attention for long contexts. We tested CPformer on six publicly-available datasets: sub-hourly Electricity and Traffic, hourly ETT, high-dimensional Weather, weekly Influenza-like Illness, and minute-level Exchange Rate, and CPformer achieves the lowest error in eight of twelve MSE/MAE cells. Relative to the strongest Transformer baseline (FEDformer), CPformer reduces mean-squared-error by 23% on Electricity, 44% on Traffic and 61% on Illness, while matching performance on strictly periodic Weather and ETT series.
[LG-93] Effects of Feature Correlations on Associative Memory Capacity ICLR2025
链接: https://arxiv.org/abs/2508.01395
作者: Stefan Bielmeier,Gerald Friedland
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICLR 2025 “New Frontiers in Associative Memories” Workshop. Code: this https URL
Abstract:We investigate how feature correlations influence the capacity of Dense Associative Memory (DAM), a Transformer attention-like model. Practical machine learning scenarios involve feature-correlated data and learn representations in the input space, but current capacity analyses do not account for this. We develop an empirical framework to analyze the effects of data structure on capacity dynamics. Specifically, we systematically construct datasets that vary in feature correlation and pattern separation using Hamming distance from information theory, and compute the model’s corresponding storage capacity using a simple binary search algorithm. Our experiments confirm that memory capacity scales exponentially with increasing separation in the input space. Feature correlations do not alter this relationship fundamentally, but reduce capacity slightly at constant separation. This effect is amplified at higher polynomial degrees in the energy function, suggesting that Associative Memory is more limited in depicting higher-order interactions between features than patterns. Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.
[LG-94] Quenched large deviations for Monte Carlo integration with Coulomb gases
链接: https://arxiv.org/abs/2508.01392
作者: Rémi Bardenet,Mylène Maïda,Martin Rouault
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 39 pages, 7 figures. Comments are welcome
Abstract:Gibbs measures, such as Coulomb gases, are popular in modelling systems of interacting particles. Recently, we proposed to use Gibbs measures as randomized numerical integration algorithms with respect to a target measure \pi on \mathbb R^d , following the heuristics that repulsiveness between particles should help reduce integration errors. A major issue in this approach is to tune the interaction kernel and confining potential of the Gibbs measure, so that the equilibrium measure of the system is the target distribution \pi . Doing so usually requires another Monte Carlo approximation of the \emphpotential, i.e. the integral of the interaction kernel with respect to \pi . Using the methodology of large deviations from Garcia–Zelada (2019), we show that a random approximation of the potential preserves the fast large deviation principle that guarantees the proposed integration algorithm to outperform independent or Markov quadratures. For non-singular interaction kernels, we make minimal assumptions on this random approximation, which can be the result of a computationally cheap Monte Carlo preprocessing. For the Coulomb interaction kernel, we need the approximation to be based on another Gibbs measure, and we prove in passing a control on the uniform convergence of the approximation of the potential.
[LG-95] Fusion Sampling Validation in Data Partitioning for Machine Learning
链接: https://arxiv.org/abs/2508.01325
作者: Christopher Godwin Udomboso,Caston Sigauke,Ini Adinya
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 23 pages, 10 figures
Abstract:Effective data partitioning is known to be crucial in machine learning. Traditional cross-validation methods like K-Fold Cross-Validation (KFCV) enhance model robustness but often compromise generalisation assessment due to high computational demands and extensive data shuffling. To address these issues, the integration of the Simple Random Sampling (SRS), which, despite providing representative samples, can result in non-representative sets with imbalanced data. The study introduces a hybrid model, Fusion Sampling Validation (FSV), combining SRS and KFCV to optimise data partitioning. FSV aims to minimise biases and merge the simplicity of SRS with the accuracy of KFCV. The study used three datasets of 10,000, 50,000, and 100,000 samples, generated with a normal distribution (mean 0, variance 1) and initialised with seed 42. KFCV was performed with five folds and ten repetitions, incorporating a scaling factor to ensure robust performance estimation and generalisation capability. FSV integrated a weighted factor to enhance performance and generalisation further. Evaluations focused on mean estimates (ME), variance estimates (VE), mean squared error (MSE), bias, the rate of convergence for mean estimates (ROC_ME), and the rate of convergence for variance estimates (ROC_VE). Results indicated that FSV consistently outperformed SRS and KFCV, with ME values of 0.000863, VE of 0.949644, MSE of 0.952127, bias of 0.016288, ROC_ME of 0.005199, and ROC_VE of 0.007137. FSV demonstrated superior accuracy and reliability in data partitioning, particularly in resource-constrained environments and extensive datasets, providing practical solutions for effective machine learning implementations.
[LG-96] Physics-Informed Neural Network Approaches for Sparse Data Flow Reconstruction of Unsteady Flow Around Complex Geometries
链接: https://arxiv.org/abs/2508.01314
作者: Vamsi Sai Krishna Malineni,Suresh Rajendran
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:The utilization of Deep Neural Networks (DNNs) in physical science and engineering applications has gained traction due to their capacity to learn intricate functions. While large datasets are crucial for training DNN models in fields like computer vision and natural language processing, obtaining such datasets for engineering applications is prohibitively expensive. Physics-Informed Neural Networks (PINNs), a branch of Physics-Informed Machine Learning (PIML), tackle this challenge by embedding physical principles within neural network architectures. PINNs have been extensively explored for solving diverse forward and inverse problems in fluid mechanics. Nonetheless, there is limited research on employing PINNs for flow reconstruction from sparse data under constrained computational resources. Earlier studies were focused on forward problems with well-defined data. The present study attempts to develop models capable of reconstructing the flow field data from sparse datasets mirroring real-world scenarios. This study focuses on two cases: (a) two-dimensional (2D) unsteady laminar flow past a circular cylinder and (b) three-dimensional (3D) unsteady turbulent flow past an ultra-large container ship (ULCS). The first case compares the effectiveness of training methods like Standard PINN and Backward Compatible PINN (BC-PINN) and explores the performance enhancements through systematic relaxation of physics constraints and dynamic weighting of loss function components. The second case highlights the capability of PINN-based models to learn underlying physics from sparse data while accurately reconstructing the flow field for a highly turbulent flow. Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn) Cite as: arXiv:2508.01314 [cs.LG] (or arXiv:2508.01314v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.01314 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-97] GraphVSSM: Graph Variational State-Space Model for Probabilistic Spatiotemporal Inference of Dynamic Exposure and Vulnerability for Regional Disaster Resilience Assessment
链接: https://arxiv.org/abs/2508.01310
作者: Joshua Dimasaka,Christian Geiß,Emily So
类目: Machine Learning (cs.LG)
*备注: Non-peer-reviewed Preprint | Keywords: graph state-space model, building exposure, physical vulnerability, weak supervision, probabilistic model, disaster resilience, risk audit | Code: this https URL | Quezon City (Philippines) Dataset: this https URL | METEOR 2.5D Dataset, this https URL , this https URL | Khurushkul-Freetown Dataset: this https URL
Abstract:Regional disaster resilience quantifies the changing nature of physical risks to inform policy instruments ranging from local immediate recovery to international sustainable development. While many existing state-of-practice methods have greatly advanced the dynamic mapping of exposure and hazard, our understanding of large-scale physical vulnerability has remained static, costly, limited, region-specific, coarse-grained, overly aggregated, and inadequately calibrated. With the significant growth in the availability of time-series satellite imagery and derived products for exposure and hazard, we focus our work on the equally important yet challenging element of the risk equation: physical vulnerability. We leverage machine learning methods that flexibly capture spatial contextual relationships, limited temporal observations, and uncertainty in a unified probabilistic spatiotemporal inference framework. We therefore introduce Graph Variational State-Space Model (GraphVSSM), a novel modular spatiotemporal approach that uniquely integrates graph deep learning, state-space modeling, and variational inference using time-series data and prior expert belief systems in a weakly supervised or coarse-to-fine-grained manner. We present three major results: a city-wide demonstration in Quezon City, Philippines; an investigation of sudden changes in the cyclone-impacted coastal Khurushkul community (Bangladesh) and mudslide-affected Freetown (Sierra Leone); and an open geospatial dataset, METEOR 2.5D, that spatiotemporally enhances the existing global static dataset for UN Least Developed Countries (2020). Beyond advancing regional disaster resilience assessment and improving our understanding global disaster risk reduction progress, our method also offers a probabilistic deep learning approach, contributing to broader urban studies that require compositional data analysis in weak supervision.
[LG-98] FedCD: A Fairness-aware Federated Cognitive Diagnosis Framework
链接: https://arxiv.org/abs/2508.01296
作者: Shangshang Yang,Jialin Han,Xiaoshan Yu,Ziwen Wang,Hao Jiang,Haiping Ma,Xingyi Zhang,Geyong Min
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 25 pages, 5 figures
Abstract:Online intelligent education platforms have generated a vast amount of distributed student learning data. This influx of data presents opportunities for cognitive diagnosis (CD) to assess students’ mastery of knowledge concepts while also raising significant data privacy and security challenges. To cope with this issue, federated learning (FL) becomes a promising solution by jointly training models across multiple local clients without sharing their original data. However, the data quality problem, caused by the ability differences and educational context differences between different groups/schools of students, further poses a challenge to the fairness of models. To address this challenge, this paper proposes a fairness-aware federated cognitive diagnosis framework (FedCD) to jointly train CD models built upon a novel parameter decoupling-based personalization strategy, preserving privacy of data and achieving precise and fair diagnosis of students on each client. As an FL paradigm, FedCD trains a local CD model for the students in each client based on its local student learning data, and each client uploads its partial model parameters to the central server for parameter aggregation according to the devised innovative personalization strategy. The main idea of this strategy is to decouple model parameters into two parts: the first is used as locally personalized parameters, containing diagnostic function-related model parameters, to diagnose each client’s students fairly; the second is the globally shared parameters across clients and the server, containing exercise embedding parameters, which are updated via fairness-aware aggregation, to alleviate inter-school unfairness. Experiments on three real-world datasets demonstrate the effectiveness of the proposed FedCD framework and the personalization strategy compared to five FL approaches under three CD models.
[LG-99] A graph neural network based on feature network for identifying influential nodes
链接: https://arxiv.org/abs/2508.01278
作者: Yanmei Hu,Siyuan Yin,Yihang Wu,Xue Yue,Yue Liu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Identifying influential nodes in complex networks is of great importance, and has many applications in practice. For example, finding influential nodes in e-commerce network can provide merchants with customers with strong purchase intent; identifying influential nodes in computer information system can help locating the components that cause the system break down and identifying influential nodes in these networks can accelerate the flow of information in networks. Thus, a lot of efforts have been made on the problem of indentifying influential nodes. However, previous efforts either consider only one aspect of the network structure, or using global centralities with high time consuming as node features to identify influential nodes, and the existing methods do not consider the relationships between different centralities. To solve these problems, we propose a Graph Convolutional Network Framework based on Feature Network, abbreviated as FNGCN (graph convolutional network is abbreviated as GCN in the following text). Further, to exclude noises and reduce redundency, FNGCN utilizes feature network to represent the complicated relationships among the local centralities, based on which the most suitable local centralities are determined. By taking a shallow GCN and a deep GCN into the FNGCN framework, two FNGCNs are developed. With ground truth obtained from the widely used Susceptible Infected Recovered (SIR) model, the two FNGCNs are compared with the state-of-art methods on several real-world networks. Experimental results show that the two FNGCNs can identify the influential nodes more accurately than the compared methods, indicating that the proposed framework is effective in identifying influential nodes in complex networks.
[LG-100] Foundation Models for Bioacoustics – a Comparative Review
链接: https://arxiv.org/abs/2508.01277
作者: Raphael Schwinger,Paria Vali Zadeh,Lukas Rauch,Mats Kurz,Tom Hauschild,Sam Lapp,Sven Tomforde
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
*备注: Preprint
Abstract:Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning including major pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models by thoroughly analysing design decisions such as model architecture, pretraining scheme, and training paradigm. Additionally, we evaluate selected foundation models on classification tasks from the BEANS and BirdSet benchmarks, comparing the generalisability of learned representations under both linear and attentive probing strategies. Our comprehensive experimental analysis reveals that BirdMAE, trained on large-scale bird song data with a self-supervised objective, achieves the best performance on the BirdSet benchmark. On BEANS, BEATs _NLM , the extracted encoder of the NatureLM-audio large audio model, is slightly better. Both transformer-based models require attentive probing to extract the full performance of their representations. ConvNext _BS and Perch models trained with supervision on large-scale bird song data remain competitive for passive acoustic monitoring classification tasks of BirdSet in linear probing settings. Training a new linear classifier has clear advantages over evaluating these models without further training. While on BEANS, the baseline model BEATs trained with self-supervision on AudioSet outperforms bird-specific models when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.
[LG-101] Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning ICCV2025
链接: https://arxiv.org/abs/2508.01251
作者: Hung-Chieh Fang,Hsuan-Tien Lin,Irwin King,Yifei Zhang
类目: Machine Learning (cs.LG)
*备注: Published at ICCV 2025
Abstract:Federated Unsupervised Learning (FUL) aims to learn expressive representations in federated and self-supervised settings. The quality of representations learned in FUL is usually determined by uniformity, a measure of how uniformly representations are distributed in the embedding space. However, existing solutions perform well in achieving intra-client (local) uniformity for local models while failing to achieve inter-client (global) uniformity after aggregation due to non-IID data distributions and the decentralized nature of FUL. To address this issue, we propose Soft Separation and Distillation (SSD), a novel approach that preserves inter-client uniformity by encouraging client representations to spread toward different directions. This design reduces interference during client model aggregation, thereby improving global uniformity while preserving local representation expressiveness. We further enhance this effect by introducing a projector distillation module to address the discrepancy between loss optimization and representation quality. We evaluate SSD in both cross-silo and cross-device federated settings, demonstrating consistent improvements in representation quality and task performance across various training scenarios. Our results highlight the importance of inter-client uniformity in FUL and establish SSD as an effective solution to this challenge. Project page: this https URL
[LG-102] RelMap: Reliable Spatiotemporal Sensor Data Visualization via Imputative Spatial Interpolation IEEE-VIS2025
链接: https://arxiv.org/abs/2508.01240
作者: Juntong Chen,Huayuan Ye,He Zhu,Siwei Fu,Changbo Wang,Chenhui Li
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 14 figures, paper accepted to IEEE VIS 2025
Abstract:Accurate and reliable visualization of spatiotemporal sensor data such as environmental parameters and meteorological conditions is crucial for informed decision-making. Traditional spatial interpolation methods, however, often fall short of producing reliable interpolation results due to the limited and irregular sensor coverage. This paper introduces a novel spatial interpolation pipeline that achieves reliable interpolation results and produces a novel heatmap representation with uncertainty information encoded. We leverage imputation reference data from Graph Neural Networks (GNNs) to enhance visualization reliability and temporal resolution. By integrating Principal Neighborhood Aggregation (PNA) and Geographical Positional Encoding (GPE), our model effectively learns the spatiotemporal dependencies. Furthermore, we propose an extrinsic, static visualization technique for interpolation-based heatmaps that effectively communicates the uncertainties arising from various sources in the interpolated map. Through a set of use cases, extensive evaluations on real-world datasets, and user studies, we demonstrate our model’s superior performance for data imputation, the improvements to the interpolant with reference data, and the effectiveness of our visualization design in communicating uncertainties.
[LG-103] Multi-Operator Few-Shot Learning for Generalization Across PDE Families
链接: https://arxiv.org/abs/2508.01211
作者: Yile Li,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning solution operators for partial differential equations (PDEs) has become a foundational task in scientific machine learning. However, existing neural operator methods require abundant training data for each specific PDE and lack the ability to generalize across PDE families. In this work, we propose MOFS: a unified multimodal framework for multi-operator few-shot learning, which aims to generalize to unseen PDE operators using only a few demonstration examples. Our method integrates three key components: (i) multi-task self-supervised pretraining of a shared Fourier Neural Operator (FNO) encoder to reconstruct masked spatial fields and predict frequency spectra, (ii) text-conditioned operator embeddings derived from statistical summaries of input-output fields, and (iii) memory-augmented multimodal prompting with gated fusion and cross-modal gradient-based attention. We adopt a two-stage training paradigm that first learns prompt-conditioned inference on seen operators and then applies end-to-end contrastive fine-tuning to align latent representations across vision, frequency, and text modalities. Experiments on PDE benchmarks, including Darcy Flow and Navier Stokes variants, demonstrate that our model outperforms existing operator learning baselines in few-shot generalization. Extensive ablations validate the contributions of each modality and training component. Our approach offers a new foundation for universal and data-efficient operator learning across scientific domains.
[LG-104] From Taylor Series to Fourier Synthesis: The Periodic Linear Unit
链接: https://arxiv.org/abs/2508.01175
作者: Shiko Kudo
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注: 15 pages, 5 figures, for associated raw example files and the code repository, see this https URL
Abstract:The dominant paradigm in modern neural networks relies on simple, monotonically-increasing activation functions like ReLU. While effective, this paradigm necessitates large, massively-parameterized models to approximate complex functions. In this paper, we introduce the Periodic Linear Unit (PLU), a learnable sine-wave based activation with periodic non-monotonicity. PLU is designed for maximum expressive power and numerical stability, achieved through its formulation and a paired innovation we term Repulsive Reparameterization, which prevents the activation from collapsing into a non-expressive linear function. We demonstrate that a minimal MLP with only two PLU neurons can solve the spiral classification task, a feat impossible for equivalent networks using standard activations. This suggests a paradigm shift from networks as piecewise Taylor-like approximators to powerful Fourier-like function synthesizers, achieving exponential gains in parameter efficiency by placing intelligence in the neuron itself.
[LG-105] MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management
链接: https://arxiv.org/abs/2508.01173
作者: Jiayi Chen,Jing Li,Guiling Wang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Reinforcement Learning (RL) has shown significant promise in automated portfolio management; however, effectively balancing risk and return remains a central challenge, as many models fail to adapt to dynamically changing market conditions. In this paper, we propose Meta-controlled Agents for a Risk-aware System (MARS), a novel RL framework designed to explicitly address this limitation through a multi-agent, risk-aware approach. Instead of a single monolithic model, MARS employs a Heterogeneous Agent Ensemble where each agent possesses a unique, intrinsic risk profile. This profile is enforced by a dedicated Safety-Critic network and a specific risk-tolerance threshold, allowing agents to specialize in behaviors ranging from capital preservation to aggressive growth. To navigate different market regimes, a high-level Meta-Adaptive Controller (MAC) learns to dynamically orchestrate the ensemble. By adjusting its reliance on conservative versus aggressive agents, the MAC effectively lowers portfolio volatility during downturns and seeks higher returns in bull markets, thus minimizing maximum drawdown and enhancing overall stability. This two-tiered structure allows MARS to generate a disciplined and adaptive portfolio that is robust to market fluctuations. The framework achieves a superior balance between risk and return by leveraging behavioral diversity rather than explicit market-feature engineering. Experiments on major international stock indexes, including periods of significant financial crisis, demonstrate the efficacy of our framework on risk-adjusted criteria, significantly reducing maximum drawdown and volatility while maintaining competitive returns.
[LG-106] 2S: Tokenized Skill Scaling for Lifelong Imitation Learning
链接: https://arxiv.org/abs/2508.01167
作者: Hongquan Zhang,Jingyu Gong,Zhizhong Zhang,Xin Tan,Yanyun Qu,Yuan Xie
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:The main challenge in lifelong imitation learning lies in the balance between mitigating catastrophic forgetting of previous skills while maintaining sufficient capacity for acquiring new ones. However, current approaches typically address these aspects in isolation, overlooking their internal correlation in lifelong skill acquisition. We address this limitation with a unified framework named Tokenized Skill Scaling (T2S). Specifically, by tokenizing the model parameters, the linear parameter mapping of the traditional transformer is transformed into cross-attention between input and learnable tokens, thereby enhancing model scalability through the easy extension of new tokens. Additionally, we introduce language-guided skill scaling to transfer knowledge across tasks efficiently and avoid linearly growing parameters. Extensive experiments across diverse tasks demonstrate that T2S: 1) effectively prevents catastrophic forgetting (achieving an average NBT of 1.0% across the three LIBERO task suites), 2) excels in new skill scaling with minimal increases in trainable parameters (needing only 8.0% trainable tokens in an average of lifelong tasks), and 3) enables efficient knowledge transfer between tasks (achieving an average FWT of 77.7% across the three LIBERO task suites), offering a promising solution for lifelong imitation learning.
[LG-107] DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging
链接: https://arxiv.org/abs/2508.01148
作者: Kotaro Yoshida,Yuji Naraki,Takafumi Horie,Ryotaro Shimizu,Hiroki Naganuma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector’s norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models exhibiting the harmful traits – where they would otherwise fail – achieving significant performance gains.
[LG-108] ransformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice
链接: https://arxiv.org/abs/2508.01134
作者: Ran Li,Lingshu Zeng
类目: Machine Learning (cs.LG)
*备注: 27 pages, 4 figures
Abstract:Pseudo-random number generators (PRNGs) are high-nonlinear processes, and they are key blocks in optimization of Large language models. Transformers excel at processing complex nonlinear relationships. Thus it is reasonable to generate high-quality pseudo-random numbers based on transformers. In this paper, we explore this question from both theoretical and practical perspectives, highlighting the potential benefits and implications of Transformer in PRNGs. We theoretically demonstrate that decoder-only Transformer models with Chain-of-Thought can simulate both the Linear Congruential Generator (LCG) and Mersenne Twister (MT) PRNGs. Based on this, we conclude that the log-precision decoder-only Transformer can represent non-uniform \textAC^0 . Our simulative theoretical findings are validated through experiments. The random numbers generated by Transformer-based PRNGs successfully pass the majority of NIST tests, whose heat maps exhibit clear statistical randomness. Finally, we assess their capability in prediction attacks.
[LG-109] A hierarchy tree data structure for behavior-based user segment representation
链接: https://arxiv.org/abs/2508.01115
作者: Yang Liu,Xuejiao Kang,Sathya Iyer,Idris Malik,Ruixuan Li,Juan Wang,Xinchen Lu,Xiangxue Zhao,Dayong Wang,Menghan Liu,Isaac Liu,Feng Liang,Yinzhe Yu
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures
Abstract:User attributes are essential in multiple stages of modern recommendation systems and are particularly important for mitigating the cold-start problem and improving the experience of new or infrequent users. We propose Behavior-based User Segmentation (BUS), a novel tree-based data structure that hierarchically segments the user universe with various users’ categorical attributes based on the users’ product-specific engagement behaviors. During the BUS tree construction, we use Normalized Discounted Cumulative Gain (NDCG) as the objective function to maximize the behavioral representativeness of marginal users relative to active users in the same segment. The constructed BUS tree undergoes further processing and aggregation across the leaf nodes and internal nodes, allowing the generation of popular social content and behavioral patterns for each node in the tree. To further mitigate bias and improve fairness, we use the social graph to derive the user’s connection-based BUS segments, enabling the combination of behavioral patterns extracted from both the user’s own segment and connection-based segments as the connection aware BUS-based recommendation. Our offline analysis shows that the BUS-based retrieval significantly outperforms traditional user cohort-based aggregation on ranking quality. We have successfully deployed our data structure and machine learning algorithm and tested it with various production traffic serving billions of users daily, achieving statistically significant improvements in the online product metrics, including music ranking and email notifications. To the best of our knowledge, our study represents the first list-wise learning-to-rank framework for tree-based recommendation that effectively integrates diverse user categorical attributes while preserving real-world semantic interpretability at a large industrial scale.
[LG-110] Flow Matching for Probabilistic Learning of Dynamical Systems from Missing or Noisy Data
链接: https://arxiv.org/abs/2508.01101
作者: Siddharth Rout,Eldad Haber,Stephane Gaudreault
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注: arXiv admin note: text overlap with arXiv:2503.12273
Abstract:Learning dynamical systems is crucial across many fields, yet applying machine learning techniques remains challenging due to missing variables and noisy data. Classical mathematical models often struggle in these scenarios due to the arose ill-posedness of the physical systems. Stochastic machine learning techniques address this challenge by enabling the modeling of such ill-posed problems. Thus, a single known input to the trained machine learning model may yield multiple plausible outputs, and all of the outputs are correct. In such scenarios, probabilistic forecasting is inherently meaningful. In this study, we introduce a variant of flow matching for probabilistic forecasting which estimates possible future states as a distribution over possible outcomes rather than a single-point prediction. Perturbation of complex dynamical states is not trivial. Community uses typical Gaussian or uniform perturbations to crucial variables to model uncertainty. However, not all variables behave in a Gaussian fashion. So, we also propose a generative machine learning approach to physically and logically perturb the states of complex high-dimensional dynamical systems. Finally, we establish the mathematical foundations of our method and demonstrate its effectiveness on several challenging dynamical systems, including a variant of the high-dimensional WeatherBench dataset, which models the global weather at a 5.625° meridional resolution.
[LG-111] Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies
链接: https://arxiv.org/abs/2508.01049
作者: Nicholas E. Corrado,Josiah P. Hanna
类目: Machine Learning (cs.LG)
*备注:
Abstract:Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games, but they are known to converge suboptimally when each agent’s policy gradient points toward a suboptimal equilibrium. In this work, we identify a subtler failure mode that arises \textiteven when the expected policy gradients of all agents point toward an optimal solution. After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on-policy distribution. This \textitsampling error w.r.t. the joint on-policy distribution produces inaccurate gradient estimates that can lead agents to converge suboptimally. In this paper, we investigate if joint sampling error can be reduced through coordinated action selection and whether doing so improves the reliability of policy gradient learning in MARL. Toward this end, we introduce an adaptive action sampling approach to reduce joint sampling error. Our method, Multi-Agent Proximal Robust On-Policy Sampling (MA-PROPS), uses a centralized behavior policy that we continually adapt to place larger probability on joint actions that are currently under-sampled w.r.t. the current joint policy. We empirically evaluate MA-PROPS in a diverse range of multi-agent games and demonstrate that (1) MA-PROPS reduces joint sampling error more efficiently than standard on-policy sampling and (2) improves the reliability of independent policy gradient algorithms, increasing the fraction of training runs that converge to an optimal joint policy.
[LG-112] Explaining GNN Explanations with Edge Gradients KDD2025
链接: https://arxiv.org/abs/2508.01048
作者: Jesse He,Akbar Rafiey,Gal Mishne,Yusu Wang
类目: Machine Learning (cs.LG)
*备注: KDD 2025
Abstract:In recent years, the remarkable success of graph neural networks (GNNs) on graph-structured data has prompted a surge of methods for explaining GNN predictions. However, the state-of-the-art for GNN explainability remains in flux. Different comparisons find mixed results for different methods, with many explainers struggling on more complex GNN architectures and tasks. This presents an urgent need for a more careful theoretical analysis of competing GNN explanation methods. In this work we take a closer look at GNN explanations in two different settings: input-level explanations, which produce explanatory subgraphs of the input graph, and layerwise explanations, which produce explanatory subgraphs of the computation graph. We establish the first theoretical connections between the popular perturbation-based and classical gradient-based methods, as well as point out connections between other recently proposed methods. At the input level, we demonstrate conditions under which GNNExplainer can be approximated by a simple heuristic based on the sign of the edge gradients. In the layerwise setting, we point out that edge gradients are equivalent to occlusion search for linear GNNs. Finally, we demonstrate how our theoretical results manifest in practice with experiments on both synthetic and real datasets.
[LG-113] Addressing Cold Start For next-article Recommendation
链接: https://arxiv.org/abs/2508.01036
作者: Omar Elgohary,Nathan Jorgenson,Trenton Marple
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:This replication study modifies ALMM, the Adaptive Linear Mapping Model constructed for the next song recommendation, to the news recommendation problem on the MIND dataset. The original version of ALMM computes latent representations for users, last-time items, and current items in a tensor factorization structure and learns a linear mapping from content features to latent item vectors. Our replication aims to improve recommendation performance in cold-start scenarios by restructuring this model to sequential news click behavior, viewing consecutively read articles as (last news, next news) tuples. Instead of the original audio features, we apply BERT and a TF-IDF (Term Frequency-Inverse Document Frequency) to news titles and abstracts to extract token contextualized representations and align them with triplet-based user reading patterns. We also propose a reproducibly thorough pre-processing pipeline combining news filtering and feature integrity validation. Our implementation of ALMM with TF-IDF shows relatively improved recommendation accuracy and robustness over Forbes and Oord baseline models in the cold-start scenario. We demonstrate that ALMM in a minimally modified state is not suitable for next news recommendation.
[LG-114] Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
链接: https://arxiv.org/abs/2508.01002
作者: Agrim Bari,Parikshit Hegde,Gustavo de Veciana
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2508.01002 [cs.LG] (or arXiv:2508.01002v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.01002 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-115] FeatureCuts: Feature Selection for Large Data by Optimizing the Cutoff
链接: https://arxiv.org/abs/2508.00954
作者: Andy Hu,Devika Prasad,Luiz Pizzato,Nicholas Foord,Arman Abrahamyan,Anna Leontjeva,Cooper Doyle,Dan Jermyn
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, appendix
Abstract:In machine learning, the process of feature selection involves finding a reduced subset of features that captures most of the information required to train an accurate and efficient model. This work presents FeatureCuts, a novel feature selection algorithm that adaptively selects the optimal feature cutoff after performing filter ranking. Evaluated on 14 publicly available datasets and one industry dataset, FeatureCuts achieved, on average, 15 percentage points more feature reduction and up to 99.6% less computation time while maintaining model performance, compared to existing state-of-the-art methods. When the selected features are used in a wrapper method such as Particle Swarm Optimization (PSO), it enables 25 percentage points more feature reduction, requires 66% less computation time, and maintains model performance when compared to PSO alone. The minimal overhead of FeatureCuts makes it scalable for large datasets typically seen in enterprise applications.
[LG-116] Cooperative effects in feature importance of individual patterns: application to air pollutants and Alzheimer disease
链接: https://arxiv.org/abs/2508.00930
作者: M. Ontivero-Ortega,A. Fania,A. Lacalamita,R. Bellotti,A. Monaco,S. Stramaglia
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Leveraging recent advances in the analysis of synergy and redundancy in systems of random variables, an adaptive version of the widely used metric Leave One Covariate Out (LOCO) has been recently proposed to quantify cooperative effects in feature importance (Hi-Fi), a key technique in explainable artificial intelligence (XAI), so as to disentangle high-order effects involving a particular input feature in regression problems. Differently from standard feature importance tools, where a single score measures the relevance of each feature, each feature is here characterized by three scores, a two-body (unique) score and higher-order scores (redundant and synergistic). This paper presents a framework to assign those three scores (unique, redundant, and synergistic) to each individual pattern of the data set, while comparing it with the well-known measure of feature importance named \it Shapley effect. To illustrate the potential of the proposed framework, we focus on a One-Health application: the relation between air pollutants and Alzheimer’s disease mortality rate. Our main result is the synergistic association between features related to O_3 and NO_2 with mortality, especially in the provinces of Bergamo e Brescia; notably also the density of urban green areas displays synergistic influence with pollutants for the prediction of AD mortality. Our results place local Hi-Fi as a promising tool of wide applicability, which opens new perspectives for XAI as well as to analyze high-order relationships in complex systems.
[LG-117] Hybrid Hypergraph Networks for Multimodal Sequence Data Classification
链接: https://arxiv.org/abs/2508.00926
作者: Feng Xu,Hui Wang,Yuting Huang,Danwei Zhang,Zizhu Fan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
Abstract:Modeling temporal multimodal data poses significant challenges in classification tasks, particularly in capturing long-range temporal dependencies and intricate cross-modal interactions. Audiovisual data, as a representative example, is inherently characterized by strict temporal order and diverse modalities. Effectively leveraging the temporal structure is essential for understanding both intra-modal dynamics and inter-modal correlations. However, most existing approaches treat each modality independently and rely on shallow fusion strategies, which overlook temporal dependencies and hinder the model’s ability to represent complex structural relationships. To address the limitation, we propose the hybrid hypergraph network (HHN), a novel framework that models temporal multimodal data via a segmentation-first, graph-later strategy. HHN splits sequences into timestamped segments as nodes in a heterogeneous graph. Intra-modal structures are captured via hyperedges guided by a maximum entropy difference criterion, enhancing node heterogeneity and structural discrimination, followed by hypergraph convolution to extract high-order dependencies. Inter-modal links are established through temporal alignment and graph attention for semantic fusion. HHN achieves state-of-the-art (SOTA) results on four multimodal datasets, demonstrating its effectiveness in complex classification tasks.
[LG-118] Beyond Benchmarks: Dynamic Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
链接: https://arxiv.org/abs/2508.00923
作者: Jiazhen Pan,Bailiang Jian,Paul Hager,Yundi Zhang,Che Liu,Friedrike Jungmann,Hongwei Bran Li,Chenyu You,Junde Wu,Jiayuan Zhu,Fenglin Liu,Yuyuan Liu,Niklas Bubeck,Christian Wachinger,Chen(Cherise)Chen,Zhenyu Gong,Cheng Ouyang,Georgios Kaissis,Benedikt Wiestler,Daniel Rueckert
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80%, 94% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86% of scenarios, cognitive-bias priming altered clinical recommendations in 81% of fairness tests, and we identified hallucination rates exceeding 66% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
[LG-119] CaliMatch: Adaptive Calibration for Improving Safe Semi-supervised Learning
链接: https://arxiv.org/abs/2508.00922
作者: Jinsoo Bae,Seoung Bum Kim,Hyungrok Do
类目: Machine Learning (cs.LG)
*备注:
Abstract:Semi-supervised learning (SSL) uses unlabeled data to improve the performance of machine learning models when labeled data is scarce. However, its real-world applications often face the label distribution mismatch problem, in which the unlabeled dataset includes instances whose ground-truth labels are absent from the labeled training dataset. Recent studies, referred to as safe SSL, have addressed this issue by using both classification and out-of-distribution (OOD) detection. However, the existing methods may suffer from overconfidence in deep neural networks, leading to increased SSL errors because of high confidence in incorrect pseudo-labels or OOD detection. To address this, we propose a novel method, CaliMatch, which calibrates both the classifier and the OOD detector to foster safe SSL. CaliMatch presents adaptive label smoothing and temperature scaling, which eliminates the need to manually tune the smoothing degree for effective calibration. We give a theoretical justification for why improving the calibration of both the classifier and the OOD detector is crucial in safe SSL. Extensive evaluations on CIFAR-10, CIFAR-100, SVHN, TinyImageNet, and ImageNet demonstrate that CaliMatch outperforms the existing methods in safe SSL tasks.
[LG-120] NeuCoReClass AD: Redefining Self-Supervised Time Series Anomaly Detection
链接: https://arxiv.org/abs/2508.00909
作者: Aitor Sánchez-Ferrera,Usue Mori,Borja Calvo,Jose A. Lozano
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series anomaly detection plays a critical role in a wide range of real-world applications. Among unsupervised approaches, self-supervised learning has gained traction for modeling normal behavior without the need of labeled data. However, many existing methods rely on a single proxy task, limiting their ability to capture meaningful patterns in normal data. Moreover, they often depend on handcrafted transformations tailored specific domains, hindering their generalization accross diverse problems. To address these limitations, we introduce NeuCoReClass AD, a self-supervised multi-task time series anomaly detection framework that combines contrastive, reconstruction, and classification proxy tasks. Our method employs neural transformation learning to generate augmented views that are informative, diverse, and coherent, without requiring domain-specific knowledge. We evaluate NeuCoReClass AD across a wide range of benchmarks, demonstrating that it consistently outperforms both classical baselines and most deep-learning alternatives. Furthermore, it enables the characterization of distinct anomaly profiles in a fully unsupervised manner.
[LG-121] Cross-Process Defect Attribution using Potential Loss Analysis
链接: https://arxiv.org/abs/2508.00895
作者: Tsuyoshi Idé,Kohei Miyaguchi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2507.20357
Abstract:Cross-process root-cause analysis of wafer defects is among the most critical yet challenging tasks in semiconductor manufacturing due to the heterogeneity and combinatorial nature of processes along the processing route. This paper presents a new framework for wafer defect root cause analysis, called Potential Loss Analysis (PLA), as a significant enhancement of the previously proposed partial trajectory regression approach. The PLA framework attributes observed high wafer defect densities to upstream processes by comparing the best possible outcomes generated by partial processing trajectories. We show that the task of identifying the best possible outcome can be reduced to solving a Bellman equation. Remarkably, the proposed framework can simultaneously solve the prediction problem for defect density as well as the attribution problem for defect scores. We demonstrate the effectiveness of the proposed framework using real wafer history data.
[LG-122] Multi-Community Spectral Clustering for Geometric Graphs
链接: https://arxiv.org/abs/2508.00893
作者: Luiz Emilio Allem,Konstantin Avrachenkov,Carlos Hoppen,Hariprasad Manjunath,Lucas Siviero Sibemberg
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Probability (math.PR); Spectral Theory (math.SP); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we consider the soft geometric block model (SGBM) with a fixed number k \geq 2 of homogeneous communities in the dense regime, and we introduce a spectral clustering algorithm for community recovery on graphs generated by this model. Given such a graph, the algorithm produces an embedding into \mathbbR^k-1 using the eigenvectors associated with the k-1 eigenvalues of the adjacency matrix of the graph that are closest to a value determined by the parameters of the model. It then applies k -means clustering to the embedding. We prove weak consistency and show that a simple local refinement step ensures strong consistency. A key ingredient is an application of a non-standard version of Davis-Kahan theorem to control eigenspace perturbations when eigenvalues are not simple. We also analyze the limiting spectrum of the adjacency matrix, using a combination of combinatorial and matrix techniques.
[LG-123] A Dynamic Context-Aware Framework for Risky Driving Prediction Using Naturalistic Data
链接: https://arxiv.org/abs/2508.00888
作者: Amir Hossein Kalantari,Eleonora Papadimitriou,Amir Pooyan Afghari
类目: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME)
*备注: 32 pages
Abstract:Naturalistic driving studies offer a powerful means for observing and quantifying real-world driving behaviour. One of their prominent applications in traffic safety is the continuous monitoring and classification of risky driving behaviour. However, many existing frameworks rely on fixed time windows and static thresholds for distinguishing between safe and risky behaviour - limiting their ability to respond to the stochastic nature of real-world driving. This study proposes a dynamic and individualised framework for identifying risky driving behaviour using Belgian naturalistic driving data. The approach leverages a rolling time window and bi-level optimisation to dynamically calibrate both risk thresholds and model hyperparameters, capturing subtle behavioural shifts. Two safety indicators, speed-weighted headway and harsh driving events, were evaluated using three data-driven models: Random Forest, XGBoost, and Deep Neural Network (DNN). The DNN demonstrated strong capability in capturing subtle changes in driving behaviour, particularly excelling in high-recall tasks, making it promising for early-stage risk detection. XGBoost provided the most balanced and stable performance across different thresholds and evaluation metrics. While random forest showed more variability, it responded sensitively to dynamic threshold adjustments, which may be advantageous during model adaptation or tuning. Speed-weighted headway emerged as a more stable and context-sensitive risk indicator than harsh driving events, likely due to its robustness to label sparsity and contextual variation. Overall, the findings support the value of adaptive, personalised risk detection approaches for enhancing real-time safety feedback and tailoring driver support in intelligent transport systems.
[LG-124] Stochastic Optimal Control via Measure Relaxations
链接: https://arxiv.org/abs/2508.00886
作者: Etienne Buehrle,Christoph Stiller
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 7 pages, 4 figures
Abstract:The optimal control problem of stochastic systems is commonly solved via robust or scenario-based optimization methods, which are both challenging to scale to long optimization horizons. We cast the optimal control problem of a stochastic system as a convex optimization problem over occupation measures. We demonstrate our method on a set of synthetic and real-world scenarios, learning cost functions from data via Christoffel polynomials. The code for our experiments is available at this https URL.
[LG-125] Learned LSM-trees: Two Approaches Using Learned Bloom Filters
链接: https://arxiv.org/abs/2508.00882
作者: Nicholas Fidalgo,Puyuan Ye
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:Modern key-value stores rely heavily on Log-Structured Merge (LSM) trees for write optimization, but this design introduces significant read amplification. Auxiliary structures like Bloom filters help, but impose memory costs that scale with tree depth and dataset size. Recent advances in learned data structures suggest that machine learning models can augment or replace these components, trading handcrafted heuristics for data-adaptive behavior. In this work, we explore two approaches for integrating learned predictions into the LSM-tree lookup path. The first uses a classifier to selectively bypass Bloom filter probes for irrelevant levels, aiming to reduce average-case query latency. The second replaces traditional Bloom filters with compact learned models and small backup filters, targeting memory footprint reduction without compromising correctness. We implement both methods atop a Monkey-style LSM-tree with leveled compaction, per-level Bloom filters, and realistic workloads. Our experiments show that the classifier reduces GET latency by up to 2.28x by skipping over 30% of Bloom filter checks with high precision, though it incurs a modest false-negative rate. The learned Bloom filter design achieves zero false negatives and retains baseline latency while cutting memory usage per level by 70-80%. Together, these designs illustrate complementary trade-offs between latency, memory, and correctness, and highlight the potential of learned index components in write-optimized storage systems.
[LG-126] A Data-Driven Machine Learning Approach for Predicting Axial Load Capacity in Steel Storag e Rack Columns
链接: https://arxiv.org/abs/2508.00876
作者: Bakhtiyar Mammadli,Casim Yazici,Muhammed Gürbüz,İrfan Kocaman,F. Javier Dominguez-Gutierrez,Fatih Mehmet Özkal
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:In this study, we present a machine learning (ML) framework to predict the axial load-bearing capacity, (kN), of cold-formed steel structural members. The methodology emphasizes robust model selection and interpretability, addressing the limitations of traditional analytical approaches in capturing the nonlinearities and geometrical complexities inherent to buckling behavior. The dataset, comprising key geometric and mechanical parameters of steel columns, was curated with appropriate pre-processing steps including removal of non-informative identifiers and imputation of missing values. A comprehensive suite of regression algorithms, ranging from linear models to kernel-based regressors and ensemble tree methods was evaluated. Among these, Gradient Boosting Regression exhibited superior predictive performance across multiple metrics, including the coefficient of determination (R2), root mean squared error (RMSE), and mean absolute error (MAE), and was consequently selected as the final model. Model interpretability was addressed using SHapley Additive exPlanations (SHAP), enabling insight into the relative importance and interaction of input features influencing the predicted axial capacity. To facilitate practical deployment, the model was integrated into an interactive, Python-based web interface via Streamlit. This tool allows end-users-such as structural engineers and designers, to input design parameters manually or through CSV upload, and to obtain real-time predictions of axial load capacity without the need for programming expertise. Applied to the context of steel storage rack columns, the framework demonstrates how data-driven tools can enhance design safety, streamline validation workflows, and inform decision-making in structural applications where buckling is a critical failure mode
[LG-127] Discrete approach to machine learning
链接: https://arxiv.org/abs/2508.00869
作者: Dmitriy Kashitsyn,Dmitriy Shabanov
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Information Theory (cs.IT)
*备注: preprint, 52 pages, 37 figures
Abstract:The article explores an encoding and structural information processing approach using sparse bit vectors and fixed-length linear vectors. The following are presented: a discrete method of speculative stochastic dimensionality reduction of multidimensional code and linear spaces with linear asymptotic complexity; a geometric method for obtaining discrete embeddings of an organised code space that reflect the internal structure of a given modality. The structure and properties of a code space are investigated using three modalities as examples: morphology of Russian and English languages, and immunohistochemical markers. Parallels are drawn between the resulting map of the code space layout and so-called pinwheels appearing on the mammalian neocortex. A cautious assumption is made about similarities between neocortex organisation and processes happening in our models.
[LG-128] A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks
链接: https://arxiv.org/abs/2508.00855
作者: Ziyang Zhang,Feifan Zhang,Weidong Tang,Lei Shi,Tailai Chen
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Nonlinear partial differential equations (PDEs) are pivotal in modeling complex physical systems, yet traditional Physics-Informed Neural Networks (PINNs) often struggle with unresolved residuals in critical spatiotemporal regions and violations of temporal causality. To address these limitations, we propose a novel Residual Guided Training strategy for Physics-Informed Transformer via Generative Adversarial Networks (GAN). Our framework integrates a decoder-only Transformer to inherently capture temporal correlations through autoregressive processing, coupled with a residual-aware GAN that dynamically identifies and prioritizes high-residual regions. By introducing a causal penalty term and an adaptive sampling mechanism, the method enforces temporal causality while refining accuracy in problematic domains. Extensive numerical experiments on the Allen-Cahn, Klein-Gordon, and Navier-Stokes equations demonstrate significant improvements, achieving relative MSE reductions of up to three orders of magnitude compared to baseline methods. This work bridges the gap between deep learning and physics-driven modeling, offering a robust solution for multiscale and time-dependent PDE systems.
[LG-129] Deep Kernel Bayesian Optimisation for Closed-Loop Electrode Microstructure Design with User-Defined Properties based on GANs
链接: https://arxiv.org/abs/2508.00833
作者: Andrea Gayon-Lombardo,Ehecatl A. del Rio-Chanona,Catalina A. Pino-Munoz,Nigel P. Brandon
类目: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: This work is part of the PhD thesis that can be found in the Imperial College archives: this https URL
Abstract:The generation of multiphase porous electrode microstructures with optimum morphological and transport properties is essential in the design of improved electrochemical energy storage devices, such as lithium-ion batteries. Electrode characteristics directly influence battery performance by acting as the main sites where the electrochemical reactions coupled with transport processes occur. This work presents a generation-optimisation closed-loop algorithm for the design of microstructures with tailored properties. A deep convolutional Generative Adversarial Network is used as a deep kernel and employed to generate synthetic three-phase three-dimensional images of a porous lithium-ion battery cathode material. A Gaussian Process Regression uses the latent space of the generator and serves as a surrogate model to correlate the morphological and transport properties of the synthetic microstructures. This surrogate model is integrated into a deep kernel Bayesian optimisation framework, which optimises cathode properties as a function of the latent space of the generator. A set of objective functions were defined to perform the maximisation of morphological properties (e.g., volume fraction, specific surface area) and transport properties (relative diffusivity). We demonstrate the ability to perform simultaneous maximisation of correlated properties (specific surface area and relative diffusivity), as well as constrained optimisation of these properties. This is the maximisation of morphological or transport properties constrained by constant values of the volume fraction of the phase of interest. Visualising the optimised latent space reveals its correlation with morphological properties, enabling the fast generation of visually realistic microstructures with customised properties.
[LG-130] EngiBench: A Framework for Data-Driven Engineering Design Research
链接: https://arxiv.org/abs/2508.00831
作者: Florian Felten,Gabriel Apaza,Gerhard Bräunlich,Cashen Diniz,Xuliang Dong,Arthur Drake,Milad Habibi,Nathaniel J. Hoffman,Matthew Keeler,Soheyl Massoudi,Francis G. VanGessel,Mark Fuge
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Under review
Abstract:Engineering design optimization seeks to automatically determine the shapes, topologies, or parameters of components that maximize performance under given conditions. This process often depends on physics-based simulations, which are difficult to install, computationally expensive, and require domain-specific expertise. To mitigate these challenges, we introduce EngiBench, the first open-source library and datasets spanning diverse domains for data-driven engineering design. EngiBench provides a unified API and a curated set of benchmarks – covering aeronautics, heat conduction, photonics, and more – that enable fair, reproducible comparisons of optimization and machine learning algorithms, such as generative or surrogate models. We also release EngiOpt, a companion library offering a collection of such algorithms compatible with the EngiBench interface. Both libraries are modular, letting users plug in novel algorithms or problems, automate end-to-end experiment workflows, and leverage built-in utilities for visualization, dataset generation, feasibility checks, and performance analysis. We demonstrate their versatility through experiments comparing state-of-the-art techniques across multiple engineering design problems, an undertaking that was previously prohibitively time-consuming to perform. Finally, we show that these problems pose significant challenges for standard machine learning methods due to highly sensitive and constrained design manifolds.
[LG-131] FastCSP: Accelerated Molecular Crystal Structure Prediction with Universal Model for Atoms
链接: https://arxiv.org/abs/2508.02641
作者: Vahe Gharakhanyan,Yi Yang,Luis Barroso-Luque,Muhammed Shuaibi,Daniel S. Levine,Kyle Michel,Viachaslau Bernat,Misko Dzamba,Xiang Fu,Meng Gao,Xingyu Liu,Keian Noori,Lafe J. Purvis,Tingling Rao,Brandon M. Wood,Ammar Rizvi,Matt Uyttendaele,Andrew J. Ouderkirk,Chiara Daraio,C. Lawrence Zitnick,Arman Boromand,Noa Marom,Zachary W. Ulissi,Anuroop Sriram
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 52 pages, 19 figures, 6 tables
Abstract:Crystal Structure Prediction (CSP) of molecular crystals plays a central role in applications, such as pharmaceuticals and organic electronics. CSP is challenging and computationally expensive due to the need to explore a large search space with sufficient accuracy to capture energy differences of a few kJ/mol between polymorphs. Dispersion-inclusive density functional theory (DFT) provides the required accuracy but its computational cost is impractical for a large number of putative structures. We introduce FastCSP, an open-source, high-throughput CSP workflow based on machine learning interatomic potentials (MLIPs). FastCSP combines random structure generation using Genarris 3.0 with geometry relaxation and free energy calculations powered entirely by the Universal Model for Atoms (UMA) MLIP. We benchmark FastCSP on a curated set of 28 mostly rigid molecules, demonstrating that our workflow consistently generates known experimental structures and ranks them within 5 kJ/mol per molecule of the global minimum. Our results demonstrate that universal MLIPs can be used across diverse compounds without requiring system-specific tuning. Moreover, the speed and accuracy afforded by UMA eliminate the need for classical force fields in the early stages of CSP and for final re-ranking with DFT. The open-source release of the entire FastCSP workflow significantly lowers the barrier to accessing CSP. CSP results for a single system can be obtained within hours on tens of modern GPUs, making high-throughput crystal structure prediction feasible for a broad range of scientific applications.
[LG-132] rustworthy scientific inference for inverse problems with generative models
链接: https://arxiv.org/abs/2508.02602
作者: James Carzon,Luca Masserano,Joshua D. Ingram,Alex Shen,Antonio Carlos Herling Ribeiro Junior,Tommaso Dorigo,Michele Doro,Joshua S. Speagle,Rafael Izbicki,Ann B. Lee
类目: Machine Learning (stat.ML); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Generative artificial intelligence (AI) excels at producing complex data structures (text, images, videos) by learning patterns from training examples. Across scientific disciplines, researchers are now applying generative models to ``inverse problems’’ to infer hidden parameters from observed data. While these methods can handle intractable models and large-scale studies, they can also produce biased or overconfident conclusions. We present a solution with Frequentist-Bayes (FreB), a mathematically rigorous protocol that reshapes AI-generated probability distributions into confidence regions that consistently include true parameters with the expected probability, while achieving minimum size when training and target data align. We demonstrate FreB’s effectiveness by tackling diverse case studies in the physical sciences: identifying unknown sources under dataset shift, reconciling competing theoretical models, and mitigating selection bias and systematics in observational studies. By providing validity guarantees with interpretable diagnostics, FreB enables trustworthy scientific inference across fields where direct likelihood evaluation remains impossible or prohibitively expensive.
[LG-133] Superior resilience to poisoning and amenability to unlearning in quantum machine learning
链接: https://arxiv.org/abs/2508.02422
作者: Yu-Qin Chen,Shi-Xin Zhang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures with references and supplemental materials
Abstract:The reliability of artificial intelligence hinges on the integrity of its training data, a foundation often compromised by noise and corruption. Here, through a comparative study of classical and quantum neural networks on both classical and quantum data, we reveal a fundamental difference in their response to data corruption. We find that classical models exhibit brittle memorization, leading to a failure in generalization. In contrast, quantum models demonstrate remarkable resilience, which is underscored by a phase transition-like response to increasing label noise, revealing a critical point beyond which the model’s performance changes qualitatively. We further establish and investigate the field of quantum machine unlearning, the process of efficiently forcing a trained model to forget corrupting influences. We show that the brittle nature of the classical model forms rigid, stubborn memories of erroneous data, making efficient unlearning challenging, while the quantum model is significantly more amenable to efficient forgetting with approximate unlearning methods. Our findings establish that quantum machine learning can possess a dual advantage of intrinsic resilience and efficient adaptability, providing a promising paradigm for the trustworthy and robust artificial intelligence of the future.
[LG-134] he Role of Review Process Failures in Affective State Estimation: An Empirical Investigation of DEAP Dataset
链接: https://arxiv.org/abs/2508.02417
作者: Nazmun N Khan,Taylor Sweet,Chase A Harvey,Calder Knapp,Dean J. Krusienski,David E Thompson
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 25 pages, 4 figures, This is a preprint version of the manuscript. It is intended for submission to a peer-reviewed journal
Abstract:The reliability of affective state estimation using EEG data is in question, given the variability in reported performance and the lack of standardized evaluation protocols. To investigate this, we reviewed 101 studies, focusing on the widely used DEAP dataset for emotion recognition. Our analysis revealed widespread methodological issues that include data leakage from improper segmentation, biased feature selection, flawed hyperparameter optimization, neglect of class imbalance, and insufficient methodological reporting. Notably, we found that nearly 87% of the reviewed papers contained one or more of these errors. Moreover, through experimental analysis, we observed that such methodological flaws can inflate the classification accuracy by up to 46%. These findings reveal fundamental gaps in standardized evaluation practices and highlight critical deficiencies in the peer review process for machine learning applications in neuroscience, emphasizing the urgent need for stricter methodological standards and evaluation protocols.
[LG-135] Detecting and measuring respiratory events in horses during exercise with a microphone: deep learning vs. standard signal processing
链接: https://arxiv.org/abs/2508.02349
作者: Jeanne I.M. Parmentier(1,2,3),Rhana M. Aarts(1),Elin Hernlund(4),Marie Rhodin(4),Berend Jan van der Zwaag(2,3) ((1) Utrecht University, (2) University of Twente, (3) Inertia Technology B.V., (4) Swedish University of Agricultural Sciences)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Monitoring respiration parameters such as respiratory rate could be beneficial to understand the impact of training on equine health and performance and ultimately improve equine welfare. In this work, we compare deep learning-based methods to an adapted signal processing method to automatically detect cyclic respiratory events and extract the dynamic respiratory rate from microphone recordings during high intensity exercise in Standardbred trotters. Our deep learning models are able to detect exhalation sounds (median F1 score of 0.94) in noisy microphone signals and show promising results on unlabelled signals at lower exercising intensity, where the exhalation sounds are less recognisable. Temporal convolutional networks were better at detecting exhalation events and estimating dynamic respiratory rates (median F1: 0.94, Mean Absolute Error (MAE) \pm Confidence Intervals (CI): 1.44 \pm 1.04 bpm, Limits Of Agreements (LOA): 0.63 \pm 7.06 bpm) than long short-term memory networks (median F1: 0.90, MAE \pm CI: 3.11 \pm 1.58 bpm) and signal processing methods (MAE \pm CI: 2.36 \pm 1.11 bpm). This work is the first to automatically detect equine respiratory sounds and automatically compute dynamic respiratory rates in exercising horses. In the future, our models will be validated on lower exercising intensity sounds and different microphone placements will be evaluated in order to find the best combination for regular monitoring.
[LG-136] Comparing Generative Models with the New Physics Learning Machine
链接: https://arxiv.org/abs/2508.02275
作者: Samuele Grossi,Marco Letizia,Riccardo Torre
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: v1: 14 pages, 7 figures, 8 tables, additional material on GitHub referenced in the paper
Abstract:The rise of generative models for scientific research calls for the development of new methods to evaluate their fidelity. A natural framework for addressing this problem is two-sample hypothesis testing, namely the task of determining whether two data sets are drawn from the same distribution. In large-scale and high-dimensional regimes, machine learning offers a set of tools to push beyond the limitations of standard statistical techniques. In this work, we put this claim to the test by comparing a recent proposal from the high-energy physics literature, the New Physics Learning Machine, to perform a classification-based two-sample test against a number of alternative approaches, following the framework presented in Grossi et al. (2025). We highlight the efficiency tradeoffs of the method and the computational costs that come from adopting learning-based approaches. Finally, we discuss the advantages of the different methods for different use cases.
[LG-137] Structure Maintained Representation Learning Neural Network for Causal Inference
链接: https://arxiv.org/abs/2508.01865
作者: Yang Sun,Wenbin Lu,Yi-Hui Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Recent developments in causal inference have greatly shifted the interest from estimating the average treatment effect to the individual treatment effect. In this article, we improve the predictive accuracy of representation learning and adversarial networks in estimating individual treatment effects by introducing a structure keeper which maintains the correlation between the baseline covariates and their corresponding representations in the high dimensional space. We train a discriminator at the end of representation layers to trade off representation balance and information loss. We show that the proposed discriminator minimizes an upper bound of the treatment estimation error. We can address the tradeoff between distribution balance and information loss by considering the correlations between the learned representation space and the original covariate feature space. We conduct extensive experiments with simulated and real-world observational data to show that our proposed Structure Maintained Representation Learning (SMRL) algorithm outperforms state-of-the-art methods. We also demonstrate the algorithms on real electronic health record data from the MIMIC-III database.
[LG-138] Fast Gaussian process inference by exact Matérn kernel decomposition
链接: https://arxiv.org/abs/2508.01864
作者: Nicolas Langrené,Xavier Warin,Pierre Gruet
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 31 pages, 1 figure
Abstract:To speed up Gaussian process inference, a number of fast kernel matrix-vector multiplication (MVM) approximation algorithms have been proposed over the years. In this paper, we establish an exact fast kernel MVM algorithm based on exact kernel decomposition into weighted empirical cumulative distribution functions, compatible with a class of kernels which includes multivariate Matérn kernels with half-integer smoothness parameter. This algorithm uses a divide-and-conquer approach, during which sorting outputs are stored in a data structure. We also propose a new algorithm to take into account some linear fixed effects predictor function. Our numerical experiments confirm that our algorithm is very effective for low-dimensional Gaussian process inference problems with hundreds of thousands of data points. An implementation of our algorithm is available at this https URL.
[LG-139] st-Time Training for Speech Enhancement INTERSPEECH2025
链接: https://arxiv.org/abs/2508.01847
作者: Avishkar Behera,Riya Ann Easow,Venkatesh Parvathala,K. Sri Rama Murty
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to Interspeech 2025. 5 pages, 2 figures
Abstract:This paper introduces a novel application of Test-Time Training (TTT) for Speech Enhancement, addressing the challenges posed by unpredictable noise conditions and domain shifts. This method combines a main speech enhancement task with a self-supervised auxiliary task in a Y-shaped architecture. The model dynamically adapts to new domains during inference time by optimizing the proposed self-supervised tasks like noise-augmented signal reconstruction or masked spectrogram prediction, bypassing the need for labeled data. We further introduce various TTT strategies offering a trade-off between adaptation and efficiency. Evaluations across synthetic and real-world datasets show consistent improvements across speech quality metrics, outperforming the baseline model. This work highlights the effectiveness of TTT in speech enhancement, providing insights for future research in adaptive and robust speech processing.
[LG-140] Efficient optimization of expensive black-box simulators via marginal means with application to neutrino detector design
链接: https://arxiv.org/abs/2508.01834
作者: Hwanwoo Kim,Simon Mak,Ann-Kathrin Schuetz,Alan Poon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:
Abstract:With advances in scientific computing, computer experiments are increasingly used for optimizing complex systems. However, for modern applications, e.g., the optimization of nuclear physics detectors, each experiment run can require hundreds of CPU hours, making the optimization of its black-box simulator over a high-dimensional space a challenging task. Given limited runs at inputs \mathbfx_1, \cdots, \mathbfx_n , the best solution from these evaluated inputs can be far from optimal, particularly as dimensionality increases. Existing black-box methods, however, largely employ this ‘‘pick-the-winner’’ (PW) solution, which leads to mediocre optimization performance. To address this, we propose a new Black-box Optimization via Marginal Means (BOMM) approach. The key idea is a new estimator of a global optimizer \mathbfx^* that leverages the so-called marginal mean functions, which can be efficiently inferred with limited runs in high dimensions. Unlike PW, this estimator can select solutions beyond evaluated inputs for improved optimization performance. Assuming the objective function follows a generalized additive model with unknown link function and under mild conditions, we prove that the BOMM estimator not only is consistent for optimization, but also has an optimization rate that tempers the ‘‘curse-of-dimensionality’’ faced by existing methods, thus enabling better performance as dimensionality increases. We present a practical framework for implementing BOMM using the transformed additive Gaussian process surrogate model. Finally, we demonstrate the effectiveness of BOMM in numerical experiments and an application on neutrino detector optimization in nuclear physics.
[LG-141] Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: “One Map Many Trials” in Satellite-Driven Poverty Analysis
链接: https://arxiv.org/abs/2508.01341
作者: Markus Pettersson,Connor T. Jerzak,Adel Daoud
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 pages
Abstract:Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can be leveraged across multiple causal trials. However, because standard training objectives prioritize overall predictive accuracy, these predictions inherently suffer from shrinkage toward the mean, leading to attenuated estimates of causal treatment effects and limiting their utility in policy. Existing debiasing methods, such as Prediction-Powered Inference, can handle this attenuation bias but require additional fresh ground-truth data at the downstream stage of causal inference, which restricts their applicability in data-scarce environments. Here, we introduce and evaluate two correction methods – linear calibration correction and Tweedie’s correction – that substantially reduce prediction bias without relying on newly collected labeled data. Linear calibration corrects bias through a straightforward linear transformation derived from held-out calibration data, whereas Tweedie’s correction leverages empirical Bayes principles to directly address shrinkage-induced biases by exploiting score functions derived from the model’s learning patterns. Through analytical exercises and experiments using Demographic and Health Survey data, we demonstrate that the proposed methods meet or outperform existing approaches that either require (a) adjustments to training pipelines or (b) additional labeled data. These approaches may represent a promising avenue for improving the reliability of causal inference when direct outcome measures are limited or unavailable, enabling a “one map, many trials” paradigm where a single upstream data creation team produces predictions usable by many downstream teams across diverse ML pipelines.
[LG-142] Flow IV: Counterfactual Inference In Nonseparable Outcome Models Using Instrumental Variables
链接: https://arxiv.org/abs/2508.01321
作者: Marc Braun,Jose M. Peña,Adel Daoud
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:To reach human level intelligence, learning algorithms need to incorporate causal reasoning. But identifying causality, and particularly counterfactual reasoning, remains an elusive task. In this paper, we make progress on this task by utilizing instrumental variables (IVs). IVs are a classic tool for mitigating bias from unobserved confounders when estimating causal effects. While IV methods have been extended to non-separable structural models at the population level, existing approaches to counterfactual prediction typically assume additive noise in the outcome. In this paper, we show that under standard IV assumptions, along with the assumptions that latent noises in treatment and outcome are strictly monotonic and jointly Gaussian, the treatment-outcome relationship becomes uniquely identifiable from observed data. This enables counterfactual inference even in nonseparable models. We implement our approach by training a normalizing flow to maximize the likelihood of the observed data, demonstrating accurate recovery of the underlying outcome function. We call our method Flow IV.
[LG-143] Inferring processes within dynamic forest models using hybrid modeling
链接: https://arxiv.org/abs/2508.01228
作者: Maximilian Pichler,Yannek Käber
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 16 figures
Abstract:Modeling forest dynamics under novel climatic conditions requires a careful balance between process-based understanding and empirical flexibility. Dynamic Vegetation Models (DVM) represent ecological processes mechanistically, but their performance is prone to misspecified assumptions about functional forms. Inferring the structure of these processes and their functional forms correctly from data remains a major challenge because current approaches, such as plug-in estimators, have proven ineffective. We introduce Forest Informed Neural Networks (FINN), a hybrid modeling approach that combines a forest gap model with deep neural networks (DNN). FINN replaces processes with DNNs, which are then calibrated alongside the other mechanistic components in one unified step. In a case study on the Barro Colorado Island 50-ha plot we demonstrate that replacing the growth process with a DNN improves predictive performance and succession trajectories compared to a fully mechanistic version of FINN. Furthermore, we discovered that the DNN learned an ecologically plausible, improved functional form of growth, which we extracted from the DNN using explainable AI. In conclusion, our new hybrid modeling approach offers a versatile opportunity to infer forest dynamics from data and to improve forecasts of ecosystem trajectories under unprecedented environmental change.
[LG-144] Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling
链接: https://arxiv.org/abs/2508.01217
作者: Yan Sun,Faming Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has revolutionized modern data science. However, how to accurately quantify the uncertainty of predictions from large-scale deep neural networks (DNNs) remains an unresolved issue. To address this issue, we introduce a novel post-processing approach. This approach feeds the output from the last hidden layer of a pre-trained large-scale DNN model into a stochastic neural network (StoNet), then trains the StoNet with a sparse penalty on a validation dataset and constructs prediction intervals for future observations. We establish a theoretical guarantee for the validity of this approach; in particular, the parameter estimation consistency for the sparse StoNet is essential for the success of this approach. Comprehensive experiments demonstrate that the proposed approach can construct honest confidence intervals with shorter interval lengths compared to conformal methods and achieves better calibration compared to other post-hoc calibration techniques. Additionally, we show that the StoNet formulation provides us with a platform to adapt sparse learning theory and methods from linear models to DNNs.
[LG-145] Inequalities for Optimization of Classification Algorithms: A Perspective Motivated by Diagnostic Testing
链接: https://arxiv.org/abs/2508.01065
作者: Paul N. Patrone,Anthony J. Kearsley
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Motivated by canonical problems in medical diagnostics, we propose and study properties of an objective function that uniformly bounds uncertainties in quantities of interest extracted from classifiers and related data analysis tools. We begin by adopting a set-theoretic perspective to show how two main tasks in diagnostics – classification and prevalence estimation – can be recast in terms of a variation on the confusion (or error) matrix \boldsymbol \rm P typically considered in supervised learning. We then combine arguments from conditional probability with the Gershgorin circle theorem to demonstrate that the largest Gershgorin radius \boldsymbol \rho_m of the matrix \mathbb I-\boldsymbol \rm P (where \mathbb I is the identity) yields uniform error bounds for both classification and prevalence estimation. In a two-class setting, \boldsymbol \rho_m is minimized via a measure-theoretic ``water-leveling’’ argument that optimizes an appropriately defined partition U generating the matrix \boldsymbol \rm P . We also consider an example that illustrates the difficulty of generalizing the binary solution to a multi-class setting and deduce relevant properties of the confusion matrix.
[LG-146] Re-optimization of a deep neural network model for electron-carbon scattering using new experimental data
链接: https://arxiv.org/abs/2508.00996
作者: Beata E. Kowal,Krzysztof M. Graczyk,Artur M. Ankowski,Rwik Dharmapal Banerjee,Jose L. Bonilla,Hemant Prasad,Jan T. Sobczyk
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex); Nuclear Theory (nucl-th)
*备注: 14 pages, 12 figures
Abstract:We present an updated deep neural network model for inclusive electron-carbon scattering. Using the bootstrap model [Phys.Rev.C 110 (2024) 2, 025501] as a prior, we incorporate recent experimental data, as well as older measurements in the deep inelastic scattering region, to derive a re-optimized posterior model. We examine the impact of these new inputs on model predictions and associated uncertainties. Finally, we evaluate the resulting cross-section predictions in the kinematic range relevant to the Hyper-Kamiokande and DUNE experiments.
[LG-147] A General Approach to Visualizing Uncertainty in Statistical Graphics
链接: https://arxiv.org/abs/2508.00937
作者: Bernarda Petek,David Nabergoj,Erik Štrumbelj
类目: Methodology (stat.ME); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:Visualizing uncertainty is integral to data analysis, yet its application is often hindered by the need for specialized methods for quantifying and representing uncertainty for different types of graphics. We introduce a general approach that simplifies this process. The core idea is to treat the statistical graphic as a function of the underlying distribution. Instead of first calculating uncertainty metrics and then plotting them, the method propagates uncertainty through to the visualization. By repeatedly sampling from the data distribution and generating a complete statistical graphic for each sample, a distribution over graphics is produced. These graphics are aggregated pixel-by-pixel to create a single, static image. This approach is versatile, requires no specific knowledge from the user beyond how to create the basic statistical graphic, and comes with theoretical coverage guarantees for standard cases such as confidence intervals and bands. We provide a reference implementation as a Python library to demonstrate the method’s utility. Our approach not only reproduces conventional uncertainty visualizations for point estimates and regression lines but also seamlessly extends to non-standard cases, including pie charts, stacked bar charts, and tables. This approach makes uncertainty visualization more accessible to practitioners and can be a valuable tool for teaching uncertainty.
[LG-148] Uni-Mol3: A Multi-Molecular Foundation Model for Advancing Organic Reaction Modeling
链接: https://arxiv.org/abs/2508.00920
作者: Lirong Wu,Junjie Wang,Zhifeng Gao,Xiaohong Ji,Rong Zhu,Xinyu Li,Linfeng Zhang,Guolin Ke,Weinan E
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Organic reaction, the foundation of modern chemical industry, is crucial for new material development and drug discovery. However, deciphering reaction mechanisms and modeling multi-molecular relationships remain formidable challenges due to the complexity of molecular dynamics. While several state-of-the-art models like Uni-Mol2 have revolutionized single-molecular representation learning, their extension to multi-molecular systems, where chemical reactions inherently occur, has been underexplored. This paper introduces Uni-Mol3, a novel deep learning framework that employs a hierarchical pipeline for multi-molecular reaction modeling. At its core, Uni-Mol3 adopts a multi-scale molecular tokenizer (Mol-Tokenizer) that encodes 3D structures of molecules and other features into discrete tokens, creating a 3D-aware molecular language. The framework innovatively combines two pre-training stages: molecular pre-training to learn the molecular grammars and reaction pre-training to capture fundamental reaction principles, forming a progressive learning paradigm from single- to multi-molecular systems. With prompt-aware downstream fine-tuning, Uni-Mol3 demonstrates exceptional performance in diverse organic reaction tasks and supports multi-task prediction with strong generalizability. Experimental results across 10 datasets spanning 4 downstream tasks show that Uni-Mol3 outperforms existing methods, validating its effectiveness in modeling complex organic reactions. This work not only ushers in an alternative paradigm for multi-molecular computational modeling but also charts a course for intelligent organic reaction by bridging molecular representation with reaction mechanism understanding.
[LG-149] Accelerating Fleet Upgrade Decisions with Machine-Learning Enhanced Optimization
链接: https://arxiv.org/abs/2508.00915
作者: Kenrick Howin Chai,Stefan Hildebrand,Tobias Lachnit,Martin Benfer,Gisela Lanza,Sandra Klinge
类目: Optimization and Control (math.OC); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Rental-based business models and increasing sustainability requirements intensify the need for efficient strategies to manage large machine and vehicle fleet renewal and upgrades. Optimized fleet upgrade strategies maximize overall utility, cost, and sustainability. However, conventional fleet optimization does not account for upgrade options and is based on integer programming with exponential runtime scaling, which leads to substantial computational cost when dealing with large fleets and repeated decision-making processes. This contribution firstly suggests an extended integer programming approach that determines optimal renewal and upgrade decisions. The computational burden is addressed by a second, alternative machine learning-based method that transforms the task to a mixed discrete-continuous optimization problem. Both approaches are evaluated in a real-world automotive industry case study, which shows that the machine learning approach achieves near-optimal solutions with significant improvements in the scalability and overall computational performance, thus making it a practical alternative for large-scale fleet management.
信息检索
[IR-0] Hubness Reduction with Dual Bank Sinkhorn Normalization for Cross-Modal Retrieval
链接: https://arxiv.org/abs/2508.02538
作者: Zhengxin Pan,Haishuai Wang,Fangyu Wu,Peng Zhang,Jiajun Bu
类目: Information Retrieval (cs.IR)
*备注: ACMMM 2025
Abstract:The past decade has witnessed rapid advancements in cross-modal retrieval, with significant progress made in accurately measuring the similarity between cross-modal pairs. However, the persistent hubness problem, a phenomenon where a small number of targets frequently appear as nearest neighbors to numerous queries, continues to hinder the precision of similarity measurements. Despite several proposed methods to reduce hubness, their underlying mechanisms remain poorly understood. To bridge this gap, we analyze the widely-adopted Inverted Softmax approach and demonstrate its effectiveness in balancing target probabilities during retrieval. Building on these insights, we propose a probability-balancing framework for more effective hubness reduction. We contend that balancing target probabilities alone is inadequate and, therefore, extend the framework to balance both query and target probabilities by introducing Sinkhorn Normalization (SN). Notably, we extend SN to scenarios where the true query distribution is unknown, showing that current methods, which rely solely on a query bank to estimate target hubness, produce suboptimal results due to a significant distributional gap between the query bank and targets. To mitigate this issue, we introduce Dual Bank Sinkhorn Normalization (DBSN), incorporating a corresponding target bank alongside the query bank to narrow this distributional gap. Our comprehensive evaluation across various cross-modal retrieval tasks, including image-text retrieval, video-text retrieval, and audio-text retrieval, demonstrates consistent performance improvements, validating the effectiveness of both SN and DBSN. All codes are publicly available at this https URL.
[IR-1] Beyond Chunks and Graphs: Retrieval-Augmented Generation through Triplet-Driven Thinking
链接: https://arxiv.org/abs/2508.02435
作者: Shengbo Gong,Xianfeng Tang,Carl Yang,Wei jin
类目: Information Retrieval (cs.IR)
*备注: 19 pages
Abstract:Retrieval-augmented generation (RAG) is critical for reducing hallucinations and incorporating external knowledge into Large Language Models (LLMs). However, advanced RAG systems face a trade-off between performance and efficiency. Multi-round RAG approaches achieve strong reasoning but incur excessive LLM calls and token costs, while Graph RAG methods suffer from computationally expensive, error-prone graph construction and retrieval redundancy. To address these challenges, we propose T ^2 RAG, a novel framework that operates on a simple, graph-free knowledge base of atomic triplets. T ^2 RAG leverages an LLM to decompose questions into searchable triplets with placeholders, which it then iteratively resolves by retrieving evidence from the triplet database. Empirical results show that T ^2 RAG significantly outperforms state-of-the-art multi-round and Graph RAG methods, achieving an average performance gain of up to 11% across six datasets while reducing retrieval costs by up to 45%. Our code is available at this https URL
[IR-2] Agent ic Personalized Fashion Recommendation in the Age of Generative AI: Challenges Opportunities and Evaluation
链接: https://arxiv.org/abs/2508.02342
作者: Yashar Deldjoo,Nima Rafiee,Mahdyar Ravanbakhsh
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Fashion recommender systems (FaRS) face distinct challenges due to rapid trend shifts, nuanced user preferences, intricate item-item compatibility, and the complex interplay among consumers, brands, and influencers. Traditional recommendation approaches, largely static and retrieval-focused, struggle to effectively capture these dynamic elements, leading to decreased user satisfaction and elevated return rates. This paper synthesizes both academic and industrial viewpoints to map the distinctive output space and stakeholder ecosystem of modern FaRS, identifying the complex interplay among users, brands, platforms, and influencers, and highlighting the unique data and modeling challenges that arise. We outline a research agenda for industrial FaRS, centered on five representative scenarios spanning static queries, outfit composition, and multi-turn dialogue, and argue that mixed-modality refinement-the ability to combine image-based references (anchors) with nuanced textual constraints-is a particularly critical task for real-world deployment. To this end, we propose an Agentic Mixed-Modality Refinement (AMMR) pipeline, which fuses multimodal encoders with agentic LLM planners and dynamic retrieval, bridging the gap between expressive user intent and fast-changing fashion inventories. Our work shows that moving beyond static retrieval toward adaptive, generative, and stakeholder-aware systems is essential to satisfy the evolving expectations of fashion consumers and brands. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.02342 [cs.IR] (or arXiv:2508.02342v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.02342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] Research Knowledge Graphs in NFDI4DataScience: Key Activities Achievements and Future Directions
链接: https://arxiv.org/abs/2508.02300
作者: Kanishka Silva,Marcel R. Ackermann,Heike Fliegl,Genet-Asefa Gesese,Fidan Limani,Philipp Mayr,Peter Mutschke,Allard Oelen,Muhammad Asif Suryani,Sharmila Upadhyaya,Benjamin Zapilko,Harald Sack,Stefan Dietze
类目: Information Retrieval (cs.IR)
*备注:
Abstract:As research in Artificial Intelligence and Data Science continues to grow in volume and complexity, it becomes increasingly difficult to ensure transparency, reproducibility, and discoverability. To address these challenges, as research artifacts should be understandable and usable by machines, the NFDI4DataScience consortium is developing and providing Research Knowledge Graphs (RKGs). Building upon earlier works, this paper presents recent progress in creating semantically rich RKGs using standardized ontologies, shared vocabularies, and automated Information Extraction techniques. Key achievements include the development of the NFDI4DS ontology, metadata standards, tools, and services designed to support the FAIR principles, as well as community-led projects and various implementations of RKGs. Together, these efforts aim to capture and connect the complex relationships between datasets, models, software, and scientific publications.
[IR-4] Voronoi Diagram Encoded Hashing
链接: https://arxiv.org/abs/2508.02266
作者: Yang Xu,Kai Ming Ting
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The goal of learning to hash (L2H) is to derive data-dependent hash functions from a given data distribution in order to map data from the input space to a binary coding space. Despite the success of L2H, two observations have cast doubt on the source of the power of L2H, i.e., learning. First, a recent study shows that even using a version of locality sensitive hashing functions without learning achieves binary representations that have comparable accuracy as those of L2H, but with less time cost. Second, existing L2H methods are constrained to three types of hash functions: thresholding, hyperspheres, and hyperplanes only. In this paper, we unveil the potential of Voronoi diagrams in hashing. Voronoi diagram is a suitable candidate because of its three properties. This discovery has led us to propose a simple and efficient no-learning binary hashing method, called Voronoi Diagram Encoded Hashing (VDeH), which constructs a set of hash functions through a data-dependent similarity measure and produces independent binary bits through encoded hashing. We demonstrate through experiments on several benchmark datasets that VDeH achieves superior performance and lower computational cost compared to existing state-of-the-art methods under the same bit length.
[IR-5] From Generation to Consumption: Personalized List Value Estimation for Re-ranking
链接: https://arxiv.org/abs/2508.02242
作者: Kaike Zhang,Xiaobei Wang,Xiaoyu Liu,Shuchang Liu,Hailan Yang,Xiang Li,Fei Sun,Qi Cao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Re-ranking is critical in recommender systems for optimizing the order of recommendation lists, thus improving user satisfaction and platform revenue. Most existing methods follow a generator-evaluator paradigm, where the evaluator estimates the overall value of each candidate list. However, they often ignore the fact that users may exit before consuming the full list, leading to a mismatch between estimated generation value and actual consumption value. To bridge this gap, we propose CAVE, a personalized Consumption-Aware list Value Estimation framework. CAVE formulates the list value as the expectation over sub-list values, weighted by user-specific exit probabilities at each position. The exit probability is decomposed into an interest-driven component and a stochastic component, the latter modeled via a Weibull distribution to capture random external factors such as fatigue. By jointly modeling sub-list values and user exit behavior, CAVE yields a more faithful estimate of actual list consumption value. We further contribute three large-scale real-world list-wise benchmarks from the Kuaishou platform, varying in size and user activity patterns. Extensive experiments on these benchmarks, two Amazon datasets, and online A/B testing on Kuaishou show that CAVE consistently outperforms strong baselines, highlighting the benefit of explicitly modeling user exits in re-ranking.
[IR-6] Why Generate When You Can Transform? Unleashing Generative Attention for Dynamic Recommendation
链接: https://arxiv.org/abs/2508.02050
作者: Yuli Liu,Wenjun Kong,Cheng Luo,Weizhi Ma
类目: Information Retrieval (cs.IR)
*备注: Accepted at ACMMM 2025
Abstract:Sequential Recommendation (SR) focuses on personalizing user experiences by predicting future preferences based on historical interactions. Transformer models, with their attention mechanisms, have become the dominant architecture in SR tasks due to their ability to capture dependencies in user behavior sequences. However, traditional attention mechanisms, where attention weights are computed through query-key transformations, are inherently linear and deterministic. This fixed approach limits their ability to account for the dynamic and non-linear nature of user preferences, leading to challenges in capturing evolving interests and subtle behavioral patterns. Given that generative models excel at capturing non-linearity and probabilistic variability, we argue that generating attention distributions offers a more flexible and expressive alternative compared to traditional attention mechanisms. To support this claim, we present a theoretical proof demonstrating that generative attention mechanisms offer greater expressiveness and stochasticity than traditional deterministic approaches. Building upon this theoretical foundation, we introduce two generative attention models for SR, each grounded in the principles of Variational Autoencoders (VAE) and Diffusion Models (DMs), respectively. These models are designed specifically to generate adaptive attention distributions that better align with variable user preferences. Extensive experiments on real-world datasets show our models significantly outperform state-of-the-art in both accuracy and diversity.
[IR-7] Evaluating Position Bias in Large Language Model Recommendations
链接: https://arxiv.org/abs/2508.02020
作者: Ethan Bito,Yongli Ren,Estrid He
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) are being increasingly explored as general-purpose tools for recommendation tasks, enabling zero-shot and instruction-following capabilities without the need for task-specific training. While the research community is enthusiastically embracing LLMs, there are important caveats to directly adapting them for recommendation tasks. In this paper, we show that LLM-based recommendation models suffer from position bias, where the order of candidate items in a prompt can disproportionately influence the recommendations produced by LLMs. First, we analyse the position bias of LLM-based recommendations on real-world datasets, where results uncover systemic biases of LLMs with high sensitivity to input orders. Furthermore, we introduce a new prompting strategy to mitigate the position bias of LLM recommendation models called Ranking via Iterative SElection (RISE). We compare our proposed method against various baselines on key benchmark datasets. Experiment results show that our method reduces sensitivity to input ordering and improves stability without requiring model fine-tuning or post-processing.
[IR-8] Req-Rec: Enhancing Requirements Elicitation for Increasing Stakeholders Satisfaction Using a Collaborative Filtering Based Recommender System
链接: https://arxiv.org/abs/2508.01502
作者: Ali Fallahi,Amineh Amini,Azam Bastanfard,Hadi Saboohi
类目: Information Retrieval (cs.IR)
*备注: March 2023, 28 pages, 7 figures
Abstract:The success or failure of a project is highly related to recognizing the right stakeholders and accurately finding and discovering their requirements. However, choosing the proper elicitation technique was always a considerable challenge for efficient requirement engineering. As a consequence of the swift improvement of digital technologies since the past decade, recommender systems have become an efficient channel for making a deeply personalized interactive communication with stakeholders. In this research, a new method, called the Req-Rec (Requirements Recommender), is proposed. It is a hybrid recommender system based on the collaborative filtering approach and the repertory grid technique as the core component. The primary goal of Req-Rec is to increase stakeholder satisfaction by assisting them in the requirement elicitation phase. Based on the results, the method efficiently could overcome weaknesses of common requirement elicitation techniques, such as time limitation, location-based restrictions, and bias in requirements’ elicitation process. Therefore, recommending related requirements assists stakeholders in becoming more aware of different aspects of the project.
[IR-9] SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation
链接: https://arxiv.org/abs/2508.01375
作者: Yining Yao,Ziwei Li,Shuwen Xiao,Boya Du,Jialin Zhu,Junjun Zheng,Xiangheng Kong,Yuning Jiang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In recommendation systems, predicting Click-Through Rate (CTR) is crucial for accurately matching users with items. To improve recommendation performance for cold-start and long-tail items, recent studies focus on leveraging item multimodal features to model users’ interests. However, obtaining multimodal representations for items relies on complex pre-trained encoders, which incurs unacceptable computation cost to train jointly with downstream ranking models. Therefore, it is important to maintain alignment between semantic and behavior space in a lightweight way. To address these challenges, we propose a Semantic-Behavior Alignment for Cold-start Recommendation framework, which mainly focuses on utilizing multimodal representations that align with the user behavior space to predict CTR. First, we leverage domain-specific knowledge to train a multimodal encoder to generate behavior-aware semantic representations. Second, we use residual quantized semantic ID to dynamically bridge the gap between multimodal representations and the ranking model, facilitating the continuous semantic-behavior alignment. We conduct our offline and online experiments on the Taobao, one of the world’s largest e-commerce platforms, and have achieved an increase of 0.83% in offline AUC, 13.21% clicks increase and 13.44% orders increase in the online A/B test, emphasizing the efficacy of our method. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.01375 [cs.IR] (or arXiv:2508.01375v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.01375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-10] A Study on Enhancing User Engagement by Employing Gamified Recommender Systems
链接: https://arxiv.org/abs/2508.01265
作者: Ali Fallahi,Azam Bastanfard,Amineh Amini,Hadi Saboohi
类目: Information Retrieval (cs.IR)
*备注: June 2023, 21 pages, 6 figures
Abstract:Providing customized products and services in the modern business world is one of the most efficient solutions to improve users’ experience and their engagements with the industries. To aim, recommender systems, by producing personalized recommendations, have a crucial role in the digital age. As a consequence of modern improvements in the internet and online-based technologies, using gamification rules also increased in various fields. Recent studies showed that considering gamification concepts in implementing recommendation systems not only can become helpful to overcome the cold start and lack of sufficient data, moreover, can effectively improve user engagement. Gamification can motivate individuals to have more activities on the system; these interactions are valuable resources of data for recommender engines. Unlike the past related works about using gamified recommendation systems in different environments or studies that particularly surveyed gamification strategies or recommenders separately, this work provides a comprehensive review of how gamified recommender systems can enhance user engagement in various domain applications. Furthermore, comparing different approaches for building recommender systems is followed by in-depth surveying about investigating the gamified recommender systems, including their approaches, limitations, evaluation metrics, proposed achievements, datasets, domain areas, and their recommendation techniques. This exhaustive analysis provides a detailed picture of the topic’s popularity, gaps, and unexplored regions. It is envisaged that the proposed research and introduced possible future directions would serve as a stepping stone for researchers interested in using gamified recommender systems for user satisfaction and engagement.
[IR-11] CM3: Calibrating Multimodal Recommendation
链接: https://arxiv.org/abs/2508.01226
作者: Xin Zhou,Yongjie Wang,Zhiqi Shen
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Working Paper: this https URL
Abstract:Alignment and uniformity are fundamental principles within the domain of contrastive learning. In recommender systems, prior work has established that optimizing the Bayesian Personalized Ranking (BPR) loss contributes to the objectives of alignment and uniformity. Specifically, alignment aims to draw together the representations of interacting users and items, while uniformity mandates a uniform distribution of user and item embeddings across a unit hypersphere. This study revisits the alignment and uniformity properties within the context of multimodal recommender systems, revealing a proclivity among extant models to prioritize uniformity to the detriment of alignment. Our hypothesis challenges the conventional assumption of equitable item treatment through a uniformity loss, proposing a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold. Specifically, we leverage the inherent similarity between items’ multimodal data to calibrate their uniformity distribution, thereby inducing a more pronounced repulsive force between dissimilar entities within the embedding space. A theoretical analysis elucidates the relationship between this calibrated uniformity loss and the conventional uniformity function. Moreover, to enhance the fusion of multimodal features, we introduce a Spherical Bézier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold. Empirical evaluations conducted on five real-world datasets substantiate the superiority of our approach over competing baselines. We also shown that the proposed methods can achieve up to a 5.4% increase in NDCG@20 performance via the integration of MLLM-extracted features. Source code is available at: this https URL.