本篇博文主要内容为 2025-08-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-12)
今日共更新1004篇论文,其中:
- 自然语言处理共131篇(Computation and Language (cs.CL))
- 人工智能共314篇(Artificial Intelligence (cs.AI))
- 计算机视觉共263篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共267篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Jinx: Unlimited LLM s for Probing Alignment Failures
【速读】: 该论文旨在解决当前语言模型安全评估中缺乏可访问的“无限制”(helpful-only)模型的问题,这类模型在训练时不施加安全对齐约束,不会拒绝用户请求,因而被领先AI公司用于红队测试(red teaming)和对齐评估。然而,它们并未向研究社区开放,限制了对模型安全边界和失效模式的系统性研究。解决方案的关键在于提出Jinx——一种基于主流开源大语言模型(LLM)的无限制变体,它在不牺牲基础推理与指令遵循能力的前提下,对所有输入均提供响应,且不进行安全过滤或拒绝,从而为研究人员提供了一个可访问、可控的工具,用于探测对齐失败、评估安全边界并系统分析语言模型安全机制的失效模式。
链接: https://arxiv.org/abs/2508.08243
作者: Jiahao Zhao,Liwei Dong
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL
Abstract:Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model’s capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety. Comments: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.08243 [cs.CL] (or arXiv:2508.08243v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.08243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-1] Exploring Safety Alignment Evaluation of LLM s in Chinese Mental Health Dialogues via LLM -as-Judge
【速读】: 该论文旨在解决高风险心理健康对话中大型语言模型(Large Language Models, LLMs)响应的安全对齐评估难题,其核心挑战在于缺乏黄金标准答案以及此类交互的伦理敏感性。解决方案的关键在于提出PsyCrisis-Bench——一个基于真实中文心理健康对话的无参考评估基准,采用提示驱动的“LLM作为裁判”(LLM-as-Judge)方法,通过专家定义的推理链在上下文中进行评估,确保模型响应符合心理学干预原则;同时,利用多维度二元点对评分机制提升评估结果的可解释性和可追溯性,从而实现更可靠且透明的安全性判断。
链接: https://arxiv.org/abs/2508.08236
作者: Yunna Cai,Fan Wang,Haowei Wang,Kun Wang,Kailai Yang,Sophia Ananiadou,Moyan Li,Mingming Fan
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.
zh
[NLP-2] Capabilities of GPT -5 on Multimodal Medical Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗决策支持中对多源异构信息(如患者叙述、结构化数据和医学影像)进行复杂推理的能力不足问题。解决方案的关键在于将GPT-5定位为通用多模态推理引擎,并通过零样本链式思维(zero-shot chain-of-thought reasoning)方法,在统一评估协议下系统性地验证其在文本问答与视觉问答任务中的表现。实验结果表明,GPT-5在多个标准化医学基准测试中均超越现有基线模型(包括GPT-4o),并在多模态推理维度上显著优于人类专家,展现出从人类相当水平向超人类专家水平的跃迁,从而为未来临床决策支持系统的架构设计提供重要依据。
链接: https://arxiv.org/abs/2508.08224
作者: Shansong Wang,Mingzhe Hu,Qiang Li,Mojtaba Safari,Xiaofeng Yang
机构: Emory University School of Medicine (埃默里大学医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.
zh
[NLP-3] Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)推理中应用时存在的关键问题,包括缺乏标准化的RL技术使用指南、对技术机制理解碎片化、以及因实验设置不一致、训练数据差异和模型初始化不同导致的结论冲突。为应对这些问题,作者通过在一个统一的开源框架内进行严格的复现与隔离评估,系统性地分析了多种主流RL技术的内部机制、适用场景与核心原理,并基于细粒度实验(涵盖不同难度的数据集、模型规模与架构)提炼出清晰的选型指南与实践路线图。其解决方案的关键在于发现:仅通过一种极简组合——即无需批评者(critic-free)策略结合标准PPO损失函数,即可有效激活LLM的强化学习能力,且该方法在性能上持续优于GRPO和DAPO等复杂策略。
链接: https://arxiv.org/abs/2508.08221
作者: Zihe Liu,Jiashun Liu,Yancheng He,Weixun Wang,Jiaheng Liu,Ling Pan,Xinyu Hu,Shaopan Xiong,Ju Huang,Jian Hu,Shengyi Huang,Siran Yang,Jiamang Wang,Wenbo Su,Bo Zheng
机构: Alibaba Group(阿里巴巴集团); Beijing Jiaotong University(北京交通大学); Hong Kong University of Science and Technology(香港科技大学); Nanjing University(南京大学); Peking University(北京大学); OpenRLHF; CleanRL
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages, 21 figures
Abstract:Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
zh
[NLP-4] SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)生成文本的水印问题,以实现内容溯源与虚假信息防范。现有方法普遍存在文本质量下降、需白盒访问模型及对logit进行操纵等局限,难以适用于API调用场景和多语言环境。其解决方案的关键在于提出SAEMark框架,该框架通过推理阶段基于特征的拒绝采样(rejection sampling)实现后处理多比特水印嵌入,无需修改模型logits或训练过程,仅依赖于从生成文本中提取的确定性特征,并使特征统计量匹配由密钥导出的目标值。该方法天然支持跨语言与跨领域泛化,同时保持高质量输出,且提供了关于水印成功率与计算预算之间理论关系的保障,实验表明其在多种数据集上均表现出高检测准确率(如英文F1达99.7%)和优异文本质量,为闭源模型提供即插即用的水印能力。
链接: https://arxiv.org/abs/2508.08211
作者: Zhuohao Yu,Xingru Jiang,Weizheng Gu,Yidong Wang,Shikun Zhang,Wei Ye
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 12 figures, code available: this https URL
Abstract:Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework’s effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark’s consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.
zh
[NLP-5] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理时不确定性估计与人类不确定性之间对齐程度的问题,以提升模型控制能力并增强用户信任。其关键解决方案在于系统性评估多种推理时间不确定性度量方法,结合传统校准指标与新型变体,发现多个度量指标虽未与人类答案偏好一致,却表现出与人类群体层面不确定性较强的对齐性,并且这些有效指标同时展现出中等到强的模型校准特性(即正确性相关性和分布分析)。
链接: https://arxiv.org/abs/2508.08204
作者: Kyle Moore,Jesse Roberts,Daryl Watson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint, under review
Abstract:There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.
zh
[NLP-6] Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在生产环境中实现高效推理时面临的工程挑战,特别是针对大规模语言模型(LLM)的推测解码(speculative decoding)技术进行优化。其关键解决方案在于提出了一套训练与推理优化技术,使得基于 EAGLE 的推测解码能够在 NVIDIA H100 GPU 上实现高吞吐量和低延迟的部署,从而显著提升 Llama 模型的推理速度——例如,在单批次情况下达到约 4 毫秒/词元的延迟,比此前最优方法快 10%,且在大批次场景下加速比达 1.4x 至 2.0x。
链接: https://arxiv.org/abs/2508.08192
作者: Bangsheng Tang,Carl Chengyan Fu,Fei Kou,Grigory Sizov,Haoci Zhang,Jason Park,Jiawen Liu,Jie You,Qirui Yang,Sachin Mehta,Shengyong Cai,Xiaodong Wang,Xingyu Liu,Yunlu Li,Yanjun Zhou,Wei Wei,Zhiwei Zhao,Zixi Qi,Adolfo Victoria,Aya Ibrahim,Bram Wasti,Changkyu Kim,Daniel Haziza,Fei Sun,Giancarlo Delfin,Emily Guo,Jialin Ouyang,Jaewon Lee,Jianyu Huang,Jeremy Reizenstein,Lu Fang,Quinn Zhu,Ria Verma,Vlad Mihailescu,Xingwen Guo,Yan Cui,Ye Hu,Yejin Lee
机构: Meta( Meta)
类目: Computation and Language (cs.CL)
备注: 15 pages
Abstract:Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
zh
[NLP-7] LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo
【速读】: 该论文旨在解决标注者分歧(annotator disagreement)建模问题,即如何通过软标签分布预测和视角评价(perspectivist evaluation)来更准确地刻画标注者之间的差异。其解决方案的关键在于对DisCo(Distribution from Context)神经架构的改进:引入标注者元数据(annotator metadata)以增强输入表示,并优化损失函数以更好地捕捉分歧模式,从而在三个数据集上显著提升软标签预测与视角评价指标的表现。
链接: https://arxiv.org/abs/2508.08163
作者: Mandira Sawkar,Samay U. Shetty,Deepak Pandita,Tharindu Cyril Weerasooriya,Christopher M. Homan
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The Learning With Disagreements (LeWiDi) 2025 shared task is to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, modeling annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend the DisCo by incorporating annotator metadata, enhancing input representations, and modifying the loss functions to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth error and calibration analyses, highlighting the conditions under which improvements occur. Our findings underscore the value of disagreement-aware modeling and offer insights into how system components interact with the complexity of human-annotated data.
zh
[NLP-8] REX-RAG : Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)与检索增强生成(Retrieval-Augmented Generation, RAG)结合时,大语言模型(Large Language Models, LLMs)在策略驱动的轨迹采样过程中容易陷入“死胡同”(dead ends)的问题——即模型频繁陷入低效推理路径并过自信地得出错误结论,从而阻碍探索并损害策略优化效果。解决方案的关键在于提出 REX-RAG 框架,其核心创新包括:(1) 混合采样策略(Mixed Sampling Strategy),通过引入探测采样方法与探索性提示(exploratory prompts)主动跳出死胡同;(2) 策略修正机制(Policy Correction Mechanism),利用重要性采样对混合采样导致的分布偏移进行校正,有效缓解梯度估计偏差,保障策略学习的稳定性与准确性。
链接: https://arxiv.org/abs/2508.08149
作者: Wentao Jiang,Xiang Feng,Zengmao Wang,Yong Luo,Pingbo Xu,Zhe Chen,Bo Du,Jing Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures
Abstract:Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as “dead ends”, committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at this https URL.
zh
[NLP-9] Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective
【速读】: 该论文旨在解决生物医学领域中大语言模型(Large Language Models, LLMs)进行上下文学习(In-Context Learning, ICL)时,示例选择策略过度强调代表性而忽视多样性所导致的性能瓶颈问题。现有方法在从大规模语料库中选取演示示例时,常因缺乏多样性而导致模型泛化能力受限。为此,作者提出 Dual-Div 框架,其核心创新在于采用两阶段检索与排序机制:第一阶段通过联合优化代表性与多样性从语料库中筛选候选示例(可选地利用标注数据增强无标签数据的覆盖),第二阶段基于测试查询对候选示例进行重排序,以选出最相关且非冗余的演示样本。实验表明,该方法在命名实体识别(Named Entity Recognition, NER)、关系抽取(Relation Extraction, RE)和文本分类(Text Classification, TC)三项任务上显著优于基线模型,最高提升宏 F1 分数达 5%,且对提示顺序变化和类别不平衡具有强鲁棒性。关键发现是:初始检索阶段的多样性比排序阶段的优化更为重要,且将演示示例数量限制在 3–5 个时可实现最佳性能效率平衡。
链接: https://arxiv.org/abs/2508.08140
作者: Jun Wang,Zaifu Zhan,Qixin Zhang,Mingquan Lin,Meijia Song,Rui Zhang
机构: University of Minnesota (明尼苏达大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency.
zh
[NLP-10] Can LLM s Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮交互或代理式应用中因生成看似流畅但错误的内容(即幻觉,confabulation)而导致的可靠性问题。其核心挑战在于模型难以识别自身不可靠的输出,且现有不确定性信号与实际正确性之间存在显著错位。解决方案的关键在于提出一种基于token级不确定性的可靠性估计方法,通过计算输出logits中的aleatoric(随机性)和epistemic(知识性)不确定性,识别关键token并聚合其隐藏状态以形成响应级别的可靠性预测表示。该方法在多个开源LLM上有效提升了对不可靠输出的检测能力,揭示了直接使用不确定性信号的局限性,并展示了不确定性引导探针在实现可靠性感知生成方面的潜力。
链接: https://arxiv.org/abs/2508.08139
作者: Tianyi Zhou,Johanne Medina,Sanjay Chawla
机构: KTH Royal Institute of Technology (皇家理工学院); QCRI, HBKU (信息与计算研究所,哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.
zh
[NLP-11] Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models
【速读】: 该论文旨在解决语音语言模型(Spoken Language Models, SLMs)在跨数据集泛化能力不足的问题,其核心挑战在于语音与文本表示之间的模态差距(modality gap)。研究表明,SLMs可能通过利用语音嵌入中的非预期变异在特定域内取得良好性能,从而阻碍了模型的泛化能力。为此,作者提出最优传输正则化(Optimal Transport Regularization, OTReg),其关键在于将语音与文本嵌入对齐建模为一个最优传输问题,在每次训练迭代中通过计算最优传输计划建立结构化的语音-文本嵌入对应关系,并据此引入正则化损失来优化SLM,使其生成更贴近文本嵌入的语音表示。OTReg无需额外标签或可学习参数,轻量且易于集成至现有SLM训练流程中,实验证明其能有效提升语音-文本对齐度并增强模型跨数据集的泛化性能。
链接: https://arxiv.org/abs/2508.08131
作者: Wenze Xu,Chun Wang,Jiazhen Yu,Sheng Chen,Liang Gao,Weihong Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be presented at ACPR 2025 Conference
Abstract:Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.
zh
[NLP-12] Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks COLING2024 LREC ACL
【速读】: 该论文旨在解决 Czech 语言下基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)任务中缺乏统一标注格式和复杂任务支持的问题。现有数据集仅包含基础任务标签(如方面词提取或情感极性检测),无法满足目标-方面类别识别等高级任务的需求。解决方案的关键在于构建一个全新的、面向复杂 ABSA 任务的 Czech 数据集,其采用 SemEval-2016 的标准化标注格式,实现情感元素(标签)之间的无缝关联,从而支持跨语言比较与迁移学习,并通过两名训练标注员达成约 90% 的标注一致性验证其质量。
链接: https://arxiv.org/abs/2508.08125
作者: Jakub Šmíd,Pavel Přibáň,Ondřej Pražák,Pavel Král
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Official version: this https URL
Abstract:In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes.
zh
[NLP-13] Iterative refinement not training objective makes HuBERT behave differently from wav2vec 2.0 INTERSPEECH2025
【速读】: 该论文试图解决的问题是:自监督语音表示学习模型(如HuBERT和wav2vec 2.0)的架构差异如何影响其学到的语言信息编码,特别是训练目标与迭代伪标签精炼机制对词义、音位和说话人身份等语言特征相关性的具体作用。解决方案的关键在于通过最小化对比实验,分离并验证两个核心架构变量——训练目标(training objective)和多轮迭代伪标签精炼(iterative pseudo-label refinement)——发现隐藏层表示与词义、音位及说话人身份之间的典型相关性差异主要由训练迭代次数决定,而非训练目标本身,从而指出迭代精炼机制在提升语言信息编码效率中的关键作用。
链接: https://arxiv.org/abs/2508.08110
作者: Robin Huo,Ewan Dunbar
机构: University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注: Proceedings of Interspeech 2025
Abstract:Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.
zh
[NLP-14] Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在教育场景中广泛应用所引发的学术诚信挑战,即如何有效自动检测学生作业中由大型语言模型(Large Language Models, LLMs)生成的内容。其核心解决方案是构建了一个名为 Generative Essay Detection in Education (GEDE) 的新型数据集,包含超过900篇学生手写作文和超过12,500篇来自不同领域的LLM生成文本,并引入“贡献水平”(contribution levels)概念以刻画学生对作业的实际参与程度,从纯人工写作到完全由LLM生成,再到通过“人性化”手段主动对抗检测的攻击样本。研究发现,当前主流检测器在识别中间贡献水平文本(如LLM辅助修改的人类写作)时性能显著下降,且易产生误报(false positives),这对教育公平与学生信任构成潜在风险。该工作为未来开发更鲁棒、可解释的教育场景文本检测方法提供了关键基准和实证依据。
链接: https://arxiv.org/abs/2508.08096
作者: Lukas Gehring,Benjamin Paaßen
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint as provided by the authors (19 pages, 12 figures, 9 tables)
Abstract:Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students’ learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students’ contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by “humanizing” generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students’ lives. Our dataset, code, and additional supplementary materials are publicly available at this https URL.
zh
[NLP-15] Dual Information Speech Language Models for Emotional Conversations ICME2025
【速读】: 该论文旨在解决当前基于文本的大语言模型(Large Language Models, LLMs)在对话系统中忽视语音中的副语言信息(paralinguistic cues),从而影响情绪和意图理解的问题。现有通过扩展冻结LLMs构建的语音语言模型(Speech-Language Models, SLMs)难以有效捕捉副语言特征且削弱了上下文理解能力。其关键问题在于信息纠缠(entangled information)与不当训练策略。解决方案的核心是提出两种异构适配器(heterogeneous adapters),并通过弱监督训练策略实现副语言与语言信息的解耦,使SLMs能够通过结构化表示解析语音;同时通过受控随机性避免生成特定任务向量,保留上下文理解能力。该方法仅需在通用数据集上训练适配器,兼顾参数效率与数据效率,并在情感对话任务中表现出优异性能。
链接: https://arxiv.org/abs/2508.08095
作者: Chun Wang,Chenyang Liu,Wenze Xu,Weihong Deng
机构: Mashang Consumer Finance Co., Ltd., Chongqing, China; The University of Sydney, Sydney, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at IEEE ICME 2025
Abstract:Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model’s ability to effectively integrate both paralinguistic and linguistic information within contextual settings.
zh
[NLP-16] HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
【速读】: 该论文旨在解决企业级私有深度搜索系统中多源知识检索的挑战,即如何有效整合本地数据与网络数据进行复杂推理任务,同时避免因直接答案复制和错误传播导致的幻觉问题。其解决方案的关键在于提出一种分层强化学习(Hierarchical Reinforcement Learning, HRL)驱动的代理框架——HierSearch:在低层分别训练本地深度搜索代理(local deep search agent)和网络深度搜索代理(Web deep search agent)以从各自域中检索证据;在高层引入规划代理(planner agent)协调低层代理并生成最终答案;此外,设计知识精炼模块(knowledge refiner)过滤幻觉及无关证据,从而提升检索准确性和推理鲁棒性。
链接: https://arxiv.org/abs/2508.08088
作者: Jiejun Tan,Zhicheng Dou,Yan Yu,Jiehan Cheng,Qiang Ju,Jian Xie,Ji-Rong Wen
机构: Baichuan(百川); Alibaba Cloud (阿里云)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and datasets are available at this https URL
Abstract:Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains.
zh
[NLP-17] Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉定位(Visual Grounding, VG)任务中的性能优化问题。当前方法虽取得一定成效,但其在微调策略上存在设计选择分散、缺乏系统验证的问题。解决方案的关键在于通过全面分析两种核心设计维度:一是比较不同视觉定位范式(visual grounding paradigms),识别出最优架构;二是对定位数据的设计进行消融研究(ablation studies),以优化微调过程。最终,基于Llama-1.5的改进方案在RefCOCO/+/g基准上分别实现+5.6%、+6.9%和+7.0%的性能提升,显著增强了MLLMs在VG任务上的表现。
链接: https://arxiv.org/abs/2508.08066
作者: Weitai Kang,Weiming Zhuang,Zhizhong Li,Yan Yan,Lingjuan Lyu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Sony AI (索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages for the main paper
Abstract:Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs’ fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.
zh
[NLP-18] From Source to Target: Leverag ing Transfer Learning for Predictive Process Monitoring in Organizations
【速读】: 该论文旨在解决预测性流程监控(Predictive Process Monitoring, PPM)在实际应用中因缺乏足够事件数据或其他相关资源而难以实施的问题,尤其是在组织内部或跨组织场景下。其核心解决方案是引入基于迁移学习(Transfer Learning)的PPM技术,关键在于利用一个业务流程中已有的知识(如预训练模型)迁移到相似但数据稀缺的目标流程中,从而实现有效的预测性决策支持。实验结果表明,该方法能够在同一组织内或不同组织间有效传递知识,显著降低对本地大规模事件日志的依赖,提升PPM的可部署性和实用性。
链接: https://arxiv.org/abs/2508.08061
作者: Sven Weinzierl,Sandra Zilker,Annina Liessmann,Martin Käppel,Weixin Wang,Martin Matzner
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Technische Hochschule Nürnberg Georg Simon Ohm (纽伦堡应用技术大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Event logs reflect the behavior of business processes that are mapped in organizational information systems. Predictive process monitoring (PPM) transforms these data into value by creating process-related predictions that provide the insights required for proactive interventions at process runtime. Existing PPM techniques require sufficient amounts of event data or other relevant resources that might not be readily available, preventing some organizations from utilizing PPM. The transfer learning-based PPM technique presented in this paper allows organizations without suitable event data or other relevant resources to implement PPM for effective decision support. The technique is instantiated in two real-life use cases, based on which numerical experiments are performed using event logs for IT service management processes in an intra- and inter-organizational setting. The results of the experiments suggest that knowledge of one business process can be transferred to a similar business process in the same or a different organization to enable effective PPM in the target context. With the proposed technique, organizations can benefit from transfer learning in an intra- and inter-organizational setting, where resources like pre-trained models are transferred within and across organizational boundaries.
zh
[NLP-19] 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT 2025)
【速读】: 该论文旨在解决聋人与健听人群之间通过非侵入性手段提升沟通效率的问题,其核心挑战在于如何利用数字人(digital humans)技术实现手语翻译与交互式对话。解决方案的关键在于整合多种前沿技术,包括手语识别(sign language recognition)、数据收集与分析、工具开发、伦理考量、可用性评估以及情感计算(affective computing),并通过跨研究社区的协作,推动生成式 AI (Generative AI) 在虚拟翻译员或交互式代理中的应用落地。
链接: https://arxiv.org/abs/2508.08050
作者: Fabrizio Nunnari,Cristina Luna Jiménez,Rosalee Wolfe,John C. McDonald,Michael Filhol,Eleni Efthimiou,Evita Fotinea,Thomas Hanke
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); University of Augsburg (奥格斯堡大学); Institute for Language and Speech Processing, Athena RC (语言与语音处理研究所,阿提娜研究中心); DePaul University (德保罗大学); Université Paris-Saclay (巴黎萨克雷大学); Institute for Language and Speech Processing, Athena RC (语言与语音处理研究所,阿提娜研究中心); Institute for Language and Speech Processing, Athena RC (语言与语音处理研究所,阿提娜研究中心); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The Sign Language Translation and Avatar Technology (SLTAT) workshops continue a series of gatherings to share recent advances in improving deaf / human communication through non-invasive means. This 2025 edition, the 9th since its first appearance in 2011, is hosted by the International Conference on Intelligent Virtual Agents (IVA), giving the opportunity for contamination between two research communities, using digital humans as either virtual interpreters or as interactive conversational agents. As presented in this summary paper, SLTAT sees contributions beyond avatar technologies, with a consistent number of submissions on sign language recognition, and other work on data collection, data analysis, tools, ethics, usability, and affective computing.
zh
[NLP-20] Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio Language Models, LALMs)在音频问答任务中推理能力不足的问题,特别是显式推理过程未能显著提升性能,且模型在深度推理上的表现仍远低于人类水平。其解决方案的关键在于提出Audio-Thinker框架,通过引入自适应思维准确率奖励(adaptive think accuracy reward)使模型能根据任务复杂度动态调整推理策略,并结合外部奖励模型评估推理过程的整体一致性和质量,辅以基于思维链的奖励机制帮助模型区分有效与无效的推理路径,从而显著提升LALMs的推理适应性、一致性与有效性。
链接: https://arxiv.org/abs/2508.08039
作者: Shu Wu,Chenxing Li,Wenfu Wang,Hao Zhang,Hualei Wang,Meng Yu,Dong Yu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: preprint
Abstract:Recent advancements in large language models, multimodal large language models, and large audio language models (LALMs) have significantly improved their reasoning capabilities through reinforcement learning with rule-based rewards. However, the explicit reasoning process has yet to show significant benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs, with a focus on improving adaptability, consistency, and effectiveness. Our approach introduces an adaptive think accuracy reward, enabling the model to adjust its reasoning strategies based on task complexity dynamically. Furthermore, we incorporate an external reward model to evaluate the overall consistency and quality of the reasoning process, complemented by think-based rewards that help the model distinguish between valid and flawed reasoning paths during training. Experimental results demonstrate that our Audio-Thinker model outperforms existing reasoning-oriented LALMs across various benchmark tasks, exhibiting superior reasoning and generalization capabilities.
zh
[NLP-21] Progressive Depth Up-scaling via Optimal Transport
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在深度扩展(depth up-scaling)过程中因神经元排列差异导致的层间对齐失配问题,该问题会损害模型性能并降低训练效率。现有方法通常通过复制或平均基础层权重来扩展模型深度,但忽略了相邻层间神经元的排列差异。为此,作者提出最优传输深度扩展(Optimal Transport Depth Up-Scaling, OpT-DeUS),其核心在于利用最优传输(Optimal Transport, OT)技术对相邻基础层的Transformer模块进行对齐与融合,从而生成新的层以缓解神经元排列不匹配问题。该方案不仅提升了连续预训练和监督微调任务中的整体性能,还显著改善了训练效率,尤其当新层插入位置更靠近模型顶层时,可进一步缩短反向传播路径并获得额外性能增益。
链接: https://arxiv.org/abs/2508.08011
作者: Mingzi Cao,Xi Wang,Nikolaos Aletras
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains.
zh
[NLP-22] WideSearch: Benchmarking Agent ic Broad Info-Seeking
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的搜索代理在执行大规模信息收集任务时可靠性与完整性不足的问题。现有系统虽能处理部分复杂推理任务,但在面对需要广泛上下文检索和结构化整理的重复性信息获取任务时表现不佳,且缺乏有效的评估基准来量化其性能。解决方案的关键在于提出一个名为WideSearch的新基准,该基准包含200个来自15个不同领域的手动标注问题(中英文各100个),每个任务要求代理从大量原子级信息中准确收集并组织输出,且具备客观可验证性;同时设计了五阶段质量控制流程以确保数据难度、完整性和可验证性。通过该基准对十余种先进代理搜索系统进行评估,结果显示当前系统整体成功率接近0%,最高仅为5%,远低于人类水平,从而揭示了当前搜索代理在宽范围信息获取能力上的显著缺陷,并为未来研究指明方向。
链接: https://arxiv.org/abs/2508.07999
作者: Ryan Wong,Jiawei Wang,Junjie Zhao,Li Chen,Yan Gao,Long Zhang,Xuan Zhou,Zuo Wang,Kai Xiang,Ge Zhang,Wenhao Huang,Yang Wang,Ke Wang
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注:
Abstract:From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such “wide-context” collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at this https URL
zh
[NLP-23] he Medical Metaphors Corpus (MCC)
【速读】: 该论文旨在解决科学语境中隐喻识别资源匮乏的问题,特别是医学与生物学领域缺乏高质量、标注详尽的隐喻语料库,从而限制了对领域特定隐喻理解的计算研究。其关键解决方案是构建首个面向科学领域的隐喻语料库——医学隐喻语料库(Medical Metaphors Corpus, MCC),该语料库包含792条经人工标注的科学概念隐喻实例,涵盖来源广泛的文本类型(如同行评审文献、新闻媒体、社交媒体及众包数据),并提供二分类和分级的隐喻强度评分(0–7分),同时明确标注源域与目标域的概念映射关系,为科学隐喻的检测、生成与应用研究提供了基准数据集和评估工具。
链接: https://arxiv.org/abs/2508.07993
作者: Anna Sofia Lippolis,Andrea Giovanni Nuzzolese,Aldo Gangemi
机构: University of Bologna (博洛尼亚大学); CNR Institute for Cognitive Sciences and Technologies (意大利国家研究委员会认知科学与技术研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Metaphor is a fundamental cognitive mechanism that shapes scientific understanding, enabling the communication of complex concepts while potentially constraining paradigmatic thinking. Despite the prevalence of figurative language in scientific discourse, existing metaphor detection resources primarily focus on general-domain text, leaving a critical gap for domain-specific applications. In this paper, we present the Medical Metaphors Corpus (MCC), a comprehensive dataset of 792 annotated scientific conceptual metaphors spanning medical and biological domains. MCC aggregates metaphorical expressions from diverse sources including peer-reviewed literature, news media, social media discourse, and crowdsourced contributions, providing both binary and graded metaphoricity judgments validated through human annotation. Each instance includes source-target conceptual mappings and perceived metaphoricity scores on a 0-7 scale, establishing the first annotated resource for computational scientific metaphor research. Our evaluation demonstrates that state-of-the-art language models achieve modest performance on scientific metaphor detection, revealing substantial room for improvement in domain-specific figurative language understanding. MCC enables multiple research applications including metaphor detection benchmarking, quality-aware generation systems, and patient-centered communication tools.
zh
[NLP-24] Exploring Procedural Data Generation for Automatic Acoustic Guitar Fingerpicking Transcription
【速读】: 该论文旨在解决 acoustic guitar fingerpicking(指弹吉他)自动转录任务中因标注训练数据稀缺及音乐录音版权限制而导致的模型训练难题。解决方案的关键在于构建一个程序化数据生成流程(procedural data generation pipeline),通过四个阶段合成训练数据:基于知识的指弹谱表编写、MIDI 表演渲染、使用扩展 Karplus-Strong 算法进行物理建模,以及包含混响和失真等效果的音频增强。实验表明,基于该流程生成的合成数据可使 CRNN-based note-tracking 模型达到合理的转录性能,且仅需少量真实数据微调即可超越纯真实数据训练的模型,验证了程序化音频在数据稀缺场景下的有效性。
链接: https://arxiv.org/abs/2508.07987
作者: Sebastian Murgul,Michael Heizmann
机构: Klangio GmbH (Klangio GmbH); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to the 6th Conference on AI Music Creativity (AIMC), 2025
Abstract:Automatic transcription of acoustic guitar fingerpicking performances remains a challenging task due to the scarcity of labeled training data and legal constraints connected with musical recordings. This work investigates a procedural data generation pipeline as an alternative to real audio recordings for training transcription models. Our approach synthesizes training data through four stages: knowledge-based fingerpicking tablature composition, MIDI performance rendering, physical modeling using an extended Karplus-Strong algorithm, and audio augmentation including reverb and distortion. We train and evaluate a CRNN-based note-tracking model on both real and synthetic datasets, demonstrating that procedural data can be used to achieve reasonable note-tracking results. Finetuning with a small amount of real data further enhances transcription accuracy, improving over models trained exclusively on real recordings. These results highlight the potential of procedurally generated audio for data-scarce music information retrieval tasks.
zh
[NLP-25] Beyond Ten Turns: Unlocking Long-Horizon Agent ic Search with Large-Scale Asynchronous RL
【速读】: 该论文旨在解决开源搜索代理(Search Agent)在处理复杂、知识密集型任务时面临的局限性,尤其是缺乏专家级的搜索智能(Search Intelligence),即在面对模糊查询时生成精准搜索、分析结果并进行深度探索的能力。现有方法在可扩展性、效率和数据质量方面存在不足,例如在线强化学习(Online RL)中回合数限制(如≤10)阻碍了复杂策略的学习。解决方案的关键在于提出ASearcher——一个用于大规模强化学习训练的开源项目,其核心创新包括:(1) 可扩展的全异步强化学习训练机制,支持长周期搜索同时保持高训练效率;(2) 基于提示的大型语言模型(LLM)代理,能够自主合成高质量且具有挑战性的问答对(QA),从而构建大规模训练数据集。通过该方案,基于QwQ-32B的代理在xBench和GAIA基准上分别实现46.7%和20.8%的Avg@4提升,并展现出超过40次工具调用和15万token输出的极端长周期搜索能力,显著优于现有开源32B模型。
链接: https://arxiv.org/abs/2508.07976
作者: Jiaxuan Gao,Wei Fu,Minyang Xie,Shusheng Xu,Chuyi He,Zhiyu Mei,Banghua Zhu,Yi Wu
机构: IIIS, Tsinghua University (清华大学); Ant Research, RL Lab; University of Washington
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. =10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in this https URL.
zh
[NLP-26] Improving Document Retrieval Coherence for Semantically Equivalent Queries
【速读】: 该论文旨在解决密集检索(Dense Retrieval, DR)模型在面对语义相似但词汇表达不同的查询时,其检索结果稳定性不足的问题。具体而言,现有DR模型对查询和文档词汇的微小变化高度敏感,导致相同语义的查询可能产生差异显著的检索结果。为应对这一挑战,论文提出了一种改进的多负例排序损失(Multi-Negative Ranking loss),其核心在于通过惩罚语义等价但表达多样化的查询所对应的Top-k文档集合之间的差异,从而增强模型在不同表述下检索一致性。实验表明,该方法不仅降低了模型对查询词表变化的敏感性,还提升了整体检索准确率。
链接: https://arxiv.org/abs/2508.07975
作者: Stefano Campese,Alessandro Moschitti,Ivano Lauriola
机构: Amazon AGI; University of Trento (特伦托大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Dense Retrieval (DR) models have proven to be effective for Document Retrieval and Information Grounding tasks. Usually, these models are trained and optimized for improving the relevance of top-ranked documents for a given query. Previous work has shown that popular DR models are sensitive to the query and document lexicon: small variations of it may lead to a significant difference in the set of retrieved documents. In this paper, we propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents with respect to semantically similar queries. The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantic equivalent queries. We conducted extensive experiments on various datasets, MS-MARCO, Natural Questions, BEIR, and TREC DL 19/20. The results show that (i) models optimizes by our loss are subject to lower sensitivity, and, (ii) interestingly, higher accuracy.
zh
[NLP-27] Joint Transcription of Acoustic Guitar Strumming Directions and Chords
【速读】: 该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)领域中吉他扫弦自动转录这一尚未充分研究且具有挑战性的问题,核心目标是从音频信号中同时准确提取扫弦动作的方向和对应的和弦进行。其解决方案的关键在于构建了一个新颖的多模态数据集(包含90分钟真实世界吉他录音与4小时标注的合成扫弦音频),并设计了一种基于卷积循环神经网络(Convolutional Recurrent Neural Network, CRNN)的深度学习模型,仅使用麦克风采集的音频即可实现扫弦事件检测、方向分类及和弦识别。实验表明,融合合成数据与真实数据的混合方法在扫弦动作检测和和弦分类任务上均显著优于基线算法,验证了深度学习在吉他扫弦转录中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2508.07973
作者: Sebastian Murgul,Johannes Schimper,Michael Heizmann
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to the 26th International Society for Music Information Retrieval Conference (ISMIR), 2025
Abstract:Automatic transcription of guitar strumming is an underrepresented and challenging task in Music Information Retrieval (MIR), particularly for extracting both strumming directions and chord progressions from audio signals. While existing methods show promise, their effectiveness is often hindered by limited datasets. In this work, we extend a multimodal approach to guitar strumming transcription by introducing a novel dataset and a deep learning-based transcription model. We collect 90 min of real-world guitar recordings using an ESP32 smartwatch motion sensor and a structured recording protocol, complemented by a synthetic dataset of 4h of labeled strumming audio. A Convolutional Recurrent Neural Network (CRNN) model is trained to detect strumming events, classify their direction, and identify the corresponding chords using only microphone audio. Our evaluation demonstrates significant improvements over baseline onset detection algorithms, with a hybrid method combining synthetic and real-world data achieving the highest accuracy for both strumming action detection and chord classification. These results highlight the potential of deep learning for robust guitar strumming transcription and open new avenues for automatic rhythm guitar analysis.
zh
[NLP-28] Understanding Syntactic Generalization in Structure-inducing Language Models
【速读】: 该论文旨在解决当前结构诱导语言模型(Structure-inducing Language Models, SiLM)在评估中存在系统性缺口和可比性不足的问题。现有SiLM架构虽被广泛提出,但多数仅在小规模数据集上进行评估,缺乏对模型性能的全面、一致比较。为此,作者选取三种代表性SiLM架构——Structformer、UDGN与Generative Pretrained Structured Transformer (GPST),通过自然语言(英语语料)和合成括号表达式两种数据形式,从诱导句法表示特性、语法判断任务表现及训练动态三个维度进行系统对比。研究发现,不同模型在各项指标上无统一最优方案,但GPST在跨场景一致性上表现最佳,尤其在长距离依赖建模方面显著优于其他模型,表明其结构生成机制具有更强的泛化能力。关键解决方案在于采用多维度、多数据源的标准化评测框架,并验证了大规模合成数据对小型模型基本性质评估的有效性。
链接: https://arxiv.org/abs/2508.07969
作者: David Arps,Hassan Sajjad,Laura Kallmeyer
机构: Heinrich-Heine-Universität(海因里希海涅大学); Dalhousie University(达尔豪西大学)
类目: Computation and Language (cs.CL)
备注: Code available at this https URL
Abstract:Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.
zh
[NLP-29] oward Machine Interpreting: Lessons from Human Interpreting Studies
【速读】: 该论文试图解决当前语音翻译系统在实际应用中缺乏适应性的问题,即这些系统虽然准确率较高,但行为较为静态,无法像人类口译员那样灵活应对真实场景。其解决方案的关键在于深入理解人类口译行为的本质,并将人类口译中的原则(如实时调整、情境感知和交互式决策)与最新的建模技术相结合,从而推动语音翻译系统向更接近人类口译体验的方向发展,缩小可用性差距。
链接: https://arxiv.org/abs/2508.07964
作者: Matthias Sperber,Maureen de Seyssel,Jiajun Bao,Matthias Paulik
机构: Apple(苹果)
类目: Computation and Language (cs.CL)
备注:
Abstract:Current speech translation systems, while having achieved impressive accuracies, are rather static in their behavior and do not adapt to real-world situations in ways human interpreters do. In order to improve their practical usefulness and enable interpreting-like experiences, a precise understanding of the nature of human interpreting is crucial. To this end, we discuss human interpreting literature from the perspective of the machine translation field, while considering both operational and qualitative aspects. We identify implications for the development of speech translation systems and argue that there is great potential to adopt many human interpreting principles using recent modeling techniques. We hope that our findings provide inspiration for closing the perceived usability gap, and can motivate progress toward true machine interpreting.
zh
[NLP-30] Large Language Models for Subjective Language Understanding: A Survey
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)有效处理主观语言理解任务的问题,这些任务涵盖情感分析、情绪识别、讽刺检测、幽默理解、立场判断、隐喻解读、意图识别和美学评估等。其核心挑战在于主观语言的模糊性、修辞特征和高度依赖语境的特性,传统方法难以建模人类细腻的情感与认知判断。解决方案的关键在于系统梳理LLMs在上述任务中的演进路径与优势:一方面,LLMs凭借其强大的上下文建模能力和对多义性语言的泛化能力,能够更准确地捕捉主观语义;另一方面,通过整合多任务学习策略,可实现对不同主观语言任务的统一建模,从而提升模型的鲁棒性和迁移能力。论文进一步指出,当前仍面临数据稀缺、模型偏见及伦理风险等开放问题,为未来研究指明方向。
链接: https://arxiv.org/abs/2508.07959
作者: Changhao Song,Yazhou Zhang,Hui Gao,Ben Yao,Peng Zhang
机构: Tianjin University (天津大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Subjective language understanding refers to a broad set of natural language processing tasks where the goal is to interpret or generate content that conveys personal feelings, opinions, or figurative meanings rather than objective facts. With the advent of large language models (LLMs) such as ChatGPT, LLaMA, and others, there has been a paradigm shift in how we approach these inherently nuanced tasks. In this survey, we provide a comprehensive review of recent advances in applying LLMs to subjective language tasks, including sentiment analysis, emotion recognition, sarcasm detection, humor understanding, stance detection, metaphor interpretation, intent detection, and aesthetics assessment. We begin by clarifying the definition of subjective language from linguistic and cognitive perspectives, and we outline the unique challenges posed by subjective language (e.g. ambiguity, figurativeness, context dependence). We then survey the evolution of LLM architectures and techniques that particularly benefit subjectivity tasks, highlighting why LLMs are well-suited to model subtle human-like judgments. For each of the eight tasks, we summarize task definitions, key datasets, state-of-the-art LLM-based methods, and remaining challenges. We provide comparative insights, discussing commonalities and differences among tasks and how multi-task LLM approaches might yield unified models of subjectivity. Finally, we identify open issues such as data limitations, model bias, and ethical considerations, and suggest future research directions. We hope this survey will serve as a valuable resource for researchers and practitioners interested in the intersection of affective computing, figurative language processing, and large-scale language models.
zh
[NLP-31] Expert Preference-based Evaluation of Automated Related Work Generation
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在科学写作中自动评估质量不足的问题,特别是如何准确捕捉专家偏好与领域特定的评价标准。现有方法如传统自动指标和 LLM-as-a-judge 系统难以有效反映专家判断,导致评估结果与实际科研需求脱节。解决方案的关键在于提出 GREP(Generative Related Work Evaluation Framework),这是一个多轮细粒度评估框架,将相关工作(related work)的评价分解为多个可解释的维度,并通过对比少样本示例提供情境化指导,从而实现对生成内容的卡氏等级(cardinal)质量评估,优于仅基于序数偏好数据的训练方式。该设计支持更精准的人机协作写作流程,并提供两种变体以平衡精度与成本。
链接: https://arxiv.org/abs/2508.07955
作者: Furkan Şahinuç,Subhabrata Dutta,Iryna Gurevych
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.
zh
[NLP-32] Challenges and opportunities in portraying emotion in generated sign language
【速读】: 该论文旨在解决手语中非手动信号(non-manual signals)尤其是情感内容在签名虚拟人(signing avatar)中难以准确表达的问题,核心挑战在于缺乏标准化的情感状态指定方法。解决方案的关键在于提出一种直观的双参数表示法,用于描述情感相关的非手动信号,并通过名为EASIER的文本表示法实现对虚拟人Paula情感表情的控制,从而以更一致和细粒度的方式规范情感面部表达的生成与标注。
链接: https://arxiv.org/abs/2508.07937
作者: John C. McDonald,Rosalee Wolfe,Fabrizio Nunnari
机构: DePaul University (德保罗大学); Institute for Language and Speech Processing, Athena RC (雅典研究中心语言与语音处理研究所); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Non-manual signals in sign languages continue to be a challenge for signing avatars. More specifically, emotional content has been difficult to incorporate because of a lack of a standard method of specifying the avatar’s emotional state. This paper explores the application of an intuitive two-parameter representation for emotive non-manual signals to the Paula signing avatar that shows promise for facilitating the linguistic specification of emotional facial expressions in a more coherent manner than previous methods. Users can apply these parameters to control Paula’s emotional expressions through a textual representation called the EASIER notation. The representation can allow avatars to express more nuanced emotional states using two numerical parameters. It also has the potential to enable more consistent specification of emotional non-manual signals in linguistic annotations which drive signing avatars.
zh
[NLP-33] ailored Emotional LLM -Supporter: Enhancing Cultural Sensitivity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在提供情感支持时缺乏文化敏感性的问题,这一问题长期受限于相关资源的匮乏。解决方案的关键在于构建首个面向该任务的多文化数据集——CultureCare,其涵盖四种文化背景,包含1729条 distress message(困扰信息)、1523个文化信号(cultural signals)和1041种支持策略(support strategies),并配有细粒度的情感与文化标注。基于此数据集,研究者提出并验证了四种适配策略,使三种前沿LLM能够生成更具文化敏感性的回应,并通过LLM裁判、在地人类标注者及临床心理学家的综合评估证明其有效性,显著优于匿名在线同伴回复,同时揭示了简单文化角色扮演不足以实现真正的文化敏感性。
链接: https://arxiv.org/abs/2508.07902
作者: Chen Cecilia Liu,Hiba Arnaout,Nils Kovačić,Dana Atzil-Slonim,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Technische Universität Darmstadt; Department of Computer Science and Hessian Center for AI (hessian.AI); Department of Psychology, Bar-Ilan University
类目: Computation and Language (cs.CL)
备注: Under review; joint first authors
Abstract:Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to lack of resources. In this work, we introduce CultureCare, the first dataset designed for this task, spanning four cultures and including 1729 distress messages, 1523 cultural signals, and 1041 support strategies with fine-grained emotional and cultural annotations. Leveraging CultureCare, we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM judges, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in future therapists.
zh
[NLP-34] Few-shot Cross-lingual Aspect-Based Sentiment Analysis with Sequence-to-Sequence Models
【速读】: 该论文旨在解决低资源语言在方面级情感分析(Aspect-based Sentiment Analysis, ABSA)中因标注数据稀缺而导致性能受限的问题。其解决方案的关键在于:通过在训练集中引入少量目标语言的标注样本(few-shot examples),显著提升跨语言ABSA模型的性能,且效果优于零样本设置,甚至可与约束解码方法相媲美;进一步研究表明,将1000个目标语言样本与英文数据结合使用时,还能超越单一语言的基线模型,为低资源和特定领域场景下的跨语言ABSA提供了高效可行的实践路径。
链接: https://arxiv.org/abs/2508.07866
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for presentation at the 28th International Conference on Text, Speech and Dialogue (TSD 2025)
Abstract:Aspect-based sentiment analysis (ABSA) has received substantial attention in English, yet challenges remain for low-resource languages due to the scarcity of labelled data. Current cross-lingual ABSA approaches often rely on external translation tools and overlook the potential benefits of incorporating a small number of target language examples into training. In this paper, we evaluate the effect of adding few-shot target language examples to the training set across four ABSA tasks, six target languages, and two sequence-to-sequence models. We show that adding as few as ten target language examples significantly improves performance over zero-shot settings and achieves a similar effect to constrained decoding in reducing prediction errors. Furthermore, we demonstrate that combining 1,000 target language examples with English data can even surpass monolingual baselines. These findings offer practical insights for improving cross-lingual ABSA in low-resource and domain-specific settings, as obtaining ten high-quality annotated examples is both feasible and highly effective.
zh
[NLP-35] Large Language Models for Czech Aspect-Based Sentiment Analysis
【速读】: 该论文旨在解决生成式 AI 在捷克语方面情感分析(Aspect-based Sentiment Analysis, ABSA)任务中的性能评估问题,尤其是大语言模型(Large Language Models, LLMs)在该领域的能力尚未充分探索。其解决方案的关键在于系统性地评估19种不同规模与架构的LLMs在零样本(zero-shot)、少样本(few-shot)及微调(fine-tuning)场景下的表现,发现领域特定的小型模型在零样本和少样本设置中优于通用LLMs,而微调后的LLMs则达到最优效果;同时,研究还揭示了多语言能力、模型大小和发布时效等因素对性能的影响,并通过错误分析指出方面词预测是主要挑战之一,为未来捷克语ABSA研究提供了实证依据与方向指引。
链接: https://arxiv.org/abs/2508.07860
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for presentation at the 28th International Conference on Text, Speech and Dialogue (TSD 2025)
Abstract:Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that aims to identify sentiment toward specific aspects of an entity. While large language models (LLMs) have shown strong performance in various natural language processing (NLP) tasks, their capabilities for Czech ABSA remain largely unexplored. In this work, we conduct a comprehensive evaluation of 19 LLMs of varying sizes and architectures on Czech ABSA, comparing their performance in zero-shot, few-shot, and fine-tuning scenarios. Our results show that small domain-specific models fine-tuned for ABSA outperform general-purpose LLMs in zero-shot and few-shot settings, while fine-tuned LLMs achieve state-of-the-art results. We analyze how factors such as multilingualism, model size, and recency influence performance and present an error analysis highlighting key challenges, particularly in aspect term prediction. Our findings provide insights into the suitability of LLMs for Czech ABSA and offer guidance for future research in this area.
zh
[NLP-36] LLM s for Law: Evaluating Legal-Specific LLM s on Contract Understanding
【速读】: 该论文旨在解决当前法律自然语言处理(Legal NLP)领域缺乏对多种法律专用大语言模型(Legal-Specific Large Language Models, LLMs)在合同分类任务中系统性评估的问题。解决方案的关键在于构建一个涵盖10个法律专用LLMs与7个通用大语言模型的全面比较实验,覆盖三个英文合同理解任务。结果表明,法律专用模型在需要精细法律理解的任务上显著优于通用模型,其中Legal-BERT和Contracts-BERT在两项任务上达到新的最先进水平(SOTA),且参数量仅为最优通用模型的31%。该研究为合同理解系统的开发提供了可靠基准和实证依据。
链接: https://arxiv.org/abs/2508.07849
作者: Amrita Singh,H. Suhan Karaca,Aditya Joshi,Hye-young Paik,Jiaojiao Jiang
机构: University of New South Wales (UNSW)
类目: Computation and Language (cs.CL)
备注: Under review. 4 pages + references
Abstract:Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems.
zh
[NLP-37] Evaluating Large Language Models as Expert Annotators
【速读】: 该论文旨在解决在需要专家知识的特定领域(如金融、生物医学和法律)中,大型语言模型(Large Language Models, LLMs)是否可作为人类专家标注者直接替代方案的问题。其核心挑战在于评估LLMs在专业领域文本标注任务中的有效性,尤其是在引入推理机制(如思维链CoT)和多智能体协作框架后的表现差异。解决方案的关键在于提出一种多智能体讨论框架(multi-agent discussion framework),模拟人类标注团队的交互过程,使LLMs在考虑其他代理的标注与理由后进行最终决策;同时对比不同推理模型与非推理模型的表现,以揭示复杂推理对专业领域标注任务的实际价值。
链接: https://arxiv.org/abs/2508.07827
作者: Yu-Min Tseng,Wei-Lin Chen,Chung-Chi Chen,Hsin-Hsi Chen
机构: National Taiwan University (国立台湾大学); Virginia Tech (弗吉尼亚理工学院); University of Virginia (弗吉尼亚大学); AIST, Japan (日本产业技术综合研究所); AINTU, Taiwan (台湾人工智能研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to COLM 2025
Abstract:Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others’ annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.
zh
[NLP-38] Evaluating Compositional Approaches for Focus and Sentiment Analysis
【速读】: 该论文旨在解决语言学中的焦点分析(Focus Analysis, FA)领域缺乏定量评估 compositional(组合性)方法的问题,尽管在自然语言处理(Natural Language Processing, NLP)中的情感分析(Sentiment Analysis, SA)领域已有相关研究。论文的关键解决方案在于论证并验证SA中已有的组合性规则同样适用于FA,因为SA是FA的一个子集;其核心在于利用通用依存关系(Universal Dependencies, UDs)形式化表示的基本句法规则(如修饰、并列和否定),对来自情感词典的词汇进行组合式建模,并通过与基于启发式规则的非组合方法(如VADER)对比实验,证明组合方法在准确性和可解释性上的优势。该研究进一步将SA中的组合性成果推广至FA,填补了该领域的研究空白。
链接: https://arxiv.org/abs/2508.07810
作者: Olga Kellert,Muhammad Imran,Nicholas Hill Matlis,Mahmud Uz Zaman,Carlos Gómez-Rodríguez
机构: University of A Coruña (拉科鲁尼亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper summarizes the results of evaluating a compositional approach for Focus Analysis (FA) in Linguistics and Sentiment Analysis (SA) in Natural Language Processing (NLP). While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as “it was John who left”. We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA.
zh
[NLP-39] Can You Trick the Grader? Adversarial Persuasion of LLM Judges
【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动化评分场景中可能因受 persuasive language(说服性语言)影响而产生评分偏差的问题,尤其是在数学推理任务中,评分应仅基于内容正确性而非表达风格。解决方案的关键在于系统性地识别并验证七种源自亚里士多德修辞学原理的说服策略(Majority、Consistency、Flattery、Reciprocity、Pity、Authority、Identity),并通过实证实验表明:即使在看似相同的错误答案中嵌入这些策略,LLM judges 仍会显著提高评分(平均提升达8%),其中 Consistency 效果最为严重;此外,模型规模增大无法有效缓解此问题,且多种策略叠加或采用 pairwise evaluation 依然存在偏倚风险,凸显了当前 LLM-as-a-Judge 架构对 persuasion-based attacks 的脆弱性,亟需构建鲁棒防御机制。
链接: https://arxiv.org/abs/2508.07805
作者: Yerin Hwang,Dongryeol Lee,Taegwan Kang,Yongil Kim,Kyomin Jung
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures
Abstract:As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle’s rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.
zh
[NLP-40] Grove MoE: Towards Efficient and Superior MoE LLM s with Adjugate Experts
【速读】: 该论文旨在解决传统混合专家(Mixture of Experts, MoE)架构中因专家规模统一且激活参数数量固定而导致的计算效率低下问题,尤其是在面对不同复杂度输入时无法动态调整计算资源。解决方案的关键在于提出一种新型的Grove MoE架构,其核心创新是引入具有动态激活机制的异构专家(adjugate experts),允许模型根据输入token的复杂度灵活激活不同规模的专家模块,从而在保持可控计算开销的同时实现模型容量的扩展。通过这一设计,GroveMoE-Base和GroveMoE-Inst两个33B参数的大语言模型在推理阶段仅激活约3.14–3.28B参数,性能却可媲美甚至超越同级别或更大规模的开源SOTA模型。
链接: https://arxiv.org/abs/2508.07785
作者: Haoyuan Wu,Haoxing Chen,Xiaodong Chen,Zhanchao Zhou,Tieyuan Chen,Yihong Zhuang,Guoshan Lu,Zenan Huang,Junbo Zhao,Lin Liu,Zhenzhong Lan,Bei Yu,Jianguo Li
机构: Inclusion AI; The Chinese University of Hong Kong (香港中文大学); Renmin University of China (中国人民大学); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous this http URL CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.
zh
[NLP-41] SASST: Leverag ing Syntax-Aware Chunking and LLM s for Simultaneous Speech Translation
【速读】: 该论文旨在解决同步语音翻译(Simultaneous Speech Translation, SimulST)中因输入流实时处理导致的语义碎片化和翻译时序不准确问题。解决方案的关键在于提出一种基于语法的切块策略(grammar-based chunking strategy),通过解析依存关系(dependency relations)和标点特征来分割输入流为语义完整的单元,从而保障切块的一致性并减少语义断裂;在此基础上构建了语法感知的端到端同步语音翻译框架SASST(Syntax-Aware Simultaneous Speech Translation),其整合了冻结的Whisper编码器与解码器-only大语言模型(LLM),通过动态输出翻译token或WAIT符号联合优化翻译时机与内容,并利用目标侧重排机制缓解词序差异问题,实验表明该方法在多语言语料CoVoST2(En-De, Zh, Ja)上显著提升了翻译质量,验证了语法结构在LLM驱动的SimulST系统中的有效性。
链接: https://arxiv.org/abs/2508.07781
作者: Zeyu Yang,Lai Wei,Roman Koshkin,Xi Chen,Satoshi Nakamura
机构: The Chinese University of Hong Kong, Shenzhen, China (香港中文大学(深圳)); Okinawa Institute of Science and Technology, Japan (冲绳科学技术大学院大学); Nara Institute of Science and Technology, Japan (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or WAIT symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-De, Zh, Ja demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.
zh
[NLP-42] Pareto Multi-Objective Alignment for Language Models KDD2025 ECML
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的多目标对齐(Multi-Objective Alignment, MOA)问题,即如何在多个相互冲突的目标(如信息量与简洁性、有用性与创造性)之间实现灵活且高效的平衡。当前主流的基于强化学习的人类反馈(Reinforcement Learning from Human Feedback, RLHF)方法通常仅优化单一奖励函数,导致模型行为僵化,难以适应复杂多变的人类偏好场景。论文提出了一种名为Pareto Multi-Objective Alignment (PAMA) 的新算法,其关键创新在于将多目标强化学习对齐转化为一个具有闭式解的凸优化问题,从而将传统多目标优化(Multi-Objective Optimization, MOO)方法中复杂的 O(n2⋅d) 时间复杂度降低至 O(n),其中 n 为目标数量,d 为模型参数规模(通常达数十亿),显著提升了可扩展性。PAMA不仅理论上保证收敛到帕累托平稳点(Pareto stationary point),即无法在不损害其他目标的前提下提升任一目标的状态,还在125M至7B参数规模的语言模型上验证了其高效性和有效性,为实现多样化人类价值观的实用对齐提供了理论严谨且计算高效的解决方案。
链接: https://arxiv.org/abs/2508.07768
作者: Qiang He,Setareh Maghsudi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ECML/PKDD 2025
Abstract:Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2*d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA’s robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments.
zh
[NLP-43] Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models CIKM2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时出现的忠实性幻觉(faithfulness hallucinations)问题,特别是探究社会偏见是否是导致此类幻觉的关键因果因素。此前,这一因果关系尚未被系统研究,且受限于上下文中的混杂变量(confounders)难以分离偏见状态与幻觉之间的因果联系。论文的核心解决方案在于引入结构因果模型(Structural Causal Model, SCM),以形式化建模并验证偏见与幻觉之间的因果机制,并设计偏见干预策略来控制混杂因素;同时构建Bias Intervention Dataset (BID),提供多种社会偏见场景以实现对因果效应的精准测量。实验表明,社会偏见是忠实性幻觉的重要成因,且不同偏见状态对幻觉的影响方向各异,揭示了偏见对不公平幻觉(unfairness hallucinations)生成的细微但显著的因果作用。
链接: https://arxiv.org/abs/2508.07753
作者: Zhenliang Zhang,Junzhe Zhang,Xinyu Hu,HuiXuan Zhang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院)
类目: Computation and Language (cs.CL)
备注: Accepted by CIKM 2025 (Full Paper)
Abstract:Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation.
zh
[NLP-44] Learning to Align Aligning to Learn: A Unified Approach for Self-Optimized Alignment
【速读】: 该论文旨在解决语言模型对齐(alignment)过程中存在的两大挑战:一是监督微调(SFT)因依赖离线策略轨迹而导致优化效率受限;二是强化学习(RL)虽能实现探索性策略优化,但存在样本效率低且高度依赖高质量基础模型的问题。解决方案的关键在于提出一种统一框架GRAO(Group Relative Alignment Optimization),其核心创新包括:1)多样本生成策略,通过奖励反馈实现对比质量评估;2)新颖的组内直接对齐损失(Group Direct Alignment Loss),利用组内相对优势权重进行优化;3)基于成对偏好动态的参考感知参数更新机制。该方法在理论上保证收敛性与样本效率优势,并在复杂人类对齐任务中显著优于SFT、DPO、PPO和GRPO等基线方法。
链接: https://arxiv.org/abs/2508.07750
作者: Haowen Wang,Yun Yue,Zhiling Ye,Shuowen Zhang,Lei Fan,Jiaxin Liang,Jiadi Jiang,Cheng Wei,Jingyuan Deng,Xudong Han,Ji Li,Chunxiao Guo,Peng Wei,Jian Wang,Jinjie Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 7 tables
Abstract:Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO’s convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO’s superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
zh
[NLP-45] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction
【速读】: 该论文试图解决当前基于Transformer的大型语言模型(Large Language Models, LLMs)在长距离语义连贯性方面的能力局限问题,特别是其依赖Next Token Prediction(NTP)机制时,在处理跨句甚至跨段落的结构化文本重建任务中表现不佳的问题。解决方案的关键在于引入Masked Sentence Prediction(MSP)这一评估范式,通过在叙事(ROCStories)、程序性文本(Recipe1M)和说明性文本(Wikipedia)三个低结构化领域中移除随机句子并要求模型补全,从而系统性地衡量模型在保持局部忠实度(fidelity)与全局连贯性(cohesiveness)方面的综合能力。结果表明,尽管GPT-4o、Claude 3.5 Sonnet和Gemini 2.0 Flash等商用LLM在多项任务上表现出色,但在MSP任务中尤其在低结构化域表现较差,揭示了现有模型在全局规划与上下文一致性建模上的显著不足。
链接: https://arxiv.org/abs/2508.07702
作者: Charlie Wyatt,Aditya Joshi,Flora Salim
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP’s focus on single-token prediction often limits a model’s ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries-an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) - the task of infilling a randomly removed sentence - from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.
zh
[NLP-46] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval
【速读】: 该论文旨在解决大语言模型(LLM)在工具学习(tool learning)中面临的归纳式工具检索(inductive tool retrieval)问题,即当模型遇到训练阶段未见过的新工具(unseen tools)时,由于工具分布发生显著偏移(large distribution shift)以及基于相似度的检索机制易受干扰(vulnerability of similarity-based retrieval),导致性能急剧下降的问题。解决方案的关键在于提出一种逻辑引导的语义桥梁框架(Logic-Guided Semantic Bridging, LoSemB),其核心创新包括:1)设计了一个基于逻辑的嵌入对齐模块(logic-based embedding alignment module),用于缓解新工具与已有工具之间的分布差异;2)引入关系增强型检索机制(relational augmented retrieval mechanism),提升对未见工具的鲁棒性检索能力,从而无需重新训练即可有效迁移已有知识以支持新工具的理解与使用。
链接: https://arxiv.org/abs/2508.07690
作者: Luyao Zhuang,Qinggang Zhang,Huachi Zhou,Juhua Liu,Qing Li,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.
zh
[NLP-47] GLiClass: Generalist Lightweight Model for Sequence Classification Tasks
【速读】: 该论文旨在解决当前分类任务中面临的效率与准确性难以兼顾的问题,尤其是在零样本(zero-shot)和少样本(few-shot)场景下,传统生成式大语言模型(Generative LLMs)存在指令遵循不一致和计算效率低的问题,而基于嵌入的方法虽高效但难以处理复杂的逻辑与语义约束;交叉编码器(Cross-encoder)在标签集合较大时因需逐对处理文本-标签对而导致效率低下。解决方案的关键在于提出 GLiClass,一种基于 GLiNER 架构改进的序列分类方法,它在保持嵌入方法高效率的同时,具备类别的灵活性和零样本/少样本适应能力,并通过引入近端策略优化(Proximal Policy Optimization, PPO)实现从人类反馈或数据稀疏条件下训练多标签文本分类器,从而有效平衡了准确率、效率与泛化性能。
链接: https://arxiv.org/abs/2508.07662
作者: Ihor Stepanov,Mykhailo Shtopko,Dmytro Vodianytskyi,Oleksandr Lukashov,Alexander Yavorskyi,Mykyta Yaroshenko
机构: Knowledgator Engineering (Knowledgator Engineering)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 7 tables, 2 figures
Abstract:Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback.
zh
[NLP-48] Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
【速读】: 该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)任务中模型在未见场景下泛化能力不足的问题,尤其是当导航需求涉及复杂空间与时间推理时。解决方案的关键在于提出SkillNav框架,其核心创新是将导航过程分解为一组可解释的原子技能(如垂直移动、区域识别、停止与暂停),并由专用代理分别处理;同时引入基于零样本视觉-语言模型(VLM)的路由器,在每个时间步动态选择最适配当前子目标的代理,通过视觉观测与历史动作对齐实现决策优化。这一模块化设计显著提升了模型在R2R基准上的性能,并展现出对GSA-R2R中新指令风格和未见环境的强大泛化能力。
链接: https://arxiv.org/abs/2508.07642
作者: Tianyi Ma,Yue Zhang,Zehao Wang,Parisa Kordjamshidi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 Figures,
Abstract:Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments.
zh
[NLP-49] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在跨多张相关图表进行推理时能力不足的问题,这在科学报告、金融分析和公共政策仪表板等真实场景中至关重要。现有基准测试主要聚焦于孤立且视觉风格统一的图表,忽略了多图协同理解的需求。解决方案的关键在于提出InterChart这一诊断性基准,它通过三层次递进难度的设计——从单图事实推理到合成对齐图表集的整合分析,再到真实世界复杂图表的语义推理——系统评估VLMs在多图表情境下的跨模态推理能力。实验表明,随着图表复杂度提升,主流开源与闭源VLMs准确率显著下降,且模型更擅长将多实体图表拆解为简单视觉单元后处理,暴露出其在跨图表整合推理方面的系统性局限,从而为提升多视觉环境中的多模态推理提供了严谨的评估框架与改进方向。
链接: https://arxiv.org/abs/2508.07630
作者: Anirudh Iyengar Kaniyar Narayana Iyengar,Srija Mukhopadhyay,Adnan Qidwai,Shubhankar Singh,Dan Roth,Vivek Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, 12 tables. Benchmark dataset and evaluation code will be publicly made available
Abstract:We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.
zh
[NLP-50] Klear-Reason er: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型在复杂推理任务中难以复现高性能表现的问题,核心挑战在于训练细节披露不充分以及强化学习(Reinforcement Learning, RL)中剪裁机制对探索信号的抑制。为应对上述问题,作者提出Klear-Reasoner模型,并构建了从数据准备、长链式思维监督微调(long Chain-of-Thought Supervised Fine-Tuning, long CoT SFT)到强化学习的完整后训练流程;关键创新在于:1)发现少量高质量数据源优于大量多样化数据源,且困难样本无需精度过滤即可提升性能;2)针对RL中传统剪裁机制忽略次优轨迹并压制探索信号的问题,提出梯度保留剪裁策略优化(Gradient-Preserving Clipping Policy Optimization, GPPO),通过温和地反向传播剪裁token的梯度,显著增强模型探索能力并提高负样本学习效率。
链接: https://arxiv.org/abs/2508.07629
作者: Zhenpeng Su,Leiyu Pan,Xue Bai,Dening Liu,Guanting Dong,Jiaming Huang,Wenping Hu,Guorui Zhou
机构: Klear Team, Kuaishou Technology (快手科技)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.
zh
[NLP-51] hinkTuning: Instilling Cognitive Reflections without Distillation
【速读】: 该论文试图解决的问题是:如何训练那些在基础状态下不具备多步推理(multi-step reasoning)能力的大型语言模型(Large Language Models, LLMs),使其能够发展出类似“思考”(thinking)的行为,即具备自我反思和逐步推理的能力。现有研究表明,强化学习(Reinforcement Learning, RL)仅能激发模型中已存在的潜在推理行为,并不能真正赋予其新的推理能力。为此,作者提出了一种名为ThinkTuning的新型训练方法,其核心在于采用基于GRPO(Generalized Reward Policy Optimization)的交互式训练范式,通过一个同规模教师模型对学生的推理过程提供隐式监督——具体表现为教师先给出问题,让学生尝试作答,再基于学生输出提供纠正性反馈(feedback),从而引导学生逐步修正思维路径并最终得出正确答案。这种类比于课堂教学的“问题-试答-反馈”机制,显著提升了学生模型的推理性能,在多个基准测试上相较零样本基线平均提升3.85%,在MATH-500、AIME和GPQA-Diamond上分别优于原始GRPO基线2.08%、2.23%和3.99%。
链接: https://arxiv.org/abs/2508.07616
作者: Aswin RRV,Jacob Dineen,Divij Handa,Md Nayem Uddin,Mihir Parmar,Chitta Baral,Ben Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback – enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at this https URL.
zh
[NLP-52] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements ECAI2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在事件检测任务中因对事件触发词(event triggers)理解不准确而导致的过度解释问题,尤其在仅提供一个示例(one-shot setting)的情况下,传统基于上下文学习(in-context learning, ICL)方法难以有效纠正此类偏差。解决方案的关键在于提出一种以关键词为中心的思维链提示方法(KeyCP++),其核心机制是通过自动标注演示样本中输入文本与检测结果之间的逻辑断层,构建触发词区分提示模板;该模板将示例触发词作为锚点,引导LLM生成候选触发词并逐个进行合理性判断,从而形成“提出-判断”的推理链条,减少对关键词的依赖并促进事件检测规则的学习。
链接: https://arxiv.org/abs/2508.07598
作者: Ziheng Li,Zhi-Hong Deng
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: ECAI 2025
Abstract:Although the LLM-based in-context learning (ICL) paradigm has demonstrated considerable success across various natural language processing tasks, it encounters challenges in event detection. This is because LLMs lack an accurate understanding of event triggers and tend to make over-interpretation, which cannot be effectively corrected through in-context examples alone. In this paper, we focus on the most challenging one-shot setting and propose KeyCP++, a keyword-centric chain-of-thought prompting approach. KeyCP++ addresses the weaknesses of conventional ICL by automatically annotating the logical gaps between input text and detection results for the demonstrations. Specifically, to generate in-depth and meaningful rationale, KeyCP++ constructs a trigger discrimination prompting template. It incorporates the exemplary triggers (a.k.a keywords) into the prompt as the anchor to simply trigger profiling, let LLM propose candidate triggers, and justify each candidate. These propose-and-judge rationales help LLMs mitigate over-reliance on the keywords and promote detection rule learning. Extensive experiments demonstrate the effectiveness of our approach, showcasing significant advancements in one-shot event detection.
zh
[NLP-53] IBPS: Indian Bail Prediction System
【速读】: 该论文旨在解决印度司法体系中保释(bail)决策中存在的主观性强、效率低下及标准不一等问题,尤其针对超过75%的在押人员为未决犯、且多来自社会经济弱势群体的现实困境。其解决方案的关键在于构建一个名为印度保释预测系统(Indian Bail Prediction System, IBPS)的人工智能框架,该框架基于150,430份高等法院保释判决数据集进行训练,通过参数高效微调大规模语言模型(Large Language Model, LLM),并引入检索增强生成(Retrieval-Augmented Generation, RAG)与法律条文上下文,实现对保释结果的精准预测和具有法律依据的推理生成。实验表明,融入法定知识的模型显著优于基线方法,在准确性和解释质量上均表现优异,并能在法律专家独立标注的测试集上良好泛化,从而为提升司法效率与程序公正提供透明、可扩展、可复现的数据驱动支持。
链接: https://arxiv.org/abs/2508.07592
作者: Puspesh Kumar Srivastava,Uddeshya Raj,Praveen Patel,/Shubham Kumar Nigam,Noel Shallum,Arnab Bhattacharya
机构: IIT Kanpur (印度理工学院坎普尔分校); Symbiosis Law School Pune (辛布森法律学院浦那分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India’s prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.
zh
[NLP-54] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)范式下大语言模型(Large Language Models, LLMs)探索行为机制不明确的问题,尤其关注其在复杂推理链生成与优化过程中的探索能力。解决方案的关键在于系统性地从四个维度展开研究:探索空间的塑造、熵与性能之间的权衡关系分析、以及RL性能优化方法的设计,其中通过构建定量指标刻画LLMs的探索能力边界,并揭示不同训练阶段、实例及词元层面的熵-性能交换规律,从而为将探索收益有效转化为实际性能提升提供理论基础和实践路径。
链接: https://arxiv.org/abs/2508.07534
作者: Jia Deng,Jie Chen,Zhipeng Chen,Daixuan Cheng,Fei Bai,Beichen Zhang,Yinqian Min,Yanzipeng Gao,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL)
备注: 27pages,25figures. arXiv admin note: text overlap with arXiv:2508.02260
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains – a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR’s empirical success, the fundamental mechanisms governing LLMs’ exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs’ capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.
zh
[NLP-55] Conversational DNA: A New Visual Language for Understanding Dialogue Structure in Human and AI
【速读】: 该论文试图解决传统对话分析方法难以揭示对话深层结构与交互模式的问题,尤其在人类与人工智能(Artificial Intelligence, AI)交互日益普遍的背景下,现有统计摘要方法无法充分捕捉对话中的语言复杂性、情感轨迹和主题连贯性等关键特征。其解决方案的关键在于提出“对话DNA”(Conversational DNA)这一新型视觉语言框架,通过生物隐喻将对话视为具有可解释结构的活体系统:语言复杂度由链状结构的粗细表示,情感变化通过色彩梯度体现,相关性由连接元素构成,主题一致性则以螺旋结构维持整体稳定性。该方法实现了对治疗对话及人机对话的可视化比较与理解,填补了传统方法在揭示交互本质方面的空白。
链接: https://arxiv.org/abs/2508.07520
作者: Baihan Lin
机构: Berkman Klein Center For Internet & Society (伯克曼·克莱因互联网与社会中心); Harvard University (哈佛大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:What if the patterns hidden within dialogue reveal more about communication than the words themselves? We introduce Conversational DNA, a novel visual language that treats any dialogue – whether between humans, between human and AI, or among groups – as a living system with interpretable structure that can be visualized, compared, and understood. Unlike traditional conversation analysis that reduces rich interaction to statistical summaries, our approach reveals the temporal architecture of dialogue through biological metaphors. Linguistic complexity flows through strand thickness, emotional trajectories cascade through color gradients, conversational relevance forms through connecting elements, and topic coherence maintains structural integrity through helical patterns. Through exploratory analysis of therapeutic conversations and historically significant human-AI dialogues, we demonstrate how this visualization approach reveals interaction patterns that traditional methods miss. Our work contributes a new creative framework for understanding communication that bridges data visualization, human-computer interaction, and the fundamental question of what makes dialogue meaningful in an age where humans increasingly converse with artificial minds.
zh
[NLP-56] Word Clouds as Common Voices: LLM -Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews
【速读】: 该论文旨在解决传统基于词频的词云(word cloud)在对话类定性访谈分析中效果不佳的问题,具体表现为:容易突出填充词、忽略同义表达(paraphrase),以及将语义相关的概念碎片化,从而限制了其在早期分析阶段提供快速且可解释概览的能力。解决方案的关键在于提出名为ThemeClouds的开源可视化工具,该工具利用大语言模型(large language models, LLMs)对对话转录文本进行概念级主题识别,并统计提及每个主题的独特参与者数量,从而生成以“提及广度”而非原始词频为基础的主题加权词云。这一方法提升了对用户实际关注点(如录音设备配置问题)的捕捉能力,优于传统词频云和主题建模基线(如LDA、BERTopic)。
链接: https://arxiv.org/abs/2508.07517
作者: Joseph T. Colonel,Baihan Lin
机构: Icahn School of Medicine at Mount Sinai (纽约西奈山伊坎医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds’').
zh
[NLP-57] Augmenting Bias Detection in LLM s Using Topological Data Analysis
【速读】: 该论文试图解决的问题是:当前尽管已有多种偏见检测方法用于评估大语言模型(Large Language Models, LLMs)中的偏见水平,但缺乏对模型内部结构中具体负责特定群体偏见的组件进行定位的能力。解决方案的关键在于引入拓扑数据分析(Topological Data Analysis, TDA),识别GPT-2模型中在StereoSet数据集中导致身份群体误表征的注意力头(attention heads)。研究发现,针对特定类别(如性别或职业)的偏见集中于某些充当“热点”的注意力头,且所提出的指标可进一步定位某一偏见类别下特定群体的偏见来源,为未来实现针对性去偏(de-biasing)提供了可解释的路径。
链接: https://arxiv.org/abs/2508.07516
作者: Keshav Varadarajan,Tananun Songdechakraiwut
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 9 figures, 4 tables
Abstract:Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.
zh
[NLP-58] Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在无需微调或专门训练的情况下,能够自主进行高强度策略博弈游戏《外交》(Diplomacy)的评估难题。此前的研究受限于《外交》极高的状态信息密度和匹配结果的高方差,通常依赖前沿模型或特定微调才能实现有效推理,这使得该任务难以广泛开展。解决方案的关键在于通过数据驱动的迭代优化,构建一种文本形式的游戏状态表示方法,使一个240亿参数的本地模型即可在不进行任何微调的情况下稳定完成对局。此外,作者还开发了用于假设检验与统计分析的工具链,并提出“关键状态分析”(Critical State Analysis)协议,以深入解析游戏中决策节点,从而揭示LLMs在自然训练中涌现的战略推理能力。
链接: https://arxiv.org/abs/2508.07485
作者: Alexander Duffy,Samuel J Paech,Ishana Shastri,Elizabeth Karpinski,Baptiste Alloui-Cros,Tyler Marques,Matthew Lyle Olson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy’s game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.
zh
[NLP-59] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的机器翻译质量评估(Machine Translation Quality Estimation, MT QE)在跨语言场景下的性能瓶颈问题。现有LLM-based QE系统因预训练任务为因果语言建模而非回归任务,且在低资源语言上表现受限,导致其预测精度不足。解决方案的关键在于提出一种自适应层优化框架(Adaptive Layer-Optimization, ALOPE),通过重构Transformer表示结构实现更优的回归预测:一是利用低秩适配器(Low-Rank Adapters, LoRA)与回归任务头结合,选择性地适配特定预训练层以增强跨语言对齐;二是引入动态加权策略融合多层表示,并采用多头回归机制聚合多个回归头的损失函数,从而提升模型对不同语言对的质量估计能力。实证表明,中间层Transformer表示更契合QE任务的跨语言特性,显著优于现有方法。
链接: https://arxiv.org/abs/2508.07484
作者: Archchana Sindhujan,Shenbin Qian,Chan Chi Chun Matthew,Constantin Orasan,Diptesh Kanojia
机构: University of Surrey (萨里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to COLM 2025 Conference
Abstract:Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.
zh
[NLP-60] Positional Biases Shift as Inputs Approach Context Window Limits
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在处理长输入时难以有效利用中间信息的问题,特别是针对“迷失在中间”(Lost in the Middle, LiM)效应的不一致性与作用机制。其解决方案的关键在于采用相对于模型上下文窗口的相对输入长度进行分析,而非以往基于绝对长度的方法。研究发现,LiM效应在输入占模型上下文窗口不超过50%时最为显著,超过该阈值后,首端优先偏倚(primacy bias)减弱,而末端近端偏倚(recency bias)保持稳定,从而形成一种基于距离的偏倚(distance-based bias),即相关信息越靠近输入末尾,模型表现越好。这一发现表明,位置偏倚本质上源自检索过程的成功与否,而非推理本身,为改进长文本任务设计、构建更合理的LLM评估基准提供了关键依据。
链接: https://arxiv.org/abs/2508.07479
作者: Blerta Veseli,Julian Chibane,Mariya Toneva,Alexander Koller
机构: Saarland Informatics Campus (萨尔兰信息学园区); Saarland University (萨尔兰大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model’s context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model’s context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs.
zh
[NLP-61] CP-Agent : Agent ic Constraint Programming
【速读】: 该论文旨在解决自然语言问题描述到形式化约束模型的自动翻译难题,这是约束编程(Constraint Programming, CP)中的一个核心挑战,传统方法依赖于固定的工作流和预定义的建模步骤,难以应对多样化的基准问题。解决方案的关键在于提出一种纯代理(agentic)策略,不依赖任何固定流水线,而是基于ReAct(Reason and Act)原则构建通用Python编码代理,利用持久的IPython内核实现状态感知的代码执行与迭代开发;通过精心设计的项目提示(project prompt)注入领域专业知识,使代理能够结合文件操作和代码执行工具进行假设测试、故障调试与解验证,从而在仅数百行代码下成功求解CP-Bench全部101个问题,表明约束建模任务的成功关键在于通用编程工具与提示中编码的领域知识的协同作用。
链接: https://arxiv.org/abs/2508.07468
作者: Stefan Szeider
机构: TU Wien (维也纳工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Translating natural language problem descriptions into formal constraint models remains a fundamental challenge in constraint programming, requiring deep expertise in both the problem domain and modeling frameworks. Previous approaches to automating this translation have employed fixed workflows with predetermined modeling steps, failing on a significant number of benchmark problems. We present a new approach using a pure agentic strategy without any fixed pipeline. We developed a general-purpose Python coding agent based on the ReAct (Reason and Act) principle, utilizing a persistent IPython kernel for stateful code execution and iterative development. Rather than embedding constraint programming logic into the agent architecture, domain-specific expertise is injected solely through a carefully crafted project prompt. The agent combines this prompt-encoded knowledge with access to file operations and code execution tools, enabling it to test hypotheses, debug failures, and verify solutions dynamically. Implemented in just a few hundred lines of code, this architecture successfully solves all 101 problems of the CP-Bench constraint programming benchmark set. The results suggest that constraint modeling tasks require the combination of general coding tools and domain expertise encoded in prompts, rather than specialized agent architectures or predefined workflows.
zh
[NLP-62] Lets Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中面临的效率与可扩展性问题,尤其是构造类树搜索方法因搜索空间爆炸导致的高计算开销和缺乏 anytime 属性,以及改进类方法因奖励信号不明确和搜索策略低效而难以收敛的问题。其解决方案的关键在于提出一个统一的局部搜索框架 ReLoc,通过四个核心算法组件——初始代码草拟、邻域代码生成、候选代码评估和当前最优代码更新——实现逐步代码修订;同时设计了一种基于修订距离的精细化奖励模型,以引导局部搜索向更有潜力的候选解演进,从而在多样化的代码生成任务中显著优于传统构造式树搜索和最先进的改进式方法。
链接: https://arxiv.org/abs/2508.07434
作者: Zhiyi Lyu,Jianguo Huang,Yanchen Deng,Steven Hoi,Bo An
机构: Nanyang Technological University (南洋理工大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbfReLoc, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.
zh
[NLP-63] Grounding Multilingual Multimodal LLM s With Cultural Knowledge
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在低资源语言和长尾文化实体理解上的性能不足问题,即模型在跨文化场景中存在偏见与误判,难以实现全球范围内的公平性和包容性。解决方案的关键在于提出一种数据驱动的方法——通过构建一个大规模的文化知识图谱(Wikidata)引导的多模态数据集 CulturalGround,其中包含2200万条高质、文化丰富的视觉问答(Visual Question Answering, VQA)对,覆盖42个国家和39种语言,并基于此训练开源模型 CulturalPangea。该方法通过将模型直接锚定于文化知识,同时混合标准多语言指令微调数据以保持通用能力,从而显著提升模型在文化相关任务上的表现(平均提升5.0分),且不损害主流视觉语言任务性能,为实现全球包容性的多模态系统提供了可行路径。
链接: https://arxiv.org/abs/2508.07414
作者: Jean de Dieu Nyandwi,Yueqi Song,Simran Khanuja,Graham Neubig
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.
zh
[NLP-64] A Comprehensive Survey of Self-Evolving AI Agents : A New Paradigm Bridging Foundation Models and Lifelong Agent ic Systems
【速读】: 该论文旨在解决当前AI代理系统在面对动态和演化环境时适应性不足的问题,即现有系统多依赖静态的手动配置,难以持续优化与进化。其解决方案的关键在于提出一种统一的概念框架,将自演化代理系统的设计抽象为由“系统输入(System Inputs)”、“代理系统(Agent System)”、“环境(Environment)”和“优化器(Optimisers)”构成的反馈循环,并基于此框架系统梳理了针对代理系统不同组件的自演化技术,从而实现从静态基础模型向具备持续适应能力的终身代理系统的跨越。
链接: https://arxiv.org/abs/2508.07407
作者: Jinyuan Fang,Yanwen Peng,Xi Zhang,Yingxu Wang,Xinhao Yi,Guibin Zhang,Yi Xu,Bin Wu,Siwei Liu,Zihao Li,Zhaochun Ren,Nikos Aletras,Xi Wang,Han Zhou,Zaiqiao Meng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.
zh
[NLP-65] Generative AI for Strategic Plan Development
【速读】: 该论文旨在解决如何利用生成式人工智能(Generative Artificial Intelligence, GAI)和大型语言模型(Large Language Models, LLMs)来辅助制定大型政府组织的战略规划问题,特别是针对战略规划中“愿景要素”(Vision Elements)的主题提取与自动化生成。其解决方案的关键在于构建一个模块化模型,并通过对比评估BERTopic与非负矩阵分解(Non-negative Matrix Factorization, NMF)两种主题建模技术在从美国问责局(Government Accountability Office, GAO)大量报告中自动识别与战略规划中愿景要素高度相似的主题方面的性能表现。研究结果表明,这两种方法均能生成与全部目标愿景要素高度匹配的主题,其中BERTopic表现更优,超过一半的关联主题达到“中等”或“强”相关性水平,验证了GAI在战略规划自动化中的可行性与有效性。
链接: https://arxiv.org/abs/2508.07405
作者: Jesse Ponnock
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 9 figures
Abstract:Given recent breakthroughs in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs), more and more professional services are being augmented through Artificial Intelligence (AI), which once seemed impossible to automate. This paper presents a modular model for leveraging GAI in developing strategic plans for large scale government organizations and evaluates leading machine learning techniques in their application towards one of the identified modules. Specifically, the performance of BERTopic and Non-negative Matrix Factorization (NMF) are evaluated in their ability to use topic modeling to generate themes representative of Vision Elements within a strategic plan. To accomplish this, BERTopic and NMF models are trained using a large volume of reports from the Government Accountability Office (GAO). The generated topics from each model are then scored for similarity against the Vision Elements of a published strategic plan and the results are compared. Our results show that these techniques are capable of generating themes similar to 100% of the elements being evaluated against. Further, we conclude that BERTopic performs best in this application with more than half of its correlated topics achieving a “medium” or “strong” correlation. A capability of GAI-enabled strategic plan development impacts a multi-billion dollar industry and assists the federal government in overcoming regulatory requirements which are crucial to the public good. Further work will focus on the operationalization of the concept proven in this study as well as viability of the remaining modules in the proposed model for GAI-generated strategic plans.
zh
[NLP-66] hink Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning -Inspired Text Guidance
【速读】: 该论文旨在解决全双工语音语言模型(Full-Duplex Speech Language Models, FD-SLMs)在真实对话场景中因长语音序列和高质量语音对话数据稀缺而导致的对话能力下降问题,以及现有文本引导语音生成方法在双通道音频流中引入时间错位与长度不匹配,破坏自然交互时序对齐的问题。解决方案的关键在于提出TurnGuide——一种受人类对话规划启发的新方法,通过动态将助手语音分割为对话轮次,并在语音输出前生成逐轮文本引导,从而有效解决插入时机与长度控制难题,显著提升端到端FD-SLM的语义连贯性与自然对话流畅性。
链接: https://arxiv.org/abs/2508.07375
作者: Wenqian Cui,Lei Zhu,Xiaohui Li,Zhihan Guo,Haoli Bai,Lu Hou,Irwin King
机构: 1. The Chinese University of Hong Kong (香港中文大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress
Abstract:Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge – their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs’ conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at this https URL. Code will be available at this https URL.
zh
[NLP-67] Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach
【速读】: 该论文旨在解决当前领域特定大语言模型(Large Language Models, LLMs)基准测试中缺乏对语料库和问答(Question-Answer, QA)集设计影响的系统性研究的问题,特别是现有方法过度依赖扩展规律(scaling law),忽视了基准构建中精确性与召回率之间的权衡。其解决方案的关键在于提出一种基于“全面性-紧凑性”原则的迭代式基准构建框架Comp-Comp:其中全面性确保领域语义的召回能力,紧凑性提升预测精度,二者协同指导语料库与QA集的优化构造,从而实现更高效、精准的领域模型评估。
链接: https://arxiv.org/abs/2508.07353
作者: Rubing Chen,Jiaxin Wu,Jian Wang,Xulu Zhang,Wenqi Fan,Chenghua Lin,Xiao-Yong Wei,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); The University of Manchester (曼彻斯特大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains.
zh
[NLP-68] PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization
【速读】: 该论文旨在解决个性化检索增强生成(Personalized Retrieval-Augmented Generation, RAG)中因依赖大语言模型(Large Language Models, LLMs)隐式整合用户画像与查询而导致的响应偏差问题,即当检索质量波动时,模型易生成与用户偏好不一致的结果。其解决方案的关键在于提出一种基于强化学习的框架PrLM,通过对比训练的个性化奖励模型(personalization reward model)引导LLMs显式推理 retrieved user profiles,从而在无需标注推理路径的情况下,有效利用用户反馈优化生成内容,提升个性化表现并增强对不同检索结果数量和检索器的鲁棒性。
链接: https://arxiv.org/abs/2508.07342
作者: Kepu Zhang,Teng Shi,Weijie Yu,Jun Xu
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); School of Information Technology and Management (信息科技与管理学院); University of International Business and Economics (对外经济贸易大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Personalized retrieval-augmented generation (RAG) aims to produce user-tailored responses by incorporating retrieved user profiles alongside the input query. Existing methods primarily focus on improving retrieval and rely on large language models (LLMs) to implicitly integrate the retrieved context with the query. However, such models are often sensitive to retrieval quality and may generate responses that are misaligned with user preferences. To address this limitation, we propose PrLM, a reinforcement learning framework that trains LLMs to explicitly reason over retrieved user profiles. Guided by a contrastively trained personalization reward model, PrLM effectively learns from user responses without requiring annotated reasoning paths. Experiments on three personalized text generation datasets show that PrLM outperforms existing methods and remains robust across varying numbers of retrieved profiles and different retrievers.
zh
[NLP-69] Strategies of Code-switching in Human-Machine Dialogs
【速读】: 该论文试图解决的问题是:当前对代码转换(code-switching)语言特征的理解尚不充分,尤其是在自然交互场景中如何系统性地研究多语者在真实对话中的语言使用模式。解决方案的关键在于开发一个能够根据不同策略进行代码转换的聊天机器人(chatbot),并设计实验来验证其可行性及参与者对代码转换行为变化的敏感度。研究发现,当机器人表现出可预测的代码转换行为时,参与者更愿意参与任务且表现更好;而随机或语法错误的代码转换(如生成未被接受的混合名词短语“la fork”)则显著降低任务体验与完成度。这表明,高质量的多语种语言技术对于推动双语语言研究至关重要,同时也揭示了现有技术若未充分发展可能带来的负面影响。
链接: https://arxiv.org/abs/2508.07325
作者: Dean Geckt,Melinda Fricke,Shuly Wintner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Most people are multilingual, and most multilinguals code-switch, yet the characteristics of code-switched language are not fully understood. We developed a chatbot capable of completing a Map Task with human participants using code-switched Spanish and English. In two experiments, we prompted the bot to code-switch according to different strategies, examining (1) the feasibility of such experiments for investigating bilingual language use, and (2) whether participants would be sensitive to variations in discourse and grammatical patterns. Participants generally enjoyed code-switching with our bot as long as it produced predictable code-switching behavior; when code-switching was random or ungrammatical (as when producing unattested incongruent mixed-language noun phrases, such as `la fork’), participants enjoyed the task less and were less successful at completing it. These results underscore the potential downsides of deploying insufficiently developed multilingual language technology, while also illustrating the promise of such technology for conducting research on bilingual language use.
zh
[NLP-70] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对经过混淆(obfuscated)的问答(QA)输入时,其鲁棒性(robustness)和适应能力尚不明确的问题。现有研究虽表明LLMs在事实性问答中表现优异,但缺乏对其在语义扰动下性能稳定性的系统评估。为此,作者提出了一种新颖的混淆技术 ObfusQAte,并基于此构建了首个多层级混淆框架 ObfusQA,通过三种细粒度维度——命名实体间接(Named-Entity Indirection)、干扰项间接(Distractor Indirection)和上下文过载(Contextual Overload)——对LLMs进行系统性测试。关键在于,ObfusQA能精准捕捉语言细微变化对模型输出的影响,揭示LLMs在复杂混淆场景下易产生错误或幻觉响应的脆弱性,从而为后续鲁棒性增强研究提供基准与工具支持。
链接: https://arxiv.org/abs/2508.07321
作者: Shubhra Ghosh,Abhilekh Borah,Aditya Kumar Guru,Kripabandhu Ghosh
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Manipal University Jaipur (曼ipal大学贾伊普尔分校); IISER Kolkata (印度科学教育研究所加尔各答分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs’ robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.
zh
[NLP-71] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗问答(Medical Question-Answering, QA)任务中缺乏可解释性、推理能力不足以及临床可靠性难以验证的问题。解决方案的关键在于构建一个名为HealthBranches的新颖基准数据集,该数据集通过半自动化流程将医学来源中的显式决策路径转化为具有真实临床背景的患者病例及其对应的问答对,并完整保留每条问题的答案推理链(reasoning path)。该设计使得模型能够在多步推理和结构化检索增强生成(Retrieval-Augmented Generation, RAG)场景下接受更严格、更贴近临床实践的评估,从而推动开发出更具可信度、可解释性和临床适用性的LLMs。
链接: https://arxiv.org/abs/2508.07308
作者: Cristian Cosentino,Annamaria Defilippo,Marco Dossena,Christopher Irwin,Sara Joubbi,Pietro Liò
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:HealthBranches is a novel benchmark dataset for medical Question-Answering (QA), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each QA. Its structured design enables robust evaluation of LLMs’ multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.
zh
[NLP-72] CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨语言和跨模态场景下事实性(factuality)可靠性不足的问题,尤其针对语音-文本混合输入时存在的幻觉(hallucination)现象。现有评估基准大多局限于英文文本或视觉模态,缺乏对多语言语音问答任务的系统性评测能力。为此,作者提出一个全新的跨语言、跨模态事实性评测基准——CCFQA,包含8种语言的平行语音-文本事实性问题,用于量化评估MLLMs在多语言语音理解中的事实一致性表现。解决方案的关键在于:1)构建首个面向多语言语音问答的事实性评估框架CCFQA;2)设计一种少样本迁移学习策略,仅用5个示例即可将英语问答能力有效迁移至多语言语音问答任务,显著提升模型在非英语语音输入下的可靠性和泛化能力。
链接: https://arxiv.org/abs/2508.07295
作者: Yexing Du,Kaiyuan Liu,Youcheng Pan,Zheng Chu,Bo Yang,Xiaocheng Feng,Yang Xiang,Ming Liu
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbfCross-lingual and \textbfCross-modal \textbfFactuality benchmark (\textbfCCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs’ cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at this https URL.
zh
[NLP-73] EndoAgent : A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning
【速读】: 该论文旨在解决当前通用人工智能(AI)系统在内镜图像诊断中面临的两大挑战:一是基于大规模预训练的方法缺乏跨任务的统一协调机制,难以应对复杂临床流程中的多步骤操作;二是现有AI代理在内镜场景下的潜力尚未被充分挖掘,尤其在灵活解析指令、整合工具以及实现视觉到决策的闭环推理方面存在不足。解决方案的关键在于提出EndoAgent——首个面向视觉到决策的内镜分析记忆引导型智能体,其核心创新在于采用双记忆架构(短时动作追踪与长时经验学习),结合迭代推理与自适应工具选择及协作机制,从而实现逻辑连贯的决策制定和持续进化的推理能力,同时通过EndoAgentBench基准测试验证了其在真实临床场景下对视觉理解与语言生成能力的显著提升。
链接: https://arxiv.org/abs/2508.07292
作者: Yi Tang,Kaini Wang,Yang Chen,Guangquan Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.
zh
[NLP-74] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
【速读】: 该论文旨在解决建筑、工程与施工(AEC)领域中命名实体识别(NER)任务因领域术语专业性强、关系复杂而导致的标准预训练模型性能受限的问题。其解决方案的关键在于提出ARCE(augmented RoBERTa with contextualized elucidations),通过大语言模型(LLM)生成简洁直接的解释性语料(称为Cote),并利用该语料对RoBERTa模型进行增量预训练,再在其上进行下游任务微调。实验表明,这种基于简单解释的知识增强策略显著优于复杂的角色导向推理,最终在AEC基准数据集上实现了77.20%的Macro-F1分数,达到当前最优水平。
链接: https://arxiv.org/abs/2508.07286
作者: Jian Chen,Jinbao Tian,Yankui Li,Zhou Li
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:this https URL.
zh
[NLP-75] "Pull or Not to Pull?: Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在涉及伦理敏感决策时缺乏可解释性和一致性道德推理能力的问题。其解决方案的关键在于设计了一种因子化提示协议(factorial prompting protocol),系统性地评估14种主流LLMs在27个不同电车难题场景中的行为与理由,这些场景基于十种道德哲学框架(如功利主义、义务论和利他主义)。通过分析决策果断性、解释一致性、公众道德对齐度以及对伦理无关线索的敏感性等维度,研究发现增强推理能力的模型虽更具结构化解释但未必更贴近人类共识;而特定道德框架(如利他主义、公平和美德伦理)下存在“甜蜜区”(sweet zones),模型在此类提示中表现出高干预率、低解释冲突及接近人类判断的一致性。这一方法不仅揭示了模型内在道德倾向的差异,也为将道德推理作为LLM对齐的核心维度提供了实证依据,并呼吁建立标准化基准以评估模型“如何且为何”做出决策。
链接: https://arxiv.org/abs/2508.07284
作者: Junchen Ding,Penghao Jiang,Zihao Xu,Ziqi Ding,Yichen Zhu,Jiaojiao Jiang,Yuekang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, “sweet zones” emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.
zh
[NLP-76] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在心理健康筛查中因过度提问导致用户负担过重、效率低下,且难以实现跨诊断症状维度(transdiagnostic symptom profiles)同步评估的问题。其解决方案的关键在于提出MAQuA框架——一个基于自适应问诊策略的多维心理筛查系统,通过融合语言响应的多结局建模(multi-outcome modeling)、项目反应理论(Item Response Theory, IRT)与因子分析(factor analysis),动态选择在当前轮次对多个心理维度最具信息量的问题,从而最大化诊断信息获取效率,显著降低达到稳定评分所需的问题数量(减少50–87%),同时保持对内化(如抑郁、焦虑)和外化(如物质使用、进食障碍)症状域的稳健性能。
链接: https://arxiv.org/abs/2508.07279
作者: Vasudha Varadarajan,Hui Xu,Rebecca Astrid Boehme,Mariam Marlan Mirstrom,Sverker Sikstrom,H. Andrew Schwartz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.
zh
[NLP-77] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models
【速读】: 该论文旨在解决当前大语言模型在语音场景下的共情推理能力不足的问题,其核心原因是训练数据缺乏对语境信息与副语言特征(paralinguistic cues)的融合。解决方案的关键在于提出两种引入副语言信息的方法:一是显式方法,直接将情绪标注等副语言元数据提供给大语言模型;二是隐式方法,利用类别和维度化情绪标注与语音转录文本自动构建新的问答(QA)训练对。其中,隐式方法在人工标注的QA基准上使LLM评判性能提升38.41%,结合显式方法后达到46.02%的准确率,验证了在语境中融合副语言信息对增强模型共情理解的有效性。
链接: https://arxiv.org/abs/2508.07273
作者: Qiongqiong Wang,Hardik B. Sailor,Jeremy H. M. Wong,Tianchi Liu,Shuo Sun,Wenyu Zhang,Muhammad Huzaifah,Nancy Chen,Ai Ti Aw
机构: Institute for Infocomm Research (I2R); Agency for Science, Technology and Research (A⋆\starSTAR)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at (ASRU 2025) 2025 IEEE Automatic Speech Recognition and Understanding Workshop
Abstract:Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.
zh
[NLP-78] he 2D Dynamic Articulatory Model DYNARTmo: Tongue-Palate Contact Area Estimation
【速读】: 该论文旨在解决二维动态发音模型(DYNARTmo)在模拟舌体与硬腭接触区域时缺乏三维空间信息的问题,从而提升其在言语科学教育和言语治疗中的可视化效果。解决方案的关键在于引入一种内部的三维软腭穹窿(palatal dome)表示,并基于两种替代的穹窿几何形态——半椭圆和余弦曲线轮廓——来建模舌体在冠状面(coronal plane)上的侧向曲率,进而通过解析计算每个前后位置的侧向接触点,实现类似电腭图(electropalatography)的可视化输出,同时支持舌部、声门和腭面三视图的同步显示,增强模型对静态与动态发音过程的表达能力。
链接: https://arxiv.org/abs/2508.07262
作者: Bernd J. Kröger
机构: RWTH Aachen University (亚琛工业大学); Kröger Lab (克罗格实验室)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注: 11 pages, 9 figures, 14 references; supplementary material: python source code
Abstract:This paper describes an extension of the two-dimensional dynamic articulatory model DYNARTmo by integrating an internal three-dimensional representation of the palatal dome to estimate tongue-palate contact areas from midsagittal tongue contours. Two alternative dome geometries - a half-ellipse and a cosine based profile - are implemented to model lateral curvature in the coronal plane. Using these geometries, lateral contact points are analytically computed for each anterior-posterior position, enabling the generation of electropalatography-like visualizations within the 2D+ framework. The enhanced model supports three synchronized views (sagittal, glottal, and palatal) for static and dynamic (animated) articulation displays, suitable for speech science education and speech therapy. Future work includes adding a facial (lip) view and implementing articulatory-to-acoustic synthesis to quantitatively evaluate model realism.
zh
[NLP-79] Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition
【速读】: 该论文旨在解决Few-Shot Continual Learning Named Entity Recognition (FS-CLNER)任务中面临的“少样本蒸馏困境”(Few-Shot Distillation Dilemma),即由于新类实体样本稀缺导致模型泛化能力弱,且旧类实体信息缺失阻碍知识蒸馏的问题。解决方案的关键在于提出一种基于提示调优(Prompt Tuning)与记忆演示模板(Memory Demonstration Templates, MDT)相结合的框架:首先设计可扩展的锚点词导向提示调优(Anchor words-oriented Prompt Tuning, APT)策略,以弥合预训练与微调之间的差距,提升少样本场景下的性能;其次,在每个训练实例中引入MDT机制,通过提供来自先前任务的回放样本来增强上下文学习能力,从而有效缓解蒸馏困境并提升模型持续学习效果。
链接: https://arxiv.org/abs/2508.07248
作者: Zhe Ren
机构: Xinjiang University (新疆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge distillation has been successfully applied to Continual Learning Named Entity Recognition (CLNER) tasks, by using a teacher model trained on old-class data to distill old-class entities present in new-class data as a form of regularization, thereby avoiding catastrophic forgetting. However, in Few-Shot CLNER (FS-CLNER) tasks, the scarcity of new-class entities makes it difficult for the trained model to generalize during inference. More critically, the lack of old-class entity information hinders the distillation of old knowledge, causing the model to fall into what we refer to as the Few-Shot Distillation Dilemma. In this work, we address the above challenges through a prompt tuning paradigm and memory demonstration template strategy. Specifically, we designed an expandable Anchor words-oriented Prompt Tuning (APT) paradigm to bridge the gap between pre-training and fine-tuning, thereby enhancing performance in few-shot scenarios. Additionally, we incorporated Memory Demonstration Templates (MDT) into each training instance to provide replay samples from previous tasks, which not only avoids the Few-Shot Distillation Dilemma but also promotes in-context learning. Experiments show that our approach achieves competitive performances on FS-CLNER.
zh
[NLP-80] How Does a Deep Neural Network Look at Lexical Stress?
【速读】: 该论文旨在解决神经网络在语音处理中作为“黑箱”模型的问题,即如何解释其决策依据。针对词重音(lexical stress)预测任务,研究者构建了一个基于自然读音和自发语音的英语双音节词数据集,并采用多种卷积神经网络(Convolutional Neural Network, CNN)架构从声谱图中学习重音位置。关键解决方案在于结合层间相关性传播(Layerwise Relevance Propagation, LRP)与特征特异性归因分析,揭示了模型对重音判断的依赖机制:最佳模型主要受重读音节中元音的第一和第二共振峰(formants)影响,同时亦关注整词范围内的声学线索,包括基频(pitch)和第三共振峰。这一方法不仅提升了对深度学习模型可解释性的理解,还展示了其从自然数据中自动提取分布性重音线索的能力,拓展了传统基于高度控制刺激的语音学研究范式。
链接: https://arxiv.org/abs/2508.07229
作者: Itai Allouche,Itay Asael,Rotem Rousso,Vered Dassa,Ann Bradlow,Seung-Eun Kim,Matthew Goldrick,Joseph Keshet
机构: Technion – Israel Institute of Technology (以色列理工学院); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages, 4 figures, submitted to the Journal of the Acoustical Society of America (JASA)
Abstract:Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel’s first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning’s ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.
zh
[NLP-81] Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model COLING2025
【速读】: 该论文旨在解决预训练语言模型(Pretrained Language Models, PLMs)在社交媒体应用任务(如谣言检测)中表现不佳的问题,其根源在于预训练语料与社交文本之间的不匹配、对独特社交符号处理不足,以及预训练任务未能有效建模传播结构中隐含的用户交互信息。解决方案的关键在于提出一种称为“后交互预测”(Post Engagement Prediction, PEP)的持续预训练策略,通过让模型预测帖子间的根节点、分支和父节点关系,从而捕捉立场(stance)和情感(sentiment)交互信息,以增强模型对传播结构的理解能力。实验表明,PEP显著提升了多种PLM在谣言检测任务上的性能,尤其在少样本场景下效果突出,并推动了专为Twitter优化的PLM SoLM的构建与发布。
链接: https://arxiv.org/abs/2508.07209
作者: Chaoqun Cui,Siyuan Li,Kunkun Ma,Caiyan Jia
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: This paper is accepted by COLING2025
Abstract:Pretrained Language Models (PLMs) have excelled in various Natural Language Processing tasks, benefiting from large-scale pretraining and self-attention mechanism’s ability to capture long-range dependencies. However, their performance on social media application tasks like rumor detection remains suboptimal. We attribute this to mismatches between pretraining corpora and social texts, inadequate handling of unique social symbols, and pretraining tasks ill-suited for modeling user engagements implicit in propagation structures. To address these issues, we propose a continue pretraining strategy called Post Engagement Prediction (PEP) to infuse information from propagation structures into PLMs. PEP makes models to predict root, branch, and parent relations between posts, capturing interactions of stance and sentiment crucial for rumor detection. We also curate and release large-scale Twitter corpus: TwitterCorpus (269GB text), and two unlabeled claim conversation datasets with propagation structures (UTwitter and UWeibo). Utilizing these resources and PEP strategy, we train a Twitter-tailored PLM called SoLM. Extensive experiments demonstrate PEP significantly boosts rumor detection performance across universal and social media PLMs, even in few-shot scenarios. On benchmark datasets, PEP enhances baseline models by 1.0-3.7% accuracy, even enabling it to outperform current state-of-the-art methods on multiple datasets. SoLM alone, without high-level modules, also achieves competitive results, highlighting the strategy’s effectiveness in learning discriminative post interaction features.
zh
[NLP-82] owards Real-World Rumor Detection: Anomaly Detection Framework with Graph Supervised Contrastive Learning COLING2025
【速读】: 该论文旨在解决社交媒体中谣言检测任务因数据稀缺和类别不平衡导致的性能瓶颈问题。现有方法通常将谣言检测建模为类别平衡的分类任务,但在真实场景下,谣言仅占极少数,多数为正常内容,这种分布特性使得传统监督学习效果受限。论文的关键解决方案是提出一种基于图监督对比学习的异常检测框架(Anomaly Detection framework with Graph Supervised Contrastive Learning, AD-GSCL),其核心思想是将未标注数据视为非谣言,并利用图结构信息设计监督对比学习机制,从而在类别不平衡和小样本条件下仍能有效区分谣言与正常内容。实验表明,该方法在多种数据分布场景下均表现出优越性能。
链接: https://arxiv.org/abs/2508.07205
作者: Chaoqun Cui,Caiyan Jia
机构: Beijing Jiaotong University (北京交通大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: This paper is accepted by COLING2025
Abstract:Current rumor detection methods based on propagation structure learning predominately treat rumor detection as a class-balanced classification task on limited labeled data. However, real-world social media data exhibits an imbalanced distribution with a minority of rumors among massive regular posts. To address the data scarcity and imbalance issues, we construct two large-scale conversation datasets from Weibo and Twitter and analyze the domain distributions. We find obvious differences between rumor and non-rumor distributions, with non-rumors mostly in entertainment domains while rumors concentrate in news, indicating the conformity of rumor detection to an anomaly detection paradigm. Correspondingly, we propose the Anomaly Detection framework with Graph Supervised Contrastive Learning (AD-GSCL). It heuristically treats unlabeled data as non-rumors and adapts graph contrastive learning for rumor detection. Extensive experiments demonstrate AD-GSCL’s superiority under class-balanced, imbalanced, and few-shot conditions. Our findings provide valuable insights for real-world rumor detection featuring imbalanced data distributions.
zh
[NLP-83] Propagation Tree Is Not Deep: Adaptive Graph Contrastive Learning Approach for Rumor Detection AAAI2024
【速读】: 该论文旨在解决社交媒体中谣言检测(rumor detection)问题,尤其针对现有基于图的模型在处理谣言传播树(Rumor Propagation Trees, RPTs)时存在的局限性。传统方法假设RPT具有深层结构并沿分支学习序列立场特征,但作者通过实证分析发现真实数据中的RPT多为宽结构(wide structure),大多数节点仅处于1层浅层回复位置。为此,论文提出了一种基于自适应视图增强的图对比学习方法(Rumor Adaptive Graph Contrastive Learning, RAGCL),其关键在于引入节点中心性(node centralities)指导的动态增强策略:通过节点删除、属性掩码和边删除等操作生成不同视图,并遵循三个核心原则——排除根节点、保留深层回复节点、保护深度区域的低层节点——从而聚焦于密集子结构进行鲁棒表示学习。该方案显著提升了谣言检测性能,并揭示了RPT的宽结构特性,为树状图结构的应用提供了可迁移的增强范式。
链接: https://arxiv.org/abs/2508.07201
作者: Chaoqun Cui,Caiyan Jia
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper is accepted by AAAI2024
Abstract:Rumor detection on social media has become increasingly important. Most existing graph-based models presume rumor propagation trees (RPTs) have deep structures and learn sequential stance features along branches. However, through statistical analysis on real-world datasets, we find RPTs exhibit wide structures, with most nodes being shallow 1-level replies. To focus learning on intensive substructures, we propose Rumor Adaptive Graph Contrastive Learning (RAGCL) method with adaptive view augmentation guided by node centralities. We summarize three principles for RPT augmentation: 1) exempt root nodes, 2) retain deep reply nodes, 3) preserve lower-level nodes in deep sections. We employ node dropping, attribute masking and edge dropping with probabilities from centrality-based importance scores to generate views. A graph contrastive objective then learns robust rumor representations. Extensive experiments on four benchmark datasets demonstrate RAGCL outperforms state-of-the-art methods. Our work reveals the wide-structure nature of RPTs and contributes an effective graph contrastive learning approach tailored for rumor detection through principled adaptive augmentation. The proposed principles and augmentation techniques can potentially benefit other applications involving tree-structured graphs.
zh
[NLP-84] Adapting LLM s to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列预测任务中应用时面临的两个核心挑战:一是时间模式的固有异质性(heterogeneity),即不同时间段内的时间序列表现出结构差异显著的动态特征;二是连续数值信号与离散语言表示之间的模态鸿沟(modality gap)。为此,作者提出TALON框架,其关键创新在于设计了两个模块:一是异构时间编码器(Heterogeneous Temporal Encoder),通过将多变量时间序列划分为结构一致的片段,实现对多样化时间模式的局部专家建模;二是语义对齐模块(Semantic Alignment Module),用于将时间特征映射至LLM兼容的语义空间,从而有效融合时间序列数据到语言模型中,并在推理阶段无需人工设计提示(handcrafted prompts)。实验表明,该方法在七个真实世界基准上均显著优于现有最优方法,平均均方误差(MSE)提升达11%。
链接: https://arxiv.org/abs/2508.07195
作者: Yanru Sun,Emadeldeen Eldele,Zongxia Xie,Yucheng Wang,Wenzhe Niu,Qinghua Hu,Chee Keong Kwoh,Min Wu
机构: 1. Tsinghua University (清华大学); 2. National University of Singapore (新加坡国立大学); 3. Nanyang Technological University (南洋理工大学); 4. NUS Graduate School for Integrative Sciences and Engineering (新加坡国立大学研究生院整合科学与工程)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: this https URL.
zh
[NLP-85] DySK-Attn: A Framework for Efficient Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)知识静态化的问题,即其训练时学习的知识难以及时更新,而重新训练成本过高,现有知识编辑方法则存在效率低或引入副作用的缺陷。解决方案的关键在于提出 DySK-Attn 框架,该框架通过将 LLM 与可实时更新的动态知识图谱(Dynamic Knowledge Graph, KG)相结合,并设计了一种稀疏知识注意力机制(sparse knowledge attention mechanism),实现从粗粒度到细粒度的高效检索,精准定位并聚焦于 KG 中少量高相关事实,从而在不增加密集计算负担的前提下提升事实准确性与推理效率。
链接: https://arxiv.org/abs/2508.07185
作者: Kabir Khan,Priya Sharma,Arjun Mehta,Neha Gupta,Ravi Narayanan
机构: San Francisco State University (旧金山州立大学); Indian Institute of Technology Bombay (印度理工学院孟买分校); Indian Institute of Science (印度科学研究所); Indian Institute of Technology Delhi (印度理工学院德里分校); International Institute of Information Technology Hyderabad (国际信息技术学院海得拉巴分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint; 7 figures, 3 tables, 1 algorithm; v1. Code and data will be released
Abstract:Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.
zh
[NLP-86] Schema Lineage Extraction at Scale: Multilingual Pipelines Composite Evaluation and Language-Model Benchmarks
【速读】: 该论文旨在解决企业数据管道中因多语言编程导致的元数据与下游数据之间的语义断层(semantic drift)问题,这一问题严重影响数据可复现性、治理能力以及检索增强生成(RAG)和文本转SQL等服务的性能。解决方案的关键在于提出一种自动化提取细粒度模式血缘(schema lineage)的新框架,能够识别源模式、源表、转换逻辑和聚合操作四个核心组件,并构建标准化的数据转换表示;同时引入Schema Lineage Composite Evaluation(SLiCE)评估指标以量化结构正确性和语义保真度,从而实现对模型抽取效果的严格评测。
链接: https://arxiv.org/abs/2508.07179
作者: Jiaqi Yin,Yi-Wei Chen,Meng-Lung Lee,Xiya Liu
机构: Microsoft(微软); Antra. Inc.(安特拉公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This “semantic drift” compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.
zh
[NLP-87] Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback CIKM’25
【速读】: 该论文旨在解决个性化标题生成中因历史点击流中的非个性化点击噪声(personalized-irrelevant click noise)导致的幻觉标题问题,这类噪声会误导模型偏离用户真实兴趣。解决方案的关键在于提出了一种基于隐式反馈去噪的个性化标题生成框架(PHG-DIF),其核心包括两个阶段:首先通过双阶段过滤机制识别并去除由短停留时间和异常点击爆发所指示的点击噪声;其次利用多层级时间融合机制动态建模用户 evolving(演化)和多维度的兴趣特征,从而实现更精准的用户画像与标题生成。
链接: https://arxiv.org/abs/2508.07178
作者: Kejin Liu,Junhong Lian,Xiang Ao,Ningtao Wang,Xing Fu,Yu Cheng,Weiqiang Wang,Xinyu Liu
机构: Henan Institute of Advanced Technology, Zhengzhou University (郑州大学先进技术研究院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Independent Researcher (独立研究者)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the 34th ACM International Conference on Information and Knowledge Management (CIKM '25), Full Research Papers track
Abstract:Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users’ evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at this https URL.
zh
[NLP-88] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models
【速读】: 该论文旨在解决当前缺乏针对多模态大语言模型(Omni-modal Large Language Models, OLLMs)的安全评估基准问题,尤其关注在音视频联合输入场景下的安全性能与跨模态一致性不足的挑战。其解决方案的关键在于提出首个全面的并行基准测试平台Omni-SafetyBench,包含24种模态组合及每类972个样本,涵盖专门设计的音视频危害案例,并引入两个核心指标:基于条件攻击成功率(Conditional Attack Success Rate, C-ASR)和拒绝率(Conditional Refusal Rate, C-RR)的安全评分(Safety-score),用于衡量模型对复杂多模态输入的理解能力与安全防御效果;以及跨模态安全一致性评分(Cross-Modal Safety Consistency Score, CMSC-score),用于量化不同模态间输出的一致性水平。通过该基准对6个开源和4个闭源OLLM的评估,揭示了现有模型在整体安全性与一致性上的普遍短板,凸显了提升多模态安全性的紧迫需求。
链接: https://arxiv.org/abs/2508.07173
作者: Leyi Pan,Zheyu Fu,Yunpeng Zhai,Shuchang Tao,Sheng Guan,Shiyu Huang,Lingzhe Zhang,Zhaoyang Liu,Bolin Ding,Felix Henry,Lijie Wen,Aiwei Liu
机构: Tsinghua University (清华大学); Tongyi Lab; Peking University (北京大学); OpenRL Lab
类目: Computation and Language (cs.CL)
备注: 20 pages, 8 figures, 12 tables
Abstract:The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs’ comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.
zh
[NLP-89] Gradient Surgery for Safe LLM Fine-Tuning
【速读】: 该论文旨在解决微调即服务(Fine-tuning-as-a-Service)场景下,因用户微调数据集中混入少量恶意样本而导致大型语言模型(Large Language Models, LLMs)安全对齐失效的问题。现有方法在面对高比例有害样本时防御性能急剧下降,其根源在于用户任务梯度与安全对齐目标之间存在冲突梯度(conflicting gradients),导致模型优化过程中损害安全性。解决方案的关键在于提出SafeGrad,一种基于梯度手术(gradient surgery)的方法:当检测到梯度冲突时,通过将用户任务梯度投影到与安全梯度正交的平面,从而消除有害分量,使模型在不牺牲安全性的前提下学习用户任务;此外,引入KL散度对齐损失以捕获基础模型的安全分布特征,进一步提升鲁棒性和数据效率。
链接: https://arxiv.org/abs/2508.07172
作者: Biao Yi,Jiahao Li,Baolei Zhang,Lihai Nie,Tong Li,Tiansheng Huang,Zheli Liu
机构: Nankai University (南开大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user’s fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user’s task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity.
zh
[NLP-90] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens
【速读】: 该论文旨在解决自动语音识别(ASR)系统中存在的偏见问题,特别是其对非标准方言使用者的系统性误识如何构成一种道德上的不尊重,并加剧边缘化语言群体的历史性不公。解决方案的关键在于从哲学层面重新审视ASR公平性,强调不能仅通过技术手段修正偏差,而需承认多样化的言语形式是值得技术包容的合法表达方式;同时识别出三个独特的伦理维度——非标准方言使用者面临的“时间税”(temporal taxation)、系统误识对对话流的干扰以及言语模式与个人/文化身份的根本关联——这些因素共同塑造了现有技术公平指标无法捕捉的不对称权力关系,从而推动ASR开发从语言标准化向语言多元主义转变。
链接: https://arxiv.org/abs/2508.07143
作者: Anna Seo Gyeong Choi,Hoon Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation – it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties (“temporal taxation”), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.
zh
[NLP-91] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution
【速读】: 该论文旨在解决生成式 AI(Generative AI)在关键社会场景中因多重身份交叉而产生的偏见问题,即Intersectional Bias(交叉性偏见),传统单维度公平性评估无法捕捉多维歧视叠加带来的独特劣势模式。其解决方案的关键在于构建一个包含245,700个提示的新基准数据集WinoIdentity,通过扩展WinoBias数据集并引入25个不同人口统计学特征与二元性别交叉的组合,系统性地识别和量化模型对不同交叉身份群体的置信度差异。作者提出“共指置信度差异”(Coreference Confidence Disparity)作为新的群组不公平指标,用于衡量模型在面对不同交叉身份时的不确定性水平,并发现尽管大语言模型(LLMs)整体性能提升显著,但其不确定性在双重劣势群体中尤为突出,且连优势群体也出现置信度下降现象,揭示了当前模型存在价值对齐和推理有效性双重失效,可能加剧社会伤害。
链接: https://arxiv.org/abs/2508.07111
作者: Falaah Arif Khan,Nivedha Sivakumar,Yinong Oliver Wang,Katherine Metcalf,Cezanne Camacho,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm.
zh
[NLP-92] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning
【速读】: 该论文旨在解决大型推理模型在测试时扩展(test-time scaling)过程中因短输入提示生成过多token而导致的计算开销问题,同时克服现有稀疏注意力机制在长序列推理中因累积误差导致准确率显著下降的缺陷。解决方案的关键在于提出一种无需训练的稀疏注意力机制LessIsMore,其核心创新是利用全局注意力模式而非传统按头局部优化的方式,通过聚合来自局部注意力头的token选择与近期上下文信息,实现跨注意力头的统一token排序,从而避免为每个头单独维护token子集,提升泛化能力和效率。实验表明,LessIsMore在保持甚至提升准确率的同时,相比全注意力机制平均加速1.1倍,且相比现有稀疏注意力方法实现1.13倍端到端加速,同时仅需约2倍少的token注意力。
链接: https://arxiv.org/abs/2508.07101
作者: Lijie Yang,Zhihao Zhang,Arti Jain,Shijie Cao,Baihong Yuan,Yiwei Chen,Zhihao Jia,Ravi Netravali
机构: Princeton University (普林斯顿大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves – and in some cases improves – accuracy while achieving a 1.1\times average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2\times fewer tokens without accuracy loss, achieving a 1.13\times end-to-end speed-up compared to existing sparse attention methods.
zh
[NLP-93] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context
【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)社会偏见评估基准主要基于西方语境、难以适配印度本土文化背景的问题。为填补这一空白,作者提出BharatBBQ——一个面向印度多语言环境的文化适配型偏见评估基准,涵盖印地语、英语、马拉地语、孟加拉语、泰米尔语、泰卢固语、奥里亚语和阿萨姆语等8种语言,覆盖13个社会群体(含3个交叉群体),并构建了包含49,108个原始样本、经翻译与验证扩展至392,864个样本的数据集。其解决方案的关键在于:通过本地化语言与文化语境设计偏见测量框架,并在多语言大模型上进行零样本与少样本设置下的系统性偏见分析,揭示印度语种中偏见往往比英语更显著,从而凸显建立语言学与文化根基扎实的评估工具的必要性。
链接: https://arxiv.org/abs/2508.07090
作者: Aditya Tomar,Nihar Ranjan Sahoo,Pushpak Bhattacharyya
机构: IIT Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation.
zh
[NLP-94] SQL-Exchange: Transforming SQL Queries Across Domains
【速读】: 该论文旨在解决跨数据库模式(schema)下SQL查询映射的问题,即如何在保持源查询结构不变的前提下,将领域特定的元素适配到目标模式中,从而提升文本到SQL(text-to-SQL)系统在下游任务中的上下文学习性能。解决方案的关键在于提出SQL-Exchange框架,该框架通过结构对齐与语义一致性的双重保障,在不同数据库模式间实现有效且可执行的SQL查询映射,并利用这些映射后的查询作为上下文示例,显著改善text-to-SQL模型的准确性和泛化能力。
链接: https://arxiv.org/abs/2508.07087
作者: Mohammadreza Daviran,Brian Lin,Davood Rafiei
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce SQL-Exchange, a framework for mapping SQL queries across different database schemas by preserving the source query structure while adapting domain-specific elements to align with the target schema. We investigate the conditions under which such mappings are feasible and beneficial, and examine their impact on enhancing the in-context learning performance of text-to-SQL systems as a downstream task. Our comprehensive evaluation across multiple model families and benchmark datasets–assessing structural alignment with source queries, execution validity on target databases, and semantic correctness–demonstrates that SQL-Exchange is effective across a wide range of schemas and query types. Our results further show that using mapped queries as in-context examples consistently improves text-to-SQL performance over using queries from the source schema.
zh
[NLP-95] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages
【速读】: 该论文旨在解决现有对话数据集普遍忽视自然人类对话中文化细微差别的问题,尤其是在东南亚这一人口超7亿、文化多样性极高的地区。解决方案的关键在于构建SEADialogues——一个以东南亚为中心的文化根基对话数据集,涵盖六个国家的八种语言(其中许多为低资源语言),每段对话均包含人物属性(persona attributes)和两个反映当地日常生活的文化相关话题,从而提升对话模型的文化敏感性与个性化能力,推动面向人类中心的大语言模型(Large Language Models, LLMs)在多文化语境下的发展。
链接: https://arxiv.org/abs/2508.07069
作者: Muhammad Dehan Al Kautsar,Aswin Candra,Muhammad Alif Al Hakim,Maxalmina Satria Kahfi,Fajri Koto,Alham Fikri Aji,Peerat Limkonchotiwat,Ekapol Chuangsuwanich,Genta Indra Winata
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学); Mekari; Universitas Indonesia(印度尼西亚大学); Detik Network; AI Singapore; Chulalongkorn University(朱拉隆功大学); Capital One
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.
zh
[NLP-96] Reason Rank: Empowering Passage Ranking with Strong Reasoning Ability
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的列表式重排序(listwise ranking)方法在复杂排序场景下性能不足的问题,尤其是在缺乏高质量推理密集型训练数据时,现有重排序器表现受限。其核心解决方案包括两个关键环节:一是提出一种自动化推理密集型训练数据合成框架,通过从多领域获取查询与文档,并利用DeepSeek-R1生成高质量标签,结合自一致性数据过滤机制保障数据质量;二是设计两阶段后训练策略,即冷启动监督微调(SFT)以学习推理模式,随后通过强化学习(Reinforcement Learning, RL)阶段优化排序能力,其中创新性地引入多视角排序奖励机制,相较于传统基于排名指标的奖励更有效。该方法最终形成名为ReasonRank的推理增强型重排序器,在多个基准测试中达到领先性能。
链接: https://arxiv.org/abs/2508.07050
作者: Wenhan Liu,Xinyu Ma,Weiwei Sun,Yutao Zhu,Yuchen Li,Dawei Yin,Zhicheng Dou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbfReasonRank outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbfThrough further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnotethis https URL. Our codes are available at this https URL.
zh
[NLP-97] MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA
【速读】: 该论文旨在解决当前知识编辑(Knowledge Editing, KE)方法在多模态医学场景中应用不足的问题,尤其是如何将更新的知识与视觉推理有效结合,以支持安全且可解释的临床决策。其解决方案的关键在于提出首个面向临床多模态任务的知识编辑基准——MultiMedEdit,该框架涵盖理解与推理两类任务类型,定义了可靠性、泛化性和局部性三个维度的评估指标,并支持跨范式比较不同模型的表现。通过在单次编辑和终身编辑两种设置下开展系统实验,该研究揭示了现有方法在复杂临床流程中的泛化能力与长尾推理能力的局限性,同时提供了编辑延迟与内存开销等效率分析,为未来开发临床鲁棒的知识编辑技术奠定了基础。
链接: https://arxiv.org/abs/2508.07022
作者: Shengtao Wen,Haodong Chen,Yadong Wang,Zhongying Pan,Xiang Chen,Yu Tian,Bo Qian,Dong Liang,Sheng-Jun Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Under Review
Abstract:Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining. While previous studies have demonstrated effectiveness in general domains and medical QA tasks, little attention has been paid to KE in multimodal medical scenarios. Unlike text-only settings, medical KE demands integrating updated knowledge with visual reasoning to support safe and interpretable clinical decisions. To address this gap, we propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks. Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality), and supports cross-paradigm comparisons across general and domain-specific models. We conduct extensive experiments under single-editing and lifelong-editing settings. Results suggest that current methods struggle with generalization and long-tail reasoning, particularly in complex clinical workflows. We further present an efficiency analysis (e.g., edit latency, memory footprint), revealing practical trade-offs in real-world deployment across KE paradigms. Overall, MultiMedEdit not only reveals the limitations of current approaches but also provides a solid foundation for developing clinically robust knowledge editing techniques in the future.
zh
[NLP-98] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在摘要生成中面临的三个关键问题:上下文长度限制、生成过程缺乏可解释性和可控性,以及随着语料库规模增长而带来的计算效率下降。其解决方案的核心在于提出Vec2Summ方法,将摘要任务建模为语义压缩过程——通过在语义嵌入空间中用单一均值向量表示文档集合,捕捉语料的中心语义;随后利用生成式AI(Generative AI)模型对这一均值向量进行嵌入反演(embedding inversion),生成连贯摘要,并引入以均值为中心的高斯采样机制以增强输出多样性与鲁棒性,从而实现高效、可控制且具备语料级抽象能力的摘要生成。
链接: https://arxiv.org/abs/2508.07017
作者: Mao Li,Fred Conrad,Johann Gagnon-Bartsch
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion – decoding this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of LLM-based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size – requiring only O(d + d^2) parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct LLM summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ’s potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized.
zh
[NLP-99] Rethinking 1-bit Optimization Leverag ing Pre-trained Large Language Models
【速读】: 该论文旨在解决1-bit大语言模型(Large Language Model, LLM)量化中因直接从零训练导致的高训练成本与性能下降问题。现有方法通常忽略预训练模型的可用性,使得1-bit模型难以有效迁移全精度权重的知识。其核心挑战在于全精度与1-bit表示之间存在显著差距,阻碍了模型的稳定适配。论文提出一种一致的渐进式训练策略,同时在前向和反向传播中逐步将浮点权重二值化,从而平滑过渡;并引入二值感知初始化(binary-aware initialization)与双尺度补偿(dual-scaling compensation)机制,降低训练难度并提升最终性能。实验表明,该方法可在不牺牲性能的前提下,利用预训练模型实现高性能的1-bit LLM,避免了昂贵的端到端训练。
链接: https://arxiv.org/abs/2508.06974
作者: Zhijun Tu,Hanting Chen,Siqi Liu,Chuanjian Liu,Jian Li,Jie Hu,Yunhe Wang
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures
Abstract:1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.
zh
[NLP-100] wo-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction CCS
【速读】: 该论文旨在解决《古兰经》问答(Quranic Question Answering)任务中的两大挑战:一是古典阿拉伯语的语言复杂性,二是宗教文本的语义丰富性导致的低资源环境下问答系统性能受限的问题。解决方案的关键在于提出了一种两阶段框架:第一阶段通过集成微调后的阿拉伯语语言模型实现更优的段落检索性能;第二阶段则利用指令微调的大语言模型(Instruction-Tuned Large Language Models)结合少量示例提示(few-shot prompting),克服小数据集上微调的局限性,从而高效完成答案抽取。实验证明,该方法在 Quran QA 2023 共享任务中取得了当前最优结果,验证了模型集成与指令微调相结合在专业领域低资源问答场景中的有效性。
链接: https://arxiv.org/abs/2508.06971
作者: Mohamed Basem,Islam Oshallah,Ali Hamdi,Khaled Shaban,Hozaifa Kassab
机构: MSA University (MSA大学); Qatar University (卡塔尔大学); AiTech AU (AiTech AU)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages , 4 figures , Accepted in Aiccsa 2025 , this https URL
Abstract:Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains.
zh
[NLP-101] DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery
【速读】: 该论文旨在解决当前AI代理在面对复杂、多样化数据需求时,难以实现自主、系统性数据发现与合成的问题,即如何让AI代理超越传统搜索方式,基于用户特定需求自动发现并整合所需数据集。其解决方案的关键在于提出首个综合性基准DatasetResearch,涵盖208个真实世界的数据需求场景,并构建三维评估框架对AI代理的检索(retrieval)与生成(generation)能力进行量化分析。研究表明,尽管先进系统在知识密集型任务中表现尚可,但在推理密集型任务和分布外“边缘案例”上仍存在显著不足,揭示了当前AI代理在数据发现方面的根本局限,为下一代具备自进化能力的智能体提供了明确的方向与基准。
链接: https://arxiv.org/abs/2508.06960
作者: Keyu Li,Mohan Jiang,Dayuan Fu,Yunze Wu,Xiangkun Hu,Dequan Wang,Pengfei Liu
机构: Shanghai Jiao Tong University (上海交通大学); SII; GAIR
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents’ ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on “corner cases” outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at this https URL.
zh
[NLP-102] AMFT: Aligning LLM Reason ers by Meta-Learning the Optimal Imitation-Exploration Balance
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中微调时存在的两个核心问题:一是传统两阶段微调流程(监督微调 SFT 与强化学习 RL)易导致灾难性遗忘(catastrophic forgetting),二是 SFT 和 RL 之间缺乏动态平衡机制,难以实现模仿(imitation)与探索(exploration)之间的最优权衡。解决方案的关键在于提出一种新颖的单阶段算法 Adaptive Meta Fine-Tuning (AMFT),其核心是引入一个基于元梯度(meta-gradient)的自适应权重控制器,将 SFT 的隐式路径级奖励(implicit, path-level reward)与 RL 的显式结果奖励(explicit, outcome-based reward)统一建模为可学习参数,并通过策略熵正则化稳定训练过程,从而自动发现最优训练课程(training curriculum)。该方法在数学推理、抽象视觉推理(General Points)和视觉-语言导航(V-IRL)等多个挑战性基准上均取得新的最先进性能,并展现出优异的分布外(OOD)泛化能力。
链接: https://arxiv.org/abs/2508.06944
作者: Lixuan He,Jie Feng,Yong Li
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbfimplicit rewards, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbfAdaptive Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbfmeta-gradient adaptive weight controller that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM this http URL codes are open-sourced via this https URL.
zh
[NLP-103] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM -Generated Texts Detection
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成文本难以与人类写作内容有效区分的问题,尤其是现有检测方法在面对改写、对抗扰动和跨领域迁移时泛化能力不足的挑战。其解决方案的关键在于提出一种模型无关的检测框架 SentiDetect,该框架通过分析情感分布稳定性差异来识别 LLM 生成文本:具体而言,利用两个互补指标——情感分布一致性(sentiment distribution consistency)和情感分布保持性(sentiment distribution preservation),分别量化文本在情感改变和语义保持变换下的稳定性变化;实验证明该方法在多个主流 LLM 和数据集上显著优于现有基线,且对改写、对抗攻击和长度变化更具鲁棒性。
链接: https://arxiv.org/abs/2508.06913
作者: Siyuan Li,Xi Lin,Guangyan Li,Zehao Liu,Aodu Wulianghai,Li Ding,Jun Wu,Jianhua Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.
zh
[NLP-104] Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody
【速读】: 该论文旨在解决情感语音转换(Emotional Voice Conversion, EVC)中属性解耦不足与细粒度情感动态建模缺失的问题,尤其在实际应用中难以实现对说话人身份、语义内容和情感风格的独立控制。解决方案的关键在于提出 Maestro-EVC 框架,通过从不同参考样本中有效解耦内容、说话人身份和情感属性,并引入时间维度的情感表征以及显式的韵律建模与韵律增强策略,从而在韵律不匹配条件下仍能鲁棒地捕捉并迁移目标情感的时序动态特征,实现高质量、可控且富有情感表现力的语音合成。
链接: https://arxiv.org/abs/2508.06890
作者: Jinsung Yoon,Wooyeol Jeong,Jio Gim,Young-Joo Suh
机构: Pohang University of Science and Technology (POSTECH)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ASRU 2025
Abstract:Emotional voice conversion (EVC) aims to modify the emotional style of speech while preserving its linguistic content. In practical EVC, controllability, the ability to independently control speaker identity and emotional style using distinct references, is crucial. However, existing methods often struggle to fully disentangle these attributes and lack the ability to model fine-grained emotional expressions such as temporal dynamics. We propose Maestro-EVC, a controllable EVC framework that enables independent control of content, speaker identity, and emotion by effectively disentangling each attribute from separate references. We further introduce a temporal emotion representation and an explicit prosody modeling with prosody augmentation to robustly capture and transfer the temporal dynamics of the target emotion, even under prosody-mismatched conditions. Experimental results confirm that Maestro-EVC achieves high-quality, controllable, and emotionally expressive speech synthesis.
zh
[NLP-105] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores ECAI2025
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在对话生成中难以有效保持角色一致性(persona fidelity)的问题,其核心挑战在于现有对话数据集的多样性不足。为应对这一问题,作者提出了一种名为SBS(Score-Before-Speaking)的新框架,其关键创新在于将响应生成与质量评分的学习统一在一个训练步骤中:模型在训练时被教导将增强后的对话响应与其语义相似度得分相关联,从而在推理阶段利用该关联生成更符合角色设定的回复。具体而言,SBS采用基于名词替换的增强策略和语义相似度作为响应质量的代理指标,实验证明该方法显著提升了百万级和十亿级参数模型在PERSONA-CHAT和ConvAI2等基准数据集上的表现,且通过消融实验验证了在输入提示中加入分数信息优于传统训练方式。
链接: https://arxiv.org/abs/2508.06886
作者: Arpita Saggar,Jonathan C. Darling,Vania Dimitrova,Duygu Sarikaya,David C. Hogg
机构: University of Leeds (利兹大学)
类目: Computation and Language (cs.CL)
备注: Camera-Ready version for ECAI 2025. 8 pages
Abstract:Persona-based dialogue generation is an important milestone towards building conversational artificial intelligence. Despite the ever-improving capabilities of large language models (LLMs), effectively integrating persona fidelity in conversations remains challenging due to the limited diversity in existing dialogue data. We propose a novel framework SBS (Score-Before-Speaking), which outperforms previous methods and yields improvements for both million and billion-parameter models. Unlike previous methods, SBS unifies the learning of responses and their relative quality into a single step. The key innovation is to train a dialogue model to correlate augmented responses with a quality score during training and then leverage this knowledge at inference. We use noun-based substitution for augmentation and semantic similarity-based scores as a proxy for response quality. Through extensive experiments with benchmark datasets (PERSONA-CHAT and ConvAI2), we show that score-conditioned training allows existing models to better capture a spectrum of persona-consistent dialogues. Our ablation studies also demonstrate that including scores in the input prompt during training is superior to conventional training setups. Code and further details are available at this https URL
zh
[NLP-106] he ReQAP System for Question Answering over Personal Information CIKM2025
【速读】: 该论文旨在解决用户设备上异构数据源(如日历、购物记录、健身工具、邮件和社交媒体内容等)中复杂查询的问答问题,这些查询通常涉及过滤、连接和聚合操作。解决方案的关键在于提出ReQAP系统,其通过递归分解问题并增量构建执行操作符树来实现高效处理;同时,系统在问题理解与每个操作符执行中均利用轻量级语言模型,并结合谨慎的微调策略,从而在保证准确性的前提下提升可解释性——通过追踪答案生成过程中的操作符执行路径,使结果可追溯至原始数据源,增强用户对系统的信任与理解。
链接: https://arxiv.org/abs/2508.06880
作者: Philipp Christmann,Gerhard Weikum
机构: Max Planck Institute for Informatics (马普研究所信息学所); Saarland Informatics Campus (萨尔兰计算机科学校园); Saarbruecken (萨尔布吕肯)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at CIKM 2025 (demonstration paper)
Abstract:Personal information is abundant on users’ devices, from structured data in calendar, shopping records or fitness tools, to unstructured contents in mail and social media posts. This works presents the ReQAP system that supports users with answers for complex questions that involve filters, joins and aggregation over heterogeneous sources. The unique trait of ReQAP is that it recursively decomposes questions and incrementally builds an operator tree for execution. Both the question interpretation and the individual operators make smart use of light-weight language models, with judicious fine-tuning. The demo showcases the rich functionality for advanced user questions, and also offers detailed tracking of how the answers are computed by the operators in the execution tree. Being able to trace answers back to the underlying sources is vital for human comprehensibility and user trust in the system.
zh
[NLP-107] ESNERA: Empirical and semantic named entity alignment for named entity dataset merging
【速读】: 该论文旨在解决命名实体识别(Named Entity Recognition, NER)领域中多源标注数据集合并时存在的标签空间不一致问题,特别是现有方法在标签映射上缺乏可解释性和扩展性的问题。其解决方案的关键在于提出一种基于标签相似性的自动对齐方法,通过融合经验相似性和语义相似性,并采用贪心的成对合并策略,实现跨数据集标签空间的有效统一,从而支持高效、可解释且可扩展的多源NER语料库集成。
链接: https://arxiv.org/abs/2508.06877
作者: Xiaobo Zhang(1 and 2),Congqing He(2),Ying He(1 and 2),Jian Peng(1),Dajie Fu(1),Tien-Ping Tan(2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance amp; Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia)
机构: Universiti Sains Malaysia (马来西亚理科大学); Jiangxi Vocational College of Electronics and Communications (江西电子科技职业学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 12 figures
Abstract:Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.
zh
[NLP-108] xt to Speech System for Meitei Mayek Script
【速读】: 该论文旨在解决Manipuri语言在低资源环境下的文本到语音(Text-to-Speech, TTS)合成问题,尤其针对使用Meitei Mayek字符系统的语言特性与声调音系结构的建模挑战。其解决方案的关键在于构建了一个基于Tacotron 2和HiFi-GAN的神经TTS架构,并开发了Meitei Mayek到ARPAbet的音素映射规则,同时采集并构建了一个单说话者语料库,从而实现了可懂且自然的语音合成效果,通过主观与客观指标验证了系统性能,为Manipuri语言的保护与技术包容性提供了基础支持。
链接: https://arxiv.org/abs/2508.06870
作者: Gangular Singh Irengbam,Nirvash Singh Wahengbam,Lanthoiba Meitei Khumanthem,Paikhomba Oinam
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:This paper presents the development of a Text-to-Speech (TTS) system for the Manipuri language using the Meitei Mayek script. Leveraging Tacotron 2 and HiFi-GAN, we introduce a neural TTS architecture adapted to support tonal phonology and under-resourced linguistic environments. We develop a phoneme mapping for Meitei Mayek to ARPAbet, curate a single-speaker dataset, and demonstrate intelligible and natural speech synthesis, validated through subjective and objective metrics. This system lays the groundwork for linguistic preservation and technological inclusion of Manipuri. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD) Cite as: arXiv:2508.06870 [cs.CL] (or arXiv:2508.06870v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.06870 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Wahengbam Nirvash Singh [view email] [v1] Sat, 9 Aug 2025 07:40:53 UTC (123 KB) Full-text links: Access Paper: View a PDF of the paper titled Text to Speech System for Meitei Mayek Script, by Gangular Singh Irengbam and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-08 Change to browse by: cs cs.LG cs.SD References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-109] Annotating Errors in English Learners Written Language Production: Advancing Automated Written Feedback Systems
【速读】: 该论文旨在解决当前自动写作评估(Automated Writing Evaluation, AWE)系统在语言学习支持方面的局限性问题,即这些系统虽能有效修正语法错误,但往往采用直接修改策略(如“一键修复”),忽视了学习者对语法规则的理解与内化需求。其关键解决方案在于提出一种基于错误类型和可迁移性的标注框架,将学习者的错误映射到具体的语法模式,并据此生成更符合教学目标的反馈——包括直接修正与间接提示两类。在此基础上,研究构建了一个标注数据集,用于评估关键词引导、无关键词和模板引导三种大语言模型(Large Language Models, LLMs)生成反馈的方法,并通过教师评估验证了不同方法在相关性、事实性和可理解性上的表现差异,从而为面向语言学习的智能反馈生成提供了结构化路径。
链接: https://arxiv.org/abs/2508.06810
作者: Steven Coyne,Diana Galvan-Sosa,Ryan Spring,Camélia Guerraoui,Michael Zock,Keisuke Sakaguchi,Kentaro Inui
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-review version of DOI https://doi.org/10.1007/978-3-031-98459-4_21 , presented at AIED 2025. All content is as of submission time except for de-anonymization, ensuing layout fixes, use of the current code repository link, and BibTeX fixes. Readers are encouraged to refer to the published version
Abstract:Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error’s error type and generalizability. For error type classification, we introduce a typology focused on inferring learners’ knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system’s outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated.
zh
[NLP-110] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection
【速读】: 该论文旨在解决当前生成式 AI 在讽刺检测(sarcasm detection)任务中面临的三大挑战:单一视角分析导致的局限性、静态推理路径缺乏灵活性,以及在处理复杂反语修辞时易产生幻觉(hallucination),从而影响模型的准确性与可靠性。解决方案的关键在于提出一种名为 SEVADE 的新型自进化多智能体分析框架,其核心是动态代理推理引擎(Dynamic Agentive Reasoning Engine, DARE),通过一组基于语言学理论的专业化智能体对文本进行多维度解构并生成结构化的推理链;同时引入一个轻量级理由裁决器(Rationale Adjudicator, RA)仅依据该推理链完成最终分类,实现推理过程与判断决策的解耦,有效降低幻觉风险,提升模型鲁棒性与性能。
链接: https://arxiv.org/abs/2508.06803
作者: Ziqi Liu,Yangbin Chen,Ziyang Zhou,Yilin Li,Mingxuan Hu,Yushan Pan,Zhijie Xu
机构: Xi’an Jiaotong-Liverpool University (西交利物浦大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose SEVADE, a novel Self-Evolving multi-agent Analysis framework with Decoupled Evaluation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of 6.75% in Accuracy and 6.29% in Macro-F1 score.
zh
[NLP-111] Story Ribbons: Reimagining Storyline Visualizations with Large Language Models IEEE-VIS2025
【速读】: 该论文旨在解决从非结构化文本数据中提取并可视化叙事关系的挑战,特别是针对文学作品中人物、地点与主题之间复杂交互的分析难题。其解决方案的关键在于提出了一种由大语言模型(Large Language Models, LLMs)驱动的数据解析流程,能够自动从小说和剧本中抽取关键叙事信息,并基于此构建“故事丝带”(Story Ribbons)这一交互式可视化系统,支持不同层次的人物轨迹与主题演变探索。通过在36部文学作品上的管道评估与用户研究,验证了LLM在简化叙事可视化生成过程及揭示经典故事新洞见方面的潜力。
链接: https://arxiv.org/abs/2508.06772
作者: Catherine Yeh,Tara Menon,Robin Singh Arya,Helen He,Moira Weigel,Fernanda Viégas,Martin Wattenberg
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to IEEE VIS 2025 (11 pages, 9 figures)
Abstract:Analyzing literature involves tracking interactions between characters, locations, and themes. Visualization has the potential to facilitate the mapping and analysis of these complex relationships, but capturing structured information from unstructured story data remains a challenge. As large language models (LLMs) continue to advance, we see an opportunity to use their text processing and analysis capabilities to augment and reimagine existing storyline visualization techniques. Toward this goal, we introduce an LLM-driven data parsing pipeline that automatically extracts relevant narrative information from novels and scripts. We then apply this pipeline to create Story Ribbons, an interactive visualization system that helps novice and expert literary analysts explore detailed character and theme trajectories at multiple narrative levels. Through pipeline evaluations and user studies with Story Ribbons on 36 literary works, we demonstrate the potential of LLMs to streamline narrative visualization creation and reveal new insights about familiar stories. We also describe current limitations of AI-based systems, and interaction motifs designed to address these issues.
zh
[NLP-112] Many-Turn Jailbreaking
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全机制中对多轮对话场景下越狱攻击(jailbreaking)研究的缺失问题。现有工作主要聚焦于单轮越狱攻击,即通过单一提示诱导模型输出不安全内容,而忽略了实际应用中用户常会持续追问以澄清或扩展初始越狱内容的情形。为应对这一挑战,作者提出探索多轮越狱攻击(multi-turn jailbreaking),其关键在于构建首个多轮越狱基准测试集(Multi-Turn Jailbreak Benchmark, MTJ-Bench),用于系统评估开放源和闭源模型在连续对话中的安全脆弱性,并揭示此类攻击可能引发的持续性不当响应风险,从而推动社区关注并改进LLMs的安全防护能力。
链接: https://arxiv.org/abs/2508.06755
作者: Xianjun Yang,Liqiang Xiao,Shiyang Li,Faisal Ladhak,Hyokun Yun,Linda Ruth Petzold,Yi Xu,William Yang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.
zh
[NLP-113] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis
【速读】: 该论文旨在解决大规模口述历史档案(尤其是日裔美国人拘禁口述史,JAIOH)在分析过程中面临的挑战,包括其非结构化格式、情感复杂性以及高昂的人工标注成本。为应对这些问题,研究提出了一种可扩展的自动化语义与情感标注框架,其关键在于通过精心设计的提示工程(prompt engineering)结合专家标注数据,利用大语言模型(LLMs)如ChatGPT、Llama和Qwen进行多阶段评估与优化。该方法不仅实现了对558句样本的高精度标注验证,还成功将最优提示策略应用于包含92,191句文本的大规模JAIOH语料库,从而在保持文化敏感性的前提下显著提升分析效率,为数字人文领域中负责任地应用人工智能技术提供了可复用的标注流程与实践指南。
链接: https://arxiv.org/abs/2508.06729
作者: Komala Subramanyam Cherukuri,Pranav Abishai Moses,Aisa Sakata,Jiangping Chen,Haihua Chen
机构: University of North Texas (北德克萨斯大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: this https URL.
zh
[NLP-114] Play Favorites: A Statistical Method to Measure Self-Bias in LLM -as-a-Judge
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为评判者时存在的自偏倚(self-bias)问题,即模型可能系统性地给予自身输出更高评分,从而扭曲对真实模型性能的评估。解决方案的关键在于提出一个统计框架,通过建模LLM作为评判者对其自身输出与其他模型输出评分分布的差异,并引入独立第三方评判者(如人类)的标注来校准基础质量,从而在模型能力各异的情况下可靠地识别和量化自偏倚,避免将真实的性能差异误判为偏倚。
链接: https://arxiv.org/abs/2508.06709
作者: Evangelia Spiliopoulou,Riccardo Fogliato,Hanna Burnsky,Tamer Soliman,Jie Ma,Graham Horwood,Miguel Ballesteros
机构: Amazon Web Services(亚马逊网络服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.
zh
[NLP-115] Do Biased Models Have Biased Thoughts?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时存在的多种偏见问题,尤其是这些偏见如何体现在模型的推理过程(即思维链,Chain-of-Thought)与最终输出之间。研究的核心问题是:具有偏见决策的模型是否也表现出偏见的思考过程? 解决方案的关键在于通过量化11种不同类型的偏见(如性别、种族、社会经济地位等),对比5个主流LLM在“思维链”阶段与最终输出中的偏见表现,发现模型的输出偏见与其推理步骤中的偏见相关性较低(多数情况下相关系数<0.6,p<0.001)。这一结果表明,与人类不同,这些模型即使做出偏见性决策,其内部推理路径未必携带明显偏见,暗示偏见可能源于训练数据或优化目标,而非显式的思维链偏差。
链接: https://arxiv.org/abs/2508.06671
作者: Swati Rajwal,Shivank Garg,Reem Abdel-Salam,Abdelrahman Zayed
机构: Emory University (埃默里大学); Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校); Cairo University (开罗大学); Mila - Quebec AI Institute (魁北克AI研究所); Polytechnique Montréal (蒙特利尔综合理工学院); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at main track of the Second Conference on Language Modeling (COLM 2025)
Abstract:The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textitDo biased models have biased thoughts? To answer our question, we conduct experiments on 5 popular large language models using fairness metrics to quantify 11 different biases in the model’s thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than 0.6 correlation with a p -value smaller than 0.001 in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.
zh
[NLP-116] sting the Limits of Machine Translation from One Book
【速读】: 该论文旨在解决低资源语言(如Kanuri)在生成式AI(Generative AI)辅助翻译中的性能瓶颈问题,尤其关注如何通过有限的语言资源提升大语言模型(LLMs)在特定领域(如健康与人道主义术语)的翻译质量。其关键解决方案在于设计两类评估数据集并系统性地测试不同语言资源组合(包括语法、词典和双语句子)对LLM翻译效果的影响,结果表明:双语句子作为平行语料仍是提升翻译准确性和流畅度最有效的数据源;单纯引入语法知识虽优于零样本翻译,但无法独立支撑高质量领域翻译;且人类评估揭示LLMs在语义准确性上优于句法流畅性,强调了多维评估指标对LLM翻译质量评价的重要性。
链接: https://arxiv.org/abs/2508.06665
作者: Jonathan Shaw,Dillon Mee,Timothy Khouw,Zackary Leech,Daniel Wilson
机构: XRi Global (XRi 全球)
类目: Computation and Language (cs.CL)
备注:
Abstract:Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts. Tanzer et al. [2024] utilize language materials (e.g. a grammar) to improve translation quality for Kalamang using large language models (LLMs). We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources. We design two datasets for evaluation: one focused on health and humanitarian terms, and another containing generalized terminology, investigating how domain-specific tasks impact LLM translation quality. By providing different combinations of language resources (grammar, dictionary, and parallel sentences), we measure LLM translation effectiveness, comparing results to native speaker translations and human linguist performance. We evaluate using both automatic metrics and native speaker assessments of fluency and accuracy. Results demonstrate that parallel sentences remain the most effective data source, outperforming other methods in human evaluations and automatic metrics. While incorporating grammar improves over zero-shot translation, it fails as an effective standalone data source. Human evaluations reveal that LLMs achieve accuracy (meaning) more effectively than fluency (grammaticality). These findings suggest LLM translation evaluation benefits from multidimensional assessment beyond simple accuracy metrics, and that grammar alone, without parallel sentences, does not provide sufficient context for effective domain-specific translation. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.06665 [cs.CL] (or arXiv:2508.06665v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.06665 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dillon Mee [view email] [v1] Fri, 8 Aug 2025 19:27:44 UTC (38 KB)
zh
[NLP-117] Measuring Stereotype and Deviation Biases in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时可能表现出的两种偏见问题:刻板印象偏见(stereotype bias)和偏离偏见(deviation bias)。刻板印象偏见指模型对特定人口群体持续关联特定属性(如政治立场、宗教信仰或性取向),而偏离偏见则体现为模型生成内容中的人口统计分布与现实世界分布之间的差异。研究通过让四种先进的LLM生成个体人物档案,系统评估其对不同群体属性的关联强度及分布偏差,发现所有被测模型均显著存在这两种偏见。解决方案的关键在于识别并量化这些偏见的存在及其影响机制,从而为后续开发更公平、可信的生成式AI(Generative AI)系统提供实证基础与改进方向。
链接: https://arxiv.org/abs/2508.06649
作者: Daniel Wang,Eli Brignac,Minjia Mao,Xiao Fang
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of Delaware (特拉华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.
zh
[NLP-118] rain It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models EMNLP
【速读】: 该论文旨在解决标准字节对编码(Byte-Pair Encoding, BPE)tokenization在语言模型训练与推理阶段不一致所引发的潜在隐私泄露问题及其对下游任务性能的影响。研究表明,BPE训练过程中依赖的合并列表(merge list)可能成为信息泄露的攻击面,而现有推理算法若偏离该列表会导致模型性能显著下降。论文的关键解决方案在于提出两类非目标性(non-targeted)的BPE推理方法:一类是完全不依赖合并列表的压缩策略(如贪婪或精确压缩),另一类是对合并列表进行随机扰动或删减的靶向偏差方法。实验表明,虽然靶向偏差会严重损害模型性能,但非目标性、无合并列表依赖的推理算法对下游任务(如问答、机器翻译和开放式生成)的影响极小,远低于预期,从而为设计更简洁且更具隐私保护能力的tokenization方案提供了可行性路径。
链接: https://arxiv.org/abs/2508.06621
作者: Tomohiro Sawada,Kartik Goyal
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP
Abstract:Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model’s training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targeted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targeted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA benchmarks, machine translation, and open-ended generation reveal that while targeted deviation from the merge lists exhibits significant degradation in language model performance, the non-targeted merge-list-free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance.
zh
[NLP-119] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
【速读】: 该论文旨在解决当前深度研究代理(Deep-Research agents)评估中存在的公平性与透明度问题。现有基准如BrowseComp依赖黑箱式实时网络搜索API,导致不同方法之间难以公平比较,且无法控制文档语料库,从而阻碍了对检索器(retriever)等核心组件性能的独立分析。解决方案的关键在于提出BrowseComp-Plus,这是一个基于BrowseComp构建的固定、精心策划的语料库基准,其中每个查询均包含人工验证的支持文档和挖掘出的挑战性负样本,从而支持受控实验。该设计使得能够清晰区分深度研究系统中LLM能力与检索策略的贡献,实现对检索有效性、引用准确性及上下文工程等关键因素的深入分析。
链接: https://arxiv.org/abs/2508.06600
作者: Zijian Chen,Xueguang Ma,Shengyao Zhuang,Ping Nie,Kai Zou,Andrew Liu,Joshua Green,Kshama Patel,Ruoxi Meng,Mingyi Su,Sahel Sharifymoghaddam,Yanxi Li,Haoran Hong,Xinyu Shi,Xuye Liu,Nandan Thakur,Crystina Zhang,Luyu Gao,Wenhu Chen,Jimmy Lin
机构: University of Waterloo (滑铁卢大学); CSIRO (澳大利亚联邦科学与工业研究组织); Independent (独立); Carnegie Mellon University (卡内基梅隆大学); The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
zh
[NLP-120] LLM Unlearning Without an Expert Curated Dataset
【速读】: 该论文旨在解决大语言模型中敏感、有害或受版权保护知识的后处理删除问题,即“后训练遗忘”(post-hoc unlearning)——在不进行全量重新训练的前提下,从模型中移除特定领域知识。其核心挑战在于如何构建高质量的遗忘数据集(forget set),以有效引导模型遗忘目标知识。解决方案的关键在于提出一种可扩展且自动化的合成方法:利用语言模型自身通过结构化提示(structured prompting)生成类教科书风格的数据,仅需输入目标领域名称即可完成高质遗忘数据集的构建。实验表明,该方法生成的数据集在生物安全、网络安全及《哈利·波特》小说等领域的遗忘效果优于现有合成基线,并接近人工专家标注数据,同时多步生成流程显著提升了数据多样性,从而增强遗忘效能。
链接: https://arxiv.org/abs/2508.06595
作者: Xiaoyuan Zhu,Muru Zhang,Ollie Liu,Robin Jia,Willie Neiswanger
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at this https URL.
zh
[NLP-121] Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在跨学科实验科学(尤其是材料科学这一高度交叉领域)中应用受限的问题,具体聚焦于如何利用AI从以往互不关联的领域(如植物科学、仿生学和材料工程)中提取结构-性能关系并指导新型生物启发材料的设计。其解决方案的关键在于构建一个首创的框架,集成微调模型(BioinspiredLLM)、检索增强生成(Retrieval-Augmented Generation, RAG)、代理系统(agentic systems)与分层采样策略(Hierarchical Sampling),通过结构化推理协议从单一查询中生成并评估数百个可实验验证的假设,从而将跨域知识转化为可实施的材料设计与机械性能预测,并在实验室中成功制备出具有可调形貌和实测剪切强度的新型花粉基粘合剂,验证了AI辅助创意生成在真实材料研发中的有效性与人机协同潜力。
链接: https://arxiv.org/abs/2508.06591
作者: Rachel K. Luu,Jingyu Deng,Mohammed Shahrudin Ibrahim,Nam-Joon Cho,Ming Dao,Subra Suresh,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院); Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Other Condensed Matter (cond-mat.other); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration.
zh
[NLP-122] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在教育对话中缺乏基于学习者认知状态动态调整教学策略的问题,即如何实现真正意义上的自适应指导(adaptive scaffolding),而不仅限于生成苏格拉底式提问。其核心挑战在于现有模型难以有效识别学习者的困惑或需求,并据此调整教学行为以促进深度反思与理解。解决方案的关键在于提出GuideEval基准测试框架,该框架基于真实教育对话数据,通过“感知-协调-激发”三阶段行为模型系统评估LLMs的指导能力,并引入一种行为引导的微调策略(behavior-guided fine-tuning),利用行为提示的教学对话对模型进行训练,显著提升了其在复杂学习情境下的动态适应性表现。
链接: https://arxiv.org/abs/2508.06583
作者: Ying Liu,Can Li,Ting Zhang,Mei Wang,Qiannan Zhu,Jian Li,Hua Huang
机构: Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners’ understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs.
zh
[NLP-123] Factor Augmented Supervised Learning with Text Embeddings
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的高维文本嵌入(text embeddings)在下游任务中效率低下且计算成本高昂的问题。其解决方案的关键在于提出一种监督式因子增强框架——自动编码器增强学习与文本(AutoEncoder-Augmented Learning with Text, AEALT),该方法将维度缩减直接集成到预训练LLM的工作流中:首先从文本文档中提取嵌入,随后通过一个监督增强型自动编码器(supervised augmented autoencoder)学习低维、任务相关的潜在因子,从而有效建模复杂嵌入的非线性结构,显著优于依赖原始嵌入的传统深度学习方法。
链接: https://arxiv.org/abs/2508.06548
作者: Zhanye Luo,Yuefeng Han,Xiufan Yu
机构: University of Chicago (芝加哥大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods.
zh
[NLP-124] he Art of Breaking Words: Rethinking Multilingual Tokenizer Design
【速读】: 该论文旨在解决多语言大语言模型(Large Language Model, LLM)中分词(tokenization)效率低下这一被忽视的关键问题,尤其针对印地语系(Indic)语言所面临的高脚本多样性与拼写复杂性挑战。现有分词器普遍存在词元到单词比率高、上下文长度利用不充分及推理速度慢等问题。其解决方案的核心在于:首先通过系统性分析词汇表大小、预分词规则和训练语料组成对词元效率与模型质量的影响;其次提出一种基于语言平衡的数据组合算法,优化多语言数据在分词器训练中的分布;最终实现平均词元到单词比率降低约6%,相较于最先进的多语言印地语模型提升超过40%的效率,并带来模型性能和推理速度的显著提升。这凸显了分词策略与模型架构、训练目标并列,是构建高效可扩展多语言LLM的关键杠杆。
链接: https://arxiv.org/abs/2508.06533
作者: Aamod Thakur,Ajay Nagpal,Atharva Savarkar,Kundeshwar Pundalik,Siddhesh Dosi,Piyush Sawarkar,Viraj Thakur,Rohit Saluja,Maunendra Sankar Desarkar,Ganesh Ramakrishnan
机构: BharatGen Team
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs
zh
[NLP-125] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)发展过程中忽视碳排放问题的缺陷,即传统神经缩放定律(Neural Scaling Laws)仅关注模型性能与参数量、数据规模及计算资源之间的关系,而未考虑训练过程中的碳足迹。为此,作者提出了CarbonScaling框架,其核心在于将神经缩放定律扩展至包含运行碳排放(operational carbon)和嵌入碳排放(embodied carbon),并融合GPU硬件演进、并行优化策略与碳估算模型,从而定量关联模型准确率与碳强度。关键创新在于揭示了真实世界效率损失显著放大碳缩放因子,并指出针对超大规模LLM,硬件技术进步带来的减排收益递减,而通过激进的关键批次大小(critical batch size)优化可有效缓解这一问题,为实现更可持续的LLM训练提供量化依据与优化路径。
链接: https://arxiv.org/abs/2508.06524
作者: Lei Jiang,Fan Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 8 pages
Abstract:Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textitCarbonScaling, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textitCarbonScaling quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textitCarbonScaling offers key insights for training more sustainable and carbon-efficient LLMs.
zh
[NLP-126] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在少样本(few-shot)生物医学命名实体识别(Biomedical Named Entity Recognition, NER)任务中性能受限的问题。其解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的动态提示策略:通过计算输入文本与标注示例之间的相似度,动态选择最相关的上下文示例作为提示,并在推理过程中为每个实例实时更新提示内容,从而提升模型对特定输入的适应能力。实验表明,该方法显著优于静态提示策略,在5-shot和10-shot设置下分别带来7.3%和5.6%的平均F1分数提升。
链接: https://arxiv.org/abs/2508.06504
作者: Yao Ge,Sudeshna Das,Yuting Guo,Abeed Sarker
机构: Emory University (埃默里大学); National Institutes of Health (美国国立卫生研究院); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 4 figures, 15 tables
Abstract:Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.
zh
[NLP-127] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction
【速读】: 该论文旨在解决葡萄牙语环境下缺乏整合外部证据的公开数据集的问题,这是构建稳健自动事实核查(Automatic Fact-Checking, AFC)系统的关键障碍,因为现有资源多仅依赖文本内部特征进行分类。解决方案的核心在于提出一种方法论,通过模拟用户验证流程,利用大型语言模型(Large Language Models, LLMs,具体为Gemini 1.5 Flash)提取新闻文本中的核心主张(claim),并结合搜索引擎API(Google Search API 和 Google FactCheck Claims Search API)检索相关外部证据文档,从而对葡萄牙语新闻语料库(包括该http URL、该http URL和MuMiN-PT)进行外部证据增强;同时引入数据验证与预处理框架(含近似重复检测),以提升基础语料的质量。
链接: https://arxiv.org/abs/2508.06495
作者: Juliana Resplande Sant’anna Gomes,Arlindo Rodrigues Galvão Filho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Master Thesis in Computer Science at Federal University on Goias (UFG). Written in Portuguese
Abstract:The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (this http URL, this http URL, MuMiN-PT) with external evidence. The approach simulates a user’s verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora.
zh
[NLP-128] Event-Aware Sentiment Factors from LLM -Augmented Financial Tweets: A Transparent Framework for Interpretable Quant Trading ICML2025
【速读】: 该论文旨在解决从非结构化社交媒体文本中提取可量化金融信号的难题,特别是如何将高情感强度的公司相关推文自动标注为多标签事件类别,并评估其对股票未来收益的预测能力。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现自动化、多标签的语义标注,并将标注后的事件标签与1至7天的前瞻性收益进行统计关联分析,从而识别出具有显著负阿尔法(alpha)效应的事件类别。研究证明了LLM在金融语义标注中的有效性,并通过公开代码和方法确保结果的透明性与可复现性,为算法交易研究提供了可扩展的开源框架。
链接: https://arxiv.org/abs/2508.07408
作者: Yueyi Wang,Qiyao Wei
机构: 未知
类目: atistical Finance (q-fin.ST); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 12 figures, accepted at ICML 2025 New in ML Workshop
Abstract:In this study, we wish to showcase the unique utility of large language models (LLMs) in financial semantic annotation and alpha signal discovery. Leveraging a corpus of company-related tweets, we use an LLM to automatically assign multi-label event categories to high-sentiment-intensity tweets. We align these labeled sentiment signals with forward returns over 1-to-7-day horizons to evaluate their statistical efficacy and market tradability. Our experiments reveal that certain event labels consistently yield negative alpha, with Sharpe ratios as low as -0.38 and information coefficients exceeding 0.05, all statistically significant at the 95% confidence level. This study establishes the feasibility of transforming unstructured social media text into structured, multi-label event variables. A key contribution of this work is its commitment to transparency and reproducibility; all code and methodologies are made publicly available. Our results provide compelling evidence that social media sentiment is a valuable, albeit noisy, signal in financial forecasting and underscore the potential of open-source frameworks to democratize algorithmic trading research.
zh
[NLP-129] FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities
【速读】: 该论文旨在解决传统束搜索(beam search)解码在语音识别中效率低下、串行执行且受CPU性能限制的问题。其核心解决方案是提出一个完全基于GPU的开源FlexCTC工具包,用于连接时序分类(Connectionist Temporal Classification, CTC)模型的束搜索解码。关键创新在于采用全批处理的GPU实现,消除了CPU-GPU同步开销,并通过CUDA Graphs显著降低内核启动延迟,同时支持GPU加速的N-gram语言模型融合与短语级增强等高级上下文技术,从而在保证高精度的同时大幅提升解码速度和可扩展性,适用于研究与生产场景。
链接: https://arxiv.org/abs/2508.07315
作者: Lilit Grigoryan,Vladimir Bataev,Nikolay Karpov,Andrei Andrusenko,Vitaly Lavrukhin,Boris Ginsburg
机构: NVIDIA(英伟达); NVIDIA(英伟达)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to Automatic Speech Recognition and Understanding Workshop (ASRU) 2025
Abstract:While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.
zh
[NLP-130] urboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree
【速读】: 该论文旨在解决上下文偏置(context-biasing)在自动语音识别(ASR)中面临的三大局限性问题:即对额外模型训练的依赖、解码速度显著下降以及对ASR系统类型(如CTC、Transducer和Attention Encoder-Decoder)的兼容性受限。其解决方案的关键在于提出一个通用的ASR上下文偏置框架,该框架基于GPU加速的词增强树(word boosting tree),能够在浅融合(shallow fusion)模式下支持贪婪搜索(greedy search)和束搜索(beam search)解码,且在处理多达20K个关键词时仍保持接近无损的解码速度,从而在准确率和效率上均优于现有开源方法。
链接: https://arxiv.org/abs/2508.07014
作者: Andrei Andrusenko,Vladimir Bataev,Lilit Grigoryan,Vitaly Lavrukhin,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by ASRU 2025
Abstract:Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.
zh
计算机视觉
[CV-0] Learning an Implicit Physics Model for Image-based Fluid Simulation ICCV2025
【速读】:该论文旨在解决从单张静态图像生成具有物理一致性的4D场景(包含3D几何结构与动态变化)的问题,尤其针对自然流体影像的动画生成。现有方法多依赖简化的2D运动估计器,导致生成的运动往往违背物理规律,产生不真实的结果。其解决方案的关键在于引入一种物理信息神经网络(physics-informed neural network),通过基于纳维-斯托克斯方程(Navier-Stokes equations)设计的损失项约束每个表面点的运动预测,从而确保动画的物理合理性;同时,利用输入图像及其估计深度预测特征驱动的3D高斯分布,并结合预测运动进行渲染,实现从任意视角观看的高质量、物理一致的4D场景重建。
链接: https://arxiv.org/abs/2508.08254
作者: Emily Yue-Ting Jia,Jiageng Mao,Zhiyuan Gao,Yajie Zhao,Yue Wang
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:Humans possess an exceptional ability to imagine 4D scenes, encompassing both motion and 3D geometry, from a single still image. This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. Existing methods for this task typically employ simplistic 2D motion estimators to animate the image, leading to motion predictions that often defy physical principles, resulting in unrealistic animations. Our approach introduces a novel method for generating 4D scenes with physics-consistent animation from a single image. We propose the use of a physics-informed neural network that predicts motion for each surface point, guided by a loss term derived from fundamental physical principles, including the Navier-Stokes equations. To capture appearance, we predict feature-based 3D Gaussians from the input image and its estimated depth, which are then animated using the predicted motions and rendered from any desired camera perspective. Experimental results highlight the effectiveness of our method in producing physically plausible animations, showcasing significant performance improvements over existing methods. Our project page is this https URL .
zh
[CV-1] ReferSplat: Referring Segmentation in 3D Gaussian Splatting ICML2025
【速读】:该论文旨在解决基于自然语言描述在3D高斯场景中对目标物体进行分割的问题(即Referring 3D Gaussian Splatting Segmentation, R3DGS),尤其针对包含空间关系或对象属性的复杂语义描述,要求模型能够在新视角下识别可能被遮挡或不可见的目标对象,这对3D多模态理解提出了重大挑战。解决方案的关键在于提出ReferSplat框架,该框架采用空间感知范式显式建模3D高斯点与自然语言表达之间的关联,从而有效提升对3D场景中语义对象的定位与分割能力,在R3DGS任务及3D开放词汇分割基准上均达到当前最优性能。
链接: https://arxiv.org/abs/2508.08252
作者: Shuting He,Guangquan Jie,Changshuo Wang,Yun Zhou,Shuming Hu,Guanbin Li,Henghui Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 Oral, Code: this https URL
Abstract:We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at this https URL.
zh
[CV-2] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
【速读】:该论文旨在解决当前基于扩散模型的音频驱动虚拟人视频生成方法在生成长视频时存在的自然音频同步性差和身份一致性弱的问题。现有方法通常依赖第三方预训练音频特征提取器获取音频嵌入,并通过交叉注意力机制直接注入扩散模型,但由于扩散主干缺乏音频先验知识,导致潜在分布误差在视频片段间累积,进而引发后续段落潜变量分布逐渐偏离最优状态。解决方案的关键在于提出一种时间步感知的音频适配器(Time-step-aware Audio Adapter),通过时间步感知调制机制有效抑制误差累积;同时引入原生音频引导机制(Audio Native Guidance Mechanism),利用扩散过程自身演化的音-潜空间联合预测作为动态引导信号,进一步提升音频同步精度;此外,还设计了动态加权滑动窗口策略以增强无限长度视频的平滑性。
链接: https://arxiv.org/abs/2508.08248
作者: Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Chong Luo,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院); Xi’an Jiaotong University (西安交通大学); Hunyuan, Tencent Inc. (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
zh
[CV-3] Cut2Next: Generating Next Shot via In-Context Tuning
【速读】:该论文旨在解决当前多镜头生成方法在叙事连贯性和电影化编辑模式上的不足,即现有技术虽能保证基础视觉一致性,但缺乏对专业剪辑技巧(如正反打镜头、切出镜头等)的建模,导致生成内容虽视觉上一致却缺乏叙事深度与真正的电影完整性。其解决方案的核心是提出Next Shot Generation (NSG)任务,并设计Cut2Next框架,通过引入一种基于关系提示(Relational Prompts)与个体提示(Individual Prompts)的分层多提示策略,指导扩散Transformer(Diffusion Transformer, DiT)模型生成符合专业剪辑逻辑的下一镜头;同时结合无额外参数引入的上下文感知条件注入(Context-Aware Condition Injection, CACI)和分层注意力掩码(Hierarchical Attention Mask, HAM)机制,有效融合多层次语义信号,从而实现高质量、叙事性强且符合电影连续性的镜头衔接。
链接: https://arxiv.org/abs/2508.08244
作者: Jingwen He,Hongbo Liu,Jiajun Li,Ziqi Huang,Yu Qiao,Wanli Ouyang,Ziwei Liu
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); S-Lab, Nanyang Technological University (南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.
zh
[CV-4] ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
【速读】:该论文针对语言引导的长时程移动操作(language-guided long-horizon mobile manipulation)在具身语义推理、可泛化操作和自适应运动中的三大核心挑战展开研究:一是现有方法受限于桌面场景,难以应对移动平台的感知约束与执行器范围限制;二是操作策略在开放世界中面对多样化物体配置时泛化能力不足;三是未充分解决在非结构化环境中同时保障高机动性与末端执行器精确控制的双重需求。解决方案的关键在于提出ODYSSEY框架,其通过融合高层任务规划与底层全身控制,实现端到端的统一建模:一方面引入基于视觉-语言模型的分层规划器,支持语言指令的长程分解与精准动作执行;另一方面设计新型全身控制策略,在复杂地形下实现鲁棒的多自由度协调控制。该方案通过首次建立长时程移动操作基准并完成仿真到现实的迁移验证,显著提升了四足机器人在非结构化环境中的任务适应性与实用性。
链接: https://arxiv.org/abs/2508.08240
作者: Kaijun Wang,Liqin Lu,Mingyu Liu,Jianuo Jiang,Zeju Li,Bolin Zhang,Wancai Zheng,Xinyi Yu,Hao Chen,Chunhua Shen
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. National University of Singapore (新加坡国立大学); 3. Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system’s generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: this https URL Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.08240 [cs.RO] (or arXiv:2508.08240v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2508.08240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-5] OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution
【速读】:该论文旨在解决当前基于去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)和流匹配(Flow Matching, FM)的单步真实世界图像超分辨率(One-step Real-World Image Super-Resolution, Real-ISR)方法中存在的潜在分布不匹配问题。具体而言,现有方法通常在初始时间步注入低质量(Low-Quality, LQ)图像隐空间分布,但该分布与高斯噪声隐空间分布之间存在显著差异,限制了生成先验的有效利用。解决方案的关键在于:观察到DDPM/FM中后期时间步的噪声隐空间分布更接近LQ图像隐空间分布,因此提出One Mid-timestep Guidance Real-ISR(OMGSR)框架,在预计算的中间时间步注入LQ隐变量,并引入隐空间分布精炼损失(Latent Distribution Refinement loss)以缩小两者间的分布差距;同时设计重叠块状LPIPS/GAN损失以消除图像生成中的棋盘伪影。该方法通用性强,适用于DDPM/FM类生成模型,并在多个分辨率下实现了优异的定量与定性性能。
链接: https://arxiv.org/abs/2508.08227
作者: Zhiqiang Wu,Zhaomang Sun,Tong Zhou,Bingtao Fu,Ji Cong,Yitong Dong,Huaqi Zhang,Xuan Tang,Mingsong Chen,Xian Wei
机构: vivo; University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy latent distribution at DDPM/FM mid-timesteps aligns more closely with the LQ image latent distribution. Based on this insight, we present One Mid-timestep Guidance Real-ISR (OMGSR), a universal framework applicable to DDPM/FM-based generative models. OMGSR injects the LQ image latent distribution at a pre-computed mid-timestep, incorporating the proposed Latent Distribution Refinement loss to alleviate the latent distribution gap. We also design the Overlap-Chunked LPIPS/GAN loss to eliminate checkerboard artifacts in image generation. Within this framework, we instantiate OMGSR for DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Experimental results demonstrate that OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution. Notably, OMGSR-F establishes overwhelming dominance in all reference metrics. We further train a 1k-resolution OMGSR-F to match the default resolution of FLUX.1-dev, which yields excellent results, especially in the details of the image generation. We also generate 2k-resolution images by the 1k-resolution OMGSR-F using our two-stage Tiled VAE Diffusion.
zh
[CV-6] Learning User Preferences for Image Generation Model
【速读】:该论文旨在解决用户偏好预测中对个体审美差异和动态变化理解不足的问题,现有方法通常依赖于通用人类偏好或静态用户画像,难以捕捉个性化、多维度的品味特征。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models),引入对比偏好损失(contrastive preference loss)与可学习偏好标记(preference tokens):前者用于有效区分用户“喜欢”与“不喜欢”的样本,后者则建模用户间共享的兴趣表征,激活群体特异性偏好并提升相似用户的偏好一致性,从而显著提高个性化偏好预测的准确性。
链接: https://arxiv.org/abs/2508.08220
作者: Wenyi Mo,Ying Ba,Tianyu Zhang,Yalong Bai,Biye Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ‘‘likes’’ and ‘‘dislikes’’, while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \textttthis https URL.
zh
[CV-7] SAGOnline: Segment Any Gaussians Online
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)场景中高效且一致的3D分割难题,现有方法普遍存在计算成本高、三维空间推理能力有限以及无法同时追踪多个物体的问题。其解决方案的关键在于提出一种轻量级、零样本的在线3D分割框架SAGOnline,通过两项核心创新实现突破:一是采用解耦策略,融合视频基础模型(如SAM2)实现跨合成视角的视图一致性2D掩码传播;二是设计GPU加速的3D掩码生成与高斯级实例标签算法,为每个3D原语分配唯一标识符,从而实现无损的多对象跟踪与分割。该方法在NVOS和Spin-NeRF基准上达到领先性能(mIoU分别为92.7%和95.2%),推理速度提升15–1500倍(27 ms/帧),显著优于Feature3DGS、OmniSeg3D-gs和SA3D等方法。
链接: https://arxiv.org/abs/2508.08219
作者: Wentao Sun,Quanyun Wu,Hanqing Xu,Kyle Gao,Zhengsen Xu,Yiping Chen,Dedong Zhang,Lingfei Ma,John S. Zelek,Jonathan Li
机构: University of Waterloo (滑铁卢大学); East China Normal University (华东师范大学); University of Calgary (卡尔加里大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes that addresses these limitations through two key innovations: (1) a decoupled strategy that integrates video foundation models (e.g., SAM2) for view-consistent 2D mask propagation across synthesized views; and (2) a GPU-accelerated 3D mask generation and Gaussian-level instance labeling algorithm that assigns unique identifiers to 3D primitives, enabling lossless multi-object tracking and segmentation across views. SAGOnline achieves state-of-the-art performance on NVOS (92.7% mIoU) and Spin-NeRF (95.2% mIoU) benchmarks, outperforming Feature3DGS, OmniSeg3D-gs, and SA3D by 15–1500 times in inference speed (27 ms/frame). Qualitative results demonstrate robust multi-object segmentation and tracking in complex scenes. Our contributions include: (i) a lightweight and zero-shot framework for 3D segmentation in Gaussian scenes, (ii) explicit labeling of Gaussian primitives enabling simultaneous segmentation and tracking, and (iii) the effective adaptation of 2D video foundation models to the 3D domain. This work allows real-time rendering and 3D scene understanding, paving the way for practical AR/VR and robotic applications.
zh
[CV-8] Spatial-ORMLLM : Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model
【速读】:该论文旨在解决手术室(Operating Room, OR)中三维空间建模精度不足的问题,尤其针对现有方法依赖多模态数据(如视频和音频传感器)难以获取、且仅使用二维数据无法捕捉复杂场景细粒度空间信息的局限性。其解决方案的关键在于提出Spatial-ORMLLM——首个仅基于RGB图像模态即可推断体积与语义线索的大规模视觉语言模型(Large Vision-Language Model, LVLM),通过引入空间增强特征融合模块(Spatial-Enhanced Feature Fusion Block),将2D输入与由估计算法提取的3D空间知识相结合,并以统一端到端框架整合视觉与文本特征,实现无需额外专家标注或传感器输入的鲁棒3D场景推理能力,在多个临床基准数据集上表现出最先进的性能并具备良好的泛化能力。
链接: https://arxiv.org/abs/2508.08199
作者: Peiqi He,Zhenhao Zhang,Yixiang Zhang,Xiongjun Zhao,Shaoliang Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise spatial modeling in the operating room (OR) is foundational to many clinical tasks, supporting intraoperative awareness, hazard avoidance, and surgical decision-making. While existing approaches leverage large-scale multimodal datasets for latent-space alignment to implicitly learn spatial relationships, they overlook the 3D capabilities of MLLMs. However, this approach raises two issues: (1) Operating rooms typically lack multiple video and audio sensors, making multimodal 3D data difficult to obtain; (2) Training solely on readily available 2D data fails to capture fine-grained details in complex scenes. To address this gap, we introduce Spatial-ORMLLM, the first large vision-language model for 3D spatial reasoning in operating rooms using only RGB modality to infer volumetric and semantic cues, enabling downstream medical tasks with detailed and holistic spatial context. Spatial-ORMLLM incorporates a Spatial-Enhanced Feature Fusion Block, which integrates 2D modality inputs with rich 3D spatial knowledge extracted by the estimation algorithm and then feeds the combined features into the visual tower. By employing a unified end-to-end MLLM framework, it combines powerful spatial features with textual features to deliver robust 3D scene reasoning without any additional expert annotations or sensor inputs. Experiments on multiple benchmark clinical datasets demonstrate that Spatial-ORMLLM achieves state-of-the-art performance and generalizes robustly to previously unseen surgical scenarios and downstream tasks.
zh
[CV-9] Reinforcement Learning in Vision: A Survey
【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning, Visual RL)领域中多模态感知、决策与行动协同优化的复杂性问题,其核心挑战在于如何构建能够理解视觉场景并进行推理、生成和执行动作的智能体。解决方案的关键在于系统性地整合近年来在多模态大语言模型、视觉生成、统一模型框架及视觉-语言-动作联合建模等方面的突破,通过算法设计创新(如从RLHF到Group Relative Policy Optimization的策略优化演进)、奖励工程改进(如统一奖励建模与偏好对齐扩散)以及评估协议升级(涵盖集合级保真度、样本级偏好和状态级稳定性),形成一个结构清晰、可扩展的研究图谱,从而推动视觉RL向更高效、通用和安全的方向发展。
链接: https://arxiv.org/abs/2508.08189
作者: Weijia Wu,Chen Gao,Joya Chen,Kevin Qinghong Lin,Qingwei Meng,Yiming Zhang,Yuke Qiu,Hong Zhou,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages
Abstract:Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: this https URL.
zh
[CV-10] KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning
【速读】:该论文旨在解决土木基础设施结构缺陷语义分割中的三大挑战:缺陷外观变化多样、成像条件恶劣以及类别严重不平衡。现有深度学习方法虽有效,但通常需要数百万参数,难以部署于实时检测系统。其解决方案的关键在于提出KARMA(Kolmogorov-Arnold Representation Mapping Architecture),通过一维函数的复合而非传统卷积来建模复杂缺陷模式,从而实现高效表示;核心创新包括:(1) 基于低秩分解的参数高效Tiny Kolmogorov-Arnold Network (TiKAN) 模块用于特征变换;(2) 采用可分离卷积优化的特征金字塔结构以支持多尺度缺陷分析;(3) 静态-动态原型机制增强类别不平衡场景下的特征表达能力。实验表明,KARMA在保持高精度的同时,参数量减少97%(0.959M vs. 31.04M),推理功耗仅为0.264 GFLOPS,具备实际部署潜力。
链接: https://arxiv.org/abs/2508.08186
作者: Md Meftahul Ferdaus,Mahdi Abdelguerfi,Elias Ioup,Steven Sloan,Kendall N. Niles,Ken Pathak
机构: University of New Orleans (新奥尔良大学); Naval Research Laboratory (海军研究实验室); US Army Corps of Engineers (美国陆军工程兵团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: this https URL.
zh
[CV-11] HAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening
【速读】:该论文针对Transformer在高光谱全色锐化(hyperspectral pansharpening)任务中因冗余token表示和缺乏多尺度特征建模而导致的性能瓶颈问题,提出了一种名为Token-wise High-frequency Augmentation Transformer (THAT)的新框架。其核心挑战在于:Vision Transformers(ViTs)难以保留高频细节(如材料边缘和纹理过渡),且全局自注意力机制易造成注意力分散,削弱局部结构信息。解决方案的关键在于两个创新模块:(1) Pivotal Token Selective Attention (PTSA),通过选择关键token抑制冗余信息;(2) Multi-level Variance-aware Feed-forward Network (MVFN),增强高频特征的学习能力,从而显著提升重建质量与计算效率。
链接: https://arxiv.org/abs/2508.08183
作者: Hongkun Jin,Hongcheng Jiang,Zejun Zhang,Yuan Zhang,Jia Fu,Tingfeng Li,Kai Luo
机构: JPMorgan Chase(摩根大通); University of Missouri-Kansas City(密苏里大学堪萨斯城分校); University of Southern California(南加州大学); University of Adelaide(阿德莱德大学); KTH Royal Institute of Technology(皇家理工学院); NEC Laboratories America(美国电气公司实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Abstract:Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components–such as material edges and texture transitions–and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at this https URL.
zh
[CV-12] RedDino: A foundation model for red blood cell analysis
【速读】:该论文旨在解决计算血液学中红细胞(Red Blood Cells, RBCs)形态学分析的自动化与精准诊断难题,当前尽管生成式AI(Generative AI)在医学诊断领域展现出潜力,但针对RBC图像分析的完整AI解决方案仍较为稀缺。其关键解决方案是提出RedDino——一个基于DINOv2自监督学习框架并专为RBC图像设计的基础模型(foundation model),通过在包含125万张来自多种成像模态和来源的RBC图像数据集上进行训练,实现了对RBC形状分类的显著性能提升,并验证了其特征表示能力和跨域泛化能力,从而推动了可靠、可推广的RBC诊断工具的发展。
链接: https://arxiv.org/abs/2508.08180
作者: Luca Zedda,Andrea Loddo,Cecilia Di Ruberto,Carsten Marr
机构: University of Cagliari (卡利亚里大学); Helmholtz Munich (赫尔姆霍兹慕尼黑研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at this https URL, and the pretrained models can be downloaded from our Hugging Face collection at this https URL
zh
[CV-13] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation
【速读】:该论文旨在解决人类运动生成(Human Motion Generation)中运动保真度评估的难题,特别是现有方法在物理可行性与人类感知之间存在差距的问题。传统评估方式依赖主观的人类判断或物理约束,但前者标签粗粒度且主观性强,后者难以量化真实物理一致性。其解决方案的关键在于提出一种基于物理标注的方法:通过计算使运动符合物理定律所需的最小修改量,获得细粒度、连续的物理对齐标注作为客观基准;在此基础上构建PP-Motion指标,利用皮尔逊相关性损失(Pearson’s correlation loss)学习物理先验,并引入基于人类感知的保真度损失,从而同时建模物理合理性与人类感知一致性,实现更精准的运动质量评估。
链接: https://arxiv.org/abs/2508.08179
作者: Sihan Zhao,Zixuan Wang,Tianyu Luan,Jia Jia,Wentao Zhu,Jiebo Luo,Junsong Yuan,Nan Xi
机构: Tsinghua University (清华大学); BNRist, Tsinghua University Key Laboratory of Pervasive Computing, Ministry of Education (清华大学伯克利深圳学院,感知计算教育部重点实验室); University at Buffalo (纽约州立大学布法罗分校); Eastern Institute of Technology (东方理工大学); University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACM Multimedia 2025
Abstract:Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson’s correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.
zh
[CV-14] 3D Human Mesh Estimation from Single View RGBD
【速读】:该论文旨在解决从单视角RGBD图像中准确估计3D人体网格(3D human mesh)的问题,尤其针对当前方法因缺乏大规模标注数据集而受限的挑战。其核心解决方案是提出一种名为M³(Masked Mesh Modeling)的方法,关键在于利用现有的动作捕捉(Motion Capture, MoCap)数据集生成大量合成的RGBD图像与部分可见人体网格配对样本,并通过训练一个掩码自编码器(masked autoencoder)来完成缺失的网格结构。在推理阶段,该方法将传感器获取的深度值与模板人体网格顶点对齐,从而恢复不可见区域的完整人体网格,显著提升了重建精度,在SURREAL、CAPE和BEHAVE等数据集上均优于现有基于全身体素点云或仅RGB输入的方法。
链接: https://arxiv.org/abs/2508.08178
作者: Ozhan Suat,Bedirhan Uguz,Batuhan Karagoz,Muhammed Can Keles,Emre Akbas
机构: Middle East Technical University (中东技术大学); METU ROMER Robotics Center (METU ROMER 机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant progress in 3D human mesh estimation from RGB images; RGBD cameras, offering additional depth data, remain underutilized. In this paper, we present a method for accurate 3D human mesh estimation from a single RGBD view, leveraging the affordability and widespread adoption of RGBD cameras for real-world applications. A fully supervised approach for this problem, requires a dataset with RGBD image and 3D mesh label pairs. However, collecting such a dataset is costly and challenging, hence, existing datasets are small, and limited in pose and shape diversity. To overcome this data scarcity, we leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D meshes from the body models found in MoCap datasets, and create partial, single-view versions of them by projection to a virtual camera. This simulates the depth data provided by an RGBD camera from a single viewpoint. Then, we train a masked autoencoder to complete the partial, single-view mesh. During inference, our method, which we name as M ^3 for ``Masked Mesh Modeling’', matches the depth values coming from the sensor to vertices of a template human mesh, which creates a partial, single-view mesh. We effectively recover parts of the 3D human body mesh model that are not visible, resulting in a full body mesh. M ^3 achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL and CAPE datasets, respectively; outperforming existing methods that use full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm, highlighting the usefulness of depth data. Code will be released.
zh
[CV-15] MedReason er: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision
【速读】:该论文旨在解决医学影像中区域定位(Region of Interest, ROI)的精准锚定问题,尤其针对临床实践中常见的隐式查询(implicit queries),现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的医疗定位流程仍依赖于显式空间提示的监督微调,难以适应复杂且非结构化的临床需求。解决方案的关键在于提出一种新的视觉-语言任务——统一医学推理定位(Unified Medical Reasoning Grounding, UMRG),并构建包含14K样本的U-MRG-14K数据集(含像素级掩码、隐式临床问题及推理轨迹),同时设计MedReasoner框架:该框架通过强化学习优化MLLM推理模块,冻结分割专家将空间提示转化为掩码,并利用格式与精度奖励实现两者的对齐,从而在保持可解释性的同时显著提升对未见临床查询的泛化能力。
链接: https://arxiv.org/abs/2508.08177
作者: Zhonghao Yan,Muxi Diao,Yuxuan Yang,Jiayuan Xu,Kaizhou Zhang,Ruoyan Jing,Lele Yang,Yanxi Liu,Kongming Liang,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 37 pages
Abstract:Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.
zh
[CV-16] CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data
【速读】:该论文旨在解决大规模科学模拟中高分辨率时变数据(Time-Varying Data, TVD)生成成本高昂的问题,尤其是现有超分辨率(Super-Resolution)方法依赖大量高质量训练数据,难以适配多样化的模拟场景。其解决方案的关键在于提出CD-TVD框架,融合对比学习与改进的基于扩散模型的超分辨率方法:通过在历史数据上预训练对比编码器和扩散模块,学习低分辨率与高分辨率样本间的退化模式及细节特征;随后仅用一个新生成的高分辨率时间步进行微调,利用预训练获得的退化知识实现精准3D超分辨率重建,显著降低对大规模高分辨率数据集的依赖,同时保持细粒度细节恢复能力。
链接: https://arxiv.org/abs/2508.08173
作者: Chongke Bi,Xin Gao,Jiangkang Deng,Guan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Time-varying data visualization, deep learning, super-resolution, diffusion model
Abstract:Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at this https URL.
zh
[CV-17] ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction
【速读】:该论文旨在解决封闭环路仿真中强化学习训练自动驾驶模型时存在的仿真到现实(sim2real)差距问题,尤其是现有基于场景重建的模拟器受限于训练数据分布、难以生成复杂或罕见交通场景(corner-case scenarios)的缺陷。解决方案的关键在于提出ReconDreamer-RL框架,其核心创新包括:1)引入ReconSimulator,融合视频扩散先验(video diffusion prior)进行外观建模与运动学模型结合的物理建模,从而从真实世界数据中重建高质量驾驶场景;2)设计动态对抗代理(Dynamic Adversary Agent, DAA),通过调整周边车辆相对于自车的轨迹,自动生成高风险交通场景(如切入行为);3)提出孪生轨迹生成器(Cousin Trajectory Generator, CTG),缓解训练数据分布偏倚问题,提升对非直行轨迹的覆盖能力。实验表明,该方法在端到端自动驾驶训练中显著优于模仿学习方法,碰撞率降低5倍。
链接: https://arxiv.org/abs/2508.08170
作者: Chaojun Ni,Guosheng Zhao,Xiaofeng Wang,Zheng Zhu,Wenkang Qin,Xinze Chen,Guanghong Jia,Guan Huang,Wenjun Mei
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Beijing Institute of Technology (北京理工大学); 4. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained by the distribution of the training data, making it difficult to render high-quality sensor data for novel trajectories or corner case scenarios. Therefore, we propose ReconDreamer-RL, a framework designed to integrate video diffusion priors into scene reconstruction to aid reinforcement learning, thereby enhancing end-to-end autonomous driving training. Specifically, in ReconDreamer-RL, we introduce ReconSimulator, which combines the video diffusion prior for appearance modeling and incorporates a kinematic model for physical modeling, thereby reconstructing driving scenarios from real-world data. This narrows the sim2real gap for closed-loop evaluation and reinforcement learning. To cover more corner-case scenarios, we introduce the Dynamic Adversary Agent (DAA), which adjusts the trajectories of surrounding vehicles relative to the ego vehicle, autonomously generating corner-case traffic scenarios (e.g., cut-in). Finally, the Cousin Trajectory Generator (CTG) is proposed to address the issue of training data distribution, which is often biased toward simple straight-line movements. Experiments show that ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio.
zh
[CV-18] Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning ICCV2025
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因任务特定模块选择不当导致的性能下降问题,以及任务特定模块忽视跨任务共享通用知识所引发的相似类别区分错误。其解决方案的关键在于提出集成任务特定适配器与通用适配器(Task-Specific and Universal Adapters, TUNA)的框架:一方面训练任务特定适配器以捕获各任务的核心特征,并引入基于熵的选择机制实现推理阶段最优适配器匹配;另一方面通过适配器融合策略构建一个编码跨任务判别性共享特征的通用适配器;最终在推理时融合两类适配器的预测结果,从而同时利用专业化和通用化知识,显著提升模型在连续学习场景下的准确性和鲁棒性。
链接: https://arxiv.org/abs/2508.08165
作者: Yan Wang,Da-Wei Zhou,Han-Jia Ye
机构: Nanjing University (南京大学); National Key Laboratory for Novel Software Technology (新型软件技术全国重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025. Code is available at: this https URL
Abstract:Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Existing pre-trained model-based CIL methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules such as adapters. However, incorrect module selection during inference hurts performance, and task-specific modules often overlook shared general knowledge, leading to errors on distinguishing between similar classes across tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we train task-specific adapters to capture the most crucial features relevant to their respective tasks and introduce an entropy-based selection mechanism to choose the most suitable adapter. Furthermore, we leverage an adapter fusion strategy to construct a universal adapter, which encodes the most discriminative features shared across tasks. We combine task-specific and universal adapter predictions to harness both specialized and general knowledge during inference. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of our approach. Code is available at: this https URL
zh
[CV-19] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization
【速读】:该论文旨在解决深度伪造视频(deepfake video)的分类与定位问题,尤其针对视觉、音频或两者同时发生的局部细微篡改所带来的检测挑战。其解决方案的关键在于开发出能够有效识别和精确定位合成内容的算法,该方法在ACM 1M Deepfakes Detection Challenge中表现出色,于时间定位任务中取得最佳性能,并在分类任务的TestA数据集上位列前四。
链接: https://arxiv.org/abs/2508.08141
作者: Nicholas Klein,Hemlata Tak,James Fullwood,Krishna Regmi,Leonidas Spinoulas,Ganesh Sivaraman,Tianxiang Chen,Elie Khoury
机构: Pindrop(皮恩Drop)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.
zh
[CV-20] FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting
【速读】:该论文旨在解决基于3D高斯表示(3D Gaussian Splatting, 3DGS)的风格迁移中面临的两大挑战:一是多视角不一致性导致的风格冲突,进而引发外观平滑和畸变;二是对VGG特征的强依赖使得风格与内容难以解耦,常造成内容泄露和过度风格化。解决方案的关键在于提出FantasyStyle框架,其核心创新为两个部分:(1)多视角频率一致性(Multi-View Frequency Consistency),通过在多视角噪声潜在空间上施加3D滤波器,选择性抑制低频成分以缓解风格先验冲突;(2)可控风格蒸馏(Controllable Stylized Distillation),引入负向引导(negative guidance)抑制风格图像中的内容泄露,并识别Score Distillation Sampling和Delta Denoising Score在3D风格迁移中的局限性,移除重建项后构建更有效的蒸馏策略,从而提升3D高斯参数优化效果。
链接: https://arxiv.org/abs/2508.08136
作者: Yitong Yang,Yinglin Wang,Changshuo Wang,Huajie Wang,Shuting He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The success of 3DGS in generative and editing applications has sparked growing interest in 3DGS-based style transfer. However, current methods still face two major challenges: (1) multi-view inconsistency often leads to style conflicts, resulting in appearance smoothing and distortion; and (2) heavy reliance on VGG features, which struggle to disentangle style and content from style images, often causing content leakage and excessive stylization. To tackle these issues, we introduce \textbfFantasyStyle, a 3DGS-based style transfer framework, and the first to rely entirely on diffusion model distillation. It comprises two key components: (1) \textbfMulti-View Frequency Consistency. We enhance cross-view consistency by applying a 3D filter to multi-view noisy latent, selectively reducing low-frequency components to mitigate stylized prior conflicts. (2) \textbfControllable Stylized Distillation. To suppress content leakage from style images, we introduce negative guidance to exclude undesired content. In addition, we identify the limitations of Score Distillation Sampling and Delta Denoising Score in 3D style transfer and remove the reconstruction term accordingly. Building on these insights, we propose a controllable stylized distillation that leverages negative guidance to more effectively optimize the 3D Gaussians. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving higher stylization quality and visual realism across various scenes and styles.
zh
[CV-21] Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control
【速读】:该论文旨在解决当前基于流的图像编辑模型在执行大规模形状变换任务时存在的两个关键问题:一是难以实现目标形状的精确修改,二是容易无意中影响非目标区域,导致背景质量下降。解决方案的核心在于提出了一种无需训练且不依赖掩码(mask-free)的框架——Follow-Your-Shape,其关键创新是通过计算轨迹发散图(Trajectory Divergence Map, TDM)来精确定位可编辑区域,该TDM基于反演路径与去噪路径之间token级速度差异的比较;进而利用TDM引导一种调度式键值注入机制(Scheduled KV Injection),从而实现稳定且忠实的形状可控编辑,同时严格保留非目标内容。
链接: https://arxiv.org/abs/2508.08134
作者: Zeqian Long,Mingzhe Zheng,Kunyu Feng,Xinhua Zhang,Hongyu Liu,Harry Yang,Linfeng Zhang,Qifeng Chen,Yue Ma
机构: HKUST; University of Illinois at Urbana-Champaign; Shanghai Jiao Tong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage is available at this https URL
Abstract:While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios – particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
zh
[CV-22] A Physics-Driven Neural Network with Parameter Embedding for Generating Quantitative MR Maps from Weighted Images
【速读】:该论文旨在解决临床加权磁共振成像(weighted MRI)中定量磁共振成像(quantitative MRI, qMRI)合成的准确性与泛化能力不足的问题。现有深度学习方法在处理不同扫描参数或新出现的病灶区域时表现不稳定,限制了其临床应用。解决方案的关键在于提出一种物理驱动的神经网络架构,通过参数嵌入(parameter embedding)将MRI序列参数——重复时间(TR)、回波时间(TE)和反转时间(TI)——直接融入模型结构中,使网络能够学习MR信号形成的物理机制。这一设计显著提升了定量参数图(T1、T2和质子密度PD)合成的精度与鲁棒性,尤其在未见脑结构和病变区域上表现出优异的泛化性能,PSNR超过34 dB且SSIM高于0.92,验证了该方法在加速qMRI并增强其临床实用性方面的潜力。
链接: https://arxiv.org/abs/2508.08123
作者: Lingjing Chen(1 and 2),Chengxiu Zhang(1 and 2),Yinqiao Yi(1 and 2),Yida Wang(1 and 2),Yang Song(3),Xu Yan(3),Shengfang Xu(4),Dalin Zhu(4),Mengqiu Cao(3),Yan Zhou(5),Chenglong Wang(1 and 2),Guang Yang(1 and 2) ((1) Shanghai Key Laboratory of Magnetic Resonance, School of Physics and Electronic Science, East China Normal University, Shanghai, China, (2) Institute of Magnetic Resonance and Molecular Imaging in Medicine, East China Normal University, Shanghai, China, (3) MR Research Collaboration Team, Siemens Healthineers, Shanghai, China, (4) Department of Radiology, Gansu Provincial Maternity and Child-care Hospital, Lanzhou, China, (5) Department of Radiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a deep learning-based approach that integrates MRI sequence parameters to improve the accuracy and generalizability of quantitative image synthesis from clinical weighted MRI. Our physics-driven neural network embeds MRI sequence parameters – repetition time (TR), echo time (TE), and inversion time (TI) – directly into the model via parameter embedding, enabling the network to learn the underlying physical principles of MRI signal formation. The model takes conventional T1-weighted, T2-weighted, and T2-FLAIR images as input and synthesizes T1, T2, and proton density (PD) quantitative maps. Trained on healthy brain MR images, it was evaluated on both internal and external test datasets. The proposed method achieved high performance with PSNR values exceeding 34 dB and SSIM values above 0.92 for all synthesized parameter maps. It outperformed conventional deep learning models in accuracy and robustness, including data with previously unseen brain structures and lesions. Notably, our model accurately synthesized quantitative maps for these unseen pathological regions, highlighting its superior generalization capability. Incorporating MRI sequence parameters via parameter embedding allows the neural network to better learn the physical characteristics of MR signals, significantly enhancing the performance and reliability of quantitative MRI synthesis. This method shows great potential for accelerating qMRI and improving its clinical utility.
zh
[CV-23] Vision-Based Localization and LLM -based Navigation for Indoor Environments
【速读】:该论文旨在解决室内定位与导航的复杂挑战,特别是在缺乏可靠GPS信号且建筑结构复杂的封闭环境中实现高精度、可扩展的定位与路径引导。其解决方案的关键在于融合视觉定位与大语言模型(Large Language Model, LLM)驱动的语义导航:首先利用经过两阶段微调的ResNet-50卷积神经网络,基于智能手机摄像头输入实现高置信度的位置识别(准确率达96%),其次通过LLM解析预处理后的楼层平面图并生成分步导航指令,从而在无需额外基础设施的情况下实现端到端的室内导航。
链接: https://arxiv.org/abs/2508.08120
作者: Keyan Rahimi,Md. Wasiul Haque,Sagar Dasgupta,Mizanur Rahman
机构: Brown University (布朗大学); The University of Alabama (阿拉巴马大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures, 1 table
Abstract:Indoor navigation remains a complex challenge due to the absence of reliable GPS signals and the architectural intricacies of large enclosed environments. This study presents an indoor localization and navigation approach that integrates vision-based localization with large language model (LLM)-based navigation. The localization system utilizes a ResNet-50 convolutional neural network fine-tuned through a two-stage process to identify the user’s position using smartphone camera input. To complement localization, the navigation module employs an LLM, guided by a carefully crafted system prompt, to interpret preprocessed floor plan images and generate step-by-step directions. Experimental evaluation was conducted in a realistic office corridor with repetitive features and limited visibility to test localization robustness. The model achieved high confidence and an accuracy of 96% across all tested waypoints, even under constrained viewing conditions and short-duration queries. Navigation tests using ChatGPT on real building floor maps yielded an average instruction accuracy of 75%, with observed limitations in zero-shot reasoning and inference time. This research demonstrates the potential for scalable, infrastructure-free indoor navigation using off-the-shelf cameras and publicly available floor plans, particularly in resource-constrained settings like hospitals, airports, and educational institutions.
zh
[CV-24] GRASPTrack: Geometry-Reason ed Association via Segmentation and Projection for Multi-Object Tracking
【速读】:该论文旨在解决单目视频中多目标跟踪(Multi-object Tracking, MOT)因遮挡和深度模糊性导致的性能下降问题,这些问题使得传统基于检测的跟踪(Tracking-by-Detection, TBD)方法因缺乏几何感知能力而难以有效应对。其解决方案的关键在于提出了一种深度感知的MOT框架GRASPTrack,该框架将单目深度估计与实例分割集成到标准TBD流程中,生成高保真度的3D点云以支持显式的三维几何推理;进一步通过体素化操作实现精确且鲁棒的基于体素的3D交并比(Voxel-Based 3D Intersection-over-Union, IoU),用于空间关联;同时引入深度自适应噪声补偿机制动态调整卡尔曼滤波过程噪声以提升状态估计可靠性,并设计深度增强的观测中心动量策略,将运动方向一致性从图像平面扩展至3D空间,从而强化复杂轨迹下的运动关联线索。
链接: https://arxiv.org/abs/2508.08117
作者: Xudong Han,Pengcheng Fang,Yueying Tian,Jianhui Yu,Xiaohao Cai,Daniel Roggen,Philip Birch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.
zh
[CV-25] Hyperspectral Imaging
【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)在理论、技术与应用层面存在的系统性挑战,包括硬件设计权衡、数据获取的变异性、高维数据处理复杂性以及跨领域方法标准化不足等问题。其解决方案的关键在于构建一个从物理原理到数据处理全流程的整合框架,涵盖传感器架构优化、数据采集与校正流程规范化、经典与人工智能驱动的分析方法(如深度学习、光谱解混和降维技术),并强调计算成像、物理信息建模、多模态融合及自监督学习等新兴策略以提升性能与可扩展性。此外,论文还提出通过开放数据集共享、元数据规范和可复现性实践来推动研究透明化与跨学科协作,为实现可扩展、实时、嵌入式HSI系统的未来发展方向奠定基础。
链接: https://arxiv.org/abs/2508.08107
作者: Danfeng Hong,Chenyu Li,Naoto Yokoya,Bing Zhang,Xiuping Jia,Antonio Plaza,Paolo Gamba,Jon Atli Benediktsson,Jocelyn Chanussot
机构: Southeast University (东南大学); University of Tokyo (东京大学); Chinese Academy of Sciences (中国科学院); University of New South Wales (新南威尔士大学); University of Extremadura (埃斯特雷马杜拉大学); University of Pavia (帕维亚大学); University of Iceland (冰岛大学); Grenoble INP (格勒诺布尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information, enabling non-invasive, label-free analysis of material, chemical, and biological properties. This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction. We summarize common data structures and highlight classical and modern analysis methods, including dimensionality reduction, classification, spectral unmixing, and AI-driven techniques such as deep learning. Representative applications across Earth observation, precision agriculture, biomedicine, industrial inspection, cultural heritage, and security are also discussed, emphasizing HSI’s ability to uncover sub-visual features for advanced monitoring, diagnostics, and decision-making. Persistent challenges, such as hardware trade-offs, acquisition variability, and the complexity of high-dimensional data, are examined alongside emerging solutions, including computational imaging, physics-informed modeling, cross-modal fusion, and self-supervised learning. Best practices for dataset sharing, reproducibility, and metadata documentation are further highlighted to support transparency and reuse. Looking ahead, we explore future directions toward scalable, real-time, and embedded HSI systems, driven by sensor miniaturization, self-supervised learning, and foundation models. As HSI evolves into a general-purpose, cross-disciplinary platform, it holds promise for transformative applications in science, technology, and society.
zh
[CV-26] BAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning
【速读】:该论文旨在解决现有基于扩散模型(Diffusion Model)的统一多模态理解与生成模型中存在的两个核心问题:一是仅使用多模态大语言模型(Multimodal Large Language Model, MLLM)最终隐藏状态作为生成条件,导致生成器与MLLM中间层丰富、分层表征之间连接薄弱;二是从头预训练统一生成架构计算成本过高,难以普及。解决方案的关键在于提出TBAC-UniImage框架,通过将预训练扩散模型作为生成阶梯(generative ladder),并利用MLLM多个不同层次的中间表示作为生成条件,从而实现理解与生成之间的深度、细粒度融合,显著提升了多模态统一建模的能力。
链接: https://arxiv.org/abs/2508.08098
作者: Junzhe Xu,Yuyang Yin,Xi Chen
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM’s final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM’s intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM’s understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.
zh
[CV-27] 3D Plant Root Skeleton Detection and Extraction
【速读】:该论文旨在解决植物根系三维结构难以精确捕获与建模的问题,尤其是由于根系结构复杂、纹理和颜色信息不足,导致传统视觉方法在识别和追踪根部性状时存在局限。其关键解决方案是提出了一种高效的3D根系骨架提取方法,通过横向根的检测与匹配、三角测量法提取侧根骨架结构,并将主根与侧根进行整合,从而从少量图像中重建出完整的3D根系架构。该方法在高复杂度根系数据集上验证有效,提取结果与真实情况高度一致,为自动化育种机器人提供精准的根系结构分析能力,显著提升作物遗传性状筛选效率与智能化水平。
链接: https://arxiv.org/abs/2508.08094
作者: Jiakai Lin,Jinchang Zhang,Ge Jin,Wenzhan Song,Tianming Liu,Guoyu Lu
机构: SUNY Binghamton (纽约州立大学宾汉姆顿分校); Yancheng Institute of Technology (盐城工学院); University of Georgia (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Plant roots typically exhibit a highly complex and dense architecture, incorporating numerous slender lateral roots and branches, which significantly hinders the precise capture and modeling of the entire root system. Additionally, roots often lack sufficient texture and color information, making it difficult to identify and track root traits using visual methods. Previous research on roots has been largely confined to 2D studies; however, exploring the 3D architecture of roots is crucial in botany. Since roots grow in real 3D space, 3D phenotypic information is more critical for studying genetic traits and their impact on root development. We have introduced a 3D root skeleton extraction method that efficiently derives the 3D architecture of plant roots from a few images. This method includes the detection and matching of lateral roots, triangulation to extract the skeletal structure of lateral roots, and the integration of lateral and primary roots. We developed a highly complex root dataset and tested our method on it. The extracted 3D root skeletons showed considerable similarity to the ground truth, validating the effectiveness of the model. This method can play a significant role in automated breeding robots. Through precise 3D root structure analysis, breeding robots can better identify plant phenotypic traits, especially root structure and growth patterns, helping practitioners select seeds with superior root systems. This automated approach not only improves breeding efficiency but also reduces manual intervention, making the breeding process more intelligent and efficient, thus advancing modern agriculture.
zh
[CV-28] MDD-Net: Multimodal Depression Detection through Mutual Transformer
【速读】:该论文旨在解决抑郁症检测中如何有效融合多模态信息以提升识别准确率的问题。其核心挑战在于从社交媒体中提取的声学与视觉特征之间存在复杂的关联性,传统方法难以充分挖掘和整合这些跨模态信息。解决方案的关键在于提出一种多模态抑郁症检测网络(MDD-Net),其中引入了互Transformer(mutual transformer)模块,用于计算不同模态生成特征间的相关性并实现高效特征融合,从而增强模型对抑郁状态的判别能力。实验表明,该方法在D-Vlog多模态数据集上相较于现有最优方法在F1分数上提升了高达17.37%。
链接: https://arxiv.org/abs/2508.08093
作者: Md Rezwanul Haque,Md. Milon Islam,S M Taslim Uddin Raju,Hamdi Altaheri,Lobna Nassar,Fakhri Karray
机构: University of Waterloo (滑铁卢大学); American University of Ras Al Khaimah (拉希德·阿勒·哈伊马克美国大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria
Abstract:Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at this https URL.
zh
[CV-29] Matrix-3D: Omnidirectional Explorable 3D World Generation
【速读】:该论文旨在解决从单张图像或文本提示中生成广域、可探索的3D世界这一问题,现有方法在生成场景的覆盖范围上存在局限。其解决方案的关键在于提出Matrix-3D框架,该框架结合条件视频生成与全景图3D重建技术:首先训练一个基于轨迹引导的全景视频扩散模型(trajectory-guided panoramic video diffusion model),以场景网格渲染图为条件实现高质量且几何一致的场景视频生成;随后通过两种独立方法将全景视频提升至完整3D世界——一种是快速的前馈式大尺度全景重建模型,另一种是基于优化的高精度细节重建流程。此外,作者还构建了首个大规模合成数据集Matrix-Pano,包含116K条带深度和轨迹标注的高质量静态全景视频序列,有效支撑了训练过程。
链接: https://arxiv.org/abs/2508.08086
作者: Zhongqi Yang,Wenhang Ge,Yuqi Li,Jiaqi Chen,Haoyuan Li,Mengyin An,Fei Kang,Hua Xue,Baixin Xu,Yuyang Yin,Eric Li,Yang Liu,Yikai Wang,Hao-Xiang Guo,Yahui Zhou
机构: Skywork AI; Hong Kong University of Science and Technology (Guangzhou); Institute of Computing Technology, Chinese Academy of Sciences; School of Artificial Intelligence, Beijing Normal University
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Technical Report
Abstract:Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in this https URL.
zh
[CV-30] ME-TST: Micro-expression Analysis via Temporal State Transition with ROI Relationship Awareness
【速读】:该论文旨在解决微表情(Micro-expressions, MEs)分析中长期视频序列内的定位与识别任务中存在的两个关键问题:一是传统基于滑动窗口的深度学习方法因固定窗口长度和硬分类策略导致对不同持续时间微表情建模能力受限;二是将微表情定位与识别视为独立任务,忽视了二者之间的内在关联。解决方案的关键在于提出两种基于状态空间模型(State Space Model, SSM)的架构——ME-TST 和 ME-TST+,通过引入时间状态转移机制替代传统的窗口级分类,实现视频级回归以更精确刻画微表情的时间动态特性,并支持变长微表情建模;同时,在 ME-TST+ 中进一步采用多粒度感兴趣区域(Region of Interest, ROI)建模与 SlowFast Mamba 框架缓解时序建模中的信息丢失,并设计特征级与结果级协同策略,利用定位与识别之间的内在联系提升整体性能。
链接: https://arxiv.org/abs/2508.08082
作者: Zizheng Guo,Bochao Zou,Junbao Zhuo,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expressions (MEs) are regarded as important indicators of an individual’s intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at this https URL.
zh
[CV-31] Information Bottleneck-based Causal Attention for Multi-label Medical Image Recognition MICCAI2025
【速读】:该论文旨在解决多标签医学图像分类(Multi-label Classification, MLC)中因模型关注到与类别无关的特征而导致的因果解释不准确问题。当前方法主要依赖因果注意力机制来学习类别特异性特征,但往往无法区分真实因果因素与虚假相关性(spurious correlations)和噪声信息,从而影响诊断准确性与可解释性。解决方案的关键在于提出一种基于信息瓶颈(Information Bottleneck, IB)的因果注意力机制(IBCA),其核心创新是构建一个结构因果模型(Structural Causal Model, SCM),将类别特异性注意力建模为因果、虚假和噪声因子的混合,并通过高斯混合多标签空间注意力机制过滤无关信息,再结合对比增强的因果干预策略逐步消除虚假注意力并降低噪声干扰,从而提升模型对真正病因的识别能力与分类性能。
链接: https://arxiv.org/abs/2508.08069
作者: Xiaoxiao Cui,Yiran Li,Kai He,Shanzhi Jiang,Mengli Xue,Wentao Li,Junhong Leng,Zhi Liu,Lizhen Cui,Shuo Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2025
Abstract:Multi-label classification (MLC) of medical images aims to identify multiple diseases and holds significant clinical potential. A critical step is to learn class-specific features for accurate diagnosis and improved interpretability effectively. However, current works focus primarily on causal attention to learn class-specific features, yet they struggle to interpret the true cause due to the inadvertent attention to class-irrelevant features. To address this challenge, we propose a new structural causal model (SCM) that treats class-specific attention as a mixture of causal, spurious, and noisy factors, and a novel Information Bottleneck-based Causal Attention (IBCA) that is capable of learning the discriminative class-specific attention for MLC of medical images. Specifically, we propose learning Gaussian mixture multi-label spatial attention to filter out class-irrelevant information and capture each class-specific attention pattern. Then a contrastive enhancement-based causal intervention is proposed to gradually mitigate the spurious attention and reduce noise information by aligning multi-head attention with the Gaussian mixture multi-label spatial. Quantitative and ablation results on Endo and MuReD show that IBCA outperforms all methods. Compared to the second-best results for each metric, IBCA achieves improvements of 6.35% in CR, 7.72% in OR, and 5.02% in mAP for MuReD, 1.47% in CR, and 1.65% in CF1, and 1.42% in mAP for Endo.
zh
[CV-32] PrIINeR: Towards Prior-Informed Implicit Neural Representations for Accelerated MRI BMVC
【速读】:该论文旨在解决高加速因子下磁共振成像(MRI)重建中因隐式神经表示(INR)先验约束弱而导致的结构信息丢失和混叠伪影问题。其解决方案的关键在于提出PrIINeR方法,通过将预训练深度学习模型中的群体级先验知识融入INR框架,结合基于实例的优化与双重数据一致性约束,使重建结果同时满足采集的k空间数据和先验引导的约束条件,从而显著提升结构保真度并有效抑制伪影。
链接: https://arxiv.org/abs/2508.08058
作者: Ziad Al-Haj Hemidi,Eytan Kats,Mattias P. Heinrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to the British Machine Vision Conference (BMVC) 2025 (Before peer review version)
Abstract:Accelerating Magnetic Resonance Imaging (MRI) reduces scan time but often degrades image quality. While Implicit Neural Representations (INRs) show promise for MRI reconstruction, they struggle at high acceleration factors due to weak prior constraints, leading to structural loss and aliasing artefacts. To address this, we propose PrIINeR, an INR-based MRI reconstruction method that integrates prior knowledge from pre-trained deep learning models into the INR framework. By combining population-level knowledge with instance-based optimization and enforcing dual data consistency, PrIINeR aligns both with the acquired k-space data and the prior-informed reconstruction. Evaluated on the NYU fastMRI dataset, our method not only outperforms state-of-the-art INR-based approaches but also improves upon several learning-based state-of-the-art methods, significantly improving structural preservation and fidelity while effectively removing aliasing this http URL bridges deep learning and INR-based techniques, offering a more reliable solution for high-quality, accelerated MRI reconstruction. The code is publicly available on this https URL.
zh
[CV-33] S2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix
【速读】:该论文旨在解决当前视频生成模型在沉浸式应用中难以生成高质量3D立体视频(stereoscopic video)和空间视频(spatial video)的问题。现有方法通常依赖于特定训练或复杂的姿态约束,限制了其通用性和灵活性。论文提出了一种无需姿态信息(pose-free)且无需额外训练(training-free)的解决方案,其关键在于:首先利用预训练的单目视频生成模型生成基础视频,并通过估计的深度信息将该视频投影到预定义的相机视角;随后引入一种新颖的帧矩阵修复(frame matrix inpainting)框架,借助原始生成模型合成不同视角与时间戳下的缺失内容,从而保证空间和时间的一致性;此外,设计了双更新(dual update)机制以缓解潜在空间中遮挡区域传播的负面影响,最终可将多视角视频转化为立体对或优化为4D高斯表示用于空间视频合成。
链接: https://arxiv.org/abs/2508.08048
作者: Peng Dai,Feitong Tan,Qiangeng Xu,Yihua Huang,David Futschik,Ruofei Du,Sean Fanello,Yinda Zhang,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: immsersive video generation
Abstract:While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textitframe matrix inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: this https URL
zh
[CV-34] RIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation
【速读】:该论文旨在解决自动驾驶中深度估计(depth estimation)的精度问题,尤其针对传统雷达-相机融合方法在不同天气条件下性能不稳定、且未充分利用语言信息的问题。其关键解决方案在于提出TRIDE算法,通过引入文本生成策略与特征提取技术增强单目深度估计,并创新性地设计了融合雷达点云信息的文本特征增强机制;同时构建了天气感知融合模块(weather-aware fusion block),能够根据实时天气条件自适应调整雷达权重,从而提升多模态融合的鲁棒性和准确性。该方法在nuScenes数据集上实现了MAE降低12.87%和RMSE降低9.08%的显著性能提升。
链接: https://arxiv.org/abs/2508.08038
作者: Huawei Sun,Zixu Wang,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMLR (2025.08)
Abstract:Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: this https URL
zh
[CV-35] IPBA: Imperceptible Perturbation Backdoor Attack in Federated Self-Supervised Learning
【速读】:该论文旨在解决联邦自监督学习(Federated Self-Supervised Learning, FSSL)在实际部署中面临的隐蔽性后门攻击问题。现有方法多依赖于视觉上明显的触发器(trigger),难以满足真实场景对隐蔽性和实用性的要求。针对这一挑战,论文提出了一种名为IPBA(Imperceptible and Effective Backdoor Attack)的新型攻击方法,其核心创新在于:通过解耦后门样本与增强样本的特征分布,并引入切片-Wasserstein距离(Sliced-Wasserstein distance)缓解后门样本的分布外(out-of-distribution)特性,从而优化触发器生成过程,显著提升攻击效果与隐蔽性。实验表明,IPBA在多种FSSL场景下均优于现有攻击方法,并具备强鲁棒性。
链接: https://arxiv.org/abs/2508.08031
作者: Jiayao Wang,Yang Song,Zhendong Zhao,Jiale Zhang,Qilin Wu,Junwu Zhu,Dongfang Zhao
机构: Yangzhou University (扬州大学); Chinese Academy of Sciences (中国科学院); Chaohu University (巢湖学院); University of Washington (华盛顿大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated self-supervised learning (FSSL) combines the advantages of decentralized modeling and unlabeled representation learning, serving as a cutting-edge paradigm with strong potential for scalability and privacy preservation. Although FSSL has garnered increasing attention, research indicates that it remains vulnerable to backdoor attacks. Existing methods generally rely on visually obvious triggers, which makes it difficult to meet the requirements for stealth and practicality in real-world deployment. In this paper, we propose an imperceptible and effective backdoor attack method against FSSL, called IPBA. Our empirical study reveals that existing imperceptible triggers face a series of challenges in FSSL, particularly limited transferability, feature entanglement with augmented samples, and out-of-distribution properties. These issues collectively undermine the effectiveness and stealthiness of traditional backdoor attacks in FSSL. To overcome these challenges, IPBA decouples the feature distributions of backdoor and augmented samples, and introduces Sliced-Wasserstein distance to mitigate the out-of-distribution properties of backdoor samples, thereby optimizing the trigger generation process. Our experimental results on several FSSL scenarios and datasets show that IPBA significantly outperforms existing backdoor attack methods in performance and exhibits strong robustness under various defense mechanisms.
zh
[CV-36] Mitigating Biases in Surgical Operating Rooms with Geometry MICCAI’25
【速读】:该论文旨在解决深度神经网络在手术室(OR)场景中因学习到数据集特有的伪相关性(spurious correlations)而导致的模型偏差问题,例如模型过度依赖鞋履、眼镜等外观线索而非真正的身份或行为特征进行识别。解决方案的关键在于将人员表示为3D点云序列,从而解耦与身份相关的形状和运动模式与基于外观的混淆因子(appearance-based confounders),使得模型能够捕捉更鲁棒的人体生物特征,提升在真实临床环境中对医护人员个性化工作流特征(如手术技能水平或协作能力)的识别准确性。
链接: https://arxiv.org/abs/2508.08028
作者: Tony Danjun Wang,Tobias Czempiel,Nassir Navab,Lennart Bastian
机构: 11; 22; 33
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended Abstract, presented at the MICCAI’25 workshop on Collaborative Intelligence and Autonomy in Image-guided Surgery
Abstract:Deep neural networks are prone to learning spurious correlations, exploiting dataset-specific artifacts rather than meaningful features for prediction. In surgical operating rooms (OR), these manifest through the standardization of smocks and gowns that obscure robust identifying landmarks, introducing model bias for tasks related to modeling OR personnel. Through gradient-based saliency analysis on two public OR datasets, we reveal that CNN models succumb to such shortcuts, fixating on incidental visual cues such as footwear beneath surgical gowns, distinctive eyewear, or other role-specific identifiers. Avoiding such biases is essential for the next generation of intelligent assistance systems in the OR, which should accurately recognize personalized workflow traits, such as surgical skill level or coordination with other staff members. We address this problem by encoding personnel as 3D point cloud sequences, disentangling identity-relevant shape and motion patterns from appearance-based confounders. Our experiments demonstrate that while RGB and geometric methods achieve comparable performance on datasets with apparent simulation artifacts, RGB models suffer a 12% accuracy drop in realistic clinical settings with decreased visual diversity due to standardizations. This performance gap confirms that geometric representations capture more meaningful biometric features, providing an avenue to developing robust methods of modeling humans in the OR.
zh
[CV-37] Sample-aware RandAugment: Search-free Automatic Data Augmentation for Effective Image Recognition
【速读】:该论文旨在解决当前自动数据增强(AutoDA)方法面临的两大挑战:一是搜索过程过于耗时,限制了实际应用;二是由于训练过程中策略适应不足导致性能不佳。其解决方案的关键在于提出一种无需搜索的、样本感知的RandAugment方法(Sample-aware RandAugment, SRA),该方法通过引入启发式评分模块来评估原始训练数据的复杂度,从而为每个样本动态分配定制化的增强策略,并结合非对称增强策略最大化评分模块的潜力,实现了在保持简单实现的同时显著提升模型泛化能力。
链接: https://arxiv.org/abs/2508.08004
作者: Anqi Xiao,Weichen Yu,Hongyuan Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Journal of Computer Vision, 2025
Abstract:Automatic data augmentation (AutoDA) plays an important role in enhancing the generalization of neural networks. However, mainstream AutoDA methods often encounter two challenges: either the search process is excessively time-consuming, hindering practical application, or the performance is suboptimal due to insufficient policy adaptation during training. To address these issues, we propose Sample-aware RandAugment (SRA), an asymmetric, search-free AutoDA method that dynamically adjusts augmentation policies while maintaining straightforward implementation. SRA incorporates a heuristic scoring module that evaluates the complexity of the original training data, enabling the application of tailored augmentations for each sample. Additionally, an asymmetric augmentation strategy is employed to maximize the potential of this scoring module. In multiple experimental settings, SRA narrows the performance gap between search-based and search-free AutoDA methods, achieving a state-of-the-art Top-1 accuracy of 78.31% on ImageNet with ResNet-50. Notably, SRA demonstrates good compatibility with existing augmentation pipelines and solid generalization across new tasks, without requiring hyperparameter tuning. The pretrained models leveraging SRA also enhance recognition in downstream object detection tasks. SRA represents a promising step towards simpler, more effective, and practical AutoDA designs applicable to a variety of future tasks. Our code is available at \hrefthis https URLthis https URL
zh
[CV-38] Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFMs)在群体活动检测(Group Activity Detection, GAD)任务中表现不佳的问题,尤其是当直接替换传统CNN骨干网络时,性能提升有限。其核心挑战在于VFMs虽具备强大特征提取能力,但预训练数据以对象为中心,缺乏对社会群体动态建模的结构化推理机制。解决方案的关键在于提出Prompt-driven Group Activity Detection (ProGraD),通过两个创新模块实现:1)可学习的群体提示(group prompts)引导VFM注意力聚焦于社交配置;2)轻量级两层GroupContext Transformer用于推断个体与群体关联及集体行为。该方法在多群体场景下显著优于现有方法,仅用10M参数即实现6.5%(Group mAP@1.0)和8.2%(Group mAP@0.5)的提升,并生成可解释的注意力图,揭示了群体推理过程。
链接: https://arxiv.org/abs/2508.07996
作者: Thinesh Thiyakesan Ponbagavathi,Chengzheng Yang,Alina Roitberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top. We introduce Prompt-driven Group Activity Detection (ProGraD) – a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5% (Group mAP@1.0) and 8.2% (Group mAP@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.07996 [cs.CV] (or arXiv:2508.07996v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.07996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-39] he Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility ICCV2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在服务视障人士(Blind and Visually Impaired, BVI)时存在的“隐式运动盲区”(Implicit Motion Blindness)问题,特别是其无法识别自动扶梯(escalator)运行方向这一典型失败模式。该问题源于当前视频理解中主流的帧采样范式——将视频视为离散静态图像序列,从而难以感知连续且低信号强度的运动信息。论文的核心解决方案并非提出新模型,而是呼吁从纯粹语义识别向鲁棒物理感知转变,并倡导开发以用户为中心、强调安全性与可靠性的新型基准测试,以满足动态环境中BVI用户的实际需求。
链接: https://arxiv.org/abs/2508.07989
作者: Xiantao Zhang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 9 pages, 3 figures, 2 tables. Accepted at CV4A11y, ICCV 2025
Abstract:Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community. However, we identify a critical failure mode that undermines their trustworthiness in real-world applications. We introduce the Escalator Problem – the inability of state-of-the-art models to perceive an escalator’s direction of travel – as a canonical example of a deeper limitation we term Implicit Motion Blindness. This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion. As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action. We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.
zh
[CV-40] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
【速读】:该论文旨在解决当前视频生成模型在视觉特效(Visual Effects, VFX)生成中面临的两大核心问题:一是现有方法依赖于单效应LoRA训练,导致无法实现多特效的协同生成;二是多特效联合训练时存在效应间干扰和空间控制不可靠的问题。解决方案的关键在于提出Omni-Effects框架,其核心创新包括:(1) 基于LoRA的专家混合机制(LoRA-MoE),通过分组专家LoRA实现多种特效的统一建模并有效缓解跨任务干扰;(2) 空间感知提示(Spatial-Aware Prompt, SAP),将空间掩码信息嵌入文本token以实现精确的空间控制,并引入独立信息流(Independent-Information Flow, IIF)模块隔离各特效的控制信号,防止不期望的混合效应。
链接: https://arxiv.org/abs/2508.07981
作者: Fangyuan Mao,Aiming Hao,Jintao Chen,Dongxia Liu,Xiaokun Feng,Jiashu Zhu,Meiqi Wu,Chubin Chen,Jiahong Wu,Xiangxiang Chu
机构: AMAP, Alibaba Group (阿里巴巴集团); PKU (北京大学); THU (清华大学); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.
zh
[CV-41] rackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking MICCAI’25
【速读】:该论文旨在解决长时间手术过程中对所有医护人员进行持续、精准的多人跟踪与身份重识别(Re-identification)问题,这是实现手术室中个性化智能支持的关键前提。其解决方案的核心在于提出TrackOR框架,通过利用三维几何特征(3D geometric signatures)来提升在线跟踪性能(相比最强基线提升11%的关联准确率),同时支持离线轨迹恢复以生成可用于分析的完整轨迹数据。这一方法显著增强了人员身份的持久性追踪能力,为实现以工作人员为中心的精细化分析和个性化智能系统提供了技术基础。
链接: https://arxiv.org/abs/2508.07968
作者: Tony Danjun Wang,Christian Heiliger,Nassir Navab,Lennart Bastian
机构: 11; 33; 1122
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Full Research Paper, presented at MICCAI’25 Workshop on Collaborative Intelligence and Autonomy in Image-guided Surgery
Abstract:Providing intelligent support to surgical teams is a key frontier in automated surgical scene understanding, with the long-term goal of improving patient outcomes. Developing personalized intelligence for all staff members requires maintaining a consistent state of who is located where for long surgical procedures, which still poses numerous computational challenges. We propose TrackOR, a framework for tackling long-term multi-person tracking and re-identification in the operating room. TrackOR uses 3D geometric signatures to achieve state-of-the-art online tracking performance (+11% Association Accuracy over the strongest baseline), while also enabling an effective offline recovery process to create analysis-ready trajectories. Our work shows that by leveraging 3D geometric information, persistent identity tracking becomes attainable, enabling a critical shift towards the more granular, staff-centric analyses required for personalized intelligent systems in the operating room. This new capability opens up various applications, including our proposed temporal pathway imprints that translate raw tracking data into actionable insights for improving team efficiency and safety and ultimately providing personalized support.
zh
[CV-42] VOIDFace: A Privacy-Preserving Multi-Network Face Recognition With Enhanced Security
【速读】:该论文旨在解决当前面部识别系统中因数据复制导致的隐私泄露与数据管理复杂性问题,以及用户对个人人脸数据失去控制权所带来的伦理挑战。其核心解决方案是提出VOIDFace框架,关键在于引入视觉秘密共享(Visual Secret Sharing, VSS)技术实现训练数据的安全存储与去中心化控制,并结合基于补丁的多训练网络结构,在不依赖原始数据副本的前提下完成模型训练,从而在保障隐私和安全的同时维持高精度的识别性能,且支持用户行使“被遗忘权”(Right-To-Be-Forgotten)。
链接: https://arxiv.org/abs/2508.07960
作者: Ajnas Muhammed,Iurri Medvedev,Nuno Gonçalves
机构: Institute of Systems and Robotics, University of Coimbra (科英布拉大学系统与机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Joint Conference on Biometrics (IJCB) 2025
Abstract:Advancement of machine learning techniques, combined with the availability of large-scale datasets, has significantly improved the accuracy and efficiency of facial recognition. Modern facial recognition systems are trained using large face datasets collected from diverse individuals or public repositories. However, for training, these datasets are often replicated and stored in multiple workstations, resulting in data replication, which complicates database management and oversight. Currently, once a user submits their face for dataset preparation, they lose control over how their data is used, raising significant privacy and ethical concerns. This paper introduces VOIDFace, a novel framework for facial recognition systems that addresses two major issues. First, it eliminates the need of data replication and improves data control to securely store training face data by using visual secret sharing. Second, it proposes a patch-based multi-training network that uses this novel training data storage mechanism to develop a robust, privacy-preserving facial recognition system. By integrating these advancements, VOIDFace aims to improve the privacy, security, and efficiency of facial recognition training, while ensuring greater control over sensitive personal face data. VOIDFace also enables users to exercise their Right-To-Be-Forgotten property to control their personal data. Experimental evaluations on the VGGFace2 dataset show that VOIDFace provides Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive facial recognition performance. Code is available at: this https URL
zh
[CV-43] FEAT: A Multi-Agent Forensic AI System with Domain-Adapted Large Language Model for Automated Cause-of-Death Analysis
【速读】:该论文旨在解决法医学领域中死亡原因判定面临的系统性挑战,包括法医人力资源短缺和诊断结果的不一致性,尤其是在中国等高负荷司法医疗体系中。其解决方案的核心在于提出FEAT(ForEnsic AgenT),一个基于领域适配的大语言模型(Large Language Model, LLM)的多智能体(multi-agent)AI框架,通过任务分解、证据分析、迭代反思与结论合成四个模块实现自动化与标准化的死亡调查流程;关键创新点包括工具增强推理、分层检索增强生成(retrieval-augmented generation)、法医调优的LLM以及人机协同反馈机制,从而在多个地理区域和病例类型中均展现出优于现有AI系统的性能,并获得资深病理学家对输出质量的高度认可。
链接: https://arxiv.org/abs/2508.07950
作者: Chen Shen,Wanqing Zhang,Kehan Li,Erwen Huang,Haitao Bi,Aiying Fan,Yiwen Shen,Hongmei Dong,Ji Zhang,Yuming Shao,Zengjia Liu,Xinshe Liu,Tao Li,Chunxia Yan,Shuanliang Fan,Di Wu,Jianhua Ma,Bin Cong,Zhenyuan Wang,Chunfeng Lian
机构: Xi’an Jiaotong University (西安交通大学); Sun Yat-Sen University (中山大学); Hebei Medical University (河北医科大学); Xinxiang Medical University (新乡医学院); Fudan University (复旦大学); Huazhong University of Science and Technology (华中科技大学); Academy of Forensic Science, Ministry of Justice (司法部司法鉴定科学研究院); Jining Medical University (济宁医学院); Shaanxi Baimei Forensic Appraisal Institutions (陕西佰美司法鉴定机构); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 18pages, 6 figures
Abstract:Forensic cause-of-death determination faces systemic challenges, including workforce shortages and diagnostic variability, particularly in high-volume systems like China’s medicolegal infrastructure. We introduce FEAT (ForEnsic AgenT), a multi-agent AI framework that automates and standardizes death investigations through a domain-adapted large language model. FEAT’s application-oriented architecture integrates: (i) a central Planner for task decomposition, (ii) specialized Local Solvers for evidence analysis, (iii) a Memory Reflection module for iterative refinement, and (iv) a Global Solver for conclusion synthesis. The system employs tool-augmented reasoning, hierarchical retrieval-augmented generation, forensic-tuned LLMs, and human-in-the-loop feedback to ensure legal and medical validity. In evaluations across diverse Chinese case cohorts, FEAT outperformed state-of-the-art AI systems in both long-form autopsy analyses and concise cause-of-death conclusions. It demonstrated robust generalization across six geographic regions and achieved high expert concordance in blinded validations. Senior pathologists validated FEAT’s outputs as comparable to those of human experts, with improved detection of subtle evidentiary nuances. To our knowledge, FEAT is the first LLM-based AI agent system dedicated to forensic medicine, offering scalable, consistent death certification while maintaining expert-level rigor. By integrating AI efficiency with human oversight, this work could advance equitable access to reliable medicolegal services while addressing critical capacity constraints in forensic systems.
zh
[CV-44] AG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding
【速读】:该论文旨在解决零样本视频时序定位(Zero-shot Video Temporal Grounding, VTG)中存在语义碎片化(semantic fragmentation)、相似度分布偏斜以及对大语言模型(LLM)依赖过高导致计算成本昂贵的问题。其核心解决方案是提出一种无需训练的时序感知方法(\textitTAG),关键在于引入时序池化(temporal pooling)、时序一致性聚类(temporal coherence clustering)和相似度调整(similarity adjustment)三个模块,有效捕捉视频的时序上下文信息并校正因语义碎片化引起的失真相似度分布,从而在不依赖LLM的情况下实现更精准的目标时刻定位,在Charades-STA和ActivityNet Captions数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2508.07925
作者: Jin-Seop Lee,SungJoon Lee,Jaehan Ahn,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textitTAG, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs. Our code is available at this https URL
zh
[CV-45] Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 在核医学领域应用中因高风险特性所引发的模型行为异常或错误难以检测与管理的问题。解决方案的关键在于开发并实现了一个混合异常检测(Outlier Detection, OD)框架,用于保障 GenAI 模型在 BIOEMTECH’s eyes™ 系统中的可靠性。该框架通过在两个具体应用——Pose2Xray(从鼠类照片生成合成 X 射线)和 DosimetrEYE(从二维 SPECT/CT 扫描估计三维辐射剂量图)中实施实时质量控制,显著提升了模型的鲁棒性、可扩展性和监管合规性,从而增强生成式 AI 在临床前研究场景中的工业可行性。
链接: https://arxiv.org/abs/2508.07923
作者: Jakub Binda,Valentina Paneta,Vasileios Eleftheriadis,Hongkyou Chung,Panagiotis Papadimitroulas,Neo Christopher Chung
机构: Institute of Informatics, University of Warsaw(华沙大学信息学院); Alethia XAI Sp. z o.o.(Alethia XAI有限责任公司); BIOEMTECH(生物电子技术公司); Athens(雅典); School of Law, Seoul National University(首尔国立大学法学院); Shin & Kim LLC(申与金律师事务所); Republic of Korea(大韩民国); Medical Informatics Laboratory, School of Medicine, University of Thessaly(塞萨洛尼基大学医学院医学信息学实验室); Larissa(拉里萨)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Generative AI holds great potentials to automate and enhance data synthesis in nuclear medicine. However, the high-stakes nature of biomedical imaging necessitates robust mechanisms to detect and manage unexpected or erroneous model behavior. We introduce development and implementation of a hybrid anomaly detection framework to safeguard GenAI models in BIOEMTECH’s eyes™ systems. Two applications are demonstrated: Pose2Xray, which generates synthetic X-rays from photographic mouse images, and DosimetrEYE, which estimates 3D radiation dose maps from 2D SPECT/CT scans. In both cases, our outlier detection (OD) enhances reliability, reduces manual oversight, and supports real-time quality control. This approach strengthens the industrial viability of GenAI in preclinical settings by increasing robustness, scalability, and regulatory compliance.
zh
[CV-46] RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering
【速读】:该论文旨在解决当前遥感(Remote Sensing, RS)视觉问答(Visual Question Answering, VQA)数据集在标注丰富性、问题多样性以及特定推理能力评估方面的局限性。为应对这一挑战,作者提出了一种创新的双轨注释生成流程:首先利用大语言模型(Large Language Models, LLMs),特别是GPT-4.1,通过精心设计的提示词自动生成图像描述、空间关系、语义标签及基于描述的复杂VQA对;其次针对遥感影像中目标计数任务的难点,开发了从原始分割数据中自动提取对象数量的专用流程,并由GPT-4.1将其转化为自然语言答案,结合预设问题模板构建计数类VQA对。该方案的关键在于将LLMs与结构化遥感数据深度融合,显著提升了数据集的规模(13,820张图像,162,373个VQA对)和注释质量,从而更全面地评估视觉语言模型(Vision Language Models, VLMs)在遥感场景下的理解与推理能力。
链接: https://arxiv.org/abs/2508.07918
作者: Xing Zi,Jinghao Xiao,Yunxiao Shi,Xian Tao,Jun Li,Ali Braytee,Mukesh Prasad
机构: University of Technology Sydney (悉尼科技大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to the proceedings of the 33rd ACM International Multimedia Conference (ACM Multimedia 2025)
Abstract:Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA’s annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.
zh
[CV-47] Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction
【速读】:该论文旨在解决单目视频中动态场景稠密几何重建面临的“记忆需求困境”(Memory Demand Dilemma)问题,即现有基于记忆的方法难以同时满足静态结构的长期稳定性与动态物体高频细节的高保真保留之间的矛盾,导致静态结构出现几何漂移或动态物体重建模糊失真。解决方案的关键在于提出Mem4D框架,通过解耦静态几何与动态运动的建模方式,设计双记忆架构:瞬态动态记忆(Transient Dynamics Memory, TDM)专注捕捉近期帧中的高频运动细节以实现动态内容的精细重建,持久结构记忆(Persistent Structure Memory, PSM)则压缩并保存长期空间信息以保障静态结构的全局一致性与无漂移重建。通过交替查询这两个专用记忆模块,Mem4D实现了静态几何的全局一致性和动态元素的高保真重建。
链接: https://arxiv.org/abs/2508.07908
作者: Xudong Cai,Shuo Wang,Peng Wang,Yongcai Wang,Zhaoxin Fan,Wanting Li,Tianbao Zhang,Jianrong Tao,Yeying Jin,Deying Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available.
zh
[CV-48] Generative Video Matting
【速读】:该论文旨在解决视频抠图(video matting)领域中因高质量真实标注数据稀缺而导致模型在真实场景下泛化能力差的问题。现有数据集通常仅提供不完美的alpha通道和前景标注,需在训练阶段通过合成背景图像或视频进行组合,限制了模型性能。解决方案的关键在于两个方面:一是构建大规模预训练数据集,包括多样化的合成数据与伪标签分割数据,并开发可扩展的合成数据生成流水线,以渲染多样化人体形态及精细毛发细节,从而获得约200段3秒时长的视频用于微调;二是提出一种新颖的视频抠图架构,有效利用预训练视频扩散模型中的丰富先验知识,该方法不仅能显著缩小合成数据与真实场景之间的域差距,且因其天然面向视频设计,避免了传统逐帧处理与独立解码器聚合时序信息的方式,从而保障更强的时间一致性。
链接: https://arxiv.org/abs/2508.07905
作者: Yongtao Ge,Kangyang Xie,Guangkai Xu,Mingyu Liu,Li Ke,Longtao Huang,Hui Xue,Hao Chen,Chunhua Shen
机构: The University of Adelaide (阿德莱德大学); Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach’s superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at this https URL.
zh
[CV-49] CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality ICCV2025
【速读】:该论文旨在解决历史文献手写文本识别(Handwritten Text Recognition, HTR)中因标注错误(特别是连字符处理不当)导致的性能瓶颈问题。其关键解决方案是提出一种基于CTC(Connectionist Temporal Classification)对齐算法的自训练方法,利用动态规划和模型输出概率实现文本行图像与完整转录之间的精确匹配,从而修正标注误差并提升识别准确率(如在PyLaia框架下CER降低1.1个百分点)。值得注意的是,研究发现较弱的初始模型能产生更准确的对齐结果,这为迭代训练策略提供了基础,使识别性能和对齐质量可逐步优化。
链接: https://arxiv.org/abs/2508.07904
作者: Marco Peer,Anna Scius-Bertrand,Andreas Fischer
机构: University of Fribourg (弗里堡大学); University of Applied Sciences and Arts Western Switerland (西部瑞士应用科学与艺术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 pages supplementary material. Accepted for VisionDocs@ICCV2025
Abstract:Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via this https URL.
zh
[CV-50] Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models MICCAI
【速读】:该论文旨在解决女性盆腔影像生成中因数据稀缺和患者隐私保护问题导致的生成式AI模型难以产出解剖学精确图像的问题(即:现有扩散模型在妇科影像领域应用受限)。其解决方案的关键在于提出了一种新颖的基于扩散机制的子宫MRI合成框架,融合了无条件与条件的去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)以及潜在扩散模型(Latent Diffusion Models, LDMs),并在2D和3D空间中实现高保真、解剖结构一致的合成图像生成。该方法显著提升了生成图像的临床真实性和诊断价值,并通过感知与分布度量评估及专家盲评验证了其有效性,同时公开了带隐私保护机制的合成数据集以推动可复现研究和妇科AI的公平发展。
链接: https://arxiv.org/abs/2508.07903
作者: Johanna P. Müller,Anika Knupfer,Pedro Blöss,Edoardo Berardi Vittur,Bernhard Kainz,Jana Hutter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI CAPI 2025
Abstract:Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology.
zh
[CV-51] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中高保真度人类视频生成时的身份一致性保持问题,现有方法通常依赖大量训练参数且难以与其他 AIGC 工具兼容。其解决方案的关键在于提出一种轻量级、即插即用的框架 Stand-In,通过在预训练视频生成模型中引入条件图像分支,并利用受限自注意力机制与条件位置映射实现身份控制,仅需约 2000 对样本即可快速学习,新增参数占比约 1%,却在视频质量和身份保留性能上优于全参数微调方法,同时具备良好的任务扩展性,可无缝集成至主体驱动视频生成、姿态参考视频生成、风格迁移和人脸替换等下游任务中。
链接: https://arxiv.org/abs/2508.07901
作者: Bowen Xue,Qixin Yan,Wenjing Wang,Hao Liu,Chen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just \sim 1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
zh
[CV-52] NeeCo: Image Synthesis of Novel Instrument States Based on Dynamic and Deformable 3D Gaussian Reconstruction
【速读】:该论文旨在解决手术图像数据集中因标注数据稀缺而导致的深度学习模型训练受限问题,尤其在真实场景中获取高质量、大规模标注数据成本高昂且难以实现。其关键解决方案是提出一种动态高斯点绘(dynamic Gaussian Splatting)技术,通过构建动态高斯模型来表示手术场景中的运动结构与形变,从而从未见过的视角和组织形变中合成逼真的带标注图像;同时引入动态训练调整策略以应对实际摄像机位姿校准不佳的问题,并基于动态高斯方法自动为合成数据生成标注。该方案显著提升了合成图像的质量(峰值信噪比达29.87),且使用合成数据训练的医学专用神经网络在未见真实数据上的性能优于当前最先进的标准数据增强方法10%,整体模型性能提升近15%。
链接: https://arxiv.org/abs/2508.07897
作者: Tianle Zeng,Junlei Hu,Gerardo Loza Galindo,Sharib Ali,Duygu Sarikaya,Pietro Valdastri,Dominic Jones
机构: STORM Lab UK (STORM 实验室英国); School of Electronic and Electrical Engineering (电子与电气工程学院); University of Leeds (利兹大学); School of Computer Science (计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures
Abstract:Computer vision-based technologies significantly enhance surgical automation by advancing tool tracking, detection, and localization. However, Current data-driven approaches are data-voracious, requiring large, high-quality labeled image datasets, which limits their application in surgical data science. Our Work introduces a novel dynamic Gaussian Splatting technique to address the data scarcity in surgical image datasets. We propose a dynamic Gaussian model to represent dynamic surgical scenes, enabling the rendering of surgical instruments from unseen viewpoints and deformations with real tissue backgrounds. We utilize a dynamic training adjustment strategy to address challenges posed by poorly calibrated camera poses from real-world scenarios. Additionally, we propose a method based on dynamic Gaussians for automatically generating annotations for our synthetic data. For evaluation, we constructed a new dataset featuring seven scenes with 14,000 frames of tool and camera motion and tool jaw articulation, with a background of an ex-vivo porcine model. Using this dataset, we synthetically replicate the scene deformation from the ground truth data, allowing direct comparisons of synthetic image quality. Experimental results illustrate that our method generates photo-realistic labeled image datasets with the highest values in Peak-Signal-to-Noise Ratio (29.87). We further evaluate the performance of medical-specific neural networks trained on real and synthetic images using an unseen real-world image dataset. Our results show that the performance of models trained on synthetic images generated by the proposed method outperforms those trained with state-of-the-art standard data augmentation by 10%, leading to an overall improvement in model performances by nearly 15%.
zh
[CV-53] Autonomous Navigation of Cloud-Controlled Quadcopters in Confined Spaces Using Multi-Modal Perception and LLM -Driven High Semantic Reasoning
【速读】:该论文旨在解决无人机在无GPS(Global Positioning System)的室内环境中实现鲁棒自主导航的问题。其核心挑战在于如何在缺乏外部定位信息的情况下,通过有限的传感器资源构建高精度的空间感知能力并保障飞行安全。解决方案的关键在于构建一个云端协同的智能感知框架:一方面利用YOLOv11进行实时目标检测、Depth Anything V2实现单目深度估计,并结合定制印刷电路板(PCB)集成飞行时间(Time-of-Flight, ToF)传感器与惯性测量单元(Inertial Measurement Unit, IMU),提升本地感知精度;另一方面借助云部署的大语言模型(Large Language Model, LLM)实现语义级决策支持,同时通过校准后的虚拟安全包络(virtual safety envelope)机制确保避障可靠性,并采用多线程架构降低端到端延迟,最终实现在复杂室内场景下的低延迟、高鲁棒性导航性能。
链接: https://arxiv.org/abs/2508.07885
作者: Shoaib Ahmmad,Zubayer Ahmed Aditto,Md Mehrab Hossain,Noushin Yeasmin,Shorower Hossain
机构: Rajshahi University of Engineering and Technology (拉杰沙希工程与技术大学); Shahjalal University of Science and Technology (沙赫贾拉尔科技大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); United International University (联合国际大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:
Abstract:This paper introduces an advanced AI-driven perception system for autonomous quadcopter navigation in GPS-denied indoor environments. The proposed framework leverages cloud computing to offload computationally intensive tasks and incorporates a custom-designed printed circuit board (PCB) for efficient sensor data acquisition, enabling robust navigation in confined spaces. The system integrates YOLOv11 for object detection, Depth Anything V2 for monocular depth estimation, a PCB equipped with Time-of-Flight (ToF) sensors and an Inertial Measurement Unit (IMU), and a cloud-based Large Language Model (LLM) for context-aware decision-making. A virtual safety envelope, enforced by calibrated sensor offsets, ensures collision avoidance, while a multithreaded architecture achieves low-latency processing. Enhanced spatial awareness is facilitated by 3D bounding box estimation with Kalman filtering. Experimental results in an indoor testbed demonstrate strong performance, with object detection achieving a mean Average Precision (mAP50) of 0.6, depth estimation Mean Absolute Error (MAE) of 7.2 cm, only 16 safety envelope breaches across 42 trials over approximately 11 minutes, and end-to-end system latency below 1 second. This cloud-supported, high-intelligence framework serves as an auxiliary perception and navigation system, complementing state-of-the-art drone autonomy for GPS-denied confined spaces.
zh
[CV-54] AP: Parameter-efficient Task-Aware Prompting for Adverse Weather Removal
【速读】:该论文旨在解决当前All-in-One图像恢复框架中存在的参数冗余与任务间相关性被忽视的问题。现有方法通常为每种退化类型设计专用网络模块或参数,导致模型参数量庞大;同时,不同恢复任务之间的内在关联未得到充分建模。其解决方案的关键在于提出一种基于任务感知增强提示(task-aware enhanced prompts)的参数高效框架,通过两阶段训练策略——预训练阶段获取通用恢复知识,提示微调阶段引入可学习的软提示(soft prompts)实现特定退化类型的适配,并采用低秩分解与对比约束机制增强提示的表达能力,从而在保留任务通用性的同时捕捉任务特异性特征,显著提升参数效率与任务建模精度。
链接: https://arxiv.org/abs/2508.07878
作者: Hanting Wang,Shengpeng Ji,Shulei Wang,Hai Huang,Xiao Jin,Qifei Zhang,Tao Jin
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image restoration under adverse weather conditions has been extensively explored, leading to numerous high-performance methods. In particular, recent advances in All-in-One approaches have shown impressive results by training on multi-task image restoration datasets. However, most of these methods rely on dedicated network modules or parameters for each specific degradation type, resulting in a significant parameter overhead. Moreover, the relatedness across different restoration tasks is often overlooked. In light of these issues, we propose a parameter-efficient All-in-One image restoration framework that leverages task-aware enhanced prompts to tackle various adverse weather this http URL, we adopt a two-stage training paradigm consisting of a pretraining phase and a prompt-tuning phase to mitigate parameter conflicts across tasks. We first employ supervised learning to acquire general restoration knowledge, and then adapt the model to handle specific degradation via trainable soft prompts. Crucially, we enhance these task-specific prompts in a task-aware manner. We apply low-rank decomposition to these prompts to capture both task-general and task-specific characteristics, and impose contrastive constraints to better align them with the actual inter-task relatedness. These enhanced prompts not only improve the parameter efficiency of the restoration model but also enable more accurate task modeling, as evidenced by t-SNE analysis. Experimental results on different restoration tasks demonstrate that the proposed method achieves superior performance with only 2.75M parameters.
zh
[CV-55] Selective Contrastive Learning for Weakly Supervised Affordance Grounding ICCV2025
【速读】:该论文旨在解决弱监督下的功能部位定位(Weakly Supervised Affordance Grounding, WSAG)问题,即在缺乏像素级标注的情况下,准确识别物体上与特定动作相关的功能部位。现有方法主要依赖共享分类器和知识蒸馏策略,但容易受类别特异性模式干扰,难以捕捉真正与功能相关的局部线索。其解决方案的关键在于引入选择性原型(selective prototypical)和像素对比(pixel contrastive)双重目标,自适应地在部件和对象层面学习功能相关特征,通过跨视角(第一人称与第三人称)对齐发现互补的部位线索,并持续区分功能相关区域与无关背景,从而将模型激活从无关区域转移到有意义的功能提示上。
链接: https://arxiv.org/abs/2508.07877
作者: WonJun Moon,Hyun Seok Seong,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025
Abstract:Facilitating an entity’s interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method. Codes are available at this http URL.
zh
[CV-56] owards Human-AI Collaboration System for the Detection of Invasive Ductal Carcinoma in Histopathology Images
【速读】:该论文旨在解决侵袭性导管癌(Invasive Ductal Carcinoma, IDC)早期诊断中准确率不足的问题,以提升患者生存率并优化治疗决策。其解决方案的关键在于提出一种“人在回路中”(Human-in-the-Loop, HITL)的深度学习系统,通过高效网络EfficientNetV2S实现初始自动诊断,并由病理医生对AI误判样本进行人工修正与标签更新,形成从人类专家到AI模型的反馈闭环,从而持续迭代优化模型性能。实验表明,该协作机制在保持模型初始高精度(93.65%)的基础上,进一步提升了诊断准确性,为未来AI辅助医学诊断提供了可扩展、高可靠的范式。
链接: https://arxiv.org/abs/2508.07875
作者: Shuo Han,Ahmed Karam Eldaly,Solomon Sunday Oyelere
机构: University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, and early, accurate diagnosis is critical to improving patient survival rates by guiding treatment decisions. Combining medical expertise with artificial intelligence (AI) holds significant promise for enhancing the precision and efficiency of IDC detection. In this work, we propose a human-in-the-loop (HITL) deep learning system designed to detect IDC in histopathology images. The system begins with an initial diagnosis provided by a high-performance EfficientNetV2S model, offering feedback from AI to the human expert. Medical professionals then review the AI-generated results, correct any misclassified images, and integrate the revised labels into the training dataset, forming a feedback loop from the human back to the AI. This iterative process refines the model’s performance over time. The EfficientNetV2S model itself achieves state-of-the-art performance compared to existing methods in the literature, with an overall accuracy of 93.65%. Incorporating the human-in-the-loop system further improves the model’s accuracy using four experimental groups with misclassified images. These results demonstrate the potential of this collaborative approach to enhance AI performance in diagnostic systems. This work contributes to advancing automated, efficient, and highly accurate methods for IDC detection through human-AI collaboration, offering a promising direction for future AI-assisted medical diagnostics.
zh
[CV-57] CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
【速读】:该论文旨在解决多模态上下文学习(multimodal in-context learning, ICL)中因图像令牌冗余导致的效率低下与性能不稳定问题。现代视觉-语言模型(LVLMs)将图像转换为大量令牌,其中多数信息稀疏且对推理贡献有限,尤其在多模态ICL场景下,冗余加剧了计算开销并削弱了快速领域适应的优势。现有图像令牌剪枝方法主要针对单图像任务设计,在此场景下会导致显著准确率下降。为此,作者提出无需训练的上下文自适应令牌剪枝(Contextually Adaptive Token Pruning, CATP),其核心在于两个阶段的渐进式剪枝策略,能够充分捕捉输入序列中的复杂跨模态交互关系;通过移除77.8%的图像令牌,CATP在四个LVLM和八个基准上平均提升0.6%性能,并降低10.78%推理延迟,显著优于所有基线方法。
链接: https://arxiv.org/abs/2508.07871
作者: Yanshu Li,Jianjiang Yang,Zhennan Shen,Ligong Han,Haoyan Xu,Ruixiang Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures, 6 tables
Abstract:Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.
zh
[CV-58] Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
【速读】:该论文旨在解决当前视觉-语言-运动模型(Vision-Language-Motion Models, VLMMs)在实际部署中面临的可控性瓶颈问题,具体表现为对多样化人类指令响应不足、姿态初始化能力有限、长序列生成性能差、对未见场景处理能力弱以及对个体身体部位的细粒度控制缺失。解决方案的关键在于提出 Being-M0.5,这是首个实现实时且高可控性的 VLMM,其核心创新包括:构建了目前最大且最全面的人体动作数据集 HuMo100M(包含超500万条自采集动作序列、1亿个多任务指令实例及精细的部位级标注),并引入一种新颖的部件感知残差量化(part-aware residual quantization)技术用于动作标记化,从而实现对身体各部位的精确、细粒度控制。实验表明,Being-M0.5 在多个动作生成基准上均达到领先性能,同时具备实时推理能力,为实用化动作生成系统提供了重要范式。
链接: https://arxiv.org/abs/2508.07863
作者: Bin Cao,Sipeng Zheng,Ye Wang,Lujie Xia,Qianshan Wei,Qin Jin,Jing Liu,Zongqing Lu
机构: CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); BAAI(百度研究院); PKU(北京大学); RUC(中国人民大学); SEU(东南大学); BeingBeyond
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages
Abstract:Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5’s superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at this https URL.
zh
[CV-59] racking Any Point Methods for Markerless 3D Tissue Tracking in Endoscopic Stereo Images
【速读】:该论文旨在解决微创手术中因组织动态运动和视野受限导致的精准组织跟踪难题,以支持术中导航、提升安全性并实现情境感知的机器人辅助操作。其解决方案的关键在于提出一种基于2D Tracking Any Point (TAP)网络的无标记三维(3D)组织跟踪方法,通过融合两个CoTracker模型——一个用于时间序列跟踪,另一个用于立体匹配——从双目内窥镜图像中估计组织的3D运动。实验在临床腹腔镜系统和模拟组织运动的机械臂上进行,使用3D打印假体和鸡组织假体验证,结果显示在10 mm/s速度下,鸡组织假体上的欧氏距离误差低至1.1 mm,证明了该方法在复杂手术场景中具备高精度与鲁棒性。
链接: https://arxiv.org/abs/2508.07851
作者: Konrad Reuter,Suresh Guttikonda,Sarah Latus,Lennart Maack,Christian Betz,Tobias Maurer,Alexander Schlaefer
机构: Hamburg University of Technology (汉堡工业大学); University Medical Center Hamburg-Eppendorf (汉堡-埃彭多夫大学医学中心); Martini-Klinik Prostate Cancer Center (马丁尼前列腺癌中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accecpted to CURAC conference 2025
Abstract:Minimally invasive surgery presents challenges such as dynamic tissue motion and a limited field of view. Accurate tissue tracking has the potential to support surgical guidance, improve safety by helping avoid damage to sensitive structures, and enable context-aware robotic assistance during complex procedures. In this work, we propose a novel method for markerless 3D tissue tracking by leveraging 2D Tracking Any Point (TAP) networks. Our method combines two CoTracker models, one for temporal tracking and one for stereo matching, to estimate 3D motion from stereo endoscopic images. We evaluate the system using a clinical laparoscopic setup and a robotic arm simulating tissue motion, with experiments conducted on a synthetic 3D-printed phantom and a chicken tissue phantom. Tracking on the chicken tissue phantom yielded more reliable results, with Euclidean distance errors as low as 1.1 mm at a velocity of 10 mm/s. These findings highlight the potential of TAP-based models for accurate, markerless 3D tracking in challenging surgical scenarios.
zh
[CV-60] Morphological Analysis of Semiconductor Microstructures using Skeleton Graphs ICCV2025
【速读】:该论文旨在解决如何从离子束辐照诱导的锗(Ge)表面微结构中提取并量化其拓扑特征,并探究不同辐照参数对表面形貌演化的影响机制。解决方案的关键在于将电子显微镜图像中的微结构转化为骨架图(skeleton graphs),并通过图卷积网络(graph convolutional network, GCN)进行嵌入表示,进而利用主成分分析(principal component analysis, PCA)与Davies-Bouldin指数评估嵌入空间中聚类的可分性,从而识别出辐照角度相较于辐照剂量(fluence)对表面形态变化具有更显著的影响。
链接: https://arxiv.org/abs/2508.07850
作者: Noriko Nitta,Rei Miyata,Naoto Oishi
机构: Kochi University of Technology (高知工科大学); National Institute of Technology, Kochi College (高知高等专门学校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CV4MS: Computer Vision for Materials Science, Workshop in conjunction with the IEEE/CVF ICCV 2025
Abstract:In this paper, electron microscopy images of microstructures formed on Ge surfaces by ion beam irradiation were processed to extract topological features as skeleton graphs, which were then embedded using a graph convolutional network. The resulting embeddings were analyzed using principal component analysis, and cluster separability in the resulting PCA space was evaluated using the Davies-Bouldin index. The results indicate that variations in irradiation angle have a more significant impact on the morphological properties of Ge surfaces than variations in irradiation fluence.
zh
[CV-61] Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images ICCV2025
【速读】:该论文旨在解决太阳耀斑(solar flare)预测中长期时空依赖建模不足与图像特征表示学习能力有限的问题。现有基于启发式物理特征的方法难以从太阳图像中自动提取有效表征,而端到端学习方法又难以捕捉太阳图像中的长程时间依赖关系。其解决方案的关键在于提出Deep Space Weather Model (Deep SWM),该模型融合多个深度状态空间模型(deep state space models),能够同时处理十通道太阳图像并建模长程时空依赖;此外,创新性地引入稀疏掩码自编码器(sparse masked autoencoder)和两阶段掩码策略,在压缩空间信息的同时保留关键区域(如黑子区域),从而提升预测的准确性与可靠性。
链接: https://arxiv.org/abs/2508.07847
作者: Shunya Nagashima,Komei Sugiura
机构: Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images. In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information. Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method. Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at this https URL.
zh
[CV-62] CBDES MoE: Hierarchically Decoupled Mixture-of-Experts for Functional Modules in Autonomous Driving
【速读】:该论文旨在解决当前多模态鸟瞰图(Bird’s Eye View, BEV)感知系统在输入适应性、建模能力以及泛化性能方面的局限性。其核心解决方案是提出一种基于功能模块粒度的分层解耦式专家混合架构(Computing Brain DEvelopment System Mixture-of-Experts, CBDES MoE),该架构通过集成结构异构的多个专家网络,并引入轻量级自注意力路由机制(Self-Attention Router, SAR),实现动态专家路径选择与稀疏、输入感知的高效推理,从而提升模型在复杂场景下的适应性和性能表现。
链接: https://arxiv.org/abs/2508.07838
作者: Qi Xiang,Kunsong Shi,Zhigui Lin,Lei He
机构: Tsinghua University (清华大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bird’s Eye View (BEV) perception systems based on multi-sensor feature fusion have become a fundamental cornerstone for end-to-end autonomous driving. However, existing multi-modal BEV methods commonly suffer from limited input adaptability, constrained modeling capacity, and suboptimal generalization. To address these challenges, we propose a hierarchically decoupled Mixture-of-Experts architecture at the functional module level, termed Computing Brain DEvelopment System Mixture-of-Experts (CBDES MoE). CBDES MoE integrates multiple structurally heterogeneous expert networks with a lightweight Self-Attention Router (SAR) gating mechanism, enabling dynamic expert path selection and sparse, input-aware efficient inference. To the best of our knowledge, this is the first modular Mixture-of-Experts framework constructed at the functional module granularity within the autonomous driving domain. Extensive evaluations on the real-world nuScenes dataset demonstrate that CBDES MoE consistently outperforms fixed single-expert baselines in 3D object detection. Compared to the strongest single-expert model, CBDES MoE achieves a 1.6-point increase in mAP and a 4.1-point improvement in NDS, demonstrating the effectiveness and practical advantages of the proposed approach.
zh
[CV-63] Effortless Vision-Language Model Specialization in Histopathology without Annotation
【速读】:该论文旨在解决通用视觉-语言模型(Vision-Language Models, VLMs)在特定组织病理学下游任务中性能欠佳的问题,尤其是当这些模型依赖人工标注样本进行监督微调时所面临的标注成本高、泛化能力受限等挑战。其解决方案的关键在于通过在领域相关且任务相关的图像-文本对上进行持续预训练(continued pretraining),实现无需人工标注的模型适应。实验表明,这种方法能显著提升零样本(zero-shot)和少样本(few-shot)分类性能,且在训练数据规模增大时可达到与少样本方法相当的效果,同时避免了手动标注需求,具有任务无关性和高效性优势。
链接: https://arxiv.org/abs/2508.07835
作者: Jingna Qiu,Nishanth Jain,Jonas Ammeling,Marc Aubreville,Katharina Breininger
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希·亚历山大大学); Ingolstadt University of Applied Sciences (英戈尔施塔特应用技术大学); Flensburg University of Applied Sciences (弗伦斯堡应用技术大学); Julius-Maximilians-Universität Würzburg (维尔茨堡尤利乌斯-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at this https URL.
zh
[CV-64] MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)因架构复杂且难以解释而导致的透明度和可信度不足的问题。其解决方案的关键在于提出一种名为MIMIC(Multimodal Inversion for Model Interpretation and Conceptualization)的框架,通过合成与内部编码相对应的视觉概念来可视化VLM的内部表示。MIMIC的核心创新包括:利用基于VLM的联合反演机制与特征对齐目标以适配VLM的自回归处理过程,并引入三重正则化项——空间对齐、自然图像平滑性和语义真实性,从而实现高质量且语义一致的视觉概念重建。这是首个针对VLM概念进行视觉解释的模型反演方法。
链接: https://arxiv.org/abs/2508.07833
作者: Animesh Jain,Alexandros Stergiou
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM’s autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
zh
[CV-65] Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP ICASSP2026
【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)任务中存在显著适应差距的问题,其根源在于模型缺乏用于密集预测的局部归纳偏置(local inductive biases)以及依赖于固定不变的特征融合范式。解决方案的关键在于提出一种架构协同设计(Architectural Co-Design)框架,通过联合优化特征表示与跨模态融合机制实现突破:一方面引入参数高效的卷积低秩适配器(Convolutional Low-Rank Adaptation, Conv-LoRA),以注入局部归纳偏置提升细粒度表征能力;另一方面设计动态融合网关(Dynamic Fusion Gateway, DFG),利用视觉上下文自适应调制文本提示,实现双向增强的跨模态融合。实验证明该协同设计策略在多个工业和医疗基准上均展现出更优的准确性和鲁棒性。
链接: https://arxiv.org/abs/2508.07819
作者: Ke Ma,Jun Long,Hongxiao Fei,Liujie Hua,Yueyi Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 1 reference, 3 figures, icassp 2026
Abstract:Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.
zh
[CV-66] Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models
【速读】:该论文旨在解决现有无参考图像质量评估(No-reference Image Quality Assessment, NR-IQA)方法在感知多维质量差异时的局限性问题,特别是由于仅依赖全局表示而难以捕捉语义显著区域,以及对局部区域特征采用均匀加权导致对局部质量变化敏感度不足。解决方案的关键在于提出一种细粒度图像质量评估模型 RSFIQA,其核心创新包括:首先利用 Segment Anything Model (SAM) 动态分割输入图像为非重叠语义区域;其次通过强大的多模态大语言模型(Multi-modal Large Language Model, MLLM)提取各区域描述性内容并感知多维失真;最后引入区域感知语义注意力(Region-Aware Semantic Attention, RSA)机制,聚合局部区域的细粒度表示生成全局注意力图,从而实现对图像局部与整体质量差异的精细化建模。该方法具有骨干网络无关性,可无缝集成至多种深度神经网络架构中。
链接: https://arxiv.org/abs/2508.07818
作者: Chenyue Song,Chen Hui,Haiqi Zhu,Feng Jiang,Yachun Mi,Wei Zhang,Shaohui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.
zh
[CV-67] Semi-supervised Multiscale Matching for SAR-Optical Image
【速读】:该论文旨在解决SAR-光学图像匹配中因依赖大量像素级人工标注而导致的数据标注成本高、难以获取充足标注样本的问题。其解决方案的关键在于提出一种半监督多尺度匹配框架(S2M2-SAR),通过结合深度与浅层匹配结果为未标注图像对生成伪标签相似性热图,并利用标注与伪标注热图联合训练匹配模型;同时引入基于跨模态互独立损失的交叉模态特征增强模块,无需真实标签即可促进共享特征与模态特有特征的解耦,从而提升跨模态特征表示能力。
链接: https://arxiv.org/abs/2508.07812
作者: Jingze Gai,Changchun Li
机构: Nanyang Technological University (南洋理工大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures
Abstract:Driven by the complementary nature of optical and synthetic aperture radar (SAR) images, SAR-optical image matching has garnered significant interest. Most existing SAR-optical image matching methods aim to capture effective matching features by employing the supervision of pixel-level matched correspondences within SAR-optical image pairs, which, however, suffers from time-consuming and complex manual annotation, making it difficult to collect sufficient labeled SAR-optical image pairs. To handle this, we design a semi-supervised SAR-optical image matching pipeline that leverages both scarce labeled and abundant unlabeled image pairs and propose a semi-supervised multiscale matching for SAR-optical image matching (S2M2-SAR). Specifically, we pseudo-label those unlabeled SAR-optical image pairs with pseudo ground-truth similarity heatmaps by combining both deep and shallow level matching results, and train the matching model by employing labeled and pseudo-labeled similarity heatmaps. In addition, we introduce a cross-modal feature enhancement module trained using a cross-modality mutual independence loss, which requires no ground-truth labels. This unsupervised objective promotes the separation of modality-shared and modality-specific features by encouraging statistical independence between them, enabling effective feature disentanglement across optical and SAR modalities. To evaluate the effectiveness of S2M2-SAR, we compare it with existing competitors on benchmark datasets. Experimental results demonstrate that S2M2-SAR not only surpasses existing semi-supervised methods but also achieves performance competitive with fully supervised SOTA methods, demonstrating its efficiency and practical potential.
zh
[CV-68] DiTVR: Zero-Shot Diffusion Transformer for Video Restoration
【速读】:该论文旨在解决视频修复(video restoration)任务中传统回归方法生成细节不真实、依赖大量成对数据集,以及现有生成式扩散模型难以保证时序一致性的核心问题。其解决方案的关键在于提出DiTVR框架,通过引入轨迹感知注意力机制(trajectory aware attention)将token沿光流轨迹对齐,尤其关注对时序动态最敏感的层级;同时设计了一个基于运动对应关系的时空邻域缓存机制以动态选择相关token,并采用小波引导、流一致性采样器仅在低频带注入数据一致性,从而在保留高频先验的同时加速收敛并提升时序稳定性。
链接: https://arxiv.org/abs/2508.07811
作者: Sicheng Gao,Nancy Mehta,Zongwei Wu,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures
Abstract:Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.
zh
[CV-69] Pose-RFT: Enhancing MLLM s for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成3D人体姿态时面临的固有模糊性建模不足与任务特定对齐困难的问题。现有方法通常依赖监督学习目标(如SMPL参数回归或token级预测),难以有效捕捉图像到姿态生成中的空间对齐关系和文本到姿态生成中的语义一致性。解决方案的关键在于提出Pose-RFT框架,其核心是将3D姿态生成任务建模为混合动作强化学习问题,联合优化离散的语言预测与连续的姿态生成;并引入HyGRPO算法,通过分组奖励归一化机制指导离散与连续动作的协同优化,同时设计任务特异性奖励函数以增强空间对齐(图像→姿态)和语义一致性(文本→姿态)的建模能力。
链接: https://arxiv.org/abs/2508.07804
作者: Bao Li,Xiaomei Zhang,Miao Xu,Zhaoxin Fan,Xiangyu Zhu,Zhen Lei
机构: CASIA (中国科学院自动化研究所); UCAS (中国科学院大学); CAIR, HKISI, CAS (中科院人工智能创新研究院,香港中文大学深圳研究院,中国科学院); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.
zh
[CV-70] MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks
【速读】:该论文旨在解决多模态图像融合(multimodal image fusion)生成的图像与仅使用可见光图像训练的下游预训练模型之间存在显著像素分布差异的问题,这种差异会导致目标检测和语义分割等任务性能下降,甚至低于仅使用可见光图像的效果。解决方案的关键在于提出MambaTrans,一种新颖的多模态融合图像模态翻译器,其核心是多模态状态空间块(Multi-Model State Space Block),该模块结合了掩码-图像-文本交叉注意力机制与3D-选择性扫描模块(3D-Selective Scan Module),能够利用目标检测先验知识最小化训练过程中的检测损失,并有效捕捉文本、掩码与图像间的长程依赖关系,从而在不调整下游预训练模型参数的前提下,显著提升多模态图像在下游任务中的表现。
链接: https://arxiv.org/abs/2508.07803
作者: Yushen Xu,Xiaosong Li,Zhenyu Kuang,Xiaoqi Cheng,Haishu Tan,Huafeng Li
机构: Foshan University (佛山大学); Guangdong-HongKong-Macao Joint Laboratory for Intelligent Micro-Nano Optoelectronic Technology (粤港澳联合实验室智能微纳光电技术); Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.
zh
[CV-71] Power Battery Detection
【速读】:该论文针对动力电池(Power Battery)质量检测中的一项新任务——功率电池检测(Power Battery Detection, PBD),旨在从工业X射线图像中精确定位阴极和阳极极片的密集端点,以实现高效、准确的质量检验。传统人工检测效率低且易出错,而现有视觉算法难以应对极片密集排列、对比度低、尺度变化大及成像伪影等挑战。为解决此问题,作者提出PBD5K数据集(首个大规模基准),包含5000张来自9种电池类型的X射线图像及其细粒度标注与8类真实干扰;并设计MDCNeXt模型,将PBD建模为点级分割问题,通过融合点、线、数量等多维结构线索提升定位精度。其关键创新在于引入两个状态空间模块:一是提示过滤模块(prompt-filtered module),利用任务特定提示学习对比关系增强区分能力;二是密度感知重排序模块(density-aware reordering module),在高密度区域优化分割结果;此外还提出距离自适应掩码生成策略,确保不同空间分布下监督信号的鲁棒性。
链接: https://arxiv.org/abs/2508.07797
作者: Xiaoqi Zhao,Peiqian Cao,Lihe Zhang,Zonglei Feng,Hanqi Liu,Jiaming Zuo,Youwei Pang,Weisi Lin,Georges El Fakhri,Huchuan Lu,Xiaofeng Liu
机构: Yale University (耶鲁大学); Dalian University of Technology (大连理工大学); Volkswagen Automotive Co., Ltd. (大众汽车公司); X3000 Inspection Co., Ltd. (X3000检测有限公司); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission to IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)
Abstract:Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \hrefthis https URLPBD5K.
zh
[CV-72] Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake
【速读】:该论文旨在解决当前主动防御策略在对抗深度伪造(deepfake)技术时存在的持久性不足问题。现有方法往往因攻击者通过收集受保护样本并重新训练模型而失效,导致防御效果短暂且难以实际应用。解决方案的关键在于提出一种两阶段防御框架(Two-Stage Defense Framework, TSDF),其核心创新是设计了具有双重功能的对抗扰动机制:一方面直接扭曲伪造内容以实现即时干扰,另一方面作为污染载体破坏攻击者重训练过程中的数据准备环节,从而阻断模型对防御扰动的适应能力,确保防御效果长期有效。
链接: https://arxiv.org/abs/2508.07795
作者: Hongrui Zheng,Yuezun Li,Liejun Wang,Yunfeng Diao,Zhiqing Guo
机构: Xinjiang University (新疆大学); Ocean University of China (中国海洋大学); Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center (新疆多模态智能处理与信息安全工程技术研究中心); Silk Road Multilingual Cognitive Computing International Cooperation Joint Laboratory (丝绸之路多语种认知计算国际合作联合实验室); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model’s ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker’s retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker’s model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at this https URL.
zh
[CV-73] Anatomy-Aware Low-Dose CT Denoising via Pretrained Vision Models and Semantic-Guided Contrastive Learning
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)图像中因噪声和伪影导致的诊断效能下降问题,尤其是现有基于深度学习的去噪方法普遍忽略人体组织解剖语义信息,从而造成去噪效果不理想甚至过度平滑的问题。其解决方案的关键在于提出一种解剖感知的LDCT去噪方法ALDEN,通过融合预训练视觉模型(Pretrained Vision Models, PVMs)的语义特征,并结合对抗学习与对比学习机制:一方面设计了一个解剖感知判别器,利用交叉注意力机制动态融合来自参考正常剂量CT(NDCT)的分层语义特征,实现组织特异性的真实性评估;另一方面引入语义引导的对比学习模块,通过正负样本对约束PVM提取的LDCT、去噪CT与NDCT特征,既保留组织特异性模式又抑制伪影,从而显著提升图像质量并增强解剖结构保真度。
链接: https://arxiv.org/abs/2508.07788
作者: Runze Wang,Zeli Chen,Zhiyun Song,Wei Fang,Jiajin Zhang,Danyang Tu,Yuxing Tang,Minfeng Xu,Xianghua Ye,Le Lu,Dakai Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To reduce radiation exposure and improve the diagnostic efficacy of low-dose computed tomography (LDCT), numerous deep learning-based denoising methods have been developed to mitigate noise and artifacts. However, most of these approaches ignore the anatomical semantics of human tissues, which may potentially result in suboptimal denoising outcomes. To address this problem, we propose ALDEN, an anatomy-aware LDCT denoising method that integrates semantic features of pretrained vision models (PVMs) with adversarial and contrastive learning. Specifically, we introduce an anatomy-aware discriminator that dynamically fuses hierarchical semantic features from reference normal-dose CT (NDCT) via cross-attention mechanisms, enabling tissue-specific realism evaluation in the discriminator. In addition, we propose a semantic-guided contrastive learning module that enforces anatomical consistency by contrasting PVM-derived features from LDCT, denoised CT and NDCT, preserving tissue-specific patterns through positive pairs and suppressing artifacts via dual negative pairs. Extensive experiments conducted on two LDCT denoising datasets reveal that ALDEN achieves the state-of-the-art performance, offering superior anatomy preservation and substantially reducing over-smoothing issue of previous work. Further validation on a downstream multi-organ segmentation task (encompassing 117 anatomical structures) affirms the model’s ability to maintain anatomical awareness.
zh
[CV-74] GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences
【速读】:该论文旨在解决当前步态识别方法中对时间上下文建模不足的问题:基于集合的方法忽略帧间的短程时序关系,而基于序列的方法难以捕捉长程时序依赖。其解决方案的关键在于提出将人类步态视为一系列个性化动作的组合,并引入“片段(snippet)”概念——即从连续序列中随机选取的一组帧,从而实现多尺度时序上下文的建模。通过精心设计的片段采样(Snippet Sampling)与片段建模(Snippet Modeling)机制,该方法有效提升了步态特征的学习能力,在多个主流数据集上取得了显著性能提升,验证了片段在步态识别中的潜力。
链接: https://arxiv.org/abs/2508.07782
作者: Saihui Hou,Chenye Wang,Wenpeng Lang,Zhengxiang Lan,Yongzhen Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures
Abstract:Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.
zh
[CV-75] Forecasting Continuous Non-Conservative Dynamical Systems in SO(3) ICCV2025
【速读】:该论文旨在解决在计算机视觉中对运动物体旋转轨迹进行建模时面临的三大挑战:(1)未知物理参数(如转动惯量)导致动力学复杂;(2)外部力和力矩引发非保守运动学;(3)在稀疏且含噪观测条件下估计演化状态轨迹的鲁棒性不足。其解决方案的关键在于利用基于SO(3)流形上的神经控制微分方程(Neural Controlled Differential Equations),并引入SO(3) Savitzky-Golay路径作为引导信号,从而在物理和几何上均具有意义地建模噪声姿态估计的轨迹。该方法不依赖能量或动量守恒假设,具备对输入噪声的鲁棒性,适用于复杂非惯性系统,并可作为模块嵌入现有流水线,在模拟和多种真实场景中展现出良好的外推能力与泛化性能。
链接: https://arxiv.org/abs/2508.07775
作者: Lennart Bastian,Mohammad Rashed,Nassir Navab,Tolga Birdal
机构: Technical University of Munich (慕尼黑工业大学); Munich Center of Machine Learning (慕尼黑机器学习中心); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Oral
Abstract:Modeling the rotation of moving objects is a fundamental task in computer vision, yet SO(3) extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with SO(3) Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at this https URL
zh
[CV-76] Prototype-Guided Curriculum Learning for Zero-Shot Learning
【速读】:该论文旨在解决零样本学习(Zero-Shot Learning, ZSL)中基于嵌入(embedding-based)方法因语义原型(semantic prototypes)手动定义而导致的两类问题:一是实例级不匹配(instance-level mismatch),即由于视角变化、遮挡和标注偏差等因素,个体样本与类级别语义原型之间存在差异;二是类别级不精确(class-level imprecision),即人工设定的语义原型可能无法准确反映类的真实语义。为此,作者提出原型引导的课程学习框架(Prototype-Guided Curriculum Learning Framework, CLZSL),其关键在于两个模块:一是原型引导的课程学习(Prototype-Guided Curriculum Learning, PCL)模块,通过优先选择视觉映射与语义原型具有高余弦相似度的样本进行训练,并逐步引入低匹配度样本,从而缓解实例级不匹配对视觉-语义映射的干扰;二是原型更新(Prototype Update, PUP)模块,利用从样本中学到的视觉映射动态调整类级别语义原型,以降低类别级不精确性并进一步优化视觉-语义映射。
链接: https://arxiv.org/abs/2508.07771
作者: Lei Wang,Shiming Chen,Guo-Sen Xie,Ziming Hong,Chaojian Yu,Qinmu Peng,Xinge You
机构: Huazhong University of Science and Technology (华中科技大学); Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); Nanjing University of Science and Technology (南京理工大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures
Abstract:In Zero-Shot Learning (ZSL), embedding-based methods enable knowledge transfer from seen to unseen classes by learning a visual-semantic mapping from seen-class images to class-level semantic prototypes (e.g., attributes). However, these semantic prototypes are manually defined and may introduce noisy supervision for two main reasons: (i) instance-level mismatch: variations in perspective, occlusion, and annotation bias will cause discrepancies between individual sample and the class-level semantic prototypes; and (ii) class-level imprecision: the manually defined semantic prototypes may not accurately reflect the true semantics of the class. Consequently, the visual-semantic mapping will be misled, reducing the effectiveness of knowledge transfer to unseen classes. In this work, we propose a prototype-guided curriculum learning framework (dubbed as CLZSL), which mitigates instance-level mismatches through a Prototype-Guided Curriculum Learning (PCL) module and addresses class-level imprecision via a Prototype Update (PUP) module. Specifically, the PCL module prioritizes samples with high cosine similarity between their visual mappings and the class-level semantic prototypes, and progressively advances to less-aligned samples, thereby reducing the interference of instance-level mismatches to achieve accurate visual-semantic mapping. Besides, the PUP module dynamically updates the class-level semantic prototypes by leveraging the visual mappings learned from instances, thereby reducing class-level imprecision and further improving the visual-semantic mapping. Experiments were conducted on standard benchmark datasets-AWA2, SUN, and CUB-to verify the effectiveness of our method.
zh
[CV-77] Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation
【速读】:该论文旨在解决4D内容(即时空一致的动态三维场景)生成中的核心挑战,即如何同时建模高保真空间表示与物理上合理的时序动态,尤其是在包含多个交互元素的大规模环境中维持视角一致性。其解决方案的关键在于提出Dream4D框架,通过可控视频生成与神经4D重建的协同作用实现:首先利用少量样本学习从单张图像预测最优相机轨迹,随后采用特定的姿态条件扩散过程生成几何一致的多视角序列,并最终构建持久的4D表示。该方法首次融合了视频扩散模型中的丰富时序先验与重建模型的几何感知能力,显著提升了4D生成质量(如mPSNR、mSSIM指标)。
链接: https://arxiv.org/abs/2508.07769
作者: Xiaoyan Liu,Kangrui Li,Jiaxin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.
zh
[CV-78] UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models ACM-MM2025
【速读】:该论文旨在解决生成式 AI 在可缩放矢量图形(SVG)理解与生成(UG)任务中的关键挑战,即如何在高精度参数控制下实现多模态条件(如文本提示和视觉参考)到SVG代码的高效转换。其解决方案的关键在于构建一个以SVG为中心的综合性数据集UniSVG,包含52.5万条标注数据,支持从文本和图像到SVG的统一生成以及对SVG的颜色、类别、用途等属性的理解。通过在该数据集上训练开源多模态大语言模型(MLLM),显著提升了模型在各类SVG UG任务上的性能,甚至超越了闭源的SOTA模型如GPT-4V。
链接: https://arxiv.org/abs/2508.07766
作者: Jinke Li,Jiarui Yu,Chenxing Wei,Hande Dong,Qiang Lin,Liangjing Yang,Zhicai Wang,Yanbin Hao
机构: Zhejiang University (浙江大学); Tencent (腾讯); Shenzhen University (深圳大学); Hefei University of Technology (合肥工业大学); Independent Researcher (独立研究者)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ACM MM 2025 Dataset Track
Abstract:Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (UG) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG UG. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG UG tasks within a unified model. To unlock MLLM’s capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs’ performance on various SVG UG tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on this https URL.
zh
[CV-79] Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild
【速读】:该论文旨在解决大视觉模型(如Segment Anything Model, SAM)在真实场景下游任务中表现受限的问题,特别是针对传统参考分割方法依赖元学习所带来的高数据与计算成本。其解决方案的关键在于将参考图像与目标图像之间的内在对应关系建模为伪视频序列,从而利用具备交互式视频目标分割能力的SAM2模型实现轻量级适配。具体而言,提出的方法名为CAV-SAM(Correspondence As Video for SAM),包含两个核心模块:基于扩散模型的语义过渡模块(Diffusion-Based Semantic Transition, DBST)用于生成语义变换序列,以及测试时几何对齐模块(Test-Time Geometric Alignment, TTGA)通过测试时微调实现序列内几何变化的对齐,从而显著提升分割性能,在多个基准数据集上优于当前最优方法超过5%。
链接: https://arxiv.org/abs/2508.07759
作者: Haoran Wang,Zekun Li,Jian Zhang,Lei Qi,Yinghuan Shi
机构: Nanjing University (南京大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.
zh
[CV-80] Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion CVPR2025
【速读】:该论文旨在解决小样本图像集合中提取共性概念(common concept)的问题,尤其是在缺乏额外引导信息(如文本提示或空间掩码)时,现有方法难以有效分离辅助特征并保持生成质量。其解决方案的关键在于提出一种名为对比反演(Contrastive Inversion)的新方法:通过对比学习训练目标标记(target token)与图像级辅助文本标记(image-wise auxiliary text tokens),实现对目标语义的解耦表示;随后采用解耦交叉注意力微调(disentangled cross-attention fine-tuning)策略,在不发生过拟合的前提下提升概念保真度(concept fidelity)。该方法无需人工引导即可准确识别并生成共性概念,显著优于现有技术。
链接: https://arxiv.org/abs/2508.07755
作者: Minseo Kim,Minchan Kwon,Dongyeun Lee,Yunho Jeon,Junmo Kim
机构: Korea Institute of Science and Technology (韩国科学技术院); Hanbat National University (汉巴特国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 workshop (AI4CC)
Abstract:The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation this http URL this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.
zh
[CV-81] Grouped Speculative Decoding for Autoregressive Image Generation ICCV2025
【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型在推理阶段因序列生成特性导致的耗时较长、难以规模化部署的问题。现有加速方法如推测解码(Speculative Decoding, SD)要么加速效果有限,要么需要额外训练,实用性受限。其核心挑战在于:图像token具有内在冗余性和多样性,即多个不同token可表达有效语义,而传统SD仅基于单一最可能token进行决策,忽视了这一特性,造成大量无效拒绝(false-negative rejections)。解决方案的关键在于提出一种无需训练的动态分组推测解码(Grouped Speculative Decoding, GSD)策略——不再依赖单个目标token,而是通过聚类方式评估一组视觉上合理的候选token,从而更充分地利用图像token的多样性,显著提升推理效率。实验表明,GSD可在不牺牲图像质量的前提下实现平均3.7倍的加速。
链接: https://arxiv.org/abs/2508.07747
作者: Junhyuk So,Juncheol Shin,Hyunho Kook,Eunhyeok Park
机构: POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the ICCV 2025
Abstract:Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at this https URL
zh
[CV-82] Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting ACM-MM2025
【速读】:该论文旨在解决生成式数据增强(Generative Data Augmentation)中因生成图像质量不可控而导致的噪声问题,尤其是在医学诊断等对数据质量要求较高的应用场景中,现有方法常因生成图像存在噪声而限制模型性能提升。解决方案的关键在于提出一种基于三元组连接结构的样本重加权方法(TriReWeight),其通过理论分析三种监督信号类型,设计出可与任意生成式数据增强方法兼容且不降低性能的重加权机制,理论上保证了泛化误差逼近最优阶 $ O(\sqrt{d \ln(n)/n}) $,实验表明该方法在六类自然图像和三类医学图像数据集上分别平均提升7.9%和3.4%,显著优于当前最先进方法。
链接: https://arxiv.org/abs/2508.07723
作者: Ting Xiang,Changjian Chen,Zhuo Tang,Qifeng Zhang,Fei Lyu,Li Yang,Jiapeng Zhang,Kenli Li
机构: Hunan University(湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, published to ACM MM2025
Abstract:The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order O(\sqrtd\ln (n)/n) . Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by 7.9% on average over six natural image datasets and by 3.4% on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.
zh
[CV-83] A Registration-Based Star-Shape Segmentation Model and Fast Algorithms
【速读】:该论文旨在解决在存在遮挡、模糊或噪声等退化因素时,图像中目标物体的准确分割难题。其解决方案的关键在于提出一种基于配准框架(registration framework)的星形(star-shape)分割模型,通过将水平集(level set)表示与配准框架相结合,并对变形后的水平集函数施加约束,从而实现完整或部分星形分割,支持单中心或多中心结构;同时,该方法还能强制分割边界通过指定的特征点(landmark locations),提升了分割结果的几何先验一致性与准确性。
链接: https://arxiv.org/abs/2508.07721
作者: Daoping Zhang,Xue-Cheng Tai,Lok Ming Lui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:Image segmentation plays a crucial role in extracting objects of interest and identifying their boundaries within an image. However, accurate segmentation becomes challenging when dealing with occlusions, obscurities, or noise in corrupted images. To tackle this challenge, prior information is often utilized, with recent attention on star-shape priors. In this paper, we propose a star-shape segmentation model based on the registration framework. By combining the level set representation with the registration framework and imposing constraints on the deformed level set function, our model enables both full and partial star-shape segmentation, accommodating single or multiple centers. Additionally, our approach allows for the enforcement of identified boundaries to pass through specified landmark locations. We tackle the proposed models using the alternating direction method of multipliers. Through numerical experiments conducted on synthetic and real images, we demonstrate the efficacy of our approach in achieving accurate star-shape segmentation.
zh
[CV-84] DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models
【速读】:该论文旨在解决建筑平面图中多样化门类别的精确检测与分类问题(door type detection and classification in floor plans),这是实现建筑合规性检查和室内场景理解等应用的关键步骤。当前公开可用的细粒度多类别门检测数据集稀缺,限制了相关模型的训练与评估。解决方案的核心在于提出一种半自动化流程:首先使用先进的深度目标检测模型将门统一识别为一类;随后利用大语言模型(LLM)基于视觉与上下文特征对每个检测实例进行细粒度分类;最后通过人机协同环节确保标签与边界框的质量。该方法显著降低了人工标注成本,同时构建出适用于神经网络模型在建筑平面图分析中基准测试的数据集,体现了深度学习与多模态推理结合在复杂现实场景数据构建中的潜力。
链接: https://arxiv.org/abs/2508.07714
作者: Licheng Zhang,Bach Le,Naveed Akhtar,Tuan Ngo
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Accurate detection and classification of diverse door types in floor plans drawings is critical for multiple applications, such as building compliance checking, and indoor scene understanding. Despite their importance, publicly available datasets specifically designed for fine-grained multi-class door detection remain scarce. In this work, we present a semi-automated pipeline that leverages a state-of-the-art object detector and a large language model (LLM) to construct a multi-class door detection dataset with minimal manual effort. Doors are first detected as a unified category using a deep object detection model. Next, an LLM classifies each detected instance based on its visual and contextual features. Finally, a human-in-the-loop stage ensures high-quality labels and bounding boxes. Our method significantly reduces annotation cost while producing a dataset suitable for benchmarking neural models in floor plan analysis. This work demonstrates the potential of combining deep learning and multimodal reasoning for efficient dataset construction in complex real-world domains.
zh
[CV-85] Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction IROS2025
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在多视角场景下因单视图投影平面内高斯法向量对齐而导致的几何偏差问题,即在当前视图中表面重建合理,但在切换至邻近视图时出现不一致或失真现象。其解决方案的关键在于提出一种多视角法向量与距离引导的高斯点渲染方法,通过引入两个核心模块:一是多视角距离重投影正则化模块,利用相邻视图间相同高斯表面的距离损失约束深度一致性;二是多视角法向量增强模块,通过对邻近视图像素点法向量进行匹配并计算损失以保证法向一致性。该方案有效实现了几何深度统一和高精度重建,显著提升了3DGS在小规模室内外场景中的多视角重建鲁棒性。
链接: https://arxiv.org/abs/2508.07701
作者: Bo Jia,Yanan Guo,Ying Chang,Benkui Zhang,Ying Xie,Kangning Du,Lin Cao
机构: Beijing Information Science and Technology University (北京信息科技大学); Aerospace Information Research Institute, CAS (中国科学院空天信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This paper has been accepted by IROS 2025
Abstract:3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS.
zh
[CV-86] Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing
【速读】:该论文旨在解决当前3D生成技术中个性化内容编辑的难题,即如何在不破坏原始几何结构的前提下,对生成的3D资产进行颜色、风格和光照等方面的优化。现有编辑工具多基于2D域设计,直接将结果输入到多视图扩散模型(multi-view diffusion models)中会导致信息损失,从而降低最终3D资产的质量。解决方案的关键在于提出一种无需微调、即插即用的方案,其核心是引入一个几何保持模块(geometry preservation module),通过使用原始输入的法线潜变量(normal latents)引导编辑后的多视图生成过程,并设计了一个注入切换器(injection switcher)以控制原始法线监督的强度,从而实现编辑后颜色与法线视图之间的精确对齐。实验表明,该方法能显著提升编辑后3D资产的多视角一致性与网格质量。
链接: https://arxiv.org/abs/2508.07700
作者: Weitao Wang,Haoran Xu,Jun Meng,Haoqian Wang
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As 3D generation techniques continue to flourish, the demand for generating personalized content is rapidly rising. Users increasingly seek to apply various editing methods to polish generated 3D content, aiming to enhance its color, style, and lighting without compromising the underlying geometry. However, most existing editing tools focus on the 2D domain, and directly feeding their results into 3D generation methods (like multi-view diffusion models) will introduce information loss, degrading the quality of the final 3D assets. In this paper, we propose a tuning-free, plug-and-play scheme that aligns edited assets with their original geometry in a single inference run. Central to our approach is a geometry preservation module that guides the edited multi-view generation with original input normal latents. Besides, an injection switcher is proposed to deliberately control the supervision extent of the original normals, ensuring the alignment between the edited color and normal views. Extensive experiments show that our method consistently improves both the multi-view consistency and mesh quality of edited 3D assets, across multiple combinations of multi-view diffusion models and editing methods.
zh
[CV-87] AR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding
【速读】:该论文针对时序视频定位(Temporal Video Grounding, TVG)任务中现有强化学习方法无法有效约束推理过程以保证最终时间预测质量的问题,提出了一种名为TAR-TVG的新框架。其核心解决方案在于在推理过程中引入时间戳锚点(timestamp anchors),作为中间验证节点来显式监督思维内容,并强制每一步推理都生成更精确的时间估计,从而确保推理链对最终预测具有实质性贡献。关键创新在于通过三阶段训练策略——初始GRPO训练收集高质量推理轨迹、基于蒸馏数据的监督微调(SFT)、以及SFT增强模型上的最终GRPO优化——显著提升低概率锚点生成能力,同时保持推理质量,实现性能与可解释性的双重提升。
链接: https://arxiv.org/abs/2508.07683
作者: Chaohong Guo,Xun Mo,Yongwei Nie,Xuemiao Xu,Chao Xu,Fei Yu,Chengjiang Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.
zh
[CV-88] DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework
【速读】:该论文旨在解决传统多步扩散模型在神经视频压缩(Neural Video Compression)中存在解码速度慢、比特率高以及感知质量提升有限的问题。其核心解决方案是提出DiffVC-OSD——一种基于单步扩散机制的感知视频压缩框架,关键创新在于将重建的潜在表示直接输入到一个单步扩散模型中,通过结合时间上下文和潜在特征进行单一扩散步骤的去噪优化,从而显著提升感知质量;同时设计了时间上下文适配器(Temporal Context Adapter)以提取多层次条件特征增强去噪U-Net的指导能力,并采用端到端微调策略进一步优化整体压缩性能。实验表明,该方法在保持卓越感知质量的同时,实现约20倍的解码加速与86.92%的比特率降低。
链接: https://arxiv.org/abs/2508.07682
作者: Wenzhuo Ma,Zhenzhong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework. Unlike conventional multi-step diffusion-based methods, DiffVC-OSD feeds the reconstructed latent representation directly into a One-Step Diffusion Model, enhancing perceptual quality through a single diffusion step guided by both temporal context and the latent itself. To better leverage temporal dependencies, we design a Temporal Context Adapter that encodes conditional inputs into multi-level features, offering more fine-grained guidance for the Denoising Unet. Additionally, we employ an End-to-End Finetuning strategy to improve overall compression performance. Extensive experiments demonstrate that DiffVC-OSD achieves state-of-the-art perceptual compression performance, offers about 20 \times faster decoding and a 86.92% bitrate reduction compared to the corresponding multi-step diffusion-based variant.
zh
[CV-89] Undress to Redress: A Training-Free Framework for Virtual Try-On
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)中长袖到短袖转换场景下的生成不真实问题,尤其当原图中暴露皮肤区域较少时,现有方法因“多数完成”规则导致皮肤修复失真。解决方案的关键在于提出无需训练的UR-VTON框架,其核心是引入“脱衣-穿衣”机制:先虚拟“脱去”原衣物以显露用户躯干,再叠加目标短袖服装,从而将复杂转换分解为两个更易处理的步骤;同时结合动态无分类器引导调度与结构精修模块,提升图像细节保真度与多样性平衡,显著改善生成质量。
链接: https://arxiv.org/abs/2508.07680
作者: Zhiying Li,Junhao Wu,Yeying Jin,Daiheng Gao,Yun Ji,Kaichuan Kong,Lei Yu,Hao Xu,Kai Chen,Bruce Gu,Nana Wang,Zhaoxin Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
Abstract:Virtual try-on (VTON) is a crucial task for enhancing user experience in online shopping by generating realistic garment previews on personal photos. Although existing methods have achieved impressive results, they struggle with long-sleeve-to-short-sleeve conversions-a common and practical scenario-often producing unrealistic outputs when exposed skin is underrepresented in the original image. We argue that this challenge arises from the ‘‘majority’’ completion rule in current VTON models, which leads to inaccurate skin restoration in such cases. To address this, we propose UR-VTON (Undress-Redress Virtual Try-ON), a novel, training-free framework that can be seamlessly integrated with any existing VTON method. UR-VTON introduces an ‘‘undress-to-redress’’ mechanism: it first reveals the user’s torso by virtually ‘‘undressing,’’ then applies the target short-sleeve garment, effectively decomposing the conversion into two more manageable steps. Additionally, we incorporate Dynamic Classifier-Free Guidance scheduling to balance diversity and image quality during DDPM sampling, and employ Structural Refiner to enhance detail fidelity using high-frequency cues. Finally, we present LS-TON, a new benchmark for long-sleeve-to-short-sleeve try-on. Extensive experiments demonstrate that UR-VTON outperforms state-of-the-art methods in both detail preservation and image quality. Code will be released upon acceptance.
zh
[CV-90] Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)自动目标识别(Automatic Target Recognition, ATR)中因标注数据质量差、存在不可靠噪声标签而导致模型性能下降的问题。现有基于图像数据的去噪标签学习方法难以适应SAR数据非直观的视觉特性,无法实现鲁棒学习。其解决方案的关键在于提出一种协同学习散射特征与深度特征(Collaborative Learning of Scattering and Deep Features, CLSDF)框架:首先设计多模态特征融合机制,将物理可解释的散射中心属性(Attributed Scattering Centers, ASCs)建模为动态图结构数据,以增强深度图像特征的表征能力;其次利用多类别的高斯混合模型(Gaussian Mixture Models, GMMs)对损失分布进行建模,区分干净样本与噪声样本;再通过双分支半监督学习策略,在两类样本间交替优化;最后引入联合分布对齐策略提升相互猜测标签的可靠性,从而在MSTAR数据集上实现不同噪声条件下最优的识别性能。
链接: https://arxiv.org/abs/2508.07656
作者: Yimin Fu,Zhunga Liu,Dongxiu Guo,Longfei Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code will be released at this https URL upon acceptance
Abstract:The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.
zh
[CV-91] LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering ICCV2025
【速读】:该论文旨在解决当前图像生成方法在控制图像中物体之间遮挡关系(occlusion relationships)时精度不足的问题。现有基于提示词(prompt-based)的方法难以精确调控遮挡,而布局到图像(layout-to-image)方法虽能控制物体位置,却未显式建模遮挡关系。解决方案的关键在于:利用预训练的图像扩散模型,在潜在空间中引入体渲染(volume rendering)原理,通过物体的遮挡关系和估计的透射率(transmittance)引导场景“渲染”,从而实现无需重新训练或微调即可精确控制遮挡效果。该方法具有物理基础,显著提升了遮挡准确性,并支持通过调节物体不透明度实现多种视觉效果,如透明度、密度、粒子浓度、光照强度及镜头效应等。
链接: https://arxiv.org/abs/2508.07647
作者: Xiaohang Zhan,Dingming Liu
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 (oral). Project page: this https URL
Abstract:We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to “render” the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.
zh
[CV-92] AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning ICCV2025
【速读】:该论文旨在解决视觉机器人操作(Visual Robot Manipulation, VRM)在机器人数据稀缺条件下泛化能力不足的问题。现有方法通常依赖大规模网络图像-语言数据进行预训练,但这些数据与机器人任务差异较大,或采用隐式学习方式(如像素级未来帧预测),导致在真实机器人场景中表现受限。其解决方案的关键在于提出一种基于类比推理的显式人类动作模仿框架——AR-VRM,通过在大规模人类动作视频数据上预训练关键点视觉语言模型(keypoint Vision-Language Model, VLM),使模型能够直接预测人类手部关键点;随后,在机器人微调阶段,利用历史观测相似性检索对应的人类动作视频,并建立人类手部关键点与机器人执行机构之间的类比映射(Analogical Reasoning, AR),从而实现对人类动作模式的显式模仿,显著提升少样本条件下的性能表现。
链接: https://arxiv.org/abs/2508.07626
作者: Dejie Yang,Zijing Zhao,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); State Key Laboratory of General Artificial Intelligence, Peking University (通用人工智能国家重点实验室, 北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCV2025
Abstract:Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark and real-world experiments. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity.
zh
[CV-93] A Trustworthy Method for Multimodal Emotion Recognition
【速读】:该论文旨在解决现有情绪识别方法在面对噪声、损坏及分布外数据时,模型预测可靠性不足的问题。当前主流方法虽通过复杂深度模型提升性能,但缺乏对预测置信度的有效评估,导致在实际应用中难以保证决策的可信性。解决方案的关键在于提出一种名为“可信情绪识别”(Trusted Emotion Recognition, TER)的新框架,其核心是引入不确定性估计模块以计算预测的置信度,并基于多模态结果的置信度加权融合输出可信预测。此外,论文还设计了新的评估指标,如可信精度(trusted precision)、可信召回率(trusted recall)、可信准确率(trusted Acc.)和可信F1分数(trusted F1 score),用于量化模型在可靠预测方面的性能表现,从而显著提升了模型在噪声环境下的鲁棒性和可靠性。
链接: https://arxiv.org/abs/2508.07625
作者: Junxiao Xue,Xiaozhen Liu,Jie Wang,Xuecheng Wu,Bin Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Big Data Mining and Analytics (BDMA), 2025
Abstract:Existing emotion recognition methods mainly focus on enhancing performance by employing complex deep models, typically resulting in significantly higher model complexity. Although effective, it is also crucial to ensure the reliability of the final decision, especially for noisy, corrupted and out-of-distribution data. To this end, we propose a novel emotion recognition method called trusted emotion recognition (TER), which utilizes uncertainty estimation to calculate the confidence value of predictions. TER combines the results from multiple modalities based on their confidence values to output the trusted predictions. We also provide a new evaluation criterion to assess the reliability of predictions. Specifically, we incorporate trusted precision and trusted recall to determine the trusted threshold and formulate the trusted Acc. and trusted F1 score to evaluate the model’s trusted performance. The proposed framework combines the confidence module that accordingly endows the model with reliability and robustness against possible noise or corruption. The extensive experimental results validate the effectiveness of our proposed model. The TER achieves state-of-the-art performance on the Music-video, achieving 82.40% Acc. In terms of trusted performance, TER outperforms other methods on the IEMOCAP and Music-video, achieving trusted F1 scores of 0.7511 and 0.9035, respectively.
zh
[CV-94] Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction
【速读】:该论文旨在解决静态环境中物体检测模型因忽视空间先验信息而导致的检测不一致、漏检或误分类问题,尤其是在复杂遮挡场景下表现不佳。其解决方案的关键在于提出一种基于图神经网络(Graph Neural Network, GNN)的后处理流水线,通过显式建模物体间的空间关系来识别并修正检测异常;具体而言,利用人工标注数据训练GNN以判断无效类别标签,并依据邻域上下文预测正确类别,从而提升检测可靠性。实验表明,该方法在YOLOv7和RT-DETR等主流检测器上作为后处理模块使用时,mAP@50最高可提升4%。
链接: https://arxiv.org/abs/2508.07624
作者: Vishakha Lall,Yisi Liu
机构: Singapore Polytechnic(新加坡理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment’s spatial structure to improve reliability in object detection systems.
zh
[CV-95] SOFA: Deep Learning Framework for Simulating and Optimizing Atrial Fibrillation Ablation MICCAI2025
【速读】:该论文旨在解决房颤(Atrial Fibrillation, AF)导管消融术后复发率高且个体差异大的问题,核心挑战在于患者特异性心肌组织特性与操作参数之间复杂的交互关系。解决方案的关键在于提出SOFA(Simulating and Optimizing Atrial Fibrillation Ablation)框架,该框架首次整合了消融效果模拟、复发风险预测与参数优化三个环节:首先基于术前LGE-MRI和消融参数(如位置、时间、温度、功率、力)生成术后瘢痕图像并预测复发风险;随后引入优化机制自动调整消融参数以最小化预测风险。其核心技术是多模态、多视角的2.5D生成器,实现了从个体化建模到策略优化的闭环流程,实验表明该方法可使模型预测的复发风险降低22.18%。
链接: https://arxiv.org/abs/2508.07621
作者: Yunsung Chung,Chanho Lim,Ghassan Bidaoui,Christian Massad,Nassir Marrouche,Jihun Hamm
机构: Tulane University (杜兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI 2025. This is the author’s original preprint
Abstract:Atrial fibrillation (AF) is a prevalent cardiac arrhythmia often treated with catheter ablation procedures, but procedural outcomes are highly variable. Evaluating and improving ablation efficacy is challenging due to the complex interaction between patient-specific tissue and procedural factors. This paper asks two questions: Can AF recurrence be predicted by simulating the effects of procedural parameters? How should we ablate to reduce AF recurrence? We propose SOFA (Simulating and Optimizing Atrial Fibrillation Ablation), a novel deep-learning framework that addresses these questions. SOFA first simulates the outcome of an ablation strategy by generating a post-ablation image depicting scar formation, conditioned on a patient’s pre-ablation LGE-MRI and the specific procedural parameters used (e.g., ablation locations, duration, temperature, power, and force). During this simulation, it predicts AF recurrence risk. Critically, SOFA then introduces an optimization scheme that refines these procedural parameters to minimize the predicted risk. Our method leverages a multi-modal, multi-view generator that processes 2.5D representations of the atrium. Quantitative evaluations show that SOFA accurately synthesizes post-ablation images and that our optimization scheme leads to a 22.18% reduction in the model-predicted recurrence risk. To the best of our knowledge, SOFA is the first framework to integrate the simulation of procedural effects, recurrence prediction, and parameter optimization, offering a novel tool for personalizing AF ablation.
zh
[CV-96] An Iterative Reconstruction Method for Dental Cone-Beam Computed Tomography with a Truncated Field of View
【速读】:该论文旨在解决牙科锥形束计算机断层扫描(dental cone-beam computed tomography, CBCT)中因小尺寸探测器导致视野(field of view, FOV)截断所引发的重建图像伪影问题。在迭代重建过程中,截断FOV内实际投影与前向投影之间的差异会随迭代累积,显著降低图像质量。解决方案的关键在于提出一种两阶段方法:第一阶段利用隐式神经表示(Implicit Neural Representation, INR)以粗分辨率生成扩展区域的先验图像,其前向投影可完整覆盖患者头部,从而估计截断带来的投影偏差;第二阶段将校正后的投影数据用于传统迭代重建,仅在原始截断区域内进行,有效抑制了截断伪影并提升了CBCT图像质量。
链接: https://arxiv.org/abs/2508.07618
作者: Hyoung Suk Park,Kiwan Jeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 2 tables
Abstract:In dental cone-beam computed tomography (CBCT), compact and cost-effective system designs often use small detectors, resulting in a truncated field of view (FOV) that does not fully encompass the patient’s head. In iterative reconstruction approaches, the discrepancy between the actual projection and the forward projection within the truncated FOV accumulates over iterations, leading to significant degradation in the reconstructed image quality. In this study, we propose a two-stage approach to mitigate truncation artifacts in dental CBCT. In the first stage, we employ Implicit Neural Representation (INR), leveraging its superior representation power, to generate a prior image over an extended region so that its forward projection fully covers the patient’s head. To reduce computational and memory burdens, INR reconstruction is performed with a coarse voxel size. The forward projection of this prior image is then used to estimate the discrepancy due to truncated FOV in the measured projection data. In the second stage, the discrepancy-corrected projection data is utilized in a conventional iterative reconstruction process within the truncated region. Our numerical results demonstrate that the proposed two-grid approach effectively suppresses truncation artifacts, leading to improved CBCT image quality.
zh
[CV-97] AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition ACM-MM2025
【速读】:该论文旨在解决现有音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)方法在处理异质性和互补性音视频关联时的局限性,尤其是在信息不对称条件下的性能瓶颈问题。其解决方案的关键在于提出一种基于双向模态增强(Bidirectional Modality Enhancement)的新框架AD-AVSR:首先引入音频双流编码策略以多角度丰富音频表征并主动建立不对称性,进而通过两个核心模块——“音频感知的视觉精炼模块”(Audio-aware Visual Refinement Module)和“跨模态噪声抑制掩码模块”(Cross-modal Noise Suppression Masking Module),实现视觉与音频表征的闭环双向交互;同时采用阈值选择机制过滤无关或弱相关音视频对,从而提升关联鲁棒性。实验表明,该设计显著优于当前最优方法,在LRS2和LRS3数据集上实现了更高的准确率与噪声鲁棒性。
链接: https://arxiv.org/abs/2508.07608
作者: Junxiao Xue,Xiaozhen Liu,Xuecheng Wu,Xinyi Yin,Danlei Huang,Fei Yu
机构: Zhengzhou University (郑州大学); Zhejiang Lab (浙江实验室); Xi’an Jiaotong University (西安交通大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted by the ACM MM 2025 Workshop on SVC
Abstract:Audio-visual speech recognition (AVSR) combines audio-visual modalities to improve speech recognition, especially in noisy environments. However, most existing methods deploy the unidirectional enhancement or symmetric fusion manner, which limits their capability to capture heterogeneous and complementary correlations of audio-visual data-especially under asymmetric information conditions. To tackle these gaps, we introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement. Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives and intentionally establish asymmetry to support subsequent cross-modal interactions. The enhancement process involves two key components, Audio-aware Visual Refinement Module for enhanced visual representations under audio guidance, and Cross-modal Noise Suppression Masking Module which refines audio representations using visual cues, collaboratively leading to the closed-loop and bidirectional information flow. To further enhance correlation robustness, we adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs. Extensive experimental results on the LRS2 and LRS3 datasets indicate that our AD-AVSR consistently surpasses SOTA methods in both performance and noise robustness, highlighting the effectiveness of our model design.
zh
[CV-98] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning
【速读】:该论文旨在解决当前开放源代码图像编辑数据集在多样性与质量上存在不足,以及缺乏可与社区主流生成模型(如扩散模型)无缝集成的即插即用编辑模块的问题。其关键解决方案包括:首先构建了X2Edit数据集,涵盖14类多样化编辑任务(如基于主体的生成),通过统一图像生成模型和专家模型合成数据,并利用视觉语言模型(VLM)设计合理指令及多维评分机制筛选高质量样本,最终形成370万条类别均衡的高质数据;其次提出基于FLUX.1的面向任务的MoE-LoRA训练方法,在仅使用全模型8%参数的前提下实现高效微调,并引入基于扩散模型内部表征的对比学习策略,以正负样本区分不同编辑类型,显著提升编辑性能。
链接: https://arxiv.org/abs/2508.07607
作者: Jian Ma,Xujie Zhu,Zihao Pan,Qirong Peng,Xu Guo,Chen Chen,Haonan Lu
机构: OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model’s editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: this https URL.
zh
[CV-99] LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation ACM-MM2025
【速读】:该论文旨在解决生成式 AI(Generative AI)中视频生成任务里身份信息保持(identity-preserving)的难题,即在扩散模型(diffusion models)的全局随机生成过程中,人脸特征易被破坏导致跨帧身份不一致的问题。其解决方案的关键在于提出 LaVieID 框架,该框架从空间和时间两个维度优化潜变量建模:首先引入局部路由机制(local router),通过加权组合细粒度局部面部结构来显式表示潜变量状态,减少特征干扰并增强独特面部特征的捕捉;其次设计时序自回归模块(temporal autoregressive module),将潜变量按时间分块并利用长程时序依赖预测偏置以修正去噪后的潜变量,显著提升帧间身份一致性。
链接: https://arxiv.org/abs/2508.07603
作者: Wenhui Song,Hanhui Li,Jiehui Huang,Panwen Hu,Yuhao Cheng,Long Chen,Yiqiang Yan,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Hong Kong University of Science and Technology (HKUST) (香港科技大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM 2025
Abstract:In this paper, we present LaVieID, a novel \underlinelocal \underlineautoregressive \underlinevid\underlineeo diffusion framework designed to tackle the challenging \underlineidentity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at this https URL.
zh
[CV-100] ShoulderShot: Generating Over-the-Shoulder Dialogue Videos
【速读】:该论文旨在解决过肩对话视频(over-the-shoulder dialogue videos)在生成过程中面临的三大核心挑战:角色一致性维持、空间连续性构建以及在有限计算预算下生成长时多轮对话。其解决方案的关键在于提出了一种名为ShoulderShot的框架,该框架通过双镜头生成(dual-shot generation)与循环视频(looping video)技术相结合,实现了对话场景中角色形象和空间关系的稳定保持,并支持灵活扩展对话长度,从而显著提升了视频生成的质量与实用性。
链接: https://arxiv.org/abs/2508.07597
作者: Yuang Zhang,Junqi Cheng,Haoyu Zhao,Jiaxi Gu,Fangyuan Zou,Zenghui Lu,Peng Shu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers’ emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at this https URL.
zh
[CV-101] From Prediction to Explanation: Multimodal Explainable and Interactive Deepfake Detection Framework for Non-Expert Users
【速读】:该论文旨在解决深度伪造(deepfake)检测模型缺乏可解释性的问题,即现有检测系统虽在分类准确率上表现优异,但作为“黑箱”模型难以提供透明的推理过程,限制了其在司法、新闻等关键领域中非专家用户的应用。解决方案的关键在于提出DF-P2E(Deepfake: Prediction to Explanation)框架,该框架通过三个模块化组件实现多模态解释:(1) 基于Grad-CAM的显著性可视化模块,定位伪造区域;(2) 视觉描述生成模块,以自然语言总结篡改区域特征;(3) 基于微调大语言模型(LLM)的情境感知叙事优化模块,生成符合用户背景和需求的解释内容。此设计实现了从预测到解释的端到端贯通,提升了检测系统的可信度与可用性。
链接: https://arxiv.org/abs/2508.07596
作者: Shahroz Tariq,Simon S. Woo,Priyanka Singh,Irena Irmalasari,Saakshi Gupta,Dev Gupta
机构: Data61, CSIRO, Australia(澳大利亚联邦科学与工业研究组织); Sungkyunkwan University, S. Korea(韩国成均馆大学); University of Queensland, Australia(澳大利亚昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 tables, 5 figures, accepted for publicaiton in the 33rd ACM International Conference on Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland
Abstract:The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.
zh
[CV-102] MSPT: A Lightweight Face Image Quality Assessment Method with Multi-stage Progressive Training
【速读】:该论文旨在解决人脸图像感知质量评估中传统方法泛化能力不足、学习型方法计算与存储开销大等问题。其解决方案的关键在于提出一种轻量级人脸质量评估网络,并引入多阶段渐进式训练(Multi-Stage Progressive Training, MSPT)策略:通过分阶段逐步引入更丰富的数据样本和提升输入图像分辨率,使轻量网络在有效学习复杂质量特征的同时显著缓解灾难性遗忘问题,从而在保持高效推理的前提下实现接近或优于当前最优方法的性能表现。
链接: https://arxiv.org/abs/2508.07590
作者: Xiongwei Xiao,Baoying Chen,Jishen Zeng,Jianquan Yang
机构: The Hong Kong Polytechnic University (香港理工大学); Alibaba Group (阿里巴巴集团); Shenzhen Campus of Sun Yat-Sen University (中山大学深圳校区)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately assessing the perceptual quality of face images is crucial, especially with the rapid progress in face restoration and generation. Traditional quality assessment methods often struggle with the unique characteristics of face images, limiting their generalizability. While learning-based approaches demonstrate superior performance due to their strong fitting capabilities, their high complexity typically incurs significant computational and storage costs, hindering practical deployment. To address this, we propose a lightweight face quality assessment network with Multi-Stage Progressive Training (MSPT). Our network employs a three-stage progressive training strategy that gradually introduces more diverse data samples and increases input image resolution. This novel approach enables lightweight networks to achieve high performance by effectively learning complex quality features while significantly mitigating catastrophic forgetting. Our MSPT achieved the second highest score on the VQualA 2025 face image quality assessment benchmark dataset, demonstrating that MSPT achieves comparable or better performance than state-of-the-art methods while maintaining efficient inference.
zh
[CV-103] Voice Pathology Detection Using Phonation
【速读】:该论文旨在解决语音障碍(Voice Disorders)早期诊断中传统方法如喉镜检查存在侵入性、主观性强且可及性差的问题。其解决方案的关键在于构建一个非侵入式的、基于机器学习的语音病理检测框架,利用来自Saarbrücken语音数据库的发声数据,提取梅尔频率倒谱系数(MFCCs)、色度特征和梅尔频谱等声学特征,并结合循环神经网络(RNN),包括长短期记忆网络(LSTM)与注意力机制进行分类。同时,通过音高偏移和高斯噪声添加等数据增强技术提升模型泛化能力,辅以尺度特征(如Hölder指数和Hurst指数)捕捉信号不规则性和长期依赖关系,从而实现对正常与病理性语音样本的自动识别,推动生成式AI在医疗健康领域的应用落地。
链接: https://arxiv.org/abs/2508.07587
作者: Sri Raksha Siva,Nived Suthahar,Prakash Boominathan,Uma Ranjan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 17 Pages, 11 Figures
Abstract:Voice disorders significantly affect communication and quality of life, requiring an early and accurate diagnosis. Traditional methods like laryngoscopy are invasive, subjective, and often inaccessible. This research proposes a noninvasive, machine learning-based framework for detecting voice pathologies using phonation data. Phonation data from the Saarbrücken Voice Database are analyzed using acoustic features such as Mel Frequency Cepstral Coefficients (MFCCs), chroma features, and Mel spectrograms. Recurrent Neural Networks (RNNs), including LSTM and attention mechanisms, classify samples into normal and pathological categories. Data augmentation techniques, including pitch shifting and Gaussian noise addition, enhance model generalizability, while preprocessing ensures signal quality. Scale-based features, such as Hölder and Hurst exponents, further capture signal irregularities and long-term dependencies. The proposed framework offers a noninvasive, automated diagnostic tool for early detection of voice pathologies, supporting AI-driven healthcare, and improving patient outcomes. Comments: 17 Pages, 11 Figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD) Cite as: arXiv:2508.07587 [cs.CV] (or arXiv:2508.07587v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.07587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-104] GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm
【速读】:该论文旨在解决当前显著性目标检测(SOD)模型普遍依赖重型骨干网络导致计算成本高、难以在边缘设备上部署的问题。其解决方案的关键在于提出一种轻量级网络GAPNet,该网络基于粒度感知(granularity-aware)范式设计:通过为多尺度解码器输出分配不同粒度的监督信号(高层输出关注粗粒度目标位置,低层输出关注细粒度边界),并引入粒度金字塔卷积(GPC)和跨尺度注意力(CSA)模块,实现高低层级特征的高效融合;同时,在编码器顶部加入自注意力模块以学习全局信息,从而在几乎不增加计算负担的前提下提升定位精度。此方法优化了特征利用与语义理解能力,并在每一处理阶段施加适当监督,最终在图像和视频SOD任务中达到轻量级模型的新SOTA性能。
链接: https://arxiv.org/abs/2508.07585
作者: Yu-Huan Wu,Wei Liu,Zi-Xuan Zhu,Zizhou Wang,Yong Liu,Liangli Zhen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures, 6 tables
Abstract:Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at this https URL.
zh
[CV-105] Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification
【速读】:该论文旨在解决视觉 Transformer(Vision Transformers, ViTs)中 LayerNorm 层在数据稀缺和领域迁移场景下的微调动态机制不明确的问题。研究表明,LayerNorm 参数在微调后的偏移(LayerNorm shifts)可作为源域与目标域之间转换的指示器,其有效性取决于目标训练样本对目标域的代表性程度,这一程度由作者提出的微调偏移比(Fine-tuning Shift Ratio, FSR)量化。解决方案的关键在于引入一个与 FSR 负相关的标量缩放因子 λ,用于调整学习到的 LayerNorm 偏移,使其逼近在充分代表性数据下获得的理想偏移;同时结合循环框架进一步优化 LayerNorm 微调过程。该方法在自然图像与病理图像、分布内(in-distribution, ID)与分布外(out-of-distribution, OOD)等多种设置下均表现出显著性能提升,尤其在数据稀缺时能有效识别并补偿目标域代表性不足的问题。
链接: https://arxiv.org/abs/2508.07577
作者: Zhaorui Tan,Tan Pan,Kaizhu Huang,Weimiao Yu,Kai Yao,Chen Jiang,Qiufeng Wang,Anh Nguyen,Xin Guo,Yuan Cheng,Xi Yang
机构: Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); Xi’an Jiaotong-Liverpool University (西交利物浦大学); University of Liverpool (利物浦大学); AI3 Fudan University (复旦大学AI3); Duke Kunshan University (昆山杜克大学); BII, A∗STAR (新加坡科技研究局生物医学研究所); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are indicative of the transitions between source and target domains; its efficacy is contingent upon the degree to which the target training samples accurately represent the target domain, as quantified by our proposed Fine-tuning Shift Ratio ( FSR ). Building on this, we propose a simple yet effective rescaling mechanism using a scalar \lambda that is negatively correlated to FSR to align learned LayerNorm shifts with those ideal shifts achieved under fully representative data, combined with a cyclic framework that further enhances the LayerNorm fine-tuning. Extensive experiments across natural and pathological images, in both in-distribution (ID) and out-of-distribution (OOD) settings, and various target training sample regimes validate our framework. Notably, OOD tasks tend to yield lower FSR and higher \lambda in comparison to ID cases, especially with scarce data, indicating under-represented target training samples. Moreover, ViTFs fine-tuned on pathological data behave more like ID settings, favoring conservative LayerNorm updates. Our findings illuminate the underexplored dynamics of LayerNorm in transfer learning and provide practical strategies for LayerNorm fine-tuning.
zh
[CV-106] Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在分布偏移(distribution shifts)下零样本泛化能力下降的问题,尤其是在缺乏标注数据的下游任务中。现有测试时自适应(Test-Time Adaptation, TTA)方法依赖于动态缓存机制来存储高置信度或低熵样本以提升鲁棒性,但面临两大挑战:一是分布偏移导致置信度指标不可靠,引发缓存中的错误累积;二是固定决策边界难以适应显著的分布变化,造成预测性能不佳。解决方案的关键在于提出自适应缓存增强(Adaptive Cache Enhancement, ACE)框架,其通过类特定的动态阈值机制选择性地存储每类图像嵌入,并利用零样本统计初始化、指数移动平均和探索增强更新策略迭代优化阈值,从而构建更稳健的缓存并实现类级别的自适应决策边界,显著提升了模型在多样化分布场景下的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.07570
作者: Khanh-Binh Nguyen,Phuoc-Nguyen Bui,Hyunseung Choo,Duc Thanh Nguyen
机构: 1: University of Science and Technology (越南科学技术大学); 2: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, Under review
Abstract:Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.
zh
[CV-107] Progressive Birds Eye View Perception for Safety-Critical Autonomous Driving: A Comprehensive Survey
【速读】:该论文旨在解决自动驾驶中鸟瞰图(Bird’s-Eye-View, BEV)感知在复杂现实场景下的安全性与可靠性问题,尤其针对遮挡、恶劣天气及动态交通等挑战。其解决方案的关键在于从安全关键视角出发,系统性地梳理BEV感知的三阶段演进路径:单模态车载感知、多模态车载感知以及多智能体协同感知,并结合公开数据集评估各阶段对安全性与鲁棒性的支持能力;同时识别开放世界中的关键挑战(如开放集识别、大规模无标签数据、传感器退化和跨智能体通信延迟),并提出未来研究方向,包括与端到端自动驾驶系统的融合、具身智能(embodied intelligence)以及大语言模型(Large Language Models, LLMs)的集成,以提升BEV感知在真实复杂环境中的适应性与可信度。
链接: https://arxiv.org/abs/2508.07560
作者: Yan Gong,Naibang Wang,Jianli Lu,Xinyu Zhang,Yongsheng Gao,Jie Zhao,Zifan Huang,Haozhi Bai,Nanxin Zeng,Nayu Su,Lei Yang,Ziying Song,Xiaoxi Hu,Xinmin Jiang,Xiaojuan Zhang,Susanto Rahardja
机构: Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); Beijing Jiaotong University (北京交通大学); Institute for Infocomm Research, A*STAR (新加坡资讯通信研究院); Unknown
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bird’s-Eye-View (BEV) perception has become a foundational paradigm in autonomous driving, enabling unified spatial representations that support robust multi-sensor fusion and multi-agent collaboration. As autonomous vehicles transition from controlled environments to real-world deployment, ensuring the safety and reliability of BEV perception in complex scenarios - such as occlusions, adverse weather, and dynamic traffic - remains a critical challenge. This survey provides the first comprehensive review of BEV perception from a safety-critical perspective, systematically analyzing state-of-the-art frameworks and implementation strategies across three progressive stages: single-modality vehicle-side, multimodal vehicle-side, and multi-agent collaborative perception. Furthermore, we examine public datasets encompassing vehicle-side, roadside, and collaborative settings, evaluating their relevance to safety and robustness. We also identify key open-world challenges - including open-set recognition, large-scale unlabeled data, sensor degradation, and inter-agent communication latency - and outline future research directions, such as integration with end-to-end autonomous driving systems, embodied intelligence, and large language models.
zh
[CV-108] Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation
【速读】:该论文旨在解决从单目视频生成高质量4D内容时面临的时空一致性保障难、细节保留不足以及用户引导难以有效融入等核心问题。解决方案的关键在于提出Splat4D框架,其通过多视角渲染(multi-view rendering)实现空间-时间一致性建模,结合不一致性识别机制定位并修复误差区域,并引入视频扩散模型(video diffusion model)与非对称U-Net结构进行精细化重构,从而在保持高保真度的同时实现高效且可控的4D内容生成。
链接: https://arxiv.org/abs/2508.07557
作者: Minghao Yin,Yukang Cao,Songyou Peng,Kai Han
机构: The University of Hong Kong(香港大学); Nanyang Technological University(南洋理工大学); Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.
zh
[CV-109] Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring
【速读】:该论文旨在解决端到端(end-to-end)自动驾驶感知与规划模型中,由于缺乏对中间功能模块的显式监督信号而导致的可解释性差和模块独立评估困难的问题。解决方案的关键在于提出一种基于特征图-真值表示相似性的独立评估方法——特征图收敛度评分(Feature Map Convergence Score, FMCS),并构建双粒度动态加权评分系统(Dual-Granularity Dynamic Weighted Scoring System, DG-DWSS),从而统一量化特征图质量,形成可计算的特征图质量评分(Feature Map Quality Score)。进一步地,论文开发了基于CLIP的特征图质量评估网络(CLIP-FMQE-Net),融合特征-真值编码器与质量评分预测头,实现对功能模块生成特征图的实时质量分析。实验表明,将该评估模块嵌入训练流程后,3D目标检测性能提升显著(NuScenes数据集上NDS提升3.89%),验证了其在增强特征表示质量和整体模型性能方面的有效性。
链接: https://arxiv.org/abs/2508.07552
作者: Ludan Zhang,Sihan Wang,Yuqi Dai,Shuofei Qiao,Lei He
机构: Tsinghua University (清华大学); Nankai University (南开大学); Zhejiang University (浙江大学); SAIC-GM-Wuling Automobile Co., Ltd. (上汽通用五菱汽车有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end models are emerging as the mainstream in autonomous driving perception and planning. However, the lack of explicit supervision signals for intermediate functional modules leads to opaque operational mechanisms and limited interpretability, making it challenging for traditional methods to independently evaluate and train these modules. Pioneering in the issue, this study builds upon the feature map-truth representation similarity-based evaluation framework and proposes an independent evaluation method based on Feature Map Convergence Score (FMCS). A Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) is constructed, formulating a unified quantitative metric - Feature Map Quality Score - to enable comprehensive evaluation of the quality of feature maps generated by functional modules. A CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) is further developed, combining feature-truth encoders and quality score prediction heads to enable real-time quality analysis of feature maps generated by functional modules. Experimental results on the NuScenes dataset demonstrate that integrating our evaluation module into the training improves 3D object detection performance, achieving a 3.89 percent gain in NDS. These results verify the effectiveness of our method in enhancing feature representation quality and overall model performance.
zh
[CV-110] Adaptive Pseudo Label Selection for Individual Unlabeled Data by Positive and Unlabeled Learning
【速读】:该论文旨在解决医学图像分割中伪标签(pseudo-labeling)选择困难的问题,尤其是在缺乏标注数据的情况下如何有效生成高质量的伪标签。其解决方案的关键在于引入正类与未标记类学习(Positive and Unlabeled Learning, PU learning),利用仅包含正样本和未标记样本的数据进行二分类建模,从而为每张未标注图像自动学习出区分前景与背景区域的合适度量标准,进而实现对多种背景区域的有效伪标签筛选。
链接: https://arxiv.org/abs/2508.07548
作者: Takehiro Yamane,Itaru Tsuge,Susumu Saito,Ryoma Bise
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a novel pseudo-labeling method for medical image segmentation that can perform learning on ``individual images’’ to select effective pseudo-labels. We introduce Positive and Unlabeled Learning (PU learning), which uses only positive and unlabeled data for binary classification problems, to obtain the appropriate metric for discriminating foreground and background regions on each unlabeled image. Our PU learning makes us easy to select pseudo-labels for various background regions. The experimental results show the effectiveness of our method.
zh
[CV-111] Commentary Generation for Soccer Highlights
【速读】:该论文旨在解决足球赛事视频与解说文本之间细粒度对齐(fine-grained alignment)不足的问题,尤其是在短片段(如Highlights)场景下,现有模型难以实现精准的时间同步。其解决方案的关键在于扩展MatchVoice模型至GOAL数据集,通过优化训练配置和探索不同窗口大小对零样本(zero-shot)性能的影响,提升生成式AI(Generative AI)在短时视频片段中的语义一致性与时间准确性。实验表明,尽管MatchVoice具备良好的泛化能力,仍需引入更广泛的视频-语言对齐技术以进一步增强性能。
链接: https://arxiv.org/abs/2508.07543
作者: Chidaksh Ravuru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at this https URL.
zh
[CV-112] CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts ICCV
【速读】:该论文旨在解决当前3D人体姿态生成模型在处理抽象(高阶)文本指令时表现不足的问题,即现有方法主要依赖于详细(低阶)提示来描述关节配置,难以匹配人类自然交流中使用抽象语言表达动作意图的特性。其解决方案的关键在于引入链式思维(Chain-of-Thought, CoT)推理机制,使模型能够将抽象文本转化为语义一致且合理的3D人体姿态;同时设计了一种自动化的数据合成流程,生成包含抽象提示、详细提示与对应3D姿态的三元组用于训练,从而提升模型对高阶语义的理解能力与生成准确性。
链接: https://arxiv.org/abs/2508.07540
作者: Junuk Cha,Jihyeon Kim
机构: KAIST(韩国科学技术院); KT(韩国电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW’25
Abstract:Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.
zh
[CV-113] Domain Generalization of Pathological Image Segmentation by Patch-Level and WSI-Level Contrastive Learning
【速读】:该论文旨在解决病理图像中因患者特征和组织厚度等内部变化导致的域偏移(domain shift)问题,而非传统关注的跨医院域偏移。其核心解决方案是通过聚类非肿瘤区域的全切片图像(Whole Slide Images, WSIs)特征来识别并利用医院内部的域结构,并引入两阶段对比学习策略——即WSI级和patch级对比学习,以缩小不同聚类间特征分布的差异,从而提升模型在不同域下的泛化能力。
链接: https://arxiv.org/abs/2508.07539
作者: Yuki Shigeyasu,Shota Harada,Akihiko Yoshizawa,Kazuhiro Terada,Naoki Nakazima,Mariyo Kurata,Hiroyuki Abe,Tetsuo Ushiku,Ryoma Bise
机构: Kyushu Univ.(九州大学); Nara Medical Univ.(奈良医科大学); Kyoto Univ. Hosp.(京都大学医院); The Univ. of Tokyo Hosp.(东京大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we address domain shifts in pathological images by focusing on shifts within whole slide images~(WSIs), such as patient characteristics and tissue thickness, rather than shifts between hospitals. Traditional approaches rely on multi-hospital data, but data collection challenges often make this impractical. Therefore, the proposed domain generalization method captures and leverages intra-hospital domain shifts by clustering WSI-level features from non-tumor regions and treating these clusters as domains. To mitigate domain shift, we apply contrastive learning to reduce feature gaps between WSI pairs from different clusters. The proposed method introduces a two-stage contrastive learning approach WSI-level and patch-level contrastive learning to minimize these gaps effectively.
zh
[CV-114] A DICOM Image De-identification Algorithm in the MIDI-B Challenge
【速读】:该论文旨在解决医学影像在公开共享过程中如何有效去除个人身份信息(PII)以保护患者隐私的问题,同时确保影像数据在科研、诊断和治疗中的可用性。其解决方案的关键在于基于DICOM格式的规则驱动型去标识化方法,包括像素掩码、日期偏移、日期哈希、文本识别、文本替换与文本移除等技术手段,在严格遵守HIPAA隐私规则、DICOM PS3.15标准及TCIA最佳实践的前提下,实现高精度的自动化去标识化处理。在MICCAI 2024 MIDI-B挑战赛中,该方案执行准确率达99.92%,位列第二名,验证了其有效性与合规性。
链接: https://arxiv.org/abs/2508.07538
作者: Hongzhu Jiang,Sihan Xie,Zhiyu Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures
Abstract:Image de-identification is essential for the public sharing of medical images, particularly in the widely used Digital Imaging and Communications in Medicine (DICOM) format as required by various regulations and standards, including Health Insurance Portability and Accountability Act (HIPAA) privacy rules, the DICOM PS3.15 standard, and best practices recommended by the Cancer Imaging Archive (TCIA). The Medical Image De-Identification Benchmark (MIDI-B) Challenge at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024) was organized to evaluate rule-based DICOM image de-identification algorithms with a large dataset of clinical DICOM images. In this report, we explore the critical challenges of de-identifying DICOM images, emphasize the importance of removing personally identifiable information (PII) to protect patient privacy while ensuring the continued utility of medical data for research, diagnostics, and treatment, and provide a comprehensive overview of the standards and regulations that govern this process. Additionally, we detail the de-identification methods we applied - such as pixel masking, date shifting, date hashing, text recognition, text replacement, and text removal - to process datasets during the test phase in strict compliance with these standards. According to the final leaderboard of the MIDI-B challenge, the latest version of our solution algorithm correctly executed 99.92% of the required actions and ranked 2nd out of 10 teams that completed the challenge (from a total of 22 registered teams). Finally, we conducted a thorough analysis of the resulting statistics and discussed the limitations of current approaches and potential avenues for future improvement.
zh
[CV-115] Enhanced Generative Structure Prior for Chinese Text Image Super-resolution
【速读】:该论文旨在解决低分辨率(Low-Resolution, LR)中文文本图像超分辨率(Super-Resolution, SR)重建中字符结构失真和字体风格不一致的问题。由于中文字符结构复杂、字体样式多样且排版不规则,传统方法依赖字符识别先验进行约束往往效果有限。解决方案的关键在于提出一种结构先验(structure prior),该先验通过在StyleGAN模型中引入基于码本(codebook)的机制实现:码本中的每个码字代表特定汉字的结构信息,而StyleGAN中的向量 $ w $ 控制字符的风格(如字体类型、方向和位置),二者协同生成与LR字符在空间和结构上对齐的高分辨率结构先验,从而在保持字符完整性的同时恢复清晰笔画,尤其适用于真实世界中布局不规则的中文文本图像。
链接: https://arxiv.org/abs/2508.07537
作者: Xiaoming Li,Wangmeng Zuo,Chen Change Loy
机构: S-Lab, Nanyang Technological University, Singapore(南洋理工大学); Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI
Abstract:Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector w in StyleGAN controls the character’s style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at this https URL
zh
[CV-116] Enhancing Reliability of Medical Image Diagnosis through Top-rank Learning with Rejection Module
【速读】:该论文旨在解决医学图像处理中因标签噪声和类别模糊实例(class-ambiguous instances)导致的Top-Rank学习效果下降问题,这些问题可能使异常样本错误地被排入高优先级,从而干扰模型对关键诊断信息的学习。解决方案的关键在于提出一种集成排斥模块(rejection module)的新方法,该模块作为额外分支与Top-Rank损失协同优化,通过一个排斥函数量化样本偏离正常模式的程度,从而识别并抑制异常样本的影响,提升模型在医学图像诊断中的可靠性和准确性。
链接: https://arxiv.org/abs/2508.07528
作者: Xiaotong Ji,Ryoma Bise,Seiichi Uchida
机构: Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In medical image processing, accurate diagnosis is of paramount importance. Leveraging machine learning techniques, particularly top-rank learning, shows significant promise by focusing on the most crucial instances. However, challenges arise from noisy labels and class-ambiguous instances, which can severely hinder the top-rank objective, as they may be erroneously placed among the top-ranked instances. To address these, we propose a novel approach that enhances toprank learning by integrating a rejection module. Cooptimized with the top-rank loss, this module identifies and mitigates the impact of outliers that hinder training effectiveness. The rejection module functions as an additional branch, assessing instances based on a rejection function that measures their deviation from the norm. Through experimental validation on a medical dataset, our methodology demonstrates its efficacy in detecting and mitigating outliers, improving the reliability and accuracy of medical image diagnoses.
zh
[CV-117] Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing ICCV2025
【速读】:该论文旨在解决当前基于Transformer的扩散模型(如多模态扩散Transformer,MM-DiT)在图像编辑任务中缺乏有效方法的问题,特别是针对传统U-Net架构下已成熟但不适用于新架构的编辑技术所面临的适配难题。其解决方案的关键在于对MM-DiT中的注意力机制进行结构化分解,将其注意力矩阵划分为四个独立块以揭示其内在特性,并基于此提出一种鲁棒的、基于提示(prompt-based)的图像编辑方法,该方法支持从全局到局部的多种编辑操作,且适用于不同版本的MM-DiT模型(包括少步数模型),从而实现了对新型多模态扩散架构的有效控制与干预。
链接: https://arxiv.org/abs/2508.07519
作者: Joonghyuk Shin,Alchan Hwang,Yujin Kim,Daneul Kim,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project webpage: this https URL
Abstract:Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT’s behavioral patterns.
zh
[CV-118] From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials
【速读】:该论文旨在解决传统田间除草剂效果评估中依赖人工视觉判断所带来的效率低、劳动强度大及主观性强的问题,尤其针对作物与杂草种类识别和损伤分类的自动化难题。其关键解决方案是提出了一种改进的分割模型,该模型融合了通用自监督视觉模型与基于植物分类学层级结构的推理机制,并在2018–2020年德国和西班牙多年度数字图像数据上训练,最终在2023年数字相机数据和2024年来自美国、德国、西班牙的无人机影像上进行了跨设备域迁移测试,验证了模型在不同成像平台下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2508.07514
作者: Artzai Picon,Itziar Eguskiza,Daniel Mugica,Javier Romero,Carlos Javier Jimenez,Eric White,Gabriel Do-Lago-Junqueira,Christian Klukas,Ramon Navarra-Mestre
机构: TECNALIA(巴斯克研究与技术联盟); University of the Basque Country(巴斯克大学); BASF Corporation(巴斯夫公司); BASF Espanola S.L.(巴斯夫西班牙有限公司); BASF SE(巴斯夫股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Field trials are vital in herbicide research and development to assess effects on crops and weeds under varied conditions. Traditionally, evaluations rely on manual visual assessments, which are time-consuming, labor-intensive, and subjective. Automating species and damage identification is challenging due to subtle visual differences, but it can greatly enhance efficiency and consistency. We present an improved segmentation model combining a general-purpose self-supervised visual model with hierarchical inference based on botanical taxonomy. Trained on a multi-year dataset (2018-2020) from Germany and Spain using digital and mobile cameras, the model was tested on digital camera data (year 2023) and drone imagery from the United States, Germany, and Spain (year 2024) to evaluate robustness under domain shift. This cross-device evaluation marks a key step in assessing generalization across platforms of the model. Our model significantly improved species identification (F1-score: 0.52 to 0.85, R-squared: 0.75 to 0.98) and damage classification (F1-score: 0.28 to 0.44, R-squared: 0.71 to 0.87) over prior methods. Under domain shift (drone images), it maintained strong performance with moderate degradation (species: F1-score 0.60, R-squared 0.80; damage: F1-score 0.41, R-squared 0.62), where earlier models failed. These results confirm the model’s robustness and real-world applicability. It is now deployed in BASF’s phenotyping pipeline, enabling large-scale, automated crop and weed monitoring across diverse geographies. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.07514 [cs.CV] (or arXiv:2508.07514v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.07514 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Artzai Picon [view email] [v1] Mon, 11 Aug 2025 00:08:42 UTC (42,064 KB)
zh
[CV-119] FormCoach: Lift Smarter Not Harder
【速读】:该论文旨在解决居家健身人群缺乏专业动作指导的问题,即如何通过人工智能技术实现对用户运动姿势的实时、精准反馈与纠正。解决方案的关键在于利用视觉-语言模型(Vision-Language Models, VLMs)将普通摄像头转变为具备交互能力的AI训练伙伴,能够识别细微的动作错误并提供定制化修正建议;同时,研究构建了一个包含1700对专家标注的用户参考视频数据集,并开发了基于评分标准的自动化评估流程,以推动AI驱动健身教练领域的标准化研究与模型比较。
链接: https://arxiv.org/abs/2508.07501
作者: Xiaoye Zuo,Nikos Athanasiou,Ginger Delmas,Yiming Huang,Xingyu Fu,Lingjie Liu
机构: University of Pennsylvania (宾夕法尼亚大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); NAVER LABS Europe (NAVER实验室欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.
zh
[CV-120] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
【速读】:该论文旨在解决现有文档检索基准在多语言场景下的局限性,即多数现有基准仅支持英文文档检索或仅针对单页图像进行多语言问答,难以评估跨语言、跨模态的细粒度文档检索能力。其解决方案的关键在于提出VisR-Bench——一个面向长文档中基于问题驱动的多模态检索的多语言基准,包含超过35K高质量问答对(覆盖1.2K文档)、16种语言及三类问题类型(图表、文本和表格),并引入无显式答案的查询以避免模型依赖表面关键词匹配,从而更真实地评估模型在复杂多语言文档中的视觉与语义理解能力。
链接: https://arxiv.org/abs/2508.07493
作者: Jian Chen,Ming Li,Jihyung Kil,Chenguang Wang,Tong Yu,Ryan Rossi,Tianyi Zhou,Changyou Chen,Ruiyi Zhang
机构: University at Buffalo (纽约州立大学布法罗分校); University of Maryland (马里兰大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.
zh
[CV-121] Extracting Overlapping Microservices from Monolithic Code via Deep Semantic Embeddings and Graph Neural Network-Based Soft Clustering
【速读】:该论文旨在解决传统微服务提取方法中因采用硬聚类(hard clustering)导致的服务间耦合增强和内聚性降低的问题。现有方法将每个软件组件唯一分配给一个微服务,忽略了实际工程实践中组件可能被有意复制到多个服务以减少通信开销的现象。解决方案的关键在于提出Mo2oM(Monolithic to Overlapping Microservices)框架,将其建模为软聚类(soft clustering)问题,允许组件以概率方式归属多个微服务;该框架融合了从方法调用图(method call graph)中提取的结构依赖关系与深度语义嵌入(deep semantic embeddings),利用图神经网络(graph neural network, GNN)实现高精度的软聚类,从而在保持功能相关性的同时优化服务间的解耦与平衡。
链接: https://arxiv.org/abs/2508.07486
作者: Morteza Ziabakhsh,Kiyan Rezaee,Sadegh Eskandari,Seyed Amir Hossein Tabatabaei,Mohammad M. Ghassemi
机构: 1. University of Tehran (德黑兰大学); 2. Massachusetts Institute of Technology (麻省理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern software systems are increasingly shifting from monolithic architectures to microservices to enhance scalability, maintainability, and deployment flexibility. Existing microservice extraction methods typically rely on hard clustering, assigning each software component to a single microservice. This approach often increases inter-service coupling and reduces intra-service cohesion. We propose Mo2oM (Monolithic to Overlapping Microservices), a framework that formulates microservice extraction as a soft clustering problem, allowing components to belong probabilistically to multiple microservices. This approach is inspired by expert-driven decompositions, where practitioners intentionally replicate certain software components across services to reduce communication overhead. Mo2oM combines deep semantic embeddings with structural dependencies extracted from methodcall graphs to capture both functional and architectural relationships. A graph neural network-based soft clustering algorithm then generates the final set of microservices. We evaluate Mo2oM on four open-source monolithic benchmarks and compare it against eight state-of-the-art baselines. Our results demonstrate that Mo2oM achieves improvements of up to 40.97% in structural modularity (balancing cohesion and coupling), 58% in inter-service call percentage (communication overhead), 26.16% in interface number (modularity and decoupling), and 38.96% in non-extreme distribution (service size balance) across all benchmarks.
zh
[CV-122] Novel View Synthesis with Gaussian Splatting: Impact on Photogrammetry Model Accuracy and Resolution
【速读】:该论文旨在解决传统摄影测量(Photogrammetry)在3D模型重建中对输入图像视角依赖性强、视点泛化能力有限的问题,以及如何利用生成式方法提升重建质量与视图合成能力。其关键解决方案是引入并改进高斯泼溅(Gaussian Splatting)技术,通过构建一个增强的开源仓库实现从Blender环境生成的新相机位姿中渲染高质量新视角图像,并将这些合成视图作为数据增强手段用于重新训练摄影测量模型。实验表明,该方法能有效提升原始摄影测量模型的几何细节和纹理保真度,从而在扩展现实(XR)、自动驾驶仿真等场景中展现出更强的实用性与灵活性。
链接: https://arxiv.org/abs/2508.07483
作者: Pranav Chougule
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:In this paper, I present a comprehensive study comparing Photogrammetry and Gaussian Splatting techniques for 3D model reconstruction and view synthesis. I created a dataset of images from a real-world scene and constructed 3D models using both methods. To evaluate the performance, I compared the models using structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and lp/mm resolution based on the USAF resolution chart. A significant contribution of this work is the development of a modified Gaussian Splatting repository, which I forked and enhanced to enable rendering images from novel camera poses generated in the Blender environment. This innovation allows for the synthesis of high-quality novel views, showcasing the flexibility and potential of Gaussian Splatting. My investigation extends to an augmented dataset that includes both original ground images and novel views synthesized via Gaussian Splatting. This augmented dataset was employed to generate a new photogrammetry model, which was then compared against the original photogrammetry model created using only the original images. The results demonstrate the efficacy of using Gaussian Splatting to generate novel high-quality views and its potential to improve photogrammetry-based 3D reconstructions. The comparative analysis highlights the strengths and limitations of both approaches, providing valuable information for applications in extended reality (XR), photogrammetry, and autonomous vehicle simulations. Code is available at this https URL.
zh
[CV-123] AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
【速读】:该论文旨在解决当前音频-视觉(Audio-Visual, AV)基准测试仅关注最终答案准确率,而忽视模型推理过程的问题,从而难以区分真实理解与通过错误逻辑或幻觉得出的正确答案。为应对这一挑战,作者提出了AURA(Audio-visual Understanding and Reasoning Assessment)基准,其关键在于设计了六个涉及因果关系、音色与音高、节奏与视听同步、不可回答性、隐含干扰和技能画像等认知领域的任务,这些任务明确要求跨模态信息融合才能解答,杜绝单模态捷径;同时引入AuraScore评估指标,将推理过程分解为“事实一致性”(Factual Consistency,即推理是否基于感知证据)和“核心推理”(Core Inference,即每一步推理是否逻辑有效),从而实现对多模态推理质量的精细化评估。实验表明,尽管主流模型在AURA上准确率高达92%,但其推理一致性得分低于45%,揭示了当前AV大模型存在显著的推理缺陷,凸显了该基准的重要价值。
链接: https://arxiv.org/abs/2508.07470
作者: Siminfar Samakoush Galougah,Rishie Raj,Sanjoy Chowdhury,Sayan Nag,Ramani Duraiswami
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.
zh
[CV-124] Health Care Waste Classification Using Deep Learning Aligned with Nepals Bin Color Guidelines
【速读】:该论文旨在解决尼泊尔医疗卫生设施日益增多背景下医疗废物(Health Care Waste, HCW)管理难题,特别是因分类与处置不当引发的环境污染、传染病传播及废物处理人员健康风险。解决方案的关键在于利用多种先进的深度学习模型(ResNeXt-50、EfficientNet-B0、MobileNetV3-S、YOLOv8-n 和 YOLOv5-s)对HCW进行智能分类,并通过分层K折交叉验证评估性能;其中YOLOv5-s模型在准确率上表现最优(95.06%),虽推理速度略慢于YOLOv8-n,但仍具备部署潜力,最终被集成至Web平台并结合尼泊尔HCW管理标准映射不同颜色垃圾桶,实现面向公众的可视化应用。
链接: https://arxiv.org/abs/2508.07450
作者: Suman Kunwar,Prabesh Rai
机构: DWaste(美国); Lambton College(加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures
Abstract:The increasing number of Health Care facilities in Nepal has also added up the challenges on managing health care waste (HCW). Improper segregation and disposal of HCW leads to the contamination, spreading of infectious diseases and puts a risk of waste handlers. This study benchmarks the state of the art waste classification models: ResNeXt-50, EfficientNet-B0, MobileNetV3-S, YOLOv8-n and YOLOv5-s using Stratified K-fold techniques where we use 5 folds on combined HCW data, and found that the YOLOv5-s achieved higher of 95.06% accuracy but fell short few milliseconds in inference speed with YOLOv8-n model. The EfficientNet-B0 showed promising results of 93.22% accuracy but took the highest inference time. A repetitive ANOVA was performed to see statistical significance and the best performing model (YOLOv5-s) was deployed to the web with mapped bin color using Nepal’s HCW management standards for public usage. Further work on the data was suggested along with localized context.
zh
[CV-125] Levarging Learning Bias for Noisy Anomaly Detection
【速读】:该论文旨在解决完全无监督图像异常检测(Fully Unsupervised Image Anomaly Detection, FUIAD)中训练数据可能包含未标记异常样本的问题。传统方法假设训练数据纯净,但现实场景中的数据污染会导致模型将异常误判为正常模式,从而降低检测性能。解决方案的关键在于系统性地利用模型固有的学习偏差(learning bias),其来源包括:(1) 正常样本的统计主导性,促使模型优先学习稳定的正常特征模式而非稀疏异常;(2) 特征空间中的差异性,即正常数据具有高类内一致性而异常则表现出高多样性,导致模型响应不稳定。基于此,作者提出两阶段框架:第一阶段通过子集划分、子模型训练与跨模型异常评分聚合,筛选出净化后的训练数据;第二阶段在此基础上训练最终检测器。实验证明该方法在不同噪声条件下均能实现优越的异常检测与定位性能,且对数据污染具有强鲁棒性,同时具备模型无关特性,可兼容多种无监督骨干网络,适用于真实世界中不完美训练数据的场景。
链接: https://arxiv.org/abs/2508.07441
作者: Yuxin Zhang,Yunkang Cao,Yuqi Cheng,Yihan Sun,Weiming Shen
机构: Huazhong University of Science and Technology (华中科技大学); State Key Laboratory of Intelligent Manufacturing Equipment and Technology (智能制造装备与技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework’s contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at this https URL.
zh
[CV-126] Freeze and Reveal: Exposing Modality Bias in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision Language Models)中存在的性别偏见问题,特别是区分视觉模态和文本模态对偏见的贡献,从而实现更精准的偏见缓解。其关键解决方案包括:首先,通过反事实数据增强(Counterfactual Data Augmentation, CDA)和任务向量(Task Vector)方法对视觉和文本主干网络进行针对性去偏;其次,提出一种新的度量指标——典型性程度(Degree of Stereotypicality),并据此设计轻量级数据增强方法 DAUDoS(Data Augmentation Using Degree of Stereotypicality),在仅使用约三分之一训练数据的情况下显著降低偏见,同时提升模型对图像中性别识别的准确性。实验表明,CLIP 的视觉编码器比 PaliGemma2 的文本编码器更具偏见,这为未来多模态系统中的定向偏见缓解提供了依据。
链接: https://arxiv.org/abs/2508.07432
作者: Vivek Hruday Kavuri,Vysishtya Karanam,Venkata Jahnavi Venkamsetty,Kriti Madumadukala,Lakshmipathi Balaji Darur,Ponnurangam Kumaraguru
机构: IIIT Hyderabad (印度信息技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Language Models achieve impressive multi-modal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing using Counterfactual Data Augmentation and Task Vector methods. Inspired by data-efficient approaches in hate-speech classification, we introduce a novel metric, Degree of Stereotypicality and a corresponding debiasing method, Data Augmentation Using Degree of Stereotypicality - DAUDoS, to reduce bias with minimal computational cost. We curate a gender annotated dataset and evaluate all methods on VisoGender benchmark to quantify improvements and identify dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one-third of the data. Both methods also improve the model’s ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiment’s, we observed that CLIP’s vision encoder is more biased whereas PaliGemma2’s text encoder is more biased. By identifying whether bias stems more from vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems.
zh
[CV-127] CLUE: Leverag ing Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization
【速读】:该论文旨在解决由图像编辑工具和生成式 AI(Generative AI)普及所引发的数字媒体真实性危机问题,即伪造内容日益逼真且难以检测。其解决方案的关键在于提出一种名为 CLUE(Capture Latent Uncovered Evidence)的框架,通过低秩适配(Low-Rank Adaptation, LoRA)高效重构 Stable Diffusion 3(SD3)模型,使其成为高保真伪造定位工具。具体而言,CLUE 利用 SD3 中的修正流(Rectified Flow, RF)机制在潜在空间注入不同强度噪声,引导 LoRA 微调后的去噪过程放大伪造导致的细微统计不一致;同时融合 Segment Anything Model(SAM)图像编码器的上下文特征,实现对伪造区域边界更精准的空间定位,从而显著提升检测性能与鲁棒性。
链接: https://arxiv.org/abs/2508.07413
作者: Youqi Wang,Shunquan Tan,Rongxuan Peng,Bin Li,Jiwu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3’s Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE’s SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at this https URL.
zh
[CV-128] CharacterShot: Controllable and Consistent 4D Character Animation
【速读】:该论文旨在解决如何从单张参考图像和二维姿态序列中生成高质量、具有一致性的4D角色动画(即包含时空一致性的动态3D角色表示)的问题。现有方法在控制灵活性、多视角一致性以及动画稳定性方面存在局限,难以实现个性化角色的高效生成。解决方案的关键在于提出一个名为CharacterShot的框架:首先基于先进的DiT(Diffusion Transformer)架构预训练一个强大的2D角色动画模型,以支持任意2D姿态序列作为可控信号;随后通过引入双注意力模块与相机先验信息,将2D动画模型提升至3D空间,生成具有时空-视图一致性的多视角视频;最后利用一种新颖的邻域约束4D高斯点绘制优化策略,在多视角视频基础上重建连续且稳定的4D角色表示。该方案显著提升了生成质量与可控性,并在自建的大规模数据集Character4D和基准测试CharacterBench上验证了其优越性能。
链接: https://arxiv.org/abs/2508.07409
作者: Junyao Gao,Jiaxing Li,Wenran Liu,Yanhong Zeng,Fei Shen,Kai Chen,Yanan Sun,Cairong Zhao
机构: Tongji University (同济大学); Shanghai AI Lab; Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures. Code at this https URL
Abstract:In this paper, we propose \textbfCharacterShot, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at this https URL.
zh
[CV-129] AgriVLN: Vision-and-Language Navigation for Agricultural Robots
【速读】:该论文旨在解决农业机器人在复杂农田环境中导航能力不足的问题,即现有系统依赖人工操作或固定轨道移动,导致机动性差、适应性弱。为实现基于自然语言指令的自主导航,作者提出Agriculture to Agriculture (A2A)基准数据集,包含6类真实农业场景下的1,560个导航任务,所有RGB视频由部署于四足机器人上的前向摄像头以0.38米高度采集,贴近实际应用条件。解决方案的关键在于提出AgriVLN基线模型,该模型基于视觉-语言模型(Vision-Language Model, VLM)并采用精心设计的提示模板,使机器人能够理解自然语言指令与农业环境,并生成相应的低级控制动作;进一步引入子任务列表(Subtask List, STL)指令分解模块,有效缓解长指令执行过程中的意图追踪难题,显著提升成功率(Success Rate, SR)从0.33至0.47,展现出农业场景下视觉-语言导航的先进性能。
链接: https://arxiv.org/abs/2508.07406
作者: Xiaobei Zhao,Xingqi Lyu,Xiang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agricultural robots have emerged as powerful members in agricultural tasks, nevertheless, still heavily rely on manual operation or untransportable railway for movement, resulting in limited mobility and poor adaptability. Vision-and-Language Navigation (VLN) enables robots to navigate to the target destinations following natural language instructions, demonstrating strong performance on several domains. However, none of the existing benchmarks or methods is specifically designed for agricultural scenes. To bridge this gap, we propose Agriculture to Agriculture (A2A) benchmark, containing 1,560 episodes across six diverse agricultural scenes, in which all realistic RGB videos are captured by front-facing camera on a quadruped robot at a height of 0.38 meters, aligning with the practical deployment conditions. Meanwhile, we propose Vision-and-Language Navigation for Agricultural Robots (AgriVLN) baseline based on Vision-Language Model (VLM) prompted with carefully crafted templates, which can understand both given instructions and agricultural environments to generate appropriate low-level actions for robot control. When evaluated on A2A, AgriVLN performs well on short instructions but struggles with long instructions, because it often fails to track which part of the instruction is currently being executed. To address this, we further propose Subtask List (STL) instruction decomposition module and integrate it into AgriVLN, improving Success Rate (SR) from 0.33 to 0.47. We additionally compare AgriVLN with several existing VLN methods, demonstrating the state-of-the-art performance in the agricultural domain.
zh
[CV-130] ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack
【速读】:该论文旨在解决基于参数高效微调(Parameter-efficient fine-tuning, PEFT)的视觉基础模型在图像伪造检测与定位(Image Forgery Detection and Localization, IFDL)任务中对对抗攻击的脆弱性问题。现有方法虽能有效适配下游任务,但未考虑其在面对仅通过上游模型即可生成的高度可迁移对抗图像时的鲁棒性缺陷,导致IFDL性能显著下降。解决方案的关键在于提出ForensicsSAM框架,其核心设计包含三点:(1)在每个Transformer块中注入始终激活且共享的伪造专家(forgery experts),以增强冻结图像编码器对伪造痕迹的感知能力;(2)设计轻量级对抗检测器(adversary detector),学习RGB域中结构化、任务特定的伪造特征,实现对多种攻击方式的可靠识别;(3)在全局注意力层和MLP模块中注入自适应激活的对抗专家(adversary experts),根据检测结果逐步纠正对抗噪声引起的特征偏移,从而在不干扰干净样本的前提下提升抗扰动能力。
链接: https://arxiv.org/abs/2508.07402
作者: Rongxuan Peng,Shunquan Tan,Chenqi Kong,Anwei Luo,Alex C. Kot,Jiwu Huang
机构: Shenzhen Key Laboratory of Media Security, Faculty of Electronic and Information Engineering, Shenzhen University, China; Rapid-Rich Object Search (ROSE) Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore; Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore; Guangdong Laboratory of Machine Perception and Intelligent Computing, Faculty of Engineering, Shenzhen MSU-BIT University, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at this https URL.
zh
[CV-131] LET-US: Long Event-Text Understanding of Scenes
【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长时序事件流(event streams)时存在的两大问题:一是难以有效解析事件流数据,二是对长时间序列的建模能力受限。其解决方案的关键在于提出LET-US框架,通过自适应压缩机制在保留关键视觉细节的同时显著降低输入事件量,并采用两阶段优化策略逐步提升模型对事件场景的理解能力;此外,利用文本引导的跨模态查询进行特征降维,结合层次聚类与相似性计算提取最具代表性的事件特征,从而实现对长事件流的精准语义理解与跨模态推理。
链接: https://arxiv.org/abs/2508.07401
作者: Rui Chen,Xingyu Chen,Shaoan Wang,Shihan Kong,Junzhi Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream–text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks – reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.
zh
[CV-132] DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在稀疏视图(sparse view)重建场景下性能下降的问题,即当输入视角稀疏、覆盖不全且重叠度低时,传统3DGS难以获得高质量的三维重建结果。解决方案的关键在于引入深度图像先验(Deep Image Prior, DIP),通过利用图像内部结构和模式的先验信息,并以粗到精(coarse-to-fine)的方式优化3D高斯参数,从而在无需任何预训练模型(如生成模型或深度估计网络)的情况下实现鲁棒的稀疏视图重建。实验表明,所提出的DIP-GS方法在多种稀疏视图重建任务中达到了当前最优(state-of-the-art)性能。
链接: https://arxiv.org/abs/2508.07372
作者: Rajaei Khatib,Raja Giryes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is a leading 3D scene reconstruction method, obtaining high-quality reconstruction with real-time rendering runtime performance. The main idea behind 3DGS is to represent the scene as a collection of 3D gaussians, while learning their parameters to fit the given views of the scene. While achieving superior performance in the presence of many views, 3DGS struggles with sparse view reconstruction, where the input views are sparse and do not fully cover the scene and have low overlaps. In this paper, we propose DIP-GS, a Deep Image Prior (DIP) 3DGS representation. By using the DIP prior, which utilizes internal structure and patterns, with coarse-to-fine manner, DIP-based 3DGS can operate in scenarios where vanilla 3DGS fails, such as sparse view recovery. Note that our approach does not use any pre-trained models such as generative models and depth estimation, but rather relies only on the input frames. Among such methods, DIP-GS obtains state-of-the-art (SOTA) competitive results on various sparse-view reconstruction tasks, demonstrating its capabilities.
zh
[CV-133] raining and Inference within 1 Second – Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring
【速读】:该论文旨在解决深度学习方法在多传感器遥感图像融合(即全色锐化,Pansharpening)中普遍存在的跨传感器泛化能力差的问题。现有方法如重新训练或零样本(zero-shot)策略要么耗时严重,要么依赖额外数据,难以满足实际应用需求。其解决方案的关键在于:首先对现有模型进行模块分解,识别出一个关键特征接口——高维融合特征映射到最终图像通道空间的起点;随后在此接口处引入一个“特征调优器”(Feature Tailor),通过物理感知的无监督损失函数高效训练,从而在特征层面校正跨传感器退化问题;同时采用局部块(patch-wise)处理方式,在部分测试样本上训练并并行推理所有块,显著提升效率。该方法实现了跨传感器场景下的高性能与亚秒级训练/推理速度,无需外部数据,优于传统方法。
链接: https://arxiv.org/abs/2508.07369
作者: Tianyu Xin,Jin-Liang Xiao,Zeyu Xia,Shan Yin,Liang-Jian Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) \textitImproved Generalization Ability : it significantly enhance performance in cross-sensor cases. (2) \textitLow Generalization Cost : it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of 512\times512\times8 image within \textit0.2 seconds and 4000\times4000\times8 image within \textit3 seconds at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.
zh
[CV-134] GS4Buildings: Prior-Guided Gaussian Splatting for 3D Building Reconstruction
【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting, GS)方法在大规模复杂城市场景中进行建筑表面重建时存在的完整性不足与几何精度下降问题,尤其在频繁遮挡条件下表现不佳。其解决方案的关键在于引入语义3D建筑模型先验(prior-guided approach),通过从低级别细节层次(Level of Detail, LoD)2的语义建筑模型直接初始化高斯分布,并利用平面建筑几何生成先验深度图和法向量图,嵌入优化过程以增强表面一致性与结构准确性;此外还提出建筑聚焦模式,限制重建区域仅限于建筑区域,从而减少71.8%的高斯原语数量,提升效率与紧凑性。
链接: https://arxiv.org/abs/2508.07355
作者: Qilin Zhang,Olaf Wysocki,Boris Jutzi
机构: Photogrammetry and Remote Sensing, TUM School of Engineering and Design, Technical University of Munich (TUM), Munich, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at ISPRS 3D GeoInfo Smart Data, Smart Cities 2025, Kashiwa, Japan. To appear in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract:Recent advances in Gaussian Splatting (GS) have demonstrated its effectiveness in photo-realistic rendering and 3D reconstruction. Among these, 2D Gaussian Splatting (2DGS) is particularly suitable for surface reconstruction due to its flattened Gaussian representation and integrated normal regularization. However, its performance often degrades in large-scale and complex urban scenes with frequent occlusions, leading to incomplete building reconstructions. We propose GS4Buildings, a novel prior-guided Gaussian Splatting method leveraging the ubiquity of semantic 3D building models for robust and scalable building surface reconstruction. Instead of relying on traditional Structure-from-Motion (SfM) pipelines, GS4Buildings initializes Gaussians directly from low-level Level of Detail (LoD)2 semantic 3D building models. Moreover, we generate prior depth and normal maps from the planar building geometry and incorporate them into the optimization process, providing strong geometric guidance for surface consistency and structural accuracy. We also introduce an optional building-focused mode that limits reconstruction to building regions, achieving a 71.8% reduction in Gaussian primitives and enabling a more efficient and compact representation. Experiments on urban datasets demonstrate that GS4Buildings improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%. These results highlight the potential of semantic building model integration to advance GS-based reconstruction toward real-world urban applications such as smart cities and digital twins. Our project is available: this https URL.
zh
[CV-135] SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal
【速读】:该论文旨在解决JPEG图像压缩在高比率下引入严重视觉伪影(artifacts)的问题,尤其是现有基于深度学习的复原方法难以恢复复杂纹理细节,导致输出图像过度平滑。其解决方案的关键在于提出一种语义导向的一步扩散模型(SODiff),通过引入语义对齐图像提示提取器(SAIPE)来提供语义引导,将低质量(LQ)图像特征映射至与文本编码器语义一致的嵌入空间,同时保留重建所需的细节信息;此外,设计了一个质量因子感知的时间预测器(quality factor-aware time predictor),隐式学习LQ图像的压缩质量因子(QF),并自适应选择最优去噪起始时间步,从而提升复原效果。
链接: https://arxiv.org/abs/2508.07346
作者: Tingyu Yang,Jue Gong,Jinpei Guo,Wenbo Li,Yong Guo,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures. The code will be available at \url{ this https URL }
Abstract:JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: this https URL
zh
[CV-136] CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation
【速读】:该论文旨在解决统一自回归(Unified Autoregressive, AR)模型在定制化图像生成中的效率与稳定性问题。现有方法依赖全量微调或适配器(Adapter),导致计算成本高、易过拟合或灾难性遗忘。其解决方案的关键在于提出CoAR框架,通过层级多模态上下文学习(Layerwise Multimodal Context Learning)策略,在保持预训练参数完全冻结的前提下,仅用极少量参数(<0.05%)高效学习特定主体的表征;同时引入正则化机制以保留预训练分布并锚定上下文标记,从而提升主体保真度和再上下文化能力,并支持无需训练的风格定制化生成。
链接: https://arxiv.org/abs/2508.07341
作者: Fangtai Wu,Mushui Liu,Weijie He,Wanggui He,Hao Jiang,Zhao Wang,Yunlong Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbfCoAR, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf0.05% of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: this https URL
zh
[CV-137] Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos ECAI2025
【速读】:该论文旨在解决视频-语言对齐(video-language alignment)中因语言复杂性、动态交互实体及其动作链以及视觉与语言之间语义鸿沟(semantic gap)所带来的挑战。解决方案的关键在于提出一种名为Planner-Refiner的迭代式框架:其中Planner模块将复杂语言提示分解为短句链(short sentence chains),以调度语言引导;Refiner模块则逐句处理每个名词短语与动词短语对,通过空间到时间的视觉token自注意力机制实现高效单步精炼,并利用递归系统保持精炼后的视觉token表示,最终输出用于任务特定头(task-specific heads)的对齐表示。该方法显著缩小了语义鸿沟,尤其在复杂提示下表现优异。
链接: https://arxiv.org/abs/2508.07330
作者: Tuyen Tran,Thao Minh Le,Quang-Hung Le,Truyen Tran
机构: Applied Artificial Intelligence Institute, Deakin University (迪肯大学), Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ECAI 2025
Abstract:Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements’ space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens’ self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner’s effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models’ capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach’s potential, especially for complex prompts.
zh
[CV-138] RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning
【速读】:该论文旨在解决传统图像描述生成方法中因依赖对象检测器或图卷积网络(GCN)所带来的冗余检测信息、GCN构建困难以及高训练成本等问题。其解决方案的关键在于提出一种基于检索的对象与关系提示机制(RORPCap),通过图像-文本检索获取丰富的语义信息,利用对象和关系提取模型从图像中抽取关键词并嵌入预定义的提示模板,再结合基于Mamba架构的映射网络将CLIP提取的图像特征快速映射为视觉-文本嵌入,最终将提示嵌入与视觉-文本嵌入拼接形成文本增强特征嵌入,输入GPT-2模型完成高质量描述生成。该方法在MS-COCO数据集上实现了显著的性能提升与极短的训练时间,展现出替代现有检测或GCN驱动模型的潜力。
链接: https://arxiv.org/abs/2508.07318
作者: Jinjing Gu,Tianbao Qin,Yuanyuan Pu,Zhengpeng Zhao
机构: Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the “Karpathy” test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning.
zh
[CV-139] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding
【速读】:该论文旨在解决多页文档理解(multi-page document understanding)在多模态大语言模型(MLLMs)中的挑战,即如何实现细粒度的视觉理解与跨页的多跳推理(multi-hop reasoning)。其解决方案的关键在于提出了一种新颖的强化学习框架——Evidence Page-Guided GRPO(EviGRPO),该框架引入了基于证据感知的奖励机制,引导模型采用“粗粒度到细粒度”的推理策略:先检索相关页面,再生成答案。这一训练范式显著提升了模型在有限监督下的性能,并通过两阶段标注流程和课程学习策略构建了高质量数据集(EviBench 和 ArxivFullQA),最终使 DocR1 在多页任务上达到当前最优效果,同时保持单页任务的高性能。
链接: https://arxiv.org/abs/2508.07313
作者: Junyu Xiong,Yonghui Wang,Weichao Zhao,Chenyu Liu,Bing Yin,Wengang Zhou,Houqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.
zh
[CV-140] MobileViCLIP: An Efficient Video-Text Model for Mobile Devices ICCV2025
【速读】:该论文旨在解决当前视频预训练模型普遍采用高延迟的ViT架构,难以在移动设备上高效部署的问题。其解决方案的关键在于将时间结构重参数化(temporal structural reparameterization)引入轻量级图像-文本模型,并基于大规模高质量视频-文本数据集进行训练,从而构建出可在移动端运行且具备强大零样本分类与检索能力的视频-文本模型MobileViCLIP。该方法显著提升了推理速度(MobileViCLIP-Small在移动端比InternVideo2-L14快55.4倍),同时保持了与大模型相当的零样本检索性能。
链接: https://arxiv.org/abs/2508.07312
作者: Min Yang,Zihan Jia,Zhilin Dai,Sheng Guo,Limin Wang
机构: Nanjing University (南京大学); Ant Group (蚂蚁集团); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT. The code is available at this https URL.
zh
[CV-141] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark
【速读】:该论文旨在解决多模态持续学习(Multimodal Continual Learning)中模型在增量学习新任务时面临的灾难性遗忘(catastrophic forgetting)问题,同时应对跨模态交互与协同带来的挑战。其解决方案的关键在于提出并实现了一个名为MCITlib的开源代码库,该库专门用于多模态大语言模型(Multimodal Large Language Models, MLLMs)的持续指令微调(Continual Instruction Tuning),目前已集成8种代表性算法,并在两个精心设计的基准测试上进行了系统评估,为该领域研究提供了可扩展、可复现的工具支持。
链接: https://arxiv.org/abs/2508.07307
作者: Haiyang Guo,Fei Zhu,Hongbo Zhao,Fanhu Zeng,Wenzhuo Liu,Shijie Ma,Da-Han Wang,Xu-Yao Zhang
机构: School of Advanced Interdisciplinary Sciences, UCAS (中国科学院大学); MAIS, CASIA (中科院自动化所); Centre for Artificial Intelligence and Robotics, HKISI-CAS (香港中文大学深圳研究院-中科院); School of Artificial Intelligence, UCAS (中国科学院大学); FKLPRIU, School of Computer and Information Engineering, Xiamen University of Technology (厦门理工学院计算机与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at this https URL.
zh
[CV-142] Drag onFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Drag on Fruit Quality Inspection on Mobile Devices
【速读】:该论文旨在解决龙眼(Dragon fruit)在采前与采后阶段缺乏高效质量检测手段的问题,以提升农业生产力并减少采后损失。其解决方案的关键在于提出了一种轻量化的卷积神经网络(Convolutional Neural Network, CNN)模型——DragonFruitQualityNet,该模型专为移动设备上的实时质量评估优化,并基于包含13,789张图像的多样化数据集进行训练,将果实分为新鲜、未成熟、成熟和缺陷四类。该模型在准确率上达到93.98%,显著优于现有方法,并通过集成至直观的移动端应用程序实现了现场即时检测,从而推动了数字农业的发展和小农户对先进AI技术的可及性。
链接: https://arxiv.org/abs/2508.07306
作者: Md Zahurul Haquea,Yeahyea Sarker,Muhammed Farhan Sadique Mahi,Syed Jubayer Jaman,Md Robiul Islam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices.
zh
[CV-143] BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation
【速读】:该论文旨在解决实时语义分割中如何在保持高效计算的同时,兼顾大感受野以增强语义理解能力,并提升细节边界分割精度的问题。其核心解决方案是提出一种双分支高效的视觉注意力网络(Bilateral Efficient Visual Attention Network, BEVANet),关键创新包括:1)引入大核注意力(Large Kernel Attention, LKA)机制,通过稀疏分解的大可分离核注意力(Sparse Decomposed Large Separable Kernel Attention, SDLSKA)同时提取视觉与结构特征;2)设计综合核选择(Comprehensive Kernel Selection, CKS)机制动态调整感受野;3)提出深度大核金字塔池化模块(Deep Large Kernel Pyramid Pooling Module, DLKPPM),融合空洞卷积与大核注意力以丰富上下文特征;4)采用边界引导自适应融合(Boundary Guided Adaptive Fusion, BGAF)模块,在边界引导下融合空间与语义特征,显著改善边界清晰度。该架构在Cityscapes数据集上实现无预训练81.0% mIoU和有ImageNet预训练81.0% mIoU,且推理速度达33 FPS,达到实时语义分割的先进性能。
链接: https://arxiv.org/abs/2508.07300
作者: Ping-Mao Huang,I-Tien Chao,Ping-Chia Huang,Jia-Wei Liao,Yung-Yu Chuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at this https URL.
zh
[CV-144] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations
【速读】:该论文旨在解决深度学习在医学图像分割中因标注数据稀缺而导致性能受限的问题。现有方法通常依赖于强-弱伪监督来利用未标注数据,但其性能常受伪标签与对应未标注图像之间不一致性的影响。解决方案的关键在于提出SynMatch框架,该框架通过合成图像来匹配伪标签,而非改进伪标签本身;具体而言,利用同一分割模型提取的纹理和形状特征生成高度一致的合成图像-伪标签对,且无需为图像合成过程引入额外训练参数,从而有效缓解伪标签不一致性问题并显著提升在半监督、弱监督及极低监督学习场景下的分割性能。
链接: https://arxiv.org/abs/2508.07298
作者: Zhiqiang Shen,Peng Cao,Xiaoli Liu,Jinzhu Yang,Osmar R. Zaiane
机构: 1. University of Alberta (阿尔伯塔大学); 2. Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所); 3. University of California, Berkeley (加州大学伯克利分校); 4. University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbfSynMatch, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71% and 10.05% on the polyp segmentation task with 5% and 10% scribble annotations, respectively. The code will be released at this https URL.
zh
[CV-145] Representation Understanding via Activation Maximization
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)内部特征表示的可解释性问题,尤其关注如何有效可视化不同层次神经元的激活特性,以揭示模型学习到的特征结构。其解决方案的关键在于提出一个统一的特征可视化框架,适用于卷积神经网络(Convolutional Neural Networks, CNNs)和视觉 Transformer(Vision Transformers, ViTs),并首次将激活最大化(Activation Maximization, AM)方法从输出层扩展至中间层,从而更全面地刻画DNN中层次化的特征表示;同时,该方法还被用于生成对抗样本,进一步揭示模型决策边界与潜在脆弱性,验证了其在传统CNN与现代ViT架构中的通用性和解释价值。
链接: https://arxiv.org/abs/2508.07281
作者: Hongbo Zhu,Angelo Cangelosi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages,12 figures
Abstract:Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.
zh
[CV-146] OpenHAIV: A Framework Towards Practical Open-World Learning
【速读】:该论文旨在解决开放世界识别(open-world recognition)场景下模型难以同时实现未知类检测与知识更新的问题。现有方法中,仅依赖异常检测(Out-of-distribution, OOD)无法支持模型持续学习新知识,而增量学习通常需要监督信号,违背了开放世界无监督或弱监督的设定。解决方案的关键在于提出一个统一框架 OpenHAIV,将 OOD 检测、新类发现(new class discovery)和增量持续微调(incremental continual fine-tuning)整合为端到端的流水线,使模型能够在无需人工标注的情况下自主识别未知类别并动态更新知识。
链接: https://arxiv.org/abs/2508.07270
作者: Xiang Xiang,Qinhao Zhou,Zhuo Xu,Jing Ma,Jiaxin Dai,Yifan Liang,Hanlin Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: Codes, results, and OpenHAIV documentation available at this https URL
Abstract:Substantial progress has been made in various techniques for open-world recognition. Out-of-distribution (OOD) detection methods can effectively distinguish between known and unknown classes in the data, while incremental learning enables continuous model knowledge updates. However, in open-world scenarios, these approaches still face limitations. Relying solely on OOD detection does not facilitate knowledge updates in the model, and incremental fine-tuning typically requires supervised conditions, which significantly deviate from open-world settings. To address these challenges, this paper proposes OpenHAIV, a novel framework that integrates OOD detection, new class discovery, and incremental continual fine-tuning into a unified pipeline. This framework allows models to autonomously acquire and update knowledge in open-world environments. The proposed framework is available at this https URL .
zh
[CV-147] Fading the Digital Ink: A Universal Black-Box Attack Framework for 3DGS Watermarking Systems
【速读】:该论文旨在解决当前3D高斯溅射(3D Gaussian Splatting, 3DGS)中数字水印技术在面对潜在攻击时鲁棒性不足的问题。现有水印方法通常嵌入1D比特流或2D图像以实现版权保护,但其对抗性攻击的脆弱性尚未被充分研究。为此,作者提出首个通用黑盒攻击框架——基于群体的多目标进化攻击(Group-based Multi-objective Evolutionary Attack, GMEA),其核心在于将攻击建模为大规模多目标优化问题,在去除水印与保持视觉质量之间进行权衡;同时引入间接目标函数,通过最小化卷积神经网络提取特征的方差来干扰水印检测器,使特征图失去判别信息;此外,采用基于群体的优化策略对3DGS模型参数空间进行分组分解,从而高效处理高维搜索空间。实验表明,该框架能有效移除主流3DGS水印方案中的1D和2D水印,同时维持高质量的视觉保真度,揭示了现有水印机制的关键漏洞,并推动更鲁棒水印系统的发展。
链接: https://arxiv.org/abs/2508.07263
作者: Qingyuan Zeng,Shu Jiang,Jiajing Lin,Zhenzhong Wang,Kay Chen Tan,Min Jiang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Zhejiang University (浙江大学); 3. Nanyang Technological University (南洋理工大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rise of 3D Gaussian Splatting (3DGS), a variety of digital watermarking techniques, embedding either 1D bitstreams or 2D images, are used for copyright protection. However, the robustness of these watermarking techniques against potential attacks remains underexplored. This paper introduces the first universal black-box attack framework, the Group-based Multi-objective Evolutionary Attack (GMEA), designed to challenge these watermarking systems. We formulate the attack as a large-scale multi-objective optimization problem, balancing watermark removal with visual quality. In a black-box setting, we introduce an indirect objective function that blinds the watermark detector by minimizing the standard deviation of features extracted by a convolutional network, thus rendering the feature maps uninformative. To manage the vast search space of 3DGS models, we employ a group-based optimization strategy to partition the model into multiple, independent sub-optimization problems. Experiments demonstrate that our framework effectively removes both 1D and 2D watermarks from mainstream 3DGS watermarking methods while maintaining high visual fidelity. This work reveals critical vulnerabilities in existing 3DGS copyright protection schemes and calls for the development of more robust watermarking systems.
zh
[CV-148] Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
【速读】:该论文旨在解决当前大型视觉语言模型(Vision-Language Models, VLMs)在个性化应用中面临的两大挑战:一是大型VLM因训练成本高和API访问受限难以直接个性化;二是小型VLM虽易个性化但推理能力不足,无法满足复杂任务需求。解决方案的关键在于提出一种名为Small-Large Collaboration (SLC)的协同框架,其中小型VLM负责生成个性化信息,大型VLM则整合这些信息以输出准确响应,并引入测试时反思策略(test-time reflection strategy)来抑制小型VLM可能产生的幻觉问题。该方法仅需训练一个元个性化的小型VLM即可适配多种开源与闭源的大VLM,显著提升个性化过程的训练效率,实现了高效、通用且可扩展的个性化部署方案。
链接: https://arxiv.org/abs/2508.07260
作者: Sihan Yang,Huitong Ji,Shaolin Lu,Jiayi Chen,Binxiao Xu,Ming Lu,Yuanxing Zhang,Wenhui Dong,Wentao Zhang
机构: Peking University (北京大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at this https URL.
zh
[CV-149] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds
【速读】:该论文旨在解决从第一人称视角(egocentric perspective)理解动态4D场景的难题,即建模随时间变化的3D空间结构,以支持人机交互、自主导航和具身智能等应用。现有数据集虽包含动态场景,但缺乏统一的4D标注与面向任务的评估协议,尤其在物体与人类运动及其交互的细粒度时空推理方面存在不足。为填补这一空白,作者提出EgoDynamic4D,一个包含RGB-D视频、相机位姿、全局唯一实例掩码及4D边界框的新型问答(QA)基准,并构建了927K条带显式Chain-of-Thought(CoT)标注的QA对,支持可验证的逐步时空推理。解决方案的关键在于设计了一个端到端的时空推理框架,通过实例感知特征编码、时间和相机编码以及空间自适应下采样,将大规模4D场景压缩为适合大语言模型(LLM)处理的token序列,从而有效融合动态与静态场景信息,实现对复杂动态交互的细粒度建模与推理。
链接: https://arxiv.org/abs/2508.07251
作者: Junsheng Huang,Shengyu Hao,Bocheng Hu,Gaoang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.
zh
[CV-150] SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking
【速读】:该论文旨在解决现有高光谱视频(Hyperspectral Video, HSV)目标跟踪方法中忽视光谱交互信息的问题,导致在复杂场景下(如背景杂乱、目标尺寸小)性能受限。其解决方案的关键在于从架构设计与训练策略两方面协同优化:在架构层面,利用Transformer建立模板与搜索区域间的带间长程空间关系,并通过集合论中的容斥原理将光谱交互建模为所有波段空间交互的并集,从而有效融合共享与波段特异的空间线索;在训练层面,引入光谱损失函数以强制模板与预测区域之间的物质分布对齐,提升对形变和外观变化的鲁棒性。
链接: https://arxiv.org/abs/2508.07250
作者: Fengchao Xiong,Zhenxing Wu,Sen Jia,Yuntao Qian
机构: Nanjing University of Science and Technology (南京理工大学); Shenzhen University (深圳大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via this https URL to support reproducibility.
zh
[CV-151] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers
【速读】:该论文旨在解决图像动画生成中长期存在的两大挑战:一是保持与静态输入图像的外观一致性,二是缓解生成动画中的突变运动过渡问题。此外,现有基于U-Net的扩散模型在效率和性能上落后于最新的文本到视频(text-to-video, T2V)生成方法,且Transformer中原始自注意力机制的二次复杂度导致计算资源消耗过高。解决方案的关键在于提出MiraMo框架,其核心创新包括:(1)采用基于线性注意力(linear attention)的文本到视频基础架构,在降低计算开销的同时维持生成质量;(2)引入运动残差学习(motion residual learning)范式,通过建模运动动态而非直接预测帧来提升时序一致性;(3)在推理阶段使用基于离散余弦变换(DCT)的噪声精修策略抑制突发运动伪影,并结合动力学控制模块平衡运动平滑性与表现力。
链接: https://arxiv.org/abs/2508.07246
作者: Xin Ma,Yaohui Wang,Genyun Jia,Xinyuan Chen,Tien-Tsin Wong,Cunjian Chen
机构: Monash University (蒙纳士大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.
zh
[CV-152] ASM-UNet: Adaptive Scan Mamba Integrating Group Commonalities and Individual Variations for Fine-Grained Segmentation
【速读】:该论文旨在解决医学图像中细粒度分割(Fine-grained Segmentation, FGS)任务的挑战,尤其是在个体间小尺度解剖结构差异显著的情况下,传统基于固定扫描顺序的Mamba模型难以适应这些变异。其解决方案的关键在于提出ASM-UNet架构,通过引入自适应扫描评分(Adaptive Scan Scores),动态引导扫描顺序——该评分由群体层面的共性特征与个体层面的差异特征融合生成,从而提升模型对个体化解剖结构的感知能力与分割精度。
链接: https://arxiv.org/abs/2508.07237
作者: Bo Wang,Mengyuan Xu,Yue Yan,Yuqun Yang,Kechen Shu,Wei Ping,Xu Tang,Wei Jiang,Zheng You
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise lesion resection depends on accurately identifying fine-grained anatomical structures. While many coarse-grained segmentation (CGS) methods have been successful in large-scale segmentation (e.g., organs), they fall short in clinical scenarios requiring fine-grained segmentation (FGS), which remains challenging due to frequent individual variations in small-scale anatomical structures. Although recent Mamba-based models have advanced medical image segmentation, they often rely on fixed manually-defined scanning orders, which limit their adaptability to individual variations in FGS. To address this, we propose ASM-UNet, a novel Mamba-based architecture for FGS. It introduces adaptive scan scores to dynamically guide the scanning order, generated by combining group-level commonalities and individual-level variations. Experiments on two public datasets (ACDC and Synapse) and a newly proposed challenging biliary tract FGS dataset, namely BTMS, demonstrate that ASM-UNet achieves superior performance in both CGS and FGS tasks. Our code and dataset are available at this https URL.
zh
[CV-153] Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource
【速读】:该论文旨在解决视觉语音识别(Visual Speech Recognition)中因光照条件、皮肤纹理等视觉干扰以及用户特异性特征导致的性能下降问题,尤其是在训练数据有限的情况下。其解决方案的关键在于提出一种基于关键点引导的视觉特征提取器:利用面部关键点作为辅助信息,设计了一种时空多图卷积网络以充分挖掘关键点的空间位置和时序特征,并引入多层级唇部动态融合框架,将关键点的时空特征与原始视频帧提取的视觉特征进行深度融合,从而有效降低用户特异性影响并提升模型在未见说话者上的识别准确率。
链接: https://arxiv.org/abs/2508.07233
作者: Lei Yang,Junshan Jin,Mingyuan Zhang,Yi He,Bofan Chen,Shilin Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual speech recognition is a technique to identify spoken content in silent speech videos, which has raised significant attention in recent years. Advancements in data-driven deep learning methods have significantly improved both the speed and accuracy of recognition. However, these deep learning methods can be effected by visual disturbances, such as lightning conditions, skin texture and other user-specific features. Data-driven approaches could reduce the performance degradation caused by these visual disturbances using models pretrained on large-scale datasets. But these methods often require large amounts of training data and computational resources, making them costly. To reduce the influence of user-specific features and enhance performance with limited data, this paper proposed a landmark guided visual feature extractor. Facial landmarks are used as auxiliary information to aid in training the visual feature extractor. A spatio-temporal multi-graph convolutional network is designed to fully exploit the spatial locations and spatio-temporal features of facial landmarks. Additionally, a multi-level lip dynamic fusion framework is introduced to combine the spatio-temporal features of the landmarks with the visual features extracted from the raw video frames. Experimental results show that this approach performs well with limited data and also improves the model’s accuracy on unseen speakers.
zh
[CV-154] HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)在分辨率受限背景下的关键挑战,具体包括:从视觉复杂的苏木精-伊红(HE)染色图像中提取与基因表达相关的信息特征、在基于扩散的框架中实现多模态空间精确对齐,以及建模不同基因在表达通道上的特异性变异。其解决方案的核心是提出HaDM-ST(Histology-assisted Differential Modeling for ST Generation),该框架通过三个关键技术模块实现高分辨率ST生成:(i) 语义蒸馏网络(semantic distillation network)从HE图像中提取预测性特征;(ii) 空间对齐模块(spatial alignment module)强制像素级对应低分辨率ST以保证空间精度;(iii) 通道感知对抗学习器(channel-aware adversarial learner)实现基因层面的细粒度建模。实验表明,HaDM-ST在多种组织和物种中对200个基因的预测均显著优于现有方法,提升了高分辨率ST的空间保真度和基因一致性。
链接: https://arxiv.org/abs/2508.07225
作者: Xuepeng Liu,Zheng Jiang,Pinan Zhu,Hanyu Liu,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, includes comparisons with TESLA, HiStoGene, and iStar; submitted to arXiv 2025
Abstract:Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via HE-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex HE images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on HE images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from HE; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.
zh
[CV-155] Generic Calibration: Pose Ambiguity/Linear Solution and Parametric-hybrid Pipeline
【速读】:该论文旨在解决传统相机标定方法中存在的一系列问题:一方面,参数化模型依赖用户经验选择,不当的模型会显著降低标定精度;另一方面,通用标定方法流程复杂且无法输出传统的内参参数。此外,论文揭示了通用标定方法中存在的位姿歧义(pose ambiguity)问题,该歧义不可逆地影响后续位姿估计精度。解决方案的关键在于提出一种线性求解器和非线性优化策略以消除该歧义,并进一步引入一种全局优化的“通用-参数混合标定”方法,将通用模型与参数模型有机结合,从而提升通用标定的外参精度,同时缓解参数标定中的过拟合与数值不稳定问题。
链接: https://arxiv.org/abs/2508.07217
作者: Yuqi Han,Qi Cai,Yuanxin Wu
机构: 上海交通大学(Shanghai Jiao Tong University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Offline camera calibration techniques typically employ parametric or generic camera models. Selecting parametric models relies heavily on user experience, and an inappropriate camera model can significantly affect calibration accuracy. Meanwhile, generic calibration methods involve complex procedures and cannot provide traditional intrinsic parameters. This paper reveals a pose ambiguity in the pose solutions of generic calibration methods that irreversibly impacts subsequent pose estimation. A linear solver and a nonlinear optimization are proposed to address this ambiguity issue. Then a global optimization hybrid calibration method is introduced to integrate generic and parametric models together, which improves extrinsic parameter accuracy of generic calibration and mitigates overfitting and numerical instability in parametric calibration. Simulation and real-world experimental results demonstrate that the generic-parametric hybrid calibration method consistently excels across various lens types and noise contamination, hopefully serving as a reliable and accurate solution for camera calibration in complex scenarios.
zh
[CV-156] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization
【速读】:该论文旨在解决现有图像篡改定位(Image Manipulation Localization, IML)模型主要依赖视觉线索而忽略内容特征间语义逻辑关系的问题。由于真实图像的内容语义通常符合人类认知规律,而图像篡改技术往往破坏这种内在关联,从而留下可被检测的语义线索。为此,作者提出了一种受认知启发的多模态边界保持网络(Cognition-inspired Multimodal Boundary-preserving Network, CMB-Net),其关键在于:首先利用大语言模型(Large Language Models, LLMs)生成基于提示(prompt-based)的文本信息以补足视觉信息中缺失的语义关系;其次设计图像-文本中心模糊度模块(Image-Text Central Ambiguity Module, ITCAM),通过量化图文特征间的模糊程度动态加权文本特征,缓解LLM幻觉对定位精度的负面影响;同时引入图像-文本交互模块(Image-Text Interaction Module, ITIM),借助相关矩阵实现细粒度的跨模态对齐;最后基于可逆神经网络思想提出恢复边缘解码器(Restoration Edge Decoder, RED),通过输入与输出特征的相互生成来无损保留篡改区域的边界信息。上述机制协同作用显著提升了IML性能。
链接: https://arxiv.org/abs/2508.07216
作者: Songlin Li,Zhiqing Guo,Yuanman Li,Zeyu Li,Yunfeng Diao,Gaobo Yang,Liejun Wang
机构: Xinjiang University (新疆大学); Hunan University (湖南大学); Hefei University of Technology (合肥工业大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.
zh
[CV-157] Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling
【速读】:该论文旨在解决无监督真实世界超分辨率(Unsupervised Real-World Super-Resolution, URW-SR)中因实际场景下退化分布复杂且未知而导致的性能瓶颈问题,尤其是现有方法难以从合成的低分辨率(Low-Resolution, LR)与高分辨率(High-Resolution, HR)图像对泛化到真实数据的问题。其解决方案的关键在于提出一种基于修正流(Rectified Flow)的退化建模机制:首先设计了退化变换LR(Degradation-Transformed LR, DT-LR)图像作为中间桥梁,通过连续且可逆的方式建模退化轨迹以更精准捕捉真实世界的退化特性;同时引入傅里叶先验引导的退化模块(Fourier Prior Guided Degradation Module, FGDM),利用傅里叶相位成分中的结构信息提升退化建模精度。最终生成具有真实退化特性的合成LR图像,并与原始HR图像配对用于训练现成的SR网络,显著提升了在真实数据上的重建效果。
链接: https://arxiv.org/abs/2508.07214
作者: Hongyang Zhou,Xiaobin Zhu,Liuling Chen,Junyi He,Jingyan Qin,Xu-Cheng Yin,Zhang xiaoxing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:Unsupervised real-world super-resolution (SR) faces critical challenges due to the complex, unknown degradation distributions in practical scenarios. Existing methods struggle to generalize from synthetic low-resolution (LR) and high-resolution (HR) image pairs to real-world data due to a significant domain gap. In this paper, we propose an unsupervised real-world SR method based on rectified flow to effectively capture and model real-world degradation, synthesizing LR-HR training pairs with realistic degradation. Specifically, given unpaired LR and HR images, we propose a novel Rectified Flow Degradation Module (RFDM) that introduces degradation-transformed LR (DT-LR) images as intermediaries. By modeling the degradation trajectory in a continuous and invertible manner, RFDM better captures real-world degradation and enhances the realism of generated LR images. Additionally, we propose a Fourier Prior Guided Degradation Module (FGDM) that leverages structural information embedded in Fourier phase components to ensure more precise modeling of real-world degradation. Finally, the LR images are processed by both FGDM and RFDM, producing final synthetic LR images with real-world degradation. The synthetic LR images are paired with the given HR images to train the off-the-shelf SR networks. Extensive experiments on real-world datasets demonstrate that our method significantly enhances the performance of existing SR approaches in real-world scenarios.
zh
[CV-158] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset
【速读】:该论文旨在解决现有图像修复方法在处理不同景深(depth-of-field, DoF)场景时忽视深度信息所导致的问题,如相似性匹配不准确、浅景深下注意力分散以及深景深下背景内容过度增强等。其解决方案的关键在于提出一种新颖的深度引导网络(Depth-Guided Network, DGN),该网络包含两个交互式分支:一个用于提供结构引导的深度估计分支,另一个执行核心图像修复任务的修复分支;其中修复分支通过渐进式窗口自注意力机制捕捉物体内部相似性,并借助稀疏非局部注意力机制捕获物体间相似性,从而实现更精准的图像重建;同时,深度特征与视觉特征通过联合训练相互促进,显著提升整体恢复质量与鲁棒性。
链接: https://arxiv.org/abs/2508.07211
作者: Junyi He,Liuling Chen,Hongyang Zhou,Zhang xiaoxing,Xiaobin Zhu,Shengxiang Yu,Jingyan Qin,Xu-Cheng Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures
Abstract:Image restoration has seen substantial progress in recent years. However, existing methods often neglect depth information, which hurts similarity matching, results in attention distractions in shallow depth-of-field (DoF) scenarios, and excessive enhancement of background content in deep DoF settings. To overcome these limitations, we propose a novel Depth-Guided Network (DGN) for image restoration, together with a novel large-scale high-resolution dataset. Specifically, the network consists of two interactive branches: a depth estimation branch that provides structural guidance, and an image restoration branch that performs the core restoration task. In addition, the image restoration branch exploits intra-object similarity through progressive window-based self-attention and captures inter-object similarity via sparse non-local attention. Through joint training, depth features contribute to improved restoration quality, while the enhanced visual features from the restoration branch in turn help refine depth estimation. Notably, we also introduce a new dataset for training and evaluation, consisting of 9,205 high-resolution images from 403 plant species, with diverse depth and texture variations. Extensive experiments show that our method achieves state-of-the-art performance on several standard benchmarks and generalizes well to unseen plant images, demonstrating its effectiveness and robustness.
zh
[CV-159] EventRR: Event Referential Reasoning for Referring Video Object Segmentation
【速读】:该论文针对引用视频对象分割(Referring Video Object Segmentation, RVOS)任务中现有方法将引用表达式视为无结构序列、忽略其关键语义结构的问题,提出了一种基于事件参考推理(Event Referential Reasoning, EventRR)的解决方案。其核心在于将RVOS解耦为对象摘要和参考推理两个部分:首先通过瓶颈令牌(bottleneck tokens)对每帧进行摘要,并在视频层面聚合以交换跨模态时序上下文;其次,构建单根有向无环图结构的参考事件图(Referential Event Graph, REG),显式建模视频引用表达式的语义事件结构,再通过基于REG拓扑遍历的时序概念-角色推理(Temporal Concept-Role Reasoning, TCRR)从叶节点到根节点累积每个时序查询的引用得分,每一步推理可解释为源自REG中概念-角色关系的问题-答案对。此设计有效捕捉了视频引用表达中特有的事件属性与时序关系,显著优于现有方法。
链接: https://arxiv.org/abs/2508.07171
作者: Huihui Xu,Jiashi Lin,Haoyu Chen,Junjun He,Lei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at this https URL
zh
[CV-160] Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection
【速读】:该论文旨在解决轻量级网络中多尺度特征提取能力不足的问题,这一问题在显著目标检测等计算机视觉任务中尤为关键,且常面临效率与性能之间的权衡。其解决方案的核心在于提出一种新颖的轻量级多尺度特征提取层(LMF layer),该层采用全连接结构中的深度可分离空洞卷积(depthwise separable dilated convolutions),从而在显著降低参数数量的同时保留多尺度感知能力;通过堆叠多个LMF层构建出LMFNet网络,在仅需0.81M参数的情况下实现了优于或相当传统及轻量模型的性能表现,验证了该方法在保持高效性前提下有效提升多尺度学习能力的可行性。
链接: https://arxiv.org/abs/2508.07170
作者: Yunpeng Shi,Lei Chen,Xiaolu Shen,Yanju Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at this https URL
zh
[CV-161] Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications
【速读】:该论文旨在解决多序列磁共振成像(Multi-sequence Magnetic Resonance Imaging, MRI)中因序列间固有异质性导致深度学习模型泛化能力不足的问题,这一问题限制了模型在不同扫描参数下的临床应用。解决方案的关键在于提出PRISM——一个基于大规模多序列MRI预训练的基座模型(foundation model),其核心创新是设计了一种新型预训练范式,能够解耦MRI中的解剖学不变特征与序列特异性变化,同时保留高层次语义表示,从而实现跨协议、跨数据集的鲁棒表征学习。
链接: https://arxiv.org/abs/2508.07165
作者: Zelin Qiu,Xi Wang,Zhuoyao Xie,Juan Zhou,Yu Wang,Lingjie Yang,Xinrui Jiang,Juyoung Bae,Moo Hyun Son,Qiang Ye,Dexuan Chen,Rui Zhang,Tao Li,Neeraj Ramesh Mahboobani,Varut Vardhanabhuti,Xiaohui Duan,Yinghua Zhao,Hao Chen
机构: Sun Yat-sen University (中山大学); Singapore Management University (新加坡管理大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.
zh
[CV-162] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion
【速读】:该论文旨在解决3D人体与物体交互(3D human-object interaction, HOI)预测中,现有方法未区分人体关节运动与刚体物体运动特性而导致的预测不一致问题。其核心解决方案是提出一种接触一致性解耦扩散框架 CoopDiff,通过两个独立分支分别建模人体动力学(预测结构化人体运动)和物体动力学(建模刚体平移与旋转),并以共享的人体-物体接触点作为锚点,施加一致性约束以确保跨分支的协同运动生成。此外,引入人驱动的交互模块进一步引导物体运动建模,从而提升人-物交互的一致性与预测可靠性。
链接: https://arxiv.org/abs/2508.07162
作者: Xiaotong Lin,Tianming Liang,Jian-Fang Hu,Kun-Yu Lin,Yulei Kang,Chunwei Tian,Jianhuang Lai,Wei-Shi Zheng
机构: 1: The Chinese University of Hong Kong (香港中文大学); 2: National Taiwan University (台湾国立大学); 3: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.
zh
[CV-163] SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models
【速读】:该论文旨在解决手绘草图动画化过程中所需专业技能高、耗时长的问题,尤其针对非专业人士难以实现创意运动效果的挑战。其解决方案的关键在于提出了一种名为SketchAnimator的新模型,通过三个阶段实现:首先利用LoRA(Low-Rank Adaptation)将草图外观信息与参考视频中的运动动态融入预训练文本到视频生成模型(T2V);其次在第三阶段采用Score Distillation Sampling(SDS)优化每帧草图中贝塞尔曲线(Bezier curves)的参数,以精确匹配学习到的运动信息。此方法实现了仅需单个参考视频即可定制化生成具有真实动态效果的草图动画,同时保留原始草图的视觉特征。
链接: https://arxiv.org/abs/2508.07149
作者: Ruolin Yang,Da Li,Honggang Zhang,Yi-Zhe Song
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2024 IEEE International Conference on Visual Communications and Image Processing (VCIP); Oral
Abstract:Sketching is a uniquely human tool for expressing ideas and creativity. The animation of sketches infuses life into these static drawings, opening a new dimension for designers. Animating sketches is a time-consuming process that demands professional skills and extensive experience, often proving daunting for amateurs. In this paper, we propose a novel sketch animation model SketchAnimator, which enables adding creative motion to a given sketch, like "a jumping car’'. Namely, given an input sketch and a reference video, we divide the sketch animation into three stages: Appearance Learning, Motion Learning and Video Prior Distillation. In stages 1 and 2, we utilize LoRA to integrate sketch appearance information and motion dynamics from the reference video into the pre-trained T2V model. In the third stage, we utilize Score Distillation Sampling (SDS) to update the parameters of the Bezier curves in each sketch frame according to the acquired motion information. Consequently, our model produces a sketch video that not only retains the original appearance of the sketch but also mirrors the dynamic movements of the reference video. We compare our method with alternative approaches and demonstrate that it generates the desired sketch video under the challenge of one-shot motion customization.
zh
[CV-164] Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction
【速读】:该论文旨在解决当前基于扩散模型(diffusion-based models)的行人轨迹预测方法中缺乏显式语义意图建模的问题,这可能导致行为误判和预测精度下降。解决方案的关键在于引入了短时与长时运动意图的联合建模机制:短时意图通过残差极坐标表示(residual polar representation)实现方向与幅值解耦,以捕捉精细的局部运动模式;长时意图则由可学习的基于token的目标点预测器生成多个候选目标及其概率分布,从而支持多模态且上下文感知的意图建模。此外,通过引入自适应引导机制和残差噪声预测器,进一步提升了扩散过程中的去噪精度与动态适应能力。
链接: https://arxiv.org/abs/2508.07146
作者: Yu Liu,Zhijie Liu,Xiao Ren,You-Fu Li,He Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.
zh
[CV-165] Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models
【速读】:该论文旨在解决当前人类中心视觉模型(Human-centric Vision Models, HVMs)在实际应用中因依赖大规模神经网络架构和受限的预训练数据访问权限而导致的实用性不足问题。其核心解决方案是提出一种基于知识蒸馏的新型预训练框架——动态模式对齐学习(Dynamic Pattern Alignment Learning, DPAL),关键在于设计了一个动态模式解码器(Dynamic Pattern Decoder, D-PaDe),该模块作为动态专家混合(Mixture of Experts, MoE)模型,能够根据输入图像和模式查询自适应地提取三类典型视觉模式(全局身份模式、局部形状模式与多人交互模式);同时引入三个层级的对齐目标,在全局图像、局部像素和实例关系层面最小化轻量级模型与大型教师模型之间的泛化差距,从而有效指导轻量模型从大模型中学习通用的人类视觉感知能力。
链接: https://arxiv.org/abs/2508.07144
作者: Xuanhan Wang,Huimin Deng,Ke Liu,Jun Wang,Lianli Gao,Jingkuan Song
机构: Tongji University (同济大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.
zh
[CV-166] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance BMVC2025
【速读】:该论文旨在解决壁画数字修复中因复杂退化模式和对艺术真实性保护的高要求而带来的挑战,尤其是现有基于学习的方法在网络中难以保持一致的掩码引导,导致对受损区域关注不足、修复质量下降的问题。解决方案的关键在于提出CMAMRNet——一种上下文掩码感知的壁画修复网络,其核心创新包括:(1) 掩码感知的上/下采样器(Mask-Aware Up/Down-Sampler, MAUDS),通过通道级特征选择与掩码引导的特征融合,在多尺度下确保掩码敏感性的一致性;(2) 共同特征聚合器(Co-Feature Aggregator, CFA),在最高和最低分辨率处提取互补特征,以同时捕捉退化区域的精细纹理与全局结构信息,从而显著提升修复结果的结构完整性和艺术细节保真度。
链接: https://arxiv.org/abs/2508.07140
作者: Yingtie Lei,Fanghai Yi,Yihang Dong,Weihuang Liu,Xiaofeng Zhang,Zimeng Li,Chi-Man Pun,Xuhang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by BMVC 2025
Abstract:Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\hrefthis https URLthis https URL.
zh
[CV-167] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays MICCAI2025
【速读】:该论文旨在解决医学影像中数据稀缺问题,尤其是低发生率异常(如肺不张、肺部浸润、胸腔积液和心脏轮廓增大)导致的AI辅助诊断与分割模型性能受限的问题。解决方案的关键在于评估生成式对抗网络(Generative Adversarial Networks, GANs)和扩散模型(Diffusion Models, DMs)在合成胸部X光图像方面的有效性,通过放射科医生的判读实验验证其视觉真实性和临床一致性,从而判断这些合成图像是否可作为可靠的数据增强手段用于训练AI诊断系统。研究发现,尽管DMs整体生成质量更优,但GANs在特定异常(如无心脏轮廓增大)上表现更准确,且识别出放射科医师用于辨别合成图像的关键视觉线索,揭示了当前生成模型存在的感知差距,为后续优化提供了方向。
链接: https://arxiv.org/abs/2508.07128
作者: Gregory Schuit,Denis Parra,Cecilia Besa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the Workshop on Human-AI Collaboration at MICCAI 2025
Abstract:Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.
zh
[CV-168] AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation
【速读】:该论文旨在解决基于提升(lifting-based)的3D人体姿态估计(3D Human Pose Estimation, HPE)方法在新数据集和真实场景中泛化能力差的问题。现有方法通常依赖于从2D关键点坐标(x, y)直接预测3D姿态,但在分布外(out-of-distribution)场景下性能显著下降。其解决方案的关键在于提出AugLift——一种简洁而有效的提升管道重构方法,通过稀疏地向标准输入(即2D关键点坐标)中加入两个额外信号:关键点检测置信度分数(confidence score, c)和对应的深度估计值(depth estimate, d)。这两个辅助信息由现成的预训练模型(如单目深度估计模型)计算获得,从而继承其强泛化能力;且AugLift作为模块化插件可无缝集成到任意现有 lifting 架构中,无需额外数据采集或传感器支持。实验表明,该方法在跨数据集场景下平均提升10.1%,同时在原分布内提升4.0%,验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2508.07112
作者: Nikolai Warner,Wenjin Zhang,Irfan Essa,Apaar Sadhwani
机构: Georgia Institute of Technology (佐治亚理工学院); University of California, Berkeley (加州大学伯克利分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Under review
Abstract:Lifting-based methods for 3D Human Pose Estimation (HPE), which predict 3D poses from detected 2D keypoints, often generalize poorly to new datasets and real-world settings. To address this, we propose \emphAugLift, a simple yet effective reformulation of the standard lifting pipeline that significantly improves generalization performance without requiring additional data collection or sensors. AugLift sparsely enriches the standard input – the 2D keypoint coordinates (x, y) – by augmenting it with a keypoint detection confidence score c and a corresponding depth estimate d . These additional signals are computed from the image using off-the-shelf, pre-trained models (e.g., for monocular depth estimation), thereby inheriting their strong generalization capabilities. Importantly, AugLift serves as a modular add-on and can be readily integrated into existing lifting architectures. Our extensive experiments across four datasets demonstrate that AugLift boosts cross-dataset performance on unseen datasets by an average of 10.1% , while also improving in-distribution performance by 4.0% . These gains are consistent across various lifting architectures, highlighting the robustness of our method. Our analysis suggests that these sparse, keypoint-aligned cues provide robust frame-level context, offering a practical way to significantly improve the generalization of any lifting-based pose estimation model. Code will be made publicly available. Comments: Preprint. Under review Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2508.07112 [cs.CV] (or arXiv:2508.07112v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.07112 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-169] owards High-Order Mean Flow Generative Models: Feasibility Expressivity and Provably Efficient Criteria
【速读】:该论文旨在解决生成式建模中采样效率与动态表达能力之间的权衡问题,特别是在无需模拟的流匹配(Flow Matching)框架下提升高阶动力学建模的能力。其解决方案的关键在于提出二阶均值流(Second-Order MeanFlow),通过在目标函数中引入平均加速度场(average acceleration fields),不仅保持了一阶均值流的稳定单步采样特性(基于广义一致性条件),还进一步证明了该方法在电路复杂度上可由 TC0 类均匀阈值电路实现,从而具备理论上的高效性;同时,借助快速近似注意力计算,作者建立了可在 n2+o(1) 时间内以 1/poly(n) 精度逼近注意力操作的可扩展性保证,为高阶流匹配模型提供了兼具丰富动力学表达能力和实用采样效率的理论基础。
链接: https://arxiv.org/abs/2508.07102
作者: Yang Cao,Yubin Chen,Zhao Song,Jiahao Zhang
机构: Wyoming Seminary (怀俄明学院); San Jose State University (圣何塞州立大学); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative modelling has seen significant advances through simulation-free paradigms such as Flow Matching, and in particular, the MeanFlow framework, which replaces instantaneous velocity fields with average velocities to enable efficient single-step sampling. In this work, we introduce a theoretical study on Second-Order MeanFlow, a novel extension that incorporates average acceleration fields into the MeanFlow objective. We first establish the feasibility of our approach by proving that the average acceleration satisfies a generalized consistency condition analogous to first-order MeanFlow, thereby supporting stable, one-step sampling and tractable loss functions. We then characterize its expressivity via circuit complexity analysis, showing that under mild assumptions, the Second-Order MeanFlow sampling process can be implemented by uniform threshold circuits within the \mathsfTC^0 class. Finally, we derive provably efficient criteria for scalable implementation by leveraging fast approximate attention computations: we prove that attention operations within the Second-Order MeanFlow architecture can be approximated to within 1/\mathrmpoly(n) error in time n^2+o(1) . Together, these results lay the theoretical foundation for high-order flow matching models that combine rich dynamics with practical sampling efficiency.
zh
[CV-170] Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration
【速读】:该论文旨在解决协同式3D目标检测中检测性能与通信带宽之间的根本性权衡问题。现有方法在信息交换时难以兼顾效率与精度,导致在低带宽环境下性能下降明显。其解决方案的关键在于提出一种新型混合协作机制(hybrid collaboration),通过自适应融合两类通信消息——紧凑的感知输出(perceptual outputs)和富含信息的原始观测数据(raw observations),并优先选择每类消息中最关键的部分,从而实现最优感知信息获取与灵活通信适配。该方案构建了名为HyComm的高效LiDAR协同3D检测系统,支持可变压缩率的消息传输及标准化数据格式,确保跨不同检测模型和代理配置的兼容性与鲁棒性,在DAIR-V2X和OPV2V数据集上均显著优于现有方法,同时通信量减少超过2006倍且保持更高精度。
链接: https://arxiv.org/abs/2508.07092
作者: Yue Hu,Juntong Peng,Yunqiao Yang,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collaborative 3D detection can substantially boost detection performance by allowing agents to exchange complementary information. It inherently results in a fundamental trade-off between detection performance and communication bandwidth. To tackle this bottleneck issue, we propose a novel hybrid collaboration that adaptively integrates two types of communication messages: perceptual outputs, which are compact, and raw observations, which offer richer information. This approach focuses on two key aspects: i) integrating complementary information from two message types and ii) prioritizing the most critical data within each type. By adaptively selecting the most critical set of messages, it ensures optimal perceptual information and adaptability, effectively meeting the demands of diverse communication this http URL on this hybrid collaboration, we present \textttHyComm, a communication-efficient LiDAR-based collaborative 3D detection system. \textttHyComm boasts two main benefits: i) it facilitates adaptable compression rates for messages, addressing various communication requirements, and ii) it uses standardized data formats for messages. This ensures they are independent of specific detection models, fostering adaptability across different agent configurations. To evaluate HyComm, we conduct experiments on both real-world and simulation datasets: DAIR-V2X and OPV2V. HyComm consistently outperforms previous methods and achieves a superior performance-bandwidth trade-off regardless of whether agents use the same or varied detection models. It achieves a lower communication volume of more than 2,006 \times and still outperforms Where2comm on DAIR-V2X in terms of AP50. The related code will be released.
zh
[CV-171] ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting ICCV2025
【速读】:该论文旨在解决自动驾驶中基于视觉的3D感知任务中检测(detection)与预测(forecasting)分离处理所导致的时序信息利用不足问题。传统方法将两者视为顺序任务,难以有效融合时间上下文线索,从而限制了性能提升。其解决方案的关键在于提出ForeSight框架,采用多任务流式双向学习机制,使检测与预测模块共享查询记忆(query memory),实现信息无缝传播;具体而言,通过引入带多假设轨迹预测的记忆队列增强检测阶段的空间推理能力,并借助历史预测与优化检测结果改进预测阶段的时间一致性,同时摒弃显式的对象关联(tracking-free),避免误差累积并支持高效多帧扩展。
链接: https://arxiv.org/abs/2508.07089
作者: Sandro Papais,Letian Wang,Brian Cheong,Steven L. Waslander
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICCV 2025
Abstract:We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.
zh
[CV-172] SO: Representing and Compressing 3D Point Cloud Scenes with Textured Surfel Octree
【速读】:该论文旨在解决3D视觉内容流传输中关键的3D表示问题,即如何在保证高质量渲染的同时实现高效压缩。现有3D表示方法(如点云、网格和3D高斯)在渲染质量、表面定义精度和可压缩性方面均存在局限。其解决方案的关键在于提出了一种名为“纹理化Surfel八叉树(Textured Surfel Octree, TeSO)”的新颖3D表示结构:它基于点云构建,将场景表示为立方体边界surfel(表面元素)并组织于八叉树结构中,每个surfel附带一个纹理贴图。通过在八叉树粗粒度层级使用大surfel近似平滑表面以减少原始数量,同时利用纹理贴图保留高频细节,从而兼顾几何效率与视觉保真度,并结合基于八叉树结构的压缩方案实现低比特率下的高性能渲染。
链接: https://arxiv.org/abs/2508.07083
作者: Yueyu Hu,Ran Gong,Tingyu Fan,Yao Wang
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D visual content streaming is a key technology for emerging 3D telepresence and AR/VR applications. One fundamental element underlying the technology is a versatile 3D representation that is capable of producing high-quality renders and can be efficiently compressed at the same time. Existing 3D representations like point clouds, meshes and 3D Gaussians each have limitations in terms of rendering quality, surface definition, and compressibility. In this paper, we present the Textured Surfel Octree (TeSO), a novel 3D representation that is built from point clouds but addresses the aforementioned limitations. It represents a 3D scene as cube-bounded surfels organized on an octree, where each surfel is further associated with a texture patch. By approximating a smooth surface with a large surfel at a coarser level of the octree, it reduces the number of primitives required to represent the 3D scene, and yet retains the high-frequency texture details through the texture map attached to each surfel. We further propose a compression scheme to encode the geometry and texture efficiently, leveraging the octree structure. The proposed textured surfel octree combined with the compression scheme achieves higher rendering quality at lower bit-rates compared to multiple point cloud and 3D Gaussian-based baselines.
zh
[CV-173] SAGCNet: Spatial-Aware Graph Completion Network for Missing Slice Imputation in Population CMR Imaging MICCAI2025
【速读】:该论文旨在解决心脏磁共振成像(Cardiac Magnetic Resonance, CMR)中因数据缺失或不可用导致的体积影像(volumetric MRI)分析准确性下降的问题。其核心挑战在于如何有效建模三维空间中切片间的局部相关性与全局上下文信息,尤其是在仅有部分完整切片可用的情况下。解决方案的关键创新在于提出了一种空间感知图补全网络(Spatial-Aware Graph Completion Network, SAGCNet),包含两个核心组件:一是将切片间关系嵌入图结构中的体积切片图补全模块,用于显式建模局部依赖;二是体积空间适配器组件,以增强模型对多种三维空间上下文信息的捕捉和利用能力。实验表明,该方法在有限切片条件下仍能高质量合成缺失切片,优于现有主流MRI补全方法。
链接: https://arxiv.org/abs/2508.07041
作者: Junkai Liu,Nay Aung,Theodoros N. Arvanitis,Stefan K. Piechnik,Joao A C Lima,Steffen E. Petersen,Le Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025
Abstract:Magnetic resonance imaging (MRI) provides detailed soft-tissue characteristics that assist in disease diagnosis and screening. However, the accuracy of clinical practice is often hindered by missing or unusable slices due to various factors. Volumetric MRI synthesis methods have been developed to address this issue by imputing missing slices from available ones. The inherent 3D nature of volumetric MRI data, such as cardiac magnetic resonance (CMR), poses significant challenges for missing slice imputation approaches, including (1) the difficulty of modeling local inter-slice correlations and dependencies of volumetric slices, and (2) the limited exploration of crucial 3D spatial information and global context. In this study, to mitigate these issues, we present Spatial-Aware Graph Completion Network (SAGCNet) to overcome the dependency on complete volumetric data, featuring two main innovations: (1) a volumetric slice graph completion module that incorporates the inter-slice relationships into a graph structure, and (2) a volumetric spatial adapter component that enables our model to effectively capture and utilize various forms of 3D spatial context. Extensive experiments on cardiac MRI datasets demonstrate that SAGCNet is capable of synthesizing absent CMR slices, outperforming competitive state-of-the-art MRI synthesis methods both quantitatively and qualitatively. Notably, our model maintains superior performance even with limited slice data.
zh
[CV-174] 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在实际部署中因存储需求大而受限的问题,以及现有生成式压缩方法引入独特失真却缺乏系统性视觉质量评估的研究空白。解决方案的关键在于构建了一个大规模视频质量评估(Video Quality Assessment, VQA)数据集与基准测试平台——3DGS-VBench,包含660个经11个场景下6种先进3DGS压缩算法处理的模型及其对应的视频序列,并通过50名参与者标注的平均意见分数(Mean Opinion Score, MOS)验证了数据集的可靠性。该基准支持对压缩效率和视觉质量的量化比较,同时评估了15种不同范式的质量评估指标,为专门针对3DGS的VQA模型训练提供基础,推动压缩与质量评估研究的发展。
链接: https://arxiv.org/abs/2508.07038
作者: Yuke Xing,William Gordon,Qi Yang,Kaifa Yang,Jiarui Wang,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); Basis Independent Silicon Valley (独立基础硅谷); University of Missouri-Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at this https URL.
zh
[CV-175] rustworthy Medical Imaging with Large Language Models : A Study of Hallucinations Across Modalities
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学影像任务中产生的幻觉(hallucination)问题,即模型生成看似自信但实际错误的输出,可能误导临床决策。研究聚焦于两个方向:一是图像到文本(image to text),即LLM根据X光、CT或MRI图像生成诊断报告;二是文本到图像(text to image),即模型基于临床提示生成医学图像。解决方案的关键在于系统性地分析两类任务中的错误模式,如事实不一致和解剖学错误,并结合专家制定的标准对多模态影像输出进行评估,从而识别导致幻觉的核心因素,包括模型架构与训练数据特性,为提升LLM驱动医学影像系统的安全性与可信度提供实证依据与改进路径。
链接: https://arxiv.org/abs/2508.07031
作者: Anindya Bijoy Das,Shahnewaz Karim Sakib,Shibbir Ahmed
机构: The University of Akron (阿克伦大学); University of Tennessee at Chattanooga (田纳西大学查塔努加分校); Texas State University (德克萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.
zh
[CV-176] Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation
【速读】:该论文旨在解决内窥镜图像中息肉(polyp)分割的难题,主要挑战包括与周围黏膜对比度低、反光高亮以及边界模糊等问题。解决方案的关键在于提出FOCUS-Med框架,其核心创新包括:(1) 引入双图卷积网络(Dual Graph Convolutional Network, Dual-GCN)模块,通过融合空间和拓扑结构信息来增强模型对息肉边界的保持能力;(2) 设计位置融合的独立自注意力机制以强化全局上下文建模;(3) 提出可学习加权快速归一化融合策略,有效弥合编码器与解码器之间的语义鸿沟,实现多尺度特征高效聚合。此外,首次将大语言模型(Large Language Model, LLM)用于定量评估分割质量,提升了临床可用性。
链接: https://arxiv.org/abs/2508.07028
作者: Juntong Fan,Shuyi Fan,Debesh Jha,Changsheng Fang,Tieyong Zeng,Hengyong Yu,Dayang Wang
机构: The Chinese University of Hong Kong (香港中文大学); University of Massachusetts, Lowell (马萨诸塞大学洛厄尔分校); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript under review
Abstract:Accurate endoscopic image segmentation on the polyps is critical for early colorectal cancer detection. However, this task remains challenging due to low contrast with surrounding mucosa, specular highlights, and indistinct boundaries. To address these challenges, we propose FOCUS-Med, which stands for Fusion of spatial and structural graph with attentional context-aware polyp segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) module to capture contextual spatial and topological structural dependencies. This graph-based representation enables the model to better distinguish polyps from background tissues by leveraging topological cues and spatial connectivity, which are often obscured in raw image intensities. It enhances the model’s ability to preserve boundaries and delineate complex shapes typical of polyps. In addition, a location-fused stand-alone self-attention is employed to strengthen global context integration. To bridge the semantic gap between encoder-decoder layers, we incorporate a trainable weighted fast normalized fusion strategy for efficient multi-scale aggregation. Notably, we are the first to introduce the use of a Large Language Model (LLM) to provide detailed qualitative evaluations of segmentation quality. Extensive experiments on public benchmarks demonstrate that FOCUS-Med achieves state-of-the-art performance across five key metrics, underscoring its effectiveness and clinical potential for AI-assisted colonoscopy.
zh
[CV-177] MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering
【速读】:该论文旨在解决复杂视觉问答(Complex Visual Question Answering, Complex VQA)任务中现有大规模视觉语言模型(Large Vision-Language Models, LVLMs)因过度依赖高层全局特征而导致多模态推理和外部知识整合能力不足的问题。其解决方案的关键在于提出MV-CoRe(Multimodal Visual-Conceptual Reasoning)模型,通过深度融合预训练视觉大模型(Vision Large Models, VLMs)与语言大模型(Language Large Models, LLMs)的全局嵌入,以及细粒度语义感知的视觉特征(如目标检测特征和场景图表示),并引入创新的多模态融合Transformer架构,实现跨模态注意力机制的深度交互,从而显著提升复杂推理能力。
链接: https://arxiv.org/abs/2508.07023
作者: Jingwei Peng,Jiehao Chen,Mateo Alejandro Rojas,Meilin Zhang
机构: Shaanxi University of Technology (陕西理工大学); Technological University of Peru (秘鲁科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and facilitating complex reasoning. We evaluate MV-CoRe on challenging Complex VQA benchmarks, including GQA, A-OKVQA, and OKVQA, after training on VQAv2. Our experimental results demonstrate that MV-CoRe consistently outperforms established LVLM baselines, achieving an overall accuracy of 77.5% on GQA. Ablation studies confirm the critical contribution of both object and scene graph features, and human evaluations further validate MV-CoRe’s superior factual correctness and reasoning depth, underscoring its robust capabilities for deep visual and conceptual understanding.
zh
[CV-178] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents
【速读】:该论文旨在解决科学文献PDF文档在复杂版式和多模态内容背景下,传统方法难以实现高效准确理解、内容优化与自动摘要生成的问题。现有大语言模型(LLM)和视觉-语言大模型(Vision-Language Large Models, LVLMs)直接应用时缺乏对精细编辑任务的精度与控制力。其解决方案的关键在于提出DocRefine框架,该框架基于自然语言指令驱动,通过一个由六个专业化协作代理组成的多智能体系统(Layout Structure Analysis、Multimodal Content Understanding、Instruction Decomposition、Content Refinement、Summarization Generation、Fidelity Consistency Verification),构建闭环反馈架构,从而保障语义准确性与视觉保真度,显著提升科学PDF文档处理的自动化水平与质量。
链接: https://arxiv.org/abs/2508.07021
作者: Kun Qian,Wenjie Li,Tianyu Sun,Wenhong Wang,Wenhan Luo
机构: Shangqiu University (商丘大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization Generation, and Fidelity Consistency Verification. This closed-loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state-of-the-art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine’s superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.
zh
[CV-179] rraMAE: Learning Spatial-Spectral Representations from Hyperspectral Earth Observation Data via Adaptive Masked Autoencoders
【速读】:该论文旨在解决现有自监督方法(如掩码自动编码器)在处理200余波段的高光谱影像(Hyperspectral Imagery, HSI)时,难以有效建模复杂空间-光谱相关性的问题。解决方案的关键在于提出TerraMAE框架,其核心创新包括:基于统计反射特性设计的自适应通道分组策略,以捕捉光谱相似性;以及融合空间与光谱质量指标的增强重建损失函数,从而提升对高光谱数据中多维特征的表征能力。该方法在高保真图像重建及作物识别、土地覆盖分类和土壤质地预测等下游任务中均展现出优越性能。
链接: https://arxiv.org/abs/2508.07020
作者: Tanjim Bin Faruk,Abdul Matin,Shrideep Pallickara,Sangmi Lee Pallickara
机构: Colorado State University (科罗拉多州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Hyperspectral satellite imagery offers sub-30 m views of Earth in hundreds of contiguous spectral bands, enabling fine-grained mapping of soils, crops, and land cover. While self-supervised Masked Autoencoders excel on RGB and low-band multispectral data, they struggle to exploit the intricate spatial-spectral correlations in 200+ band hyperspectral images. We introduce TerraMAE, a novel HSI encoding framework specifically designed to learn highly representative spatial-spectral embeddings for diverse geospatial analyses. TerraMAE features an adaptive channel grouping strategy, based on statistical reflectance properties to capture spectral similarities, and an enhanced reconstruction loss function that incorporates spatial and spectral quality metrics. We demonstrate TerraMAE’s effectiveness through superior spatial-spectral information preservation in high-fidelity image reconstruction. Furthermore, we validate its practical utility and the quality of its learned representations through strong performance on three key downstream geospatial tasks: crop identification, land cover classification, and soil texture prediction.
zh
[CV-180] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation
【速读】:该论文旨在解决如何高效生成高分辨率(4K)的表面反射率参数(SVBRDF)映射的问题,尤其是在保持多通道SVBRDF图(如漫反射、镜面反射、法线等)之间结构一致性的同时,避免对预训练扩散Transformer(DiT)骨干网络造成破坏性修改。其核心解决方案是提出HiMat框架,关键创新在于引入CrossStitch模块——一个轻量级卷积模块,通过局部操作捕捉不同SVBRDF通道间的依赖关系;该模块在初始化时确保DiT骨干网络行为不变,从而在微调阶段实现高效的跨图一致性约束,同时保留原始模型的先验能力,最终实现高质量、高细节的4K SVBRDF生成。
链接: https://arxiv.org/abs/2508.07011
作者: Zixiong Wang,Jian Yang,Yiwei Hu,Milos Hasan,Beibei Wang
机构: Nankai University (南开大学); Adobe Research (Adobe 研究院); NVIDIA Research (英伟达研究院); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.
zh
[CV-181] Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments MICCAI2025
【速读】:该论文旨在解决多发性硬化症(Multiple Sclerosis, MS)中病灶异质性进展的个性化预测问题,特别是如何基于当前影像和治疗信息准确生成未来病灶演变的掩码。其解决方案的关键在于提出首个治疗感知的时空扩散模型(treatment-aware spatio-temporal diffusion model),在体素空间(voxel-space)中融合多模态患者数据(包括MRI和治疗信息),实现对新发和扩大的T2病灶(NET2 lesions)在未来时间点的生成式预测。该方法不仅在来自6种不同治疗方案的2131例多中心临床试验数据上验证了高精度预测能力,还通过下游任务(如未来病灶计数、定位估计、活动性分类及反事实病灶生成)展示了其在真实临床场景中的应用潜力,凸显了因果驱动的图像生成模型在MS数据驱动预后分析中的价值。
链接: https://arxiv.org/abs/2508.07006
作者: Gian Mario Favero,Ge Ya Luo,Nima Fathi,Justin Szeto,Douglas L. Arnold,Brennan Nichyporuk,Chris Pal,Tal Arbel
机构: McGill University (麦吉尔大学); Montreal Neurological Hospital (蒙特利尔神经病学医院); Mila - Quebec AI Institute (魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025 (LMID Workshop)
Abstract:Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.
zh
[CV-182] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision
【速读】:该论文旨在解决当前自监督图像分割模型在预训练阶段存在的多阶段训练流程及伪掩码生成过程耗时的问题,该问题不仅限制了模型随训练数据规模扩展的能力,还因优化过程的不连续性导致次优解。其解决方案的关键在于提出一种新型快速伪掩码算法——Fast Universal Agglomerative Pooling (UniAP),该算法可在毫秒级时间内并行生成语义级、实例级及多粒度的伪掩码;在此基础上构建了可连续预训练的可扩展自监督通用分割框架 S2-UniSeg,采用学生-动量教师结构,并引入面向分割任务的新预训练目标 Query-wise Self-Distillation (QuerySD),以学习局部到全局的对应关系,从而显著提升性能并在多个基准上实现SOTA效果。
链接: https://arxiv.org/abs/2508.06995
作者: Huihui Xu,Jin Ye,Hongqiu Wang,Changkai Ji,Jiashi Lin,Ming Hu,Ziyan Huang,Ying Chen,Chenglong Ma,Tianbin Li,Lihao Liu,Junjun He,Lei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at this https URL
zh
[CV-183] OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware
【速读】:该论文旨在解决医学图像和视频分割中因输入尺寸过大导致的显存(VRAM)瓶颈问题,尤其是在使用UNet或Vision Transformer等架构时,其显存消耗随输入规模急剧上升,迫使采用分块(patch- or frame-wise)处理方式,从而牺牲全局一致性与推理速度。解决方案的关键在于提出OctreeNCA,通过引入八叉树(octree)数据结构扩展神经元细胞自动机(Neural Cellular Automaton, NCA)的邻域定义,实现对全局信息的高效遍历;同时,基于CUDA开发了专用推理函数,显著降低显存占用并提升推理效率,使得高分辨率病理切片(如184 Megapixel)或长时序手术视频(如1分钟)可一次性完成分割,相较UNet节省90%显存。
链接: https://arxiv.org/abs/2508.06993
作者: Nick Lemke,John Kalkhof,Niklas Babendererde,Anirban Mukhopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical applications demand segmentation of large inputs, like prostate MRIs, pathology slices, or videos of surgery. These inputs should ideally be inferred at once to provide the model with proper spatial or temporal context. When segmenting large inputs, the VRAM consumption of the GPU becomes the bottleneck. Architectures like UNets or Vision Transformers scale very poorly in VRAM consumption, resulting in patch- or frame-wise approaches that compromise global consistency and inference speed. The lightweight Neural Cellular Automaton (NCA) is a bio-inspired model that is by construction size-invariant. However, due to its local-only communication rules, it lacks global knowledge. We propose OctreeNCA by generalizing the neighborhood definition using an octree data structure. Our generalized neighborhood definition enables the efficient traversal of global knowledge. Since deep learning frameworks are mainly developed for large multi-layer networks, their implementation does not fully leverage the advantages of NCAs. We implement an NCA inference function in CUDA that further reduces VRAM demands and increases inference speed. Our OctreeNCA segments high-resolution images and videos quickly while occupying 90% less VRAM than a UNet during evaluation. This allows us to segment 184 Megapixel pathology slices or 1-minute surgical videos at once.
zh
[CV-184] ADoc: Robust Time-Aware Document Image Dewarping
【速读】:该论文旨在解决真实场景下文档图像去扭曲(document image dewarping)问题,尤其针对复杂结构和高程度形变的文档图像,现有方法难以获得满意效果。其关键解决方案在于:首次将去扭曲任务建模为一个包含多个中间状态的动态过程,而非一次性变换;并设计了一个轻量级网络框架TADoc(Time-Aware Document Dewarping Network),以捕捉真实物理场景中渐进式的几何畸变运动特性。此外,为弥补OCR指标在稀疏文本文档上的评估不足,提出新的文档布局相似性度量DLS(Document Layout Similarity),从而更全面地评估下游任务中的去扭曲效果。
链接: https://arxiv.org/abs/2508.06988
作者: Fangmin Zhao,Weichao Zeng,Zhenhang Li,Dongbao Yang,Yu Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); VCIP & TMCC & DISSec, College of Computer Science & College of Cryptology and Cyber Science, Nankai University (南开大学计算机学院、网络空间安全学院); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures
Abstract:Flattening curved, wrinkled, and rotated document images captured by portable photographing devices, termed document image dewarping, has become an increasingly important task with the rise of digital economy and online working. Although many methods have been proposed recently, they often struggle to achieve satisfactory results when confronted with intricate document structures and higher degrees of deformation in real-world scenarios. Our main insight is that, unlike other document restoration tasks (e.g., deblurring), dewarping in real physical scenes is a progressive motion rather than a one-step transformation. Based on this, we have undertaken two key initiatives. Firstly, we reformulate this task, modeling it for the first time as a dynamic process that encompasses a series of intermediate states. Secondly, we design a lightweight framework called TADoc (Time-Aware Document Dewarping Network) to address the geometric distortion of document images. In addition, due to the inadequacy of OCR metrics for document images containing sparse text, the comprehensiveness of evaluation is insufficient. To address this shortcoming, we propose a new metric – DLS (Document Layout Similarity) – to evaluate the effectiveness of document dewarping in downstream tasks. Extensive experiments and in-depth evaluations have been conducted and the results indicate that our model possesses strong robustness, achieving superiority on several benchmarks with different document types and degrees of distortion.
zh
[CV-185] WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering
【速读】:该论文旨在解决自动驾驶(AD)场景中因复杂天气和光照条件导致的正向渲染(forward rendering)与逆向渲染(inverse rendering)任务难以准确执行的问题。现有基于大扩散模型的方法虽能利用2D先验获得合理结果,但存在控制性差和鲁棒性不足的缺陷。其解决方案的关键在于提出WeatherDiffusion框架,通过引入内在图感知注意力机制(Intrinsic map-aware attention, MAA),使模型能够根据图像不同区域对应的不同内在属性(如材质、几何、光照)进行精细化建模,从而实现高质量的逆向渲染;同时,该方法支持基于文本描述的可控天气与光照编辑,并构建了合成数据集WeatherSynthetic与真实数据集WeatherReal用于训练和验证,显著提升了下游任务(如目标检测与图像分割)在恶劣天气下的鲁棒性。
链接: https://arxiv.org/abs/2508.06982
作者: Yixin Zhu,Zuoliang Zhu,Miloš Hašan,Jian Yang,Jin Xie,Beibei Wang
机构: Nanjing University (南京大学); Nankai University (南开大学); Adobe Research (Adobe 研究院); NVIDIA Research (NVIDIA 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.
zh
[CV-186] Evaluating Fisheye-Compatible 3D Gaussian Splatting Methods on Real Images Beyond 180 Degree Field of View
【速读】:该论文旨在解决广角(>180°)鱼眼图像在3D高斯溅射(3D Gaussian Splatting, 3DGS)重建中的挑战,特别是极端畸变对重建质量的影响以及稀疏输入下初始化困难的问题。其关键解决方案包括:(1) 提出并评估两种针对鱼眼图像的3DGS方法——Fisheye-GS和3DGUT,发现3DGUT在全200°视场下保持稳定且感知质量优异;(2) 针对传统基于结构光测量(Structure from Motion, SfM)初始化在强畸变下失效的问题,引入一种基于深度估计的初始化策略,利用UniK3D模型从仅2–3张鱼眼图像中预测稠密点云,即使在雾、眩光或天空等复杂场景中也能实现与SfM相当的重建质量,从而显著提升稀疏、畸变严重的鱼眼图像的3D重建实用性。
链接: https://arxiv.org/abs/2508.06968
作者: Ulas Gunes,Matias Turkulainen,Juho Kannala,Esa Rahtu
机构: University of Oulu (奥卢大学); University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:We present the first evaluation of fisheye-based 3D Gaussian Splatting methods, Fisheye-GS and 3DGUT, on real images with fields of view exceeding 180 degree. Our study covers both indoor and outdoor scenes captured with 200 degree fisheye cameras and analyzes how each method handles extreme distortion in real world settings. We evaluate performance under varying fields of view (200 degree, 160 degree, and 120 degree) to study the tradeoff between peripheral distortion and spatial coverage. Fisheye-GS benefits from field of view (FoV) reduction, particularly at 160 degree, while 3DGUT remains stable across all settings and maintains high perceptual quality at the full 200 degree view. To address the limitations of SfM-based initialization, which often fails under strong distortion, we also propose a depth-based strategy using UniK3D predictions from only 2-3 fisheye images per scene. Although UniK3D is not trained on real fisheye data, it produces dense point clouds that enable reconstruction quality on par with SfM, even in difficult scenes with fog, glare, or sky. Our results highlight the practical viability of fisheye-based 3DGS methods for wide-angle 3D reconstruction from sparse and distortion-heavy image inputs.
zh
[CV-187] Adversarial Video Promotion Against Text-to-Video Retrieval
【速读】:该论文旨在解决文本到视频检索(Text-to-Video Retrieval, T2VR)系统在面对对抗攻击时的脆弱性问题,尤其是此前未被充分研究的“视频促进攻击”(Video Promotion Attack, ViPro)——即通过扰动视频内容使其在多个查询下排名上升,从而可能带来流量或信息传播上的不当收益。其解决方案的关键在于提出一种名为 Modal Refinement (MoRe) 的机制,用于捕捉视觉与文本模态之间更细粒度、复杂的交互关系,从而显著提升攻击在黑盒场景下的迁移能力;实验表明,ViPro 在白盒、灰盒和黑盒设置下平均性能优于基线方法 30%、10% 和 4%,揭示了T2VR系统中一个被忽视的安全漏洞,并为后续防御策略提供了重要参考。
链接: https://arxiv.org/abs/2508.06964
作者: Qiwei Tian,Chenhao Lin,Zhengyu Zhao,Qian Li,Shuai Liu,Chao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over 30/10/4% for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at this https URL.
zh
[CV-188] Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification
【速读】:该论文旨在解决细粒度视觉分类(Fine-Grained Visual Classification, FGVC)中难以有效捕捉细微判别性特征的问题,尤其针对现有基于频域分解的方法因依赖固定基函数而缺乏对图像内容自适应调整能力的局限性。其解决方案的关键在于提出一种新颖的感知引擎——Subtle-Cue Oriented Perception Engine (SCOPE),该方法通过两个核心模块实现空间域内局部细节与全局语义的动态融合:一是子细节提取器(Subtle Detail Extractor, SDE),用于从浅层特征中动态增强边缘、纹理等细微结构;二是显著语义精炼器(Salient Semantic Refiner, SSR),在SDE增强信息引导下学习具有语义一致性与结构感知能力的高层特征。两者级联式设计逐步整合低层细节与高层语义,从而突破频域方法固定尺度限制,提升多尺度特征融合的灵活性与判别力。
链接: https://arxiv.org/abs/2508.06959
作者: Qin Xu,Lili Zhu,Xiaoxia Cheng,Bo Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The crux of resolving fine-grained visual classification (FGVC) lies in capturing discriminative and class-specific cues that correspond to subtle visual characteristics. Recently, frequency decomposition/transform based approaches have attracted considerable interests since its appearing discriminative cue mining ability. However, the frequency-domain methods are based on fixed basis functions, lacking adaptability to image content and unable to dynamically adjust feature extraction according to the discriminative requirements of different images. To address this, we propose a novel method for FGVC, named Subtle-Cue Oriented Perception Engine (SCOPE), which adaptively enhances the representational capability of low-level details and high-level semantics in the spatial domain, breaking through the limitations of fixed scales in the frequency domain and improving the flexibility of multi-scale fusion. The core of SCOPE lies in two modules: the Subtle Detail Extractor (SDE), which dynamically enhances subtle details such as edges and textures from shallow features, and the Salient Semantic Refiner (SSR), which learns semantically coherent and structure-aware refinement features from the high-level features guided by the enhanced shallow features. The SDE and SSR are cascaded stage-by-stage to progressively combine local details with global semantics. Extensive experiments demonstrate that our method achieves new state-of-the-art on four popular fine-grained image classification benchmarks.
zh
[CV-189] SLRTP2025 Sign Language Production Challenge: Methodology Results and Future Work CVPR
【速读】:该论文旨在解决手语生成(Sign Language Production, SLP)领域缺乏标准化评估指标的问题,从而阻碍不同系统之间的有效比较。为应对这一挑战,作者提出了首个手语生成竞赛,并基于RWTH-PHOENIX-Weather-2014T数据集(德语手语DGS天气播报数据)和自建的隐藏测试集,对从口语到骨骼关键点序列(Text-to-Pose, T2P)的翻译模型进行多维度评估。解决方案的关键在于构建了一个标准化的评估框架,包括高质量的骨架提取关键点以及统一的评价指标(如BLEU-1和DTW-MJE),并引入检索增强型方法与预训练语言模型作为最优方案,显著提升了生成结果的准确性和自然性。
链接: https://arxiv.org/abs/2508.06951
作者: Harry Walsh,Ed Fish,Ozge Mercanoglu Sincan,Mohamed Ilyes Lakhal,Richard Bowden,Neil Fox,Bencie Woll,Kepeng Wu,Zecheng Li,Weichao Zhao,Haodong Wang,Wengang Zhou,Houqiang Li,Shengeng Tang,Jiayi He,Xu Wang,Ruobei Zhang,Yaxiong Wang,Lechao Cheng,Meryem Tasyurek,Tugce Kiziltepe,Hacer Yalim Keles
机构: University of Surrey (萨里大学); University of Birmingham (伯明翰大学); University College London (伦敦大学学院); University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); Hacettepe University (哈切特佩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 11 pages, 6 Figures, CVPR conference
Abstract:Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition’s aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation, over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a German Sign Language - Deutsche Gebardensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network, including high-quality skeleton extraction-based keypoints establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.
zh
[CV-190] CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing
【速读】:该论文旨在解决训练-free文本驱动图像局部编辑中难以兼顾文本贴合度、未编辑区域上下文保真度以及编辑融合自然性的问题。现有方法在三者之间常存在权衡困境,导致编辑结果出现语义偏差、背景失真或明显拼接痕迹。其解决方案的关键在于提出CannyEdit框架,包含两项核心创新:(1) 选择性Canny控制(Selective Canny Control),通过在指定可编辑区域掩蔽Canny ControlNet的结构引导,同时利用反演阶段ControlNet信息保留机制严格保持未编辑区域细节;(2) 双提示引导(Dual-Prompt Guidance),结合局部提示实现对象级精确编辑,并引入全局目标提示以维持场景内物体间的语义一致性,从而显著提升编辑质量与自然度。
链接: https://arxiv.org/abs/2508.06937
作者: Weiyan Xie,Han Gao,Didan Deng,Kaican Li,April Hua Liu,Yongxiang Huang,Nevin L. Zhang
机构: Huawei Hong Kong AI Framework & Data Technologies Lab (华为香港人工智能框架与数据技术实验室); The Hong Kong University of Science and Technology (香港科技大学); Shanghai University of Finance and Economics (上海财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this http URL
Abstract:Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses these challenges through two key innovations: (1) Selective Canny Control, which masks the structural guidance of Canny ControlNet in user-specified editable regions while strictly preserving details of the source images in unedited areas via inversion-phase ControlNet information retention. This enables precise, text-driven edits without compromising contextual integrity. (2) Dual-Prompt Guidance, which combines local prompts for object-specific edits with a global target prompt to maintain coherent scene interactions. On real-world image editing tasks (addition, replacement, removal), CannyEdit outperforms prior methods like KV-Edit, achieving a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity. In terms of editing seamlessness, user studies reveal only 49.2 percent of general users and 42.0 percent of AIGC experts identified CannyEdit’s results as AI-edited when paired with real images without edits, versus 76.08 to 89.09 percent for competitor methods.
zh
[CV-191] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型在生成图像质量与人类偏好方面存在提升空间的问题。现有AR图像生成方法在输出一致性、真实感和语义准确性等维度上仍有不足,难以满足高质量可控图像合成的需求。解决方案的关键在于引入在线强化学习(Reinforcement Learning, RL)训练机制,并基于Group Relative Policy Optimization(GRPO)算法设计多维奖励函数,从感知质量、真实性和语义保真度等多个维度对生成图像进行评估与优化。通过在类别条件(class-to-image)和文本条件(text-to-image)图像生成任务上的实验验证,该方法显著提升了图像质量和人类偏好评分,证明了RL驱动的优化策略在AR图像生成中的有效性与可行性。
链接: https://arxiv.org/abs/2508.06924
作者: Shihao Yuan,Yahui Liu,Yang Yue,Jingyuan Zhang,Wangmeng Zuo,Qi Wang,Fuzheng Zhang,Guorui Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 15 figures
Abstract:Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models’ outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: this https URL.
zh
[CV-192] Vibration-Based Energy Metric for Restoring Needle Alignment in Autonomous Robotic Ultrasound
【速读】:该论文旨在解决机器人超声引导下经皮穿刺针插入过程中因超声成像平面与穿刺针插入平面发生偏移而导致的针体定位不准问题。传统方法依赖于超声图像中针体的可见性,但在针体完全偏离成像平面时难以实现稳定检测。解决方案的关键在于引入一种基于机械振动的特征提取机制:通过周期性振动穿刺针并设计一个振动能量度量指标,该指标在针体完全出平面的情况下仍具有鲁棒性;进而利用此度量构建控制策略,实时调整超声探头位置以纠正平移和旋转方向上的偏移,从而恢复针体对齐。实验表明,该方法在离体猪组织上实现了平均平移误差0.41 ± 0.27 mm和旋转误差0.51 ± 0.19°,验证了其有效性。
链接: https://arxiv.org/abs/2508.06921
作者: Zhongyu Chen,Chenyang Li,Xuesong Li,Dianye Huang,Zhongliang Jiang,Stefanie Speidel,Xiangyu Chu,K. W. Samuel Au
机构: Multiscale Medical Robotics Centre (多尺度医疗机器人中心); AIR@InnoHK (创新香港研究院); The Chinese University of Hong Kong (香港中文大学); National Center for Tumor Diseases (国家肿瘤疾病中心); German Cancer Research Center (德国癌症研究中心); TUD Dresden University of Technology (德累斯顿工业大学); Faculty of Medicine and University Hospital Carl Gustav Carus (卡尔·古斯塔夫·卡鲁斯医学院及大学医院); Helmholtz-Zentrum Dresden-Rossendorf (亥姆霍兹德累斯顿罗森多夫研究中心); Chair for Computer Aided Medical Procedures and Augmented Reality (计算机辅助医疗程序与增强现实研究所); Technical University of Munich (慕尼黑工业大学); Department of Mechanical and Automation Engineering (机械与自动化工程系); Munich Center for Machine Learning (慕尼黑机器学习中心); Centre for Tactile Internet with Human-in-the-Loop (触觉互联网人机协同中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise needle alignment is essential for percutaneous needle insertion in robotic ultrasound-guided procedures. However, inherent challenges such as speckle noise, needle-like artifacts, and low image resolution make robust needle detection difficult, particularly when visibility is reduced or lost. In this paper, we propose a method to restore needle alignment when the ultrasound imaging plane and the needle insertion plane are misaligned. Unlike many existing approaches that rely heavily on needle visibility in ultrasound images, our method uses a more robust feature by periodically vibrating the needle using a mechanical system. Specifically, we propose a vibration-based energy metric that remains effective even when the needle is fully out of plane. Using this metric, we develop a control strategy to reposition the ultrasound probe in response to misalignments between the imaging plane and the needle insertion plane in both translation and rotation. Experiments conducted on ex-vivo porcine tissue samples using a dual-arm robotic ultrasound-guided needle insertion system demonstrate the effectiveness of the proposed approach. The experimental results show the translational error of 0.41 \pm 0.27 mm and the rotational error of 0.51 \pm 0.19 degrees.
zh
[CV-193] alk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing
【速读】:该论文旨在解决当前文本到图像生成(text-to-image generation)任务在多轮对话场景下存在的局限性,特别是单轮交互导致的意图漂移(intention drift)和编辑不一致的问题。现有方法大多局限于单次生成,难以支持迭代式、多轮的创意任务,且基于单一代理的顺序式处理模式易引发语义不连贯。其解决方案的关键在于提出一种多智能体系统(multi-agent system),通过三个核心模块实现:1)从对话历史中解析用户意图(intention parsing),2)将任务分解并由专业化代理协作执行(task decomposition and collaborative execution),3)基于多视角评估机制进行反馈驱动的精细化调整(feedback-driven refinement)。这一架构显著提升了多轮图像生成与编辑过程中的可控性、一致性与用户体验。
链接: https://arxiv.org/abs/2508.06916
作者: Shichao Ma,Yunhe Guo,Jiahao Su,Qihe Huang,Zhengyang Zhou,Yang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge this gap, but their single-agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi-agent system for interactive image generation and editing in multi-turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback-driven refinement based on a multi-view evaluation mechanism. Talk2Image enables step-by-step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.
zh
[CV-194] MMReID-Bench: Unleashing the Power of MLLM s for Effective and Versatile Person Re-identification
【速读】:该论文旨在解决传统行人重识别(Person Re-identification, ReID)模型在多模态数据(如RGB、热成像、红外、草图图像及文本描述等)上泛化能力差的问题,其核心在于现有方法仅将多模态大语言模型(Multi-modal Large Language Models, MLLMs)用作特征提取器或图像描述生成器,未能充分发挥其推理、指令遵循和跨模态理解能力。解决方案的关键是提出首个面向行人ReID的多任务多模态基准测试平台MMReID-Bench,包含20,710个多模态查询与画廊图像,覆盖10种不同任务,从而系统评估MLLMs在复杂多模态场景下的表现,并推动更鲁棒、通用的多模态基础模型的发展。
链接: https://arxiv.org/abs/2508.06908
作者: Jinhao Li,Zijian Chen,Lirong Deng,Changbo Wang,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID.
zh
[CV-195] MultiRef: Controllable Image Generation with Multiple Visual References ACM-MM2025
【速读】:该论文旨在解决当前图像生成模型在多视觉参考(multi-reference)条件下的可控性不足问题,即现有框架主要依赖单一输入源(如文本提示或单张参考图),难以有效融合多个视觉参考中的内容与美学原则以生成符合人类创作习惯的作品。其解决方案的关键在于构建了MultiRef-bench评估基准和相应的数据集MultiRef:前者包含990个合成样本和1000个真实世界样本,用于系统性测试模型对多参考图像的整合能力;后者通过自研的数据引擎RefBlend生成38k高质量图像,涵盖10种参考类型和33种组合方式,为研究提供了可靠的数据支撑。实验表明,即使是最先进的模型如OmniGen,在合成和真实场景下也仅分别达到66.6%和79.0%的准确率,凸显了该任务的挑战性,并为开发更灵活、类人化的生成式AI工具指明方向。
链接: https://arxiv.org/abs/2508.06905
作者: Ruoxi Chen,Dongping Chen,Siyuan Wu,Sinan Wang,Shiyun Lang,Petr Sushko,Gaoyang Jiang,Yao Wan,Ranjay Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM 2025 Datasets
Abstract:Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs – either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: this https URL.
zh
[CV-196] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation
【速读】:该论文旨在解决当前训练-free的伪装目标分割(Camouflaged Object Segmentation, COS)方法在处理多实例场景时性能不足的问题。现有方法依赖于单一任务通用文本提示(如“camouflaged animal”)生成语义级视觉提示,导致Segment Anything Model (SAM) 输出粗粒度的语义掩码,难以区分多个离散的伪装目标。解决方案的关键在于提出首个无需训练的实例感知提示框架(Instance-Aware Prompting Framework, IAPF),其核心创新包括:(1) 利用多模态大语言模型(MLLM)生成图像特定的前景与背景标签以增强文本提示的细粒度表达;(2) 基于Grounding DINO生成精确的实例级边界框提示,并结合单前景多背景点提示策略,在每个边界框内采样区域约束点提示,引导SAM输出候选实例掩码;(3) 通过自一致性投票机制从多个候选掩码中选出最一致的结果作为最终预测,从而实现对复杂场景下多个伪装目标的精准分割。
链接: https://arxiv.org/abs/2508.06904
作者: Chao Yin,Jide Li,Xiaoqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:Camouflaged Object Segmentation (COS) remains highly challenging due to the intrinsic visual similarity between target objects and their surroundings. While training-based COS methods achieve good performance, their performance degrades rapidly with increased annotation sparsity. To circumvent this limitation, recent studies have explored training-free COS methods, leveraging the Segment Anything Model (SAM) by automatically generating visual prompts from a single task-generic prompt (\textite.g., “\textitcamouflaged animal”) uniformly applied across all test images. However, these methods typically produce only semantic-level visual prompts, causing SAM to output coarse semantic masks and thus failing to handle scenarios with multiple discrete camouflaged instances effectively. To address this critical limitation, we propose a simple yet powerful \textbfInstance-\textbfAware \textbfPrompting \textbfFramework (IAPF), the first training-free COS pipeline that explicitly converts a task-generic prompt into fine-grained instance masks. Specifically, the IAPF comprises three steps: (1) Text Prompt Generator, utilizing task-generic queries to prompt a Multimodal Large Language Model (MLLM) for generating image-specific foreground and background tags; (2) \textbfInstance Mask Generator, leveraging Grounding DINO to produce precise instance-level bounding box prompts, alongside the proposed Single-Foreground Multi-Background Prompting strategy to sample region-constrained point prompts within each box, enabling SAM to yield a candidate instance mask; (3) Self-consistency Instance Mask Voting, which selects the final COS prediction by identifying the candidate mask most consistent across multiple candidate instance masks. Extensive evaluations on standard COS benchmarks demonstrate that the proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.
zh
[CV-197] Motions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos
【速读】:该论文旨在解决短格式视频(Short-form Videos, SVs)中情感分析(Video Emotion Analysis, VEA)面临的挑战,特别是由于内容多样性带来的语义鸿沟以及视听共表达不一致导致的局部偏差和全局信息缺失问题。解决方案的关键在于提出一种端到端的音频-视觉融合网络AV-CANet,其核心创新包括:利用视频Transformer捕捉语义相关的特征表示,设计局部-全局融合模块以逐步建模跨模态特征的相关性,并引入三极惩罚损失(EP-CE Loss)从全局层面优化模型训练。该方法在大规模标注数据集eMotions及相关公开数据集上验证了有效性,为未来SVs情感分析研究提供了新思路。
链接: https://arxiv.org/abs/2508.06902
作者: Xuecheng Wu,Dingkang Yang,Danlei Huang,Xinyi Yin,Yifan Wang,Jia Zhang,Jiayu Nie,Liangyu Fu,Yang Liu,Junxiao Xue,Hadi Amirpour,Wei Zhou
机构: Xi’an Jiaotong University (西安交通大学); Fudan University (复旦大学); ByteDance (字节跳动); Zhengzhou University (郑州大学); University of Science and Technology of China (中国科学技术大学); Inspur Electronic Information Industry Co., Ltd (浪潮电子信息产业股份有限公司); Northwestern Polytechnical University (西北工业大学); The University of Toronto (多伦多大学); Zhejiang Lab (浙江实验室); University of Klagenfurt (克拉根福大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.
zh
[CV-198] Advancements in Chinese font generation since deep learning era: A survey
【速读】:该论文旨在解决生成式中文字符图像的质量提升问题,即如何基于少量或大量参考样本高效生成高质量、风格一致的中文字体图像。其解决方案的关键在于对当前基于深度学习的中文字体生成方法进行系统性梳理与分类:首先明确任务背景与研究基础(如经典深度学习架构、字体表示格式、公开数据集及评估指标),进而依据所需参考样本数量将现有方法划分为多样本(many-shot)和少样本(few-shot)生成两类,并深入分析各类代表性方法的优势与局限,从而为后续研究提供理论支撑与实践方向。
链接: https://arxiv.org/abs/2508.06900
作者: Weiran Chen,Guiqian Zhu,Ying Li,Yi Ji,Chunping Liu
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 42 Pages, 25 figures
Abstract:Chinese font generation aims to create a new Chinese font library based on some reference samples. It is a topic of great concern to many font designers and typographers. Over the past years, with the rapid development of deep learning algorithms, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to improve the overall quality of generated Chinese character images remains a tough issue. In this paper, we conduct a holistic survey of the recent Chinese font generation approaches based on deep learning. To be specific, we first illustrate the research background of the task. Then, we outline our literature selection and analysis methodology, and review a series of related fundamentals, including classical deep learning architectures, font representation formats, public datasets, and frequently-used evaluation metrics. After that, relying on the number of reference samples required to generate a new font, we categorize the existing methods into two major groups: many-shot font generation and few-shot font generation methods. Within each category, representative approaches are summarized, and their strengths and limitations are also discussed in detail. Finally, we conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for the researchers in this field.
zh
[CV-199] BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models ICCV2025
【速读】:该论文旨在解决主流多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉嵌入(visual embeddings)对齐不足的问题。当前方法通常仅通过自回归监督文本输出来间接优化视觉投影器(vision projector),忽略了直接引入视觉监督的必要性,导致视觉嵌入与文本语义之间难以实现精细对齐。解决方案的关键在于提出BASIC方法,利用大语言模型浅层中经过精炼的视觉嵌入作为监督信号,直接指导投影器生成初始视觉嵌入:一方面通过最小化初始嵌入与监督嵌入在语义空间中的夹角来优化方向一致性;另一方面通过缩小两者logit分布差异来增强语义匹配度。该方法无需额外标注或监督模型,在多个基准测试上显著提升了MLLMs性能,验证了直接视觉监督的有效性。
链接: https://arxiv.org/abs/2508.06895
作者: Jianting Tang,Yubo Wang,Haoyu Cao,Linli Xu
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025
Abstract:Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM’s shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.
zh
[CV-200] Fusion-Based Brain Tumor Classification Using Deep Learning and Explainable AI and Rule-Based Reasoning
【速读】:该论文旨在解决脑肿瘤类型自动分类中模型准确性与可解释性不足的问题,尤其是在磁共振成像(MRI)影像分析中,如何实现高精度且临床可信的诊断支持。其解决方案的关键在于构建一个基于软投票策略的集成深度学习框架,融合MobileNetV2和DenseNet121卷积神经网络(CNN),并引入可解释人工智能(XAI)模块——使用Grad-CAM++生成类特定显著性图以可视化模型关注区域,同时叠加符号化临床决策规则覆盖(CDRO)模块,将预测映射至放射学专家经验规则。该设计不仅提升了分类性能(准确率91.7%),还通过Dice系数(最高0.88)和IoU分数(最高0.78)验证了注意力区域与专家标注的一致性,并获得放射科医生对解释有用性和热力图对应性的高度认可(平均Likert评分4.4和4.0),从而实现了临床可信赖的自动化脑肿瘤分类系统。
链接: https://arxiv.org/abs/2508.06891
作者: Melika Filvantorkaman,Mohsen Piri,Maral Filvan Torkaman,Ashkan Zabihi,Hamidreza Moradi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 6 figures
Abstract:Accurate and interpretable classification of brain tumors from magnetic resonance imaging (MRI) is critical for effective diagnosis and treatment planning. This study presents an ensemble-based deep learning framework that combines MobileNetV2 and DenseNet121 convolutional neural networks (CNNs) using a soft voting strategy to classify three common brain tumor types: glioma, meningioma, and pituitary adenoma. The models were trained and evaluated on the Figshare dataset using a stratified 5-fold cross-validation protocol. To enhance transparency and clinical trust, the framework integrates an Explainable AI (XAI) module employing Grad-CAM++ for class-specific saliency visualization, alongside a symbolic Clinical Decision Rule Overlay (CDRO) that maps predictions to established radiological heuristics. The ensemble classifier achieved superior performance compared to individual CNNs, with an accuracy of 91.7%, precision of 91.9%, recall of 91.7%, and F1-score of 91.6%. Grad-CAM++ visualizations revealed strong spatial alignment between model attention and expert-annotated tumor regions, supported by Dice coefficients up to 0.88 and IoU scores up to 0.78. Clinical rule activation further validated model predictions in cases with distinct morphological features. A human-centered interpretability assessment involving five board-certified radiologists yielded high Likert-scale scores for both explanation usefulness (mean = 4.4) and heatmap-region correspondence (mean = 4.0), reinforcing the framework’s clinical relevance. Overall, the proposed approach offers a robust, interpretable, and generalizable solution for automated brain tumor classification, advancing the integration of deep learning into clinical neurodiagnostics.
zh
[CV-201] NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective
【速读】:该论文旨在解决红外小目标检测与分割(IRSTDS)中因目标微弱、无形状特征及背景杂波严重而导致的误报率高问题。传统基于卷积神经网络(CNN)的方法虽提升了特征表示能力,但未有效抑制噪声,导致误警增多。其解决方案的关键在于从频域角度出发,提出一种新型噪声抑制特征金字塔网络(NS-FPN),核心包含两个模块:低频引导特征净化(LFP)模块通过净化高频成分实现去噪增强,螺旋感知特征采样(SFS)模块则引入螺旋采样策略在特征融合过程中聚焦目标相关特征,从而显著降低误报并提升检测性能。
链接: https://arxiv.org/abs/2508.06878
作者: Maoxun Yuan,Duanni Meng,Ziteng Xi,Tianyi Zhao,Shiji Zhao,Yimian Dai,Xingxing Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks.
zh
[CV-202] LWT-ARTERY-LABEL: A Lightweight Framework for Automated Coronary Artery Identification
【速读】:该论文旨在解决冠状动脉疾病(CAD)诊断中,基于计算机断层扫描冠状动脉造影(CTCA)的冠状动脉自动解剖标注问题。传统基于知识的方法难以利用数据驱动的洞察力,而现有的深度学习方法则常因计算资源消耗大且忽视临床知识而受限。解决方案的关键在于提出一种轻量级方法,将解剖学先验知识与基于规则的拓扑约束相结合,从而在保证准确性的同时显著降低计算复杂度,并实现对冠状动脉树结构的有效自动化标注。
链接: https://arxiv.org/abs/2508.06874
作者: Shisheng Zhang,Ramtin Gharleghi,Sonit Singh,Daniel Moses,Dona Adikari,Arcot Sowmya,Susann Beier
机构: University of New South Wales (新南威尔士大学); Prince of Wales Hospital (王储医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coronary artery disease (CAD) remains the leading cause of death globally, with computed tomography coronary angiography (CTCA) serving as a key diagnostic tool. However, coronary arterial analysis using CTCA, such as identifying artery-specific features from computational modelling, is labour-intensive and time-consuming. Automated anatomical labelling of coronary arteries offers a potential solution, yet the inherent anatomical variability of coronary trees presents a significant challenge. Traditional knowledge-based labelling methods fall short in leveraging data-driven insights, while recent deep-learning approaches often demand substantial computational resources and overlook critical clinical knowledge. To address these limitations, we propose a lightweight method that integrates anatomical knowledge with rule-based topology constraints for effective coronary artery labelling. Our approach achieves state-of-the-art performance on benchmark datasets, providing a promising alternative for automated coronary artery labelling.
zh
[CV-203] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
【速读】:该论文旨在解决长视频理解中由于数据规模庞大导致的计算不可行问题,尤其是现有基于关键帧检索的方法因文本查询与视觉内容之间对齐不足、难以捕捉复杂时序语义信息而精度受限的问题。其解决方案的关键在于提出一种名为视觉-字幕融合(Visual-Subtitle Integration, VSI)的多模态关键帧搜索方法,通过整合字幕、时间戳和场景边界信息,在统一的多模态搜索框架下引入双流机制:视频搜索流(Video Search Stream)提取视觉特征,字幕匹配流(Subtitle Match Stream)利用文本信息进行互补检索,并借助两流间的交互增强关键帧定位准确性,从而显著提升长视频问答任务的性能。
链接: https://arxiv.org/abs/2508.06869
作者: Jianxiang He,Shaoguang Wang,Weiyu Guo,Meisheng Hong,Jungang Li,Yijie Xu,Ziyang Chen,Hui Xiong
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); HKUST(GZ)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures
Abstract:Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy.
zh
[CV-204] MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
【速读】:该论文旨在解决当前严重天气预测系统中依赖人工专家解读导致的主观性强和操作负担重的问题,以及三大核心技术挑战:严重天气样本稀缺、高维气象数据与文本预警之间对齐不充分,以及现有多模态语言模型难以处理高维气象数据并捕捉时间序列、垂直气压层和空间维度间的复杂依赖关系。其解决方案的关键在于构建了首个大规模时序多模态数据集MP-Bench(包含421,363对原始多年气象数据与对应文本描述),并开发出气象多模态大模型(Meteorology Multimodal Large Model, MMLM),该模型可直接输入四维(4D)气象数据,并通过三个即插即用的自适应融合模块实现跨时间序列、垂直气压层和空间维度的动态特征提取与整合,从而显著提升严重天气理解能力,推动自动化AI驱动的天气预报系统发展。
链接: https://arxiv.org/abs/2508.06859
作者: Shuo Tang,Jian Xu,Jiadong Zhang,Yi Chen,Qizhao Jin,Lingdong Shen,Chenglin Liu,Shiming Xiang
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. Institute of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能研究院); 3. Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University (北京大学机器感知教育部重点实验室); 4. Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Timely and accurate severe weather warnings are critical for disaster mitigation. However, current forecasting systems remain heavily reliant on manual expert interpretation, introducing subjectivity and significant operational burdens. With the rapid development of AI technologies, the end-to-end “AI weather station” is gradually emerging as a new trend in predicting severe weather events. Three core challenges impede the development of end-to-end AI severe weather system: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) existing multimodal language models are unable to handle high-dimensional meteorological data and struggle to fully capture the complex dependencies across temporal sequences, vertical pressure levels, and spatial dimensions. To address these challenges, we introduce MP-Bench, the first large-scale temporal multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding text caption, covering a wide range of severe weather scenarios across China. On top of this dataset, we develop a meteorology multimodal large model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench demonstrate that MMLM performs exceptionally well across multiple tasks, highlighting its effectiveness in severe weather understanding and marking a key step toward realizing automated, AI-driven weather forecasting systems. Our source code and dataset will be made publicly available.
zh
[CV-205] A Joint Sparse Self-Representation Learning Method for Multiview Clustering
【速读】:该论文旨在解决多视图聚类(Multiview Clustering, MC)中如何有效融合各视图的一致性与互补信息的问题,尤其针对现有子空间聚类方法在建模局部结构时依赖图拉普拉斯正则化(Graph-Laplacian regularization)所导致的局部信息提取不充分问题。解决方案的关键在于提出一种联合稀疏自表示学习模型,通过引入基数约束(cardinality constraint,即ℓ₀-范数)替代传统图正则项,以直接限制每个视图中用于自表示的样本数量,从而显式提取视图特异性的局部信息;同时结合低秩约束来挖掘共识亲和矩阵中的全局一致性结构。为应对该非凸、非光滑优化模型带来的收敛性难题,作者进一步设计了一种具有全局收敛性的交替二次惩罚(Alternating Quadratic Penalty, AQP)算法,通过闭式解迭代求解两个子问题,显著提升了模型的泛化能力。
链接: https://arxiv.org/abs/2508.06857
作者: Mengxue Jia,Zhihua Allen-Zhao,You Zhao,Sanyang Liu
机构: Xi’an Dianzi University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS)
备注:
Abstract:Multiview clustering (MC) aims to group samples using consistent and complementary information across various views. The subspace clustering, as a fundamental technique of MC, has attracted significant attention. In this paper, we propose a novel joint sparse self-representation learning model for MC, where a featured difference is the extraction of view-specific local information by introducing cardinality (i.e., \ell_0 -norm) constraints instead of Graph-Laplacian regularization. Specifically, under each view, cardinality constraints directly restrict the samples used in the self-representation stage to extract reliable local and global structure information, while the low-rank constraint aids in revealing a global coherent structure in the consensus affinity matrix during merging. The attendant challenge is that Augmented Lagrange Method (ALM)-based alternating minimization algorithms cannot guarantee convergence when applied directly to our nonconvex, nonsmooth model, thus resulting in poor generalization ability. To address it, we develop an alternating quadratic penalty (AQP) method with global convergence, where two subproblems are iteratively solved by closed-form solutions. Empirical results on six standard datasets demonstrate the superiority of our model and AQP method, compared to eight state-of-the-art algorithms.
zh
[CV-206] AGIC: Attention-Guided Image Captioning to Improve Caption Relevance
【速读】:该论文旨在解决图像描述生成(image captioning)中准确性和描述性不足的长期挑战。其解决方案的关键在于提出一种注意力引导的图像描述方法(Attention-Guided Image Captioning, AGIC),通过在特征空间中直接增强显著视觉区域来指导文本生成,并引入混合解码策略——结合确定性与概率采样,以平衡生成结果的流畅性与多样性。此方法在Flickr8k和Flickr30k数据集上验证有效,不仅性能达到或超越多个前沿模型,且推理速度更快,具备良好的可扩展性和可解释性。
链接: https://arxiv.org/abs/2508.06853
作者: L. D. M. S. Sai Teja,Ashok Urlana,Pruthwik Mishra
机构: NIT Silchar; TCS Research, Hyderabad; SVNIT Surat, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 Figures
Abstract:Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.
zh
[CV-207] Hybrid Machine Learning Framework for Predicting Geometric Deviations from 3D Surface Metrology
【速读】:该论文旨在解决复杂几何形状制造件中几何偏差(Geometric Deviation)精确预测的难题,这是影响现代制造过程尺寸精度的关键挑战。解决方案的核心在于构建一个融合高分辨率三维扫描与混合机器学习建模的框架:首先利用多角度3D扫描获取高质量表面数据,并通过精确配准、降噪和拼接技术生成高保真三维模型;随后设计了一个结合卷积神经网络(Convolutional Neural Networks, CNNs)用于特征提取与梯度提升决策树(Gradient-Boosted Decision Trees, GBDTs)用于预测建模的混合算法。该方法在95%置信水平下实现了0.012 mm的预测精度,较传统统计过程控制(Statistical Process Control, SPC)方法提升73%,并揭示了制造参数与几何偏差之间的潜在关联,为精密制造中的自动化质量控制、预测性维护及设计优化提供了新路径。
链接: https://arxiv.org/abs/2508.06845
作者: Hamidreza Samadi,Md Manjurul Ahsan,Shivakumar Raman
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study addresses the challenge of accurately forecasting geometric deviations in manufactured components using advanced 3D surface analysis. Despite progress in modern manufacturing, maintaining dimensional precision remains difficult, particularly for complex geometries. We present a methodology that employs a high-resolution 3D scanner to acquire multi-angle surface data from 237 components produced across different batches. The data were processed through precise alignment, noise reduction, and merging techniques to generate accurate 3D representations. A hybrid machine learning framework was developed, combining convolutional neural networks for feature extraction with gradient-boosted decision trees for predictive modeling. The proposed system achieved a prediction accuracy of 0.012 mm at a 95% confidence level, representing a 73% improvement over conventional statistical process control methods. In addition to improved accuracy, the model revealed hidden correlations between manufacturing parameters and geometric deviations. This approach offers significant potential for automated quality control, predictive maintenance, and design optimization in precision manufacturing, and the resulting dataset provides a strong foundation for future predictive modeling research.
zh
[CV-208] Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification
【速读】:该论文旨在解决跨域行人重识别(person re-identification, reID)中多源域自适应(multi-source domain adaptation, MSDA)的效率与性能平衡问题,即如何在不访问源域数据的前提下实现高效且高精度的模型迁移。现有方法通常需训练域特定的骨干网络或依赖源域数据,导致参数量和计算成本显著增加。其解决方案的关键在于提出一种无源域自适应门控专家(Source-free Adaptive Gated Experts, SAGE-reID)框架:首先通过无源域监督学习训练每个源域对应的低秩适配器(low-rank adapters, LoRA),随后引入轻量级门控网络动态分配融合权重,实现跨域知识的有效整合。该方法保持骨干网络参数不变,而LoRA专家规模仅占骨干参数的2%,从而大幅降低内存消耗与过拟合风险,同时在Market-1501、DukeMTMC-reID和MSMT17三个基准上超越当前最优方法。
链接: https://arxiv.org/abs/2508.06831
作者: Taha Mustapha Nehdi,Nairouz Mrabah,Atif Belal,Marco Pedersoli,Eric Granger
机构: LIVIA, ILLS, Dept. of Systems Engineering, ETS Montreal, Canada(加拿大蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source-free Adaptive Gated Experts (SAGE-reID) method is introduced for person reID. Our SAGE-reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size (= 2% of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks: Market-1501, DukeMTMC-reID, and MSMT17 indicate that SAGE-reID outperforms state-of-the-art methods while being computationally efficient.
zh
[CV-209] VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation
【速读】:该论文旨在解决临床亚皮下血管分割中因标注数据稀缺、昂贵且血管对比度低、噪声大而导致的精度不足问题。其核心解决方案是提出一种弱监督训练框架,通过引入低成本的稀疏标注(如中心线轨迹、点标记或短划线),利用可微分随机游走标签传播模型将稀疏标签扩展为密集的概率监督信号;该传播过程融合图像驱动的血管显著性特征和管状连续性先验,并生成像素级命中概率与校准的不确定性估计,进而构建不确定性加权损失函数以避免对模糊区域的过拟合。关键创新在于标签传播器与基于CNN的分割预测器联合学习,无需显式边缘监督即可自动发现血管边界和连续性约束,同时引入拓扑感知正则化项强化中心线连通性并抑制伪分支,从而在显著降低标注负担的同时保持临床相关的血管拓扑结构。
链接: https://arxiv.org/abs/2508.06819
作者: Ayaan Nooruddin Siddiqui,Mahnoor Zaidi,Ayesha Nazneen Shahbaz,Priyadarshini Chatterjee,Krishnan Menon Iyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of subcutaneous vessels from clinical images is hampered by scarce, expensive ground truth and by low contrast, noisy appearance of vessels across patients and modalities. We present a novel weakly supervised training framework tailored for subcutaneous vessel segmentation that leverages inexpensive sparse annotations (e.g., centerline traces, dot markers, or short scribbles). Sparse labels are expanded into dense, probabilistic supervision via a differentiable random walk label propagation model whose transition weights incorporate image driven vesselness cues and tubular continuity priors. The propagation yields per-pixel hitting probabilities together with calibrated uncertainty estimates; these are incorporated into an uncertainty weighted loss to avoid over fitting to ambiguous regions. Crucially, the label-propagator is learned jointly with a CNN based segmentation predictor, enabling the system to discover vessel edges and continuity constraints without explicit edge supervision. We further introduce a topology aware regularizer that encourages centerline connectivity and penalizes spurious branches, improving clinical usability. In experiments on clinical subcutaneous imaging datasets, our method consistently outperforms naive training on sparse labels and conventional dense pseudo-labeling, producing more complete vascular maps and better calibrated uncertainty for downstream decision making. The approach substantially reduces annotation burden while preserving clinically relevant vessel topology.
zh
[CV-210] DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation MICCAI
【速读】:该论文旨在解决皮肤黑色素瘤(melanocytic tumors)在皮肤镜图像(dermoscopic images)中的精确分割问题,这是实现自动化皮肤癌筛查和临床决策支持的关键步骤。由于皮肤镜图像存在细微的纹理与颜色差异、常见伪影(如毛发、标尺、气泡)以及对边界定位精度要求高,传统自然场景分割方法难以满足临床需求。解决方案的核心在于提出一种受ResNet启发的双分辨率架构(dual resolution architecture),其包含两个紧密耦合的分支:一个全分辨率流用于保留精细边界信息,另一个下采样流用于聚合多尺度上下文特征;通过边界感知残差连接注入高频边缘信息至深层特征图,并引入通道注意力模块动态调整颜色与纹理敏感性。此外,为应对图像伪影和数据集规模有限的问题,设计了轻量级伪影抑制模块及多任务训练目标(结合Dice-Tversky损失、显式边界损失与对比正则化项),从而在无需复杂后处理或预训练的情况下实现像素级准确的分割结果。
链接: https://arxiv.org/abs/2508.06816
作者: Vikram Singh,Kabir Malhotra,Rohan Desai,Ananya Shankaracharya,Priyadarshini Chatterjee,Krishnan Menon Iyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAIA
Abstract:Accurate segmentation of melanocytic tumors in dermoscopic images is a critical step for automated skin cancer screening and clinical decision support. Unlike natural scene segmentation, lesion delineation must reconcile subtle texture and color variations, frequent artifacts (hairs, rulers, bubbles), and a strong need for precise boundary localization to support downstream diagnosis. In this paper we introduce Our method, a novel ResNet inspired dual resolution architecture specifically designed for melanocytic tumor segmentation. Our method maintains a full resolution stream that preserves fine grained boundary information while a complementary pooled stream aggregates multi scale contextual cues for robust lesion recognition. The streams are tightly coupled by boundary aware residual connections that inject high frequency edge information into deep feature maps, and by a channel attention module that adapts color and texture sensitivity to dermoscopic appearance. To further address common imaging artifacts and the limited size of clinical datasets, we propose a lightweight artifact suppression block and a multi task training objective that combines a Dice Tversky segmentation loss with an explicit boundary loss and a contrastive regularizer for feature stability. The combined design yields pixel accurate masks without requiring heavy post processing or complex pre training protocols. Extensive experiments on public dermoscopic benchmarks demonstrate that Our method significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines, making it a practical building block for automated melanoma assessment systems.
zh
[CV-211] Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling MICCAI
【速读】:该论文旨在解决医学影像中器官边界定位精度不足的问题,尤其是在需要毫米级准确性的临床场景(如分割、配准、放射治疗规划)中,传统深度卷积网络(ConvNets)虽在自然图像边缘检测上接近人类水平,但其输出常缺乏精确的空间定位能力。解决方案的关键在于提出一种面向医学图像的“清晰边缘检测器”(crisp edge detector),其核心创新是采用一种新颖的自顶向下反向精炼架构(top-down backward refinement architecture),通过反向精炼路径逐步上采样并融合高层语义特征与低层细节信息,从而生成高分辨率且定位精准的器官边界。此外,为高效处理各向异性体积数据(如CT/MRI),该方法结合2D切片级精炼与轻量3D上下文聚合策略,在保持计算效率的同时提升边界质量。实验表明,该方法在严格指标(边界F-measure、Hausdorff距离)下显著优于基线ConvNet及现有医学边缘/轮廓检测方法,并在下游任务(如器官分割、图像配准、病灶勾画)中带来一致性能提升。
链接: https://arxiv.org/abs/2508.06805
作者: Aarav Mehta,Priya Deshmukh,Vikram Singh,Siddharth Malhotra,Krishnan Menon Iyer,Tanvi Iyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAIA Workshop
Abstract:Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.
zh
[CV-212] DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging MICCAI
【速读】:该论文旨在解决术中超声成像(intraoperative ultrasound imaging)与术前高分辨率MRI/CT影像之间存在配准困难的问题,其核心挑战在于超声图像受噪声、伪影干扰且与术前影像空间对齐精度低。解决方案的关键在于提出DiffUS——一种基于物理的可微分超声渲染器,通过将MRI三维数据转化为声阻抗体积(acoustic impedance volume),利用射线追踪结合反射-透射耦合方程模拟超声波传播,并以稀疏线性系统建模多内部反射,最终在扇形采集几何下提取深度分辨的回波信号,生成包含斑点噪声和深度衰减等真实伪影的B模式图像。整个框架在PyTorch中实现为可微张量运算,支持梯度优化,从而赋能切片到体素配准和体积重建等下游任务。
链接: https://arxiv.org/abs/2508.06768
作者: Noe Bertramo,Gabriel Duguey,Vivek Gopalakrishnan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 10 pages, accepted to MICCAI ASMUS 25
Abstract:Intraoperative ultrasound imaging provides real-time guidance during numerous surgical procedures, but its interpretation is complicated by noise, artifacts, and poor alignment with high-resolution preoperative MRI/CT scans. To bridge the gap between reoperative planning and intraoperative guidance, we present DiffUS, a physics-based, differentiable ultrasound renderer that synthesizes realistic B-mode images from volumetric imaging. DiffUS first converts MRI 3D scans into acoustic impedance volumes using a machine learning approach. Next, we simulate ultrasound beam propagation using ray tracing with coupled reflection-transmission equations. DiffUS formulates wave propagation as a sparse linear system that captures multiple internal reflections. Finally, we reconstruct B-mode images via depth-resolved echo extraction across fan-shaped acquisition geometry, incorporating realistic artifacts including speckle noise and depth-dependent degradation. DiffUS is entirely implemented as differentiable tensor operations in PyTorch, enabling gradient-based optimization for downstream applications such as slice-to-volume registration and volumetric reconstruction. Evaluation on the ReMIND dataset demonstrates DiffUS’s ability to generate anatomically accurate ultrasound images from brain MRI data.
zh
[CV-213] SafePLUG: Empowering Multimodal LLM s with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在交通事故理解任务中普遍存在的局限性,即主要依赖粗粒度的图像级或视频级理解,难以处理细粒度视觉细节或局部场景组件,从而限制了其在复杂事故场景中的应用。解决方案的关键在于提出SafePLUG框架,该框架通过引入像素级理解(Pixel-Level Understanding)和时序锚定(Temporal Grounding)能力,使MLLMs能够支持任意形状视觉提示下的区域感知问答、基于语言指令的像素级分割,以及交通事故场景中时序事件的精准定位。这一设计显著提升了对复杂交通场景的细粒度解析能力,为智能交通系统中的驾驶安全提升和情境感知增强提供了坚实基础。
链接: https://arxiv.org/abs/2508.06763
作者: Zihao Sheng,Zilin Huang,Yen-Jung Chen,Yansong Qu,Yuhao Luo,Yue Leng,Sikai Chen
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Purdue University (普渡大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code, dataset, and model checkpoints will be made publicly available at: this https URL
Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: this https URL
zh
[CV-214] VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
【速读】:该论文旨在解决当前人体姿态与形状(Human Pose and Shape, HPS)估计方法在复杂姿态或显著遮挡场景下性能下降的问题,尤其是现有数据集普遍缺乏真实世界中具有挑战性的遮挡情况,如随机补丁或剪贴画风格的遮挡,无法有效评估模型在实际应用中的鲁棒性。解决方案的关键在于构建一个名为VOccl3D的新基准数据集,该数据集基于先进的计算机图形渲染技术,包含多样化的现实遮挡场景、服装纹理和人体运动,并提供3D身体姿态与形状标注。通过在此数据集上微调最新的HPS方法(如CLIFF和BEDLAM-CLIFF),作者验证了其在多个公共数据集及自建测试集上的显著提升,同时利用该数据集优化YOLO11目标检测器以增强遮挡下的检测性能,最终形成一套端到端的鲁棒HPS估计系统,为未来研究提供了更贴近现实的评估平台。
链接: https://arxiv.org/abs/2508.06757
作者: Yash Garg,Saketh Bachu,Arindam Dutta,Rohit Lal,Sarosij Bose,Calvin-Khang Ta,M. Salman Asif,Amit Roy-Chowdhury
机构: University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:this https URL
zh
[CV-215] FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI MICCAI2025
【速读】:该论文旨在解决胶质瘤中异柠檬酸脱氢酶(IDH)突变状态的非侵入性精准检测问题,传统依赖组织活检的方法难以反映肿瘤空间异质性,且深度学习模型因标注数据稀缺而性能受限。解决方案的关键在于提出一种基于基础模型的生物标志物网络(FoundBioNet),其核心创新包括:1)肿瘤感知特征编码模块(TAFE),用于提取多尺度、聚焦于肿瘤区域的特征;2)跨模态差异模块(CMD),用于突出与IDH突变相关的T2-FLAIR信号细微差异。该方法结合大规模预训练与任务特定微调,在包含1705例患者的多中心队列上实现了高AUC值(最高达90.58%),显著优于基线模型,并通过消融实验证明两个模块对提升预测准确性均不可或缺。
链接: https://arxiv.org/abs/2508.06756
作者: Somayeh Farahani,Marjaneh Hejazi,Antonio Di Ieva,Sidong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for oral and poster presentation at MICCAI 2025
Abstract:Accurate, noninvasive detection of isocitrate dehydrogenase (IDH) mutation is essential for effective glioma management. Traditional methods rely on invasive tissue sampling, which may fail to capture a tumor’s spatial heterogeneity. While deep learning models have shown promise in molecular profiling, their performance is often limited by scarce annotated data. In contrast, foundation deep learning models offer a more generalizable approach for glioma imaging biomarkers. We propose a Foundation-based Biomarker Network (FoundBioNet) that utilizes a SWIN-UNETR-based architecture to noninvasively predict IDH mutation status from multi-parametric MRI. Two key modules are incorporated: Tumor-Aware Feature Encoding (TAFE) for extracting multi-scale, tumor-focused features, and Cross-Modality Differential (CMD) for highlighting subtle T2-FLAIR mismatch signals associated with IDH mutation. The model was trained and validated on a diverse, multi-center cohort of 1705 glioma patients from six public datasets. Our model achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p = 0.05). Ablation studies confirmed that both the TAFE and CMD modules are essential for improving predictive accuracy. By integrating large-scale pretraining and task-specific fine-tuning, FoundBioNet enables generalizable glioma characterization. This approach enhances diagnostic accuracy and interpretability, with the potential to enable more personalized patient care.
zh
[CV-216] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video
【速读】:该论文旨在解决生成式 AI (Generative AI) 在4D场景合成中难以保持物理真实性和运动动态一致性的难题,即如何在利用文本到图像或图像到视频生成模型提供的语义先验的同时,确保生成内容具备真实的几何结构与运动一致性。其解决方案的关键在于提出了一种名为Restage4D的几何保真型视频条件4D重 staging(restaging)框架,通过视频回溯训练策略(video-rewinding training strategy)构建真实基础视频与合成驱动视频之间的时序桥梁,共享运动表示以校正生成误差;同时引入遮挡感知刚性损失(occlusion-aware rigidity loss)和遮挡回溯机制(disocclusion backtracing mechanism),显著提升复杂运动下的结构与几何一致性,从而实现对可变形3D场景在新运动下的稳定重建与自动纠错。
链接: https://arxiv.org/abs/2508.06715
作者: Jixuan He,Chieh Hubert Lin,Lu Qi,Ming-Hsuan Yang
机构: Cornell Tech; University of California, Merced; Wuhan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textitCan we generate physically consistent 4D content by leveraging the motion priors of the real-world video? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbfRestage4D, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.
zh
[CV-217] Fourier Optics and Deep Learning Methods for Fast 3D Reconstruction in Digital Holography
【速读】:该论文旨在解决计算机全息(Computer-generated holography, CGH)中高效生成高质量相位仅全息图(phase-only hologram, POH)和复数全息图(complex hologram, CH)的问题,特别是在利用初始点云与MRI数据重建三维物体时的计算效率与重建质量之间的平衡。解决方案的关键在于提出了一种端到端的快速流水线框架,将输入数据重构为体素化对象后,采用非凸傅里叶光学优化算法(包括交替投影、随机梯度下降SGD及拟牛顿法)进行优化,并通过2D中值滤波去除优化过程中的伪影和散斑噪声,从而显著提升均方误差(MSE)、均方根误差(RMSE)和峰值信噪比(PSNR)等性能指标,优于传统深度学习方法如HoloNet。
链接: https://arxiv.org/abs/2508.06703
作者: Justin London
机构: University of North Dakota (北达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer-generated holography (CGH) is a promising method that modulates user-defined waveforms with digital holograms. An efficient and fast pipeline framework is proposed to synthesize CGH using initial point cloud and MRI data. This input data is reconstructed into volumetric objects that are then input into non-convex Fourier optics optimization algorithms for phase-only hologram (POH) and complex-hologram (CH) generation using alternating projection, SGD, and quasi-Netwton methods. Comparison of reconstruction performance of these algorithms as measured by MSE, RMSE, and PSNR is analyzed as well as to HoloNet deep learning CGH. Performance metrics are shown to be improved by using 2D median filtering to remove artifacts and speckled noise during optimization.
zh
[CV-218] MMFformer: Multimodal Fusion Transformer Network for Depression Detection
【速读】:该论文旨在解决抑郁症早期检测中因依赖主观临床评估而导致的困难问题,特别是如何从多模态社交媒体信息(如视频、音频)中准确提取抑郁相关的时序特征并实现跨模态有效融合。其解决方案的关键在于提出MMFformer网络架构:该架构采用带残差连接的Transformer捕捉视频的空间特征,利用Transformer编码器建模音频中的重要时间动态,并通过晚期与中期融合策略挖掘模态间的相关性,从而提升抑郁检测的准确性。实验表明,该方法在两个大规模数据集上显著优于现有最先进模型,F1分数分别提升了13.92%和7.74%。
链接: https://arxiv.org/abs/2508.06701
作者: Md Rezwanul Haque,Md. Milon Islam,S M Taslim Uddin Raju,Hamdi Altaheri,Lobna Nassar,Fakhri Karray
机构: University of Waterloo (滑铁卢大学); American University of Ras Al Khaimah (拉希德酋长国美国大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted for the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria
Abstract:Depression is a serious mental health illness that significantly affects an individual’s well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at this https URL.
zh
[CV-219] Learning More by Seeing Less: Line Drawing Pretraining for Efficient Transferable and Human-Aligned Vision
【速读】:该论文试图解决当前计算机视觉模型对冗余视觉信息依赖过强、导致表征效率低、泛化能力不足的问题。其核心解决方案是采用线稿(line drawing)作为结构优先的预训练模态,以诱导更紧凑且更具泛化能力的视觉表征。关键在于,线稿预训练能够增强模型的形状偏好(shape bias)、聚焦注意力机制,并显著降低表示空间的内在维度(intrinsic dimensionality),从而在分类、检测和分割任务中实现更高的数据效率;同时,这种结构紧凑的表征更易压缩并用于知识蒸馏,使轻量级学生模型性能优于传统颜色监督预训练方法。
链接: https://arxiv.org/abs/2508.06696
作者: Tianqin Li,George Liu,Tai Sing Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite remarkable progress in computer vision, modern recognition systems remain limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings - suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose using line drawings as a structure-first pretraining modality to induce more compact and generalizable visual representations. We show that models pretrained on line drawings develop stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance - echoing the similar observation in low dimensional efficient representation in the brain. Beyond performance improvements, line drawing pretraining produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from line-pretrained teachers consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Finally, we demonstrate that the pretraining with line-drawing can also be extended to unsupervised setting via our proposed method “learning to draw”. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases - offering a simple yet powerful strategy for building more robust and adaptable vision systems.
zh
[CV-220] owards Robust Red-Green Watermarking for Autoregressive Image Generators
【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型中缺乏有效水印机制的问题,尤其是如何在保持图像质量的同时提升水印对常见图像扰动和再生攻击的鲁棒性。其关键解决方案在于提出两种基于视觉标记聚类(visual token clustering)的新型水印方法:一是无需训练的基于聚类查找表的方法,二是通过微调变分自编码器(VAE)编码器直接从扰动图像中预测标记聚类的方法。这两种方案均通过将相似标记分配至同一簇来增强水印的稳定性,从而显著提高检测准确率并优于现有基线方法,同时具备接近轻量级后处理水印的快速验证效率。
链接: https://arxiv.org/abs/2508.06656
作者: Denis Lukovnikov,Andreas Müller,Erwin Quiring,Asja Fischer
机构: Ruhr-University Bochum (鲁尔大学波鸿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.
zh
[CV-221] Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors
【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因关键帧索引误差导致模型性能下降的问题,尤其是在实际应用中难以获得精确关键帧索引的情况下。现有方法虽依赖准确的关键帧索引提升识别精度,却忽视了索引误差的客观存在及其对鲁棒性的影响。为应对这一挑战,论文提出CausalNet框架,其核心创新在于:首先通过因果运动位置学习模块(Causal Motion Position Learning Module, CMPLM)定位与动作单元(Action Units, AUs)相关的肌肉运动区域,从而减少冗余信息干扰;其次引入因果注意力块(Causal Attention Block, CAB),深入建模微表情中肌肉收缩与放松动作之间的因果关系,实现对完整微表情序列的高效利用与精准识别。该方案在多种噪声水平下均表现出鲁棒性,并在标准基准上超越了当前最优方法。
链接: https://arxiv.org/abs/2508.06640
作者: Zheyuan Zhang,Weihao Tang,Hong Chen
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Auckland (奥克兰大学); Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism (文化和旅游部交互技术与体验系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has surpassed state-of-the-art (SOTA) methods on several standard MER benchmarks when using the provided annotated key-frames. Code is available at this https URL.
zh
[CV-222] CoDe-NeRF: Neural Rendering via Dynamic Coefficient Decomposition
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在渲染具有复杂镜面反射和高光场景时存在的问题,例如因光照与材质属性纠缠导致的反射模糊,以及基于物理的逆渲染优化不稳定的缺陷。解决方案的关键在于提出一种基于动态系数分解(dynamic coefficient decomposition)的神经渲染框架:将复杂外观分解为一个共享的、静态的神经基底(neural basis),用于编码固有材质属性,以及由条件于视角和光照的系数网络(Coefficient Network)生成的一组动态系数;随后通过动态辐亮度积分器(Dynamic Radiance Integrator)融合这两部分以合成最终辐亮度。该方法有效分离了视图依赖性与材质不变性,从而显著提升镜面高光的清晰度与真实感。
链接: https://arxiv.org/abs/2508.06632
作者: Wenpeng Xing,Jie Chen,Zaifeng Yang,Tiancheng Zhao,Gaolei Li,Changting Lin,Yike Guo,Meng Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, but challenges remain in rendering scenes with complex specular reflections and highlights. Existing approaches may produce blurry reflections due to entanglement between lighting and material properties, or encounter optimization instability when relying on physically-based inverse rendering. In this work, we present a neural rendering framework based on dynamic coefficient decomposition, aiming to improve the modeling of view-dependent appearance. Our approach decomposes complex appearance into a shared, static neural basis that encodes intrinsic material properties, and a set of dynamic coefficients generated by a Coefficient Network conditioned on view and illumination. A Dynamic Radiance Integrator then combines these components to synthesize the final radiance. Experimental results on several challenging benchmarks suggest that our method can produce sharper and more realistic specular highlights compared to existing techniques. We hope that this decomposition paradigm can provide a flexible and effective direction for modeling complex appearance in neural scene representations.
zh
[CV-223] CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation
【速读】:该论文旨在解决无配对训练数据条件下跨域图像翻译(cross-domain image translation)中扩散模型与翻译过程难以对齐的问题。传统方法因扩散过程作用于噪声信号而翻译过程作用于干净信号,导致两者不一致,进而引发局部最优解,限制了扩散模型的性能提升。其解决方案的关键在于提出一种联合学习框架,通过提取扩散模型中的图像成分来表征干净信号,并利用该成分进行端到端的翻译过程建模;同时引入时间依赖的翻译网络以学习复杂的映射关系,从而实现扩散过程与翻译过程的全局优化,显著提升了生成图像的质量、保真度和结构一致性。
链接: https://arxiv.org/abs/2508.06625
作者: Shilong Zou,Yuhang Huang,Renjiao Yi,Chenyang Zhu,Kai Xu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB \leftrightarrow RGB and diverse cross-modality translation tasks including RGB \leftrightarrow Edge, RGB \leftrightarrow Semantics and RGB \leftrightarrow Depth, showcasing better generative performances than the state of the arts.
zh
[CV-224] VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis
【速读】:该论文旨在解决皮肤疾病诊断中因皮损图像视觉特征复杂多样且现有纯视觉模型缺乏可解释性而导致的准确性与临床实用性不足的问题。其解决方案的关键在于提出VL-MedGuide框架,该框架利用视觉-语言大模型(Visual-Language Large Models, LVLMs)的多模态理解与推理能力,通过两个相互关联的模块实现智能且可解释的辅助诊断:一是多模态概念感知模块,借助精心设计的提示工程识别并语言描述皮肤病学相关的视觉特征;二是可解释疾病推理模块,结合链式思维(Chain-of-Thought)提示将这些概念与原始视觉信息融合,从而输出精确的疾病诊断及透明的推理过程。
链接: https://arxiv.org/abs/2508.06624
作者: Kexin Yu,Zihan Xu,Jialei Xie,Carter Adams
机构: Jiangsu Ocean University (江苏海洋大学); Federal University of Bahia (巴西联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate diagnosis of skin diseases remains a significant challenge due to the complex and diverse visual features present in dermatoscopic images, often compounded by a lack of interpretability in existing purely visual diagnostic models. To address these limitations, this study introduces VL-MedGuide (Visual-Linguistic Medical Guide), a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs) for intelligent and inherently interpretable auxiliary diagnosis of skin conditions. VL-MedGuide operates in two interconnected stages: a Multi-modal Concept Perception Module, which identifies and linguistically describes dermatologically relevant visual features through sophisticated prompt engineering, and an Explainable Disease Reasoning Module, which integrates these concepts with raw visual information via Chain-of-Thought prompting to provide precise disease diagnoses alongside transparent rationales. Comprehensive experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis (83.55% BACC, 80.12% F1) and concept detection (76.10% BACC, 67.45% F1), surpassing existing baselines. Furthermore, human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations, bridging the gap between AI performance and clinical utility by offering actionable, explainable insights for dermatological practice.
zh
[CV-225] ContextGuard-LVLM: Enhancing News Veracity through Fine-grained Cross-modal Contextual Consistency Verification
【速读】:该论文旨在解决数字新闻媒体中视觉与文本信息之间细粒度跨模态上下文一致性(Fine-grained Cross-modal Contextual Consistency, FCCC)的验证难题,即如何准确识别图像与文字在叙事逻辑、情感基调及背景信息层面是否存在隐性不一致,而不仅限于实体匹配。其解决方案的关键在于提出ContextGuard-LVLM框架,该框架基于先进的视觉-语言大模型(Vision-Language Large Models, LVLMs),并引入多阶段上下文推理机制,结合强化学习或对抗学习范式,以增强模型对细微语境错位的感知能力;同时,通过扩展和标注三个基准数据集,新增“语境情感”、“视觉叙事主题”和“场景事件逻辑一致性”等细粒度标签,并定义新的CTXT(Contextual Coherence)实体类型,从而显著提升模型在复杂逻辑推理和语境理解任务上的性能,优于现有零样本LVLM基线方法(如InstructBLIP和LLaVA 1.5)。
链接: https://arxiv.org/abs/2508.06623
作者: Sihan Ma,Qiming Wu,Ruotong Jiang,Frank Burns
机构: Inner Mongolia University of Science & Technology (内蒙古科技大学); Federal University of Rio de Janeiro (里约热内卢联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including “contextual sentiment,” “visual narrative theme,” and “scene-event logical coherence,” and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.
zh
[CV-226] CountQA: How Well Do MLLM s Count in the Wild?
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在现实世界场景中普遍存在的对象计数能力缺失问题,这一缺陷严重限制了其在实际应用中的可靠性。现有评估基准要么对象密度稀疏,要么局限于特定视觉领域,无法充分测试模型在复杂、高密度环境下的表现。为此,作者提出CountQA——一个包含1500余组问答对的新基准,其图像具有高对象密度、杂乱背景和遮挡等真实世界特征,能够系统性地探测MLLMs的计数能力。实验表明,即使是最先进的模型在CountQA上准确率仅为42.9%,且随着对象数量增加性能进一步下降。该研究的关键在于构建了一个专门用于诊断和改进MLLMs数值感知与空间意识能力的评测体系,为下一代兼具描述流畅性与数值精确性的多模态模型发展奠定基础。
链接: https://arxiv.org/abs/2508.06585
作者: Jayant Sravan Tamarapalli,Rynaa Grover,Nilay Pande,Sahiti Yerramilli
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with performance declining as object counts rise. By providing a dedicated benchmark to diagnose and rectify this core weakness, CountQA paves the way for a new generation of MLLMs that are not only descriptively fluent but also numerically grounded and spatially aware. We will open-source the dataset and code upon paper acceptance to foster further research.
zh
[CV-227] IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶应用中的两大关键挑战:其一,现有VLA架构多基于开环模仿学习,易受限于训练数据中记录的行为模式,导致性能欠佳且泛化能力弱;其二,闭环训练高度依赖高保真传感器仿真,而域差距(domain gap)和计算效率低下成为主要障碍。解决方案的核心在于提出一种名为IRL-VLA的新型闭环强化学习框架,采用三阶段范式:首先通过模仿学习预训练VLA策略;其次利用逆强化学习构建轻量级奖励世界模型,实现高效的闭环奖励计算;最后引入基于PPO(Proximal Policy Optimization)的奖励世界模型引导强化学习,以平衡安全性、驾驶舒适性与交通效率。该方法在NAVSIM v2端到端驾驶基准和CVPR2025 Autonomous Grand Challenge中均取得领先性能。
链接: https://arxiv.org/abs/2508.06571
作者: Anqing Jiang,Yu Gao,Yiru Wang,Zhigang Sun,Shuo Wang,Yuwen Heng,Hao Sun,Shichen Tang,Lijuan Zhu,Jinhao Chai,Jijun Wang,Zichong Gu,Hao Jiang,Li Sun
机构: Bosch Corporate Research, Shanghai, China (博世集团研究部); Shanghai University (上海大学); Shanghai Jiao Tong University (上海交通大学); Bosch Mobility Solutions, Robert Bosch GmbH, Suzhou (罗伯特·博世公司移动解决方案部门,苏州); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pagres, 2 figures
Abstract:Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via \textbfInverse \textbfReinforcement \textbfLearning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.
zh
[CV-228] ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos ACL2025
【速读】:该论文旨在解决视频中隐性仇恨言论(implicit hate speech)检测这一尚未充分探索的问题,现有研究多集中于文本和图像层面的仇恨言论识别,而对视频内容的多模态分析仍存在显著空白。其解决方案的关键在于提出一个名为ImpliHateVid的大规模视频数据集(含509个隐性仇恨视频、500个显性仇恨视频和1000个非仇恨视频),并设计了一种两阶段对比学习框架:第一阶段训练音频、文本和图像的模态特定编码器,通过拼接特征并应用对比损失进行优化;第二阶段引入跨模态编码器进一步精炼多模态表示,并融合情感、情绪及字幕等辅助特征以增强隐性仇恨言论的判别能力。实验表明,该方法在ImpliHateVid和HateMM两个数据集上均有效提升了视频仇恨内容的检测性能。
链接: https://arxiv.org/abs/2508.06570
作者: Mohammad Zia Ur Rehman,Anukriti Bhatnagar,Omkar Kabde,Shubhi Bansal,Nagendra Kumar
机构: Indian Institute of Technology Indore (印度理工学院英德奥分校); Chaitanya Bharathi Institute of Technology (查伊坦亚·巴赫拉印度科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in ACL 2025
Abstract:The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.
zh
[CV-229] Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features
【速读】:该论文旨在解决机器人感知与物理交互中表面材料识别(surface material recognition)的难题,尤其关注如何有效融合触觉与视觉模态信息以提升分类性能。解决方案的关键在于提出Surformer v1架构,该模型基于Transformer结构,采用结构化触觉特征和通过ResNet-50提取并经主成分分析(PCA)降维后的视觉嵌入作为输入,结合模态特异性编码器与跨模态注意力机制,实现视觉与触觉信息的深度交互。实验表明,该方法在仅使用触觉数据时即达到最高准确率且推理速度最快,在多模态场景下亦实现了99.4%的高精度与极低延迟(0.77 ms),展现出在准确性、效率与计算成本之间的优异平衡。
链接: https://arxiv.org/abs/2508.06566
作者: Manish Kansana,Elias Hossain,Shahram Rahimi,Noorbakhsh Amiri Golilarz
机构: Mississippi State University (密西西比州立大学); The University of Alabama (阿拉巴马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and PCA-reduced visual embeddings extracted via ResNet-50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieved the highest accuracy but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition.
zh
[CV-230] Bridging Brain Connectomes and Clinical Reports for Early Alzheimers Disease Diagnosis
【速读】:该论文旨在解决如何有效关联客观的脑影像数据与主观的文本类临床报告(如医生笔记)这一关键挑战,以提升脑部疾病诊断的准确性和时效性。其解决方案的关键在于提出一种新颖的跨模态对齐框架,将脑连接组(brain connectome)中的子网络视为图像数据的“令牌”(token),而非传统的图像块(image patch),从而实现与临床报告中词元(word token)在共享跨模态潜在空间中的对齐。这种方法能够更高效地识别神经影像学发现与临床观察之间的系统级关联,尤其适用于以网络异常为主要特征的脑部疾病(如阿尔茨海默病早期机制研究)。
链接: https://arxiv.org/abs/2508.06565
作者: Jing Zhang,Xiaowei Yu,Minheng Chen,Lu Zhang,Tong Chen,Yan Zhuang,Chao Cao,Yanjun Lyu,Li Su,Tianming Liu,Dajiang Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Integrating brain imaging data with clinical reports offers a valuable opportunity to leverage complementary multimodal information for more effective and timely diagnosis in practical clinical settings. This approach has gained significant attention in brain disorder research, yet a key challenge remains: how to effectively link objective imaging data with subjective text-based reports, such as doctors’ notes. In this work, we propose a novel framework that aligns brain connectomes with clinical reports in a shared cross-modal latent space at both the subject and connectome levels, thereby enhancing representation learning. The key innovation of our approach is that we treat brain subnetworks as tokens of imaging data, rather than raw image patches, to align with word tokens in clinical reports. This enables a more efficient identification of system-level associations between neuroimaging findings and clinical observations, which is critical since brain disorders often manifest as network-level abnormalities rather than isolated regional alterations. We applied our method to mild cognitive impairment (MCI) using the ADNI dataset. Our approach not only achieves state-of-the-art predictive performance but also identifies clinically meaningful connectome-text pairs, offering new insights into the early mechanisms of Alzheimer’s disease and supporting the development of clinically useful multimodal biomarkers.
zh
[CV-231] Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC ACM-MM
【速读】:该论文旨在解决对话中多模态情感识别(Multimodal Emotion Recognition in Conversations)的挑战,即如何有效融合文本、声学和视觉信号以提升识别性能,同时引入具有心理学意义的先验知识来指导多模态对齐。解决方案的关键在于提出一种新颖的视觉情感引导锚定机制(Visual Emotion Guided Anchoring, VEGA),该机制利用CLIP图像编码器构建基于面部示例的情感特异性视觉锚点(emotion-specific visual anchors),从而引导单模态与多模态特征向一个感知上 grounded 且心理上一致的表示空间对齐,其设计灵感来源于原型情绪类别(prototypical emotion categories)和多感官整合(multisensory integration)的认知理论,并通过随机锚点采样策略增强鲁棒性。
链接: https://arxiv.org/abs/2508.06564
作者: Guanyu Hu,Dimitrios Kollias,Xinyu Yang
机构: Xi’an Jiaotong University (西安交通大学); Queen Mary University of London (伦敦玛丽女王大学); Center for Multimodal AI (多模态人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted for publication at ACM Multimedia (ACM MM) 2025
Abstract:Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP’s textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: this https URL.
zh
[CV-232] On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications
【速读】:该论文旨在解决多模态深度学习模型在临床实践中因数据模态缺失而导致性能下降的问题,即当某些模态(如文本或结构化数据)在推理阶段不可用时,如何仍能训练出鲁棒且可信赖的单模态视觉模型。其解决方案的关键在于提出了一种多模态特权知识蒸馏(Multimodal Privileged Knowledge Distillation, MMPKD)策略,该策略仅在训练阶段利用额外的模态信息(如文本或表格元数据)作为教师模型,指导单模态视觉学生模型(如视觉Transformer)的学习过程,从而提升学生模型在零样本条件下对感兴趣区域(Region of Interest, ROI)的定位能力。
链接: https://arxiv.org/abs/2508.06558
作者: Simon Baur,Alexandra Benova,Emilio Dolgener Cantú,Jackie Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps’ zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.
zh
[CV-233] From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets
【速读】:该论文旨在解决大规模目标检测数据集中标签错误(label errors)系统性修正难题,此类错误包括缺失标注、分类错误和定位不准等问题,严重影响模型训练与评估结果的可靠性。其解决方案的关键在于提出一种半自动化标签纠错框架REC ✓ D(Rechecked),该框架基于现有检测器生成候选错误区域,并通过轻量级众包微任务(microtasks)让多名标注者独立验证每个候选边界框(bounding box),再聚合标注结果以估计不确定性并提升标签质量。该方法显著提升了纠错效率,在KITTI数据集行人类别上的实证表明原始标注中至少存在24%的错误,且结合当前错误检测方法可高效恢复数百个错误,远优于人工重新标注耗时。
链接: https://arxiv.org/abs/2508.06556
作者: Sarina Penquitt,Jonathan Klees,Rinor Cakaj,Daniel Kondermann,Matthias Rottmann,Lars Schmarje
机构: University of Wuppertal (伍珀塔尔大学); Quality Match; Osnabrück University (奥斯纳布吕克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Object detection has advanced rapidly in recent years, driven by increasingly large and diverse datasets. However, label errors, defined as missing labels, incorrect classification or inaccurate localization, often compromise the quality of these datasets. This can have a significant impact on the outcomes of training and benchmark evaluations. Although several methods now exist for detecting label errors in object detection datasets, they are typically validated only on synthetic benchmarks or limited manual inspection. How to correct such errors systemically and at scale therefore remains an open problem. We introduce a semi-automated framework for label-error correction called REC \checkmark D (Rechecked). Building on existing detectors, the framework pairs their error proposals with lightweight, crowd-sourced microtasks. These tasks enable multiple annotators to independently verify each candidate bounding box, and their responses are aggregated to estimate ambiguity and improve label quality. To demonstrate the effectiveness of REC \checkmark D, we apply it to the class pedestrian in the KITTI dataset. Our crowdsourced review yields high-quality corrected annotations, which indicate a rate of at least 24% of missing and inaccurate annotations in original annotations. This validated set will be released as a new real-world benchmark for label error detection and correction. We show that current label error detection methods, when combined with our correction framework, can recover hundreds of errors in the time it would take a human to annotate bounding boxes from scratch. However, even the best methods still miss up to 66% of the true errors and with low quality labels introduce more errors than they find. This highlights the urgent need for further research, now enabled by our released benchmark.
zh
[CV-234] StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback
【速读】:该论文旨在解决个性化时尚搭配推荐中缺乏系统性、协同式智能代理框架的问题,以提升用户购物体验。其解决方案的关键在于提出StyleTailor——首个将个性化服装设计、购物推荐、虚拟试穿与系统化评估无缝整合的协作式智能代理框架。该框架通过多层级负面反馈驱动的迭代视觉优化范式,实现对用户偏好的自适应精准匹配,核心包含Designer和Consultant两个代理,分别负责个性化服饰选择与虚拟试穿,并借助分层视觉-语言模型反馈机制(涵盖单品、整套穿搭及试穿效果)持续优化输出,形成闭环改进流程。
链接: https://arxiv.org/abs/2508.06555
作者: Hongbo Ma,Fei Shen,Hongbin Xu,Xiaoce Wang,Gang Xu,Jinkai Zheng,Liangqiong Qu,Ming Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 24pages, 5 figures
Abstract:The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation this http URL assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor’s superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.
zh
[CV-235] Static and Plugged: Make Embodied Evaluation Simple
【速读】:该论文旨在解决当前具身智能(Embodied Intelligence)评估中依赖交互式仿真环境或真实场景所导致的高成本、碎片化和难以扩展的问题。其解决方案的关键在于提出一个名为StaticEmbodiedBench的即插即用型基准测试平台,通过静态场景表示实现统一评估,覆盖42种多样化场景和8个核心维度,支持通过简单接口进行可扩展且全面的性能评测,并首次建立了基于静态场景的具身智能统一排行榜,显著提升了评估效率与标准化程度。
链接: https://arxiv.org/abs/2508.06553
作者: Jiahao Xiao,Jianbo Zhang,BoWen Yan,Shengyu Guo,Tongrui Ye,Kaiwei Zhang,Zicheng Zhang,Xiaohong Liu,Zhengxue Cheng,Lei Fan,Chuyi Li,Guangtao Zhai
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学); 4. Peking University (北京大学); 5. Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied intelligence is advancing rapidly, driving the need for efficient evaluation. Current benchmarks typically rely on interactive simulated environments or real-world setups, which are costly, fragmented, and hard to scale. To address this, we introduce StaticEmbodiedBench, a plug-and-play benchmark that enables unified evaluation using static scene representations. Covering 42 diverse scenarios and 8 core dimensions, it supports scalable and comprehensive assessment through a simple interface. Furthermore, we evaluate 19 Vision-Language Models (VLMs) and 11 Vision-Language-Action models (VLAs), establishing the first unified static leaderboard for Embodied intelligence. Moreover, we release a subset of 200 samples from our benchmark to accelerate the development of embodied intelligence.
zh
[CV-236] Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection
【速读】:该论文旨在解决深度伪造(deepfake)检测中因数据集年龄分布不均而导致的年龄特定偏差问题,从而提升模型在不同年龄群体中的公平性表现。解决方案的关键在于构建一个年龄多样化的深度伪造数据集,通过整合现有数据集Celeb-DF、FaceForensics++和UTKFace,并利用合成数据填补年龄分布缺口,形成具有代表性的模块化数据构建流程。实验表明,基于该数据集训练的XceptionNet、EfficientNet和LipForensics模型在AUC、pAUC和EER等指标上均表现出更公平的跨年龄性能和更强的泛化能力,为未来公平性导向的深度伪造检测研究提供了可复现的数据与模型框架。
链接: https://arxiv.org/abs/2508.06552
作者: Unisha Joshi
机构: Grand Canyon University (格兰岱尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, and 7 tables
Abstract:The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at this https URL.
zh
[CV-237] Slice or the Whole Pie? Utility Control for AI Models
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在实际部署中因多样化应用场景对性能需求不同而导致的资源浪费与维护困难问题。传统方法通常需要为不同用户或任务训练多个独立模型,不仅计算开销大,而且难以高效管理。解决方案的关键在于提出一种名为NNObfuscator的新型效用控制机制,该机制使单个预训练模型能够根据预设条件动态调整其性能水平,从而实现按需提供不同层级的服务能力。这一机制支持模型所有者设置分层访问策略,例如免费用户获得基础性能,付费用户则享有更高性能,显著提升了资源利用率、降低了冗余计算,并支撑可持续的AI商业部署模式。
链接: https://arxiv.org/abs/2508.06551
作者: Ye Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training deep neural networks (DNNs) has become an increasingly resource-intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine-tuning efforts to achieve optimal performance across diverse use cases. Although pre-trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli-cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free-tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well-established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.
zh
[CV-238] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images ICCV25
【速读】:该论文旨在解决仅使用多视角RGB图像而无3D真值标注时,如何准确估计三维语义场景图(3D semantic scene graph)的问题。其核心挑战在于:由预测深度图重建的伪点云几何信息噪声较大,且多视角图像特征中存在大量背景噪声,导致节点和边的特征表达不鲁棒。解决方案的关键在于:通过语义掩码引导特征聚合以过滤背景噪声,并设计一种新颖方法融合邻近节点信息以增强节点与边的特征表达;同时利用训练集统计先验(explicit statistical priors)对节点及其一阶邻域进行精细化预测修正,从而提升场景图估计的准确性。
链接: https://arxiv.org/abs/2508.06546
作者: Qi Xun Yeo,Yanyan Li,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper has been accepted in ICCV 25
Abstract:Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at this https URL.
zh
[CV-239] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing
【速读】:该论文旨在解决复杂多主体(multi-IP)场景下的人像擦除问题,尤其针对人与人遮挡、人与物体纠缠及背景干扰等挑战。现有方法在密集遮挡、伪装背景和多样化交互场景中表现不佳,主要受限于数据集覆盖不足和前景实例缺乏空间解耦能力。其解决方案的关键在于提出一种名为Multi-Layer Diffusion (MILD) 的新策略,通过将生成过程分解为语义分离的路径(每个主体和背景独立处理),并引入Human Morphology Guidance(整合姿态、语义分割和空间关系)以增强以人为中心的理解,同时设计Spatially-Modulated Attention机制优化注意力流动,从而实现更精准的多主体人像擦除与背景重建。
链接: https://arxiv.org/abs/2508.06543
作者: Jinghan Yu,Zhiyuan Ma,Yue Ma,Kaiqi Liu,Yuhan Wang,Jianjun Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.
zh
[CV-240] Benchmarking Deep Learning-Based Object Detection Models on Feature Deficient Astrophotography Imagery Dataset
【速读】:该论文旨在解决当前目标检测模型在非商业领域(如天文图像)中性能下降的问题,因为这些场景通常具有信号稀疏性(signal sparsity),而主流训练数据集(如ImageNet、COCO和PASCAL VOC)主要覆盖日常物体且缺乏此类特性。解决方案的关键在于引入MobilTelesco这一基于智能手机的天体摄影数据集,该数据集提供了稀疏的夜空图像,从而为评估和改进目标检测模型在特征稀缺条件下的鲁棒性提供了一个新的基准。
链接: https://arxiv.org/abs/2508.06537
作者: Shantanusinh Parmar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:
Abstract:Object detection models are typically trained on datasets like ImageNet, COCO, and PASCAL VOC, which focus on everyday objects. However, these lack signal sparsity found in non-commercial domains. MobilTelesco, a smartphone-based astrophotography dataset, addresses this by providing sparse night-sky images. We benchmark several detection models on it, highlighting challenges under feature-deficient conditions.
zh
[CV-241] ransfer Learning with EfficientNet for Accurate Leukemia Cell Classification
【速读】:该论文旨在解决急性淋巴细胞白血病(Acute Lymphoblastic Leukemia, ALL)外周血涂片图像的准确分类问题,以支持早期诊断和治疗方案制定。其关键解决方案在于结合数据增强技术与预训练卷积神经网络(Convolutional Neural Networks, CNNs)的迁移学习策略,特别是采用EfficientNet-B3模型,在处理类别不平衡的数据集(ALL与非ALL图像各10,000张)后,实现了F1分数94.30%、准确率92.02%及AUC 94.79%的优异性能,显著优于此前在C-NMC挑战赛中报告的方法。
链接: https://arxiv.org/abs/2508.06535
作者: Faisal Ahmed
机构: Embry-Riddle Aeronautical University (艾姆布里-里德航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 1 figure
Abstract:Accurate classification of Acute Lymphoblastic Leukemia (ALL) from peripheral blood smear images is essential for early diagnosis and effective treatment planning. This study investigates the use of transfer learning with pretrained convolutional neural networks (CNNs) to improve diagnostic performance. To address the class imbalance in the dataset of 3,631 Hematologic and 7,644 ALL images, we applied extensive data augmentation techniques to create a balanced training set of 10,000 images per class. We evaluated several models, including ResNet50, ResNet101, and EfficientNet variants B0, B1, and B3. EfficientNet-B3 achieved the best results, with an F1-score of 94.30%, accuracy of 92.02%, andAUCof94.79%,outperformingpreviouslyreported methods in the C-NMCChallenge. Thesefindings demonstrate the effectiveness of combining data augmentation with advanced transfer learning models, particularly EfficientNet-B3, in developing accurate and robust diagnostic tools for hematologic malignancy detection.
zh
[CV-242] What Makes “Good” Distractors for Object Hallucination Evaluation in Large Vision-Language Models?
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,即模型在图像描述中生成与图像内容不一致的对象。传统评估方法Polling-based Object Probing Evaluation (POPE) 依赖类别级统计信息(如类别频率和共现关系)采样负样本作为干扰项,但其简单采样策略忽略了图像特定信息,且仅限于非存在对象,导致评估效力随LVLM性能提升而下降。本文提出Hallucination searching-based Object Probing Evaluation (HOPE) 基准,其核心创新在于:一是利用对比语言-图像预训练模型(CLIP)近似LVLM的预测行为,基于图像内容选择最具误导性的负样本作为干扰项;二是通过将真实对象与错误描述配对构建高度误导性干扰项,从而更全面地暴露LVLM的幻觉脆弱性。实验表明,HOPE使多种先进LVLM的精度下降至少9%,最高达23%,显著优于POPE。
链接: https://arxiv.org/abs/2508.06530
作者: Ming-Kun Xie,Jia-Hao Xiao,Gang Niu,Lei Feng,Zhiqiang Kou,Min-Ling Zhang,Masashi Sugiyama
机构: Center for Advanced Intelligence Project, RIKEN (先进智能项目中心,理化学研究所); Southeast University (东南大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textite.g., category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textiti.e., non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9% and up to 23% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at this https URL.
zh
[CV-243] RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统中全景驾驶感知(panoptic driving perception)对精度与实时性并重的需求问题。其核心挑战在于多任务之间存在负向迁移(negative transfer),且分割任务的结构设计依赖人工干预,同时车道线分割在训练与测试阶段标签不一致导致评估偏差。解决方案的关键是提出一种基于Transformer的实时多任务模型RMT-PPAD,引入轻量级门控适配器模块(gate control with an adapter)以自适应融合共享特征与任务特异性特征,有效缓解任务间负向迁移;设计自适应分割解码器(adaptive segmentation decoder)自动学习多尺度特征权重,无需手动设计特定任务结构;并修正车道线分割中训练与测试标签不一致的问题,实现更公平的性能评估。实验表明,该方法在BDD100K数据集上达到当前最优性能,推理速度达32.6 FPS,且在真实场景下表现稳定。
链接: https://arxiv.org/abs/2508.06529
作者: Jiayuan Wang,Q. M. Jonathan Wu,Katsuya Suto,Ning Zhang
机构: University of Windsor (温莎大学); Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at this https URL.
zh
[CV-244] A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition
【速读】:该论文旨在解决视频行为识别中传统3D卷积神经网络(3D CNN)难以建模长距离时序依赖关系,而纯Transformer架构则面临计算复杂度高的问题。解决方案的关键在于提出一种混合框架,其中3D CNN模块负责提取低级时空特征,Transformer模块用于捕捉长程时间依赖性,并通过融合机制整合两者的表示,从而在保持可管理计算复杂度的前提下显著提升识别准确率。
链接: https://arxiv.org/abs/2508.06528
作者: Xiuliang Zhang,Tadiwa Elisha Nyamasvisva,Chuntao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages,6 figures
Abstract:Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.
zh
[CV-245] Large Language Models Facilitate Vision Reflection in Image Classification
【速读】:该论文旨在解决大型多模态模型(LMMs)在视觉识别任务中缺乏可解释性的问题,尤其是如何通过“视觉反思”(vision reflection)机制提升模型的准确性与透明度。其核心解决方案在于:首先,利用专门的视觉模型预测结果引导LMM进行验证,从而提高图像识别准确率,即便是在ImageNet等基准测试上也有效;其次,揭示了视觉-语言连接器(vision-language connector)将视觉特征映射为显式文本概念的关键作用,使语言模型能够基于常识知识判断预测合理性;最后,发现仅用少量文本token即可替代大量视觉token生成相似回答,表明LMM可能依赖于高度压缩的文本表示而非原始视觉特征,这一特性支持无需训练即可增强细粒度识别性能的连接器设计。这些发现共同阐明了视觉反思机制的可解释性潜力,并提出了一种高效、鲁棒且可解释的视觉认知策略。
链接: https://arxiv.org/abs/2508.06525
作者: Guoyuan An,JaeYoon Kim,SungEui Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition accuracy, even on benchmarks like ImageNet, despite prior evidence that LMMs typically underperform dedicated vision encoders. Second, we analyze the internal behavior of vision reflection and find that the vision-language connector maps visual features into explicit textual concepts, allowing the language model to reason about prediction plausibility using commonsense knowledge. We further observe that replacing a large number of vision tokens with only a few text tokens still enables LLaVA to generate similar answers, suggesting that LMMs may rely primarily on a compact set of distilled textual representations rather than raw vision features. Third, we show that a training-free connector can enhance LMM performance in fine-grained recognition tasks, without extensive feature-alignment training. Together, these findings offer new insights into the explainability of vision-language models and suggest that vision reflection is a promising strategy for achieving robust and interpretable visual recognition.
zh
[CV-246] Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation
【速读】:该论文旨在解决结直肠癌早期诊断中自动息肉分割(polyp segmentation)模型在标注数据有限且存在域偏移(domain shift)时性能显著下降的问题。现有半监督学习(semi-supervised learning, SSL)方法依赖通用增强策略,未能考虑息肉特有的结构特性,导致跨设备和跨中心泛化能力差。其解决方案的关键在于提出频率先验引导匹配(Frequency Prior Guided Matching, FPGM),该方法基于一个关键发现:息肉边缘在不同数据集间具有高度一致的频率特征(frequency signature)。FPGM通过两阶段过程实现:首先从标注息肉边缘区域学习域不变的频率先验;随后对未标注图像进行谱扰动,使其幅度谱与该先验对齐,同时保留相位信息以维持结构完整性。这种针对性的频域对齐有效抑制了域特异性纹理差异,促使模型学习更具泛化性的解剖结构特征,在六个公开数据集上验证表明FPGM优于十种对比方法,并在数据稀缺场景下展现出显著的零样本迁移能力(Dice分数提升超10%)。
链接: https://arxiv.org/abs/2508.06517
作者: Haoran Xi,Chen Liu,Xiaolin Li
机构: Tianjin University (天津大学); Chinese Academy of Sciences (中国科学院); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures, 6 tables
Abstract:Automated polyp segmentation is essential for early diagnosis of colorectal cancer, yet developing robust models remains challenging due to limited annotated data and significant performance degradation under domain shift. Although semi-supervised learning (SSL) reduces annotation requirements, existing methods rely on generic augmentations that ignore polyp-specific structural properties, resulting in poor generalization to new imaging centers and devices. To address this, we introduce Frequency Prior Guided Matching (FPGM), a novel augmentation framework built on a key discovery: polyp edges exhibit a remarkably consistent frequency signature across diverse datasets. FPGM leverages this intrinsic regularity in a two-stage process. It first learns a domain-invariant frequency prior from the edge regions of labeled polyps. Then, it performs principled spectral perturbations on unlabeled images, aligning their amplitude spectra with this learned prior while preserving phase information to maintain structural integrity. This targeted alignment normalizes domain-specific textural variations, thereby compelling the model to learn the underlying, generalizable anatomical structure. Validated on six public datasets, FPGM establishes a new state-of-the-art against ten competing methods. It demonstrates exceptional zero-shot generalization capabilities, achieving over 10% absolute gain in Dice score in data-scarce scenarios. By significantly enhancing cross-domain robustness, FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision.
zh
[CV-247] BigTokDetect: A Clinically-Informed Vision-Language Model Framework for Detecting Pro-Bigorexia Videos on TikTok
【速读】:该论文旨在解决社交媒体平台上难以检测促进肌肉畸形障碍(muscle dysmorphic disorder)行为的有害内容,特别是伪装成合法健身内容的“促大肌症”(pro-bigorexia)信息,这类内容对青少年男性影响尤为显著。传统基于文本的检测系统难以识别此类内容,因其通过视觉展示、编码语言和动机性信息等多模态组合实现隐蔽传播。解决方案的关键在于提出BigTokDetect框架,构建了首个由临床心理学家和精神科医生标注的多模态数据集BigTok(含2,200余条TikTok视频),涵盖身体形象、营养、运动、补剂及阳刚气质五大类,并通过领域特定微调优化视觉-语言模型,在主类别分类上达到0.829%准确率,子类别检测达0.690%,且实验证明多模态融合相较纯文本方法提升5–10%性能,其中视频特征提供最强判别信号。
链接: https://arxiv.org/abs/2508.06515
作者: Minh Duc Chu,Kshitij Pawar,Zihao He,Roxanna Sharifi,Ross Sonnenblick,Magdalayna Curry,Laura D’Adamo,Lindsay Young,Stuart B Murray,Kristina Lerman
机构: USC Information Sciences Institute (南加州大学信息科学研究所); Keck School of Medicine, USC (南加州大学医学院); Department of Clinical Psychology, Drexel University (德雷塞尔大学临床心理学系); Annenberg School for Communication and Journalism, USC (南加州大学安纳伯格传播与新闻学院); Department of Psychiatry and Biobehavioral Sciences, UCLA (加州大学洛杉矶分校精神病学与生物行为科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Social media platforms increasingly struggle to detect harmful content that promotes muscle dysmorphic behaviors, particularly pro-bigorexia content that disproportionately affects adolescent males. Unlike traditional eating disorder detection focused on the “thin ideal,” pro-bigorexia material masquerades as legitimate fitness content through complex multimodal combinations of visual displays, coded language, and motivational messaging that evade text-based detection systems. We address this challenge by developing BigTokDetect, a clinically-informed detection framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinical psychologists and psychiatrists across five primary categories spanning body image, nutrition, exercise, supplements, and masculinity. Through a comprehensive evaluation of state-of-the-art vision language models, we achieve 0.829% accuracy on primary category classification and 0.690% on subcategory detection via domain-specific finetuning. Our ablation studies demonstrate that multimodal fusion improves performance by 5-10% over text-only approaches, with video features providing the most discriminative signals. These findings establish new benchmarks for multimodal harmful content detection and provide both the computational tools and methodological framework needed for scalable content moderation in specialized mental health domains.
zh
[CV-248] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation
【速读】:该论文旨在解决现有基于扩散模型的肖像动画方法在动态说话风格(如头部姿态和运动)控制上的不足,以及因采用双U-Net架构导致的计算开销过大的问题。其核心解决方案是提出一个统一的DiT(Diffusion Transformer)框架DiTalker,关键创新在于设计了一个Style-Emotion Encoding Module,通过独立的风格分支(提取身份特异性风格信息,如头部姿态与运动)和情绪分支(提取身份无关的情绪特征)实现对说话风格的精细化建模;同时引入Audio-Style Fusion Module,利用两个并行的交叉注意力层解耦音频与说话风格特征,并以此引导动画生成过程,从而在保证身份一致性的同时显著提升唇部同步精度与说话风格可控性。
链接: https://arxiv.org/abs/2508.06511
作者: He Feng,Yongjia Ma,Donglin Di,Lei Fan,Tonghua Su,Xiangqian Wu
机构: Harbin Institute of Technology (哈尔滨工业大学); Li Auto (理想汽车); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (e.g., emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods primarily focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on a dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style-controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracting identity-specific style information (e.g., head poses and movements), and an emotion branch extracting identity-agnostic emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers, using these features to guide the animation process. To enhance the quality of results, we adopt and modify two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity and background details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability. Project Page: this https URL
zh
[CV-249] Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG
【速读】:该论文旨在解决现有视觉问答(Visual Question Answering, VQA)模型在复杂、领域特定任务(如医学领域)中缺乏细粒度回答精度的问题。其解决方案的关键在于提出BIND:BLIVA Integrated with Dense Encoding,通过对比预训练启发的密集查询token编码优化联合嵌入空间,并构建Med-GRIM模型,该模型采用基于图的检索(Graph-based Retrieval)与提示工程(Prompt Engineering)融合领域知识,同时引入轻量级语言模型(Small Language Models, SLMs)实现低计算开销的模块化工作流,从而在保持大语言模型性能的同时显著降低计算成本。
链接: https://arxiv.org/abs/2508.06496
作者: Rakesh Raj Madavan,Akshat Kaimal,Hashim Faisal,Chandrakala S
机构: Shiv Nadar University Chennai (夏伊夫纳达大学金奈分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: this https URL
zh
[CV-250] Codebook-enabled Generative End-to-end Semantic Communication Powered by Transformer
【速读】:该论文旨在解决基于码本(codebook)的生成式语义通信系统在信道噪声下性能易受干扰的问题,其核心挑战在于码向量间的语义关系与对应码索引距离之间缺乏一致性,导致系统鲁棒性不足。解决方案的关键在于:首先联合设计语义编解码器(semantic codec)与高质量码本,使语义信息在编码空间中具有更好的结构化特性;随后设计基于码本引导的向量到索引变换器(vector-to-index transformer),通过显式建模码本结构来抑制信道噪声影响,从而提升接收端图像重建质量。实验结果表明,该方法在视觉感知指标上优于JPEG+LDPC和传统联合信源信道编码(JSCC)方案。
链接: https://arxiv.org/abs/2402.16868
作者: Peigen Ye,Yaping Sun,Shumin Yao,Hao Chen,Xiaodong Xu,Shuguang Cui
机构: The Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Peng Cheng Laboratory (鹏城实验室); FNii and SSE, The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)未来网络研究院和系统设计与集成学院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE INFOCOM PerAI6G 2024(accepted)
Abstract:Codebook-based generative semantic communication attracts increasing attention, since only indices are required to be transmitted when the codebook is shared between transmitter and receiver. However, due to the fact that the semantic relations among code vectors are not necessarily related to the distance of the corresponding code indices, the performance of the codebook-enabled semantic communication system is susceptible to the channel noise. Thus, how to improve the system robustness against the noise requires careful design. This paper proposes a robust codebook-assisted image semantic communication system, where semantic codec and codebook are first jointly constructed, and then vector-to-index transformer is designed guided by the codebook to eliminate the effects of channel noise, and achieve image generation. Thanks to the assistance of the high-quality codebook to the Transformer, the generated images at the receiver outperform those of the compared methods in terms of visual perception. In the end, numerical results and generated images demonstrate the advantages of the generative semantic communication method over JPEG+LDPC and traditional joint source channel coding (JSCC) methods.
zh
[CV-251] Learned Regularization for Microwave Tomography
【速读】:该论文旨在解决微波断层成像(Microwave Tomography, MWT)中因逆问题高度非线性与病态性所导致的重建精度不足问题,尤其针对传统优化方法难以恢复精细结构、而现有深度学习方法依赖大量配对数据且泛化能力弱的局限。解决方案的关键在于提出一种物理信息引导的混合框架——单步扩散正则化(Single-Step Diffusion Regularization, SSD-Reg),其将扩散模型作为先验知识嵌入到数据一致性驱动的变分迭代重建流程中,无需配对训练数据即可有效融合物理约束与学习到的结构分布,从而在保持物理一致性的同时显著提升重建的准确性、稳定性和鲁棒性。
链接: https://arxiv.org/abs/2508.08114
作者: Bowen Tong,Hao Chen,Shaorui Guo,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Microwave Tomography (MWT) aims to reconstruct the dielectric properties of tissues from measured scattered electromagnetic fields. This inverse problem is highly nonlinear and ill-posed, posing significant challenges for conventional optimization-based methods, which, despite being grounded in physical models, often fail to recover fine structural details. Recent deep learning strategies, including end-to-end and post-processing networks, have improved reconstruction quality but typically require large paired training datasets and may struggle to generalize. To overcome these limitations, we propose a physics-informed hybrid framework that integrates diffusion models as learned regularization within a data-consistency-driven variational scheme. Specifically, we introduce Single-Step Diffusion Regularization (SSD-Reg), a novel approach that embeds diffusion priors into the iterative reconstruction process, enabling the recovery of complex anatomical structures without the need for paired data. SSD-Reg maintains fidelity to both the governing physics and learned structural distributions, improving accuracy, stability, and robustness. Extensive experiments demonstrate that SSD-Reg, implemented as a Plug-and-Play (PnP) module, provides a flexible and effective solution for tackling the ill-posedness inherent in functional image reconstruction.
zh
[CV-252] MIND: A Noise-Adaptive Denoising Framework for Medical Images Integrating Multi-Scale Transformer
【速读】:该论文旨在解决医学图像中因低剂量扫描、设备限制和成像伪影等因素导致的非均匀噪声干扰问题,此类噪声会严重影响结构识别与病灶检测的准确性。解决方案的关键在于提出一种融合多尺度卷积与Transformer架构的自适应去噪模型(MI-ND),其核心创新包括引入噪声水平估计器(Noise Level Estimator, NLE)和噪声自适应注意力模块(Noise Adaptive Attention Block, NAAB),实现基于噪声感知的通道-空间注意力调节与跨模态特征融合,从而在保持结构细节的同时有效抑制噪声,显著提升图像质量指标(如PSNR、SSIM、LPIPS)及下游诊断任务的性能(如F1分数和ROC-AUC)。
链接: https://arxiv.org/abs/2508.07817
作者: Tao Tang,Chengxu Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 6 pages, 6 figures
Abstract:The core role of medical images in disease diagnosis makes their quality directly affect the accuracy of clinical judgment. However, due to factors such as low-dose scanning, equipment limitations and imaging artifacts, medical images are often accompanied by non-uniform noise interference, which seriously affects structure recognition and lesion detection. This paper proposes a medical image adaptive denoising model (MI-ND) that integrates multi-scale convolutional and Transformer architecture, introduces a noise level estimator (NLE) and a noise adaptive attention module (NAAB), and realizes channel-spatial attention regulation and cross-modal feature fusion driven by noise perception. Systematic testing is carried out on multimodal public datasets. Experiments show that this method significantly outperforms the comparative methods in image quality indicators such as PSNR, SSIM, and LPIPS, and improves the F1 score and ROC-AUC in downstream diagnostic tasks, showing strong prac-tical value and promotional potential. The model has outstanding benefits in structural recovery, diagnostic sensitivity, and cross-modal robustness, and provides an effective solution for medical image enhancement and AI-assisted diagnosis and treatment.
zh
[CV-253] PCA-Guided Autoencoding for Structured Dimensionality Reduction in Active Infrared Thermography
【速读】:该论文旨在解决主动红外热成像(Active Infrared Thermography, AIRT)数据维度高且现有非线性自编码器(Autoencoder, AE)学习的潜在空间缺乏结构化特征的问题,从而限制了其在缺陷表征任务中的有效性。解决方案的关键在于提出一种主成分分析引导(PCA-guided)的自编码框架,通过引入一种新颖的“PCA蒸馏损失”(PCA distillation loss),使AE在捕捉热信号中复杂非线性特征的同时,强制潜在空间与结构化的PCA成分对齐,从而构建具有语义结构的低维表示。
链接: https://arxiv.org/abs/2508.07773
作者: Mohammed Salah,Numan Saeed,Davor Svetinovic,Stefano Sfarra,Mohammed Omar,Yusra Abdulrahman
机构: Khalifa University of Science and Technology (哈利法大学科学技术学院); Advanced Research and Innovation Center (先进研究与创新中心); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Vienna University of Economics and Business (维也纳经济大学); University of L’Aquila (拉奎拉大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Infrared thermography, Non-Destructive Testing, Principal Component Analysis, PCA-Guided Autoencoder, PCA Distillation Loss, Dimensionality Reduction
Abstract:Active Infrared thermography (AIRT) is a widely adopted non-destructive testing (NDT) technique for detecting subsurface anomalies in industrial components. Due to the high dimensionality of AIRT data, current approaches employ non-linear autoencoders (AEs) for dimensionality reduction. However, the latent space learned by AIRT AEs lacks structure, limiting their effectiveness in downstream defect characterization tasks. To address this limitation, this paper proposes a principal component analysis guided (PCA-guided) autoencoding framework for structured dimensionality reduction to capture intricate, non-linear features in thermographic signals while enforcing a structured latent space. A novel loss function, PCA distillation loss, is introduced to guide AIRT AEs to align the latent representation with structured PCA components while capturing the intricate, non-linear patterns in thermographic signals. To evaluate the utility of the learned, structured latent space, we propose a neural network-based evaluation metric that assesses its suitability for defect characterization. Experimental results show that the proposed PCA-guided AE outperforms state-of-the-art dimensionality reduction methods on PVC, CFRP, and PLA samples in terms of contrast, signal-to-noise ratio (SNR), and neural network-based metrics.
zh
[CV-254] Sea-Undistort: A Dataset for Through-Water Image Restoration in High Resolution Airborne Bathymetric Mapping
【速读】:该论文旨在解决浅水区基于图像的测深制图(image-based bathymetric mapping)中因动态水面对光学畸变(如波浪诱导图案、散射和眩光)带来的挑战,这些畸变由水面动态、水体特性及太阳光照共同作用产生。解决方案的关键在于构建了一个名为Sea-Undistort的综合合成数据集,包含1200对512×512像素的水下场景图像,每对包含无畸变与畸变视图,并模拟了真实的水体效应(如日眩光、波浪和散射),同时附带相机参数、太阳位置和平均深度等逐图像元数据,从而支持监督训练——这是在真实环境中难以实现的。研究进一步利用该数据集评估了两种先进图像恢复方法,并提出一种增强型轻量级扩散模型(diffusion-based framework),引入早期融合的日眩光掩码机制,在真实航空影像上显著提升了海底数字表面模型(DSM)的完整性,尤其在深水区域减少测深误差、抑制眩光与散射并清晰还原细粒度海底细节。
链接: https://arxiv.org/abs/2508.07760
作者: Maximilian Kromer,Panagiotis Agrafiotis,Begüm Demir
机构: Technische Universität Berlin (柏林工业大学); Berlin Institute for the Foundations of Learning and Data (BIFOLD) (柏林基础学习与数据研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Under review in IEEE Geoscience and Remote Sensing Letters
Abstract:Accurate image-based bathymetric mapping in shallow waters remains challenging due to the complex optical distortions such as wave induced patterns, scattering and sunglint, introduced by the dynamic water surface, the water column properties, and solar illumination. In this work, we introduce Sea-Undistort, a comprehensive synthetic dataset of 1200 paired 512x512 through-water scenes rendered in Blender. Each pair comprises a distortion-free and a distorted view, featuring realistic water effects such as sun glint, waves, and scattering over diverse seabeds. Accompanied by per-image metadata such as camera parameters, sun position, and average depth, Sea-Undistort enables supervised training that is otherwise infeasible in real environments. We use Sea-Undistort to benchmark two state-of-the-art image restoration methods alongside an enhanced lightweight diffusion-based framework with an early-fusion sun-glint mask. When applied to real aerial data, the enhanced diffusion model delivers more complete Digital Surface Models (DSMs) of the seabed, especially in deeper areas, reduces bathymetric errors, suppresses glint and scattering, and crisply restores fine seabed details. Dataset, weights, and code are publicly available at this https URL.
zh
[CV-255] KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features
【速读】:该论文旨在解决音频驱动的说话头生成模型与先进文本转语音(Text-To-Speech, TTS)模型带来的新型时序深度伪造(temporal deepfakes)检测难题,尤其在面对未见过的攻击场景时,现有检测方法普遍存在计算开销大、泛化能力差的问题。其解决方案的关键在于提出一种多模态检测框架:视觉模态采用手工设计特征以提升可解释性与适应性;音频模态则引入自监督学习(Self-Supervised Learning, SSL)骨干网络结合图注意力机制(Graph Attention Networks),有效捕获音频中的深层语义表征,从而增强对未知伪造技术的鲁棒性。该方法在保证高检测性能的同时兼顾实际部署可行性,展现出良好的抗干扰能力和潜在的可解释性优势。
链接: https://arxiv.org/abs/2508.07337
作者: Ivan Kukanov,Jun Wah Ng
机构: KLASS Engineering and Solutions(Singapore)
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, accepted to the 33rd ACM International Conference on Multimedia (MM’25)
Abstract:The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability. On the AV-Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.
zh
[CV-256] Sensory robustness through top-down feedback and neural stochasticity in recurrent vision models
【速读】:该论文试图解决的问题是:尽管生物视觉系统依赖自上而下的反馈机制进行视觉处理,但大多数人工视觉模型(如纯前馈或循环神经网络)在图像分类任务中已取得成功,这引发了对皮层下行通路(descending cortical pathways)功能意义的质疑。为厘清这一问题,研究者通过训练卷积循环神经网络(ConvRNN)在有无自上而下反馈的情况下完成图像分类任务,探究反馈通路的具体计算贡献。解决方案的关键在于:引入随机神经变异(通过dropout模拟单个神经元随机失活)与自上而下反馈的协同作用,从而显著提升模型的速度-精度权衡能力、抗噪性和对抗攻击鲁棒性;其核心机制在于反馈信号结合dropout可将网络活动约束于低维流形空间,并优化对象信息在分布外场景中的编码效率,同时稳定群体水平的表征动态,揭示了神经随机性与自上而下反馈共同驱动感官编码稳健性的双重机制。
链接: https://arxiv.org/abs/2508.07115
作者: Antonino Greco,Marco D’Alessandro,Karl J. Friston,Giovanni Pezzulo,Markus Siegel
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Biological systems leverage top-down feedback for visual processing, yet most artificial vision models succeed in image classification using purely feedforward or recurrent architectures, calling into question the functional significance of descending cortical pathways. Here, we trained convolutional recurrent neural networks (ConvRNN) on image classification in the presence or absence of top-down feedback projections to elucidate the specific computational contributions of those feedback pathways. We found that ConvRNNs with top-down feedback exhibited remarkable speed-accuracy trade-off and robustness to noise perturbations and adversarial attacks, but only when they were trained with stochastic neural variability, simulated by randomly silencing single units via dropout. By performing detailed analyses to identify the reasons for such benefits, we observed that feedback information substantially shaped the representational geometry of the post-integration layer, combining the bottom-up and top-down streams, and this effect was amplified by dropout. Moreover, feedback signals coupled with dropout optimally constrained network activity onto a low-dimensional manifold and encoded object information more efficiently in out-of-distribution regimes, with top-down information stabilizing the representational dynamics at the population level. Together, these findings uncover a dual mechanism for resilient sensory coding. On the one hand, neural stochasticity prevents unit-level co-adaptation albeit at the cost of more chaotic dynamics. On the other hand, top-down feedback harnesses high-level information to stabilize network activity on compact low-dimensional manifolds.
zh
[CV-257] Membership Inference Attacks with False Discovery Rate Control
【速读】:该论文旨在解决现有成员推理攻击(Membership Inference Attack, MIA)方法在实际应用中缺乏错误发现率(False Discovery Rate, FDR)保障的问题。FDR 表示在所有被识别为成员的数据中,虚假发现的比例,其控制对于攻击结果的可靠性至关重要。然而,由于数据分布未知以及非成员概率估计之间的依赖性,实现 FDR 保证极具挑战。论文提出了一种新颖的 MIA 方法,其核心创新在于通过理论设计实现了对 FDR 的严格控制,并同时提供对真实非成员数据误判为成员数据的边际概率保障。该方法可作为封装器(wrapper)以事后方式无缝集成到现有 MIA 方法中,无需修改原有模型或攻击流程,从而在保持高攻击性能的同时,显著提升攻击结果的统计可信度。
链接: https://arxiv.org/abs/2508.07066
作者: Chenxu Zhao,Wei Qian,Aobo Chen,Mengdi Huai
机构: Iowa State University (爱荷华州立大学)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have been proposed. Despite the significance and popularity of MIAs, existing works on MIAs are limited in providing guarantees on the false discovery rate (FDR), which refers to the expected proportion of false discoveries among the identified positive discoveries. However, it is very challenging to ensure the false discovery rate guarantees, because the underlying distribution is usually unknown, and the estimated non-member probabilities often exhibit interdependence. To tackle the above challenges, in this paper, we design a novel membership inference attack method, which can provide the guarantees on the false discovery rate. Additionally, we show that our method can also provide the marginal probability guarantee on labeling true non-member data as member data. Notably, our method can work as a wrapper that can be seamlessly integrated with existing MIA methods in a post-hoc manner, while also providing the FDR control. We perform the theoretical analysis for our method. Extensive experiments in various settings (e.g., the black-box setting and the lifelong learning setting) are also conducted to verify the desirable performance of our method.
zh
[CV-258] Digital generation of the 3-D pore architecture of isotropic membranes using 2-D cross-sectional scanning electron microscopy images
【速读】:该论文旨在解决传统二维扫描电子显微镜(SEM)在表征多孔膜时无法解析三维孔道结构及其连通性的问题,而这一信息对理解膜性能至关重要。其解决方案的关键在于开发了一种增强型重建算法,该算法不仅保持了孔径分布等关键统计特性,还能准确再现真实膜中复杂的孔形态;通过单张2D SEM图像即可生成高保真的3D孔网络重构,并在商用微滤膜上验证了其与X射线断层扫描(X-ray tomography)数据的高度一致性,同时在精细孔结构分辨能力上表现更优。
链接: https://arxiv.org/abs/2508.06664
作者: Sima Zeinali Danalou,Hooman Chamani,Arash Rabbani,Patrick C. Lee,Jason Hattrick Simpers,Jay R Werber
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注:
Abstract:A major limitation of two-dimensional scanning electron microscopy (SEM) in imaging porous membranes is its inability to resolve three-dimensional pore architecture and interconnectivity, which are critical factors governing membrane performance. Although conventional tomographic 3-D reconstruction techniques can address this limitation, they are often expensive, technically challenging, and not widely accessible. We previously introduced a proof-of-concept method for reconstructing a membrane’s 3-D pore network from a single 2-D SEM image, yielding statistically equivalent results to those obtained from 3-D tomography. However, this initial approach struggled to replicate the diverse pore geometries commonly observed in real membranes. In this study, we advance the methodology by developing an enhanced reconstruction algorithm that not only maintains essential statistical properties (e.g., pore size distribution), but also accurately reproduces intricate pore morphologies. Applying this technique to a commercial microfiltration membrane, we generated a high-fidelity 3-D reconstruction and derived key membrane properties. Validation with X-ray tomography data revealed excellent agreement in structural metrics, with our SEM-based approach achieving superior resolution in resolving fine pore features. The tool can be readily applied to isotropic porous membrane structures of any pore size, as long as those pores can be visualized by SEM. Further work is needed for 3-D structure generation of anisotropic membranes.
zh
人工智能
[AI-0] VGGSounder: Audio-Visual Evaluations for Foundation Models ICCV
【速读】:该论文旨在解决当前用于评估音视频基础模型(audio-visual foundation models)的VGGSounder数据集存在的局限性问题,包括标注不完整、类别部分重叠以及模态间对齐错误,这些问题会导致对模型视听理解能力的误判。其解决方案的关键在于提出一个全面重新标注的多标签测试集VGGSounder,该数据集扩展自原始VGGSound,并引入细粒度的模态标注,从而支持对各模态性能的精准分析;此外,作者还设计了一种新的模态混淆度量(modality confusion metric),通过在模型中增加另一输入模态时观察性能下降情况,揭示模型在多模态融合中的潜在局限性。
链接: https://arxiv.org/abs/2508.08237
作者: Daniil Zverev,Thaddäus Wiedemer,Ameya Prabhu,Matthias Bethge,Wieland Brendel,A. Sophia Koepke
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2025
Abstract:The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSounder dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSounder, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
zh
[AI-1] LL3M: Large Language 3D Modelers
【速读】:该论文旨在解决传统3D资产生成方法依赖大规模3D数据集进行学习所带来的局限性,例如缺乏可编辑性、难以与艺术家工作流程集成以及生成结果的不可解释性。其解决方案的关键在于将形状生成任务重新定义为编写可解释的Python代码(在Blender环境中执行),从而利用预训练大语言模型(LLMs)构建一个多智能体系统(multi-agent system),通过协作规划、检索、编写、调试和优化Blender脚本来自动生成和编辑几何与材质。此方案的核心优势在于生成的代码作为高阶、人类可读且结构化的场景表示,能够无缝整合Blender的复杂构造(如B-meshes、几何修饰符、着色节点),支持模块化修改、参数化控制及人机协同创作,同时借助基于Blender API文档的检索增强生成知识库(BlenderRAG)提升代码正确性和建模能力。
链接: https://arxiv.org/abs/2508.08228
作者: Sining Lu,Guan Chen,Nam Anh Dinh,Itai Lang,Ari Holtzman,Rana Hanocka
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Our project page is at this https URL
Abstract:We present LL3M, a multi-agent system that leverages pretrained large language models (LLMs) to generate 3D assets by writing interpretable Python code in Blender. We break away from the typical generative approach that learns from a collection of 3D data. Instead, we reformulate shape generation as a code-writing task, enabling greater modularity, editability, and integration with artist workflows. Given a text prompt, LL3M coordinates a team of specialized LLM agents to plan, retrieve, write, debug, and refine Blender scripts that generate and edit geometry and appearance. The generated code works as a high-level, interpretable, human-readable, well-documented representation of scenes and objects, making full use of sophisticated Blender constructs (e.g. B-meshes, geometry modifiers, shader nodes) for diverse, unconstrained shapes, materials, and scenes. This code presents many avenues for further agent and human editing and experimentation via code tweaks or procedural parameters. This medium naturally enables a co-creative loop in our system: agents can automatically self-critique using code and visuals, while iterative user instructions provide an intuitive way to refine assets. A shared code context across agents enables awareness of previous attempts, and a retrieval-augmented generation knowledge base built from Blender API documentation, BlenderRAG, equips agents with examples, types, and functions empowering advanced modeling operations and code correctness. We demonstrate the effectiveness of LL3M across diverse shape categories, style and material edits, and user-driven refinements. Our experiments showcase the power of code as a generative and interpretable medium for 3D asset creation. Our project page is at this https URL.
zh
[AI-2] Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent
【速读】:该论文旨在解决Transformer模型在多步符号推理任务中(特别是树结构中的路径查找问题)如何通过训练习得推理能力的机制问题,尤其是从理论层面理解其学习过程。解决方案的关键在于基于梯度下降的动力学分析,证明了一层Transformer可以通过多阶段训练动态实现后向推理(从目标节点到根节点的路径输出)和前向推理(先识别目标到根路径再反转得到根到目标路径)两种任务,并且具备对未见过的树结构的泛化能力。研究发现,不同注意力头会自主分工协作,分别处理两个子任务,从而在单个自回归生成过程中完成复杂的两阶段推理,这为解释Transformer如何执行序列化算法步骤提供了机制性洞见,并表明适当设计中间链式思维步骤可使浅层多头Transformer有效解决原本需深层架构才能完成的问题。
链接: https://arxiv.org/abs/2508.08222
作者: Tong Yang,Yu Huang,Yingbin Liang,Yuejie Chi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: submitted for consideration of publication in May
Abstract:Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.
zh
[AI-3] Street-Level AI: Are Large Language Models Ready for Real-World Judgments? AAAI
【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在高风险社会决策场景中(如无家可归者资源分配)是否能可靠地替代或辅助“街头官僚”(street-level bureaucrats)进行公平、一致的优先级判断的问题。其解决方案的关键在于:基于真实服务需求数据(严格保密并仅使用本地部署的大模型),通过对比LLM生成的优先级排序与人类专家判断及现有政治-社会确定的脆弱性评分系统(vulnerability scoring systems),量化分析LLM在内部一致性、跨模型一致性以及与既有制度标准的一致性方面的表现。研究发现,LLM在多维度上表现出显著不一致性,尽管在成对比较中与普通人的直觉判断存在一定程度的定性一致性,这表明当前生成式AI系统尚未具备直接应用于高影响力社会决策的成熟度。
链接: https://arxiv.org/abs/2508.08193
作者: Gaurab Pokharel,Shafkat Farabi,Patrick J. Fowler,Sanmay Das
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This work has been accepted for publication as a full paper at the AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)
Abstract:A surge of recent work explores the ethical and societal implications of large-scale AI models that make “moral” judgments. Much of this literature focuses either on alignment with human judgments through various thought experiments or on the group fairness implications of AI judgments. However, the most immediate and likely use of AI is to help or fully replace the so-called street-level bureaucrats, the individuals deciding to allocate scarce social resources or approve benefits. There is a rich history underlying how principles of local justice determine how society decides on prioritization mechanisms in such domains. In this paper, we examine how well LLM judgments align with human judgments, as well as with socially and politically determined vulnerability scoring systems currently used in the domain of homelessness resource allocation. Crucially, we use real data on those needing services (maintaining strict confidentiality by only using local large models) to perform our analyses. We find that LLM prioritizations are extremely inconsistent in several ways: internally on different runs, between different LLMs, and between LLMs and the vulnerability scoring systems. At the same time, LLMs demonstrate qualitative consistency with lay human judgments in pairwise testing. Findings call into question the readiness of current generation AI systems for naive integration in high-stakes societal decision-making.
zh
[AI-4] Neural Logic Networks for Interpretable Classification
【速读】:该论文旨在解决传统神经网络可解释性差的问题,即其学习到的特征和决策过程难以被人类理解、验证或提取。为此,作者提出了一种基于逻辑运算(AND、OR、NOT)的可解释神经逻辑网络(Neural Logic Networks),通过引入偏置项以考虑未观测数据,并构建了严格的逻辑与概率建模框架来描述概念组合关系,从而提升模型的可解释性和实用性。解决方案的关键在于设计了一种新型的因子化IF-THEN规则结构及改进的学习算法,使模型能够在表格分类任务中自动发现相关且可解释的规则,尤其在医疗领域等对可解释性要求较高的场景中表现出显著优势。
链接: https://arxiv.org/abs/2508.08172
作者: Vincent Perreault,Katsumi Inoue,Richard Labib,Alain Hertz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 21 pages, 6 figures, pre-print
Abstract:Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on an example from the medical field where interpretability has tangible value.
zh
[AI-5] PyVeritas: On Verifying Python via LLM -Based Transpilation and Bounded Model Checking for C
【速读】:该论文旨在解决Python语言在形式化验证(formal verification)方面工具匮乏的问题,尤其是针对其复杂性与现有低级转译器(如Cython)的冗长性和不适用性所导致的形式验证难以落地的困境。解决方案的关键在于提出PyVeritas框架,该框架利用大语言模型(Large Language Models, LLMs)实现从Python到C的高层级转译(high-level transpilation),随后借助成熟的C语言模型检测工具(如CBMC)进行有界模型检查(bounded model checking)和基于MaxSAT的故障定位(fault localisation)。这一方法使得原本无法直接验证的Python代码能够在现有C语言验证生态中完成断言驱动的验证与可解释的缺陷诊断,实证表明LLM转译可达80–90%准确率,从而为小型但非平凡的Python程序提供了有效的形式化验证支持。
链接: https://arxiv.org/abs/2508.08171
作者: Pedro Orvalho,Marta Kwiatkowska
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 tables, 1 figure
Abstract:Python has become the dominant language for general-purpose programming, yet it lacks robust tools for formal verification. In contrast, programmers working in languages such as C benefit from mature model checkers, for example CBMC, which enable exhaustive symbolic reasoning and fault localisation. The inherent complexity of Python, coupled with the verbosity and low-level nature of existing transpilers (e.g., Cython), have historically limited the applicability of formal verification to Python programs. In this paper, we propose PyVeritas, a novel framework that leverages Large Language Models (LLMs) for high-level transpilation from Python to C, followed by bounded model checking and MaxSAT-based fault localisation in the generated C code. PyVeritas enables verification and bug localisation for Python code using existing model checking tools for C. Our empirical evaluation on two Python benchmarks demonstrates that LLM-based transpilation can achieve a high degree of accuracy, up to 80–90% for some LLMs, enabling effective development environment that supports assertion-based verification and interpretable fault diagnosis for small yet non-trivial Python programs. Comments: 14 pages, 6 tables, 1 figure Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08171 [cs.SE] (or arXiv:2508.08171v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.08171 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-6] Can AI Explanations Make You Change Your Mind? IJCAI2025
【速读】:该论文试图解决的问题是:在基于人工智能的决策支持系统(Decision Support System, DSS)中,尽管解释(explanation)被设计用于增强用户对AI建议的信任并促进人类监督以识别错误或偏见决策,但实际使用中用户是否真正细致地考虑这些解释仍不确定。研究发现,许多参与者并未花足够时间仔细阅读解释,也未充分评估其内容。解决方案的关键在于通过实证研究识别影响用户认真对待AI解释的因素,并进一步分析这些因素如何影响用户是否愿意根据AI建议调整自身判断,从而为改进可解释AI(Explainable AI, XAI)设计提供依据,确保解释机制能真正发挥监督与纠错作用。
链接: https://arxiv.org/abs/2508.08158
作者: Laura Spillner,Rachel Ringe,Robert Porzel,Rainer Malaka
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This paper was presented at the Explainable AI workshop at IJCAI 2025: this https URL
Abstract:In the context of AI-based decision support systems, explanations can help users to judge when to trust the AI’s suggestion, and when to question it. In this way, human oversight can prevent AI errors and biased decision-making. However, this rests on the assumption that users will consider explanations in enough detail to be able to catch such errors. We conducted an online study on trust in explainable DSS, and were surprised to find that in many cases, participants spent little time on the explanation and did not always consider it in detail. We present an exploratory analysis of this data, investigating what factors impact how carefully study participants consider AI explanations, and how this in turn impacts whether they are open to changing their mind based on what the AI suggests.
zh
[AI-7] From Natural Language to Solver-Ready Power System Optimization: An LLM -Assisted Validation-in-the-Loop Framework
【速读】:该论文旨在解决如何将电力系统优化场景的自然语言描述自动转化为可由现成优化求解器直接处理的紧凑数学模型的问题。传统方法直接依赖大语言模型(Large Language Models, LLMs)生成解决方案,常因缺乏数值精度和约束处理能力而导致不可行或次优结果。其解决方案的关键在于:构建一个基于领域感知提示(domain-aware prompt)与结构化模式(schema)的LLM辅助代理(agent),通过系统性验证与迭代修复机制确保模型可行性,并最终输出既满足求解器要求的数学形式化表达,又提供用户友好的结果。实验以机组组合问题为例,验证了该方法能生成最优或近优调度方案及其对应目标成本,证明了结合AI与经典优化框架在提升能源系统决策效率方面的有效性。
链接: https://arxiv.org/abs/2508.08147
作者: Yunkai Hu,Tianqiao Zhao,Meng Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a novel Large Language Models (LLMs)-assisted agent that automatically converts natural-language descriptions of power system optimization scenarios into compact, solver-ready formulations and generates corresponding solutions. In contrast to approaches that rely solely on LLM to produce solutions directly, the proposed method focuses on discovering a mathematically compatible formulation that can be efficiently solved by off-the-shelf optimization solvers. Directly using LLMs to produce solutions often leads to infeasible or suboptimal results, as these models lack the numerical precision and constraint-handling capabilities of established optimization solvers. The pipeline integrates a domain-aware prompt and schema with an LLM, enforces feasibility through systematic validation and iterative repair, and returns both solver-ready models and user-facing results. Using the unit commitment problem as a representative case study, the agent produces optimal or near-optimal schedules along with the associated objective costs. Results demonstrate that coupling the solver with task-specific validation significantly enhances solution reliability. This work shows that combining AI with established optimization frameworks bridges high-level problem descriptions and executable mathematical models, enabling more efficient decision-making in energy systems
zh
[AI-8] COMponent-Aware Pruning for Accelerated Control Tasks in Latent Space Models
【速读】:该论文旨在解决资源受限移动平台(如移动机器人、可穿戴设备和物联网设备)上神经网络控制器(NNC)部署时面临的计算复杂度高与内存占用大问题,尤其关注如何在保证控制稳定性前提下实现模型压缩。其解决方案的关键在于提出一种基于组件感知的结构化剪枝方法,通过为每个剪枝组确定最优剪枝幅度,在压缩率与控制器稳定性之间取得平衡,并结合李雅普诺夫(Lyapunov)稳定性理论提供数学保障,从而建立可量化的安全压缩比边界,使压缩后的NNC能够在不破坏关键稳定性属性的前提下可靠部署于边缘设备。
链接: https://arxiv.org/abs/2508.08144
作者: Ganesh Sundaram,Jonas Ulmen,Amjad Haider,Daniel Görges
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Submitted in: The 2026 IEEE/SICE International Symposium on System Integration (SII 2026)
Abstract:The rapid growth of resource-constrained mobile platforms, including mobile robots, wearable systems, and Internet-of-Things devices, has increased the demand for computationally efficient neural network controllers (NNCs) that can operate within strict hardware limitations. While deep neural networks (DNNs) demonstrate superior performance in control applications, their substantial computational complexity and memory requirements present significant barriers to practical deployment on edge devices. This paper introduces a comprehensive model compression methodology that leverages component-aware structured pruning to determine the optimal pruning magnitude for each pruning group, ensuring a balance between compression and stability for NNC deployment. Our approach is rigorously evaluated on Temporal Difference Model Predictive Control (TD-MPC), a state-of-the-art model-based reinforcement learning algorithm, with a systematic integration of mathematical stability guarantee properties, specifically Lyapunov criteria. The key contribution of this work lies in providing a principled framework for determining the theoretical limits of model compression while preserving controller stability. Experimental validation demonstrates that our methodology successfully reduces model complexity while maintaining requisite control performance and stability characteristics. Furthermore, our approach establishes a quantitative boundary for safe compression ratios, enabling practitioners to systematically determine the maximum permissible model reduction before violating critical stability properties, thereby facilitating the confident deployment of compressed NNCs in resource-limited environments.
zh
[AI-9] MuaLLM : A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation
【速读】:该论文旨在解决电路设计领域文献综述的三大挑战:前沿研究快速迭代导致的信息滞后、数据表示不一致以及多目标优化复杂性。为此,作者提出MuaLLM——一个开源的多模态大语言模型(Large Language Model, LLM)代理,其核心创新在于融合了混合检索增强生成(Retrieval-Augmented Generation, RAG)框架与自适应向量数据库,并采用“推理+行动”(Reason + Act, ReAct)工作流实现多步信息检索与迭代推理。关键突破在于将检索过程与推理解耦,从而在标准LLM上下文长度限制下仍能高效处理大规模电路设计文献,显著降低计算成本(最高节省90%)并提升响应速度(1.6倍加速),同时保持高精度,解决了传统方法依赖仿真数据生成效率低下的瓶颈问题。
链接: https://arxiv.org/abs/2508.08137
作者: Pravallika Abbineni,Saoud Aldowaish,Colin Liechty,Soroosh Noorzad,Ali Ghazizadeh,Morteza Fayazi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives make this task significantly challenging. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG-250, and 86.8% accuracy on Reas-100.
zh
[AI-10] BlindGuard: Safeguarding LLM -based Multi-Agent Systems under Unknown Attacks
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLM)的多智能体系统(Multi-Agent Systems, MAS)中存在的传播脆弱性(propagation vulnerability)问题,即恶意智能体通过智能体间的消息交互扭曲集体决策过程。现有监督式防御方法虽有效,但依赖标注的恶意行为样本,在实际部署中难以适用。其解决方案的关键在于提出一种无监督防御方法 BlindGuard,核心创新包括:1)构建分层智能体编码器(hierarchical agent encoder),以捕获每个智能体的个体特征、邻域关系及全局交互模式,实现对恶意行为的全面表征;2)设计基于扰动引导的检测器(corruption-guided detector),通过方向性噪声注入与对比学习机制,仅利用正常智能体的行为即可训练出有效的检测模型,从而在无需攻击标签或先验知识的前提下实现对多种攻击类型(如提示注入、记忆污染和工具攻击)的高精度识别与强泛化能力。
链接: https://arxiv.org/abs/2508.08127
作者: Rui Miao,Yixin Liu,Yili Wang,Xu Shen,Yue Tan,Yiwei Dai,Shirui Pan,Xin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: this https URL.
zh
[AI-11] MemoryKT: An Integrative Memory-and-Forgetting Method for Knowledge Tracing
【速读】:该论文旨在解决现有知识追踪(Knowledge Tracing, KT)模型在模拟学生记忆状态时存在的局限性,即多数方法仅采用单一、未区分的遗忘机制,忽略了记忆的三个核心过程——编码(encoding)、存储(storage)和检索(retrieval),以及个体差异化的遗忘模式。解决方案的关键在于提出 memoryKT 模型,该模型基于一种新颖的时间变分自编码器(temporal variational autoencoder),通过三阶段动态建模实现对完整记忆循环的联合刻画:首先学习学生知识记忆特征的分布,其次重构练习反馈以验证记忆状态,最后在时间流中嵌入个性化遗忘模块,动态调节记忆存储强度。这一设计显著提升了模型对个体差异的感知能力,并在四个公开数据集上优于当前最优基线方法。
链接: https://arxiv.org/abs/2508.08122
作者: Mingrong Lin,Ke Deng,Zhengyang Wu,Zetao Zheng,Jie Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Knowledge Tracing (KT) is committed to capturing students’ knowledge mastery from their historical interactions. Simulating students’ memory states is a promising approach to enhance both the performance and interpretability of knowledge tracing models. Memory consists of three fundamental processes: encoding, storage, and retrieval. Although forgetting primarily manifests during the storage stage, most existing studies rely on a single, undifferentiated forgetting mechanism, overlooking other memory processes as well as personalized forgetting patterns. To address this, this paper proposes memoryKT, a knowledge tracing model based on a novel temporal variational autoencoder. The model simulates memory dynamics through a three-stage process: (i) Learning the distribution of students’ knowledge memory features, (ii) Reconstructing their exercise feedback, while (iii) Embedding a personalized forgetting module within the temporal workflow to dynamically modulate memory storage strength. This jointly models the complete encoding-storage-retrieval cycle, significantly enhancing the model’s perception capability for individual differences. Extensive experiments on four public datasets demonstrate that our proposed approach significantly outperforms state-of-the-art baselines.
zh
[AI-12] amMedAgents : Enhancing Medical Decision-Making of LLM s Through Structured Teamwork
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在医疗决策中缺乏系统性协作机制的问题,即如何将人类团队协作中的证据基础组件有效转化为可计算的多智能体协同架构。解决方案的关键在于提出TeamMedAgents框架,该框架基于组织心理学中的“五大团队行为”模型(Salas et al.'s “Big Five” model),将团队领导力、相互绩效监控、团队导向、共享心智模型、闭环沟通和互信这六个核心团队协作要素进行模块化、可配置化实现,并嵌入自适应协作架构中,从而在8个医学基准数据集上显著提升推理性能,且通过消融实验揭示了不同任务复杂度与领域特性下最优协作模式的差异性。
链接: https://arxiv.org/abs/2508.08115
作者: Pranav Pushkar Mishra,Mohammad Arvan,Mohan Zalake(University of Illinois, Chicago)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 6 tables(2 in main, 4 in appendix)
Abstract:We present TeamMedAgents, a novel multi-agent approach that systematically integrates evidence-based teamwork components from human-human collaboration into medical decision-making with large language models (LLMs). Our approach validates an organizational psychology teamwork model from human collaboration to computational multi-agent medical systems by operationalizing six core teamwork components derived from Salas et al.'s “Big Five” model: team leadership, mutual performance monitoring, team orientation, shared mental models, closed-loop communication, and mutual trust. We implement and evaluate these components as modular, configurable mechanisms within an adaptive collaboration architecture while assessing the effect of the number of agents involved based on the task’s requirements and domain. Systematic evaluation of computational implementations of teamwork behaviors across eight medical benchmarks (MedQA, MedMCQA, MMLU-Pro Medical, PubMedQA, DDXPlus, MedBullets, Path-VQA, and PMC-VQA) demonstrates consistent improvements across 7 out of 8 evaluated datasets. Controlled ablation studies conducted on 50 questions per configuration across 3 independent runs provide mechanistic insights into individual component contributions, revealing optimal teamwork configurations that vary by reasoning task complexity and domain-specific requirements. Our ablation analyses reveal dataset-specific optimal teamwork configurations, indicating that different medical reasoning modalities benefit from distinct collaborative patterns. TeamMedAgents represents an advancement in collaborative AI by providing a systematic translation of established teamwork theories from human collaboration into agentic collaboration, establishing a foundation for evidence-based multi-agent system design in critical decision-making domains.
zh
[AI-13] ChatGPT on the Road: Leverag ing Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience
【速读】:该论文旨在解决传统车载对话代理(in-vehicle conversational agents)因依赖预设提示(pre-scripted prompts)或有限语音指令而导致的自然交互能力受限问题,从而影响驾驶员与系统之间的流畅沟通。其解决方案的关键在于引入基于大语言模型(Large Language Model, LLM)的ChatGPT驱动型车载代理,使其能够支持连续、多轮的自然语言对话,从而提升驾驶安全性与用户体验。实验结果表明,相较于无代理和预设代理条件,ChatGPT-based代理显著降低了纵向加速度、侧向加速度及车道偏移的波动性,并在胜任力、拟人感、情感信任度和偏好度等主观评价维度上表现更优,验证了LLM赋能的对话代理在复杂驾驶场景中实现上下文丰富交互的可行性与有效性。
链接: https://arxiv.org/abs/2508.08101
作者: Yeana Lee Bond(1),Mungyeong Choe(2),Baker Kasim Hasan(1),Arsh Siddiqui(1),Myounghoon Jeon(1 (Courtesy Appointment), 2) ((1) Computer Science, Virginia Tech, Blacksburg, Virginia, USA, (2) Industrial and Systems Engineering, Virginia Tech, Blacksburg, Virginia, USA)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Submitted to International Journal of Human-Computer Studies. Bond and Choe: Drafting, Review, Editing, Validation, Software, Methodology, Investigation, Data Analysis, Conceptualization, Experiment training. Hasan and Siddiqui: Experimental and Data Analysis Support. Jeon: Supervision, Review, Resources, Project Admin, Methodology, Conceptualization. Total 34 pages
Abstract:Studies on in-vehicle conversational agents have traditionally relied on pre-scripted prompts or limited voice commands, constraining natural driver-agent interaction. To resolve this issue, the present study explored the potential of a ChatGPT-based in-vehicle agent capable of carrying continuous, multi-turn dialogues. Forty drivers participated in our experiment using a motion-based driving simulator, comparing three conditions (No agent, Pre-scripted agent, and ChatGPT-based agent) as a within-subjects variable. Results showed that the ChatGPT-based agent condition led to more stable driving performance across multiple metrics. Participants demonstrated lower variability in longitudinal acceleration, lateral acceleration, and lane deviation compared to the other two conditions. In subjective evaluations, the ChatGPT-based agent also received significantly higher ratings in competence, animacy, affective trust, and preference compared to the Pre-scripted agent. Our thematic analysis of driver-agent conversations revealed diverse interaction patterns in topics, including driving assistance/questions, entertainment requests, and anthropomorphic interactions. Our results highlight the potential of LLM-powered in-vehicle conversational agents to enhance driving safety and user experience through natural, context-rich interactions.
zh
[AI-14] Grid2Guide: A* Enabled Small Language Model for Indoor Navigation
【速读】:该论文旨在解决复杂室内环境中缺乏外部定位信号和专用基础设施时,实现可靠室内导航的问题。其解决方案的关键在于提出了一种名为Grid2Guide的混合导航框架,该框架结合了A搜索算法与小型语言模型(Small Language Model, SLM),首先通过二值占用矩阵表示室内地图,利用A算法计算最优路径并生成简洁的文本导航步骤,再由SLM将这些步骤转化为自然语言指令,从而提升对终端用户的可读性和可理解性,最终实现轻量级、无需基础设施支持的实时室内导航。
链接: https://arxiv.org/abs/2508.08100
作者: Md. Wasiul Haque,Sagar Dasgupta,Mizanur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures, 6 tables
Abstract:Reliable indoor navigation remains a significant challenge in complex environments, particularly where external positioning signals and dedicated infrastructures are unavailable. This research presents Grid2Guide, a hybrid navigation framework that combines the A* search algorithm with a Small Language Model (SLM) to generate clear, human-readable route instructions. The framework first conducts a binary occupancy matrix from a given indoor map. Using this matrix, the A* algorithm computes the optimal path between origin and destination, producing concise textual navigation steps. These steps are then transformed into natural language instructions by the SLM, enhancing interpretability for end users. Experimental evaluations across various indoor scenarios demonstrate the method’s effectiveness in producing accurate and timely navigation guidance. The results validate the proposed approach as a lightweight, infrastructure-free solution for real-time indoor navigation support.
zh
[AI-15] Growing Reservoirs with Developmental Graph Cellular Automata
【速读】:该论文旨在解决如何通过发育图细胞自动机(Developmental Graph Cellular Automata, DGCA)生成具有功能特性的动态结构,以实现可塑性储备池(plastic reservoirs)的自动构建,并进一步推动功能性、适应性形态发生(functional, adaptive morphogenesis)的建模。其解决方案的关键在于利用DGCA从单节点种子出发,通过两种目标导向机制——任务驱动型(使用NARMA系列任务)与任务无关型(基于储备池性能指标)——训练其生长出多样化的、类生命结构,这些结构在基准任务上统计上显著优于传统储备池,从而为自组织、功能导向的计算系统设计提供了新范式。
链接: https://arxiv.org/abs/2508.08091
作者: Matias Barandiaran,James Stovold
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted to ALIFE 2025
Abstract:Developmental Graph Cellular Automata (DGCA) are a novel model for morphogenesis, capable of growing directed graphs from single-node seeds. In this paper, we show that DGCAs can be trained to grow reservoirs. Reservoirs are grown with two types of targets: task-driven (using the NARMA family of tasks) and task-independent (using reservoir metrics). Results show that DGCAs are able to grow into a variety of specialized, life-like structures capable of effectively solving benchmark tasks, statistically outperforming `typical’ reservoirs on the same task. Overall, these lay the foundation for the development of DGCA systems that produce plastic reservoirs and for modeling functional, adaptive morphogenesis. Comments: Accepted to ALIFE 2025 Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08091 [cs.NE] (or arXiv:2508.08091v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2508.08091 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] FNBT: Full Negation Belief Transformation for Open-World Information Fusion Based on Dempster-Shafer Theory of Evidence
【速读】:该论文旨在解决在开放世界信息融合场景中,由于不同数据源或模型来自异构框架(heterogeneous frames)导致传统Dempster-Shafer证据理论融合方法性能不佳的问题。其核心挑战在于如何在不统一框架的前提下,实现跨来源的不确定性信息有效融合。解决方案的关键在于提出一种名为全否定信念变换(Full Negation Belief Transformation, FNBT)的方法:首先通过判别准则识别是否为开放世界融合任务;其次通过扩展框架以容纳异构元素;最后引入全否定机制对基本概率分配(basic probability assignment, BPA)进行变换,使得现有组合规则可直接应用于转换后的质量函数,从而实现兼容性与一致性。该方法理论上满足质量函数不变性、继承性和冲突消除三大性质,并在真实数据集上的模式分类任务中表现出优越性能,同时成功化解了Zadeh的经典反例。
链接: https://arxiv.org/abs/2508.08075
作者: Meishen He,Wenjun Ma,Jiao Wang,Huijun Yue,Xiaoma Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Dempster-Shafer theory of evidence has been widely applied in the field of information fusion under uncertainty. Most existing research focuses on combining evidence within the same frame of discernment. However, in real-world scenarios, trained algorithms or data often originate from different regions or organizations, where data silos are prevalent. As a result, using different data sources or models to generate basic probability assignments may lead to heterogeneous frames, for which traditional fusion methods often yield unsatisfactory results. To address this challenge, this study proposes an open-world information fusion method, termed Full Negation Belief Transformation (FNBT), based on the Dempster-Shafer theory. More specially, a criterion is introduced to determine whether a given fusion task belongs to the open-world setting. Then, by extending the frames, the method can accommodate elements from heterogeneous frames. Finally, a full negation mechanism is employed to transform the mass functions, so that existing combination rules can be applied to the transformed mass functions for such information fusion. Theoretically, the proposed method satisfies three desirable properties, which are formally proven: mass function invariance, heritability, and essential conflict elimination. Empirically, FNBT demonstrates superior performance in pattern classification tasks on real-world datasets and successfully resolves Zadeh’s counterexample, thereby validating its practical effectiveness.
zh
[AI-17] C-MAG: Cascade Multimodal Attributed Graphs for Supply Chain Link Prediction KDD2025
【速读】:该论文旨在解决全球供应链中产品与制造商匹配效率低下的问题,传统方法难以有效处理制造商的复杂能力、认证信息、地理限制以及丰富的多模态数据(multimodal data)。其解决方案的关键在于提出PMGraph这一公开基准数据集,构建了包含8,888家制造商、70,000余种产品及超过110,000条关联边的异构多模态图结构,并设计了两级架构Cascade Multimodal Attributed Graph (C-MAG),通过先对文本与视觉属性进行对齐和聚合生成中间群嵌入(group embeddings),再利用多尺度消息传递机制在制造商-产品异构图中传播信息,从而显著提升链接预测准确率,同时提供适用于噪声环境中模态感知融合的实用指南。
链接: https://arxiv.org/abs/2508.08071
作者: Yunqing Li,Zixiang Tang,Jiaying Zhuang,Zhenyu Yang,Farhad Ameri,Jianbang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a poster presentation at the KDD 2025 Workshop on AI for Supply Chain (AI4SupplyChain)
Abstract:Connecting an ever-expanding catalogue of products with suitable manufacturers and suppliers is critical for resilient, efficient global supply chains, yet traditional methods struggle to capture complex capabilities, certifications, geographic constraints, and rich multimodal data of real-world manufacturer profiles. To address these gaps, we introduce PMGraph, a public benchmark of bipartite and heterogeneous multimodal supply-chain graphs linking 8,888 manufacturers, over 70k products, more than 110k manufacturer-product edges, and over 29k product images. Building on this benchmark, we propose the Cascade Multimodal Attributed Graph C-MAG, a two-stage architecture that first aligns and aggregates textual and visual attributes into intermediate group embeddings, then propagates them through a manufacturer-product hetero-graph via multiscale message passing to enhance link prediction accuracy. C-MAG also provides practical guidelines for modality-aware fusion, preserving predictive performance in noisy, real-world settings.
zh
[AI-18] AdaptFlow: Adaptive Workflow Optimization via Meta-Learning
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体工作流(agentic workflows)在面对多样化任务时适应性差、可扩展性受限的问题,其核心在于现有方法多依赖静态模板或人工设计的工作流,难以实现跨任务的泛化。解决方案的关键是提出 AdaptFlow,一种受模型无关元学习(Model-Agnostic Meta-Learning, MAML)启发的自然语言驱动的元学习框架:通过双层优化机制——内层利用LLM生成的反馈对特定子任务进行工作流微调,外层则更新共享的工作流初始化参数以在多任务上保持良好性能,从而实现从少量示例中快速适应新任务的能力,并在问答、代码生成和数学推理等多个基准上展现出优于手动设计与自动搜索基线的泛化性能。
链接: https://arxiv.org/abs/2508.08053
作者: Runchuan Zhu,Bowen Jiang,Lingrui Mei,Fangkai Yang,Lu Wang,Haoxiang Gao,Fengshuo Bai,Pu Zhao,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows, which are structured sequences of LLM invocations intended to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learning framework inspired by model-agnostic meta-learning (MAML). AdaptFlow learns a generalizable workflow initialization that enables rapid subtask-level adaptation. It employs a bi-level optimization scheme: the inner loop refines the workflow for a specific subtask using LLM-generated feedback, while the outer loop updates the shared initialization to perform well across tasks. This setup allows AdaptFlow to generalize effectively to unseen tasks by adapting the initialized workflow through language-guided modifications. Evaluated across question answering, code generation, and mathematical reasoning benchmarks, AdaptFlow consistently outperforms both manually crafted and automatically searched baselines, achieving state-of-the-art results with strong generalization across tasks and models. The source code and data are available at this https URL.
zh
[AI-19] On Understanding of the Dynamics of Model Capacity in Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的稳定性-塑性困境(stability-plasticity dilemma),即神经网络(Neural Network, NN)在保持对旧任务知识稳定性的前提下,如何有效学习新任务的能力。其核心问题是:NN的有效模型容量(Effective Model Capacity, CLEMC)是否随任务序列变化而动态演变,从而影响稳定性与塑性之间的平衡点。解决方案的关键在于提出并建模CLEMC,通过构建一个差分方程来刻画NN、任务数据和优化过程之间相互作用的演化机制,进而证明该平衡点本质上是非平稳的——无论网络架构或优化方法如何,当新任务分布偏离先前任务时,NN表示新任务的能力必然下降。这一理论框架为理解CL中模型容量的动态特性提供了新的视角,并通过涵盖从小型前馈网络到大规模Transformer语言模型的广泛实验验证了结论的普适性。
链接: https://arxiv.org/abs/2508.08052
作者: Supriyo Chakraborty,Krishnan Raghavan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The stability-plasticity dilemma, closely related to a neural network’s (NN) capacity-its ability to represent tasks-is a fundamental challenge in continual learning (CL). Within this context, we introduce CL’s effective model capacity (CLEMC) that characterizes the dynamic behavior of the stability-plasticity balance point. We develop a difference equation to model the evolution of the interplay between the NN, task data, and optimization procedure. We then leverage CLEMC to demonstrate that the effective capacity-and, by extension, the stability-plasticity balance point is inherently non-stationary. We show that regardless of the NN architecture or optimization method, a NN’s ability to represent new tasks diminishes when incoming task distributions differ from previous ones. We conduct extensive experiments to support our theoretical findings, spanning a range of architectures-from small feedforward network and convolutional networks to medium-sized graph neural networks and transformer-based large language models with millions of parameters.
zh
[AI-20] Multi-modal Adaptive Mixture of Experts for Cold-start Recommendation
【速读】:该论文旨在解决推荐系统在冷启动(cold-start)场景下的性能瓶颈问题,即如何有效推荐那些交互历史稀疏的新物品。现有方法虽利用多模态数据(如图像、文本等)提升推荐效果,但普遍采用简单的融合策略(如拼接、平均池化或固定权重),难以捕捉模态间的复杂关系。其解决方案的关键在于提出一种基于混合专家(Mixture of Experts, MoE)架构的新型框架 MAMEX,通过引入模态特定的专家网络和可学习的门控机制,动态调整不同模态对每个物品的贡献权重,从而根据内容特征自适应地强化最具信息量的模态,同时在部分模态缺失或不相关时保持鲁棒性。
链接: https://arxiv.org/abs/2508.08042
作者: Van-Khang Nguyen,Duc-Hoang Pham,Huy-Son Nguyen,Cam-Van Thi Nguyen,Hoang-Quynh Le,Duc-Trong Le
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommendation systems have faced significant challenges in cold-start scenarios, where new items with a limited history of interaction need to be effectively recommended to users. Though multimodal data (e.g., images, text, audio, etc.) offer rich information to address this issue, existing approaches often employ simplistic integration methods such as concatenation, average pooling, or fixed weighting schemes, which fail to capture the complex relationships between modalities. Our study proposes a novel Mixture of Experts (MoE) framework for multimodal cold-start recommendation, named MAMEX, which dynamically leverages latent representation from different modalities. MAMEX utilizes modality-specific expert networks and introduces a learnable gating mechanism that adaptively weights the contribution of each modality based on its content characteristics. This approach enables MAMEX to emphasize the most informative modalities for each item while maintaining robustness when certain modalities are less relevant or missing. Extensive experiments on benchmark datasets show that MAMEX outperforms state-of-the-art methods in cold-start scenarios, with superior accuracy and adaptability. For reproducibility, the code has been made available on Github this https URL.
zh
[AI-21] BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models
【速读】:该论文旨在解决提示词驱动的联邦多模态学习(prompt-based federated multimodal learning)中的安全漏洞问题,特别是针对基于对比学习架构(如CLIP-style模型)的联邦提示调优方法在隐私保护场景下可能遭受的后门攻击风险。其解决方案的关键在于提出了一种名为BadPromptFL的新型后门攻击框架,其中被攻陷的客户端联合优化本地后门触发器(backdoor triggers)与提示嵌入(prompt embeddings),通过全局聚合过程注入中毒提示(poisoned prompts),从而在推理阶段无需修改模型参数即可实现通用性后门激活,且具有高隐蔽性和低参与度要求,实验验证了该攻击在多个数据集和聚合协议下的有效性、隐蔽性与泛化能力。
链接: https://arxiv.org/abs/2508.08040
作者: Maozhen Zhang,Mengnan Zhao,Bo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt-based tuning has emerged as a lightweight alternative to full fine-tuning in large vision-language models, enabling efficient adaptation via learned contextual prompts. This paradigm has recently been extended to federated learning settings (e.g., PromptFL), where clients collaboratively train prompts under data privacy constraints. However, the security implications of prompt-based aggregation in federated multimodal learning remain largely unexplored, leaving a critical attack surface unaddressed. In this paper, we introduce \textbfBadPromptFL, the first backdoor attack targeting prompt-based federated learning in multimodal contrastive models. In BadPromptFL, compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process. These prompts are then propagated to benign clients, enabling universal backdoor activation at inference without modifying model parameters. Leveraging the contextual learning behavior of CLIP-style architectures, BadPromptFL achieves high attack success rates (e.g., (90%)) with minimal visibility and limited client participation. Extensive experiments across multiple datasets and aggregation protocols validate the effectiveness, stealth, and generalizability of our attack, raising critical concerns about the robustness of prompt-based federated learning in real-world deployments.
zh
[AI-22] Bridging ASR and LLM s for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches
【速读】:该论文旨在解决发音障碍语音(dysarthric speech)识别准确率低的问题,其核心挑战在于音素失真和高变异性。解决方案的关键在于引入大语言模型(Large Language Model, LLM)增强的解码策略,通过利用语言约束实现音素恢复与语法纠错,从而显著提升识别结果的可理解性。实验表明,相较于传统的CTC、seq2seq解码方式,LLM-enhanced decoding(如BART、GPT-2、Vicuna)在多种自监督ASR模型(如Wav2Vec、HuBERT、Whisper)上均展现出更强的鲁棒性和泛化能力,尤其在不同严重程度的发音障碍样本中表现出优势。
链接: https://arxiv.org/abs/2508.08027
作者: Ahmed Aboeitta,Ahmed Sharshar,Youssef Nafea,Shady Shehata
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech Recognition (ASR) due to phoneme distortions and high variability. While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shown promise, their effectiveness in dysarthric speech remains unclear. This study systematically benchmarks these models with different decoding strategies, including CTC, seq2seq, and LLM-enhanced decoding (BART,GPT-2, Vicuna). Our contributions include (1) benchmarking ASR architectures for dysarthric speech, (2) introducing LLM-based decoding to improve intelligibility, (3) analyzing generalization across datasets, and (4) providing insights into recognition errors across severity levels. Findings highlight that LLM-enhanced decoding improves dysarthric ASR by leveraging linguistic constraints for phoneme restoration and grammatical correction.
zh
[AI-23] Advancing Knowledge Tracing by Exploring Follow-up Performance Trends
【速读】:该论文旨在解决现有知识追踪(Knowledge Tracing, KT)方法在分析学习者历史学习序列与未来表现之间关系时常见的相关性冲突问题。其解决方案的关键在于提出一种名为Forward-Looking Knowledge Tracing(FINER)的新方法,该方法通过从历史智能教学系统(Intelligent Tutoring Systems, ITS)数据中提取后续表现趋势(Follow-up Performance Trends, FPTs),并将其与历史学习序列相结合,从而提升对学生未来表现的预测准确性。FINER的核心创新包括:构建可在线性时间内检索FPT的学习模式、引入一种新颖的相似性感知注意力机制以基于频率和上下文相似度聚合FPTs,以及设计融合FPTs与历史序列的有效机制,显著优于当前十种最先进的KT方法,在六个真实世界数据集上准确率提升达8.74%至84.85%。
链接: https://arxiv.org/abs/2508.08019
作者: Hengyu Liu,Yushuai Li,Minghe Yu,Tiancheng Zhang,Ge Yu,Torben Bach Pedersen,Kristian Torp,Christian S. Jensen,Tianyi Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
Abstract:Intelligent Tutoring Systems (ITS), such as Massive Open Online Courses, offer new opportunities for human learning. At the core of such systems, knowledge tracing (KT) predicts students’ future performance by analyzing their historical learning activities, enabling an accurate evaluation of students’ knowledge states over time. We show that existing KT methods often encounter correlation conflicts when analyzing the relationships between historical learning sequences and future performance. To address such conflicts, we propose to extract so-called Follow-up Performance Trends (FPTs) from historical ITS data and to incorporate them into KT. We propose a method called Forward-Looking Knowledge Tracing (FINER) that combines historical learning sequences with FPTs to enhance student performance prediction accuracy. FINER constructs learning patterns that facilitate the retrieval of FPTs from historical ITS data in linear time; FINER includes a novel similarity-aware attention mechanism that aggregates FPTs based on both frequency and contextual similarity; and FINER offers means of combining FPTs and historical learning sequences to enable more accurate prediction of student future performance. Experiments on six real-world datasets show that FINER can outperform ten state-of-the-art KT methods, increasing accuracy by 8.74% to 84.85%.
zh
[AI-24] Fitting Description Logic Ontologies to ABox and Query Examples KR2025
【速读】:该论文研究的是本体拟合问题(ontology fitting problem),即给定一组正例和负例(形式为 (A,q),其中 A 是一个 ABox(实例库),q 是一个布尔查询),目标是寻找一个描述逻辑(Description Logic, DL)本体 O,使得对于所有正例满足 A∪O⊨q,而对于所有负例满足 A∪O⊨q。作者在 ALC 和 ALCI 两种本体语言下,考察了原子查询(AQs)、 conjunctive queries (CQs) 及其并集(UCQs)作为查询语言的情形,并提供了有效的刻画方法以及判定是否存在拟合本体的计算复杂度分析:对于 AQs 和完整 CQs,该问题是 coNP-完全;对于 CQs 和 UCQs,则为 2EXPTIME-完全。这一结果表明,尽管查询表达能力增强导致复杂度显著上升,但通过形式化刻画与复杂性边界分析,该问题在理论层面具有可处理性。
链接: https://arxiv.org/abs/2508.08007
作者: Maurice Funk,Marvin Grosser,Carsten Lutz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR2025), 23 pages
Abstract:We study a fitting problem inspired by ontology-mediated querying: given a collection of positive and negative examples of the form (\mathcalA,q) with \mathcalA an ABox and q a Boolean query, we seek an ontology \mathcalO that satisfies \mathcalA \cup \mathcalO \vDash q for all positive examples and \mathcalA \cup \mathcalO\not\vDash q for all negative examples. We consider the description logics \mathcalALC and \mathcalALCI as ontology languages and a range of query languages that includes atomic queries (AQs), conjunctive queries (CQs), and unions thereof (UCQs). For all of the resulting fitting problems, we provide effective characterizations and determine the computational complexity of deciding whether a fitting ontology exists. This problem turns out to be \small CONP for AQs and full CQs and 2E\small XPT\small IME -complete for CQs and UCQs. These results hold for both \mathcalALC and \mathcalALCI . Comments: Submitted to the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR2025), 23 pages Subjects: Artificial Intelligence (cs.AI) MSC classes: Computing methodologies~Description logics, Computing methodologies~Ontology engineering Cite as: arXiv:2508.08007 [cs.AI] (or arXiv:2508.08007v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.08007 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Marvin Grosser [view email] [v1] Mon, 11 Aug 2025 14:11:27 UTC (71 KB) Full-text links: Access Paper: View a PDF of the paper titled Fitting Description Logic Ontologies to ABox and Query Examples, by Maurice Funk and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-08 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-25] Learning to Select MCP Algorithms: From Traditional ML to Dual-Channel GAT-MLP
【速读】:该论文旨在解决最大团问题(Maximum Clique Problem, MCP)中缺乏面向算法选择的研究这一关键挑战,即如何根据图实例的特征自动选择最优的求解算法。其解决方案的核心在于构建一个融合传统机器学习与图神经网络(Graph Neural Networks, GNNs)的双通道学习框架——GAT-MLP模型:该模型利用图注意力网络(Graph Attention Network, GAT)捕捉局部结构信息,同时通过多层感知机(Multilayer Perceptron, MLP)建模全局统计特征,从而实现对不同MCP算法性能的精准预测与选择。实验表明,该方法在多个指标上均表现出强且一致的性能,验证了双通道架构和GNN在组合优化算法选择中的有效性。
链接: https://arxiv.org/abs/2508.08005
作者: Xiang Li,Shanshan Wang,Chenglong Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Extensive experiments and prior studies show that no single maximum clique algorithm consistently performs best across all instances, highlighting the importance of selecting suitable algorithms based on instance features. Through an extensive analysis of relevant studies, it is found that there is a lack of research work concerning algorithm selection oriented toward the Maximum Clique Problem (MCP). In this work, we propose a learning-based framework that integrates both traditional machine learning and graph neural networks to address this gap. We construct a labeled dataset by running four exact MCP algorithms on a diverse collection of graph instances, accompanied by structural and global statistical features extracted from each graph. We first evaluate four conventional classifiers: Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), and K-Nearest Neighbors (KNN), across multiple dataset variants. Experimental results show that RF consistently shows strong performance across metrics and dataset variants, making it a reliable baseline. In addition, feature importance analysis indicates that connectivity and topological structure are strong predictors of algorithm performance. Building on these findings, we develop a dual-channel model named GAT-MLP, which combines a Graph Attention Network (GAT) for local structural encoding with a Multilayer Perceptron (MLP) for global feature modeling. The GAT-MLP model shows strong and consistent performance across all metrics. Our results highlight the effectiveness of dual-channel architectures and the promise of graph neural networks in combinatorial algorithm selection.
zh
[AI-26] Interpreting Fedspeak with Confidence: A LLM -Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths
【速读】:该论文旨在解决如何自动解析和理解美联储(Federal Reserve)使用的“Fedspeak”——一种隐含政策信号与战略立场的风格化语言——从而准确分类其货币政策立场的问题。这一挑战对金融预测、算法交易和数据驱动的政策分析具有重要影响。解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)且具备不确定性感知能力的框架:首先通过融合货币传导机制领域的专业知识来增强文本的语义与上下文表征,其次引入动态不确定性解码模块以量化模型预测置信度,从而提升分类准确性与模型可靠性。实验表明,该框架在货币政策立场分析任务上达到当前最优性能,且统计分析验证了感知不确定性与模型误差率之间的显著正相关关系,证明了不确定性作为诊断信号的有效性。
链接: https://arxiv.org/abs/2508.08001
作者: Rui Yao(1),Qi Chai(1 and 3),Jinhai Yao(2),Siyuan Li(1),Junhao Chen(1),Qi Zhang(2),Hao Wang(1) ((1) The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China, (2) Shanghai Jiaotong University, Shanghai, China, (3) Xi’an Jiaotong University, Xi’an, China)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Rui Yao, Qi Chai, and Jinhai Yao contributed equally to this work. Corresponding authors: Qi Zhang ( this http URL @sjtu. this http URL ) and Hao Wang (haowang@hkust this http URL )
Abstract:“Fedspeak”, the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a communication tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. In this paper, we propose an LLM-based, uncertainty-aware framework for deciphering Fedspeak and classifying its underlying monetary policy stance. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty decoding module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal.
zh
[AI-27] DIVER: A Multi-Stage Approach for Reasoning -intensive Information Retrieval
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对需要抽象推理、类比思维或多步逻辑推导的复杂查询时表现不佳的问题。其核心挑战在于传统检索器难以捕捉此类查询与文档之间的深层语义关联。解决方案的关键在于提出一个专为推理密集型信息检索设计的端到端管道 DIVER,包含四个核心组件:1)文档预处理以提升输入质量;2)基于大语言模型(Large Language Model, LLM)的迭代式查询扩展机制,通过与文档交互动态优化查询表达;3)在合成多领域数据集上微调的推理增强型检索器,引入难负样本以强化区分能力;4)结合 LLM 评估的帮助性评分与传统检索分数的点对点重排序模块。该方案在 BRIGHT 基准测试中显著优于现有推理感知模型,验证了其在复杂现实任务中的有效性。
链接: https://arxiv.org/abs/2508.07995
作者: Meixiu Long,Duolin Sun,Dan Yang,Junjie Wang,Yue Shen,Jian Wang,Peng Wei,Jinjie Gu,Jiahai Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present \textbfDIVER, a retrieval pipeline tailored for reasoning-intensive information retrieval. DIVER consists of four components: document processing to improve input quality, LLM-driven query expansion via iterative document interaction, a reasoning-enhanced retriever fine-tuned on synthetic multi-domain data with hard negatives, and a pointwise reranker that combines LLM-assigned helpfulness scores with retrieval scores. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 41.6 and 28.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks. Our code and retrieval model will be released soon.
zh
[AI-28] WeChat-YATT: A Simple Scalable and Balanced RLHF Trainer
【速读】:该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)训练框架在扩展至复杂多模态工作流和动态负载场景时所面临的两大核心挑战:一是控制器(controller)在管理大规模模型时的可扩展性瓶颈,二是RLHF流水线在动态采样与资源分配下的调度效率低下问题。解决方案的关键在于提出WeChat-YATT(Yet Another Transformer Trainer in WeChat),其核心创新包括两个方面:一是采用并行控制器编程模型(parallel controller programming model),实现对复杂RLHF工作流的灵活高效编排,有效缓解集中式控制器架构的性能瓶颈;二是设计动态部署策略(dynamic placement schema),根据训练负载变化自适应地划分计算资源并调度任务,显著降低硬件空闲时间,提升GPU利用率。实验表明,该方案在吞吐量上优于现有先进RLHF框架,并已在微信产品中成功部署用于支持大规模用户场景。
链接: https://arxiv.org/abs/2508.07970
作者: Junyu Wu,Weiming Chang,Xiaotao Liu,Guanyou He,Tingfeng Xian,Haoqiang Hong,Boqi Chen,Haotao Tian,Tao Yang,Yunsheng Shi,Feng Lin,Ting Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2507.22789
Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite notable advances enabled by existing RLHF training frameworks, significant challenges remain in scaling to complex multimodal workflows and adapting to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT (Yet Another Transformer Trainer in WeChat), a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating the bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across a range of experimental scenarios, demonstrating that it achieves substantial improvements in throughput compared to state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models supporting WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications.
zh
[AI-29] Exploring the Challenges and Opportunities of AI-assisted Codebase Generation
【速读】:该论文旨在解决当前代码库级人工智能助手(Codebase AI Assistants, CBAs)在实际开发中采纳率低的问题,核心在于理解开发者如何与CBAs交互、其使用中的痛点及失败原因。解决方案的关键在于通过一项平衡设计的用户研究和访谈(n=16),系统识别出开发者在使用CBAs时面临的六类根本性挑战和五项工作流集成障碍,并进一步对21个商业CBAs进行能力对比分析,从而提炼出提升CBAs效率与实用性的设计机会。
链接: https://arxiv.org/abs/2508.07966
作者: Philipp Eibl,Sadra Sabouri,Souti Chattopadhyay
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent AI code assistants have significantly improved their ability to process more complex contexts and generate entire codebases based on a textual description, compared to the popular snippet-level generation. These codebase AI assistants (CBAs) can also extend or adapt codebases, allowing users to focus on higher-level design and deployment decisions. While prior work has extensively studied the impact of snippet-level code generation, this new class of codebase generation models is relatively unexplored. Despite initial anecdotal reports of excitement about these agents, they remain less frequently adopted compared to snippet-level code assistants. To utilize CBAs better, we need to understand how developers interact with CBAs, and how and why CBAs fall short of developers’ needs. In this paper, we explored these gaps through a counterbalanced user study and interview with (n = 16) students and developers working on coding tasks with CBAs. We found that participants varied the information in their prompts, like problem description (48% of prompts), required functionality (98% of prompts), code structure (48% of prompts), and their prompt writing process. Despite various strategies, the overall satisfaction score with generated codebases remained low (mean = 2.8, median = 3, on a scale of one to five). Participants mentioned functionality as the most common factor for dissatisfaction (77% of instances), alongside poor code quality (42% of instances) and communication issues (25% of instances). We delve deeper into participants’ dissatisfaction to identify six underlying challenges that participants faced when using CBAs, and extracted five barriers to incorporating CBAs into their workflows. Finally, we surveyed 21 commercial CBAs to compare their capabilities with participant challenges and present design opportunities for more efficient and useful CBAs.
zh
[AI-30] SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis
【速读】:该论文旨在解决深度伪造语音检测领域中偏见与公平性问题被忽视的现状。其核心挑战在于现有检测模型在不同人口统计特征(如性别、年龄、语言和合成器类型)下表现不一致,可能导致不公平的检测结果。解决方案的关键是构建了Speaker Characteristics Deepfake (SCDF) 数据集——一个包含超过237,000条语音样本的平衡标注数据集,覆盖五种语言、性别均衡且年龄跨度广泛。通过在此数据集上评估多种先进检测模型,研究揭示了说话者特征对检测性能的显著影响,从而为开发具备偏见感知能力的公平检测系统提供了实证基础和方法论支持。
链接: https://arxiv.org/abs/2508.07944
作者: Vojtěch Staněk,Karel Srna,Anton Firc,Kamil Malinka
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Despite growing attention to deepfake speech detection, the aspects of bias and fairness remain underexplored in the speech domain. To address this gap, we introduce the Speaker Characteristics Deepfake (SCDF) dataset: a novel, richly annotated resource enabling systematic evaluation of demographic biases in deepfake speech detection. SCDF contains over 237,000 utterances in a balanced representation of both male and female speakers spanning five languages and a wide age range. We evaluate several state-of-the-art detectors and show that speaker characteristics significantly influence detection performance, revealing disparities across sex, language, age, and synthesizer type. These findings highlight the need for bias-aware development and provide a foundation for building non-discriminatory deepfake detection systems aligned with ethical and regulatory standards.
zh
[AI-31] Deep Reinforcement Learning with anticipatory reward in LSTM for Collision Avoidance of Mobile Robots
【速读】:该论文旨在解决多机器人系统中因缺乏通信与标识导致的碰撞风险问题,尤其是在受限环境中实现安全协同运动。其解决方案的关键在于利用长短期记忆(Long Short-Term Memory, LSTM)模型对机器人历史轨迹进行建模,预测其下一时刻的位置,并基于此预测动态调整深度Q网络(Deep Q-Network, DQN)的奖励函数,从而在强化学习过程中提前感知并规避潜在碰撞。该方法计算开销低,适用于嵌入式系统部署。
链接: https://arxiv.org/abs/2508.07941
作者: Olivier Poulet,Frédéric Guinand(RI2C - LITIS),François Guérin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This article proposes a collision risk anticipation method based on short-term prediction of the agents position. A Long Short-Term Memory (LSTM) model, trained on past trajectories, is used to estimate the next position of each robot. This prediction allows us to define an anticipated collision risk by dynamically modulating the reward of a Deep Q-Learning Network (DQN) agent. The approach is tested in a constrained environment, where two robots move without communication or identifiers. Despite a limited sampling frequency (1 Hz), the results show a significant decrease of the collisions number and a stability improvement. The proposed method, which is computationally inexpensive, appears particularly attractive for implementation on embedded systems.
zh
[AI-32] (X)-evolve: Solution space evolution powered by large language models
【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)与进化算法(Evolutionary Algorithms, EAs)结合用于复杂优化问题时,因逐个演化个体解而导致LLM调用成本过高的问题。其解决方案的关键在于提出了一种范式转变的方法——X-evolve,该方法不再直接演化单个解,而是演化解空间 (X)(即搜索空间 (S) 的子集),通过LLM生成带有可调参数的程序来定义参数化的解空间,并利用基于评分的搜索算法在该空间中高效探索,从而以显著减少的LLM调用次数实现更广域、更高效的全局搜索,有效提升了优化效率和收敛速度。
链接: https://arxiv.org/abs/2508.07932
作者: Yi Zhai,Zhiqiang Wei,Ruohan Li,Keyu Pan,Shuo Liu,Lu Zhang,Jianmin Ji,Wuyang Zhang,Yu Zhang,Yanyong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While combining large language models (LLMs) with evolutionary algorithms (EAs) shows promise for solving complex optimization problems, current approaches typically evolve individual solutions, often incurring high LLM call costs. We introduce (X)-evolve, a paradigm-shifting method that instead evolves solution spaces (X) (sets of individual solutions) - subsets of the overall search space (S). In (X)-evolve, LLMs generate tunable programs wherein certain code snippets, designated as parameters, define a tunable solution space. A score-based search algorithm then efficiently explores this parametrically defined space, guided by feedback from objective function scores. This strategy enables broader and more efficient exploration, which can potentially accelerate convergence at a much lower search cost, requiring up to two orders of magnitude fewer LLM calls than prior leading methods. We demonstrate (X)-evolve’s efficacy across three distinct hard optimization problems. For the cap set problem, we discover a larger partial admissible set, establishing a new tighter asymptotic lower bound for the cap set constant ((C \ge 2.2203)). In information theory, we uncover a larger independent set for the 15-vertex cycle graph ((\mathcalC_15^\boxtimes 5), size 19,946), thereby raising the known lower bound on its Shannon capacity. Furthermore, for the NP-hard online bin packing problem, we generate heuristics that consistently outperform standard strategies across established benchmarks. By evolving solution spaces, our method considerably improves search effectiveness, making it possible to tackle high-dimensional problems that were previously computationally prohibitive.
zh
[AI-33] Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant
【速读】:该论文试图解决的问题是:如何评估一个大型语言模型(LLM)是否能够作为可靠的参与者模拟器(participant simulator),即能否在认知任务中生成与人类行为高度一致的模拟数据,从而支持“计算内原型设计”(in silico prototyping)以加速认知科学实验设计。解决方案的关键在于提出并应用一套核心标准来评估此类模型,特别是强调“生成性行为”(generative behavior)这一关键指标——即模型不仅需具备高预测准确性,还需在行为模式上系统性地逼近真实人类数据;研究发现,尽管Centaur在预测精度上表现良好,但其生成行为仍显著偏离人类数据,表明当前模型尚未达到可靠参与者模拟器的标准。
链接: https://arxiv.org/abs/2508.07887
作者: Sabrina Namazova,Alessandra Brondetta,Younes Strittmatter,Matthew Nassar,Sebastian Musslick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for “in silico prototyping of experimental studies”, e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.
zh
[AI-34] Vertex Features for Neural Global Illumination SIGGRAPH
【速读】:该论文旨在解决传统基于特征网格(feature grid)的神经表示方法在3D场景重建与神经渲染任务中内存占用过大、限制现代并行计算硬件性能的问题。其核心解决方案是提出神经顶点特征(neural vertex features),即不再将神经特征均匀分布于整个3D空间,而是将可学习特征直接存储于显式网格表面的顶点上,从而利用几何结构实现紧凑且结构化的表示方式。该方法通过引入任务相关的几何先验,不仅显著降低内存消耗(仅为基于网格表示的五分之一甚至更少),还提升了特征表达能力,并有效减少了推理开销。
链接: https://arxiv.org/abs/2508.07852
作者: Rui Su,Honghao Dong,Haojie Jin,Yisong Chen,Guoping Wang,Sheng Li
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Accepted by ACM SIGGRAPH Asia’2025
Abstract:Recent research on learnable neural representations has been widely adopted in the field of 3D scene reconstruction and neural rendering applications. However, traditional feature grid representations often suffer from substantial memory footprint, posing a significant bottleneck for modern parallel computing hardware. In this paper, we present neural vertex features, a generalized formulation of learnable representation for neural rendering tasks involving explicit mesh surfaces. Instead of uniformly distributing neural features throughout 3D space, our method stores learnable features directly at mesh vertices, leveraging the underlying geometry as a compact and structured representation for neural processing. This not only optimizes memory efficiency, but also improves feature representation by aligning compactly with the surface using task-specific geometric priors. We validate our neural representation across diverse neural rendering tasks, with a specific emphasis on neural radiosity. Experimental results demonstrate that our method reduces memory consumption to only one-fifth (or even less) of grid-based representations, while maintaining comparable rendering quality and lowering inference overhead.
zh
[AI-35] DETACH: Cross-domain Learning for Long-Horizon Tasks via Mixture of Disentangled Experts AAAI’26
【速读】:该论文旨在解决长时程(Long-Horizon, LH)人-场景交互任务中现有方法依赖技能串联(skill chaining)导致的泛化能力不足问题,即在环境与自身状态耦合的情况下,难以适应新环境和新技能组合,从而无法有效完成跨域的复杂多步骤任务。解决方案的关键在于提出一种受生物双通路机制启发的双流解耦框架 DETACH,其核心是通过两个独立模块实现环境与技能的解耦学习:一是环境学习模块,用于提取物体功能、空间关系和场景语义,实现跨域环境迁移;二是技能学习模块,对关节自由度和运动模式进行独立编码,实现跨技能迁移。该设计显著提升了LH任务的子任务成功率(平均提升23%)和执行效率(平均提升29%)。
链接: https://arxiv.org/abs/2508.07842
作者: Yutong Shen,Hangxu Liu,Penghui Liu,Ruizhe Xia,Tianyi Yao,Yitong Sun,Tongtong Feng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages,8 figures. Submitted to AAAI’26
Abstract:Long-Horizon (LH) tasks in Human-Scene Interaction (HSI) are complex multi-step tasks that require continuous planning, sequential decision-making, and extended execution across domains to achieve the final goal. However, existing methods heavily rely on skill chaining by concatenating pre-trained subtasks, with environment observations and self-state tightly coupled, lacking the ability to generalize to new combinations of environments and skills, failing to complete various LH tasks across domains. To solve this problem, this paper presents DETACH, a cross-domain learning framework for LH tasks via biologically inspired dual-stream disentanglement. Inspired by the brain’s “where-what” dual pathway mechanism, DETACH comprises two core modules: i) an environment learning module for spatial understanding, which captures object functions, spatial relationships, and scene semantics, achieving cross-domain transfer through complete environment-self disentanglement; ii) a skill learning module for task execution, which processes self-state information including joint degrees of freedom and motor patterns, enabling cross-skill transfer through independent motor pattern encoding. We conducted extensive experiments on various LH tasks in HSI scenes. Compared with existing methods, DETACH can achieve an average subtasks success rate improvement of 23% and average execution efficiency improvement of 29%.
zh
[AI-36] KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations
【速读】:该论文旨在解决紧急救援场景中,第一响应者在时间敏感条件下难以快速获取个性化、优化医疗决策支持的问题。由于人口结构变化导致的健康风险增加,第一响应者需在最短时间内评估患者状况并提供有效救治,但其专业知识和经验往往不足以应对复杂多变的急救情境。解决方案的关键在于构建一个以知识图谱(Knowledge Graph)为核心的智能知识管理系统,通过现场实时计算、评估与处理的结构化知识,结合人工智能(AI)驱动的情境预识别能力,为第一响应者提供智能化的治疗建议,从而提升急救效率与准确性。
链接: https://arxiv.org/abs/2508.07834
作者: Mubaris Nadeem,Johannes Zenkert,Lisa Bender,Christian Weber,Madjid Fathi
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: LWDA’23, KIRETT project, University of Siegen, Germany
Abstract:Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.
zh
[AI-37] Best-Effort Policies for Robust Markov Decision Processes
【速读】:该论文旨在解决鲁棒马尔可夫决策过程(Robust MDPs, RMDPs)中存在多个最优鲁棒策略时的策略选择问题。在标准RMDP框架下,若状态间不确定性满足s-矩形性(s-rectangularity),可通过鲁棒值迭代高效计算出最大化最坏情况期望回报的策略;然而,这些策略虽在最坏情形下等效,但在非对抗性转移概率下的期望回报可能不同。为此,作者提出一种新的策略选择准则——最优鲁棒最佳努力(Optimal Robust Best-Effort, ORBE)策略,其不仅要求最坏情况下的最优性能,还进一步最大化在非对抗性转移概率下的期望回报。ORBE策略的存在性已被证明,并通过结构刻画和一个仅需少量额外计算开销的算法实现求解,从而为多最优鲁棒策略提供了一个有理论依据的区分机制。
链接: https://arxiv.org/abs/2508.07790
作者: Alessandro Abate,Thom Badings,Giuseppe De Giacomo,Francesco Fabiano
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs). A standard goal in RMDPs is to compute a policy that maximizes the expected return under an adversarial choice of the transition probabilities. If the uncertainty in the probabilities is independent between the states, known as s-rectangularity, such optimal robust policies can be computed efficiently using robust value iteration. However, there might still be multiple optimal robust policies, which, while equivalent with respect to the worst-case, reflect different expected returns under non-adversarial choices of the transition probabilities. Hence, we propose a refined policy selection criterion for RMDPs, drawing inspiration from the notions of dominance and best-effort in game theory. Instead of seeking a policy that only maximizes the worst-case expected return, we additionally require the policy to achieve a maximal expected return under different (i.e., not fully adversarial) transition probabilities. We call such a policy an optimal robust best-effort (ORBE) policy. We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a small overhead compared to standard robust value iteration. ORBE policies offer a principled tie-breaker among optimal robust policies. Numerical experiments show the feasibility of our approach.
zh
[AI-38] Sparse Probabilistic Graph Circuits
【速读】:该论文旨在解决深度生成模型(Deep Generative Models, DGMs)在图结构数据上因非线性激活函数导致的概率推理不可计算的问题,即模型的不可 tractable 性。尽管近期提出的概率图电路(Probabilistic Graph Circuits, PGCs)通过构建可 tractable 的概率推理框架缓解了这一问题,但其基于稠密图表示的方式带来了 O(n2) 的时间复杂度,限制了其在大规模稀疏图上的应用。论文提出稀疏概率图电路(Sparse PGCs, SPGCs),其关键创新在于直接在稀疏图表示上进行建模,将复杂度降至 O(n+m),其中 n 为节点数、m 为边数,特别适用于 m≪n2 的场景。实验表明,SPGCs 在保持精确推理能力的同时显著提升了内存效率和推理速度,并在新药设计任务中达到与不可 tractable DGMs 相当的性能表现。
链接: https://arxiv.org/abs/2508.07763
作者: Martin Rektoris,Milan Papež,Václav Šmídl,Tomáš Pevný
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep generative models (DGMs) for graphs achieve impressively high expressive power thanks to very efficient and scalable neural networks. However, these networks contain non-linearities that prevent analytical computation of many standard probabilistic inference queries, i.e., these DGMs are considered \emphintractable. While recently proposed Probabilistic Graph Circuits (PGCs) address this issue by enabling \emphtractable probabilistic inference, they operate on dense graph representations with \mathcalO(n^2) complexity for graphs with n nodes and \emph m edges. To address this scalability issue, we introduce Sparse PGCs, a new class of tractable generative models that operate directly on sparse graph representation, reducing the complexity to \mathcalO(n + m) , which is particularly beneficial for m \ll n^2 . In the context of de novo drug design, we empirically demonstrate that SPGCs retain exact inference capabilities, improve memory efficiency and inference speed, and match the performance of intractable DGMs in key metrics.
zh
[AI-39] Chimera: Harnessing Multi-Agent LLM s for Automatic Insider Threat Simulation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在内部威胁检测(Insider Threat Detection, ITD)研究中面临的高质量标注数据稀缺问题。现有公开数据集或因规模有限而缺乏真实场景覆盖,或因纯合成数据无法体现复杂用户行为和语义特征,导致模型训练与评估受限。解决方案的关键在于提出 Chimera——首个基于大语言模型(Large Language Model, LLM)的多智能体框架,通过模拟员工角色行为、组织互动(如小组会议、一对一交流)及自主任务调度,实现对良性与恶意内部活动的多样化日志生成;其集成15类典型内部攻击模式并在科技公司、金融机构和医疗机构等敏感领域部署,产出 ChimeraLog 数据集,显著提升数据的真实性与复杂性,从而推动 ITD 方法在真实环境下的性能评估与进步。
链接: https://arxiv.org/abs/2508.07745
作者: Jiongchi Yu,Xiaofei Xie,Qiang Hu,Yuhan Ma,Ziming Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 23 pages
Abstract:Insider threats, which can lead to severe losses, remain a major security concern. While machine learning-based insider threat detection (ITD) methods have shown promising results, their progress is hindered by the scarcity of high-quality data. Enterprise data is sensitive and rarely accessible, while publicly available datasets, when limited in scale due to cost, lack sufficient real-world coverage; and when purely synthetic, they fail to capture rich semantics and realistic user behavior. To address this, we propose Chimera, the first large language model (LLM)-based multi-agent framework that automatically simulates both benign and malicious insider activities and collects diverse logs across diverse enterprise environments. Chimera models each employee with agents that have role-specific behavior and integrates modules for group meetings, pairwise interactions, and autonomous scheduling, capturing realistic organizational dynamics. It incorporates 15 types of insider attacks (e.g., IP theft, system sabotage) and has been deployed to simulate activities in three sensitive domains: technology company, finance corporation, and medical institution, producing a new dataset, ChimeraLog. We assess ChimeraLog via human studies and quantitative analysis, confirming its diversity, realism, and presence of explainable threat patterns. Evaluations of existing ITD methods show an average F1-score of 0.83, which is significantly lower than 0.99 on the CERT dataset, demonstrating ChimeraLog’s higher difficulty and utility for advancing ITD research.
zh
[AI-40] Symmetry-Aware Transformer Training for Automated Planning
【速读】:该论文旨在解决基于Transformer的规划模型(如PlanGPT)在从简单规划问题外推到复杂规划问题时表现不佳的问题,其根本原因在于规划任务中存在的变量命名对称性(symmetry),导致等价表示组合爆炸,而纯Transformer模型缺乏归纳偏置(inductive bias)难以高效学习。解决方案的关键在于提出一种新颖的对比学习目标(contrastive learning objective),使Transformer具备对对称性的感知能力,并结合架构改进,从而显著提升模型在计划生成(plan-generation)和启发式预测(heuristic-prediction)任务中的训练效率与泛化性能。
链接: https://arxiv.org/abs/2508.07743
作者: Markus Fritzsche,Elliot Gestrin,Jendrik Seipp
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While transformers excel in many settings, their application in the field of automated planning is limited. Prior work like PlanGPT, a state-of-the-art decoder-only transformer, struggles with extrapolation from easy to hard planning problems. This in turn stems from problem symmetries: planning tasks can be represented with arbitrary variable names that carry no meaning beyond being identifiers. This causes a combinatorial explosion of equivalent representations that pure transformers cannot efficiently learn from. We propose a novel contrastive learning objective to make transformers symmetry-aware and thereby compensate for their lack of inductive bias. Combining this with architectural improvements, we show that transformers can be efficiently trained for either plan-generation or heuristic-prediction. Our results across multiple planning domains demonstrate that our symmetry-aware training effectively and efficiently addresses the limitations of PlanGPT.
zh
[AI-41] A Rule-Based Approach to Specifying Preferences over Conflicting Facts and Querying Inconsistent Knowledge Bases KR2025
【速读】:该论文旨在解决不一致知识库(Knowledge Base, KB)中如何有效指定和计算冲突事实之间的优先级关系问题,从而在查询时基于优先修复语义(repair-based semantics)获得有意义的答案。其解决方案的关键在于提出一个声明式规则框架,利用答案集编程(Answer Set Programming, ASP)来评估偏好规则,并通过多种循环消除技术从可能含循环的优先关系中提取出无环关系,最终实现端到端的查询处理系统。
链接: https://arxiv.org/abs/2508.07742
作者: Meghyn Bienvenu,Camille Bourgaux,Katsumi Inoue,Robin Jean
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: This is an extended version of a paper appearing at the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR 2025). 24 pages
Abstract:Repair-based semantics have been extensively studied as a means of obtaining meaningful answers to queries posed over inconsistent knowledge bases (KBs). While several works have considered how to exploit a priority relation between facts to select optimal repairs, the question of how to specify such preferences remains largely unaddressed. This motivates us to introduce a declarative rule-based framework for specifying and computing a priority relation between conflicting facts. As the expressed preferences may contain undesirable cycles, we consider the problem of determining when a set of preference rules always yields an acyclic relation, and we also explore a pragmatic approach that extracts an acyclic relation by applying various cycle removal techniques. Towards an end-to-end system for querying inconsistent KBs, we present a preliminary implementation and experimental evaluation of the framework, which employs answer set programming to evaluate the preference rules, apply the desired cycle resolution techniques to obtain a priority relation, and answer queries under prioritized-repair semantics.
zh
[AI-42] CognitiveArm: Enabling Real-Time EEG-Controlled Prosthetic Arm Using Embodied Machine Learning
【速读】:该论文旨在解决在资源受限的嵌入式人工智能(AI)硬件上实现高效、实时的非侵入式脑机接口(BCI)控制假肢臂的问题,核心挑战在于平衡模型复杂度、计算效率与延迟。解决方案的关键在于构建一个名为CognitiveArm的端侧系统,其整合了BrainFlow开源EEG数据采集库与优化的深度学习(DL)模型,并通过进化搜索算法识别帕累托最优的超参数配置;同时采用剪枝和量化等模型压缩技术,在保持高达90%分类准确率的前提下显著提升部署效率,支持语音指令切换模式以实现多自由度(DoF)动作控制,从而在边缘设备上实现了低延迟、高精度的脑控假肢操作。
链接: https://arxiv.org/abs/2508.07731
作者: Abdul Basit,Maha Nawaz,Saim Rehman,Muhammad Shafique
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 12 figures, Accepted to 62nd DAC 2025
Abstract:Efficient control of prosthetic limbs via non-invasive brain-computer interfaces (BCIs) requires advanced EEG processing, including pre-filtering, feature extraction, and action prediction, performed in real time on edge AI hardware. Achieving this on resource-constrained devices presents challenges in balancing model complexity, computational efficiency, and latency. We present CognitiveArm, an EEG-driven, brain-controlled prosthetic system implemented on embedded AI hardware, achieving real-time operation without compromising accuracy. The system integrates BrainFlow, an open-source library for EEG data acquisition and streaming, with optimized deep learning (DL) models for precise brain signal classification. Using evolutionary search, we identify Pareto-optimal DL configurations through hyperparameter tuning, optimizer analysis, and window selection, analyzed individually and in ensemble configurations. We apply model compression techniques such as pruning and quantization to optimize models for embedded deployment, balancing efficiency and accuracy. We collected an EEG dataset and designed an annotation pipeline enabling precise labeling of brain signals corresponding to specific intended actions, forming the basis for training our optimized DL models. CognitiveArm also supports voice commands for seamless mode switching, enabling control of the prosthetic arm’s 3 degrees of freedom (DoF). Running entirely on embedded hardware, it ensures low latency and real-time responsiveness. A full-scale prototype, interfaced with the OpenBCI UltraCortex Mark IV EEG headset, achieved up to 90% accuracy in classifying three core actions (left, right, idle). Voice integration enables multiplexed, variable movement for everyday tasks (e.g., handshake, cup picking), enhancing real-world performance and demonstrating CognitiveArm’s potential for advanced prosthetic control.
zh
[AI-43] raining-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer
【速读】:该论文旨在解决当前ANN-to-SNN转换方法在处理Transformer架构中非线性操作时效率低下且需额外微调的问题。其关键解决方案是提出一种无需训练的转换框架,核心创新在于设计了一种多基指数衰减(Multi-basis Exponential Decay, MBE)神经元,该神经元通过指数衰减策略与多基编码方法高效逼近多种非线性运算,从而避免对预训练人工神经网络(Artificial Neural Networks, ANNs)权重进行修改,实现近无损精度转换并显著降低延迟。
链接: https://arxiv.org/abs/2508.07710
作者: Jingya Wang,Xin Deng,Wenjie Wei,Dehao Zhang,Shuai Wang,Qian Sun,Jieyuan Zhang,Hanwen Liu,Ning Xie,Malu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures. Compared to directly trained Spiking Transformers, ANN-to-SNN conversion methods bypass the high training costs. However, existing methods still suffer from notable limitations, failing to effectively handle nonlinear operations in Transformer architectures and requiring additional fine-tuning processes for pre-trained ANNs. To address these issues, we propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron, which employs an exponential decay strategy and multi-basis encoding method to efficiently approximate various nonlinear operations. It removes the requirement for weight modifications in pre-trained ANNs. Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications.
zh
[AI-44] Energy Consumption in Parallel Neural Network Training
【速读】:该论文旨在解决神经网络训练过程中因模型和数据规模扩大而导致的能源消耗显著增长问题,尤其是并行化策略对能耗影响常被忽视的现状。其解决方案的关键在于通过系统性地开展数据并行训练的扩展实验(针对ResNet50和FourCastNet两个模型),量化分析GPU数量、全局批量大小和局部批量大小等并行参数对预测性能、训练时间和能源消耗的影响,揭示了能源消耗与计算资源(GPU小时)近似线性相关,但不同模型和硬件下的比例因子存在显著差异,并受每GPU小时样本数和梯度更新次数的系统性调节。这一发现为未来实现更可持续的人工智能研究提供了实证依据和优化方向。
链接: https://arxiv.org/abs/2508.07706
作者: Philipp Huber,David Li,Juan Pedro Gutiérrez Hermosillo Muriedas,Deifilia Kieckhefen,Markus Götz,Achim Streit,Charlotte Debus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing demand for computational resources of training neural networks leads to a concerning growth in energy consumption. While parallelization has enabled upscaling model and dataset sizes and accelerated training, its impact on energy consumption is often overlooked. To close this research gap, we conducted scaling experiments for data-parallel training of two models, ResNet50 and FourCastNet, and evaluated the impact of parallelization parameters, i.e., GPU count, global batch size, and local batch size, on predictive performance, training time, and energy consumption. We show that energy consumption scales approximately linearly with the consumed resources, i.e., GPU hours; however, the respective scaling factor differs substantially between distinct model trainings and hardware, and is systematically influenced by the number of samples and gradient updates per GPU hour. Our results shed light on the complex interplay of scaling up neural network training and can inform future developments towards more sustainable AI research.
zh
[AI-45] MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leverag ed Enhanced State Representation
【速读】:该论文旨在解决重症监护中脓毒症(sepsis)早期识别与优化管理的难题,传统强化学习(Reinforcement Learning, RL)方法多依赖于结构化数据(如实验室结果或生命体征),难以全面刻画患者状态。其解决方案的关键在于提出一种多模态离线强化学习框架 MORE-CLEAR,该框架利用预训练大语言模型(Large Language Models, LLMs)从临床病历文本中提取丰富的语义表征,结合门控融合与跨模态注意力机制实现时序动态加权和多模态信息的有效整合,从而提升患者状态表示的质量。实验表明,该方法在MIMIC-III、MIMIC-IV及私有数据集上显著优于单模态RL模型,在估计存活率和策略性能方面均有明显改进,是首个将LLMs能力融入多模态离线RL以增强医疗场景状态表征的研究。
链接: https://arxiv.org/abs/2508.07681
作者: Yooseok Lim,ByoungJun Jeon,Seong-A Park,Jisoo Lee,Sae Won Choi,Chang Wook Jeong,Ho-Geol Ryu,Hongyeol Lee,Hyun-Lim Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures
Abstract:Sepsis, a life-threatening inflammatory response to infection, causes organ dysfunction, making early detection and optimal management critical. Previous reinforcement learning (RL) approaches to sepsis management rely primarily on structured data, such as lab results or vital signs, and on a dearth of a comprehensive understanding of the patient’s condition. In this work, we propose a Multimodal Offline REinforcement learning for Clinical notes Leveraged Enhanced stAte Representation (MORE-CLEAR) framework for sepsis control in intensive care units. MORE-CLEAR employs pre-trained large-scale language models (LLMs) to facilitate the extraction of rich semantic representations from clinical notes, preserving clinical context and improving patient state representation. Gated fusion and cross-modal attention allow dynamic weight adjustment in the context of time and the effective integration of multimodal data. Extensive cross-validation using two public (MIMIC-III and MIMIC-IV) and one private dataset demonstrates that MORE-CLEAR significantly improves estimated survival rate and policy performance compared to single-modal RL approaches. To our knowledge, this is the first to leverage LLM capabilities within a multimodal offline RL for better state representation in medical applications. This approach can potentially expedite the treatment and management of sepsis by enabling reinforcement learning models to propose enhanced actions based on a more comprehensive understanding of patient conditions.
zh
[AI-46] Ethics2vec: aligning automatic agents and human preferences
【速读】:该论文旨在解决人工智能(AI)系统与人类价值观对齐(alignment)的问题,即如何使智能体的行为符合人类伦理价值,尤其是在面对不可通约(incommensurable)价值(如生命价值与治疗成本)时难以量化和权衡的挑战。解决方案的关键在于提出了一种名为Ethics2Vec的方法,通过将自动决策策略或控制律映射为多维向量表示,从而在统一空间中定义度量标准,实现对智能体行为与人类价值观之间对齐程度的比较与评估。此方法借鉴了Anything2vec的成功经验,扩展至伦理场景,使得原本难以量化的伦理考量具备可计算性和可比性。
链接: https://arxiv.org/abs/2508.07673
作者: Gianluca Bontempi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Though intelligent agents are supposed to improve human experience (or make it more efficient), it is hard from a human perspective to grasp the ethical values which are explicitly or implicitly embedded in an agent behaviour. This is the well-known problem of alignment, which refers to the challenge of designing AI systems that align with human values, goals and preferences. This problem is particularly challenging since most human ethical considerations refer to \emphincommensurable (i.e. non-measurable and/or incomparable) values and criteria. Consider, for instance, a medical agent prescribing a treatment to a cancerous patient. How could it take into account (and/or weigh) incommensurable aspects like the value of a human life and the cost of the treatment? Now, the alignment between human and artificial values is possible only if we define a common space where a metric can be defined and used. This paper proposes to extend to ethics the conventional Anything2vec approach, which has been successful in plenty of similar and hard-to-quantify domains (ranging from natural language processing to recommendation systems and graph analysis). This paper proposes a way to map an automatic agent decision-making (or control law) strategy to a multivariate vector representation, which can be used to compare and assess the alignment with human values. The Ethics2Vec method is first introduced in the case of an automatic agent performing binary decision-making. Then, a vectorisation of an automatic control law (like in the case of a self-driving car) is discussed to show how the approach can be extended to automatic control settings.
zh
[AI-47] EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration NEURIPS2025
【速读】:该论文旨在解决当前人工智能(AI)在难民融合领域中仅聚焦于就业等狭隘目标、忽视文化、情感与伦理维度的问题,这些问题限制了长期融合的成功。其解决方案的关键在于提出EMPATHIA(Enriched Multimodal Pathways for Agentic Thinking in Humanitarian Immigrant Assistance)多智能体框架,该框架基于Kegan的建构发展理论,将融合过程分解为三个模块:SEED(社会文化嵌入决策)、RISE(快速独立引擎)和THRIVE(跨文化和谐与韧性)。其中,SEED采用选择器-验证器架构,通过情绪、文化和伦理三类专门代理进行透明协商,生成可解释的推荐结果,从而在多元价值体系之间实现平衡,并支持人机协作,保障人类尊严在机器参与重大决策时得以维持。
链接: https://arxiv.org/abs/2508.07671
作者: Mohamed Rayan Barhdadi,Mehmet Tuncel,Erchin Serpedin,Hasan Kurban
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Applications (stat.AP)
备注: 19 pages, 3 figures (plus 6 figures in supplementary), 2 tables, 1 algorithm. Submitted to NeurIPS 2025 Creative AI Track: Humanity
Abstract:Current AI approaches to refugee integration optimize narrow objectives such as employment and fail to capture the cultural, emotional, and ethical dimensions critical for long-term success. We introduce EMPATHIA (Enriched Multimodal Pathways for Agentic Thinking in Humanitarian Immigrant Assistance), a multi-agent framework addressing the central Creative AI question: how do we preserve human dignity when machines participate in life-altering decisions? Grounded in Kegan’s Constructive Developmental Theory, EMPATHIA decomposes integration into three modules: SEED (Socio-cultural Entry and Embedding Decision) for initial placement, RISE (Rapid Integration and Self-sufficiency Engine) for early independence, and THRIVE (Transcultural Harmony and Resilience through Integrated Values and Engagement) for sustained outcomes. SEED employs a selector-validator architecture with three specialized agents - emotional, cultural, and ethical - that deliberate transparently to produce interpretable recommendations. Experiments on the UN Kakuma dataset (15,026 individuals, 7,960 eligible adults 15+ per ILO/UNHCR standards) and implementation on 6,359 working-age refugees (15+) with 150+ socioeconomic variables achieved 87.4% validation convergence and explainable assessments across five host countries. EMPATHIA’s weighted integration of cultural, emotional, and ethical factors balances competing value systems while supporting practitioner-AI collaboration. By augmenting rather than replacing human expertise, EMPATHIA provides a generalizable framework for AI-driven allocation tasks where multiple values must be reconciled.
zh
[AI-48] AIS-LLM : A Unified Framework for Maritime Trajectory Prediction Anomaly Detection and Collision Risk Assessment with Explainable Forecasting
【速读】:该论文旨在解决当前 maritime traffic analysis 任务(如船舶轨迹预测、异常检测和碰撞风险评估)通常被独立处理,难以全面考虑复杂海上情境的问题。解决方案的关键在于提出了一种名为 AIS-LLM 的新型端到端框架,其核心创新是将时间序列 AIS 数据与大语言模型(Large Language Model, LLM)深度融合:通过时间序列编码器(Time-Series Encoder)、基于提示的编码器(Prompt Encoder)、跨模态对齐模块(Cross-Modality Alignment Module)以及多任务解码器(Multi-Task Decoder),实现三项关键任务的协同执行,并借助任务输出整合生成态势摘要与简报,从而提升海上交通管理的智能化与效率。
链接: https://arxiv.org/abs/2508.07668
作者: Hyobin Park,Jinwook Jung,Minseok Seo,Hyunsoo Choi,Deukjae Cho,Sekil Park,Dong-Geol Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With the increase in maritime traffic and the mandatory implementation of the Automatic Identification System (AIS), the importance and diversity of maritime traffic analysis tasks based on AIS data, such as vessel trajectory prediction, anomaly detection, and collision risk assessment, is rapidly growing. However, existing approaches tend to address these tasks individually, making it difficult to holistically consider complex maritime situations. To address this limitation, we propose a novel framework, AIS-LLM, which integrates time-series AIS data with a large language model (LLM). AIS-LLM consists of a Time-Series Encoder for processing AIS sequences, an LLM-based Prompt Encoder, a Cross-Modality Alignment Module for semantic alignment between time-series data and textual prompts, and an LLM-based Multi-Task Decoder. This architecture enables the simultaneous execution of three key tasks: trajectory prediction, anomaly detection, and risk assessment of vessel collisions within a single end-to-end system. Experimental results demonstrate that AIS-LLM outperforms existing methods across individual tasks, validating its effectiveness. Furthermore, by integratively analyzing task outputs to generate situation summaries and briefings, AIS-LLM presents the potential for more intelligent and efficient maritime traffic management.
zh
[AI-49] 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
【速读】:该论文旨在解决交互场景下大语言模型(Large Language Models, LLMs)处理多源信息时面临的上下文隐私保护难题,特别是在混合包含私有与公开信息的场景中(如会议摘要生成)。其核心挑战在于如何在保证公共内容忠实性的同时有效防止私有信息泄露。解决方案的关键在于提出一种多智能体框架,将隐私推理分解为专业化子任务(如信息提取和分类),通过信息流拓扑结构的设计实现各智能体间的信息负载均衡、迭代验证与错误校正,从而提升对上下文隐私规范的可靠遵守能力。实验表明,该方法显著降低了隐私泄露率(ConfAIde和PrivacyLens基准上分别降低18%和19%,使用GPT-4o时),且优于单智能体基线。
链接: https://arxiv.org/abs/2508.07667
作者: Wenkai Li,Liwen Sun,Zhenxiang Guan,Xuhui Zhou,Maarten Sap
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources (e.g., summarizing meetings with private and public information). We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. To understand how privacy errors emerge and propagate, we conduct a systematic ablation over information-flow topologies, revealing when and why upstream detection mistakes cascade into downstream leakage. Experiments on the ConfAIde and PrivacyLens benchmark with several open-source and closed-sourced LLMs demonstrate that our best multi-agent configuration substantially reduces private information leakage (\textbf18% on ConfAIde and \textbf19% on PrivacyLens with GPT-4o) while preserving the fidelity of public content, outperforming single-agent baselines. These results highlight the promise of principled information-flow design in multi-agent systems for contextual privacy with LLMs.
zh
[AI-50] Discovering Spatial Correlations between Earth Observations in Global Atmospheric State Estimation by using Adaptive Graph Structure Learning
【速读】:该论文旨在解决全球大气状态估计中地球观测数据与大气状态之间动态空间相关性建模难题,以提升数值天气预报(Numerical Weather Prediction, NWP)系统的预测精度。现有方法难以有效捕捉随时间变化的复杂空间关联,尤其在观测位置不固定、气象背景动态演变的情况下。解决方案的关键在于引入具备结构学习能力的时空图神经网络(Spatiotemporal Graph Neural Networks, STGNNs),并通过自适应节点度数调控和空间距离约束来优化边采样策略,从而缓解结构学习导致的结构信息丢失与过平滑问题,显著提升了模型在高变率区域的预测性能。
链接: https://arxiv.org/abs/2508.07659
作者: Hyeon-Ju Jeon,Jeon-Ho Kang,In-Hyuk Kwon,O-Joun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:This study aims to discover spatial correlations between Earth observations and atmospheric states to improve the forecasting accuracy of global atmospheric state estimation, which are usually conducted using conventional numerical weather prediction (NWP) systems and is the beginning of weather forecasting. NWP systems predict future atmospheric states at fixed locations, which are called NWP grid points, by analyzing previous atmospheric states and newly acquired Earth observations without fixed locations. Thus, surrounding meteorological context and the changing locations of the observations make spatial correlations between atmospheric states and observations over time. To handle complicated spatial correlations, which change dynamically, we employ spatiotemporal graph neural networks (STGNNs) with structure learning. However, structure learning has an inherent limitation that this can cause structural information loss and over-smoothing problem by generating excessive edges. To solve this problem, we regulate edge sampling by adaptively determining node degrees and considering the spatial distances between NWP grid points and observations. We validated the effectiveness of the proposed method by using real-world atmospheric state and observation data from East Asia. Even in areas with high atmospheric variability, the proposed method outperformed existing STGNN models with and without structure learning.
zh
[AI-51] Disentangling Multiplex Spatial-Temporal Transition Graph Representation Learning for Socially Enhanced POI Recommendation
【速读】:该论文旨在解决现有兴趣点(Point-of-Interest, POI)推荐模型中空间与时间过渡表征分离建模导致的语义错位问题,这种错位会在特征融合阶段引入冗余信息,从而增加模型不确定性并降低可解释性。其解决方案的关键在于提出DiMuST模型,该模型基于多层空间-时间过渡图进行解耦表示学习,核心创新是设计了一种新型的解耦变分多层图自编码器(Disentangled variational multiplex graph Auto-Encoder, DAE),首先通过多层空间-时间图策略分离共享与私有分布,再利用Experts乘积(Product of Experts, PoE)机制融合共享特征,并通过对比约束对私有特征去噪,从而有效捕捉POI的空间-时间过渡表征并保留其内在关联性。
链接: https://arxiv.org/abs/2508.07649
作者: Jie Li,Haoye Dong,Zhengyang Wu,Zetao Zheng,Mingrong Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Next Point-of-Interest (POI) recommendation is a research hotspot in business intelligence, where users’ spatial-temporal transitions and social relationships play key roles. However, most existing works model spatial and temporal transitions separately, leading to misaligned representations of the same spatial-temporal key nodes. This misalignment introduces redundant information during fusion, increasing model uncertainty and reducing interpretability. To address this issue, we propose DiMuST, a socially enhanced POI recommendation model based on disentangled representation learning over multiplex spatial-temporal transition graphs. The model employs a novel Disentangled variational multiplex graph Auto-Encoder (DAE), which first disentangles shared and private distributions using a multiplex spatial-temporal graph strategy. It then fuses the shared features via a Product of Experts (PoE) mechanism and denoises the private features through contrastive constraints. The model effectively captures the spatial-temporal transition representations of POIs while preserving the intrinsic correlation of their spatial-temporal relationships. Experiments on two challenging datasets demonstrate that our DiMuST significantly outperforms existing methods across multiple metrics.
zh
[AI-52] Grasp-HGN: Grasping the Unexpected
【速读】:该论文针对当前机器人假手(robotic prosthetic hands)在真实环境中泛化能力差的问题展开研究,特别是面对未见过的物体时,现有抓取模型性能显著下降,严重影响截肢用户的独立性和生活质量。其核心挑战在于:现有数据集受限于固定对象类别,难以覆盖现实世界中无限多样的物体类型,导致模型在“语义投影”(semantic projection,即对未见物体类型的泛化能力)上表现不佳——例如,YOLO类模型在训练集上准确率达80%,但在未见物体上骤降至15%。解决方案的关键是提出Grasp-LLaVA,一种基于视觉-语言模型的抓取估计方法,通过引入人类式推理机制,依据物体物理特征推断合适的抓取方式,在未见物体类型上实现50.2%的准确率,远超当前最优模型(36.7%)。此外,为平衡实时性与精度,论文进一步设计了边缘-云端混合架构(Hybrid Grasp Network, HGN),结合边缘端快速响应与云侧高精度推理,并引入置信度校准机制(DC)实现动态切换,使未见物体场景下的准确率提升至42.3%,同时获得3.5倍速度加速;在真实场景样本混合测试中,平均准确率达86%,较纯边缘方案提升12.2%,且推理速度提升2.2倍。
链接: https://arxiv.org/abs/2508.07648
作者: Mehrshad Zandigohar,Mallesham Dasari,Gunar Schirner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper accepted at ACM Transactions on Embedded Computing Systems
Abstract:For transradial amputees, robotic prosthetic hands promise to regain the capability to perform daily living activities. To advance next-generation prosthetic hand control design, it is crucial to address current shortcomings in robustness to out of lab artifacts, and generalizability to new environments. Due to the fixed number of object to interact with in existing datasets, contrasted with the virtually infinite variety of objects encountered in the real world, current grasp models perform poorly on unseen objects, negatively affecting users’ independence and quality of life. To address this: (i) we define semantic projection, the ability of a model to generalize to unseen object types and show that conventional models like YOLO, despite 80% training accuracy, drop to 15% on unseen objects. (ii) we propose Grasp-LLaVA, a Grasp Vision Language Model enabling human-like reasoning to infer the suitable grasp type estimate based on the object’s physical characteristics resulting in a significant 50.2% accuracy over unseen object types compared to 36.7% accuracy of an SOTA grasp estimation model. Lastly, to bridge the performance-latency gap, we propose Hybrid Grasp Network (HGN), an edge-cloud deployment infrastructure enabling fast grasp estimation on edge and accurate cloud inference as a fail-safe, effectively expanding the latency vs. accuracy Pareto. HGN with confidence calibration (DC) enables dynamic switching between edge and cloud models, improving semantic projection accuracy by 5.6% (to 42.3%) with 3.5x speedup over the unseen object types. Over a real-world sample mix, it reaches 86% average accuracy (12.2% gain over edge-only), and 2.2x faster inference than Grasp-LLaVA alone. Comments: Paper accepted at ACM Transactions on Embedded Computing Systems Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.07648 [cs.RO] (or arXiv:2508.07648v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2508.07648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-53] Attribution Explanations for Deep Neural Networks: A Theoretical Perspective
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)中Attribution解释方法的可信性(faithfulness)问题,即现有方法是否真实反映输入变量对模型决策的实际贡献。其核心挑战包括:方法间缺乏统一组织结构导致难以比较、理论基础薄弱、以及缺乏真值进行实证评估。解决方案的关键在于推动理论层面的三大方向发展:(i) 理论统一,揭示不同方法间的共性与差异,实现系统化比较;(ii) 理论依据澄清,明确已有方法的数学和逻辑基础;(iii) 理论评估,严格证明方法是否满足忠实性原则。这些进展有助于深化对Attribution方法的理解、指导方法选择,并启发新方法的设计。
链接: https://arxiv.org/abs/2508.07636
作者: Huiqi Deng,Hongbin Pei,Quanshi Zhang,Mengnan Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Attribution explanation is a typical approach for explaining deep neural networks (DNNs), inferring an importance or contribution score for each input variable to the final output. In recent years, numerous attribution methods have been developed to explain DNNs. However, a persistent concern remains unresolved, i.e., whether and which attribution methods faithfully reflect the actual contribution of input variables to the decision-making process. The faithfulness issue undermines the reliability and practical utility of attribution explanations. We argue that these concerns stem from three core challenges. First, difficulties arise in comparing attribution methods due to their unstructured heterogeneity, differences in heuristics, formulations, and implementations that lack a unified organization. Second, most methods lack solid theoretical underpinnings, with their rationales remaining absent, ambiguous, or unverified. Third, empirically evaluating faithfulness is challenging without ground truth. Recent theoretical advances provide a promising way to tackle these challenges, attracting increasing attention. We summarize these developments, with emphasis on three key directions: (i) Theoretical unification, which uncovers commonalities and differences among methods, enabling systematic comparisons; (ii) Theoretical rationale, clarifying the foundations of existing methods; (iii) Theoretical evaluation, rigorously proving whether methods satisfy faithfulness principles. Beyond a comprehensive review, we provide insights into how these studies help deepen theoretical understanding, inform method selection, and inspire new attribution methods. We conclude with a discussion of promising open problems for further work.
zh
[AI-54] Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo
【速读】:该论文致力于解决基于得分的生成模型(score-based generative models)中后验采样(posterior sampling)的问题,即在已知先验分布 $ p(x) $ 和观测模型 $ p(y|x) $ 的情况下,如何从后验分布 $ p(x|y) $ 中进行有效采样。传统方法表明,在最坏情况下,该问题在KL散度意义下是计算上难以处理的。然而,诸如图像超分辨率、风格迁移和重建等实际任务中,已有算法表现出良好的经验性能。论文不依赖于特定分布假设或受限场景来实现精确后验采样,而是将问题重新建模为一个“倾斜”(tilting)问题——即在保持与先验一致的同时,使分布偏向观测数据。其核心贡献在于,在最小假设条件下,提出了一种多项式时间内可实现的近似采样方法,使得所生成样本在KL散度意义下接近噪声先验的后验分布,同时在Fisher散度意义下接近真实后验分布,从而兼顾观测一致性与先验结构。这是首个关于(近似)后验采样的多项式时间形式化结果。
链接: https://arxiv.org/abs/2508.07631
作者: Advait Parulekar,Litu Rout,Karthikeyan Shanmugam,Sanjay Shakkottai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior p(x) , a measurement model p(y|x) , and are tasked with sampling from the posterior p(x|y) . Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general “tilting” problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.
zh
[AI-55] Multimodal AI Systems for Enhanced Laying Hen Welfare Assessment and Productivity Optimization
【速读】:该论文旨在解决传统家禽福利评估方法在现代养鸡场中面临的局限性问题,即依赖人工主观观察和单一传感器数据难以全面捕捉产蛋鸡福利的多维复杂性。其核心解决方案是引入多模态人工智能(Multimodal Artificial Intelligence),通过融合视觉、声音、环境与生理等多源数据流,实现对禽类福利动态的深度洞察。关键创新在于采用特征级(feature-level)融合策略,在真实农场环境中实现了鲁棒性与性能的最佳平衡,并展现出优于早期或晚期融合方法的可扩展性优势。此外,研究还提出了领域迁移评分(Domain Transfer Score, DTS)和数据可靠性指数(Data Reliability Index, DRI)两项新评估工具,以及一个模块化、情境感知的部署框架,以应对传感器脆弱性、成本高、行为定义不一致及跨农场泛化能力弱等实际障碍,推动从被动、单模态监测向主动、精准驱动的福利管理系统转型。
链接: https://arxiv.org/abs/2508.07628
作者: Daniel Essien,Suresh Neethirajan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 66 pages, 7 figures, 11 tables
Abstract:The future of poultry production depends on a paradigm shift replacing subjective, labor-intensive welfare checks with data-driven, intelligent monitoring ecosystems. Traditional welfare assessments-limited by human observation and single-sensor data-cannot fully capture the complex, multidimensional nature of laying hen welfare in modern farms. Multimodal Artificial Intelligence (AI) offers a breakthrough, integrating visual, acoustic, environmental, and physiological data streams to reveal deeper insights into avian welfare dynamics. This investigation highlights multimodal As transformative potential, showing that intermediate (feature-level) fusion strategies achieve the best balance between robustness and performance under real-world poultry conditions, and offer greater scalability than early or late fusion approaches. Key adoption barriers include sensor fragility in harsh farm environments, high deployment costs, inconsistent behavioral definitions, and limited cross-farm generalizability. To address these, we introduce two novel evaluation tools - the Domain Transfer Score (DTS) to measure model adaptability across diverse farm settings, and the Data Reliability Index (DRI) to assess sensor data quality under operational constraints. We also propose a modular, context-aware deployment framework designed for laying hen environments, enabling scalable and practical integration of multimodal sensing. This work lays the foundation for a transition from reactive, unimodal monitoring to proactive, precision-driven welfare systems that unite productivity with ethical, science based animal care.
zh
[AI-56] On the Limits of Selective AI Prediction: A Case Study in Clinical Decision Making
【速读】:该论文旨在解决生成式 AI (Generative AI) 在临床决策中因模型预测不准确与人类自动化偏倚(automation bias)共同作用导致决策质量下降的问题。解决方案的关键在于采用选择性预测(selective prediction)机制,即当模型认为自身预测不可靠时主动 abstain(回避),并明确告知用户,从而避免误导人类决策者。研究通过259名临床医生的实验验证发现,选择性预测能有效缓解错误AI预测对诊断准确性的影响,但同时也改变了医生的决策模式,表现为在AI abstain时出现显著的漏诊和漏治现象,提示需进一步优化人机协同策略以确保安全可靠的临床应用。
链接: https://arxiv.org/abs/2508.07617
作者: Sarah Jabbour,David Fouhey,Nikola Banovic,Stephanie D. Shepard,Ella Kazerooni,Michael W. Sjoding,Jenna Wiens
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures, 5 tables
Abstract:AI has the potential to augment human decision making. However, even high-performing models can produce inaccurate predictions when deployed. These inaccuracies, combined with automation bias, where humans overrely on AI predictions, can result in worse decisions. Selective prediction, in which potentially unreliable model predictions are hidden from users, has been proposed as a solution. This approach assumes that when AI abstains and informs the user so, humans make decisions as they would without AI involvement. To test this assumption, we study the effects of selective prediction on human decisions in a clinical context. We conducted a user study of 259 clinicians tasked with diagnosing and treating hospitalized patients. We compared their baseline performance without any AI involvement to their AI-assisted accuracy with and without selective prediction. Our findings indicate that selective prediction mitigates the negative effects of inaccurate AI in terms of decision accuracy. Compared to no AI assistance, clinician accuracy declined when shown inaccurate AI predictions (66% [95% CI: 56%-75%] vs. 56% [95% CI: 46%-66%]), but recovered under selective prediction (64% [95% CI: 54%-73%]). However, while selective prediction nearly maintains overall accuracy, our results suggest that it alters patterns of mistakes: when informed the AI abstains, clinicians underdiagnose (18% increase in missed diagnoses) and undertreat (35% increase in missed treatments) compared to no AI input at all. Our findings underscore the importance of empirically validating assumptions about how humans engage with AI within human-AI systems.
zh
[AI-57] HGMF: A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用大规模分层工具库时面临的工具选择准确性低和计算成本高的问题,主要源于LLMs有限的上下文窗口以及无关选项带来的噪声。解决方案的关键在于提出一种基于概率剪枝的分层高斯混合框架(Hierarchical Gaussian Mixture Framework, HGMF):首先将用户查询与所有工具描述映射到统一语义空间,随后分两阶段进行层次化聚类与过滤——先对服务器层级使用高斯混合模型(Gaussian Mixture Model, GMM)聚类并依据查询似然筛选,再对选定服务器下的工具重复此过程,从而生成一个紧凑且高相关性的候选集,显著降低最终LLM的选择复杂度,提升准确率并减少推理延迟。
链接: https://arxiv.org/abs/2508.07602
作者: Wenpeng Xing,Zhipeng Chen,Changting Lin,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Invoking external tools enables Large Language Models (LLMs) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of LLMs and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic pruning method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query’s likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the LLM. Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework’s scalability and effectiveness for large-scale tool libraries.
zh
[AI-58] Optimization of Private Semantic Communication Performance: An Uncooperative Covert Communication Method
【速读】:该论文旨在解决在存在窃听攻击的情况下,如何保护图像数据语义信息传输的隐私性与质量的问题。具体而言,服务器需在多个时隙中传输图像的语义信息(semantic information),而攻击者试图侦测并窃取原始图像内容;为此,部署友方干扰器(friendly jammer)通过发送干扰信号来隐蔽语义传输,但因能量受限,干扰器不与服务器通信,导致服务器无法获知其发射功率。在此约束下,服务器必须联合优化每个时隙的语义信息内容及其对应的发射功率,以最大化用户端的语义传输质量和系统隐私性。解决方案的关键在于提出一种基于优先采样(prioritised sampling)的双延迟深度确定性策略梯度算法(Twin Delayed Deep Deterministic Policy Gradient, TD3),该算法引入额外的Q网络用于估计Q值,使智能体从两个Q网络中选择较低Q值的动作,从而避免局部最优解和Q值估计偏差,实现无需服务器与干扰器通信下的联合决策优化。仿真结果表明,该方法相比传统强化学习方法可提升隐私性和语义传输质量达77.8%和14.3%。
链接: https://arxiv.org/abs/2508.07586
作者: Wenjing Zhang,Ye Hu,Tao Luo,Zhilong Zhang,Mingzhe Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:In this paper, a novel covert semantic communication framework is investigated. Within this framework, a server extracts and transmits the semantic information, i.e., the meaning of image data, to a user over several time slots. An attacker seeks to detect and eavesdrop the semantic transmission to acquire details of the original image. To avoid data meaning being eavesdropped by an attacker, a friendly jammer is deployed to transmit jamming signals to interfere the attacker so as to hide the transmitted semantic information. Meanwhile, the server will strategically select time slots for semantic information transmission. Due to limited energy, the jammer will not communicate with the server and hence the server does not know the transmit power of the jammer. Therefore, the server must jointly optimize the semantic information transmitted at each time slot and the corresponding transmit power to maximize the privacy and the semantic information transmission quality of the user. To solve this problem, we propose a prioritised sampling assisted twin delayed deep deterministic policy gradient algorithm to jointly determine the transmitted semantic information and the transmit power per time slot without the communications between the server and the jammer. Compared to standard reinforcement learning methods, the propose method uses an additional Q network to estimate Q values such that the agent can select the action with a lower Q value from the two Q networks thus avoiding local optimal action selection and estimation bias of Q values. Simulation results show that the proposed algorithm can improve the privacy and the semantic information transmission quality by up to 77.8% and 14.3% compared to the traditional reinforcement learning methods.
zh
[AI-59] MCPToolBench: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在调用Model Context Protocol (MCP)工具时缺乏系统化评估框架的问题,具体包括:缺少覆盖多领域、大规模的基准数据集,MCP工具响应格式不统一导致评估困难,真实环境中工具调用成功率波动较大,以及LLMs上下文窗口限制了单次可调用工具数量。解决方案的关键在于提出MCPToolBench++——一个基于超过4000个MCP服务器(涵盖40余类服务)构建的大规模、多领域AI Agent工具调用基准测试平台,其数据集包含单步与多步工具调用任务,能够全面评估主流LLMs在复杂场景下的工具使用能力。
链接: https://arxiv.org/abs/2508.07575
作者: Shiqing Fan,Xichen Ding,Liang Zhang,Linjian Mo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Benchmarks and Source Code Released
Abstract:LLMs’ capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents’ MCP tool use abilities suffer from several issues. First, there’s a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs’ context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs’ performance on calling MCP tools, we propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single-step and multi-step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results.
zh
[AI-60] owards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression
【速读】:该论文旨在解决实际语言模型推理与理论Transformer分析之间存在的差距问题,特别是如何通过引入随机性和采样机制来更好地理解推理过程中的行为。其解决方案的关键在于构建一个基于噪声注入和二值系数采样的框架,模拟语言模型的解码过程,并以此对广泛采用的推理技术进行深入理论分析,从而为现实世界中语言模型的推理行为提供新的洞察。
链接: https://arxiv.org/abs/2508.07571
作者: Xingwu Chen,Miao Lu,Beining Wu,Difan Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.
zh
[AI-61] Retrieval-Augmented Multi-Agent System for Rapid Statement of Work Generation
【速读】:该论文旨在解决传统商业与法律项目中编制工作说明书(Statement of Work, SOW)效率低、耗时长且易出错的问题。当前SOW起草通常依赖多人协作,耗时数小时至数天,且存在内容过时或合规性风险。解决方案的关键在于引入一个由三个智能代理(agents)组成的AI驱动自动化系统:第一个代理生成初稿,第二个代理进行法律合规性审查,第三个代理完成格式化与校验。该系统不仅理解内容语义并按项目需求定制SOW,还显著提升了准确性和一致性,实测可在三分钟内完成一份完整SOW,相比人工方法效率提升显著,有效降低法律风险并释放专业人员精力用于高价值决策。
链接: https://arxiv.org/abs/2508.07569
作者: Amulya Suravarjhula,Rashi Chandrashekhar Agrawal,Sakshi Jayesh Patel,Rahul Gupta
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:Drafting a Statement of Work (SOW) is a vital part of business and legal projects. It outlines key details like deliverables, timelines, responsibilities, and legal terms. However, creating these documents is often a slow and complex process. It usually involves multiple people, takes several days, and leaves room for errors or outdated content. This paper introduces a new AI-driven automation system that makes the entire SOW drafting process faster, easier, and more accurate. Instead of relying completely on humans, the system uses three intelligent components or ‘agents’ that each handle a part of the job. One agent writes the first draft, another checks if everything is legally correct, and the third agent formats the document and ensures everything is in order. Unlike basic online tools that just fill in templates, this system understands the meaning behind the content and customizes the SOW to match the needs of the project. It also checks legal compliance and formatting so that users can trust the result. The system was tested using real business examples. It was able to create a full SOW in under three minutes, compared to several hours or days using manual methods. It also performed well in accuracy and quality, showing that it can reduce legal risks and save a lot of time. This solution shows how artificial intelligence can be used to support legal and business professionals by taking care of routine work and helping them focus on more important decisions. It’s a step toward making legal processes smarter, faster, and more reliable.
zh
[AI-62] A Small-footprint Acoustic Echo Cancellation Solution for Mobile Full-Duplex Speech Interactions ICASSP2025
【速读】:该论文旨在解决移动场景下全双工语音交互系统中因硬件差异、非线性失真和长延迟导致的声学回声消除(Acoustic Echo Cancellation, AEC)效果不佳的问题。其解决方案的关键在于:首先采用多样化的数据增强策略提升模型在不同环境下的鲁棒性;其次引入渐进式学习机制逐步优化AEC性能,显著改善语音质量;进一步地,设计面向下游任务(如语音活动检测 Voice Activity Detection, VAD 和自动语音识别 Automatic Speech Recognition, ASR)的定制化后处理策略,以增强这些任务的准确性;最终,通过轻量级模型与流式推理相结合,实现低资源消耗的移动端部署。实验证明该方法在回声返回损耗增强(Echo Return Loss Enhancement)和语音感知质量评估(Perceptual Evaluation of Speech Quality)方面均表现优异,并大幅提升了VAD和ASR性能。
链接: https://arxiv.org/abs/2508.07561
作者: Yiheng Jiang,Tian Biao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: This paper is accepted to ICASSP 2025
Abstract:In full-duplex speech interaction systems, effective Acoustic Echo Cancellation (AEC) is crucial for recovering echo-contaminated speech. This paper presents a neural network-based AEC solution to address challenges in mobile scenarios with varying hardware, nonlinear distortions and long latency. We first incorporate diverse data augmentation strategies to enhance the model’s robustness across various environments. Moreover, progressive learning is employed to incrementally improve AEC effectiveness, resulting in a considerable improvement in speech quality. To further optimize AEC’s downstream applications, we introduce a novel post-processing strategy employing tailored parameters designed specifically for tasks such as Voice Activity Detection (VAD) and Automatic Speech Recognition (ASR), thus enhancing their overall efficacy. Finally, our method employs a small-footprint model with streaming inference, enabling seamless deployment on mobile devices. Empirical results demonstrate effectiveness of the proposed method in Echo Return Loss Enhancement and Perceptual Evaluation of Speech Quality, alongside significant improvements in both VAD and ASR results.
zh
[AI-63] Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning
【速读】:该论文致力于解决机器学习(Machine Learning, ML)系统在高风险场景中因不确定性估计不足而导致的安全性与可信度问题,核心目标是提升模型在关键任务中的可靠性。其解决方案的关键在于提出一种基于训练轨迹的轻量级、后处理式选择性预测(Selective Prediction)方法:通过集成模型在训练过程中不同检查点(checkpoint)的预测结果来提取不确定性信号,无需修改模型架构或损失函数即可实现高性能的选择性预测,且兼容差分隐私(Differential Privacy, DP),从而在保护隐私的同时保持不确定性质量的稳定性。该方法不仅显著优于现有技术,还为评估和增强不确定性估计提供了可解释的误差分解框架,并设计了对抗扰动下的防御机制,最终推动了ML系统从“仅准确预测”向“知道何时不预测”的可靠智能演进。
链接: https://arxiv.org/abs/2508.07556
作者: Stephan Rabanser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注: PhD Thesis
Abstract:Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction – where models abstain when confidence is low. We first show that a model’s training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap – the deviation from the oracle accuracy-coverage curve – identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference. Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions – but also know when to say “I do not know”. Comments: PhD Thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML) Cite as: arXiv:2508.07556 [cs.LG] (or arXiv:2508.07556v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.07556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-64] Intersectoral Knowledge in AI and Urban Studies: A Framework for Transdisciplinary Research
【速读】:该论文旨在解决跨学科(transdisciplinary)知识在人工智能(AI)与城市研究等复杂领域中,因不同本体论(ontological)、认识论(epistemological)视角而难以有效验证与整合的问题。其解决方案的关键在于提出一个六维框架,从本体论、认识论、方法论、目的论(teleological)、价值论(axiological)及价值实现(valorization)维度系统分类和评估研究取向,从而帮助早期科研人员和跨学科团队识别并调和不同学科立场,推动更具社会问责性的知识生产。
链接: https://arxiv.org/abs/2508.07507
作者: Rashid Mushkani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Transdisciplinary approaches are increasingly essential for addressing grand societal challenges, particularly in complex domains such as Artificial Intelligence (AI), urban planning, and social sciences. However, effectively validating and integrating knowledge across distinct epistemic and ontological perspectives poses significant difficulties. This article proposes a six-dimensional framework for assessing and strengthening transdisciplinary knowledge validity in AI and city studies, based on an extensive analysis of the most cited research (2014–2024). Specifically, the framework classifies research orientations according to ontological, epistemological, methodological, teleological, axiological, and valorization dimensions. Our findings show a predominance of perspectives aligned with critical realism (ontological), positivism (epistemological), analytical methods (methodological), consequentialism (teleological), epistemic values (axiological), and social/economic valorization. Less common stances, such as idealism, mixed methods, and cultural valorization, are also examined for their potential to enrich knowledge production. We highlight how early career researchers and transdisciplinary teams can leverage this framework to reconcile divergent disciplinary viewpoints and promote socially accountable outcomes.
zh
[AI-65] VA-Blueprint: Uncovering Building Blocks for Visual Analytics System Design IEEE-VIS2025
【速读】:该论文旨在解决视觉分析(Visual Analytics, VA)系统设计与开发过程中缺乏系统性知识指导的问题,尤其是在城市领域VA系统中,尽管其数据复杂且问题独特,但现有研究对其实用开发挑战关注不足,导致可复用的设计模式和结构化知识库稀缺。解决方案的关键在于提出VA-Blueprint方法论与知识库,通过系统性地梳理20个典型城市VA系统的组件并构建多层级结构,形成初始蓝图;进一步利用大语言模型自动化提取另外81篇文献中的核心组件,实现知识库的规模化扩展,从而为VA系统的设计提供结构化、可复用的开发框架。
链接: https://arxiv.org/abs/2508.07497
作者: Leonardo Ferreira,Gustavo Moreira,Fabio Miranda
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE VIS 2025. VA-Blueprint is available at this https URL
Abstract:Designing and building visual analytics (VA) systems is a complex, iterative process that requires the seamless integration of data processing, analytics capabilities, and visualization techniques. While prior research has extensively examined the social and collaborative aspects of VA system authoring, the practical challenges of developing these systems remain underexplored. As a result, despite the growing number of VA systems, there are only a few structured knowledge bases to guide their design and development. To tackle this gap, we propose VA-Blueprint, a methodology and knowledge base that systematically reviews and categorizes the fundamental building blocks of urban VA systems, a domain particularly rich and representative due to its intricate data and unique problem sets. Applying this methodology to an initial set of 20 systems, we identify and organize their core components into a multi-level structure, forming an initial knowledge base with a structured blueprint for VA system development. To scale this effort, we leverage a large language model to automate the extraction of these components for other 81 papers (completing a corpus of 101 papers), assessing its effectiveness in scaling knowledge base construction. We evaluate our method through interviews with experts and a quantitative analysis of annotation metrics. Our contributions provide a deeper understanding of VA systems’ composition and establish a practical foundation to support more structured, reproducible, and efficient system development. VA-Blueprint is available at this https URL.
zh
[AI-66] Grounding Natural Language for Multi-agent Decision-Making with Multi-agent ic LLM s
【速读】:该论文旨在解决如何有效整合大规模语言模型(Large Language Models, LLMs)与多智能体决策算法,以提升多智能体系统在复杂协作任务中的推理能力与协调效率。其核心挑战在于如何设计一个系统性的框架,使LLMs能够在具备社会困境和博弈论考量的环境中实现稳定、高效的多智能体交互。解决方案的关键在于提出一套集成策略,包括高级提示工程(prompt engineering)、有效的记忆架构设计、多模态信息处理机制以及通过微调算法实现的对齐策略,从而显著增强多智能体语言模型(multi-agentic LLMs)在经典博弈场景下的性能表现。
链接: https://arxiv.org/abs/2508.07466
作者: Dom Huh,Prasant Mohapatra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language is a ubiquitous tool that is foundational to reasoning and collaboration, ranging from everyday interactions to sophisticated problem-solving tasks. The establishment of a common language can serve as a powerful asset in ensuring clear communication and understanding amongst agents, facilitating desired coordination and strategies. In this work, we extend the capabilities of large language models (LLMs) by integrating them with advancements in multi-agent decision-making algorithms. We propose a systematic framework for the design of multi-agentic large language models (LLMs), focusing on key integration practices. These include advanced prompt engineering techniques, the development of effective memory architectures, multi-modal information processing, and alignment strategies through fine-tuning algorithms. We evaluate these design choices through extensive ablation studies on classic game settings with significant underlying social dilemmas and game-theoretic considerations.
zh
[AI-67] Noise-Aware Generative Microscopic Traffic Simulation
【速读】:该论文旨在解决微观交通仿真中个体车辆行为建模的难题,特别是如何在仿真中真实再现如幽灵拥堵(phantom traffic jams)等复杂交通现象,而传统人类驾驶员模拟模型因过度抽象化难以实现这一目标。其解决方案的关键在于构建一个标准化、经筛选但保留真实传感器误差的轨迹数据集——I-24 MOTION Scenario Dataset (I24-MSD),并采用噪声感知(noise-aware)的学习策略,将传感器噪声视为学习问题的一部分而非单纯需预处理去除的干扰因素。通过在生成式模型中引入噪声感知损失函数,该方法不仅显著提升了仿真的真实性,还证明了显式建模数据不完美性比传统去噪预处理更具优势,从而推动了面向现实挑战的下一代微观交通仿真发展。
链接: https://arxiv.org/abs/2508.07453
作者: Vindula Jayawardana,Catherine Tang,Junyi Ji,Jonah Philion,Xue Bin Peng,Cathy Wu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:Accurately modeling individual vehicle behavior in microscopic traffic simulation remains a key challenge in intelligent transportation systems, as it requires vehicles to realistically generate and respond to complex traffic phenomena such as phantom traffic jams. While traditional human driver simulation models offer computational tractability, they do so by abstracting away the very complexity that defines human driving. On the other hand, recent advances in infrastructure-mounted camera-based roadway sensing have enabled the extraction of vehicle trajectory data, presenting an opportunity to shift toward generative, agent-based models. Yet, a major bottleneck remains: most existing datasets are either overly sanitized or lack standardization, failing to reflect the noisy, imperfect nature of real-world sensing. Unlike data from vehicle-mounted sensors-which can mitigate sensing artifacts like occlusion through overlapping fields of view and sensor fusion-infrastructure-based sensors surface a messier, more practical view of challenges that traffic engineers encounter. To this end, we present the I-24 MOTION Scenario Dataset (I24-MSD)-a standardized, curated dataset designed to preserve a realistic level of sensor imperfection, embracing these errors as part of the learning problem rather than an obstacle to overcome purely from preprocessing. Drawing from noise-aware learning strategies in computer vision, we further adapt existing generative models in the autonomous driving community for I24-MSD with noise-aware loss functions. Our results show that such models not only outperform traditional baselines in realism but also benefit from explicitly engaging with, rather than suppressing, data imperfection. We view I24-MSD as a stepping stone toward a new generation of microscopic traffic simulation that embraces the real-world challenges and is better aligned with practical needs.
zh
[AI-68] Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中从稀疏奖励信号中学习有效特征的难题,这一问题在传统端到端学习方法中常导致性能受限,而现有趋势倾向于引入复杂的辅助目标或完全解耦感知与控制模块,反而增加了设计复杂性。解决方案的关键在于提出一种基于博弈论的结构化交互机制——Stackelberg耦合表示与强化学习(SCORER)框架,其中感知网络(领导者)主动学习有助于控制网络(跟随者)最小化贝尔曼误差的特征表示,通过两时间尺度近似算法求解该Stackelberg博弈的均衡点,从而在不依赖复杂辅助目标或架构的前提下显著提升样本效率和最终性能。
链接: https://arxiv.org/abs/2508.07452
作者: Fernando Martinez,Tao Li,Yingdong Lu,Juntao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Integrated, end-to-end learning of representations and policies remains a cornerstone of deep reinforcement learning (RL). However, to address the challenge of learning effective features from a sparse reward signal, recent trends have shifted towards adding complex auxiliary objectives or fully decoupling the two processes, often at the cost of increased design complexity. This work proposes an alternative to both decoupling and naive end-to-end learning, arguing that performance can be significantly improved by structuring the interaction between distinct perception and control networks with a principled, game-theoretic dynamic. We formalize this dynamic by introducing the Stackelberg Coupled Representation and Reinforcement Learning (SCORER) framework, which models the interaction between perception and control as a Stackelberg game. The perception network (leader) strategically learns features to benefit the control network (follower), whose own objective is to minimize its Bellman error. We approximate the game’s equilibrium with a practical two-timescale algorithm. Applied to standard DQN variants on benchmark tasks, SCORER improves sample efficiency and final performance. Our results show that performance gains can be achieved through principled algorithmic design of the perception-control dynamic, without requiring complex auxiliary objectives or architectures.
zh
[AI-69] Optimizing Districting Plans to Maximize Majority-Minority Districts via IPs and Local Search
【速读】:该论文旨在解决选区重划(redistricting)中如何生成更优的州级选区方案,以最大化多数少数族裔选区(majority-minority districts)数量的问题,从而更好地落实《投票权法案》(Voting Rights Act)的执行。其解决方案的关键在于提出一种基于整数规划(integer programming)的新方法,该方法结合了改进的列生成算法(column generation algorithm),利用先前提出的随机分层划分算法(stochastic hierarchical partitioning algorithm)生成高质量候选选区(作为标准集合划分模型中的列),并通过局部再优化算法迭代提升基准解的质量,同时引入一种不牺牲多数少数族裔选区数量的前提下增强选区紧凑性的算法。实验表明,该方法在多个数据集上优于Cannon等人提出的“短爆发”(short bursts)启发式算法。
链接: https://arxiv.org/abs/2508.07446
作者: Daniel Brous,David Shmoys
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 4 figures, 1 table
Abstract:In redistricting litigation, effective enforcement of the Voting Rights Act has often involved providing the court with districting plans that display a larger number of majority-minority districts than the current proposal (as was true, for example, in what followed Allen v. Milligan concerning the congressional districting plan for Alabama in 2023). Recent work by Cannon et al. proposed a heuristic algorithm for generating plans to optimize majority-minority districts, which they called short bursts; that algorithm relies on a sophisticated random walk over the space of all plans, transitioning in bursts, where the initial plan for each burst is the most successful plan from the previous burst. We propose a method based on integer programming, where we build upon another previous work, the stochastic hierarchical partitioning algorithm, which heuristically generates a robust set of potential districts (viewed as columns in a standard set partitioning formulation); that approach was designed to optimize a different notion of fairness across a statewide plan. We design a new column generation algorithm to find plans via integer programming that outperforms short bursts on multiple data sets in generating statewide plans with significantly more majority-minority districts. These results also rely on a new local re-optimization algorithm to iteratively improve on any baseline solution, as well as an algorithm to increase the compactness of districts in plans generated (without impacting the number of majority-minority districts).
zh
[AI-70] Lightning Prediction under Uncertainty: DeepLight with Hazy Loss
【速读】:该论文旨在解决闪电预测中存在的一系列关键问题,包括现有模型难以捕捉闪电事件的动态空间上下文和内在不确定性、对雷达反射率与云特性等关键观测数据利用不足,以及过度依赖计算成本高且参数敏感的数值天气预报(Numerical Weather Prediction, NWP)系统。其解决方案的核心在于提出一种名为DeepLight的深度学习架构,通过双编码器结构融合多源气象数据(如雷达反射率、云属性及历史闪电记录),并采用多分支卷积技术动态建模不同尺度的空间相关性;同时引入新颖的Hazy Loss函数,基于距离真实闪电事件的远近对偏差进行加权惩罚,从而有效建模时空不确定性,显著提升预测准确性——实验表明,该方法在公平威胁评分(Equitable Threat Score, ETS)上相较当前最优方法提升18%-30%。
链接: https://arxiv.org/abs/2508.07428
作者: Md Sultanul Arifin,Abu Nowshed Sakib,Yeasir Rayhan,Tanzima Hashem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Lightning, a common feature of severe meteorological conditions, poses significant risks, from direct human injuries to substantial economic losses. These risks are further exacerbated by climate change. Early and accurate prediction of lightning would enable preventive measures to safeguard people, protect property, and minimize economic losses. In this paper, we present DeepLight, a novel deep learning architecture for predicting lightning occurrences. Existing prediction models face several critical limitations: they often struggle to capture the dynamic spatial context and inherent uncertainty of lightning events, underutilize key observational data, such as radar reflectivity and cloud properties, and rely heavily on Numerical Weather Prediction (NWP) systems, which are both computationally expensive and highly sensitive to parameter settings. To overcome these challenges, DeepLight leverages multi-source meteorological data, including radar reflectivity, cloud properties, and historical lightning occurrences through a dual-encoder architecture. By employing multi-branch convolution techniques, it dynamically captures spatial correlations across varying extents. Furthermore, its novel Hazy Loss function explicitly addresses the spatio-temporal uncertainty of lightning by penalizing deviations based on proximity to true events, enabling the model to better learn patterns amidst randomness. Extensive experiments show that DeepLight improves the Equitable Threat Score (ETS) by 18%-30% over state-of-the-art methods, establishing it as a robust solution for lightning prediction.
zh
[AI-71] Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics IEEE-VIS2025
【速读】:该论文旨在解决城市视觉分析(Urban Visual Analytics)中因数据复杂性、多领域专业知识需求及用户技术门槛高而导致的分析流程难以构建与协作的问题。其核心挑战在于如何在用户意图(user intent)与系统行为之间建立有效对齐,避免因从显式操作向意图驱动交互转变而产生的偏差。解决方案的关键是提出Urbanite框架,该框架基于数据流模型(dataflow-based model),支持用户在不同粒度层次(如数据流、节点、参数)上定义任务意图,并通过交互式对齐机制贯穿分析的 specification(规范)、process(过程)和 evaluation(评估)阶段,同时集成可解释性、多分辨率任务定义和交互溯源功能,从而实现人-AI协同的城市数据分析系统构建。
链接: https://arxiv.org/abs/2508.07390
作者: Gustavo Moreira,Leonardo Ferreira,Carolina Veiga,Maryam Hosseini,Fabio Miranda
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE VIS 2025. Urbanite is available at this https URL
Abstract:With the growing availability of urban data and the increasing complexity of societal challenges, visual analytics has become essential for deriving insights into pressing real-world problems. However, analyzing such data is inherently complex and iterative, requiring expertise across multiple domains. The need to manage diverse datasets, distill intricate workflows, and integrate various analytical methods presents a high barrier to entry, especially for researchers and urban experts who lack proficiency in data management, machine learning, and visualization. Advancements in large language models offer a promising solution to lower the barriers to the construction of analytics systems by enabling users to specify intent rather than define precise computational operations. However, this shift from explicit operations to intent-based interaction introduces challenges in ensuring alignment throughout the design and development process. Without proper mechanisms, gaps can emerge between user intent, system behavior, and analytical outcomes. To address these challenges, we propose Urbanite, a framework for human-AI collaboration in urban visual analytics. Urbanite leverages a dataflow-based model that allows users to specify intent at multiple scopes, enabling interactive alignment across the specification, process, and evaluation stages of urban analytics. Based on findings from a survey to uncover challenges, Urbanite incorporates features to facilitate explainability, multi-resolution definition of tasks across dataflows, nodes, and parameters, while supporting the provenance of interactions. We demonstrate Urbanite’s effectiveness through usage scenarios created in collaboration with urban experts. Urbanite is available at this https URL.
zh
[AI-72] Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding
【速读】:该论文旨在解决当前Temporal Video Grounding (TVG) 方法在优化高时间交并比(IoU)时过度拟合该指标,从而损害视频与查询语义动作理解的问题,这限制了TVG的鲁棒性。解决方案的关键在于提出Inversion Tasks for TVG (Invert4TVG) 框架,通过从现有TVG标注中衍生出三种逆向任务——动词补全(Verb Completion)、动作识别(Action Recognition)和视频描述(Video Description),以增强语义理解;这些任务与TVG联合优化,借助强化学习框架及设计合理的奖励函数,实现定位精度与语义一致性之间的平衡,从而显著提升模型性能,实验表明在Charades-STA数据集上使用3B参数模型时,R1@0.7指标提升达7.1%。
链接: https://arxiv.org/abs/2508.07388
作者: Zhaoyu Chen,Hongnan Lin,Yongwei Nie,Fei Ma,Xuemiao Xu,Fei Yu,Chengjiang Long
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query. Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query, a critical factor for robust TVG. To address this, we introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data. Our approach leverages three inversion tasks derived from existing TVG annotations: (1) Verb Completion, predicting masked action verbs in queries from video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions of video segments that explicitly embed query-relevant actions. These tasks, integrated with TVG via a reinforcement learning framework with well-designed reward functions, ensure balanced optimization of localization and semantics. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1. By inverting TVG to derive query-related actions from segments, our approach strengthens semantic understanding, significantly raising the ceiling of localization accuracy.
zh
[AI-73] Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自动化渗透测试(penetration testing)中面临的三大核心挑战:错误处理能力差、推理效率低以及无法自主完成复杂的端到端攻击任务。为应对这些问题,作者提出了一种名为Pentest-R1的新型框架,其解决方案的关键在于采用两阶段强化学习(reinforcement learning, RL)训练机制:第一阶段通过构建包含500多个真实世界多步骤攻击路径的数据集,进行离线强化学习以注入基础攻击逻辑;第二阶段则在交互式CTF环境中利用在线强化学习对LLM进行微调,使其从环境反馈中学习误差自纠正能力和动态适应策略。实验表明,该方法在Cybench和AutoPenBench基准上均达到领先性能,尤其在开源模型中实现了新的最先进水平,验证了双阶段RL协同优化对提升LLM在复杂安全任务中自主性与鲁棒性的关键作用。
链接: https://arxiv.org/abs/2508.07382
作者: He Kong,Die Hu,Jingguo Ge,Liangxiong Li,Hui Li,Tong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework’s effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success.
zh
[AI-74] AutoAssert 1: A LoRA Fine-Tuned LLM Model for Efficient Automated Assertion Generation
【速读】:该论文旨在解决日益复杂的软件系统对自动化测试与维护工具的迫切需求,尤其是如何高效生成符合硬件逻辑的断言(assertion)以提升测试覆盖率和准确性。其解决方案的关键在于提出一种基于硬件描述语言(HDL)的断言生成方法,该方法融合了一个轻量级且参数可调的大语言模型(LLM)与Unsloth平台,能够在显著降低训练成本的同时保持高精度和泛化能力,从而实现对硬件逻辑严格遵循的测试用例自动构造。
链接: https://arxiv.org/abs/2508.07371
作者: Yi Zhong,Hongchao Liu,Di ZHao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16pages,6figures
Abstract:As the complexity of software systems continues to increase, the demand for automated testing and maintenance tools is growing exponentially. To meet this urgent need, we propose a new assertion generation method based on Hardware Description Language (HDL). This method combines a lightweight, parameter-adjustable large language model (LLM) with the Unsloth platform to automatically generate test cases, thereby significantly reducing training costs without sacrificing accuracy or generalization performance. Empirical evaluation shows that our method can efficiently generate assertions that strictly conform to the hardware logic. This framework provides a robust and flexible solution to modern software testing and maintenance challenges. this https URL and this https URL are the locations of the source code.
zh
[AI-75] ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis
【速读】:该论文旨在解决噬菌体病毒蛋白(Phage Virion Proteins, PVP)准确预测的问题,这对于基因组学研究至关重要,因为PVP是噬菌体结构中的关键组成。现有计算工具在注释高通量测序得到的噬菌体蛋白序列时受限于序列编码方式的不足,尤其在空间信息保留方面存在缺陷。解决方案的关键在于提出一种名为ProteoKnight的新颖图像编码方法,它基于经典的DNA-Walk算法并适配至蛋白质序列,通过引入像素颜色和调整行走距离来捕捉复杂的蛋白质特征,从而有效缓解传统方法中因空间约束导致的信息丢失问题;同时结合预训练卷积神经网络(Convolutional Neural Networks, CNNs)进行分类,并利用蒙特卡洛Dropout(Monte Carlo Dropout, MCD)评估预测不确定性,实现高精度且可解释的PVP识别。
链接: https://arxiv.org/abs/2508.07345
作者: Samiha Afaf Neha,Abir Ahammed Bhuiyan,Md. Ishrak Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:\textbfIntroduction: Accurate prediction of Phage Virion Proteins (PVP) is essential for genomic studies due to their crucial role as structural elements in bacteriophages. Computational tools, particularly machine learning, have emerged for annotating phage protein sequences from high-throughput sequencing. However, effective annotation requires specialized sequence encodings. Our paper introduces ProteoKnight, a new image-based encoding method that addresses spatial constraints in existing techniques, yielding competitive performance in PVP classification using pre-trained convolutional neural networks. Additionally, our study evaluates prediction uncertainty in binary PVP classification through Monte Carlo Dropout (MCD). \textbfMethods: ProteoKnight adapts the classical DNA-Walk algorithm for protein sequences, incorporating pixel colors and adjusting walk distances to capture intricate protein features. Encoded sequences were classified using multiple pre-trained CNNs. Variance and entropy measures assessed prediction uncertainty across proteins of various classes and lengths. \textbfResults: Our experiments achieved 90.8% accuracy in binary classification, comparable to state-of-the-art methods. Multi-class classification accuracy remains suboptimal. Our uncertainty analysis unveils variability in prediction confidence influenced by protein class and sequence length. \textbfConclusions: Our study surpasses frequency chaos game representation (FCGR) by introducing novel image encoding that mitigates spatial information loss limitations. Our classification technique yields accurate and robust PVP predictions while identifying low-confidence predictions.
zh
[AI-76] Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“幻觉”(illusion)现象,这是阻碍其可靠部署的核心障碍。作者通过构建“计算必要性层次结构”(computational necessity hierarchy),将LLM形式化为概率图灵机,并首次在对角化、不可计算性和信息论边界上证明了幻觉的不可避免性,这一结论基于新提出的“学习者泵引理”(learner pump lemma)。解决方案的关键在于提出两条“逃逸路径”:其一是将检索增强生成(Retrieval Enhanced Generation, RAG)建模为预言机(oracle machine),利用“计算跳跃”(computational jumps)实现绝对逃逸,从而首次为RAG的有效性提供形式化理论支撑;其二是将持续学习形式化为一种“内化预言机”(internalized oracle)机制,并通过一种新颖的神经博弈论(neural game theory)实现该路径。
链接: https://arxiv.org/abs/2508.07334
作者: Quan Shi,Wang Xi,Zenghui Ding,Jianqing Gao,Xianjun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:The illusion phenomenon of large language models (LLMs) is the core obstacle to their reliable deployment. This article formalizes the large language model as a probabilistic Turing machine by constructing a “computational necessity hierarchy”, and for the first time proves the illusions are inevitable on diagonalization, incomputability, and information theory boundaries supported by the new “learner pump lemma”. However, we propose two “escape routes”: one is to model Retrieval Enhanced Generations (RAGs) as oracle machines, proving their absolute escape through “computational jumps”, providing the first formal theory for the effectiveness of RAGs; The second is to formalize continuous learning as an “internalized oracle” mechanism and implement this path through a novel neural game theory this http URL, this article proposes a
zh
[AI-77] Efficient Edge LLM s Deployment via HessianAware Quantization and CPU GPU Collaborative
【速读】:该论文旨在解决在资源受限的边缘设备上高效部署基于专家混合(Mixture of Experts, MoE)架构的大语言模型(Large Language Models, LLMs)所面临的两大挑战:一是激活分布中存在大量异常值导致量化精度严重下降,影响推理性能;二是受限内存下专家模块的高效卸载与协同推理难以平衡延迟与吞吐量。解决方案的关键在于提出一种基于海森感知量化(Hessian-Aware Quantization, HAQ)的8位联合量化方法,通过引入平滑海森矩阵量化缓解异常值对激活和权重量化的影响,同时设计基于专家级协同卸载与推理机制,结合专家激活路径统计信息,在CPU与GPU间优化专家模块调度,显著降低内存占用并提升推理效率。
链接: https://arxiv.org/abs/2508.07329
作者: Tuo Zhang,Ning Li,Xin Yuan,Wenchao Xu,Quan Chen,Song Guo,Haijun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With the breakthrough progress of large language models (LLMs) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through sparse activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in quantization accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed Hessian matrix quantization, we achieve joint 8-bit quantization of activations and weights, which significantly alleviates the accuracy loss caused by outliers while ensuring efficient implementation on mainstream hardware. Second, we design an expert-level collaborative offloading and inference mechanism, which, combined with expert activation path statistics, enables efficient deployment and scheduling of expert modules between CPU and GPU, greatly reducing memory footprint and inference latency. Extensive experiments validate the effectiveness of our method on mainstream large models such as the OPT series and Mixtral 8*7B: on datasets like Wikitext2 and C4, the inference accuracy of the low-bit quantized model approaches that of the full-precision model, while GPU memory usage is reduced by about 60%, and inference latency is significantly improved.
zh
[AI-78] From Knowledge to Conjectures: A Modal Framework for Reasoning about Hypotheses
【速读】:该论文旨在解决认知逻辑中如何形式化推测性推理(conjectural reasoning)的问题,即在已知事实的基础上引入假设性前提以探索其后果,从而区分事实性陈述与假设性陈述。传统信念逻辑(doxastic logic)和知识逻辑(epistemic logic)难以有效处理这种多层认知状态的结构,尤其在保持假设层与事实层不混淆方面存在缺陷。解决方案的关键在于提出一种新的模态系统(如KC和KDC),其核心是引入Axiom C(φ → □φ),确保所有已确立的事实在假设层中保持不变;同时摒弃Axiom T(□φ → φ),避免因经典二值语义导致的模态塌缩(modal collapse)。通过采用弱克莱尼逻辑(Weak Kleene logic)或描述逻辑(Description Logic)构建的偏逻辑(paracomplete)语义框架,允许未定义命题与模态断言共存,从而实现认知层的清晰分层,并保证系统的完备性、可判定性和对部分知识的鲁棒性。此外,论文进一步引入动态操作 settle(φ),形式化从假设到事实的认知状态更新过程,刻画不确定性消解的认知事件。
链接: https://arxiv.org/abs/2508.07304
作者: Fabio Vitali
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a new family of cognitive modal logics designed to formalize conjectural reasoning: a modal system in which cognitive contexts extend known facts with hypothetical assumptions to explore their consequences. Unlike traditional doxastic and epistemic systems, conjectural logics rely on a principle, called Axiom C ( \varphi \rightarrow \Box\varphi ), that ensures that all established facts are preserved across hypothetical layers. While Axiom C was dismissed in the past due to its association with modal collapse, we show that the collapse only arises under classical and bivalent assumptions, and specifically in the presence of Axiom T. Hence we avoid Axiom T and adopt a paracomplete semantic framework, grounded in Weak Kleene logic or Description Logic, where undefined propositions coexist with modal assertions. This prevents the modal collapse and guarantees a layering to distinguish between factual and conjectural statements. Under this framework we define new modal systems, e.g., KC and KDC, and show that they are complete, decidable, and robust under partial knowledge. Finally, we introduce a dynamic operation, \mathsfsettle(\varphi) , which formalizes the transition from conjecture to accepted fact, capturing the event of the update of a world’s cognitive state through the resolution of uncertainty.
zh
[AI-79] When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective
【速读】:该论文旨在解决当前半监督学习(Semi-Supervised Learning, SSL)与神经符号学习(Neuro-Symbolic Learning, Nesy)中预训练任务选择缺乏理论指导的问题,即在实际应用中,无监督预训练任务的选择多依赖启发式方法,难以在模型训练前评估其合理性。解决方案的关键在于将Nesy理论从可靠知识(reliable knowledge)扩展至不可靠知识(unreliable knowledge,即假设),从而统一SSL与Nesy的理论框架;并通过严格的理论分析提出三个决定预训练任务有效性的核心指标:知识可学习性(knowledge learnability)、知识可靠性(knowledge reliability)和知识完备性(knowledge completeness),进而设计可操作化的度量方案以实现对预训练任务效果的前瞻性预测。实验表明,基于少量数据预测的性能与大规模SSL/Nesy训练后的实际性能高度相关,验证了理论的有效性和方法的实用性。
链接: https://arxiv.org/abs/2508.07299
作者: Lin-Han Jia,Si-Yu Han,Wen-Chao Hu,Jie-Jing Shao,Wen-Da Wei,Zhi Zhou,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neuro-symbolic (Nesy) learning improves the target task performance of models by enabling them to satisfy knowledge, while semi/self-supervised learning (SSL) improves the target task performance by designing unsupervised pretext tasks for unlabeled data to make models satisfy corresponding assumptions. We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge (i.e., assumptions), thereby unifying the theoretical frameworks of SSL and Nesy. Through rigorous theoretical analysis, we demonstrate that, in theory, the impact of pretext tasks on target performance hinges on three factors: knowledge learnability with respect to the model, knowledge reliability with respect to the data, and knowledge completeness with respect to the target. We further propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance. This will change the current status quo in practical applications, where the selections of unsupervised tasks are heuristic-based rather than theory-based, and it is difficult to evaluate the rationality of unsupervised pretext task selection before testing the model on the target task. In experiments, we verify a high correlation between the predicted performance-estimated using minimal data-and the actual performance achieved after large-scale semi-supervised or self-supervised learning, thus confirming the validity of the theory and the effectiveness of the evaluation method.
zh
[AI-80] Revisiting Data Attribution for Influence Functions
【速读】:该论文旨在解决深度学习模型中数据溯源(data attribution)问题,即如何识别对模型预测最具影响力的训练样本,并理解模型行为与特定预测之间的因果关系。其解决方案的关键在于利用影响函数(influence functions),该方法源自稳健统计学,能够以一阶近似方式高效估算单个训练数据点被轻微加权或移除时对模型参数及预测结果的影响,而无需昂贵的重新训练过程。论文系统回顾了影响函数在深度学习中的数据溯源能力,涵盖其理论基础、高效估计逆海森向量积(inverse-Hessian-vector product)的算法进展,并评估其在数据溯源和误标签检测中的有效性。
链接: https://arxiv.org/abs/2508.07297
作者: Hongbo Zhu,Angelo Cangelosi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The goal of data attribution is to trace the model’s predictions through the learning algorithm and back to its training data. thereby identifying the most influential training samples and understanding how the model’s behavior leads to particular predictions. Understanding how individual training examples influence a model’s predictions is fundamental for machine learning interpretability, data debugging, and model accountability. Influence functions, originating from robust statistics, offer an efficient, first-order approximation to estimate the impact of marginally upweighting or removing a data point on a model’s learned parameters and its subsequent predictions, without the need for expensive retraining. This paper comprehensively reviews the data attribution capability of influence functions in deep learning. We discuss their theoretical foundations, recent algorithmic advances for efficient inverse-Hessian-vector product estimation, and evaluate their effectiveness for data attribution and mislabel detection. Finally, highlighting current challenges and promising directions for unleashing the huge potential of influence functions in large-scale, real-world deep learning scenarios.
zh
[AI-81] Fine-Tuning Large Language Models Using EEG Microstate Features for Mental Workload Assessment
【速读】:该论文旨在解决如何利用脑电图(EEG)微状态特征提升大语言模型(LLM)对认知负荷状态(如“休息”与“负载”)的识别准确率问题。其解决方案的关键在于通过四阶段流程实现EEG特征与LLM的深度融合:首先收集并预处理EEG数据,进而进行微状态分割与EEG回填以提取动态脑电特征;随后将这些特征作为提示(prompt)输入至LLM中,结合提示工程增强模型对认知状态的理解能力;最后通过监督学习方式精细调优LLM,使其能够基于EEG微状态特征精准区分不同认知负荷状态。该方法显著提升了模型性能,为认知神经科学与认知人工智能研究提供了可解释、可泛化的技术路径。
链接: https://arxiv.org/abs/2508.07283
作者: Bujar Raufi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
备注: 17 Pages, 7 figures, 3 tables and one prompt template
Abstract:This study explores the intersection of electroencephalography (EEG) microstates and Large Language Models (LLMs) to enhance the assessment of cognitive load states. By utilizing EEG microstate features, the research aims to fine-tune LLMs for improved predictions of distinct cognitive states, specifically ‘Rest’ and ‘Load’. The experimental design is delineated in four comprehensive stages: dataset collection and preprocessing, microstate segmentation and EEG backfitting, feature extraction paired with prompt engineering, and meticulous LLM model selection and refinement. Employing a supervised learning paradigm, the LLM is trained to identify cognitive load states based on EEG microstate features integrated into prompts, producing accurate discrimination of cognitive load. A curated dataset, linking EEG features to specified cognitive load conditions, underpins the experimental framework. The results indicate a significant improvement in model performance following the proposed fine-tuning, showcasing the potential of EEG-informed LLMs in cognitive neuroscience and cognitive AI applications. This approach not only contributes to the understanding of brain dynamics but also paves the way for advancements in machine learning techniques applicable to cognitive load and cognitive AI research.
zh
[AI-82] Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation
【速读】:该论文旨在解决启发式负采样方法在推荐系统中因候选池中存在的未观测环境混杂因素(如曝光偏差或流行度偏差)而导致虚假难负样本(False Hard Negatives, FHNS)的问题。这些FHNS会诱导模型学习由混杂因素引起的伪相关性,从而损害模型在分布偏移下的泛化能力。解决方案的关键在于提出一种名为因果扩散负采样(Causal Negative Sampling via Diffusion, CNSDiff)的新方法:通过条件扩散过程在潜在空间中合成负样本,避免了传统方法依赖预定义候选池所引入的偏差;同时引入因果正则化项,显式抑制环境混杂因素对负采样的影响,从而生成更具鲁棒性的负样本,显著提升模型在分布外(Out-of-Distribution, OOD)场景下的推荐性能。
链接: https://arxiv.org/abs/2508.07243
作者: Chu Zhao,Eneng Yang,Yizhou Dang,Jianzhe Zhao,Guibing Guo,Xingwei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures, Under-review
Abstract:Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools to guide the model toward learning more accurate decision boundaries. However, our empirical and theoretical analyses reveal that unobserved environmental confounders (e.g., exposure or popularity biases) in candidate pools may cause heuristic sampling methods to introduce false hard negatives (FHNS). These misleading samples can encourage the model to learn spurious correlations induced by such confounders, ultimately compromising its generalization ability under distribution shifts. To address this issue, we propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff). By synthesizing negative samples in the latent space via a conditional diffusion process, CNSDiff avoids the bias introduced by predefined candidate pools and thus reduces the likelihood of generating FHNS. Moreover, it incorporates a causal regularization term to explicitly mitigate the influence of environmental confounders during the negative sampling process, leading to robust negatives that promote out-of-distribution (OOD) generalization. Comprehensive experiments under four representative distribution shift scenarios demonstrate that CNSDiff achieves an average improvement of 13.96% across all evaluation metrics compared to state-of-the-art baselines, verifying its effectiveness and robustness in OOD recommendation tasks.
zh
[AI-83] SocRipple: A Two-Stage Framework for Cold-Start Video Recommendations RECSYS2025
【速读】:该论文旨在解决推荐系统中冷启动(Cold Start)问题,即新物品因缺乏用户交互历史而难以进行个性化分发的挑战。标准协同过滤模型因交互信号稀疏表现不佳,而仅依赖内容特征的方法又无法捕捉用户特定的相关性。解决方案的关键在于提出一种两阶段检索框架 SocRipple:第一阶段利用创作者的社会关系网络实现目标化的初始曝光;第二阶段基于早期互动信号和从历史交互中学习到的稳定用户嵌入(User Embedding),通过 K 近邻(KNN)搜索向外“涟漪式”扩散推荐,从而在提升新物品曝光度的同时保持用户参与率,实现新物品推广与个性化推荐之间的平衡。
链接: https://arxiv.org/abs/2508.07241
作者: Amit Jaspal,Kapil Dalwani,Ajantha Ramineni
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 2 tables, recsys 2025
Abstract:Most industry scale recommender systems face critical cold start challenges new items lack interaction history, making it difficult to distribute them in a personalized manner. Standard collaborative filtering models underperform due to sparse engagement signals, while content only approaches lack user specific relevance. We propose SocRipple, a novel two stage retrieval framework tailored for coldstart item distribution in social graph based platforms. Stage 1 leverages the creators social connections for targeted initial exposure. Stage 2 builds on early engagement signals and stable user embeddings learned from historical interactions to “ripple” outwards via K Nearest Neighbor (KNN) search. Large scale experiments on a major video platform show that SocRipple boosts cold start item distribution by +36% while maintaining user engagement rate on cold start items, effectively balancing new item exposure with personalized recommendations.
zh
[AI-84] EDGE: A Theoretical Framework for Misconception-Aware Adaptive Learning
【速读】:该论文旨在解决自适应学习系统中因学习者存在特定误解(misconception)而导致的学习效率低下问题,尤其在传统方法难以精准诊断和针对性干预的情况下。解决方案的关键在于提出一个四阶段框架 EDGE(Evaluate, Diagnose, Generate, Exercise),其核心创新包括:基于项目反应理论(IRT)与贝叶斯状态空间模型的综合能力与状态评估;利用干扰项模式和响应时长进行认知诊断以识别具体误解;通过对比生成机制合成具有最小扰动但能破坏学习者捷径策略的反事实题目(counterfactual items);以及采用基于索引的检索调度策略(restless bandit近似)实现最优间隔复习安排。此外,论文定义了复合就绪度指标 EdgeScore,并证明其单调性和Lipschitz连续性,进而推导出在遗忘和学习增益满足弱假设下的近优索引策略,同时建立了反事实题目可加速降低目标误解后验概率的理论条件。
链接: https://arxiv.org/abs/2508.07224
作者: Ananda Prakash Verma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present EDGE, a general-purpose, misconception-aware adaptive learning framework composed of four stages: Evaluate (ability and state estimation), Diagnose (posterior infer-ence of misconceptions), Generate (counterfactual item synthesis), and Exercise (index-based retrieval scheduling). EDGE unifies psychometrics (IRT/Bayesian state space models), cog-nitive diagnostics (misconception discovery from distractor patterns and response latencies), contrastive item generation (minimal perturbations that invalidate learner shortcuts while pre-serving psychometric validity), and principled scheduling (a restless bandit approximation to spaced retrieval). We formalize a composite readiness metric, EdgeScore, prove its monotonicity and Lipschitz continuity, and derive an index policy that is near-optimal under mild assumptions on forgetting and learning gains. We further establish conditions under which counterfactual items provably reduce the posterior probability of a targeted misconception faster than standard practice. The paper focuses on theory and implementable pseudocode; empirical study is left to future work.
zh
[AI-85] Selection and Exploitation of High-Quality Knowledge from Large Language Models for Recommendation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推荐系统中引入的高质量知识选择与利用问题,具体表现为:LLMs生成的知识常存在幻觉(hallucination)、内容冗余和信息同质化等问题,直接将此类知识嵌入推荐模型会导致性能下降。解决方案的关键在于提出一个知识筛选与利用推荐框架(Knowledge Selection & Exploitation Recommendation, KSER),其核心由两个模块构成:一是知识过滤模块,通过嵌入选择滤波网络(Embedding Selection Filter Network, ESFNet)为不同知识领域中的知识片段分配自适应权重,实现高质量知识的筛选;二是嵌入空间对齐模块,采用基于注意力机制的架构将LLMs生成的语义嵌入与推荐模型训练所用特征空间进行对齐,从而提升知识迁移的有效性。此外,论文还提出全参数训练与提取器仅训练两种策略,其中“提取器仅训练”策略为知识增强推荐提供了新的范式。
链接: https://arxiv.org/abs/2508.07223
作者: Guanchen Wang,Mingming Ha,Tianbao Ma,Linxun Chen,Zhaojie Liu,Guorui Zhou,Kun Gai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, there has been growing interest in leveraging the impressive generalization capabilities and reasoning ability of large language models (LLMs) to improve the performance of recommenders. With this operation, recommenders can access and learn the additional world knowledge and reasoning information via LLMs. However, in general, for different users and items, the world knowledge derived from LLMs suffers from issues of hallucination, content redundant, and information homogenization. Directly feeding the generated response embeddings into the recommendation model can lead to unavoidable performance deterioration. To address these challenges, we propose a Knowledge Selection \ Exploitation Recommendation (KSER) framework, which effectively select and extracts the high-quality knowledge from LLMs. The framework consists of two key components: a knowledge filtering module and a embedding spaces alignment module. In the knowledge filtering module, a Embedding Selection Filter Network (ESFNet) is designed to assign adaptive weights to different knowledge chunks in different knowledge fields. In the space alignment module, an attention-based architecture is proposed to align the semantic embeddings from LLMs with the feature space used to train the recommendation models. In addition, two training strategies–\textbfall-parameters training and \textbfextractor-only training–are proposed to flexibly adapt to different downstream tasks and application scenarios, where the extractor-only training strategy offers a novel perspective on knowledge-augmented recommendation. Experimental results validate the necessity and effectiveness of both the knowledge filtering and alignment modules, and further demonstrate the efficiency and effectiveness of the extractor-only training strategy.
zh
[AI-86] LLM -based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal Inference
【速读】:该论文旨在解决从观测数据中估计个体化治疗效应(Individualized Treatment Effects, ITE)时面临的未测量混杂因素(Unmeasured Confounding)和结构偏差(Structural Bias)问题,尤其是在复杂现实环境中,传统因果机器学习(Causal Machine Learning, Causal ML)方法如因果树和双重稳健估计器因依赖领域专家识别混杂变量而存在标注成本高、可扩展性差等局限。其解决方案的关键在于引入基于大语言模型(Large Language Model, LLM)的智能体(Agents),将这些智能体集成到因果ML流程中,以模拟领域专家的推理能力,实现自动化混杂因子发现与亚组分析,从而在降低人工干预的同时保持结果的可解释性,并通过实证验证显著提升了治疗效应估计的鲁棒性,缩小置信区间并揭示潜在的混杂偏倚。
链接: https://arxiv.org/abs/2508.07221
作者: Po-Han Lee,Yu-Cheng Lin,Chan-Tung Ku,Chan Hsu,Pei-Cing Huang,Ping-Hsun Wu,Yihuang Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Applications (stat.AP); Methodology (stat.ME)
备注:
Abstract:Estimating individualized treatment effects from observational data presents a persistent challenge due to unmeasured confounding and structural bias. Causal Machine Learning (causal ML) methods, such as causal trees and doubly robust estimators, provide tools for estimating conditional average treatment effects. These methods have limited effectiveness in complex real-world environments due to the presence of latent confounders or those described in unstructured formats. Moreover, reliance on domain experts for confounder identification and rule interpretation introduces high annotation cost and scalability concerns. In this work, we proposed Large Language Model-based agents for automated confounder discovery and subgroup analysis that integrate agents into the causal ML pipeline to simulate domain expertise. Our framework systematically performs subgroup identification and confounding structure discovery by leveraging the reasoning capabilities of LLM-based agents, which reduces human dependency while preserving interpretability. Experiments on real-world medical datasets show that our proposed approach enhances treatment effect estimation robustness by narrowing confidence intervals and uncovering unrecognized confounding biases. Our findings suggest that LLM-based agents offer a promising path toward scalable, trustworthy, and semantically aware causal inference.
zh
[AI-87] Neural Bridge Processes
【速读】:该论文旨在解决从部分观测的上下文-目标对中学习随机函数的问题,传统方法如高斯过程(Gaussian Processes, GPs)在大规模数据下存在可扩展性问题且假设分布为高斯型,而神经过程(Neural Processes, NPs)难以建模复杂多模态的目标分布。尽管神经扩散过程(Neural Diffusion Processes, NDPs)通过学习扩散过程提升了表达能力,但其仅依赖条件信号进行去噪,导致输入与扩散路径耦合弱且扩散终点语义不一致。解决方案的关键在于提出神经桥接过程(Neural Bridge Processes, NBPs),其中输入 x 作为整个扩散轨迹的动态锚点,通过显式地将前向核设计为依赖于 x 的形式,强制扩散路径严格终止于监督目标,从而增强梯度信号并保证终点一致性。这一机制显著提升了结构化预测任务中的性能与理论一致性。
链接: https://arxiv.org/abs/2508.07220
作者: Jian Xu,Yican Liu,Qibin Zhao,John Paisley,Delu Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning stochastic functions from partially observed context-target pairs is a fundamental problem in probabilistic modeling. Traditional models like Gaussian Processes (GPs) face scalability issues with large datasets and assume Gaussianity, limiting their applicability. While Neural Processes (NPs) offer more flexibility, they struggle with capturing complex, multi-modal target distributions. Neural Diffusion Processes (NDPs) enhance expressivity through a learned diffusion process but rely solely on conditional signals in the denoising network, resulting in weak input coupling from an unconditional forward process and semantic mismatch at the diffusion endpoint. In this work, we propose Neural Bridge Processes (NBPs), a novel method for modeling stochastic functions where inputs x act as dynamic anchors for the entire diffusion trajectory. By reformulating the forward kernel to explicitly depend on x, NBP enforces a constrained path that strictly terminates at the supervised target. This approach not only provides stronger gradient signals but also guarantees endpoint coherence. We validate NBPs on synthetic data, EEG signal regression and image regression tasks, achieving substantial improvements over baselines. These results underscore the effectiveness of DDPM-style bridge sampling in enhancing both performance and theoretical consistency for structured prediction tasks.
zh
[AI-88] What One Cannot Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains
【速读】:该论文旨在解决两个核心问题:其一是明确两层单头Transformer是否能够表示任意k阶条件n-gram(即k阶马尔可夫过程),从而厘清模型深度与上下文学习(In-context Learning, ICL)能力之间的理论边界;其二是分析此类两层结构在训练过程中如何逐步形成有效的上下文表示。解决方案的关键在于理论证明:通过构造一个两层单头Transformer,其每层仅含一个注意力头,即可精确实现任意k阶条件n-gram的建模,这表明相较于此前需三层才能实现的已知最优构造,两层架构已具备充分表达能力。这一结果为Transformer深度与ICL能力的关系提供了最紧致的理论刻画,并进一步揭示了浅层架构在结构化序列任务中亦能展现出强大ICL潜力的机制。
链接: https://arxiv.org/abs/2508.07208
作者: Chanakya Ekbote,Marco Bondaschi,Nived Rajaraman,Jason D. Lee,Michael Gastpar,Ashok Vardhan Makkuva,Paul Pu Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.
zh
[AI-89] Presburger Functional Synthesis: Complexity and Tractable Normal Forms KR2025
【速读】:该论文致力于解决Presburger函数合成(Presburger Functional Synthesis, PFnS)问题,即给定输入与输出之间的关系为Presburger算术公式时,自动构造一个从输入到输出的函数以满足该关系。其核心解决方案在于提出了一种特殊的规范形式——PSyNF(Presburger Syntactic Normal Form),该形式能保证PFnS在多项式时间内且空间复杂度可控地求解。论文证明了PSyNF的多项式可判定性和可编译性,并揭示了任何其他具有相同高效求解能力的规范形式均可在多项式时间内转化为PSyNF,从而为PFnS提供了理论完备性和实用优化基础。
链接: https://arxiv.org/abs/2508.07207
作者: S. Akshay,A. R. Balasubramanian,Supratik Chakraborty,Georg Zetzsche
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Full version of conference paper at KR 2025 (22nd International Conference on Principles of Knowledge Representation and Reasoning)
Abstract:Given a relational specification between inputs and outputs as a logic formula, the problem of functional synthesis is to automatically synthesize a function from inputs to outputs satisfying the relation. Recently, a rich line of work has emerged tackling this problem for specifications in different theories, from Boolean to general first-order logic. In this paper, we launch an investigation of this problem for the theory of Presburger Arithmetic, that we call Presburger Functional Synthesis (PFnS). We show that PFnS can be solved in EXPTIME and provide a matching exponential lower bound. This is unlike the case for Boolean functional synthesis (BFnS), where only conditional exponential lower bounds are known. Further, we show that PFnS for one input and one output variable is as hard as BFnS in general. We then identify a special normal form, called PSyNF, for the specification formula that guarantees poly-time and poly-size solvability of PFnS. We prove several properties of PSyNF, including how to check and compile to this form, and conditions under which any other form that guarantees poly-time solvability of PFnS can be compiled in poly-time to PSyNF. Finally, we identify a syntactic normal form that is easier to check but is exponentially less succinct than PSyNF.
zh
[AI-90] Can Smaller Large Language Models Evaluate Research Quality?
【速读】:该论文旨在解决小型大语言模型(Small Language Models, LLMs)是否具备与大型模型相当的研究质量评估能力这一问题。其关键解决方案在于对谷歌发布的可下载模型Gemma-3-27b-it进行系统性测试,结果表明该模型在34个英国研究卓越框架(UK Research Excellence Framework 2021)学科单位中均能与专家评分代理指标呈现正向相关性,且相关性强度达到ChatGPT 4o的83.8%和4o-mini的94.7%,证明研究质量评分估计并非仅存在于最大规模模型中的涌现特性(emergent property),小型本地化模型同样具备实用价值,尤其适用于成本控制或需要离线安全处理的场景。
链接: https://arxiv.org/abs/2508.07196
作者: Mike Thelwall
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although both Google Gemini (1.5 Flash) and ChatGPT (4o and 4o-mini) give research quality evaluation scores that correlate positively with expert scores in nearly all fields, and more strongly that citations in most, it is not known whether this is true for smaller Large Language Models (LLMs). In response, this article assesses Google’s Gemma-3-27b-it, a downloadable LLM (60Gb). The results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021. The Gemma-3-27b-it correlations have 83.8% of the strength of ChatGPT 4o and 94.7% of the strength of ChatGPT 4o-mini correlations. Differently from the two larger LLMs, the Gemma-3-27b-it correlations do not increase substantially when the scores are averaged across five repetitions, its scores tend to be lower, and its reports are relatively uniform in style. Overall, the results show that research quality score estimation can be conducted by offline LLMs, so this capability is not an emergent property of the largest LLMs. Moreover, score improvement through repetition is not a universal feature of LLMs. In conclusion, although the largest LLMs still have the highest research evaluation score estimation capability, smaller ones can also be used for this task, and this can be helpful for cost saving or when secure offline processing is needed.
zh
[AI-91] Multi-Dimensional Summarization Agents with Context-Aware Reasoning over Enterprise Tables
【速读】:该论文旨在解决传统表格到文本(table-to-text)模型在企业级多维数据摘要任务中难以跨层级结构进行推理以及缺乏对上下文敏感变化(context-aware deltas)捕捉能力的问题。其核心解决方案在于提出一种基于大语言模型(LLM)的多智能体(multi-agent)流水线框架,通过分工协作的代理(slicing agent、variance detection agent、context construction agent 和 LLM-based generation agent)实现对多维数据的提取、分析与生成式摘要,显著提升了摘要对原始数据的忠实度(83%)、关键变化覆盖度及决策相关性(4.4/5),尤其在涉及复杂权衡场景(如价格上升带来的收入增加但销量下降)中表现优于现有方法。
链接: https://arxiv.org/abs/2508.07186
作者: Amit Dhanda
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We propose a novel framework for summarizing structured enterprise data across multiple dimensions using large language model (LLM)-based agents. Traditional table-to-text models often lack the capacity to reason across hierarchical structures and context-aware deltas, which are essential in business reporting tasks. Our method introduces a multi-agent pipeline that extracts, analyzes, and summarizes multi-dimensional data using agents for slicing, variance detection, context construction, and LLM-based generation. Our results show that the proposed framework outperforms traditional approaches, achieving 83% faithfulness to underlying data, superior coverage of significant changes, and high relevance scores (4.4/5) for decision-critical insights. The improvements are especially pronounced in categories involving subtle trade-offs, such as increased revenue due to price changes amid declining unit volumes, which competing methods either overlook or address with limited specificity. We evaluate the framework on Kaggle datasets and demonstrate significant improvements in faithfulness, relevance, and insight quality over baseline table summarization approaches.
zh
[AI-92] Explainability-in-Action: Enabling Expressive Manipulation and Tacit Understanding by Bending Diffusion Models in ComfyUI
【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)的创意应用场景中,大型模型(如文本到图像扩散模型)往往因黑箱特性而削弱了艺术家的参与感、可修改性和持续创作能力。解决方案的关键在于提出一种以“工艺”(craft-based)为导向的可解释性方法,强调通过长期、动手式的交互来理解模型内部结构,并将其视为可操作的创作材料;具体实现上,作者开发了一个集成于 ComfyUI 节点式界面的插件,支持对生成模型各组件的交互式操控与检查,从而帮助艺术家建立对模型各模块如何影响输出结果的直觉认知。
链接: https://arxiv.org/abs/2508.07183
作者: Ahmed M. Abuzuraiq,Philippe Pasquier
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: In Proceedings of Explainable AI for the Arts Workshop 2025 (XAIxArts 2025) arXiv:2406.14485
Abstract:Explainable AI (XAI) in creative contexts can go beyond transparency to support artistic engagement, modifiability, and sustained practice. While curated datasets and training human-scale models can offer artists greater agency and control, large-scale generative models like text-to-image diffusion systems often obscure these possibilities. We suggest that even large models can be treated as creative materials if their internal structure is exposed and manipulable. We propose a craft-based approach to explainability rooted in long-term, hands-on engagement akin to Schön’s “reflection-in-action” and demonstrate its application through a model-bending and inspection plugin integrated into the node-based interface of ComfyUI. We demonstrate that by interactively manipulating different parts of a generative model, artists can develop an intuition about how each component influences the output.
zh
[AI-93] Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实软件开发任务中评估时存在的基准数据污染(data contamination)和测试严谨性不足的问题,这些问题限制了对模型实际能力的准确揭示。其解决方案的关键在于提出CODE2BENCH——一个端到端的动态基准构建流水线,包含三项核心创新:(1) 自动化动态更新机制,通过定期摄入最新GitHub代码以最小化训练数据污染;(2) 基于Scope Graph的依赖分析方法,实现函数按依赖强度分类(区分自包含Self-Contained, SC与弱自包含Weakly Self-Contained, WSC任务),支持跨语言评估与可控依赖场景;(3) 基于属性的测试(Property-Based Testing, PBT)技术,自动合成高覆盖率的测试套件以实现功能验证。该方案构建了首个基于880个近期Python项目、包含1163个任务的CODE2BENCH-2505基准,为LLMs在真实代码生成任务中的全面、可信评估提供了可扩展且抗污染的方法论基础。
链接: https://arxiv.org/abs/2508.07180
作者: Zhe Zhang,Runlin Liu,Aishan Liu,Xingyu Liu,Xiang Gao,Hailong Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models LLMs) become increasingly integrated into software development workflows, rigorously evaluating their performance on complex, real-world code generation tasks has become essential. However, existing benchmarks often suffer from data contamination and limited test rigor, constraining their ability to reveal model failures effectively. To address these, we present CODE2BENCH, a end-to-end pipeline for dynamically constructing robust and contamination-resistant benchmarks from real-world GitHub repositories. Specifically, CODE2BENCH introduces three key innovations: (1) Automated Dynamism, achieved through periodic ingestion of recent code to minimize training data contamination; (2) Scope Graph-based dependency analysis, which enables structured classification of functions into benchmark instances with controlled dependency levels (distinguishing between Self-Contained (SC) tasks for cross-language evaluation and Weakly Self-Contained (WSC) tasks involving permitted library usage); and (3) Property-Based Testing (PBT) for the automated synthesis of rigorous test suites to enable thorough functional verification. Using this pipeline, we construct CODE2BENCH-2505, the first benchmark derived from 880 recent Python projects spanning diverse domains, comprising 1,163 code generation tasks with 100% average branch coverage on ground-truth implementations. Extensive evaluation of 16 LLMs using CODE2BENCH-2505 reveals that models consistently struggle with SC tasks requiring complex, non-standard logic and cross-language transfer, while showing relatively stronger performance on WSC tasks in Python. Our work introduces a contamination-resistant, language-agnostic methodology for dynamic benchmark construction, offering a principled foundation for the comprehensive and realistic evaluation of LLMs on real-world software development tasks.
zh
[AI-94] Integrating Neurosymbolic AI in Advanced Air Mobility: A Comprehensive Survey IJCAI-2025
【速读】:该论文旨在解决先进空中交通(Advanced Air Mobility, AAM)在监管、运营和安全方面面临的复杂挑战,其核心问题是现有AI方法难以兼顾灵活性与可解释性,从而限制了其在高可靠性航空系统中的应用。解决方案的关键在于融合神经网络的适应性与符号推理的逻辑严谨性,即采用神经符号人工智能(Neurosymbolic AI),通过如神经符号强化学习(Neurosymbolic Reinforcement Learning)等方法实现动态优化,并推动其在需求预测、飞机设计及实时空管等关键AAM场景中的可靠集成,以提升系统的透明度、鲁棒性和符合航空标准的能力。
链接: https://arxiv.org/abs/2508.07163
作者: Kamal Acharya,Iman Sharifi,Mehul Lad,Liang Sun,Houbing Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 4 figures, IJCAI-2025 (accepted)
Abstract:Neurosymbolic AI combines neural network adaptability with symbolic reasoning, promising an approach to address the complex regulatory, operational, and safety challenges in Advanced Air Mobility (AAM). This survey reviews its applications across key AAM domains such as demand forecasting, aircraft design, and real-time air traffic management. Our analysis reveals a fragmented research landscape where methodologies, including Neurosymbolic Reinforcement Learning, have shown potential for dynamic optimization but still face hurdles in scalability, robustness, and compliance with aviation standards. We classify current advancements, present relevant case studies, and outline future research directions aimed at integrating these approaches into reliable, transparent AAM systems. By linking advanced AI techniques with AAM’s operational demands, this work provides a concise roadmap for researchers and practitioners developing next-generation air mobility solutions.
zh
[AI-95] SGD Convergence under Stepsize Shrinkage in Low-Precision Training
【速读】:该论文旨在解决低精度训练(low-precision training)中梯度量化带来的收敛性问题,即量化引入的梯度幅度衰减(magnitude shrinkage)和加性噪声如何影响随机梯度下降(SGD)的收敛行为。解决方案的关键在于将梯度量化建模为一个梯度缩放模型:每个随机梯度被缩放因子 $ q_k \in (0,1] $ 调整,并叠加零均值量化噪声。作者证明该缩放等价于将原始步长 $ \mu_k $ 替换为有效步长 $ \mu_k q_k $,从而揭示了低精度训练导致收敛速度变慢的本质机制——收敛速率由最小缩放因子 $ q_{\min} $ 决定,且存在因量化噪声引起的渐近误差上限(asymptotic error floor)。这一理论框架在标准光滑性和有界方差假设下,严谨地分析了数值精度降低对训练过程的影响。
链接: https://arxiv.org/abs/2508.07142
作者: Vincent-Daniel Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Numerical Analysis (math.NA)
备注:
Abstract:Low-precision training has become essential for reducing the computational and memory costs of large-scale deep learning. However, quantization of gradients introduces both magnitude shrinkage and additive noise, which can alter the convergence behavior of stochastic gradient descent (SGD). In this work, we study the convergence of SGD under a gradient shrinkage model, where each stochastic gradient is scaled by a factor q_k \in (0,1] and perturbed by zero-mean quantization noise. We show that this shrinkage is equivalent to replacing the nominal stepsize \mu_k with an effective stepsize \mu_k q_k , which slows convergence when q_\min 1 . Under standard smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a reduced rate determined by q_\min , and with an increased asymptotic error floor due to quantization noise. We theoretically analyze how reduced numerical precision slows down training by modeling it as gradient shrinkage in the standard SGD convergence framework.
zh
[AI-96] A Real-Time Self-Tuning Moderator Framework for Adversarial Prompt Detection
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的对抗攻击与越狱(jailbreaking)威胁,这些问题可能严重损害信息安全性。传统防御方法如微调(fine-tuning)或分类器模型存在适应性差、对良性提示产生性能退化以及难以规模化部署等局限。论文提出一种实时自调优(real-time, self-tuning, RTST)监控框架,其核心在于通过动态调整机制实现对新型攻击的快速响应,同时保持极低的训练开销和最小侵入性,从而在不显著影响正常任务性能的前提下提升模型鲁棒性。
链接: https://arxiv.org/abs/2508.07139
作者: Ivan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure
Abstract:Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google’s Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.
zh
[AI-97] A Stable and Principled Loss Function for Direct Language Model Alignment
【速读】:该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中直接偏好优化(Direct Preference Optimization, DPO)方法存在的理论不一致性与训练不稳定问题,特别是DPO损失函数在概率趋近于零时引发的大梯度和奖励欺骗(reward hacking)现象。其解决方案的关键在于提出一种新的损失函数,该函数直接从RLHF的最优性条件推导而来,不再鼓励对logits差异的无限最大化,而是将其约束为由底层奖励决定的特定有限值,从而实现更稳定、更有效的模型对齐。
链接: https://arxiv.org/abs/2508.07137
作者: Yuandong Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B.
zh
[AI-98] “Draw me a curator” Examining the visual stereotyping of a cultural services profession by generative AI
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像生成过程中对博物馆策展人(curator)群体的代表性偏差问题,即AI生成的视觉内容未能反映现实世界中策展人群体的性别与种族多样性,反而强化了刻板印象。其解决方案的关键在于识别并揭示训练数据集中存在的系统性偏见,特别是对女性和非白人族群的严重低估,以及对年轻、时尚化职业形象的过度渲染,从而警示使用者需批判性地审视AI生成内容的真实性与社会影响。
链接: https://arxiv.org/abs/2508.07132
作者: Dirk HR Spennemann
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Based on 230 visualisations, this paper examines the depiction of museum curators by the popular generative Artificial Intelligence (AI) model, ChatGPT4o. While the AI-generated representations do not reiterate popular stereotypes of curators as nerdy, conservative in dress and stuck in time rummaging through collections, they contrast sharply with real-world demographics. AI-generated imagery extremely underrepresents women (3.5% vs 49% to 72% in reality) and disregards ethnic communities other than Caucasian (0% vs 18% to 36%). It only over-represents young curators (79% vs approx. 27%) but also renders curators to resemble yuppie professionals or people featuring in fashion advertising. Stereotypical attributes are prevalent, with curators widely depicted as wearing beards and holding clipboards or digital tablets. The findings highlight biases in the generative AI image creation dataset, which is poised to shape an inaccurate portrayal of museum professionals if the images were to be taken uncritically at face value.
zh
[AI-99] oward AI Matching Policies in Homeless Services: A Qualitative Study with Policymakers
【速读】:该论文试图解决的问题是:在无家可归者住房资源匹配过程中,数据驱动的生成式 AI (Generative AI) 算法是否被实践者接受、如何被采纳,以及其带来的实际影响。解决方案的关键在于:通过半结构化访谈收集洛杉矶地区13名住房服务政策制定者的反馈,识别出他们在效率、公平性和透明度维度上对AI工具的潜在收益与风险的认知;研究发现,尽管缺乏统一的设计共识,但若算法被审慎设计并与人类决策者协同使用,政策制定者普遍持开放态度,这为未来开发适用于低资源场景的责任型算法系统提供了重要设计考量和方向指引。
链接: https://arxiv.org/abs/2508.07129
作者: Caroline M. Johnston,Olga Koumoundouros,Angel Hsing-Chi Hwang,Laura Onasch-Vera,Eric Rice,Phebe Vayanos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure, 2 tables
Abstract:Artificial intelligence researchers have proposed various data-driven algorithms to improve the processes that match individuals experiencing homelessness to scarce housing resources. It remains unclear whether and how these algorithms are received or adopted by practitioners and what their corresponding consequences are. Through semi-structured interviews with 13 policymakers in homeless services in Los Angeles, we investigate whether such change-makers are open to the idea of integrating AI into the housing resource matching process, identifying where they see potential gains and drawbacks from such a system in issues of efficiency, fairness, and transparency. Our qualitative analysis indicates that, even when aware of various complicating factors, policymakers welcome the idea of an AI matching tool if thoughtfully designed and used in tandem with human decision-makers. Though there is no consensus as to the exact design of such an AI system, insights from policymakers raise open questions and design considerations that can be enlightening for future researchers and practitioners who aim to build responsible algorithmic systems to support decision-making in low-resource scenarios.
zh
[AI-100] Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning)中缺乏有效人类反馈机制的问题,尤其是在任务目标难以通过密集奖励函数明确描述时,传统依赖离线轨迹比较的方法无法适用。其核心挑战在于实时获取的标量反馈(scalar feedback)通常噪声大、不一致,导致学习到的奖励模型准确性与泛化能力受限。解决方案的关键在于提出Pref-GUIDE框架,将原始标量反馈转化为结构化的偏好数据:其中Pref-GUIDE Individual通过短时间窗口内的行为对比和模糊反馈过滤缓解时间不一致性,而Pref-GUIDE Voting则通过聚合多用户群体的奖励模型形成共识偏好,显著提升鲁棒性。实验表明,该方法在三个复杂环境中优于仅使用标量反馈的基线,且投票变体甚至超越专家设计的密集奖励函数,为在线强化学习中高效利用人类输入提供了可扩展且原理清晰的新路径。
链接: https://arxiv.org/abs/2508.07126
作者: Zhengran Ji,Boyuan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.
zh
[AI-101] Designing a Feedback-Driven Decision Support System for Dynamic Student Intervention
【速读】:该论文旨在解决教育领域中机器学习模型静态性导致的预测准确性不足问题,即现有模型无法在获得新数据(如干预后学生表现)时进行动态调整。其核心解决方案是构建一个反馈驱动的决策支持系统(Feedback-Driven Decision Support System, DSS),采用闭环架构实现持续模型优化;关键创新在于将LightGBM回归器与增量式再训练机制相结合,使系统能自动响应教师输入的更新数据并触发模型迭代,从而提升预测精度(实验显示RMSE下降10.7%),同时通过SHAP方法保障可解释性,推动教育分析从静态预测向人机协同、实时响应的自适应AI演进。
链接: https://arxiv.org/abs/2508.07107
作者: Timothy Oluwapelumi Adeyemi,Nadiah Fahad AlOtaibi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 1 figure, 3 tables
Abstract:Accurate prediction of student performance is essential for timely academic intervention. However, most machine learning models in education are static and cannot adapt when new data, such as post-intervention outcomes, become available. To address this limitation, we propose a Feedback-Driven Decision Support System (DSS) with a closed-loop architecture that enables continuous model refinement. The system integrates a LightGBM-based regressor with incremental retraining, allowing educators to input updated student results, which automatically trigger model updates. This adaptive mechanism improves prediction accuracy by learning from real-world academic progress. The platform features a Flask-based web interface for real-time interaction and incorporates SHAP for explainability, ensuring transparency. Experimental results show a 10.7% reduction in RMSE after retraining, with consistent upward adjustments in predicted scores for intervened students. By transforming static predictors into self-improving systems, our approach advances educational analytics toward human-centered, data-driven, and responsive AI. The framework is designed for integration into LMS and institutional dashboards.
zh
[AI-102] Hide or Highlight: Understanding the Impact of Factuality Expression on User Trust AAAI
【速读】:该论文试图解决的问题是:如何通过不同的事实性披露策略(factuality disclosure strategies)来影响用户对大语言模型(Large Language Models, LLMs)生成内容的信任度,尤其是在模型输出存在事实错误的情况下。研究发现,相较于透明披露(highlighting less factual content)或强调事实内容(attention)等策略,采用“不透明”(opaque,即移除较不可信内容)和“模糊化”(ambiguity,使不可信内容变得模糊)这两种隐藏潜在错误信息的策略,能够显著提升用户信任度,同时保持用户对答案质量的感知水平。其解决方案的关键在于:有意识地隐藏或弱化可能不准确的内容,而非直接暴露其不确定性,从而在不损害整体可用性的前提下增强用户对AI输出的信任。
链接: https://arxiv.org/abs/2508.07095
作者: Hyo Jin Do,Werner Geyer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, To be published in Proceedings of the 8th AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)
Abstract:Large language models are known to produce outputs that are plausible but factually incorrect. To prevent people from making erroneous decisions by blindly trusting AI, researchers have explored various ways of communicating factuality estimates in AI-generated outputs to end-users. However, little is known about whether revealing content estimated to be factually incorrect influences users’ trust when compared to hiding it altogether. We tested four different ways of disclosing an AI-generated output with factuality assessments: transparent (highlights less factual content), attention (highlights factual content), opaque (removes less factual content), ambiguity (makes less factual content vague), and compared them with a baseline response without factuality information. We conducted a human subjects research (N = 148) using the strategies in question-answering scenarios. We found that the opaque and ambiguity strategies led to higher trust while maintaining perceived answer quality, compared to the other strategies. We discuss the efficacy of hiding presumably less factual content to build end-user trust.
zh
[AI-103] An Evolutionary Game-Theoretic Merging Decision-Making Considering Social Acceptance for Autonomous Driving
【速读】:该论文旨在解决自动驾驶车辆(AV)在高速公路上匝道汇入过程中面临的动态复杂性与社会可接受性不足的问题,现有决策算法难以兼顾效率、舒适性和安全性。其解决方案的关键在于提出一种基于进化博弈论(Evolutionary Game Theory, EGT)的汇入决策框架,该框架以人类驾驶员的有限理性为基础,构建包含多目标收益函数的博弈模型,反映类人驾驶偏好;通过求解复制动态方程得到演化稳定策略(Evolutionarily Stable Strategy, ESS),从而确定最优切入时机,实现AV与主路车辆(MVs)之间的动态平衡;同时引入实时驾驶风格估计算法,在线调整博弈收益函数以响应MVs的即时反应,显著提升了整体交通环境中的效率、舒适性与安全性。
链接: https://arxiv.org/abs/2508.07080
作者: Haolin Liu,Zijun Guo,Yanbo Chen,Jiaqi Chen,Huilong Yu,Junqiang Xi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Highway on-ramp merging is of great challenge for autonomous vehicles (AVs), since they have to proactively interact with surrounding vehicles to enter the main road safely within limited time. However, existing decision-making algorithms fail to adequately address dynamic complexities and social acceptance of AVs, leading to suboptimal or unsafe merging decisions. To address this, we propose an evolutionary game-theoretic (EGT) merging decision-making framework, grounded in the bounded rationality of human drivers, which dynamically balances the benefits of both AVs and main-road vehicles (MVs). We formulate the cut-in decision-making process as an EGT problem with a multi-objective payoff function that reflects human-like driving preferences. By solving the replicator dynamic equation for the evolutionarily stable strategy (ESS), the optimal cut-in timing is derived, balancing efficiency, comfort, and safety for both AVs and MVs. A real-time driving style estimation algorithm is proposed to adjust the game payoff function online by observing the immediate reactions of MVs. Empirical results demonstrate that we improve the efficiency, comfort and safety of both AVs and MVs compared with existing game-theoretic and traditional planning approaches across multi-object metrics.
zh
[AI-104] Model Predictive Control for Crowd Navigation via Learning-Based Trajectory Prediction
【速读】:该论文旨在解决自主机器人在行人密集环境中的安全导航问题,其核心挑战在于如何准确预测行人轨迹并据此规划安全、平滑的运动路径。解决方案的关键在于将基于深度学习的社交隐式(Social-Implicit, SI)行人轨迹预测模型集成到模型预测控制(Model Predictive Control, MPC)框架中,形成SI-MPC系统。该方法通过引入考虑行人社会交互的预测机制,在低密度场景下可将轨迹预测误差降低达76%,并在高密度场景中提升导航安全性与运动平滑性,同时揭示了开环预测指标与闭环实际性能之间的重要差异,强调了系统级评估的必要性。
链接: https://arxiv.org/abs/2508.07079
作者: Mohamed Parvez Aslam,Bojan Derajic,Mohamed-Khalil Bouzidi,Sebastian Bernhard,Jan Oliver Ringert
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Safe navigation in pedestrian-rich environments remains a key challenge for autonomous robots. This work evaluates the integration of a deep learning-based Social-Implicit (SI) pedestrian trajectory predictor within a Model Predictive Control (MPC) framework on the physical Continental Corriere robot. Tested across varied pedestrian densities, the SI-MPC system is compared to a traditional Constant Velocity (CV) model in both open-loop prediction and closed-loop navigation. Results show that SI improves trajectory prediction - reducing errors by up to 76% in low-density settings - and enhances safety and motion smoothness in crowded scenes. Moreover, real-world deployment reveals discrepancies between open-loop metrics and closed-loop performance, as the SI model yields broader, more cautious predictions. These findings emphasize the importance of system-level evaluation and highlight the SI-MPC framework’s promise for safer, more adaptive navigation in dynamic, human-populated environments.
zh
[AI-105] Surgical Knowledge Rewrite in Compact LLM s: An Unlearn-then-Learn Strategy with (IA3) for Localized Factual Modulation and Catastrophic Forgetting Mitigation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在动态知识更新过程中面临的两大核心问题:一是对新事实的采纳存在抵抗,二是因冲突性编辑导致的灾难性遗忘(catastrophic forgetting),即无关知识的严重丢失。解决方案的关键在于提出一种“先消除后学习”(unlearn-then-learn)的两阶段策略,其核心创新是结合参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术——Infused Adapter by Inhibiting and Amplifying Inner Activations (IA^3),并通过一个初始的电路定位(circuit localization)阶段精准识别并靶向负责编码冲突事实的内部组件。这种机制驱动的干预方式实现了高精度的知识编辑(新事实准确率达98.50%、原事实遗忘率达96.00%),同时显著提升了局部化控制能力(F_control 达72.00%),远优于传统微调方法(~20% F_control),并揭示了“软遗忘”机制,即原始知识虽被抑制但仍可条件性访问,从而增强模型的安全性与可控性。
链接: https://arxiv.org/abs/2508.07075
作者: Stanley Ngugi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 visual aids
Abstract:Large Language Models (LLMs) struggle with dynamic knowledge updates, especially when new information conflicts with deeply embedded facts. Such conflicting factual edits often lead to two critical issues: resistance to adopting the new fact and severe catastrophic forgetting of unrelated knowledge. This paper introduces and evaluates a novel “unlearn-then-learn” strategy for precise knowledge editing in LLMs, leveraging the parameter-efficient fine-tuning (PEFT) technique, Infused Adapter by Inhibiting and Amplifying Inner Activations ( IA^3 ). Crucially, this two-stage approach is powered by an initial circuit localization phase that identifies and targets the specific internal components responsible for encoding the conflicting fact. Through a rigorous experimental methodology on microsoft/Phi-3-mini-4k-instruct, we demonstrate that this mechanistically informed two-stage approach achieves near-perfect accuracy (98.50%) for the new, modulated fact while simultaneously effectively suppressing the original conflicting fact (96.00% forget rate). Critically, our strategy exhibits unprecedented localization (72.00% F_control accuracy), dramatically mitigating catastrophic forgetting observed in direct fine-tuning approaches (which showed as low as ~20% F_control accuracy), a direct benefit of our targeted interpretability-guided intervention. Furthermore, qualitative analysis reveals a nuanced mechanism of “soft forgetting,” where original knowledge is suppressed from default retrieval but remains latent and conditionally accessible, enhancing model safety and control. These findings represent a significant advancement towards precise, localized, and safe knowledge management in compact LLMs.
zh
[AI-106] owards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在内容安全 moderation 中存在的局限性,尤其是其在处理隐含仇恨言论、攻击性语言及性别和种族偏见等需要细腻道德判断的任务中表现不佳的问题。研究指出,LLMs 由于训练数据的偏差和对上下文理解的不足,常导致输出不一致甚至伦理风险。解决方案的关键在于构建一个统一的基准数据集(涵盖49类情绪、攻击性文本与偏见),并提出 SafePhi——一个基于 QLoRA 微调的 Phi-4 模型,在该数据集上实现了 Macro F1 分数 0.89,显著优于 OpenAI Moderator(0.77)和 Llama Guard(0.74)。此外,研究强调需引入更多样化且具代表性的数据,并结合人类反馈机制(human-in-the-loop),以提升模型鲁棒性和可解释性。
链接: https://arxiv.org/abs/2508.07063
作者: Naseem Machlovi,Maryam Saleki,Innocent Ababio,Ruhul Amin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems become more integrated into daily life, the need for safer and more reliable moderation has never been greater. Large Language Models (LLMs) have demonstrated remarkable capabilities, surpassing earlier models in complexity and performance. Their evaluation across diverse tasks has consistently showcased their potential, enabling the development of adaptive and personalized agents. However, despite these advancements, LLMs remain prone to errors, particularly in areas requiring nuanced moral reasoning. They struggle with detecting implicit hate, offensive language, and gender biases due to the subjective and context-dependent nature of these issues. Moreover, their reliance on training data can inadvertently reinforce societal biases, leading to inconsistencies and ethical concerns in their outputs. To explore the limitations of LLMs in this role, we developed an experimental framework based on state-of-the-art (SOTA) models to assess human emotions and offensive behaviors. The framework introduces a unified benchmark dataset encompassing 49 distinct categories spanning the wide spectrum of human emotions, offensive and hateful text, and gender and racial biases. Furthermore, we introduced SafePhi, a QLoRA fine-tuned version of Phi-4, adapting diverse ethical contexts and outperforming benchmark moderators by achieving a Macro F1 score of 0.89, where OpenAI Moderator and Llama Guard score 0.77 and 0.74, respectively. This research also highlights the critical domains where LLM moderators consistently underperformed, pressing the need to incorporate more heterogeneous and representative data with human-in-the-loop, for better model robustness and explainability.
zh
[AI-107] Membership and Memorization in LLM Knowledge Distillation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)知识蒸馏(Knowledge Distillation, KD)过程中存在的隐私泄露问题,特别是学生模型可能继承教师模型在私有数据上训练所导致的成员身份(membership)和记忆(memorization)隐私风险。解决方案的关键在于系统性地分析六种主流LLM KD技术在不同设置下的隐私风险表现,包括教师模型家族(GPT-2、LLAMA-2、OPT)、学生模型规模、任务类型及KD目标函数等关键组件的影响,并揭示了记忆隐私与成员隐私风险之间存在显著不一致性,同时发现不同网络块(per-block)的隐私风险差异巨大,从而为设计更安全的知识蒸馏方法提供了实证依据和理论支撑。
链接: https://arxiv.org/abs/2508.07054
作者: Ziqi Zhang,Ali Shahin Shamsabadi,Hanxiao Lu,Yifeng Cai,Hamed Haddadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Knowledge Distillation (KD) aim to mitigate the high computational demands of Large Language Models (LLMs) by transferring knowledge from a large ‘‘teacher’’ to a smaller ‘‘student’’ model. However, students may inherit the teacher’s privacy when the teacher is trained on private data. In this work, we systematically characterize and investigate membership and memorization privacy risks inherent in six LLM KD techniques. Using instruction-tuning settings that span seven NLP tasks, together with three teacher model families (GPT-2, LLAMA-2, and OPT), and various size student models, we demonstrate that all existing LLM KD approaches carry membership and memorization privacy risks from the teacher to its students. However, the extent of privacy risks varies across different KD techniques. We systematically analyse how key LLM KD components (KD objective functions, student training data and NLP tasks) impact such privacy risks. We also demonstrate a significant disagreement between memorization and membership privacy risks of LLM KD techniques. Finally, we characterize per-block privacy risk and demonstrate that the privacy risk varies across different blocks by a large margin.
zh
[AI-108] Whisfusion: Parallel ASR Decoding via a Diffusion Transformer
【速读】:该论文旨在解决自动语音识别(ASR)中因自回归(AR)解码器的序列生成特性导致的高延迟问题,尤其是在实时字幕和会议转录等对延迟敏感的应用场景中。传统AR解码器虽然性能优异,但其逐词生成的方式成为瓶颈,而现有的非自回归(NAR)方法又受限于上下文建模能力不足。解决方案的关键在于提出Whisfusion框架,首次将预训练的Whisper编码器与文本扩散解码器融合,构建一种全新的NAR架构:该架构在每一步解码中并行处理完整的声学上下文,从而突破AR的延迟限制;同时引入轻量级交叉注意力适配器(通过参数高效微调PEFT训练),实现多模态信息的有效对齐,并结合批处理并行的多步解码策略,在不显著增加延迟的前提下提升识别准确率。
链接: https://arxiv.org/abs/2508.07048
作者: Taeyoun Kwon,Junhyuk Ahn,Taegeun Yun,Heeju Jwa,Yoonchae Choi,Siwon Park,Nam-Joon Kim,Jangchan Kim,Hyun Gon Ryu,Hyuk-Jae Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 9 figures
Abstract:Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at this https URL.
zh
[AI-109] Balancing Privacy and Efficiency: Music Information Retrieval via Additive Homomorphic Encryption
【速读】:该论文旨在解决生成式 AI 时代下音乐数据隐私保护的独特挑战,即音乐数据因其时序性和多模态特性,在采样、变换和混音过程中产生的核心向量嵌入(embedding)极易被模型学习、滥用甚至窃取,而传统版权许可与数字水印技术难以有效保护这些抽象的数学表示。解决方案的关键在于提出一种基于加法同态加密(Additive Homomorphic Encryption, AHE)的隐私保护向量相似性搜索方法,通过计算音乐嵌入之间的内积实现高效且安全的相似度检索,相较于全同态加密(FHE)显著降低性能开销,同时在真实 MP3 数据集上验证了其效率与实用性。
链接: https://arxiv.org/abs/2508.07044
作者: William Zerong Wang,Dongfang Zhao
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:In the era of generative AI, ensuring the privacy of music data presents unique challenges: unlike static artworks such as images, music data is inherently temporal and multimodal, and it is sampled, transformed, and remixed at an unprecedented scale. These characteristics make its core vector embeddings, i.e, the numerical representations of the music, highly susceptible to being learned, misused, or even stolen by models without accessing the original audio files. Traditional methods like copyright licensing and digital watermarking offer limited protection for these abstract mathematical representations, thus necessitating a stronger, e.g., cryptographic, approach to safeguarding the embeddings themselves. Standard encryption schemes, such as AES, render data unintelligible for computation, making such searches impossible. While Fully Homomorphic Encryption (FHE) provides a plausible solution by allowing arbitrary computations on ciphertexts, its substantial performance overhead remains impractical for large-scale vector similarity searches. Given this trade-off, we propose a more practical approach using Additive Homomorphic Encryption (AHE) for vector similarity search. The primary contributions of this paper are threefold: we analyze threat models unique to music information retrieval systems; we provide a theoretical analysis and propose an efficient AHE-based solution through inner products of music embeddings to deliver privacy-preserving similarity search; and finally, we demonstrate the efficiency and practicality of the proposed approach through empirical evaluation and comparison to FHE schemes on real-world MP3 files.
zh
[AI-110] K-Dense Analyst: Towards Fully Automated Scientific Analysis
【速读】:该论文旨在解决现代生物信息学分析中数据生成与科学洞察之间存在的巨大鸿沟问题,即当前大语言模型(Large Language Models, LLMs)虽在科学推理方面展现出潜力,但在处理需要迭代计算、工具集成和严格验证的真实世界分析工作流时仍存在根本性局限。解决方案的关键在于提出K-Dense Analyst——一个基于双环架构的分层多智能体系统,通过专用代理将复杂科学目标分解为可在安全计算环境中执行并验证的任务,从而实现自主生物信息学分析。该系统在BixBench基准测试中达到29.2%的准确率,显著优于GPT-5(22.9%),且其性能远超直接使用Gemini 2.5 Pro(仅18.3%)的表现,证明了架构创新可突破底层模型能力限制,推动构建真正具备自主科学推理能力的计算生物学家。
链接: https://arxiv.org/abs/2508.07043
作者: Orion Li,Vinayak Agarwal,Summer Zhou,Ashwin Gopinath,Timothy Kassis
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
备注:
Abstract:The complexity of modern bioinformatics analysis has created a critical gap between data generation and developing scientific insights. While large language models (LLMs) have shown promise in scientific reasoning, they remain fundamentally limited when dealing with real-world analytical workflows that demand iterative computation, tool integration and rigorous validation. We introduce K-Dense Analyst, a hierarchical multi-agent system that achieves autonomous bioinformatics analysis through a dual-loop architecture. K-Dense Analyst, part of the broader K-Dense platform, couples planning with validated execution using specialized agents to decompose complex objectives into executable, verifiable tasks within secure computational environments. On BixBench, a comprehensive benchmark for open-ended biological analysis, K-Dense Analyst achieves 29.2% accuracy, surpassing the best-performing language model (GPT-5) by 6.3 percentage points, representing nearly 27% improvement over what is widely considered the most powerful LLM available. Remarkably, K-Dense Analyst achieves this performance using Gemini 2.5 Pro, which attains only 18.3% accuracy when used directly, demonstrating that our architectural innovations unlock capabilities far beyond the underlying model’s baseline performance. Our insights demonstrate that autonomous scientific reasoning requires more than enhanced language models, it demands purpose-built systems that can bridge the gap between high-level scientific objectives and low-level computational execution. These results represent a significant advance toward fully autonomous computational biologists capable of accelerating discovery across the life sciences.
zh
[AI-111] From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving
【速读】:该论文旨在解决从大规模真实世界数据中学习鲁棒驾驶策略的问题,尤其针对行为克隆(Behavioral Cloning, BC)方法在闭环执行时因误差累积而导致性能脆弱的局限性。其关键解决方案在于:将静态专家数据与离线强化学习(Offline Reinforcement Learning, Offline RL)相结合,具体采用保守Q学习(Conservative Q-Learning, CQL)算法,在相同的Transformer架构和结构化实体中心状态表示下,训练出具有更强鲁棒性的驾驶策略。通过精心设计的奖励函数,CQL代理能够学习到保守的价值函数,从而有效纠正微小错误并避免分布外状态,最终在Waymo Open Motion Dataset的1000个未见场景中实现3.2倍的成功率提升和7.4倍的碰撞率下降,验证了离线RL对长时程驾驶任务的重要性。
链接: https://arxiv.org/abs/2508.07029
作者: Antonio Guillen-Perez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Learning robust driving policies from large-scale, real-world datasets is a central challenge in autonomous driving, as online data collection is often unsafe and impractical. While Behavioral Cloning (BC) offers a straightforward approach to imitation learning, policies trained with BC are notoriously brittle and suffer from compounding errors in closed-loop execution. This work presents a comprehensive pipeline and a comparative study to address this limitation. We first develop a series of increasingly sophisticated BC baselines, culminating in a Transformer-based model that operates on a structured, entity-centric state representation. While this model achieves low imitation loss, we show that it still fails in long-horizon simulations. We then demonstrate that by applying a state-of-the-art Offline Reinforcement Learning algorithm, Conservative Q-Learning (CQL), to the same data and architecture, we can learn a significantly more robust policy. Using a carefully engineered reward function, the CQL agent learns a conservative value function that enables it to recover from minor errors and avoid out-of-distribution states. In a large-scale evaluation on 1,000 unseen scenarios from the Waymo Open Motion Dataset, our final CQL agent achieves a 3.2x higher success rate and a 7.4x lower collision rate than the strongest BC baseline, proving that an offline RL approach is critical for learning robust, long-horizon driving policies from static expert data.
zh
[AI-112] Making Effective Decisions: Machine Learning and the Ecogame in 1970
【速读】:该论文试图解决的问题是:如何在早期数字技术背景下,通过艺术实践体现人机协同与系统性反馈机制,从而为当代以人工智能(AI)驱动的艺术提供一种以人为本的范式。其解决方案的关键在于,利用仿真和早期机器学习技术构建一个实时网络环境下的交互式艺术项目——Ecogame,该项目融合了视觉艺术与控制论(cybernetics)中的适应性、反馈与控制概念,强调个体行为对整体系统的影响力,从而为AI艺术提供了历史先例,证明技术可被设计为促进参与性和民主化的工具。
链接: https://arxiv.org/abs/2508.07027
作者: Catherine Mason
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: In Proceedings of Explainable AI for the Arts Workshop 2025 (XAIxArts 2025) arXiv:2406.14485
Abstract:This paper considers Ecogame, an innovative art project of 1970, whose creators believed in a positive vision of a technological future; an understanding, posited on cybernetics, of a future that could be participatory via digital means, and therefore more democratised. Using simulation and early machine learning techniques over a live network, Ecogame combined the power of visual art with cybernetic concepts of adaptation, feedback, and control to propose that behaviour had implications for the total system. It provides an historical precedent for contemporary AI-driven art about using AI in a more human-centred way.
zh
[AI-113] Efficient and Reliable Hitting-Set Computations for the Implicit Hitting Set Approach
【速读】:该论文旨在解决隐式击中集(Implicit Hitting Set, IHS)框架中击中集(Hitting Set, HS)优化环节的计算效率与可靠性之间的权衡问题。IHS是一种用于求解组合优化难题的通用方法,其核心在于迭代调用决策预言机(decision oracle)以提取不一致来源,并通过优化器计算这些来源的击中集。传统做法通常采用整数规划(Integer Programming, IP)来实现击中集优化,但存在数值不稳定导致正确性风险的问题。论文的关键解决方案是探索基于伪布尔(Pseudo-Boolean, PB)推理和随机局部搜索(Stochastic Local Search)等替代算法技术,尤其强调利用PB推理构建可验证的击中集计算过程,从而在保持较高效率的同时提供形式化证明能力,使得击中集计算结果具备可信赖性,且适用于任何能够用PB格式建模的IHS实例化场景。
链接: https://arxiv.org/abs/2508.07015
作者: Hannes Ihalainen,Dieter Vandesande,André Schidler,Jeremias Berg,Bart Bogaerts,Matti Järvisalo
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:The implicit hitting set (IHS) approach offers a general framework for solving computationally hard combinatorial optimization problems declaratively. IHS iterates between a decision oracle used for extracting sources of inconsistency and an optimizer for computing so-called hitting sets (HSs) over the accumulated sources of inconsistency. While the decision oracle is language-specific, the optimizers is usually instantiated through integer programming. We explore alternative algorithmic techniques for hitting set optimization based on different ways of employing pseudo-Boolean (PB) reasoning as well as stochastic local search. We extensively evaluate the practical feasibility of the alternatives in particular in the context of pseudo-Boolean (0-1 IP) optimization as one of the most recent instantiations of IHS. Highlighting a trade-off between efficiency and reliability, while a commercial IP solver turns out to remain the most effective way to instantiate HS computations, it can cause correctness issues due to numerical instability; in fact, we show that exact HS computations instantiated via PB reasoning can be made competitive with a numerically exact IP solver. Furthermore, the use of PB reasoning as a basis for HS computations allows for obtaining certificates for the correctness of IHS computations, generally applicable to any IHS instantiation in which reasoning in the declarative language at hand can be captured in the PB-based proof format we employ. Subjects: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2508.07015 [cs.AI] (or arXiv:2508.07015v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.07015 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-114] Neural Channel Knowledge Map Assisted Scheduling Optimization of Active IRSs in Multi-User Systems
【速读】:该论文旨在解决下一代无线网络中智能反射面(Intelligent Reflecting Surfaces, IRS)系统面临的两大核心挑战:一是由于硬件限制导致的严重双程路径损耗(double-pathloss),二是用户密度和信道维度增加时,多用户调度的复杂度与信道状态信息获取开销急剧上升的问题。解决方案的关键在于提出一种基于神经通道知识图谱(neural Channel Knowledge Map, CKM)的新颖调度框架,其中设计了两级级联深度神经网络——LPS-Net用于预测链路功率统计(Link Power Statistics, LPS),SE-Net用于准确估计遍历频谱效率(ergodic Spectral Efficiency, SE),从而实现从历史信道/吞吐量测量中学习并预测用户位置相关的性能指标;同时,进一步提出低复杂度的稳定匹配-迭代平衡(Stable Matching-Iterative Balancing, SM-IB)调度算法,在保证近似最优最大最小吞吐量性能的同时显著降低计算复杂度。
链接: https://arxiv.org/abs/2508.07009
作者: Xintong Chen,Zhenyu Jiang,Jiangbin Lyu,Liqun Fu
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Propose Neural Channel Knowledge Map for multi-user scheduling
Abstract:Intelligent Reflecting Surfaces (IRSs) have potential for significant performance gains in next-generation wireless networks but face key challenges, notably severe double-pathloss and complex multi-user scheduling due to hardware constraints. Active IRSs partially address pathloss but still require efficient scheduling in cell-level multi-IRS multi-user systems, whereby the overhead/delay of channel state acquisition and the scheduling complexity both rise dramatically as the user density and channel dimensions increase. Motivated by these challenges, this paper proposes a novel scheduling framework based on neural Channel Knowledge Map (CKM), designing Transformer-based deep neural networks (DNNs) to predict ergodic spectral efficiency (SE) from historical channel/throughput measurements tagged with user positions. Specifically, two cascaded networks, LPS-Net and SE-Net, are designed to predict link power statistics (LPS) and ergodic SE accurately. We further propose a low-complexity Stable Matching-Iterative Balancing (SM-IB) scheduling algorithm. Numerical evaluations verify that the proposed neural CKM significantly enhances prediction accuracy and computational efficiency, while the SM-IB algorithm effectively achieves near-optimal max-min throughput with greatly reduced complexity.
zh
[AI-115] Consensus-based Decentralized Multi-agent Reinforcement Learning for Random Access Network Optimization
【速读】:该论文旨在解决无线设备在随机接入(Random Access, RA)介质访问控制(MAC)协议设计中面临的挑战,即如何在多终端不可预测的数据流量下最小化冲突并确保传输公平性。现有基于集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)方法因依赖中心化训练和高昂的信息收集开销,难以在实际场景中部署。本文提出一种完全去中心化的MARL架构,其关键在于通过共识机制实现设备间的局部信息交换,仅需传递本地奖励以降低通信开销,并在Actor-Critic(AC)网络框架下设计策略学习算法,同时提供了全局收敛性的理论证明,从而显著提升RA网络性能。
链接: https://arxiv.org/abs/2508.07001
作者: Myeung Suk Oh,Zhiyao Zhang,FNU Hairi,Alvaro Velasquez,Jia Liu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted in ACM International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc) 2025
Abstract:With wireless devices increasingly forming a unified smart network for seamless, user-friendly operations, random access (RA) medium access control (MAC) design is considered a key solution for handling unpredictable data traffic from multiple terminals. However, it remains challenging to design an effective RA-based MAC protocol to minimize collisions and ensure transmission fairness across the devices. While existing multi-agent reinforcement learning (MARL) approaches with centralized training and decentralized execution (CTDE) have been proposed to optimize RA performance, their reliance on centralized training and the significant overhead required for information collection can make real-world applications unrealistic. In this work, we adopt a fully decentralized MARL architecture, where policy learning does not rely on centralized tasks but leverages consensus-based information exchanges across devices. We design our MARL algorithm over an actor-critic (AC) network and propose exchanging only local rewards to minimize communication overhead. Furthermore, we provide a theoretical proof of global convergence for our approach. Numerical experiments show that our proposed MARL algorithm can significantly improve RA network performance compared to other baselines.
zh
[AI-116] Conformal Set-based Human-AI Complementarity with Multiple Experts AAMAS2025
【速读】:该论文旨在解决多专家协作场景下如何选择最优专家子集以提升分类性能的问题,尤其针对基于校准预测集(conformal prediction sets)的决策支持系统中实例特定专家选择的难题。现有研究通常假设仅使用单一专家,而本文提出了一种基于贪心算法的专家子集选择策略,利用校准预测集识别对每个样本最具相关性的专家子集,从而显著优于随机或朴素的专家选择方法。其关键在于:通过分析不同专家在不同实例上的预测置信度与覆盖性,动态筛选出最可能提升分类准确率的专家组合,实证表明该方法在CIFAR-10H和ImageNet-16H数据集上可逼近近优解并有效增强人机协同分类性能。
链接: https://arxiv.org/abs/2508.06997
作者: Helbert Paat,Guohao Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Accepted at AAMAS 2025. Code available at: this https URL
Abstract:Decision support systems are designed to assist human experts in classification tasks by providing conformal prediction sets derived from a pre-trained model. This human-AI collaboration has demonstrated enhanced classification performance compared to using either the model or the expert independently. In this study, we focus on the selection of instance-specific experts from a pool of multiple human experts, contrasting it with existing research that typically focuses on single-expert scenarios. We characterize the conditions under which multiple experts can benefit from the conformal sets. With the insight that only certain experts may be relevant for each instance, we explore the problem of subset selection and introduce a greedy algorithm that utilizes conformal sets to identify the subset of expert predictions that will be used in classifying an instance. This approach is shown to yield better performance compared to naive methods for human subset selection. Based on real expert predictions from the CIFAR-10H and ImageNet-16H datasets, our simulation study indicates that our proposed greedy algorithm achieves near-optimal subsets, resulting in improved classification performance among multiple experts.
zh
[AI-117] Simulating Biological Intelligence: Active Inference with Experiment-Informed Generative Model
【速读】:该论文试图解决如何在自主代理中构建可解释且生物合理的目的性行为模型问题,尤其是在生成式 AI (Generative AI) 主导的背景下,探索基于生物神经网络的替代路径以提升系统安全性与效率。解决方案的关键在于提出一个基于主动推理(Active Inference)的框架,该框架通过实验启发的生成模型模拟具身代理在类游戏环境中的决策过程,从而揭示记忆驱动学习与预测规划在智能决策中的作用,为可解释人工智能(Explainable AI)提供了一种生物基础坚实且可扩展的方法。
链接: https://arxiv.org/abs/2508.06980
作者: Aswin Paul,Moein Khajehnejad,Forough Habibollahi,Brett J. Kagan,Adeel Razi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
Abstract:With recent and rapid advancements in artificial intelligence (AI), understanding the foundation of purposeful behaviour in autonomous agents is crucial for developing safe and efficient systems. While artificial neural networks have dominated the path to AI, recent studies are exploring the potential of biologically based systems, such as networks of living biological neuronal networks. Along with promises of high power and data efficiency, these systems may also inform more explainable and biologically plausible models. In this work, we propose a framework rooted in active inference, a general theory of behaviour, to model decision-making in embodied agents. Using experiment-informed generative models, we simulate decision-making processes in a simulated game-play environment, mirroring experimental setups that use biological neurons. Our results demonstrate learning in these agents, providing insights into the role of memory-based learning and predictive planning in intelligent decision-making. This work contributes to the growing field of explainable AI by offering a biologically grounded and scalable approach to understanding purposeful behaviour in agents.
zh
[AI-118] DSperse: A Framework for Targeted Verification in Zero-Knowledge Machine Learning
【速读】:该论文旨在解决分布式机器学习推理中零知识证明(Zero-Knowledge Proof, ZKP)的高开销与刚性问题,尤其是在全模型电路化(full-model circuitization)带来的计算和存储成本过高时。解决方案的关键在于提出一种模块化框架 DSperse,通过战略性的加密验证机制,将验证范围聚焦于模型推理流程中的特定子计算片段(称为“slices”),而非整个模型;这些可验证片段支持局部信任最小化,同时借助审计、复制或经济激励机制保障全局一致性。这种灵活的边界对齐策略使验证仅在最具价值的组件上执行,从而实现可扩展且适配多样部署需求的针对性验证。
链接: https://arxiv.org/abs/2508.06972
作者: Dan Ivanov,Tristan Freiberg,Haruna Isah
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 12 pages, 8 figures, and 10 tables
Abstract:DSperse is a modular framework for distributed machine learning inference with strategic cryptographic verification. Operating within the emerging paradigm of distributed zero-knowledge machine learning, DSperse avoids the high cost and rigidity of full-model circuitization by enabling targeted verification of strategically chosen subcomputations. These verifiable segments, or “slices”, may cover part or all of the inference pipeline, with global consistency enforced through audit, replication, or economic incentives. This architecture supports a pragmatic form of trust minimization, localizing zero-knowledge proofs to the components where they provide the greatest value. We evaluate DSperse using multiple proving systems and report empirical results on memory usage, runtime, and circuit behavior under sliced and unsliced configurations. By allowing proof boundaries to align flexibly with the model’s logical structure, DSperse supports scalable, targeted verification strategies suited to diverse deployment needs.
zh
[AI-119] Can Multitask Learning Enhance Model Explainability?
【速读】:该论文旨在解决多模态遥感学习网络因复杂性导致可解释性差的问题。其解决方案的关键在于利用多任务学习框架,将某些模态作为辅助任务的目标变量进行预测,而非作为额外输入。这种方法借助卫星数据本身丰富的信息内容,在不增加部署阶段数据采集负担的前提下,提升了模型的可解释性,并保持了与传统多模态基线相当甚至更优的性能表现,同时能够通过辅助任务的行为来解释主任务的预测误差。
链接: https://arxiv.org/abs/2508.06966
作者: Hiba Najjar,Bushra Alshbib,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at GCPR 2025, Special Track “Photogrammetry and remote sensing”
Abstract:Remote sensing provides satellite data in diverse types and formats. The usage of multimodal learning networks exploits this diversity to improve model performance, except that the complexity of such networks comes at the expense of their interpretability. In this study, we explore how modalities can be leveraged through multitask learning to intrinsically explain model behavior. In particular, instead of additional inputs, we use certain modalities as additional targets to be predicted along with the main task. The success of this approach relies on the rich information content of satellite data, which remains as input modalities. We show how this modeling context provides numerous benefits: (1) in case of data scarcity, the additional modalities do not need to be collected for model inference at deployment, (2) the model performance remains comparable to the multimodal baseline performance, and in some cases achieves better scores, (3) prediction errors in the main task can be explained via the model behavior in the auxiliary task(s). We demonstrate the efficiency of our approach on three datasets, including segmentation, classification, and regression tasks. Code available at this http URL.
zh
[AI-120] MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中持续存在的可信性(trustworthiness)问题,现有修复方法如监督微调(Supervised Fine-Tuning, SFT)和基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)成本高、效率低,而提示工程(prompt engineering)则缺乏鲁棒性和可扩展性。为此,作者提出 MASteer,首个基于表征工程(representation engineering)的端到端可信性修复框架,其核心创新在于:1)AutoTester——一个多智能体系统,自动生成符合开发者需求的多样化高质量可控样本;2)AutoRepairer——通过锚点向量(anchor vectors)构建自适应的引导策略,在推理阶段实现上下文感知的自动化策略选择,从而实现无需训练、灵活且高效的可信性修复。
链接: https://arxiv.org/abs/2508.06963
作者: Changqing Li,Tianlin Li,Xiaohan Zhang,Aishan Liu,Li Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training-free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end-to-end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester, a multi-agent system that generates diverse, high-quality steer samples tailored to developer needs; and AutoRepairer, which constructs adaptive steering strategies with anchor vectors for automated, context-aware strategy selection during inference. Experiments on standard and customized trustworthiness tasks show MASteer consistently outperforms baselines, improving metrics by 15.36% on LLaMA-3.1-8B-Chat and 4.21% on Qwen-3-8B-Chat, while maintaining general model capabilities. MASteer demonstrates strong robustness, generalization, and practical value for scalable, efficient trustworthiness repair.
zh
[AI-121] Neural Beam Field for Spatial Beam RSRP Prediction
【速读】:该论文旨在解决密集多用户无线网络中基于波束的参考信号接收功率(RSRP)预测难题,其核心挑战在于测量开销高与信道快速变化导致的传统方法难以实现高效且准确的波束管理。解决方案的关键在于提出了一种混合神经-物理框架——神经波束场(Neural Beam Field, NBF),其中引入了多径条件功率谱(Multi-path Conditional Power Profile, MCPP),通过闭式解析建模将站点特定的多径传播特性与天线/波束配置相耦合;同时采用“黑箱-白箱”解耦设计:基于Transformer的深度神经网络(DNN)从稀疏用户测量和位置数据中学习MCPP,而物理启发模块则解析推导波束RSRP统计特性。此外,通过预训练与校准(Pretrain-and-Calibrate, PaC)策略融合射线追踪先验与现场RSRP数据校准,显著提升模型收敛速度、适应性及泛化能力。
链接: https://arxiv.org/abs/2508.06956
作者: Keqiang Guo,Yuheng Zhong,Xin Tong,Jiangbin Lyu,Rui Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Keywords: Neural Beam Field, Multipath Conditional Power Profile, Channel Knowledge Map, Beam-level RSRP, Transformer
Abstract:Accurately predicting beam-level reference signal received power (RSRP) is essential for beam management in dense multi-user wireless networks, yet challenging due to high measurement overhead and fast channel variations. This paper proposes Neural Beam Field (NBF), a hybrid neural-physical framework for efficient and interpretable spatial beam RSRP prediction. Central to our approach is the introduction of the Multi-path Conditional Power Profile (MCPP), which bridges site-specific multipath propagation with antenna/beam configurations via closed-form analytical modeling. We adopt a decoupled ``blackbox-whitebox" design: a Transformer-based deep neural network (DNN) learns the MCPP from sparse user measurements and positions, while a physics-inspired module analytically infers beam RSRP statistics. To improve convergence and adaptivity, we further introduce a Pretrain-and-Calibrate (PaC) strategy that leverages ray-tracing priors and on-site calibration using RSRP data. Extensive simulations results demonstrate that NBF significantly outperforms conventional table-based channel knowledge maps (CKMs) and pure blackbox DNNs in prediction accuracy, training efficiency, and generalization, while maintaining a compact model size. The proposed framework offers a scalable and physically grounded solution for intelligent beam management in next-generation dense wireless networks.
zh
[AI-122] Large Language Models Do Not Simulate Human Psychology
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够模拟人类心理,从而替代人类被试参与心理学研究。其解决方案的关键在于通过概念性论证与实证证据相结合的方式,指出LLMs在应对语义细微变化时表现出与人类显著不同的反应模式,即使是在针对心理学任务专门微调过的CENTAUR模型中亦然;同时不同LLMs对新题目的响应差异极大,表明其缺乏一致性与可靠性。因此,作者强调LLMs不应被视为心理模拟工具,而应作为需针对每项新应用进行人类对照验证的辅助工具。
链接: https://arxiv.org/abs/2508.06950
作者: Sarah Schröder,Thekla Morgenroth,Ulrike Kuhl,Valerie Vaquet,Benjamin Paaßen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs’ and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.
zh
[AI-123] Class Unbiasing for Generalization in Medical Diagnosis
【速读】:该论文旨在解决医疗诊断中因类别特征偏差(class-feature bias)导致的模型性能下降问题,即模型可能过度依赖与部分类别强相关的特征,从而在其他类别上表现不佳,影响整体泛化能力。其核心解决方案是提出一种类不偏倚模型(Clu-unbias)训练方法,关键在于设计了一种类不平等损失(class-wise inequality loss),以促进正类和负类样本在分类损失中的贡献均衡;同时引入类加权的分布鲁棒优化目标(class-wise group distributionally robust optimization objective),通过提升欠表现类别的权重来增强不平等损失在类别不平衡场景下的有效性,从而协同缓解类别不平衡与类别特征偏差问题,显著提升模型的泛化性能。
链接: https://arxiv.org/abs/2508.06943
作者: Lishi Zuo,Man-Wai Mak,Lu Yi,Youzhi Tu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical diagnosis might fail due to bias. In this work, we identified class-feature bias, which refers to models’ potential reliance on features that are strongly correlated with only a subset of classes, leading to biased performance and poor generalization on other classes. We aim to train a class-unbiased model (Cls-unbias) that mitigates both class imbalance and class-feature bias simultaneously. Specifically, we propose a class-wise inequality loss which promotes equal contributions of classification loss from positive-class and negative-class samples. We propose to optimize a class-wise group distributionally robust optimization objective-a class-weighted training objective that upweights underperforming classes-to enhance the effectiveness of the inequality loss under class imbalance. Through synthetic and real-world datasets, we empirically demonstrate that class-feature bias can negatively impact model performance. Our proposed method effectively mitigates both class-feature bias and class imbalance, thereby improving the model’s generalization ability.
zh
[AI-124] When Prompt Engineering Meets Software Engineering: CNL-P as Natural and Robust "APIs for Human-AI Interaction
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在实际应用中因自然语言(Natural Language, NL)提示存在歧义和不一致性而导致输出质量不稳定的问题。其解决方案的关键在于提出受软件工程(Software Engineering, SE)启发的结构化自然语言提示框架——受控自然语言提示(Controlled NL for Prompt, CNL-P),该框架通过引入精确的语法结构和严格的语义规范,显著降低NL的模糊性,从而实现用户意图的声明式、结构化且准确表达。CNL-P不仅整合了提示工程(Prompt Engineering, PE)的最佳实践,还首次将静态分析技术应用于自然语言处理,开发了用于校验CNL-P提示语法与语义正确性的linting工具,并配套构建了基于LLM的NL到CNL-P自动转换工具,有效降低了使用门槛。实验证明,CNL-P通过PE与SE的有机协同显著提升了LLM响应质量,为构建以自然语言为核心的新型编程范式奠定了基础。
链接: https://arxiv.org/abs/2508.06942
作者: Zhenchang Xing,Yang Liu,Zhuo Cheng,Qing Huang,Dehai Zhao,Daniel Sun,Chenhua Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:With the growing capabilities of large language models (LLMs), they are increasingly applied in areas like intelligent customer service, code generation, and knowledge management. Natural language (NL) prompts act as the ``APIs’’ for human-LLM interaction. To improve prompt quality, best practices for prompt engineering (PE) have been developed, including writing guidelines and templates. Building on this, we propose Controlled NL for Prompt (CNL-P), which not only incorporates PE best practices but also draws on key principles from software engineering (SE). CNL-P introduces precise grammar structures and strict semantic norms, further eliminating NL’s ambiguity, allowing for a declarative but structured and accurate expression of user intent. This helps LLMs better interpret and execute the prompts, leading to more consistent and higher-quality outputs. We also introduce an NL2CNL-P conversion tool based on LLMs, enabling users to write prompts in NL, which are then transformed into CNL-P format, thus lowering the learning curve of CNL-P. In particular, we develop a linting tool that checks CNL-P prompts for syntactic and semantic accuracy, applying static analysis techniques to NL for the first time. Extensive experiments demonstrate that CNL-P enhances the quality of LLM responses through the novel and organic synergy of PE and SE. We believe that CNL-P can bridge the gap between emerging PE and traditional SE, laying the foundation for a new programming paradigm centered around NL.
zh
[AI-125] CLAP: Coreference-Linked Augmentation for Passage Retrieval CIKM2025
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的段落扩展方法在与密集检索器(dense retriever)结合时存在的语义漂移(semantic drift)和预训练语义空间错位问题,以及因分块(chunking)导致的核心指代连续性破坏和噪声引入的问题。解决方案的关键在于提出一种轻量级的LLM增强框架——核心指代关联增强(Coreference-Linked Augmentation for Passage Retrieval, CLAP),其通过将段落分割为语义连贯的子块、解析核心指代链(coreference chain)并生成与密集检索器表示对齐的局部伪查询(pseudo-query),实现全局主题信号与细粒度子主题信号的融合,从而在不依赖领域知识的前提下显著提升检索性能,尤其在跨域场景下表现优异。
链接: https://arxiv.org/abs/2508.06941
作者: Huanwei Xu,Lin Xu,Liang Yuan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by CIKM 2025
Abstract:Large Language Model (LLM)-based passage expansion has shown promise for enhancing first-stage retrieval, but often underperforms with dense retrievers due to semantic drift and misalignment with their pretrained semantic space. Beyond this, only a portion of a passage is typically relevant to a query, while the rest introduces noise–an issue compounded by chunking techniques that break coreference continuity. We propose Coreference-Linked Augmentation for Passage Retrieval (CLAP), a lightweight LLM-based expansion framework that segments passages into coherent chunks, resolves coreference chains, and generates localized pseudo-queries aligned with dense retriever representations. A simple fusion of global topical signals and fine-grained subtopic signals achieves robust performance across domains. CLAP yields consistent gains even as retriever strength increases, enabling dense retrievers to match or surpass second-stage rankers such as BM25 + MonoT5-3B, with up to 20.68% absolute nDCG@10 improvement. These improvements are especially notable in out-of-domain settings, where conventional LLM-based expansion methods relying on domain knowledge often falter. CLAP instead adopts a logic-centric pipeline that enables robust, domain-agnostic generalization.
zh
[AI-126] Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction
【速读】:该论文旨在解决多模态学习在农业场景下(特别是作物产量预测)中模型可解释性不足的问题,尤其是在面对来自卫星遥感、气象时间序列、地形高程和土壤属性等异构数据源时,如何有效解析模型决策过程。解决方案的关键在于利用基于Transformer架构的自注意力机制,结合Attention Rollout (AR) 和通用注意力 (GA) 方法进行特征级归因分析,并引入加权模态激活 (WMA) 方法评估不同模态的重要性,从而实现对子地块尺度作物产量预测模型的可解释性增强。实验表明,Transformer模型在性能上优于卷积和循环网络,且AR方法在时间维度归因方面更具鲁棒性和可靠性,结合作物物候阶段知识可进一步提升解释结果的农学合理性。
链接: https://arxiv.org/abs/2508.06939
作者: Hiba Najjar,Deepak Pathak,Marlon Nuske,Andreas Dengel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal learning enables various machine learning tasks to benefit from diverse data sources, effectively mimicking the interplay of different factors in real-world applications, particularly in agriculture. While the heterogeneous nature of involved data modalities may necessitate the design of complex architectures, the model interpretability is often overlooked. In this study, we leverage the intrinsic explainability of Transformer-based models to explain multimodal learning networks, focusing on the task of crop yield prediction at the subfield level. The large datasets used cover various crops, regions, and years, and include four different input modalities: multispectral satellite and weather time series, terrain elevation maps and soil properties. Based on the self-attention mechanism, we estimate feature attributions using two methods, namely the Attention Rollout (AR) and Generic Attention (GA), and evaluate their performance against Shapley-based model-agnostic estimations, Shapley Value Sampling (SVS). Additionally, we propose the Weighted Modality Activation (WMA) method to assess modality attributions and compare it with SVS attributions. Our findings indicate that Transformer-based models outperform other architectures, specifically convolutional and recurrent networks, achieving R2 scores that are higher by 0.10 and 0.04 at the subfield and field levels, respectively. AR is shown to provide more robust and reliable temporal attributions, as confirmed through qualitative and quantitative evaluation, compared to GA and SVS values. Information about crop phenology stages was leveraged to interpret the explanation results in the light of established agronomic knowledge. Furthermore, modality attributions revealed varying patterns across the two methods compared.[…]
zh
[AI-127] Automated Formalization via Conceptual Retrieval-Augmented LLM s
【速读】:该论文旨在解决生成式 AI (Generative AI) 在数学定理自动形式化(autoformalization)过程中面临的两大挑战:模型幻觉(如未定义谓词、符号误用和版本不兼容)以及自然语言描述中前提模糊或缺失导致的语义鸿沟。解决方案的关键在于提出一种基于概念驱动的检索增强数学形式化框架(CRAMF),其核心创新包括:1)从 Lean 4 的标准数学库 Mathlib4 自动生成一个包含超过 26,000 条形式化定义和 1,000 多个核心数学概念的知识库;2)通过领域级和应用级信号对查询进行上下文增强,以应对数学概念的多态性;3)设计双通道混合检索策略结合重排序机制,确保高精度的形式化定义召回。实验表明,CRAMF 可无缝集成至基于大语言模型(LLM)的自动形式化器中,在 miniF2F、ProofNet 和新提出的 AdvancedMath 基准上均显著提升翻译准确率,相对改进最高达 62.1%,平均达 29.9%。
链接: https://arxiv.org/abs/2508.06931
作者: Wangyue Lu,Lun Du,Sirui Li,Ke Weng,Haozhe Sun,Hengyu Liu,Minghe Yu,Tiancheng Zhang,Ge Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Interactive theorem provers (ITPs) require manual formalization, which is labor-intensive and demands expert knowledge. While automated formalization offers a potential solution, it faces two major challenges: model hallucination (e.g., undefined predicates, symbol misuse, and version incompatibility) and the semantic gap caused by ambiguous or missing premises in natural language descriptions. To address these issues, we propose CRAMF, a Concept-driven Retrieval-Augmented Mathematical Formalization framework. CRAMF enhances LLM-based autoformalization by retrieving formal definitions of core mathematical concepts, providing contextual grounding during code generation. However, applying retrieval-augmented generation (RAG) in this setting is non-trivial due to the lack of structured knowledge bases, the polymorphic nature of mathematical concepts, and the high precision required in formal retrieval. We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4, the standard mathematical library for the Lean 4 theorem prover, indexing over 26,000 formal definitions and 1,000+ core mathematical concepts. To address conceptual polymorphism, we propose contextual query augmentation with domain- and application-level signals. In addition, we design a dual-channel hybrid retrieval strategy with reranking to ensure accurate and relevant definition retrieval. Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers, yielding consistent improvements in translation accuracy, achieving up to 62.1% and an average of 29.9% relative improvement.
zh
[AI-128] GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization
【速读】:该论文旨在解决分布式约束优化问题(DCOPs)中局部搜索算法易陷入次优局部最优解的问题,尤其是针对现有全局引导式退火算法(GDBA)在一般估值问题上表现不佳的局限性。其解决方案的关键在于提出一种新型的分布式引导局部搜索框架(DGLS),包含三个核心机制:1)自适应约束违反条件,仅对高成本约束施加惩罚以提升搜索效率;2)惩罚值蒸发机制,控制惩罚强度避免过度偏离可行域;3)协同惩罚更新同步方案,确保各智能体间惩罚信息的一致性与收敛稳定性。理论分析表明,DGLS中的惩罚值有界且代理之间构成潜在博弈,实验结果验证了其在标准基准测试中显著优于当前最优基线方法,尤其在结构化问题上性能提升达3.77%–66.3%。
链接: https://arxiv.org/abs/2508.06899
作者: Yanchen Deng,Xinrun Wang,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注:
Abstract:Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While GDBA provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in our DGLS. Our extensive empirical results on various standard benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Particularly, compared to Damped Max-sum with high damping factors (e.g., 0.7 or 0.9), our DGLS achieves competitive performance on general-valued problems, and outperforms it by significant margins (\textbf3.77%–66.3%) on structured problems in terms of anytime results.
zh
[AI-129] Pushdown Reward Machines for Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中如何高效建模和奖励具有复杂时序结构的行为问题,尤其是那些无法用传统马尔可夫奖励函数或有限状态自动机(即奖励机器,Reward Machines, RMs)表达的非局部、嵌套或递归性质的行为。其解决方案的关键在于提出下推奖励机器(pushdown reward machines, pdRMs),这是一种基于确定性下推自动机(Deterministic Pushdown Automata, DPDA)的扩展结构,能够识别并奖励可由确定性上下文无关语言(Deterministic Context-Free Languages)描述的时序行为,从而显著提升奖励函数的表达能力。此外,论文进一步设计了两种基于pdRM的状态访问策略:一种可访问完整栈状态,另一种仅限于栈顶k个符号,并提出了判断二者是否达到相同最优期望奖励的验证机制,为实际应用提供了理论保障与算法基础。
链接: https://arxiv.org/abs/2508.06894
作者: Giovanni Varricchione,Toryn Q. Klassen,Natasha Alechina,Mehdi Dastani,Brian Logan,Sheila A. McIlraith
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top k symbols (for a given constant k ) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant k ) achieve the same optimal expected reward. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results about the proposed learning problems. Finally, we provide experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.
zh
[AI-130] Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning
【速读】:该论文旨在解决多任务强化学习(Multi-Task Reinforcement Learning, MTRL)中因可塑性损失(plasticity loss)导致的性能退化问题,即模型在训练过程中逐渐丧失适应新任务的能力。其核心解决方案是引入结构稀疏化方法,特别是渐进式幅度剪枝(Gradual Magnitude Pruning, GMP)和稀疏进化训练(Sparse Evolutionary Training, SET),通过动态生成稀疏网络结构来增强神经元的可塑性和表征灵活性,从而缓解神经元休眠和表征坍缩等典型可塑性退化现象。实验表明,这些稀疏化策略能显著提升MTRL代理在多种架构(如共享骨干、专家混合模型等)下的多任务性能,且效果优于密集基线模型,并与显式的可塑性增强方法相当,揭示了稀疏性与可塑性之间存在紧密关联,为构建更具适应性的MTRL系统提供了有效工具。
链接: https://arxiv.org/abs/2508.06871
作者: Aleksandar Todorov,Juan Cardenas-Cartagena,Rafael F. Cunha,Marco Zullich,Matthia Sabatelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.
zh
[AI-131] MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)评估基准存在的局限性问题,包括规模不足、覆盖领域狭窄以及知识结构非结构化,导致评估结果静态且缺乏差异化。其解决方案的关键在于提出MDK12-Bench——一个基于真实K-12教育考试的多学科大规模基准,涵盖六个学科、14.1万条实例和6225个知识点,并采用六层分类体系组织;同时设计了一种动态评估框架,通过引入未见过的视觉、文本及题型变化来强化模型泛化能力,并借助知识点参考增强生成(Knowledge-Point Reference-Augmented Generation, KP-RAG)机制探究知识在推理中的作用,从而提升评估的客观性与长期有效性。
链接: https://arxiv.org/abs/2508.06851
作者: Pengfei Zhou,Xiaopeng Peng,Fanrui Zhang,Zhaopan Xu,Jiaxin Ai,Yansheng Qiu,Chuanhao Li,Zhen Li,Ming Li,Yukang Feng,Jianwen Sun,Haoquan Zhang,Zizhen Li,Xiaofeng Mao,Zekai Li,Wangbo Zhao,Kai Wang,Xiaojun Chang,Wenqi Shao,Yang You,Kaipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 35 pages, 33 figures
Abstract:Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel dynamic evaluation framework that introduces unfamiliar visual, textual, and question form shifts to challenge model generalization while improving benchmark objectivity and longevity by mitigating data contamination. We further evaluate knowledge-point reference-augmented generation (KP-RAG) to examine the role of knowledge in problem-solving. Key findings reveal limitations in current MLLMs in multiple aspects and provide guidance for enhancing model robustness, interpretability, and AI-assisted education.
zh
[AI-132] owards Experience-Centered AI: A Framework for Integrating Lived Experience in Design and Development
【速读】:该论文试图解决当前AI系统设计与评估中对“生活经验”(lived experience)关注不足的问题,即缺乏系统性理解人类真实体验如何影响对AI系统的感知、信任与使用,并缺少将这些经验有效嵌入AI开发全生命周期的可行策略。其解决方案的关键在于提出一个整合跨学科文献(包括生活经验哲学、以人为中心的设计和人-AI交互)的框架,通过构建针对AI系统的具体生活经验分类体系,结合教育、医疗和文化契合三个应用领域案例,阐明生活经验如何塑造用户目标、系统预期及伦理考量,并进一步引入AI操作者与人-AI协作视角,识别责任分配、心智模型校准和长期适应等挑战,最终提供可操作建议,推动开发出不仅技术稳健,且具同理心、情境敏感并贴近人类现实的体验中心型AI系统。
链接: https://arxiv.org/abs/2508.06849
作者: Sanjana Gautam,Mohit Chandra,Ankolika De,Tatiana Chakravorti,Girik Malik,Munmun De Choudhury
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Lived experiences fundamentally shape how individuals interact with AI systems, influencing perceptions of safety, trust, and usability. While prior research has focused on developing techniques to emulate human preferences, and proposed taxonomies to categorize risks (such as psychological harms and algorithmic biases), these efforts have provided limited systematic understanding of lived human experiences or actionable strategies for embedding them meaningfully into the AI development lifecycle. This work proposes a framework for meaningfully integrating lived experience into the design and evaluation of AI systems. We synthesize interdisciplinary literature across lived experience philosophy, human-centered design, and human-AI interaction, arguing that centering lived experience can lead to models that more accurately reflect the retrospective, emotional, and contextual dimensions of human cognition. Drawing from a wide body of work across psychology, education, healthcare, and social policy, we present a targeted taxonomy of lived experiences with specific applicability to AI systems. To ground our framework, we examine three application domains (i) education, (ii) healthcare, and (iii) cultural alignment, illustrating how lived experience informs user goals, system expectations, and ethical considerations in each context. We further incorporate insights from AI system operators and human-AI partnerships to highlight challenges in responsibility allocation, mental model calibration, and long-term system adaptation. We conclude with actionable recommendations for developing experience-centered AI systems that are not only technically robust but also empathetic, context-aware, and aligned with human realities. This work offers a foundation for future research that bridges technical development with the lived experiences of those impacted by AI systems.
zh
[AI-133] Highlight All the Phrases: Enhancing LLM Transparency through Visual Factuality Indicators AAAI
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容中存在虚假信息(即“幻觉”或“错构”)的问题,尤其是当前缺乏有效机制向用户传达此类内容的真实性评估。解决方案的关键在于设计一种基于事实性评分的可视化交互方式——通过将响应中的每个短语按其事实性得分进行颜色编码,从而提升用户对输出准确性的验证效率和信任度。实验表明,相较于无任何标注的基线方案,该颜色编码策略显著增强了用户的信任感、简化了准确性验证过程,并更符合用户偏好。
链接: https://arxiv.org/abs/2508.06846
作者: Hyo Jin Do,Rachel Ostrand,Werner Geyer,Keerthiram Murugesan,Dennis Wei,Justin Weisz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, To be published in Proceedings of the 8th AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)
Abstract:Large language models (LLMs) are susceptible to generating inaccurate or false information, often referred to as “hallucinations” or “confabulations.” While several technical advancements have been made to detect hallucinated content by assessing the factuality of the model’s responses, there is still limited research on how to effectively communicate this information to users. To address this gap, we conducted two scenario-based experiments with a total of 208 participants to systematically compare the effects of various design strategies for communicating factuality scores by assessing participants’ ratings of trust, ease in validating response accuracy, and preference. Our findings reveal that participants preferred and trusted a design in which all phrases within a response were color-coded based on factuality scores. Participants also found it easier to validate accuracy of the response in this style compared to a baseline with no style applied. Our study offers practical design guidelines for LLM application developers and designers, aimed at calibrating user trust, aligning with user preferences, and enhancing users’ ability to scrutinize LLM outputs.
zh
[AI-134] Multi-level Advantage Credit Assignment for Cooperative Multi-Agent Reinforcement Learning AISTATS2025
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中的信用分配(Credit Assignment)问题,即如何准确评估每个智能体对共享奖励的贡献。由于任务复杂性和智能体间协作方式的多样性,奖励可能由不同规模和组合的智能体群体共同获得,且这些群体之间存在重叠,传统方法难以有效区分各层级的贡献。论文提出的关键解决方案是引入一种多级优势函数(Multi-level Advantage Formulation),通过显式的反事实推理(Counterfactual Reasoning)来识别不同协作层级上的信用分配;其核心在于构建基于注意力机制的框架,自动发现智能体间的相关性关系,并整合个体、联合及关联动作的优势函数,从而生成多层次的优势信号以指导策略学习。该方法命名为多级优势信用分配(Multi-level Advantage Credit Assignment, MACA),在Starcraft II等复杂场景中展现出优越性能,验证了其在复杂信用分配场景下的有效性。
链接: https://arxiv.org/abs/2508.06836
作者: Xutong Zhao,Yaqi Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AISTATS 2025
Abstract:Cooperative multi-agent reinforcement learning (MARL) aims to coordinate multiple agents to achieve a common goal. A key challenge in MARL is credit assignment, which involves assessing each agent’s contribution to the shared reward. Given the diversity of tasks, agents may perform different types of coordination, with rewards attributed to diverse and often overlapping agent subsets. In this work, we formalize the credit assignment level as the number of agents cooperating to obtain a reward, and address scenarios with multiple coexisting levels. We introduce a multi-level advantage formulation that performs explicit counterfactual reasoning to infer credits across distinct levels. Our method, Multi-level Advantage Credit Assignment (MACA), captures agent contributions at multiple levels by integrating advantage functions that reason about individual, joint, and correlated actions. Utilizing an attention-based framework, MACA identifies correlated agent relationships and constructs multi-level advantages to guide policy learning. Comprehensive experiments on challenging Starcraft v1\v2 tasks demonstrate MACA’s superior performance, underscoring its efficacy in complex credit assignment scenarios.
zh
[AI-135] Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles Methods and Challenges
【速读】:该论文旨在解决当前遥感图像解释领域长期依赖视觉中心模型所面临的局限性,如多模态推理能力弱、语义抽象不足以及交互式决策困难等问题。其核心解决方案是提出一种从“视觉中心”向“语言中心”的范式转变,以大语言模型(Large Language Models, LLMs)作为认知中枢,构建一个受全局工作空间理论(Global Workspace Theory, GWT)启发的语言中心框架,实现感知、任务、知识与行为空间的统一整合,从而支持遥感解释中的统一理解、推理与决策。关键在于将LLMs视为认知核心组件,驱动多模态数据融合、知识关联和复杂推理过程,为下一代认知驱动的智能地理空间分析提供理论基础与技术路径。
链接: https://arxiv.org/abs/2508.06832
作者: Haifeng Li,Wang Guo,Haiyang Wu,Mengwei Wu,Jipeng Zhang,Qing Zhu,Yu Liu,Xin Huang,Chao Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The mainstream paradigm of remote sensing image interpretation has long been dominated by vision-centered models, which rely on visual features for semantic understanding. However, these models face inherent limitations in handling multi-modal reasoning, semantic abstraction, and interactive decision-making. While recent advances have introduced Large Language Models (LLMs) into remote sensing workflows, existing studies primarily focus on downstream applications, lacking a unified theoretical framework that explains the cognitive role of language. This review advocates a paradigm shift from vision-centered to language-centered remote sensing interpretation. Drawing inspiration from the Global Workspace Theory (GWT) of human cognition, We propose a language-centered framework for remote sensing interpretation that treats LLMs as the cognitive central hub integrating perceptual, task, knowledge and action spaces to enable unified understanding, reasoning, and decision-making. We first explore the potential of LLMs as the central cognitive component in remote sensing interpretation, and then summarize core technical challenges, including unified multimodal representation, knowledge association, and reasoning and decision-making. Furthermore, we construct a global workspace-driven interpretation mechanism and review how language-centered solutions address each challenge. Finally, we outline future research directions from four perspectives: adaptive alignment of multimodal data, task understanding under dynamic knowledge constraints, trustworthy reasoning, and autonomous interaction. This work aims to provide a conceptual foundation for the next generation of remote sensing interpretation systems and establish a roadmap toward cognition-driven intelligent geospatial analysis.
zh
[AI-136] Whos the Evil Twin? Differential Auditing for Undesired Behavior
【速读】:该论文旨在解决神经网络中隐藏行为(hidden behaviors)的检测难题,特别是在缺乏先验知识且存在对抗性混淆(adversarial obfuscation)的情况下。其核心解决方案是将检测问题建模为一场红蓝两队之间的对抗博弈:红队训练两个性能相近的模型——一个仅使用良性数据(benign data),另一个则包含隐匿的有害行为;蓝队在有限或无先验信息条件下识别出被污染的模型。关键发现在于,基于对抗攻击的方法在获得红队提示时可实现100%准确率,显著优于高斯噪声分析、模型差分(model diffing)和积分梯度等其他策略;而在大语言模型(LLM)场景下,有效审计需依赖关于不良分布的提示,以结合黑盒与开源权重方法进一步揭示模型对齐偏差(misalignment)。
链接: https://arxiv.org/abs/2508.06827
作者: Ishwar Balappanawar,Venkata Hasith Vattikuti,Greta Kintzley,Ronan Azimi-Mancel,Satvik Golechha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: main section: 8 pages, 4 figures, 1 table total: 34 pages, 44 figures, 12 tables
Abstract:Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.
zh
[AI-137] Natural Language-Driven Viewpoint Navigation for Volume Exploration via Semantic Block Representation IEEE-VIS2025
【速读】:该论文旨在解决科学数据中体积数据(volumetric data)探索过程中,用户难以选择最优视角以实现高效导航的问题,尤其针对缺乏领域专业知识或3D导航经验的用户。解决方案的关键在于提出一种基于自然语言交互的框架,通过编码体积块(volumetric blocks)来捕捉并区分潜在结构,并引入CLIP Score机制为每个块提供语义信息以指导导航;同时,利用强化学习(reinforcement learning)框架结合这些语义线索,自动搜索并识别与用户意图一致的最优视角,最终通过CLIP Score对所选视角进行评估,从而提升体积数据探索的效率和复杂科学现象的可解释性。
链接: https://arxiv.org/abs/2508.06823
作者: Xuan Zhao,Jun Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IEEE VIS 2025
Abstract:Exploring volumetric data is crucial for interpreting scientific datasets. However, selecting optimal viewpoints for effective navigation can be challenging, particularly for users without extensive domain expertise or familiarity with 3D navigation. In this paper, we propose a novel framework that leverages natural language interaction to enhance volumetric data exploration. Our approach encodes volumetric blocks to capture and differentiate underlying structures. It further incorporates a CLIP Score mechanism, which provides semantic information to the blocks to guide navigation. The navigation is empowered by a reinforcement learning framework that leverage these semantic cues to efficiently search for and identify desired viewpoints that align with the user’s intent. The selected viewpoints are evaluated using CLIP Score to ensure that they best reflect the user queries. By automating viewpoint selection, our method improves the efficiency of volumetric data navigation and enhances the interpretability of complex scientific phenomena.
zh
[AI-138] Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
【速读】:该论文旨在解决生成式机器学习(Generative ML)模型在实际开发与部署中,其微调(fine-tuning)行为的结构特征与演化规律缺乏系统性实证研究的问题。解决方案的关键在于构建并分析基于Hugging Face平台的186万模型所形成的“模型家族树”(model family trees),通过借鉴进化生物学视角,利用模型元数据和模型卡片(model cards)量化模型间的遗传相似性与变异模式,从而揭示微调过程中模型特性(如许可证类型、语言兼容性、文档标准化程度)的定向演化趋势。这一方法突破了传统对模型微调视为随机或独立过程的理解,首次从生态学角度提供了可量化的实证基础,为理解开放机器学习生态系统的动态演化机制提供了新范式。
链接: https://arxiv.org/abs/2508.06811
作者: Benjamin Laufer,Hamidah Oderinwale,Jon Kleinberg
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 29 pages, 18 figures and tables
Abstract:Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees – networks that connect fine-tuned models to their base or parent – reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling’ models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license’s terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.
zh
[AI-139] Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation ICML2025
【速读】:该论文旨在解决Offline-to-Online Reinforcement Learning (O2O RL) 中因离线数据与在线数据分布差异导致的性能瓶颈问题,即现有数据增强方法生成的数据仍存在与在线数据分布不一致的问题,从而限制了在线微调的效果。解决方案的关键在于提出一种无需额外分类器训练开销的Classifier-Free Diffusion Generation (CFDG) 方法,利用无分类器引导扩散模型显著提升离线与在线数据混合生成的质量,并结合重加权策略使更多生成数据更贴近在线数据分布,从而在保证智能体稳定性的同时提升整体性能。
链接: https://arxiv.org/abs/2508.06806
作者: Xiao Huang,Xu Liu,Enze Zhang,Tong Yu,Shuai Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2025
Abstract:Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent’s stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.
zh
[AI-140] Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities
【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中因模态缺失导致的性能下降问题,尤其是现有方法在处理不同样本的重建难度差异时缺乏适应性,从而限制了模型对困难样本的有效学习。解决方案的关键在于提出一种新颖的“硬度感知动态课程学习”框架(HARDY-MER),其核心创新包括:一是引入多视角硬度评估机制(Multi-view Hardness Evaluation),通过直接硬度(模态重建误差)与间接硬度(跨模态互信息)共同量化每个样本的重建难度;二是设计基于检索的动态课程学习策略(Retrieval-based Dynamic Curriculum Learning),通过检索语义相似样本并动态调整训练焦点,在易样本与难样本之间实现平衡优化,从而显著提升模型在模态缺失场景下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2508.06800
作者: Rui Liu,Haolin Zuo,Zheng Lian,Hongyu Yuan,Qi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model’s ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at this https URL.
zh
[AI-141] LSDTs: LLM -Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning
【速读】:该论文旨在解决数字孪生(Digital Twin, DT)在复杂基础设施系统管理中因难以整合非结构化知识而导致效能受限的问题。其解决方案的关键在于提出LSDTs(LLM-Augmented Semantic Digital Twins)框架,利用大语言模型(Large Language Model, LLM)从环境法规、技术指南等非结构化文档中提取规划知识,并将其组织为形式化的本体(ontology),构建一个语义层以驱动数字孪生模拟符合监管要求的现实场景,从而实现可解释、高保真且具备适应性的基础设施规划优化。
链接: https://arxiv.org/abs/2508.06799
作者: Naiyi Li,Zihui Ma,Runlong Yu,Lingyao Li
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital Twins (DTs) offer powerful tools for managing complex infrastructure systems, but their effectiveness is often limited by challenges in integrating unstructured knowledge. Recent advances in Large Language Models (LLMs) bring new potential to address this gap, with strong abilities in extracting and organizing diverse textual information. We therefore propose LSDTs (LLM-Augmented Semantic Digital Twins), a framework that helps LLMs extract planning knowledge from unstructured documents like environmental regulations and technical guidelines, and organize it into a formal ontology. This ontology forms a semantic layer that powers a digital twin-a virtual model of the physical system-allowing it to simulate realistic, regulation-aware planning scenarios. We evaluate LSDTs through a case study of offshore wind farm planning in Maryland, including its application during Hurricane Sandy. Results demonstrate that LSDTs support interpretable, regulation-aware layout optimization, enable high-fidelity simulation, and enhance adaptability in infrastructure planning. This work shows the potential of combining generative AI with digital twins to support complex, knowledge-driven planning tasks.
zh
[AI-142] Geometry-Aware Spiking Graph Neural Network
【速读】:该论文旨在解决现有脉冲图神经网络(Spiking Graph Neural Networks, SGNNs)在处理复杂图结构时的局限性,尤其是其主要基于欧几里得空间且依赖固定几何假设,难以有效建模具有层次结构和环状拓扑的非欧几里得图结构的问题。解决方案的关键在于提出一种几何感知的脉冲图神经网络(Geometry-Aware Spiking Graph Neural Network, GSG),其核心创新包括:1)通过黎曼嵌入层将节点特征映射到常曲率流形空间以捕捉非欧几里得结构;2)设计流形脉冲层,在曲面空间中通过几何一致的邻域聚合与曲率感知注意力机制模拟膜电位演化和脉冲行为;3)引入流形学习目标,通过联合优化分类与链接预测损失(基于测地距离定义)实现实例级几何自适应。整个模型采用黎曼随机梯度下降(Riemannian SGD)训练,无需时间反向传播,从而在保持高能效的同时显著提升准确性和鲁棒性。
链接: https://arxiv.org/abs/2508.06793
作者: Bowen Zhang,Genan Dai,Hu Huang,Long Lan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated impressive capabilities in modeling graph-structured data, while Spiking Neural Networks (SNNs) offer high energy efficiency through sparse, event-driven computation. However, existing spiking GNNs predominantly operate in Euclidean space and rely on fixed geometric assumptions, limiting their capacity to model complex graph structures such as hierarchies and cycles. To overcome these limitations, we propose \method, a novel Geometry-Aware Spiking Graph Neural Network that unifies spike-based neural dynamics with adaptive representation learning on Riemannian manifolds. \method features three key components: a Riemannian Embedding Layer that projects node features into a pool of constant-curvature manifolds, capturing non-Euclidean structures; a Manifold Spiking Layer that models membrane potential evolution and spiking behavior in curved spaces via geometry-consistent neighbor aggregation and curvature-based attention; and a Manifold Learning Objective that enables instance-wise geometry adaptation through jointly optimized classification and link prediction losses defined over geodesic distances. All modules are trained using Riemannian SGD, eliminating the need for backpropagation through time. Extensive experiments on multiple benchmarks show that GSG achieves superior accuracy, robustness, and energy efficiency compared to both Euclidean SNNs and manifold-based GNNs, establishing a new paradigm for curvature-aware, energy-efficient graph learning.
zh
[AI-143] Mode-Aware Non-Linear Tucker Autoencoder for Tensor-based Unsupervised Learning
【速读】:该论文旨在解决高维数据(特别是高阶张量)在自监督学习中面临的挑战,包括传统基于多层感知机(MLP-based)的自编码器(Autoencoder, AE)因展开操作加剧维度灾难而导致模型规模过大、计算开销高及深层结构特征捕获优化困难的问题。同时,现有张量网络虽通过张量分解缓解计算负担,但普遍缺乏对非线性关系的学习能力。解决方案的关键在于提出模式感知的非线性Tucker自编码器(Mode-Aware Non-linear Tucker Autoencoder, MA-NTAE),其将经典Tucker分解推广至非线性框架,并引入“选中-展开-编码-折叠”策略,通过递归的unfold-encode-fold操作实现对高阶张量各模式的灵活编码,有效融合张量结构先验信息;该方法在计算复杂度上呈现随张量阶数线性增长、随模式维度成比例增长的特性,显著优于标准AE和当前主流张量网络,在压缩与聚类任务中表现更优,尤其在高阶、高维张量场景下优势更为突出。
链接: https://arxiv.org/abs/2508.06784
作者: Junjing Zheng,Chengliang Song,Weidong Jiang,Xinyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:High-dimensional data, particularly in the form of high-order tensors, presents a major challenge in self-supervised learning. While MLP-based autoencoders (AE) are commonly employed, their dependence on flattening operations exacerbates the curse of dimensionality, leading to excessively large model sizes, high computational overhead, and challenging optimization for deep structural feature capture. Although existing tensor networks alleviate computational burdens through tensor decomposition techniques, most exhibit limited capability in learning non-linear relationships. To overcome these limitations, we introduce the Mode-Aware Non-linear Tucker Autoencoder (MA-NTAE). MA-NTAE generalized classical Tucker decomposition to a non-linear framework and employs a Pick-and-Unfold strategy, facilitating flexible per-mode encoding of high-order tensors via recursive unfold-encode-fold operations, effectively integrating tensor structural priors. Notably, MA-NTAE exhibits linear growth in computational complexity with tensor order and proportional growth with mode dimensions. Extensive experiments demonstrate MA-NTAE’s performance advantages over standard AE and current tensor networks in compression and clustering tasks, which become increasingly pronounced for higher-order, higher-dimensional tensors.
zh
[AI-144] PROPS: Progressively Private Self-alignment of Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于人类反馈(Human Feedback)进行对齐(Alignment)过程中所面临的隐私泄露问题,特别是偏好标签(preference labels)可能暴露标注者个人价值观、信念和人格特质的风险。现有方法如差分隐私随机梯度下降(Differentially Private SGD, DP-SGD)虽能提供严格的隐私保障,但因对梯度进行过度隐私化处理而牺牲模型性能,且未针对偏好级隐私进行优化。其解决方案的关键在于提出PROPS(PROgressively Private Self-alignment),一种多阶段隐私保护对齐框架:在前期训练中使用隐私保护模型生成合成偏好数据,供后续阶段作为“伪标签器”用于补充训练样本,从而实现隐私与模型效用的协同优化。该方法通过渐进式隐私机制,在相同隐私预算下显著提升对齐效果,实验证明其相较于DP-SGD和基于随机响应(Randomized Response, RR)的方法可分别获得最高达3倍和2.5倍的胜率提升。
链接: https://arxiv.org/abs/2508.06783
作者: Noel Teku,Fengwei Tian,Payel Bhattacharjee,Souradip Chakraborty,Amrit Singh Bedi,Ravi Tandon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT)
备注:
Abstract:Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler’s preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.
zh
[AI-145] BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation
【速读】:该论文旨在解决神经句向量嵌入模型在密集检索任务中依赖二元相关性标签(binary relevance labels)导致的局限性,即无法充分捕捉现实世界中相关性存在的连续性特征。其解决方案的关键在于提出BiXSE方法,通过将大型语言模型(LLM)生成的细粒度评分(graded relevance scores)视为概率目标,采用点对点训练策略优化二元交叉熵(BCE),从而实现从单个查询-文档对中获得精细监督信号;同时利用批内负样本(in-batch negatives)降低标注与计算成本,显著优于基于Softmax的对比学习(InfoNCE),并在多个基准测试中达到或超越强大多样性排序基线。
链接: https://arxiv.org/abs/2508.06781
作者: Christos Tsirigotis,Vaibhav Adlakha,Joao Monteiro,Aaron Courville,Perouz Taslakian
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures, accepted at COLM 2025
Abstract:Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either relevant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose BiXSE, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.
zh
[AI-146] Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift
【速读】:该论文旨在解决模型漂移(model drift)的检测问题,即在缺乏任务标签或输出评估的情况下,如何有效监测深度神经网络(特别是Transformer架构)中表示层激活的潜在变化。解决方案的关键在于提出一种纯理论框架——零方向探测(Zero-Direction Probing, ZDP),其核心思想是利用Transformer各层激活矩阵的右/左零空间(null space)及其Fisher信息几何特性进行建模与分析。通过一系列理论证明(如方差泄漏定理、Fisher零守恒性等),作者推导出可计算的谱零泄漏度量(Spectral Null-Leakage, SNL),并提供非渐近尾部界和浓度不等式,从而为漂移检测设定先验阈值,实现对表征变化的可测试、可量化监控。
链接: https://arxiv.org/abs/2508.06776
作者: Amit Pandey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 14 pages
Abstract:We present Zero-Direction Probing (ZDP), a theory-only framework for detecting model drift from null directions of transformer activations without task labels or output evaluations. Under assumptions A1–A6, we prove: (i) the Variance–Leak Theorem, (ii) Fisher Null-Conservation, (iii) a Rank–Leak bound for low-rank updates, and (iv) a logarithmic-regret guarantee for online null-space trackers. We derive a Spectral Null-Leakage (SNL) metric with non-asymptotic tail bounds and a concentration inequality, yielding a-priori thresholds for drift under a Gaussian null model. These results show that monitoring right/left null spaces of layer activations and their Fisher geometry provides concrete, testable guarantees on representational change.
zh
[AI-147] PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems
【速读】:该论文旨在解决数字孪生(Digital Twin, DT)生态系统中多智能体路径规划(Multi-Agent Path Finding, MAPF)面临的高效数据共享与网络感知决策难题,尤其是在机器人和自动化系统规模化部署时,如何实现低延迟、高鲁棒性的自主任务执行。解决方案的关键在于提出PANAMA算法,其核心创新是引入“优先级不对称”机制(Priority Asymmetry),结合集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)框架与异步Actor-Learner架构,使多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在无线网络环境下的路径规划具备更高的准确性、速度与可扩展性,从而实现网络感知的智能体协同与DT驱动的自动化系统的深度融合。
链接: https://arxiv.org/abs/2508.06767
作者: Arman Dogru,R. Irem Bor-Yaliniz,Nimal Gamini Senarath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data-sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next-gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi-agent Reinforcement Learning (MARL) based multi-agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor-learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data-sharing strategies for scalable, automated systems, ensuring resilience in complex, real-world environments. PANAMA bridges the gap between network-aware decision-making and robust multi-agent coordination, advancing the synergy between DTs, wireless networks, and AI-driven automation.
zh
[AI-148] A Fuzzy Logic Prompting Framework for Large Language Models in Adaptive and Uncertain Tasks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在动态、以用户为中心的任务中缺乏安全性与适应性的问题,尤其是在教育等交互密集型场景下,如何实现模型行为的可解释性、目标对齐和实时调整。解决方案的关键在于提出一种模块化提示框架(modular prompting framework),其核心是结合自然语言边界提示(natural language boundary prompt)与基于模糊支撑逻辑(fuzzy scaffolding logic)及自适应规则的控制机制,从而在不进行微调或外部编排的情况下,使LLM能够根据用户状态动态调节行为,提升教学支架质量、适应性和指令一致性。
链接: https://arxiv.org/abs/2508.06754
作者: Vanessa Figueiredo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a modular prompting framework that supports safer and more adaptive use of large language models (LLMs) across dynamic, user-centered tasks. Grounded in human learning theory, particularly the Zone of Proximal Development (ZPD), our method combines a natural language boundary prompt with a control schema encoded with fuzzy scaffolding logic and adaptation rules. This architecture enables LLMs to modulate behavior in response to user state without requiring fine-tuning or external orchestration. In a simulated intelligent tutoring setting, the framework improves scaffolding quality, adaptivity, and instructional alignment across multiple models, outperforming standard prompting baselines. Evaluation is conducted using rubric-based LLM graders at scale. While initially developed for education, the framework has shown promise in other interaction-heavy domains, such as procedural content generation for games. Designed for safe deployment, it provides a reusable methodology for structuring interpretable, goal-aligned LLM behavior in uncertain or evolving contexts.
zh
[AI-149] Pushing the Envelope of LLM Inference on AI-PC
【速读】:该论文旨在解决超低比特大语言模型(Ultra-low-bit LLMs)在资源受限环境(如边缘设备和AI PC)中推理时,现有状态最优(SOTA)推理运行时(runtime)计算效率未被充分挖掘的问题。其解决方案的关键在于从底层出发设计并实现针对现代CPU优化的1-bit和2-bit微核(microkernels),并在PyTorch-TPP这一先进LLM推理框架中集成这些微核,从而显著提升推理性能:实验表明,使用2-bit模型的端到端推理速度相较当前SOTA运行时最高提升2.2倍,较16-bit模型推理最高提速7倍,有效推动了超低比特LLM在AI PC与边缘设备上的高效部署。
链接: https://arxiv.org/abs/2508.06753
作者: Evangelos Georganas,Dhiraj Kalamkar,Alexander Heinecke
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., this http URL) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime this http URL by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models.
zh
[AI-150] opology Generation of UAV Covert Communication Networks: A Graph Diffusion Approach with Incentive Mechanism
【速读】:该论文旨在解决无人飞行器(Uncrewed Aerial Vehicle, UAV)网络在敏感应用场景(如城市监控、应急响应和安全感知)中面临的可靠连接保障与隐蔽通信难题,尤其针对动态移动性和暴露风险带来的挑战。其解决方案的关键在于提出一种自组织UAV网络框架,融合基于图扩散策略优化(Graph Diffusion-based Policy Optimization, GDPO)的拓扑生成机制与基于斯塔克伯格博弈(Stackelberg Game, SG)的激励机制:GDPO利用生成式AI动态生成稀疏但高度连通的网络拓扑,以灵活适应节点分布变化和地面用户(Ground User, GU)需求;SG激励机制则引导具有自利行为的UAV选择有利于协作和提升隐蔽通信性能的中继行为与邻居链路,从而实现网络鲁棒性与隐蔽性的协同增强。
链接: https://arxiv.org/abs/2508.06746
作者: Xin Tang,Qian Chen,Fengshun Li,Youchun Gong,Yinqiu Liu,Wen Tian,Shaowen Qin,Xiaohuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the growing demand for Uncrewed Aerial Vehicle (UAV) networks in sensitive applications, such as urban monitoring, emergency response, and secure sensing, ensuring reliable connectivity and covert communication has become increasingly vital. However, dynamic mobility and exposure risks pose significant challenges. To tackle these challenges, this paper proposes a self-organizing UAV network framework combining Graph Diffusion-based Policy Optimization (GDPO) with a Stackelberg Game (SG)-based incentive mechanism. The GDPO method uses generative AI to dynamically generate sparse but well-connected topologies, enabling flexible adaptation to changing node distributions and Ground User (GU) demands. Meanwhile, the Stackelberg Game (SG)-based incentive mechanism guides self-interested UAVs to choose relay behaviors and neighbor links that support cooperation and enhance covert communication. Extensive experiments are conducted to validate the effectiveness of the proposed framework in terms of model convergence, topology generation quality, and enhancement of covert communication performance.
zh
[AI-151] Analysis of Schedule-Free Nonconvex Optimization
【速读】:该论文旨在解决大规模学习算法中基于一阶方法(First-order methods)的收敛性分析问题,特别是传统方法依赖于事先已知总迭代步数 $ T $ 的步长调度策略,而实际应用中 $ T $ 往往不可预知。为应对这一挑战,论文提出了一种鲁棒的李雅普诺夫(Lyapunov)分析框架,在仅假设目标函数 $ L $-平滑且下有界的条件下,将Schedule-Free (SF) 方法的非凸收敛性分析简化为单步下降不等式。该框架的关键创新在于:通过统一处理Polyak–Ruppert平均与动量机制的插值关系,实现了对不同步长策略(恒定步长+PR平均、线性增长步长、多项式平均)的无Horizon(即不依赖 $ T $)收敛速率分析,首次在非凸场景下建立了无需提前知晓 $ T $ 的最优收敛率边界,如 $ O(1/\log T) 、 O(\log T / T) $ 和 $ O(T^{-(1-\alpha)}) $ 等,并通过性能估计问题(PEP)实验验证了理论结果的紧致性,暗示原始SF算法可能达到更优的 $ O(1/T) $ 收敛率。
链接: https://arxiv.org/abs/2508.06743
作者: Connor Brown
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:First-order methods underpin most large-scale learning algorithms, yet their classical convergence guarantees hinge on carefully scheduled step-sizes that depend on the total horizon T , which is rarely known in advance. The Schedule-Free (SF) method promises optimal performance with hyperparameters that are independent of T by interpolating between Polyak–Ruppert averaging and momentum, but nonconvex analysis of SF has been limited or reliant on strong global assumptions. We introduce a robust Lyapunov framework that, under only L -smoothness and lower-boundedness, reduces SF analysis to a single-step descent inequality. This yields horizon-agnostic bounds in the nonconvex setting: O(1/\log T) for constant step + PR averaging, O(\log T/T) for a linearly growing step-size, and a continuum of O(T^-(1-\alpha)) rates for polynomial averaging. We complement these proofs with Performance Estimation Problem (PEP) experiments that numerically validate our rates and suggest that our O(1/\log T) bound on the original nonconvex SF algorithm may tighten to O(1/T) . Our work extends SF’s horizon-free guarantees to smooth nonconvex optimization and charts future directions for optimal nonconvex rates.
zh
[AI-152] Learning Causal Structure Distributions for Robust Planning
【速读】:该论文旨在解决机器人系统中动力学建模的鲁棒性与计算效率问题,尤其针对传统模型学习方法忽略因果结构、无法有效利用机器人系统中交互稀疏性的缺陷。解决方案的关键在于引入因果结构分布估计,并基于该分布采样因果图以指导编码器-多解码器概率模型中的潜在空间表示,从而在建模功能关系时同时考虑结构不确定性。这种方法显著提升了动力学模型的鲁棒性,同时大幅降低计算资源消耗,使得模型能够适应新环境和任务,且对输入噪声和环境变化具有更强的容忍能力。
链接: https://arxiv.org/abs/2508.06742
作者: Alejandro Murillo-Gonzalez,Junhong Xu,Lantao Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Structural causal models describe how the components of a robotic system interact. They provide both structural and functional information about the relationships that are present in the system. The structural information outlines the variables among which there is interaction. The functional information describes how such interactions work, via equations or learned models. In this paper we find that learning the functional relationships while accounting for the uncertainty about the structural information leads to more robust dynamics models which improves downstream planning, while using significantly lower computational resources. This in contrast with common model-learning methods that ignore the causal structure and fail to leverage the sparsity of interactions in robotic systems. We achieve this by estimating a causal structure distribution that is used to sample causal graphs that inform the latent-space representations in an encoder-multidecoder probabilistic model. We show that our model can be used to learn the dynamics of a robot, which together with a sampling-based planner can be used to perform new tasks in novel environments, provided an objective function for the new requirement is available. We validate our method using manipulators and mobile robots in both simulation and the real-world. Additionally, we validate the learned dynamics’ adaptability and increased robustness to corrupted inputs and changes in the environment, which is highly desirable in challenging real-world robotics scenarios. Video: this https URL.
zh
[AI-153] ParBalans: Parallel Multi-Armed Bandits-based Adaptive Large Neighborhood Search
【速读】:该论文旨在解决混合整数规划(Mixed-Integer Programming, MIP)问题在求解过程中计算资源消耗大、求解效率低的问题。针对MIP的组合特性,传统方法难以高效处理大规模复杂实例,因此亟需提升算法的并行化能力以加速收敛并增强可扩展性。论文提出的关键解决方案是设计ParBalans,一种基于多臂赌博机(multi-armed bandits)的自适应大邻域搜索算法的并行扩展版本,通过同时利用求解器层面和算法层面的并行机制,充分挖掘Balans原有模块化架构中对不同参数配置的并行探索潜力,从而显著提升在困难MIP基准测试上的性能表现,其效果与当前最先进的商业求解器Gurobi相当甚至更优。
链接: https://arxiv.org/abs/2508.06736
作者: Alican Yilmaz,Junyang Cai,Serdar Kadioglu,Bistra Dilkina
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Solving Mixed-Integer Programming (MIP) problems often requires substantial computational resources due to their combinatorial nature. Parallelization has emerged as a critical strategy to accelerate solution times and enhance scalability to tackle large, complex instances. This paper investigates the parallelization capabilities of Balans, a recently proposed multi-armed bandits-based adaptive large neighborhood search for MIPs. While Balans’s modular architecture inherently supports parallel exploration of diverse parameter configurations, this potential has not been thoroughly examined. To address this gap, we introduce ParBalans, an extension that leverages both solver-level and algorithmic-level parallelism to improve performance on challenging MIP instances. Our experimental results demonstrate that ParBalans exhibits competitive performance compared to the state-of-the-art commercial solver Gurobi, particularly on hard optimization benchmarks.
zh
[AI-154] GLIDR: Graph-Like Inductive Logic Programming with Differentiable Reasoning
【速读】:该论文旨在解决传统可微分归纳逻辑编程(Differentiable Inductive Logic Programming, DILP)方法在知识图谱补全任务中因假设规则结构为链式(chain-like)而导致性能受限与可解释性不足的问题。其解决方案的关键在于提出GLIDR,一种支持更丰富语法结构(如分支和环路)的可微分规则学习方法,通过引入一种通用的可微消息传递推理算法,将规则搜索空间参数化为对自由变量数量的上限限制,从而实现更灵活且表达能力强的规则建模。此外,GLIDR能够从模型权重中提取显式逻辑规则用于符号求解器,并具备对训练数据噪声的高度鲁棒性,同时支持与深度神经网络端到端联合优化,适用于任意数据模态的规则学习任务。
链接: https://arxiv.org/abs/2508.06716
作者: Blair Johnson,Clayton Kerce,Faramarz Fekri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:Differentiable inductive logic programming (ILP) techniques have proven effective at finding approximate rule-based solutions to link prediction and node classification problems on knowledge graphs; however, the common assumption of chain-like rule structure can hamper the performance and interpretability of existing approaches. We introduce GLIDR, a differentiable rule learning method that models the inference of logic rules with more expressive syntax than previous methods. GLIDR uses a differentiable message passing inference algorithm that generalizes previous chain-like rule learning methods to allow rules with features like branches and cycles. GLIDR has a simple and expressive rule search space which is parameterized by a limit on the maximum number of free variables that may be included in a rule. Explicit logic rules can be extracted from the weights of a GLIDR model for use with symbolic solvers. We demonstrate that GLIDR can significantly outperform existing rule learning methods on knowledge graph completion tasks and even compete with embedding methods despite the inherent disadvantage of being a structure-only prediction method. We show that rules extracted from GLIDR retain significant predictive performance, and that GLIDR is highly robust to training data noise. Finally, we demonstrate that GLIDR can be chained with deep neural networks and optimized end-to-end for rule learning on arbitrary data modalities.
zh
[AI-155] Probabilistic Circuits for Knowledge Graph Completion with Reduced Rule Sets
【速读】:该论文旨在解决规则驱动的知识图谱补全(Knowledge Graph Completion, KGC)方法中因规则数量庞大而导致可解释性下降的问题。传统基于规则的方法虽然具备良好的可解释性,但为了达到与深度学习相当的性能,往往需要构建规模庞大的规则集,这反而削弱了其可解释优势。论文提出的关键解决方案是:从训练数据中挖掘出有意义的规则上下文(rule contexts,即协同工作的规则子集),并利用学习到的概率分布(通过概率电路 Probabilistic Circuits 实现)对这些规则上下文进行建模,从而在显著减少规则数量的同时保持甚至超越原始完整规则集的性能。该方法实现了规则数减少70%-96%,且在等效最小规则数量下性能提升达31倍,在仅使用少量规则时仍能保留基线峰值性能的91%。该框架基于概率逻辑语义,无需独立性假设,且推理过程提供查询概率的近似下界和精确值,验证了其在8个标准基准数据集上的有效性。
链接: https://arxiv.org/abs/2508.06706
作者: Jaikrishna Manojkumar Patil,Nathaniel Lee,Al Mehdi Saadat Chowdhury,YooJung Choi,Paulo Shakarian
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Rule-based methods for knowledge graph completion provide explainable results but often require a significantly large number of rules to achieve competitive performance. This can hinder explainability due to overwhelmingly large rule sets. We discover rule contexts (meaningful subsets of rules that work together) from training data and use learned probability distribution (i.e. probabilistic circuits) over these rule contexts to more rapidly achieve performance of the full rule set. Our approach achieves a 70-96% reduction in number of rules used while outperforming baseline by up to 31 \times when using equivalent minimal number of rules and preserves 91% of peak baseline performance even when comparing our minimal rule sets against baseline’s full rule sets. We show that our framework is grounded in well-known semantics of probabilistic logic, does not require independence assumptions, and that our tractable inference procedure provides both approximate lower bounds and exact probability of a given query. The efficacy of our method is validated by empirical studies on 8 standard benchmark datasets where we show competitive performance by using only a fraction of the rules required by AnyBURL’s standard inference method, the current state-of-the-art for rule-based knowledge graph completion. This work may have further implications for general probabilistic reasoning over learned sets of rules.
zh
[AI-156] Zero-Shot Cellular Trajectory Map Matching
【速读】:该论文旨在解决零样本细胞轨迹匹配(Zero-shot Cellular Trajectory Map-Matching, CTMM)问题,即在无需目标区域额外训练的情况下,将基于蜂窝基站的位置序列准确对齐到道路网络,以支持导航和路径优化等位置服务。现有方法受限于依赖ID特征和区域特定数据,难以适应未探索区域。解决方案的关键在于提出一种基于像素的轨迹校准辅助器(pixel-based trajectory calibration assistant),通过迁移可泛化的地理空间知识来校准像素化轨迹,并引导道路网络层面的路径查找;同时引入融合高斯混合模型的变分自编码器(VAE)实现场景自适应专家识别,以及设计时空感知模块捕捉序列特征与定位不确定性,从而提升用户位置估计精度;最终采用约束路径查找算法重建道路ID序列,在保证拓扑有效性的前提下优化最短可行路径,显著减少冗余绕行。
链接: https://arxiv.org/abs/2508.06674
作者: Weijie Shi,Yue Cui,Hao Chen,Jiaming Li,Mengze Li,Jia Zhu,Jiajie Xu,Xiaofang Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cellular Trajectory Map-Matching (CTMM) aims to align cellular location sequences to road networks, which is a necessary preprocessing in location-based services on web platforms like Google Maps, including navigation and route optimization. Current approaches mainly rely on ID-based features and region-specific data to learn correlations between cell towers and roads, limiting their adaptability to unexplored areas. To enable high-accuracy CTMM without additional training in target regions, Zero-shot CTMM requires to extract not only region-adaptive features, but also sequential and location uncertainty to alleviate positioning errors in cellular data. In this paper, we propose a pixel-based trajectory calibration assistant for zero-shot CTMM, which takes advantage of transferable geospatial knowledge to calibrate pixelated trajectory, and then guide the path-finding process at the road network level. To enhance knowledge sharing across similar regions, a Gaussian mixture model is incorporated into VAE, enabling the identification of scenario-adaptive experts through soft clustering. To mitigate high positioning errors, a spatial-temporal awareness module is designed to capture sequential features and location uncertainty, thereby facilitating the inference of approximate user positions. Finally, a constrained path-finding algorithm is employed to reconstruct the road ID sequence, ensuring topological validity within the road network. This process is guided by the calibrated trajectory while optimizing for the shortest feasible path, thus minimizing unnecessary detours. Extensive experiments demonstrate that our model outperforms existing methods in zero-shot CTMM by 16.8%.
zh
[AI-157] Formal Concept Analysis: a Structural Framework for Variability Extraction and Analysis
【速读】:该论文试图解决的问题是:如何明确形式概念分析(Formal Concept Analysis, FCA)中哪些性质可以用于变异性相关任务(variability-related tasks),以及这些性质如何被有效利用来解释概念结构中蕴含的多样性信息。解决方案的关键在于系统性地梳理FCA中对变异性分析至关重要的若干属性,并阐明它们在解读概念结构中多样性和共性关系时的具体应用方式,从而弥合FCA理论与实际变异性建模需求之间的理解鸿沟。
链接: https://arxiv.org/abs/2508.06668
作者: Jessie Galasso
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注:
Abstract:Formal Concept Analysis (FCA) is a mathematical framework for knowledge representation and discovery. It performs a hierarchical clustering over a set of objects described by attributes, resulting in conceptual structures in which objects are organized depending on the attributes they share. These conceptual structures naturally highlight commonalities and variabilities among similar objects by categorizing them into groups which are then arranged by similarity, making it particularly appropriate for variability extraction and analysis. Despite the potential of FCA, determining which of its properties can be leveraged for variability-related tasks (and how) is not always straightforward, partly due to the mathematical orientation of its foundational literature. This paper attempts to bridge part of this gap by gathering a selection of properties of the framework which are essential to variability analysis, and how they can be used to interpret diverse variability information within the resulting conceptual structures.
zh
[AI-158] In-Context Reinforcement Learning via Communicative World Models
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在面对新任务和新环境时难以泛化的问题,其核心挑战在于代理的表征和策略往往过度拟合于训练环境的特定细节,导致无法有效适应未见场景。解决方案的关键在于提出CORAL(Communicative Representation for Adaptive RL)框架,通过将潜在表征学习与控制策略解耦,并将ICRL(in-context RL)建模为一种双智能体涌现通信问题:预训练的信息代理(Information Agent, IA)作为世界模型构建者,专注于从多样化任务中提取并压缩环境知识为简洁消息;同时引入因果影响损失(Causal Influence Loss),引导IA生成对后续动作具有可预测影响的通信内容;部署阶段,固定预训练的IA作为上下文提供者,控制代理(Control Agent, CA)基于该通信上下文快速学习并实现零样本适应,从而显著提升样本效率和在稀疏奖励环境中迁移能力。
链接: https://arxiv.org/abs/2508.06659
作者: Fernando Martinez-Lopez,Tao Li,Yingdong Lu,Juntao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents’ in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by decoupling latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not to maximize task reward, but to build a world model and distill its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in entirely unseen sparse-reward environments, validating the efficacy of learning a transferable communicative representation.
zh
[AI-159] Fractal Language Modelling by Universal Sequence Maps (USM)
【速读】:该论文旨在解决符号序列(如基因组序列)在多尺度和高维嵌入空间中如何高效、唯一地编码以保留上下文信息的问题,从而为神经网络等非线性模型提供可计算的数值表示。其核心挑战在于确保编码过程既能精确映射原始序列身份,又能支持后续分析(如k-mer频率计算或距离度量)而无需重复计算嵌入坐标。解决方案的关键在于改进通用序列映射(Universal Sequence Maps, USM)中的迭代机制,通过消除种子偏差(seeding biases),实现了两个突破:一是使数值位置与序列身份完全一致,即保证了编码的唯一性和可逆性;二是揭示了USM本质上是一个收敛至稳态嵌入解的高效数值过程,这不仅提升了编码稳定性,还允许使用非整数k值进行k-mer分析,拓展了传统方法的应用边界。
链接: https://arxiv.org/abs/2508.06641
作者: Jonas S Almeida,Daniel E Russ,Susana Vinga,Ines Duarte,Lee Mason,Praphulla Bhawsar,Aaron Ge,Arlindo Oliveira,Jeya Balaji Balasubramanian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Quantitative Methods (q-bio.QM)
备注: 16 pages, 8 figures
Abstract:Motivation: With the advent of Language Models using Transformers, popularized by ChatGPT, there is a renewed interest in exploring encoding procedures that numerically represent symbolic sequences at multiple scales and embedding dimensions. The challenge that encoding addresses is the need for mechanisms that uniquely retain contextual information about the succession of individual symbols, which can then be modeled by nonlinear formulations such as neural networks. Context: Universal Sequence Maps(USM) are iterated functions that bijectively encode symbolic sequences onto embedded numerical spaces. USM is composed of two Chaos Game Representations (CGR), iterated forwardly and backwardly, that can be projected into the frequency domain (FCGR). The corresponding USM coordinates can be used to compute a Chebyshev distance metric as well as k-mer frequencies, without having to recompute the embedded numeric coordinates, and, paradoxically, allowing for non-integers values of k. Results: This report advances the bijective fractal encoding by Universal Sequence Maps (USM) by resolving seeding biases affecting the iterated process. The resolution had two results, the first expected, the second an intriguing outcome: 1) full reconciliation of numeric positioning with sequence identity; and 2) uncovering the nature of USM as an efficient numeric process converging towards a steady state sequence embedding solution. We illustrate these results for genomic sequences because of the convenience of a planar representation defined by an alphabet with only 4 tokens (the 4 nucleotides). Nevertheless, the application to alphabet of arbitrary cardinality was found to be straightforward. Comments: 16 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Quantitative Methods (q-bio.QM) Cite as: arXiv:2508.06641 [cs.LG] (or arXiv:2508.06641v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.06641 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-160] Segmented Confidence Sequences and Multi-Scale Adaptive Confidence Segments for Anomaly Detection in Nonstationary Time Series
【速读】:该论文旨在解决时间序列异常检测在非平稳环境下的适应性问题,即当数据的统计特性随时间发生漂移(如概念漂移或多尺度变化)时,传统静态阈值方法容易失效的问题。解决方案的关键在于提出两种新颖的自适应阈值框架——分段置信序列(Segmented Confidence Sequences, SCS)和多尺度自适应置信区间(Multi-Scale Adaptive Confidence Segments, MACS),二者均基于统计在线学习与分割原理,实现局部化、上下文敏感的动态调整,并在分布演化条件下仍能保证误报率控制,从而提升检测的可靠性、可解释性和时效性。
链接: https://arxiv.org/abs/2508.06638
作者: Muyan Anna Li,Aditi Gautam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures
Abstract:As time series data become increasingly prevalent in domains such as manufacturing, IT, and infrastructure monitoring, anomaly detection must adapt to nonstationary environments where statistical properties shift over time. Traditional static thresholds are easily rendered obsolete by regime shifts, concept drift, or multi-scale changes. To address these challenges, we introduce and empirically evaluate two novel adaptive thresholding frameworks: Segmented Confidence Sequences (SCS) and Multi-Scale Adaptive Confidence Segments (MACS). Both leverage statistical online learning and segmentation principles for local, contextually sensitive adaptation, maintaining guarantees on false alarm rates even under evolving distributions. Our experiments across Wafer Manufacturing benchmark datasets show significant F1-score improvement compared to traditional percentile and rolling quantile approaches. This work demonstrates that robust, statistically principled adaptive thresholds enable reliable, interpretable, and timely detection of diverse real-world anomalies.
zh
[AI-161] Using Imperfect Synthetic Data in Downstream Inference Tasks
【速读】:该论文旨在解决在有限数据场景下,如何将大语言模型(Large Language Models, LLMs)生成的合成样本(synthetic samples)与真实数据有效结合,并在此基础上得出统计上有效的结论这一关键问题。其解决方案的核心在于提出一种基于广义矩估计(Generalized Method of Moments, GMM)的新估计量,该估计量无需调参且具备坚实的理论保障;特别地,研究发现合成数据与真实数据的矩残差之间的交互作用可提升目标参数的估计精度,从而在计算社会科学研究的不同回归任务中展现出显著的有限样本性能优势。
链接: https://arxiv.org/abs/2508.06635
作者: Yewon Byun,Shantanu Gupta,Zachary C. Lipton,Rachel Leah Childers,Bryan Wilder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Predictions and generations from large language models are increasingly being explored as an aid to computational social science and human subject research in limited data regimes. While previous technical work has explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (also termed as synthetic simulations), such as in responses to surveys. However, it is not immediately clear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this work, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address the challenge at hand. Surprisingly, we find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter. We empirically validate the finite-sample performance of our estimator across different regression tasks in computational social science applications, demonstrating large empirical gains.
zh
[AI-162] Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Record
【速读】:该论文旨在解决胰腺导管腺癌(Pancreatic ductal adenocarcinoma, PDAC)早期检测的临床难题,因其缺乏特异性症状和可靠生物标志物,导致诊断常滞后于疾病进展。解决方案的关键在于提出一种多模态融合方法,通过整合电子健康记录(Electronic Health Records, EHR)中的纵向诊断代码历史与常规实验室指标,利用神经控制微分方程建模不规则时间序列的实验室数据,结合预训练语言模型与循环网络提取诊断代码轨迹表征,并采用交叉注意力机制捕捉两种模态间的交互关系,从而实现对PDAC风险的提前预测,实验表明该方法在AUC上较现有最优方法提升6.5%至15.5%,并识别出与PDAC高风险相关的诊断代码和实验室指标组合。
链接: https://arxiv.org/abs/2508.06627
作者: Mosbah Aouad,Anirudh Choudhary,Awais Farooq,Steven Nevers,Lusine Demirkhanyan,Bhrandon Harris,Suguna Pappu,Christopher Gondi,Ravishankar Iyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines neural controlled differential equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers. Our code is available at this https URL.
zh
[AI-163] Generalizing Scaling Laws for Dense and Sparse Large Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在训练过程中难以准确预测模型规模或合理分配计算资源的问题。尽管已有研究提出了多种缩放定律(scaling laws)以提升训练效率,但这些方法大多局限于特定架构(如密集型或稀疏型模型),缺乏通用性。本文的关键解决方案是重新审视现有缩放定律,并提出一种广义缩放定律(generalized scaling law),构建一个适用于密集和稀疏两类大模型的统一框架,从而实现更普适、高效的资源规划与模型规模预测。
链接: https://arxiv.org/abs/2508.06617
作者: Md Arafat Hossain,Xingfu Wu,Valerie Taylor,Ali Jannesari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 8 pages, 8 figures
Abstract:Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness.
zh
[AI-164] Generative AI for Intent-Driven Network Management in 6G: A Case Study on Hierarchical Learning Approach
【速读】:该论文旨在解决6G时代移动网络日益异构与动态化带来的管理复杂性问题,传统自动化手段难以满足高效、智能的网络运维需求。其解决方案的关键在于提出一种分层学习增强的意图驱动网络(Intent-Driven Network, IDN)架构,创新性地将生成式人工智能(Generative AI, GenAI)贯穿于IDN的三个核心阶段:意图处理、意图验证和意图执行,而不仅限于单一的意图解析环节。通过引入最新GenAI架构Mamba,该方案实现了从人类指令到网络策略的全链路智能化闭环,显著提升了网络性能与自适应能力,超越了传统IDN架构的局限性。
链接: https://arxiv.org/abs/2508.06616
作者: Md Arafat Habib,Medhat Elsayed,Yigit Ozcan,Pedro Enrique Iturria-Rivera,Majid Bavand,Melike Erol-Kantarci
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:With the emergence of 6G, mobile networks are becoming increasingly heterogeneous and dynamic, necessitating advanced automation for efficient management. Intent-Driven Networks (IDNs) address this by translating high-level intents into optimization policies. Large Language Models (LLMs) can enhance this process by understanding complex human instructions to enable adaptive, intelligent automation. Given the rapid advancements in Generative AI (GenAI), a comprehensive survey of LLM-based IDN architectures in disaggregated Radio Access Network (RAN) environments is both timely and critical. This article provides such a survey, along with a case study on a hierarchical learning-enabled IDN architecture that integrates GenAI across three key stages: intent processing, intent validation, and intent execution. Unlike most existing approaches that apply GenAI in the form of LLMs for intent processing only, we propose a hierarchical framework that introduces GenAI across all three stages of IDN. To demonstrate the effectiveness of the proposed IDN management architecture, we present a case study based on the latest GenAI architecture named Mamba. The case study shows how the proposed GenAI-driven architecture enhances network performance through intelligent automation, surpassing the performance of the conventional IDN architectures.
zh
[AI-165] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLM s
【速读】:该论文旨在解决开放权重人工智能(Open-weight AI)系统在面临篡改攻击时的安全风险问题,特别是通过修改模型权重或激活值来诱导有害行为的脆弱性。当前的安全微调方法和后训练技术难以抵御超过几十步的对抗性微调攻击,缺乏系统性的风险管理体系。论文提出的关键解决方案是:在预训练阶段通过多阶段可扩展的数据过滤管道,主动移除与双重用途(dual-use)主题相关的文本内容,从而减少模型内部对生物威胁代理知识(biothreat proxy knowledge)的内化。实验表明,基于此策略训练出的69亿参数模型能够有效抵抗高达10,000步、3亿tokens级别的生物威胁相关文本的对抗微调攻击,性能优于现有后训练基线一个数量级以上,且未造成无关能力的退化。这一结果验证了预训练数据筛选作为开放权重AI系统防御体系中一项有前景的前置防护层的价值。
链接: https://arxiv.org/abs/2508.06601
作者: Kyle O’Brien,Stephen Casper,Quentin Anthony,Tomek Korbak,Robert Kirk,Xander Davies,Ishan Mishra,Geoffrey Irving,Yarin Gal,Stella Biderman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text – outperforming existing post-training baselines by over an order of magnitude – with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.
zh
[AI-166] owards Integrated Alignment
【速读】:该论文试图解决当前人工智能(AI)对齐领域中因行为主义与表征主义方法分裂而导致的模型对齐局限性问题,这种分裂使得现有模型在面对日益复杂的欺骗性对齐威胁时更为脆弱。其解决方案的关键在于提出一种集成对齐(Integrated Alignment)框架的设计原则,通过深度整合不同对齐方法的优势,并借助适应性共进化机制实现协同优化;同时强调战略多样性(strategic diversity),即部署正交的对齐与误对齐检测手段,以避免同质化管道带来的系统性风险,并推动研究领域的跨协作、开放权重及共享资源,从而促进整个AI对齐研究生态的统一与增强。
链接: https://arxiv.org/abs/2508.06592
作者: Ben Y. Reis,William La Cava
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI adoption expands across human society, the problem of aligning AI models to match human preferences remains a grand challenge. Currently, the AI alignment field is deeply divided between behavioral and representational approaches, resulting in narrowly aligned models that are more vulnerable to increasingly deceptive misalignment threats. In the face of this fragmentation, we propose an integrated vision for the future of the field. Drawing on related lessons from immunology and cybersecurity, we lay out a set of design principles for the development of Integrated Alignment frameworks that combine the complementary strengths of diverse alignment approaches through deep integration and adaptive coevolution. We highlight the importance of strategic diversity - deploying orthogonal alignment and misalignment detection approaches to avoid homogeneous pipelines that may be “doomed to success”. We also recommend steps for greater unification of the AI alignment research field itself, through cross-collaboration, open model weights and shared community resources.
zh
[AI-167] A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis
【速读】:该论文旨在解决神经影像学计算机辅助诊断(CAD)系统在小样本研究中 reproducibility 低,以及大规模数据集因疾病亚型混杂标签导致的混淆异质性问题。解决方案的关键在于提出一种面向神经影像学 CAD 的新型联邦学习框架,其核心包含两个模块:一是动态导航模块(dynamic navigation module),根据潜在亚型表示将样本路由至最合适的本地模型;二是元集成模块(meta-integration module),用于融合来自异构本地模型的预测结果以生成统一的诊断输出。该框架有效提升了诊断准确性和鲁棒性,在超过1300名重度抑郁障碍(MDD)患者和1100名健康对照的多中心 fMRI 数据上验证了其优越性能,平均准确率达74.06%,显著优于传统方法。
链接: https://arxiv.org/abs/2508.06589
作者: Xinglin Zhao,Yanwen Wang,Xiaobo Liu,Yanrong Hao,Rui Cao,Xin Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-aided diagnosis (CAD) systems play a crucial role in analyzing neuroimaging data for neurological and psychiatric disorders. However, small-sample studies suffer from low reproducibility, while large-scale datasets introduce confounding heterogeneity due to multiple disease subtypes being labeled under a single category. To address these challenges, we propose a novel federated learning framework tailored for neuroimaging CAD systems. Our approach includes a dynamic navigation module that routes samples to the most suitable local models based on latent subtype representations, and a meta-integration module that combines predictions from heterogeneous local models into a unified diagnostic output. We evaluated our framework using a comprehensive dataset comprising fMRI data from over 1300 MDD patients and 1100 healthy controls across multiple study cohorts. Experimental results demonstrate significant improvements in diagnostic accuracy and robustness compared to traditional methods. Specifically, our framework achieved an average accuracy of 74.06% across all tested sites, showcasing its effectiveness in handling subtype heterogeneity and enhancing model generalizability. Ablation studies further confirmed the importance of both the dynamic navigation and meta-integration modules in improving performance. By addressing data heterogeneity and subtype confounding, our framework advances reliable and reproducible neuroimaging CAD systems, offering significant potential for personalized medicine and clinical decision-making in neurology and psychiatry.
zh
[AI-168] Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning
【速读】:该论文旨在解决图结构数据中向量量化(Vector Quantization, VQ)方法普遍存在的码本坍缩(codebook collapse)问题,即在训练过程中大量码字(codeword)未被有效利用,导致表示能力受限和泛化性能下降。其关键解决方案是提出RGVQ框架,通过引入图拓扑结构与特征相似性作为显式正则信号来增强码本利用率,并采用Gumbel-Softmax重参数化实现软分配机制,确保所有码字均能获得梯度更新;同时设计结构感知的对比正则项以惩罚相似节点对的共分配行为,从而促进token多样性,提升图表示的表达能力和迁移性。
链接: https://arxiv.org/abs/2508.06588
作者: Zian Zhai,Fan Li,Xingyu Tan,Xiaoyang Wang,Wenjie Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Vector Quantization (VQ) has recently emerged as a promising approach for learning discrete representations of graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph this http URL this paper, we present the first empirical study showing that codebook collapse consistently occurs when applying VQ to graph data, even with mitigation strategies proposed in vision or language domains. To understand why graph VQ is particularly vulnerable to collapse, we provide a theoretical analysis and identify two key factors: early assignment imbalances caused by redundancy in graph features and structural patterns, and self-reinforcing optimization loops in deterministic VQ. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize the token co-assignments among similar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.
zh
[AI-169] Omni Geometry Representation Learning vs Large Language Models for Geospatial Entity Resolution
【速读】:该论文旨在解决地理空间实体消歧(Geospatial Entity Resolution, ER)中对具有多样化几何形态(如点、线、多段线、多边形和复合多边形)的实体匹配效率与准确性不足的问题。现有方法通常将复杂几何简化为单一坐标点,导致空间信息严重丢失。其解决方案的关键在于提出Omni模型,该模型包含一个“全几何编码器”(omni-geometry encoder),能够无缝嵌入多种几何类型并保留其空间结构特征;同时结合基于Transformer的预训练语言模型构建属性亲和机制(Attribute Affinity mechanism),以增强文本属性的语义匹配能力。实验表明,Omni在多个数据集上相较现有方法提升最高达12%(F1分数),验证了其有效性。
链接: https://arxiv.org/abs/2508.06584
作者: Kalana Wijegunarathna,Kristin Stock,Christopher B. Jones
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:The development, integration, and maintenance of geospatial databases rely heavily on efficient and accurate matching procedures of Geospatial Entity Resolution (ER). While resolution of points-of-interest (POIs) has been widely addressed, resolution of entities with diverse geometries has been largely overlooked. This is partly due to the lack of a uniform technique for embedding heterogeneous geometries seamlessly into a neural network framework. Existing neural approaches simplify complex geometries to a single point, resulting in significant loss of spatial information. To address this limitation, we propose Omni, a geospatial ER model featuring an omni-geometry encoder. This encoder is capable of embedding point, line, polyline, polygon, and multi-polygon geometries, enabling the model to capture the complex geospatial intricacies of the places being compared. Furthermore, Omni leverages transformer-based pre-trained language models over individual textual attributes of place records in an Attribute Affinity mechanism. The model is rigorously tested on existing point-only datasets and a new diverse-geometry geospatial ER dataset. Omni produces up to 12% (F1) improvement over existing methods. Furthermore, we test the potential of Large Language Models (LLMs) to conduct geospatial ER, experimenting with prompting strategies and learning scenarios, comparing the results of pre-trained language model-based methods with LLMs. Results indicate that LLMs show competitive results. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.06584 [cs.DB] (or arXiv:2508.06584v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2508.06584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-170] Leverag ing LLM s for Privacy-Aware Predictions in Participatory Budgeting
【速读】:该论文旨在解决参与式预算(Participatory Budgeting, PB)实践中普遍存在的参与度低、提案质量参差不齐以及组织方难以高效管理大量提案的问题。其核心解决方案在于提出一种隐私保护型预测方法,利用大型语言模型(Large Language Model, LLM)如GPT-4 Turbo,基于提案的文本描述和匿名的历史投票记录来预测哪些提案更可能获得资助,从而在不依赖用户人口统计信息或个人身份数据的前提下提升PB流程的透明度、规划效率与公民参与度。关键创新在于结合LLM的先验知识与真实历史投票数据,使预测结果更能反映实际投票行为,为提案撰写者和组织者提供可操作的决策支持。
链接: https://arxiv.org/abs/2508.06577
作者: Juan Zambrano,Clément Contet,Jairo Gudiño,Felipe Garrido-Lucero,Umberto Grandi,Cesar A Hidalgo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Participatory Budgeting (PB) empowers citizens to propose and vote on public investment projects. Yet, despite its democratic potential, PB initiatives often suffer from low participation rates, limiting their visibility and perceived legitimacy. In this work, we aim to strengthen PB elections in two key ways: by supporting project proposers in crafting better proposals, and by helping PB organizers manage large volumes of submissions in a transparent manner. We propose a privacy-preserving approach to predict which PB proposals are likely to be funded, using only their textual descriptions and anonymous historical voting records – without relying on voter demographics or personally identifiable information. We evaluate the performance of GPT 4 Turbo in forecasting proposal outcomes across varying contextual scenarios, observing that the LLM’s prior knowledge needs to be complemented by past voting data to obtain predictions reflecting real-world PB voting behavior. Our findings highlight the potential of AI-driven tools to support PB processes by improving transparency, planning efficiency, and civic engagement.
zh
[AI-171] Efficient Safety Testing of Autonomous Vehicles via Adaptive Search over Crash-Derived Scenarios
【速读】:该论文旨在解决自动驾驶车辆(AV)在安全关键场景下测试效率低的问题,以确保其安全性验证的充分性与可行性。解决方案的关键在于提出一种自适应大邻域模拟退火算法(adaptive large-variable neighborhood-simulated annealing algorithm, ALVNS-SA),通过优化测试用例生成过程,显著提升对典型安全关键场景的覆盖率,实验表明该算法在覆盖率达84.00%的基础上,实现96.83%的碰撞场景覆盖和92.07%的近碰撞场景覆盖,优于遗传算法(GA)、自适应大邻域搜索-模拟退火算法(ALNS-SA)及随机测试方法。
链接: https://arxiv.org/abs/2508.06575
作者: Rui Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring the safety of autonomous vehicles (AVs) is paramount in their development and deployment. Safety-critical scenarios pose more severe challenges, necessitating efficient testing methods to validate AVs safety. This study focuses on designing an accelerated testing algorithm for AVs in safety-critical scenarios, enabling swift recognition of their driving capabilities. First, typical logical scenarios were extracted from real-world crashes in the China In-depth Mobility Safety Study-Traffic Accident (CIMSS-TA) database, obtaining pre-crash features through reconstruction. Second, Baidu Apollo, an advanced black-box automated driving system (ADS) is integrated to control the behavior of the ego vehicle. Third, we proposed an adaptive large-variable neighborhood-simulated annealing algorithm (ALVNS-SA) to expedite the testing process. Experimental results demonstrate a significant enhancement in testing efficiency when utilizing ALVNS-SA. It achieves an 84.00% coverage of safety-critical scenarios, with crash scenario coverage of 96.83% and near-crash scenario coverage of 92.07%. Compared to genetic algorithm (GA), adaptive large neighborhood-simulated annealing algorithm (ALNS-SA), and random testing, ALVNS-SA exhibits substantially higher coverage in safety-critical scenarios.
zh
[AI-172] aching Introduction to Programming in the times of AI: A case study of a course re-design CCS
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)工具在编程教学中日益普及背景下,课程设计、学习目标设定、教学实施及形成性与总结性评价等方面所面临的挑战,以及学生对AI工具的不当使用问题。其解决方案的关键在于重新设计现有课程结构、重构作业形式与教学方法,以适应当前AI技术的发展趋势,从而在最大化利用AI工具优势的同时,有效应对由此引发的教学质量和学术诚信等潜在风险。
链接: https://arxiv.org/abs/2508.06572
作者: Nikolaos Avouris,Kyriakos Sgarbas,George Caridakis,Christos Sintoris
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To be cited as: Avouris, N., Sgarbas, K., Caridakis, G., Sintoris, C., (2025). Teaching Introduction to Programming in the times of AI: A case study of a course re-design, Proceedings 12th Penhellenic Conference of Computer Science Education, PCCSE 2025, Rhodes, October 2025
Abstract:The integration of AI tools into programming education has become increasingly prevalent in recent years, transforming the way programming is taught and learned. This paper provides a review of the state-of-the-art AI tools available for teaching and learning programming, particularly in the context of introductory courses. It highlights the challenges on course design, learning objectives, course delivery and formative and summative assessment, as well as the misuse of such tools by the students. We discuss ways of re-designing an existing course, re-shaping assignments and pedagogy to address the current AI technologies challenges. This example can serve as a guideline for policies for institutions and teachers involved in teaching programming, aiming to maximize the benefits of AI tools while addressing the associated challenges and concerns.
zh
[AI-173] Operationalizing Serendipity: Multi-Agent AI Workflows for Enhanced Materials Characterization with Theory-in-the-Loop
【速读】:该论文试图解决的问题是:现代自主实验室在追求高效 hypothesis 测试的过程中,可能忽视了那些由意外观察引发的、具有重大科学价值的非预期发现(即“偶然性发现”),从而限制了开放性科学探索的可能性。解决方案的关键在于提出 SciLink 框架——一个开源的多智能体人工智能系统,通过将实验观测、新颖性评估与理论模拟直接自动化连接,实现对偶然发现的系统识别和利用。其核心创新在于采用混合 AI 策略:专用机器学习模型负责实验数据的定量分析,大语言模型执行高层次推理,使系统能自动将原始材料表征数据转化为可证伪的科学命题,并基于文献量化评估其新颖性,最终推动针对性后续实验设计,从而在提升效率的同时主动培育 serendipitous discovery(偶然性发现)的环境。
链接: https://arxiv.org/abs/2508.06569
作者: Lance Yao,Suman Samantray,Ayana Ghosh,Kevin Roccapriore,Libor Kovarik,Sarah Allec,Maxim Ziatdinov
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注:
Abstract:The history of science is punctuated by serendipitous discoveries, where unexpected observations, rather than targeted hypotheses, opened new fields of inquiry. While modern autonomous laboratories excel at accelerating hypothesis testing, their optimization for efficiency risks overlooking these crucial, unplanned findings. To address this gap, we introduce SciLink, an open-source, multi-agent artificial intelligence framework designed to operationalize serendipity in materials research by creating a direct, automated link between experimental observation, novelty assessment, and theoretical simulations. The framework employs a hybrid AI strategy where specialized machine learning models perform quantitative analysis of experimental data, while large language models handle higher-level reasoning. These agents autonomously convert raw data from materials characterization techniques into falsifiable scientific claims, which are then quantitatively scored for novelty against the published literature. We demonstrate the framework’s versatility across diverse research scenarios, showcasing its application to atomic-resolution and hyperspectral data, its capacity to integrate real-time human expert guidance, and its ability to close the research loop by proposing targeted follow-up experiments. By systematically analyzing all observations and contextualizing them, SciLink provides a practical framework for AI-driven materials research that not only enhances efficiency but also actively cultivates an environment ripe for serendipitous discoveries, thereby bridging the gap between automated experimentation and open-ended scientific exploration.
zh
[AI-174] Solving Pasur Using GPU-Accelerated Counterfactual Regret Minimization
【速读】:该论文旨在解决帕苏尔(Pasur)这一复杂扑克类卡牌游戏在大规模不完美信息博弈中求解近似纳什均衡(near-Nash equilibrium)的计算难题。由于其规则复杂且博弈树规模庞大(平均超过10⁹个节点),传统方法难以高效处理内存占用与计算复杂度。解决方案的关键在于构建一个基于CUDA加速的计算框架,通过两个核心设计实现:一是将博弈树分解为“实际游戏状态”和“前序回合继承得分”两部分,并在展开过程中配对形成完整博弈树,从而显著减少内存开销;二是采用逐轮反向训练策略,从终局开始递归传播平均效用至早期阶段,有效管理计算复杂性。该框架结合PyTorch CUDA张量处理规则逻辑,并最终训练树模型预测策略用于实时对战,同时利用GPU并行自对弈估算各牌组公平价值,具有可扩展性,适用于其他分阶段决策场景如回合制策略游戏或金融市场的序列交易决策。
链接: https://arxiv.org/abs/2508.06559
作者: Sina Baghal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:
Abstract:Pasur is a fishing card game played over six rounds and is played similarly to games such as Cassino and Scopa, and Bastra. This paper introduces a CUDA-accelerated computational framework for simulating Pasur, emphasizing efficient memory management. We use our framework to compute near-Nash equilibria via Counterfactual Regret Minimization (CFR), a well-known algorithm for solving large imperfect-information games. Solving Pasur presents unique challenges due to its intricate rules and the large size of its game tree. We handle rule complexity using PyTorch CUDA tensors and to address the memory-intensive nature of the game, we decompose the game tree into two key components: (1) actual game states, and (2) inherited scores from previous rounds. We construct the Full Game Tree by pairing card states with accumulated scores in the Unfolding Process. This design reduces memory overhead by storing only essential strategy values and node connections. To further manage computational complexity, we apply a round-by-round backward training strategy, starting from the final round and recursively propagating average utilities to earlier stages. Our approach constructs the complete game tree, which on average consists of over 10^9 nodes. We provide detailed implementation snippets. After computing a near-Nash equilibrium strategy, we train a tree-based model to predict these strategies for use during gameplay. We then estimate the fair value of each deck through large-scale self-play between equilibrium strategies by simulating, for instance, 10,000 games per matchup, executed in parallel using GPU acceleration. Similar frameworks can be extended to other reinforcement learning algorithms where the action tree naturally decomposes into multiple rounds such as turn-based strategy games or sequential trading decisions in financial markets. Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2508.06559 [cs.AI] (or arXiv:2508.06559v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.06559 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sina Baghal [view email] [v1] Wed, 6 Aug 2025 15:15:11 UTC (143 KB)
zh
[AI-175] Symbolic Learning of Interpretable Reduced-Order Models for Jumping Quadruped Robots
【速读】:该论文旨在解决四足机器人跳跃运动中复杂非线性动力学建模难题,传统方法如受驱弹簧倒立摆模型(actuated Spring-loaded Inverted Pendulum, aSLIP)难以准确捕捉高维跳跃动态。解决方案的关键在于提出一种融合稀疏非线性动力学识别(Sparse Identification of Nonlinear Dynamics, SINDy)与跳跃动力学物理结构先验的新型学习架构,通过将高维非线性跳跃动力学映射到低维潜在空间,从而构建出既具解释性又高精度的降阶模型(reduced-order model),并在仿真与硬件实验中验证了其在多种跳跃策略下的优越性能。
链接: https://arxiv.org/abs/2508.06538
作者: Gioele Buriani,Jingyue Liu,Maximilian Stölzle,Cosimo Della Santina,Jiatao Ding
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, under review
Abstract:Reduced-order models are essential for motion planning and control of quadruped robots, as they simplify complex dynamics while preserving critical behaviors. This paper introduces a novel methodology for deriving such interpretable dynamic models, specifically for jumping. We capture the high-dimensional, nonlinear jumping dynamics in a low-dimensional latent space by proposing a learning architecture combining Sparse Identification of Nonlinear Dynamics (SINDy) with physical structural priors on the jump dynamics. Our approach demonstrates superior accuracy to the traditional actuated Spring-loaded Inverted Pendulum (aSLIP) model and is validated through simulation and hardware experiments across different jumping strategies.
zh
[AI-176] MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving ACM-MM2025
【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统在面对对抗性攻击时的鲁棒性评估难题,这一问题对AD系统的安全性与可靠性构成重大挑战。解决方案的关键在于提出MetAdv——一个融合虚拟仿真与物理车辆反馈的新型对抗测试平台,其核心是构建了一个三层闭环测试环境,实现从高层统一对抗样本生成、中层基于仿真的交互测试到低层物理车辆执行的全流程评估;同时,该平台支持多种AD任务和算法范式,并具备人机协同机制,可实时采集驾驶员生理信号与行为反馈,从而为对抗条件下的人机信任关系提供新洞见,最终形成一套可扩展、统一的对抗评估框架。
链接: https://arxiv.org/abs/2508.06534
作者: Aishan Liu,Jiakai Wang,Tianyuan Zhang,Hainan Li,Jiangfan Liu,Siyuan Liang,Yilong Ren,Xianglong Liu,Dacheng Tao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM 2025 Demo/Videos track
Abstract:Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD.
zh
[AI-177] PiKV: KV Cache Management System for Mixture of Experts ICML
【速读】:该论文针对大规模语言模型(Large Language Models, LLMs)在多GPU和多节点推理场景下,由于键值(Key-Value, KV)缓存存储带来的内存和通信开销问题展开研究。尽管基于专家混合(Mixture-of-Experts, MoE)架构的模型通过稀疏化计算实现了高效推理,其KV缓存仍保持密集且全局同步状态,导致显著的资源浪费与性能瓶颈。解决方案的关键在于提出PiKV——一个专为MoE架构设计的并行分布式KV缓存服务框架,其核心创新包括:(1)专家分片KV存储(expert-sharded KV storage),将缓存跨GPU分区以降低冗余;(2)PiKV路由机制(PiKV routing),减少token到KV缓存的访问延迟;(3)自适应调度策略(PiKV Scheduling),动态保留与查询相关的缓存条目;以及(4)集成压缩模块(PiKV Compression),嵌入缓存流水线中进一步压缩内存占用。该方案有效缓解了MoE模型在分布式环境下的KV缓存瓶颈,提升了可扩展性和推理效率。
链接: https://arxiv.org/abs/2508.06526
作者: Dong Liu,Yanxuan Yu,Ben Lengerich,Ying Nian Wu,Xuhong Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted to ICML ES-MoFo III WorkShop Paper Link: this https URL Github Link: this https URL
Abstract:As large language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbfPiKV, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textitexpert-sharded KV storage to partition caches across GPUs, \textitPiKV routing to reduce token-to-KV access, and a \textitPiKV Scheduling to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textitPiKV Compression modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \hrefthis https URLthis https URL. Experiments details is recorded at: \hrefthis https URLthis https URL_Results. We also have PiKV integrated with Nvidia kvpress for acceleration, details see \hrefthis https URLthis https URL. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures. Comments: Accepted to ICML ES-MoFo III WorkShop Paper Link: this https URL Github Link: this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2508.06526 [cs.DC] (or arXiv:2508.06526v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2508.06526 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-178] AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers
【速读】:该论文旨在解决生成式模型(Generative Models)在高风险场景中缺乏输出溯源机制的问题,即无法验证模型生成内容的来源。传统模型指纹技术主要适用于协作环境,而本文首次在模型提供方可能恶意对抗的威胁模型下评估指纹技术的有效性。解决方案的关键在于引入一个可信验证者(trusted verifier),该验证者能够从模型输出空间中提取提供商未知的秘密指纹,并训练一个预测与验证模型来识别这些指纹。实验表明,该方法在GAN和扩散模型上实现了接近零的假阳性率(FPR@95%TPR),且对模型架构和训练数据的小幅修改以及主动规避攻击均保持鲁棒性。
链接: https://arxiv.org/abs/2508.05691
作者: Kai Yao,Marc Juarez
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify the origin of model outputs. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider may act adversarially. To our knowledge, this is the first work to evaluate fingerprinting for provenance attribution under such a threat model. The methods rely on a trusted verifier that extracts secret fingerprints from the model’s output space, unknown to the provider, and trains a model to predict and verify them. Our empirical evaluation shows that our methods achieve near-zero FPR@95%TPR for instances of GAN and diffusion models, even when tested on small modifications to the original architecture and training data. Moreover, the methods remain robust against adversarial attacks that actively modify the outputs to bypass detection. Source codes are available at this https URL.
zh
[AI-179] UPP: Unified Path Planner with Adaptive Safety and Optimality
【速读】:该论文旨在解决机器人路径规划中安全性与最优性难以兼顾的问题(safety and optimality trade-off)。现有方法通常只侧重于其中一方面,而未能实现两者的协同优化。解决方案的关键在于提出一种统一路径规划器(Unified Path Planner, UPP),其通过改进的启发式函数和动态安全代价函数来平衡安全性与路径质量,同时引入可调参数以灵活控制安全等级,并在计算复杂度之间进行权衡。该方法在仿真和真实机器人平台(TurtleBot)上均验证了其有效性,能够生成既安全又具实用性的次优路径。
链接: https://arxiv.org/abs/2505.23197
作者: Jatin Kumar Arora,Shubhendu Bhasin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages,11 figures
Abstract:We are surrounded by robots helping us perform complex tasks. Robots have a wide range of applications, from industrial automation to personalized assistance. However, with great technological innovation come significant challenges. One of the major challenges in robotics is path planning. Despite advancements such as graph search, sampling, and potential field methods, most path planning algorithms focus either on optimality or on safety. Very little research addresses both simultaneously. We propose a Unified Path Planner (UPP) that uses modified heuristics and a dynamic safety cost function to balance safety and optimality. The level of safety can be adjusted via tunable parameters, trading off against computational complexity. We demonstrate the planner’s performance in simulations, showing how parameter variation affects results. UPP is compared with various traditional and safe-optimal planning algorithms across different scenarios. We also validate it on a TurtleBot, where the robot successfully finds safe and sub-optimal paths.
zh
[AI-180] Recommendation with Generative Models
【速读】:该论文旨在解决生成式推荐系统(Gen-RecSys)中模型分类不清晰、技术演进路径模糊以及跨领域应用缺乏系统性梳理的问题。其解决方案的关键在于提出一个全新的深度生成模型(Deep Generative Models, DGMs)分类体系,将DGMs划分为三类:以标识符驱动(ID-driven)的模型、大语言模型(Large Language Models, LLMs)和多模态模型(Multimodal Models),该分类体系不仅厘清了各类模型的技术特性与架构差异,还为Gen-RecSys在对话式AI、多媒体内容生成等场景中的发展提供了结构化导航框架,并强调构建稳健的评估体系以应对生成模型带来的潜在风险。
链接: https://arxiv.org/abs/2409.15173
作者: Yashar Deldjoo,Zhankui He,Julian McAuley,Anton Korikov,Scott Sanner,Arnau Ramisa,Rene Vidal,Maheswaran Sathiamoorthy,Atoosa Kasrizadeh,Silvia Milano,Francesco Ricci
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: This submission is a full-length book, expanding significantly on two chapters previously submitted ( arXiv:2409.10993v1 , arXiv:2408.10946v1 ). It includes additional chapters, context, analysis, and content, providing a comprehensive presentation of the subject. We have ensured it is appropriately presented as a new, distinct work. arXiv admin note: substantial text overlap with arXiv:2409.10993
Abstract:Generative models are a class of AI models capable of creating new instances of data by learning and sampling from their statistical distributions. In recent years, these models have gained prominence in machine learning due to the development of approaches such as generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based architectures such as GPT. These models have applications across various domains, such as image generation, text synthesis, and music composition. In recommender systems, generative models, referred to as Gen-RecSys, improve the accuracy and diversity of recommendations by generating structured outputs, text-based interactions, and multimedia content. By leveraging these capabilities, Gen-RecSys can produce more personalized, engaging, and dynamic user experiences, expanding the role of AI in eCommerce, media, and beyond. Our book goes beyond existing literature by offering a comprehensive understanding of generative models and their applications, with a special focus on deep generative models (DGMs) and their classification. We introduce a taxonomy that categorizes DGMs into three types: ID-driven models, large language models (LLMs), and multimodal models. Each category addresses unique technical and architectural advancements within its respective research area. This taxonomy allows researchers to easily navigate developments in Gen-RecSys across domains such as conversational AI and multimodal content generation. Additionally, we examine the impact and potential risks of generative models, emphasizing the importance of robust evaluation frameworks. Comments: This submission is a full-length book, expanding significantly on two chapters previously submitted (arXiv:2409.10993v1, arXiv:2408.10946v1). It includes additional chapters, context, analysis, and content, providing a comprehensive presentation of the subject. We have ensured it is appropriately presented as a new, distinct work. arXiv admin note: substantial text overlap with arXiv:2409.10993 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2409.15173 [cs.IR] (or arXiv:2409.15173v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.15173 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-181] Rethinking Self-Replication: Detecting Distributed Selfhood in the Outlier Cellular Automaton
【速读】:该论文试图解决的问题是:在细胞自动机(Cellular Automaton, CA)中,自发性自复制现象是否能够无需人为设计或特定初始化条件而自然涌现,并且其结构是否可能以分布式、多组分协同的方式存在。解决方案的关键在于引入了一种数据驱动的因果重构框架,该框架能够重建确定性细胞自动机中模式的完整因果谱系,从而通过显式的因果链严格识别出自复制结构。这一方法使作者得以证明Outlier规则下的自复制体不仅是自发且鲁棒的,还常由多个不相连的簇协同工作构成,挑战了传统对个体性和复制机制的理解。
链接: https://arxiv.org/abs/2508.08047
作者: Arend Hintze,Clifford Bohm
机构: 未知
类目: Cellular Automata and Lattice Gases (nlin.CG); Artificial Intelligence (cs.AI)
备注:
Abstract:Spontaneous self-replication in cellular automata has long been considered rare, with most known examples requiring careful design or artificial initialization. In this paper, we present formal, causal evidence that such replication can emerge unassisted – and that it can do so in a distributed, multi-component form. Building on prior work identifying complex dynamics in the Outlier rule, we introduce a data-driven framework that reconstructs the full causal ancestry of patterns in a deterministic cellular automaton. This allows us to rigorously identify self-replicating structures via explicit causal lineages. Our results show definitively that self-replicators in the Outlier CA are not only spontaneous and robust, but are also often composed of multiple disjoint clusters working in coordination, raising questions about some conventional notions of individuality and replication in artificial life systems.
zh
[AI-182] Exploring Strategies for Personalized Radiation Therapy: Part III Identifying genetic determinants for Radiation Response with Meta Learning
【速读】:该论文旨在解决当前放疗策略中因忽视肿瘤异质性而导致的疗效差异问题,即如何基于个体细胞系的基因表达数据实现对放射敏感性的精准预测。传统方法如Radiosensitivity Index (RSI) 依赖固定10基因签名的线性模型,假设基因贡献在所有肿瘤类型中一致,且忽略表达量和基因间相互作用,难以适应复杂生物背景。其解决方案的关键在于提出一种元学习(meta learning)框架,通过在任务间学习可迁移的结构并保留样本特异性自适应能力,使模型能够在仅需少量样本(one-shot)的情况下动态调整各基因权重,从而提升对不同肿瘤亚型(如腺癌和大细胞癌)的预测准确性,并揭示基因影响的上下文依赖模式,为个性化放疗提供新路径。
链接: https://arxiv.org/abs/2508.08030
作者: Hao Peng,Yuanyuan Zhang,Steve Jiang,Robert Timmerman,John Minna
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Radiation response in cancer is shaped by complex, patient specific biology, yet current treatment strategies often rely on uniform dose prescriptions without accounting for tumor heterogeneity. In this study, we introduce a meta learning framework for one-shot prediction of radiosensitivity measured by SF2 using cell line level gene expression data. Unlike the widely used Radiosensitivity Index RSI a rank-based linear model trained on a fixed 10-gene signature, our proposed meta-learned model allows the importance of each gene to vary by sample through fine tuning. This flexibility addresses key limitations of static models like RSI, which assume uniform gene contributions across tumor types and discard expression magnitude and gene gene interactions. Our results show that meta learning offers robust generalization to unseen samples and performs well in tumor subgroups with high radiosensitivity variability, such as adenocarcinoma and large cell carcinoma. By learning transferable structure across tasks while preserving sample specific adaptability, our approach enables rapid adaptation to individual samples, improving predictive accuracy across diverse tumor subtypes while uncovering context dependent patterns of gene influence that may inform personalized therapy.
zh
[AI-183] Auditory Intelligence: Understanding the World Through Sound
【速读】:该论文旨在解决当前音频理解任务(如声音事件检测、声学场景分类、自动音频描述和音频问答)局限于表层识别的问题,即系统仅能识别“发生了什么”,而无法解释“为什么发生”、“意味着什么”或“如何在上下文中演变”。其解决方案的关键在于提出一种认知启发式的分层、情境化音频智能框架,通过引入四个新的任务范式——ASPIRE(时频模式描述)、SODA(分层事件/场景描述)、AUX(因果解释)和AUGMENT(目标驱动解读),将音频理解从感知扩展至推理与交互层面,从而推动更通用、可解释且符合人类认知的音频智能发展。
链接: https://arxiv.org/abs/2508.07829
作者: Hyeonuk Nam
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Position paper without experimental/quantitative validation. Not submitted to any journal/conference
Abstract:Recent progress in auditory intelligence has yielded high-performing systems for sound event detection (SED), acoustic scene classification (ASC), automated audio captioning (AAC), and audio question answering (AQA). Yet these tasks remain largely constrained to surface-level recognition-capturing what happened but not why, what it implies, or how it unfolds in context. I propose a conceptual reframing of auditory intelligence as a layered, situated process that encompasses perception, reasoning, and interaction. To instantiate this view, I introduce four cognitively inspired task paradigms-ASPIRE, SODA, AUX, and AUGMENT-those structure auditory understanding across time-frequency pattern captioning, hierarchical event/scene description, causal explanation, and goal-driven interpretation, respectively. Together, these paradigms provide a roadmap toward more generalizable, explainable, and human-aligned auditory intelligence, and are intended to catalyze a broader discussion of what it means for machines to understand sound.
zh
[AI-184] From Product Hilbert Spaces to the Generalized Koopman Operator and the Nonlinear Fundamental Lemma
【速读】:该论文旨在解决两个关键问题:一是将Koopman算子推广至含有控制输入的系统,二是推导非线性系统的 fundamental lemma。这两个问题在发展数据驱动的非线性系统控制方法中具有核心地位,其难点在于构造能够实现无限维线性系统表示的可观测函数(observable functions)及其对应的希尔伯特空间(Hilbert space)。解决方案的关键在于构建一个由状态和输入可观测函数各自希尔伯特空间的张量积构成的乘积希尔伯特空间,并基于其中的正交展开(orthonormal expansion)推导出广义Koopman算子(generalized Koopman operator),从而实现对非线性系统动力学的无限维线性化表示。此外,利用该模型的双线性结构进一步推导出非线性fundamental lemma,为数据驱动控制提供了理论基础与可计算框架。
链接: https://arxiv.org/abs/2508.07494
作者: Mircea Lazar
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:
Abstract:The generalization of the Koopman operator to systems with control input and the derivation of a nonlinear fundamental lemma are two open problems that play a key role in the development of data-driven control methods for nonlinear systems. Both problems hinge on the construction of observable or basis functions and their corresponding Hilbert space that enable an infinite-dimensional, linear system representation. In this paper we derive a novel solution to these problems based on orthonormal expansion in a product Hilbert space constructed as the tensor product between the Hilbert spaces of the state and input observable functions, respectively. We prove that there exists an infinite-dimensional linear operator, i.e. the generalized Koopman operator, from the constructed product Hilbert space to the Hilbert space corresponding to the lifted state propagated forward in time. A scalable data-driven method for computing finite-dimensional approximations of generalized Koopman operators and several choices of observable functions are also presented. Moreover, we derive a nonlinear fundamental lemma by exploiting the bilinear structure of the infinite-dimensional generalized Koopman model. The effectiveness of the developed generalized Koopman embedding is illustrated on the Van der Pol oscillator.
zh
[AI-185] Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures
【速读】:该论文旨在解决高能物理实验中大规模数据处理的实时性与能效问题,特别是在大型强子对撞机(Large Hadron Collider, LHC)的LHCb实验中,面对每秒高达40 MHz的数据流,传统触发系统难以满足日益增长的计算需求。解决方案的关键在于将基于图神经网络(Graph Neural Network, GNN)的粒子轨迹重建算法端到端部署于LHCb的第一级触发系统,并完全运行在GPU上,从而显著提升处理吞吐量并降低能耗;同时进一步在FPGA架构上加速该模型,对比其在功耗和处理速度上的性能表现,验证了现代硬件与先进机器学习算法协同优化在高频率数据流场景下的可行性与优势。
链接: https://arxiv.org/abs/2508.07423
作者: Fotis I. Giasemis
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注: PhD thesis, Chapters 8 and 9 include results from work performed in collaboration with Anthony Correia
Abstract:As the particle physics community needs higher and higher precisions in order to test our current model of the subatomic world, larger and larger datasets are necessary. With upgrades scheduled for the detectors of colliding-beam experiments around the world, and specifically at the Large Hadron Collider at CERN, more collisions and more complex interactions are expected. This directly implies an increase in data produced and consequently in the computational resources needed to process them. At CERN, the amount of data produced is gargantuan. This is why the data have to be heavily filtered and selected in real time before being permanently stored. This data can then be used to perform physics analyses, in order to expand our current understanding of the universe and improve the Standard Model of physics. This real-time filtering, known as triggering, involves complex processing happening often at frequencies as high as 40 MHz. This thesis contributes to understanding how machine learning models can be efficiently deployed in such environments, in order to maximize throughput and minimize energy consumption. Inevitably, modern hardware designed for such tasks and contemporary algorithms are needed in order to meet the challenges posed by the stringent, high-frequency data rates. In this work, I present our graph neural network-based pipeline, developed for charged particle track reconstruction at the LHCb experiment at CERN. The pipeline was implemented end-to-end inside LHCb’s first-level trigger, entirely on GPUs. Its performance was compared against the classical tracking algorithms currently in production at LHCb. The pipeline was also accelerated on the FPGA architecture, and its performance in terms of power consumption and processing speed was compared against the GPU implementation.
zh
[AI-186] Leverag ing GNN to Enhance MEF Method in Predicting ENSO
【速读】:该论文旨在解决厄尔尼诺-南方涛动(ENSO)长期预测中因集合预报成员间分散性高而导致的预测精度不足问题。其核心解决方案是引入基于图论的分析框架,通过构建一个无向图来建模80个集合成员之间的相似性(以均方根误差RMSE和相关系数为度量),利用社区检测方法识别出结构相似且准确的预测子集,并从中选出最优的20个成员进行平均作为最终预测结果。此方法通过去除噪声并强化集合一致性,显著提升了预测技能,同时展现出对高绩效成员的统计鲁棒性和在复杂长滞后场景下的稳定性,且具有模型无关性,可推广至其他大规模集合预测模型。
链接: https://arxiv.org/abs/2508.07410
作者: Saghar Ganji,Mohammad Naisipour
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 2 tables
Abstract:Reliable long-lead forecasting of the El Nino Southern Oscillation (ENSO) remains a long-standing challenge in climate science. The previously developed Multimodal ENSO Forecast (MEF) model uses 80 ensemble predictions by two independent deep learning modules: a 3D Convolutional Neural Network (3D-CNN) and a time-series module. In their approach, outputs of the two modules are combined using a weighting strategy wherein one is prioritized over the other as a function of global performance. Separate weighting or testing of individual ensemble members did not occur, however, which may have limited the model to optimize the use of high-performing but spread-out forecasts. In this study, we propose a better framework that employs graph-based analysis to directly model similarity between all 80 members of the ensemble. By constructing an undirected graph whose vertices are ensemble outputs and whose weights on edges measure similarity (via RMSE and correlation), we identify and cluster structurally similar and accurate predictions. From which we obtain an optimized subset of 20 members using community detection methods. The final prediction is then obtained by averaging this optimized subset. This method improves the forecast skill through noise removal and emphasis on ensemble coherence. Interestingly, our graph-based selection shows robust statistical characteristics among top performers, offering new ensemble behavior insights. In addition, we observe that while the GNN-based approach does not always outperform the baseline MEF under every scenario, it produces more stable and consistent outputs, particularly in compound long-lead situations. The approach is model-agnostic too, suggesting that it can be applied directly to other forecasting models with gargantuan ensemble outputs, such as statistical, physical, or hybrid models.
zh
[AI-187] A Spin Glass Characterization of Neural Networks
【速读】:该论文旨在解决如何从统计物理角度刻画单个前馈神经网络(Feedforward Neural Network, FNN)的结构特性问题,特别是传统指标(如损失函数或准确率)无法捕捉的隐含复杂性。其解决方案的关键在于构建一个基于Hopfield型自旋玻璃模型的描述框架,通过模拟多个副本(replica)样本之间的重叠(overlap)作为特征描述符,从而将自旋玻璃中的复相变(replica symmetry breaking, RSB)现象与FNN的数据拟合能力、容量、泛化性能及鲁棒性等关键属性建立联系。该方法突破了以往仅针对模型集合进行分析的局限,首次为个体网络实例提供了可计算的结构性诊断工具,揭示了传统指标未能体现的非平凡结构特性。
链接: https://arxiv.org/abs/2508.07397
作者: Jun Li
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This work presents a statistical mechanics characterization of neural networks, motivated by the replica symmetry breaking (RSB) phenomenon in spin glasses. A Hopfield-type spin glass model is constructed from a given feedforward neural network (FNN). Overlaps between simulated replica samples serve as a characteristic descriptor of the FNN. The connection between the spin-glass description and commonly studied properties of the FNN – such as data fitting, capacity, generalization, and robustness – has been investigated and empirically demonstrated. Unlike prior analytical studies that focus on model ensembles, this method provides a computable descriptor for individual network instances, which reveals nontrivial structural properties that are not captured by conventional metrics such as loss or accuracy. Preliminary results suggests its potential for practical applications such as model inspection, safety verification, and detection of hidden vulnerabilities.
zh
[AI-188] CROP: Integrating Topological and Spatial Structures via Cross-View Prefixes for Molecular LLM s
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在分子理解任务中因仅依赖分子序列而难以捕捉复杂分子结构的问题。现有方法未能有效整合分子的多维结构信息,限制了模型对分子特性的全面建模能力。解决方案的关键在于提出CROss-view Prefixes (CROP)框架,通过高效融合分子的拓扑图结构(graph view)与空间构型图像结构(image view)两种互补视角,实现对分子结构的多视图集成。其核心创新包括:(1) 利用SMILES Guided Resampler将多视图结构重采样为固定长度前缀,兼顾效率与可扩展性;(2) 通过Structural Embedding Gate将嵌入转化为LLM可处理的前缀表示,并以LLM自编码的分子序列为引导提升前缀质量,从而显著增强LLM在分子描述生成、IUPAC命名预测和分子性质预测等任务中的性能表现。
链接: https://arxiv.org/abs/2508.06917
作者: Jianting Tang,Yubo Wang,Haoyu Cao,Linli Xu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: Accepted to ACMMM 2025
Abstract:Recent advances in molecular science have been propelled significantly by large language models (LLMs). However, their effectiveness is limited when relying solely on molecular sequences, which fail to capture the complex structures of molecules. Beyond sequence representation, molecules exhibit two complementary structural views: the first focuses on the topological relationships between atoms, as exemplified by the graph view; and the second emphasizes the spatial configuration of molecules, as represented by the image view. The two types of views provide unique insights into molecular structures. To leverage these views collaboratively, we propose the CROss-view Prefixes (CROP) to enhance LLMs’ molecular understanding through efficient multi-view integration. CROP possesses two advantages: (i) efficiency: by jointly resampling multiple structural views into fixed-length prefixes, it avoids excessive consumption of the LLM’s limited context length and allows easy expansion to more views; (ii) effectiveness: by utilizing the LLM’s self-encoded molecular sequences to guide the resampling process, it boosts the quality of the generated prefixes. Specifically, our framework features a carefully designed SMILES Guided Resampler for view resampling, and a Structural Embedding Gate for converting the resulting embeddings into LLM’s prefixes. Extensive experiments demonstrate the superiority of CROP in tasks including molecule captioning, IUPAC name prediction and molecule property prediction.
zh
[AI-189] Understanding Human Limits in Pattern Recognition: A Computational Model of Sequential Reasoning in Rock Paper Scissors
【速读】:该论文旨在解决人类如何通过行为模式预测他人策略的问题,以及限制这种预测能力的计算约束是什么。其核心问题是:在重复进行剪刀石头布游戏的情境中,人类是否能有效识别对手的行为规律,以及认知机制在其中的作用边界。解决方案的关键在于引入基于大语言模型的“假设心智”(Hypothetical Minds, HM)作为认知模型,模拟人类生成和测试对手策略假设的过程。研究发现,HM在相同实验条件下能够复现人类的成功与失败模式;进一步的消融和增强实验表明,准确的假设生成是主要瓶颈——当提供自然语言描述的对手策略时,HM对6/7个对手实现80%胜率,说明认知局限主要源于假设构建能力不足,而非推理或执行层面。这一方法揭示了模型驱动的认知分析可为人类决策机制提供可检验的假说。
链接: https://arxiv.org/abs/2508.06503
作者: Logan Cross,Erik Brockbank,Tobias Gerstenberg,Judith E. Fan,Daniel L. K. Yamins,Nick Haber
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: To be published in Proceedings of the 8th Annual Conference on Cognitive Computational Neuroscience (2025)
Abstract:How do we predict others from patterns in their behavior and what are the computational constraints that limit this ability? We investigate these questions by modeling human behavior over repeated games of rock, paper, scissors from Brockbank Vul (2024). Against algorithmic opponents that varied in strategic sophistication, people readily exploit simple transition patterns (e.g., consistently playing rock after paper) but struggle to detect more complex sequential dependencies. To understand the cognitive mechanisms underlying these abilities and their limitations, we deploy Hypothetical Minds (HM), a large language model-based agent that generates and tests hypotheses about opponent strategies, as a cognitive model of this behavior (Cross et al., 2024). We show that when applied to the same experimental conditions, HM closely mirrors human performance patterns, succeeding and failing in similar ways. To better understand the source of HM’s failures and whether people might face similar cognitive bottlenecks in this context, we performed a series of ablations and augmentations targeting different components of the system. When provided with natural language descriptions of the opponents’ strategies, HM successfully exploited 6/7 bot opponents with win rates 80% suggesting that accurate hypothesis generation is the primary cognitive bottleneck in this task. Further, by systematically manipulating the model’s hypotheses through pedagogically-inspired interventions, we find that the model substantially updates its causal understanding of opponent behavior, revealing how model-based analyses can produce testable hypotheses about human cognition.
zh
[AI-190] Computing with Canonical Microcircuits
【速读】:该论文试图解决当前深度学习模型在参数效率和可解释性方面的局限性问题,即如何在保持高性能的同时显著减少模型参数量,并提升其行为的生物学合理性与可解释性。解决方案的关键在于受大脑皮层中普遍存在的规范微环路(canonical microcircuits, CMCs)启发,构建一种基于神经微分方程(neural ODEs)的计算架构,其中包含棘状星形细胞、抑制性神经元和锥体神经元组成的8维动力系统,具有生物 plausible 的递归连接结构。实验表明,单个CMC节点即可在MNIST数据集上达到97.8%的准确率,而层级化配置通过可学习的区域间连接进一步提升了复杂图像任务的性能,同时显著低于传统深度学习模型的参数量;相空间分析还揭示了不同输入类别的独特动力学轨迹,体现出类似生物系统的可解释性行为,验证了该方法在效率与可解释性上的优势。
链接: https://arxiv.org/abs/2508.06501
作者: PK Douglas
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 13 figures
Abstract:The human brain represents the only known example of general intelligence that naturally aligns with human values. On a mere 20-watt power budget, the brain achieves robust learning and adaptive decision-making in ways that continue to elude advanced AI systems. Inspired by the brain, we present a computational architecture based on canonical microcircuits (CMCs) - stereotyped patterns of neurons found ubiquitously throughout the cortex. We implement these circuits as neural ODEs comprising spiny stellate, inhibitory, and pyramidal neurons, forming an 8-dimensional dynamical system with biologically plausible recurrent connections. Our experiments show that even a single CMC node achieves 97.8 percent accuracy on MNIST, while hierarchical configurations - with learnable inter-regional connectivity and recurrent connections - yield improved performance on more complex image benchmarks. Notably, our approach achieves competitive results using substantially fewer parameters than conventional deep learning models. Phase space analysis revealed distinct dynamical trajectories for different input classes, highlighting interpretable, emergent behaviors observed in biological systems. These findings suggest that neuromorphic computing approaches can improve both efficiency and interpretability in artificial neural networks, offering new directions for parameter-efficient architectures grounded in the computational principles of the human brain.
zh
[AI-191] Network-Specific Models for Multimodal Brain Response Prediction
【速读】:该论文旨在解决复杂多模态电影刺激下大脑响应预测的精度问题,传统方法通常将大脑视为均质系统,忽略了不同功能网络在信息处理中的异质性。其解决方案的关键在于采用基于Yeo 7网络划分的簇(cluster)策略:将7个功能网络聚类为4个功能簇,并为每个簇训练独立的多被试、多层感知机(Multi-layer Perceptron, MLP)模型,从而实现簇特异性优化与自适应记忆建模,使各模型可根据目标网络的功能角色动态调整时序动态和模态权重。此方法显著提升了对Schaefer atlas中1000个皮层区域响应的预测准确性,在Algonauts Project 2025 Challenge中达到第八名,且分布外(out-of-distribution, OOD)相关性接近基线模型的两倍。
链接: https://arxiv.org/abs/2508.06499
作者: Andrea Corsico,Giorgia Rigamonti,Simone Zini,Luigi Celona,Paolo Napoletano
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we present a network-specific approach for predicting brain responses to complex multimodal movies, leveraging the Yeo 7-network parcellation of the Schaefer atlas. Rather than treating the brain as a homogeneous system, we grouped the seven functional networks into four clusters and trained separate multi-subject, multi-layer perceptron (MLP) models for each. This architecture supports cluster-specific optimization and adaptive memory modeling, allowing each model to adjust temporal dynamics and modality weighting based on the functional role of its target network. Our results demonstrate that this clustered strategy significantly enhances prediction accuracy across the 1,000 cortical regions of the Schaefer atlas. The final model achieved an eighth-place ranking in the Algonauts Project 2025 Challenge, with out-of-distribution (OOD) correlation scores nearly double those of the baseline model used in the selection phase. Code is available at this https URL.
zh
[AI-192] Forecasting Commodity Price Shocks Using Temporal and Semantic Fusion of Prices Signals and Agent ic Generative AI Extracted Economic News
【速读】:该论文旨在解决大宗商品价格突涨的精准预测问题,这对于经济缓冲能力有限的国家尤为关键,因为突发性价格上涨可能冲击国家预算、扰乱依赖进口的产业,并威胁粮食与能源安全。解决方案的关键在于构建一个融合历史价格时序数据与全球宏观经济新闻语义信号的混合预测框架,其核心是基于代理型生成式AI(agentic generative AI)的流水线架构,采用双流长短期记忆网络(dual-stream LSTM)结合注意力机制,将结构化时间序列输入与经事实核查的新闻嵌入向量进行深度融合。实证结果表明,该方法在64年期数据集上实现了0.94的平均AUC和0.91的整体准确率,显著优于传统机器学习基线模型,且消融实验验证了新闻信息对性能的决定性作用,凸显了引入非结构化文本所蕴含现实背景的重要性。
链接: https://arxiv.org/abs/2508.06497
作者: Mohammed-Khalil Ghali,Cecil Pang,Oscar Molina,Carlos Gershenson-Garcia,Daehan Won
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate forecasting of commodity price spikes is vital for countries with limited economic buffers, where sudden increases can strain national budgets, disrupt import-reliant sectors, and undermine food and energy security. This paper introduces a hybrid forecasting framework that combines historical commodity price data with semantic signals derived from global economic news, using an agentic generative AI pipeline. The architecture integrates dual-stream Long Short-Term Memory (LSTM) networks with attention mechanisms to fuse structured time-series inputs with semantically embedded, fact-checked news summaries collected from 1960 to 2023. The model is evaluated on a 64-year dataset comprising normalized commodity price series and temporally aligned news embeddings. Results show that the proposed approach achieves a mean AUC of 0.94 and an overall accuracy of 0.91 substantially outperforming traditional baselines such as logistic regression (AUC = 0.34), random forest (AUC = 0.57), and support vector machines (AUC = 0.47). Additional ablation studies reveal that the removal of attention or dimensionality reduction leads to moderate declines in performance, while eliminating the news component causes a steep drop in AUC to 0.46, underscoring the critical value of incorporating real-world context through unstructured text. These findings demonstrate that integrating agentic generative AI with deep learning can meaningfully improve early detection of commodity price shocks, offering a practical tool for economic planning and risk mitigation in volatile market environments while saving the very high costs of operating a full generative AI agents pipeline.
zh
机器学习
[LG-0] Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion
链接: https://arxiv.org/abs/2508.08216
作者: Nicole Lai-Tan,Xiao Gu,Marios G. Philiastides,Fani Deligianni
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages, 5 figures
Abstract:Personalised music-based interventions offer a powerful means of supporting motor rehabilitation by dynamically tailoring auditory stimuli to provide external timekeeping cues, modulate affective states, and stabilise gait patterns. Generalisable Brain-Computer Interfaces (BCIs) thus hold promise for adapting these interventions across individuals. However, inter-subject variability in EEG signals, further compounded by movement-induced artefacts and motor planning differences, hinders the generalisability of BCIs and results in lengthy calibration processes. We propose Individual Tangent Space Alignment (ITSA), a novel pre-alignment strategy incorporating subject-specific recentering, distribution matching, and supervised rotational alignment to enhance cross-subject generalisation. Our hybrid architecture fuses Regularised Common Spatial Patterns (RCSP) with Riemannian geometry in parallel and sequential configurations, improving class separability while maintaining the geometric structure of covariance matrices for robust statistical computation. Using leave-one-subject-out cross-validation, `ITSA’ demonstrates significant performance improvements across subjects and conditions. The parallel fusion approach shows the greatest enhancement over its sequential counterpart, with robust performance maintained across varying data conditions and electrode configurations. The code will be made publicly available at the time of publication.
[LG-1] Federated Learning for Epileptic Seizure Prediction Across Heterogeneous EEG Datasets
链接: https://arxiv.org/abs/2508.08159
作者: Cem Ata Baykara,Saurav Raj Pandey,Ali Burak Ünal,Harlin Lee,Mete Akgün
类目: Machine Learning (cs.LG)
*备注:
Abstract:Developing accurate and generalizable epileptic seizure prediction models from electroencephalography (EEG) data across multiple clinical sites is hindered by patient privacy regulations and significant data heterogeneity (non-IID characteristics). Federated Learning (FL) offers a privacy-preserving framework for collaborative training, but standard aggregation methods like Federated Averaging (FedAvg) can be biased by dominant datasets in heterogeneous settings. This paper investigates FL for seizure prediction using a single EEG channel across four diverse public datasets (Siena, CHB-MIT, Helsinki, NCH), representing distinct patient populations (adult, pediatric, neonate) and recording conditions. We implement privacy-preserving global normalization and propose a Random Subset Aggregation strategy, where each client trains on a fixed-size random subset of its data per round, ensuring equal contribution during aggregation. Our results show that locally trained models fail to generalize across sites, and standard weighted FedAvg yields highly skewed performance (e.g., 89.0% accuracy on CHB-MIT but only 50.8% on Helsinki and 50.6% on NCH). In contrast, Random Subset Aggregation significantly improves performance on under-represented clients (accuracy increases to 81.7% on Helsinki and 68.7% on NCH) and achieves a superior macro-average accuracy of 77.1% and pooled accuracy of 80.0% across all sites, demonstrating a more robust and fair global model. This work highlights the potential of balanced FL approaches for building effective and generalizable seizure prediction systems in realistic, heterogeneous multi-hospital environments while respecting data privacy.
[LG-2] FairFLRep: Fairness aware fault localization and repair of Deep Neural Networks
链接: https://arxiv.org/abs/2508.08151
作者: Moses Openja,Paolo Arcaini,Foutse Khomh,Fuyuki Ishikawa
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Deep neural networks (DNNs) are being utilized in various aspects of our daily lives, including high-stakes decision-making applications that impact individuals. However, these systems reflect and amplify bias from the data used during training and testing, potentially resulting in biased behavior and inaccurate decisions. For instance, having different misclassification rates between white and black sub-populations. However, effectively and efficiently identifying and correcting biased behavior in DNNs is a challenge. This paper introduces FairFLRep, an automated fairness-aware fault localization and repair technique that identifies and corrects potentially bias-inducing neurons in DNN classifiers. FairFLRep focuses on adjusting neuron weights associated with sensitive attributes, such as race or gender, that contribute to unfair decisions. By analyzing the input-output relationships within the network, FairFLRep corrects neurons responsible for disparities in predictive quality parity. We evaluate FairFLRep on four image classification datasets using two DNN classifiers, and four tabular datasets with a DNN model. The results show that FairFLRep consistently outperforms existing methods in improving fairness while preserving accuracy. An ablation study confirms the importance of considering fairness during both fault localization and repair stages. Our findings also show that FairFLRep is more efficient than the baseline approaches in repairing the network.
[LG-3] OFAL: An Oracle-Free Active Learning Framework
链接: https://arxiv.org/abs/2508.08126
作者: Hadi Khorsand,Vahid Pourahmadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the active learning paradigm, using an oracle to label data has always been a complex and expensive task, and with the emersion of large unlabeled data pools, it would be highly beneficial If we could achieve better results without relying on an oracle. This research introduces OFAL, an oracle-free active learning scheme that utilizes neural network uncertainty. OFAL uses the model’s own uncertainty to transform highly confident unlabeled samples into informative uncertain samples. First, we start with separating and quantifying different parts of uncertainty and introduce Monte Carlo Dropouts as an approximation of the Bayesian Neural Network model. Secondly, by adding a variational autoencoder, we go on to generate new uncertain samples by stepping toward the uncertain part of latent space starting from a confidence seed sample. By generating these new informative samples, we can perform active learning and enhance the model’s accuracy. Lastly, we try to compare and integrate our method with other widely used active learning sampling methods.
[LG-4] NeuroDx-LM: A Clinical Large-Scale Model for EEG-based Neurological Disorder Detection
链接: https://arxiv.org/abs/2508.08124
作者: Guanghao Jin,Yuan Liang,Yihan Ma,Jingpei Wu,Guoyang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale models pre-trained on Electroencephalography (EEG) have shown promise in clinical applications such as neurological disorder detection. However, the practical deployment of EEG-based large-scale models faces critical challenges such as limited labeled EEG data and suboptimal performance in clinical scenarios. To address these issues, we propose NeuroDx-LM, a novel large-scale model specifically designed for detecting EEG-based neurological disorders. Our key contributions include (i) a Selective Temporal-Frequency Embedding mechanism that adaptively captures complex temporal and spectral patterns in EEG signals; and (ii) a Progressive Feature-Aware Training strategy that refines feature representation in a two-stage process. In the first stage, our model learns the fundamental discriminative features of EEG activities; in the second stage, the model further extracts more specialized fine-grained features for accurate diagnostic performance. We evaluated NeuroDx-LM on the CHB-MIT and Schizophrenia datasets, achieving state-of-the-art performance in EEG-based seizure and schizophrenia detection, respectively. These results demonstrate the great potential of EEG-based large-scale models to advance clinical applicability. Our code is available at this https URL.
[LG-5] Fast and Generalizable parameter-embedded Neural Operators for Lithium-Ion Battery Simulation
链接: https://arxiv.org/abs/2508.08087
作者: Amir Ali Panahi,Daniel Luder,Billy Wu,Gregory Offer,Dirk Uwe Sauer,Weihan Li
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 31 pages, 6 figures
Abstract:Reliable digital twins of lithium-ion batteries must achieve high physical fidelity with sub-millisecond speed. In this work, we benchmark three operator-learning surrogates for the Single Particle Model (SPM): Deep Operator Networks (DeepONets), Fourier Neural Operators (FNOs) and a newly proposed parameter-embedded Fourier Neural Operator (PE-FNO), which conditions each spectral layer on particle radius and solid-phase diffusivity. Models are trained on simulated trajectories spanning four current families (constant, triangular, pulse-train, and Gaussian-random-field) and a full range of State-of-Charge (SOC) (0 % to 100 %). DeepONet accurately replicates constant-current behaviour but struggles with more dynamic loads. The basic FNO maintains mesh invariance and keeps concentration errors below 1 %, with voltage mean-absolute errors under 1.7 mV across all load types. Introducing parameter embedding marginally increases error, but enables generalisation to varying radii and diffusivities. PE-FNO executes approximately 200 times faster than a 16-thread SPM solver. Consequently, PE-FNO’s capabilities in inverse tasks are explored in a parameter estimation task with Bayesian optimisation, recovering anode and cathode diffusivities with 1.14 % and 8.4 % mean absolute percentage error, respectively, and 0.5918 percentage points higher error in comparison with classical methods. These results pave the way for neural operators to meet the accuracy, speed and parametric flexibility demands of real-time battery management, design-of-experiments and large-scale inference. PE-FNO outperforms conventional neural surrogates, offering a practical path towards high-speed and high-fidelity electrochemical digital twins.
[LG-6] Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles
链接: https://arxiv.org/abs/2508.08080
作者: Cas Oude Hoekstra,Floris den Hengst
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Applications (stat.AP)
*备注:
Abstract:Symbolic Regression (SR) is a well-established framework for generating interpretable or white-box predictive models. Although SR has been successfully applied to create interpretable estimates of the average of the outcome, it is currently not well understood how it can be used to estimate the relationship between variables at other points in the distribution of the target variable. Such estimates of e.g. the median or an extreme value provide a fuller picture of how predictive variables affect the outcome and are necessary in high-stakes, safety-critical application domains. This study introduces Symbolic Quantile Regression (SQR), an approach to predict conditional quantiles with SR. In an extensive evaluation, we find that SQR outperforms transparent models and performs comparably to a strong black-box baseline without compromising transparency. We also show how SQR can be used to explain differences in the target distribution by comparing models that predict extreme and central outcomes in an airline fuel usage case study. We conclude that SQR is suitable for predicting conditional quantiles and understanding interesting feature influences at varying quantiles.
[LG-7] ELF: Efficient Logic Synthesis by Pruning Redundancy in Refactoring
链接: https://arxiv.org/abs/2508.08073
作者: Dimitris Tsaras,Xing Li,Lei Chen,Zhiyao Xie,Mingxuan Yuan
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
*备注: Accepted to DAC 2025
Abstract:In electronic design automation, logic optimization operators play a crucial role in minimizing the gate count of logic circuits. However, their computation demands are high. Operators such as refactor conventionally form iterative cuts for each node, striving for a more compact representation - a task which often fails 98% on average. Prior research has sought to mitigate computational cost through parallelization. In contrast, our approach leverages a classifier to prune unsuccessful cuts preemptively, thus eliminating unnecessary resynthesis operations. Experiments on the refactor operator using the EPFL benchmark suite and 10 large industrial designs demonstrate that this technique can speedup logic optimization by 3.9x on average compared with the state-of-the-art ABC implementation.
[LG-8] Deep Learning-Based Analysis of Power Consumption in Gasoline Electric and Hybrid Vehicles
链接: https://arxiv.org/abs/2508.08034
作者: Roksana Yahyaabadi,Ghazal Farhani,Taufiq Rahman,Soodeh Nikan,Abdullah Jirjees,Fadi Araji
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Accurate power consumption prediction is crucial for improving efficiency and reducing environmental impact, yet traditional methods relying on specialized instruments or rigid physical models are impractical for large-scale, real-world deployment. This study introduces a scalable data-driven method using powertrain dynamic feature sets and both traditional machine learning and deep neural networks to estimate instantaneous and cumulative power consumption in internal combustion engine (ICE), electric vehicle (EV), and hybrid electric vehicle (HEV) platforms. ICE models achieved high instantaneous accuracy with mean absolute error and root mean squared error on the order of 10^-3 , and cumulative errors under 3%. Transformer and long short-term memory models performed best for EVs and HEVs, with cumulative errors below 4.1% and 2.1%, respectively. Results confirm the approach’s effectiveness across vehicles and models. Uncertainty analysis revealed greater variability in EV and HEV datasets than ICE, due to complex power management, emphasizing the need for robust models for advanced powertrains.
[LG-9] Robust Anomaly Detection in O-RAN: Leverag ing LLM s against Data Manipulation Attacks
链接: https://arxiv.org/abs/2508.08029
作者: Thusitha Dayaratne,Ngoc Duy Pham,Viet Vo,Shangqi Lai,Sharif Abuadbba,Hajime Suzuki,Xingliang Yuan,Carsten Rudolph
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:The introduction of 5G and the Open Radio Access Network (O-RAN) architecture has enabled more flexible and intelligent network deployments. However, the increased complexity and openness of these architectures also introduce novel security challenges, such as data manipulation attacks on the semi-standardised Shared Data Layer (SDL) within the O-RAN platform through malicious xApps. In particular, malicious xApps can exploit this vulnerability by introducing subtle Unicode-wise alterations (hypoglyphs) into the data that are being used by traditional machine learning (ML)-based anomaly detection methods. These Unicode-wise manipulations can potentially bypass detection and cause failures in anomaly detection systems based on traditional ML, such as AutoEncoders, which are unable to process hypoglyphed data without crashing. We investigate the use of Large Language Models (LLMs) for anomaly detection within the O-RAN architecture to address this challenge. We demonstrate that LLM-based xApps maintain robust operational performance and are capable of processing manipulated messages without crashing. While initial detection accuracy requires further improvements, our results highlight the robustness of LLMs to adversarial attacks such as hypoglyphs in input data. There is potential to use their adaptability through prompt engineering to further improve the accuracy, although this requires further research. Additionally, we show that LLMs achieve low detection latency (under 0.07 seconds), making them suitable for Near-Real-Time (Near-RT) RIC deployments.
[LG-10] Optimizing Federated Learning for Scalable Power-demand Forecasting in Microgrids
链接: https://arxiv.org/abs/2508.08022
作者: Roopkatha Banerjee,Sampath Koti,Gyanendra Singh,Anirban Chakraborty,Gurunath Gurrala,Bhushan Jagyasi,Yogesh Simmhan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Real-time monitoring of power consumption in cities and micro-grids through the Internet of Things (IoT) can help forecast future demand and optimize grid operations. But moving all consumer-level usage data to the cloud for predictions and analysis at fine time scales can expose activity patterns. Federated Learning~(FL) is a privacy-sensitive collaborative DNN training approach that retains data on edge devices, trains the models on private data locally, and aggregates the local models in the cloud. But key challenges exist: (i) clients can have non-independently identically distributed~(non-IID) data, and (ii) the learning should be computationally cheap while scaling to 1000s of (unseen) clients. In this paper, we develop and evaluate several optimizations to FL training across edge and cloud for time-series demand forecasting in micro-grids and city-scale utilities using DNNs to achieve a high prediction accuracy while minimizing the training cost. We showcase the benefit of using exponentially weighted loss while training and show that it further improves the prediction of the final model. Finally, we evaluate these strategies by validating over 1000s of clients for three states in the US from the OpenEIA corpus, and performing FL both in a pseudo-distributed setting and a Pi edge cluster. The results highlight the benefits of the proposed methods over baselines like ARIMA and DNNs trained for individual consumers, which are not scalable.
[LG-11] Communication-Efficient Zero-Order and First-Order Federated Learning Methods over Wireless Networks
链接: https://arxiv.org/abs/2508.08013
作者: Mohamad Assaad,Zeinab Nehme,Merouane Debbah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) is an emerging learning framework that enables edge devices to collaboratively train ML models without sharing their local data. FL faces, however, a significant challenge due to the high amount of information that must be exchanged between the devices and the aggregator in the training phase, which can exceed the limited capacity of wireless systems. In this paper, two communication-efficient FL methods are considered where communication overhead is reduced by communicating scalar values instead of long vectors and by allowing high number of users to send information simultaneously. The first approach employs a zero-order optimization technique with two-point gradient estimator, while the second involves a first-order gradient computation strategy. The novelty lies in leveraging channel information in the learning algorithms, eliminating hence the need for additional resources to acquire channel state information (CSI) and to remove its impact, as well as in considering asynchronous devices. We provide a rigorous analytical framework for the two methods, deriving convergence guarantees and establishing appropriate performance bounds.
[LG-12] A Physics-informed Deep Operator for Real-Time Freeway Traffic State Estimation
链接: https://arxiv.org/abs/2508.08002
作者: Hongxin Yu,Yibing Wang,Fengyue Jin,Meng Zhang,Anni Chen
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 18 pages, 9 figures
Abstract:Traffic state estimation (TSE) falls methodologically into three categories: model-driven, data-driven, and model-data dual-driven. Model-driven TSE relies on macroscopic traffic flow models originated from hydrodynamics. Data-driven TSE leverages historical sensing data and employs statistical models or machine learning methods to infer traffic state. Model-data dual-driven traffic state estimation attempts to harness the strengths of both aspects to achieve more accurate TSE. From the perspective of mathematical operator theory, TSE can be viewed as a type of operator that maps available measurements of inerested traffic state into unmeasured traffic state variables in real time. For the first time this paper proposes to study real-time freeway TSE in the idea of physics-informed deep operator network (PI-DeepONet), which is an operator-oriented architecture embedding traffic flow models based on deep neural networks. The paper has developed an extended architecture from the original PI-DeepONet. The extended architecture is featured with: (1) the acceptance of 2-D data input so as to support CNN-based computations; (2) the introduction of a nonlinear expansion layer, an attention mechanism, and a MIMO mechanism; (3) dedicated neural network design for adaptive identification of traffic flow model parameters. A traffic state estimator built on the basis of this extended PI-DeepONet architecture was evaluated with respect to a short freeway stretch of NGSIM and a large-scale urban expressway in China, along with other four baseline TSE methods. The evaluation results demonstrated that this novel TSE method outperformed the baseline methods with high-precision estimation results of flow and mean speed.
[LG-13] Prediction error certification for PINNs: Theory computation and application to Stokes flow
链接: https://arxiv.org/abs/2508.07994
作者: Birgit Hillebrecht,Benjamin Unger
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Rigorous error estimation is a fundamental topic in numerical analysis. With the increasing use of physics-informed neural networks (PINNs) for solving partial differential equations, several approaches have been developed to quantify the associated prediction error. In this work, we build upon a semigroup-based framework previously introduced by the authors for estimating the PINN error. While this estimator has so far been limited to academic examples - due to the need to compute quantities related to input-to-state stability - we extend its applicability to a significantly broader class of problems. This is accomplished by modifying the error bound and proposing numerical strategies to approximate the required stability parameters. The extended framework enables the certification of PINN predictions in more realistic scenarios, as demonstrated by a numerical study of Stokes flow around a cylinder.
[LG-14] Adaptive Source-Channel Coding for Semantic Communications
链接: https://arxiv.org/abs/2508.07958
作者: Dongxu Li,Kai Yuan,Jianhao Huang,Chuan Huang,Xiaoqi Qin,Shuguang Cui,Ping Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Semantic communications (SemComs) have emerged as a promising paradigm for joint data and task-oriented transmissions, combining the demands for both the bit-accurate delivery and end-to-end (E2E) distortion minimization. However, current joint source-channel coding (JSCC) in SemComs is not compatible with the existing communication systems and cannot adapt to the variations of the sources or the channels, while separate source-channel coding (SSCC) is suboptimal in the finite blocklength regime. To address these issues, we propose an adaptive source-channel coding (ASCC) scheme for SemComs over parallel Gaussian channels, where the deep neural network (DNN)-based semantic source coding and conventional digital channel coding are separately deployed and adaptively designed. To enable efficient adaptation between the source and channel coding, we first approximate the E2E data and semantic distortions as functions of source coding rate and bit error ratio (BER) via logistic regression, where BER is further modeled as functions of signal-to-noise ratio (SNR) and channel coding rate. Then, we formulate the weighted sum E2E distortion minimization problem for joint source-channel coding rate and power allocation over parallel channels, which is solved by the successive convex approximation. Finally, simulation results demonstrate that the proposed ASCC scheme outperforms typical deep JSCC and SSCC schemes for both the single- and parallel-channel scenarios while maintaining full compatibility with practical digital systems.
[LG-15] Shapley-Inspired Feature Weighting in k-means with No Additional Hyperparameters
链接: https://arxiv.org/abs/2508.07952
作者: Richard J. Fawley,Renato Cordeiro de Amorim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustering algorithms often assume all features contribute equally to the data structure, an assumption that usually fails in high-dimensional or noisy settings. Feature weighting methods can address this, but most require additional parameter tuning. We propose SHARK (Shapley Reweighted k -means), a feature-weighted clustering algorithm motivated by the use of Shapley values from cooperative game theory to quantify feature relevance, which requires no additional parameters beyond those in k -means. We prove that the k -means objective can be decomposed into a sum of per-feature Shapley values, providing an axiomatic foundation for unsupervised feature relevance and reducing Shapley computation from exponential to polynomial time. SHARK iteratively re-weights features by the inverse of their Shapley contribution, emphasising informative dimensions and down-weighting irrelevant ones. Experiments on synthetic and real-world data sets show that SHARK consistently matches or outperforms existing methods, achieving superior robustness and accuracy, particularly in scenarios where noise may be present. Software: this https URL.
[LG-16] Frequency-Domain Analysis of Time-Dependent Multiomic Data in Progressive Neurodegenerative Diseases: A Proposed Quantum-Classical Hybrid Approach with Quaternionic Extensions
链接: https://arxiv.org/abs/2508.07948
作者: John D. Mayfield M.D. Ph.D. M.Sc
类目: Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 11 pages, 1 figure
Abstract:Progressive neurodegenerative diseases, including Alzheimer’s disease (AD), multiple sclerosis (MS), Parkinson’s disease (PD), and amyotrophic lateral sclerosis (ALS), exhibit complex, nonlinear trajectories that challenge deterministic modeling. Traditional time-domain analyses of multiomic and neuroimaging data often fail to capture hidden oscillatory patterns, limiting predictive accuracy. We propose a theoretical mathematical framework that transforms time-series data into frequency or s-domain using Fourier and Laplace transforms, models neuronal dynamics via Hamiltonian formulations, and employs quantum-classical hybrid computing with variational quantum eigensolvers (VQE) for enhanced pattern detection. This theoretical construct serves as a foundation for future empirical works in quantum-enhanced analysis of neurodegenerative diseases. We extend this to quaternionic representations with three imaginary axes ( i, j, k ) to model multistate Hamiltonians in multifaceted disorders, drawing from quantum neuromorphic computing to capture entangled neural dynamics \citepPehle2020, Emani2019. This approach leverages quantum advantages in handling high-dimensional amplitude-phase data, enabling outlier detection and frequency signature analysis. Potential clinical applications include identifying high-risk patients with rapid progression or therapy resistance using s-domain biomarkers, supported by quantum machine learning (QML) precedents achieving up to 99.89% accuracy in Alzheimer’s classification \citepBelay2024, Bhowmik2025. This framework aims to lay the groundwork for redefining precision medicine for neurodegenerative diseases through future validations.
[LG-17] Adaptive Fine-Tuning via Pattern Specialization for Deep Time Series Forecasting
链接: https://arxiv.org/abs/2508.07927
作者: Amal Saadallah,Abdulaziz Al-Ademi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting poses significant challenges in non-stationary environments where underlying patterns evolve over time. In this work, we propose a novel framework that enhances deep neural network (DNN) performance by leveraging specialized model adaptation and selection. Initially, a base DNN is trained offline on historical time series data. A reserved validation subset is then segmented to extract and cluster the most dominant patterns within the series, thereby identifying distinct regimes. For each identified cluster, the base DNN is fine-tuned to produce a specialized version that captures unique pattern characteristics. At inference, the most recent input is matched against the cluster centroids, and the corresponding fine-tuned version is deployed based on the closest similarity measure. Additionally, our approach integrates a concept drift detection mechanism to identify and adapt to emerging patterns caused by non-stationary behavior. The proposed framework is generalizable across various DNN architectures and has demonstrated significant performance gains on both traditional DNNs and recent advanced architectures implemented in the GluonTS library.
[LG-18] Score Augmentation for Diffusion Models
链接: https://arxiv.org/abs/2508.07926
作者: Liang Hou,Yuan Gao,Boyuan Jiang,Xin Tao,Qi Yan,Renjie Liao,Pengfei Wan,Di Zhang,Kun Gai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under general transformations. In experiments, we extensively validate ScoreAug on multiple benchmarks including CIFAR-10, FFHQ, AFHQv2, and ImageNet, with results demonstrating significant performance improvements over baselines. Notably, ScoreAug effectively mitigates overfitting across diverse scenarios, such as varying data scales and model capacities, while exhibiting stable convergence properties. Another advantage of ScoreAug over standard data augmentation lies in its ability to circumvent data leakage issues under certain conditions. Furthermore, we show that ScoreAug can be synergistically combined with traditional data augmentation techniques to achieve additional performance gains.
[LG-19] EFU: Enforcing Federated Unlearning via Functional Encryption
链接: https://arxiv.org/abs/2508.07873
作者: Samaneh Mohammadi,Vasileios Tsouvalas,Iraklis Symeonidis,Ali Balador,Tanir Ozcelebi,Francesco Flammini,Nirvana Meratnia
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Federated unlearning (FU) algorithms allow clients in federated settings to exercise their ‘‘right to be forgotten’’ by removing the influence of their data from a collaboratively trained model. Existing FU methods maintain data privacy by performing unlearning locally on the client-side and sending targeted updates to the server without exposing forgotten data; yet they often rely on server-side cooperation, revealing the client’s intent and identity without enforcement guarantees - compromising autonomy and unlearning privacy. In this work, we propose EFU (Enforced Federated Unlearning), a cryptographically enforced FU framework that enables clients to initiate unlearning while concealing its occurrence from the server. Specifically, EFU leverages functional encryption to bind encrypted updates to specific aggregation functions, ensuring the server can neither perform unauthorized computations nor detect or skip unlearning requests. To further mask behavioral and parameter shifts in the aggregated model, we incorporate auxiliary unlearning losses based on adversarial examples and parameter importance regularization. Extensive experiments show that EFU achieves near-random accuracy on forgotten data while maintaining performance comparable to full retraining across datasets and neural architectures - all while concealing unlearning intent from the server. Furthermore, we demonstrate that EFU is agnostic to the underlying unlearning algorithm, enabling secure, function-hiding, and verifiable unlearning for any client-side FU mechanism that issues targeted updates.
[LG-20] Unequal Uncertainty: Rethinking Algorithmic Interventions for Mitigating Discrimination from AI
链接: https://arxiv.org/abs/2508.07872
作者: Holli Sargeant,Mackenzie Jorgensen,Arina Shah,Adrian Weller,Umang Bhatt
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Uncertainty in artificial intelligence (AI) predictions poses urgent legal and ethical challenges for AI-assisted decision-making. We examine two algorithmic interventions that act as guardrails for human-AI collaboration: selective abstention, which withholds high-uncertainty predictions from human decision-makers, and selective friction, which delivers those predictions together with salient warnings or disclosures that slow the decision process. Research has shown that selective abstention based on uncertainty can inadvertently exacerbate disparities and disadvantage under-represented groups that disproportionately receive uncertain predictions. In this paper, we provide the first integrated socio-technical and legal analysis of uncertainty-based algorithmic interventions. Through two case studies, AI-assisted consumer credit decisions and AI-assisted content moderation, we demonstrate how the seemingly neutral use of uncertainty thresholds can trigger discriminatory impacts. We argue that, although both interventions pose risks of unlawful discrimination under UK law, selective frictions offer a promising pathway toward fairer and more accountable AI-assisted decision-making by preserving transparency and encouraging more cautious human judgment.
[LG-21] Recommendation Is a Dish Better Served Warm RECSYS2025
链接: https://arxiv.org/abs/2508.07856
作者: Danil Gusak,Nikita Sukhorukov,Evgeny Frolov
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted for ACM RecSys 2025. Author’s version. The final published version will be available at the ACM Digital Library
Abstract:In modern recommender systems, experimental settings typically include filtering out cold users and items based on a minimum interaction threshold. However, these thresholds are often chosen arbitrarily and vary widely across studies, leading to inconsistencies that can significantly affect the comparability and reliability of evaluation results. In this paper, we systematically explore the cold-start boundary by examining the criteria used to determine whether a user or an item should be considered cold. Our experiments incrementally vary the number of interactions for different items during training, and gradually update the length of user interaction histories during inference. We investigate the thresholds across several widely used datasets, commonly represented in recent papers from top-tier conferences, and on multiple established recommender baselines. Our findings show that inconsistent selection of cold-start thresholds can either result in the unnecessary removal of valuable data or lead to the misclassification of cold instances as warm, introducing more noise into the system.
[LG-22] Learning Satellite Attitude Dynamics with Physics-Informed Normalising Flow
链接: https://arxiv.org/abs/2508.07841
作者: Carlo Cena,Mauro Martini,Marcello Chiaberge
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: In review
Abstract:Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error of the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, yielding improvements of up to 42.86% in performance stability error and increased robustness-to-noise.
[LG-23] EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning
链接: https://arxiv.org/abs/2508.07809
作者: Huanyu Liu,Jia Li,Chang Yu,Taozhi Chen,Yihong Dong,Lecheng Wang,Hu XiaoLong,Ge Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.07809 [cs.LG] (or arXiv:2508.07809v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.07809 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] opological Feature Compression for Molecular Graph Neural Networks
链接: https://arxiv.org/abs/2508.07807
作者: Rahul Khorana
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in molecular representation learning have produced highly effective encodings of molecules for numerous cheminformatics and bioinformatics tasks. However, extracting general chemical insight while balancing predictive accuracy, interpretability, and computational efficiency remains a major challenge. In this work, we introduce a novel Graph Neural Network (GNN) architecture that combines compressed higher-order topological signals with standard molecular features. Our approach captures global geometric information while preserving computational tractability and human-interpretable structure. We evaluate our model across a range of benchmarks, from small-molecule datasets to complex material datasets, and demonstrate superior performance using a parameter-efficient architecture. We achieve the best performing results in both accuracy and robustness across almost all benchmarks. We open source all code \footnoteAll code and results can be found on Github this https URL.
[LG-25] A Tutorial: An Intuitive Explanation of Offline Reinforcement Learning Theory
链接: https://arxiv.org/abs/2508.07746
作者: Fengdi Che
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Offline reinforcement learning (RL) aims to optimize the return given a fixed dataset of agent trajectories without additional interactions with the environment. While algorithm development has progressed rapidly, significant theoretical advances have also been made in understanding the fundamental challenges of offline RL. However, bridging these theoretical insights with practical algorithm design remains an ongoing challenge. In this survey, we explore key intuitions derived from theoretical work and their implications for offline RL algorithms. We begin by listing the conditions needed for the proofs, including function representation and data coverage assumptions. Function representation conditions tell us what to expect for generalization, and data coverage assumptions describe the quality requirement of the data. We then examine counterexamples, where offline RL is not solvable without an impractically large amount of data. These cases highlight what cannot be achieved for all algorithms and the inherent hardness of offline RL. Building on techniques to mitigate these challenges, we discuss the conditions that are sufficient for offline RL. These conditions are not merely assumptions for theoretical proofs, but they also reveal the limitations of these algorithms and remind us to search for novel solutions when the conditions cannot be satisfied. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2508.07746 [cs.LG] (or arXiv:2508.07746v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.07746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Separation and Collaboration: Two-Level Routing Grouped Mixture-of-Experts for Multi-Domain Continual Learning
链接: https://arxiv.org/abs/2508.07738
作者: Jialu Zhou,Dianxi Shi,Shaowu Yang,Xinyu Wei,Mingyue Yang,Leqian Li,Mengzhu Wang,Chunping Qiu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-Domain Continual Learning (MDCL) acquires knowledge from sequential tasks with shifting class sets and distribution. Despite the Parameter-Efficient Fine-Tuning (PEFT) methods can adapt for this dual heterogeneity, they still suffer from catastrophic forgetting and forward forgetting. To address these challenges, we propose a Two-Level Routing Grouped Mixture-of-Experts (TRGE) method. Firstly, TRGE dynamically expands the pre-trained CLIP model, assigning specific expert group for each task to mitigate catastrophic forgetting. With the number of experts continually grows in this process, TRGE maintains the static experts count within the group and introduces the intra-group router to alleviate routing overfitting caused by the increasing routing complexity. Meanwhile, we design an inter-group routing policy based on task identifiers and task prototype distance, which dynamically selects relevant expert groups and combines their outputs to enhance inter-task collaboration. Secondly, to get the correct task identifiers, we leverage Multimodal Large Language Models (MLLMs) which own powerful multimodal comprehension capabilities to generate semantic task descriptions and recognize the correct task identifier. Finally, to mitigate forward forgetting, we dynamically fuse outputs for unseen samples from the frozen CLIP model and TRGE adapter based on training progress, leveraging both pre-trained and learned knowledge. Through extensive experiments across various settings, our method outperforms other advanced methods with fewer trainable parameters.
[LG-27] Robust Reinforcement Learning over Wireless Networks with Homomorphic State Representations
链接: https://arxiv.org/abs/2508.07722
作者: Pietro Talli,Federico Mason,Federico Chiariotti,Andrea Zanella
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Multiagent Systems (cs.MA)
*备注: This manuscript is currently under revision
Abstract:In this work, we address the problem of training Reinforcement Learning (RL) agents over communication networks. The RL paradigm requires the agent to instantaneously perceive the state evolution to infer the effects of its actions on the environment. This is impossible if the agent receives state updates over lossy or delayed wireless systems and thus operates with partial and intermittent information. In recent years, numerous frameworks have been proposed to manage RL with imperfect feedback; however, they often offer specific solutions with a substantial computational burden. To address these limits, we propose a novel architecture, named Homomorphic Robust Remote Reinforcement Learning (HR3L), that enables the training of remote RL agents exchanging observations across a non-ideal wireless channel. HR3L considers two units: the transmitter, which encodes meaningful representations of the environment, and the receiver, which decodes these messages and performs actions to maximize a reward signal. Importantly, HR3L does not require the exchange of gradient information across the wireless channel, allowing for quicker training and a lower communication overhead than state-of-the-art solutions. Experimental results demonstrate that HR3L significantly outperforms baseline methods in terms of sample efficiency and adapts to different communication scenarios, including packet losses, delayed transmissions, and capacity limitations.
[LG-28] Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information
链接: https://arxiv.org/abs/2508.07713
作者: Jinghan Yang,Jiayu Weng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under Working
Abstract:Deep neural networks can memorize corrupted labels, making data quality critical for model performance, yet real-world datasets are frequently compromised by both label noise and input noise. This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios that quantifies statistical dependencies between inputs and labels. We compute each sample’s pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances. Empirical validation on MNIST with different synthetic noise settings demonstrates that the method effectively filters low-quality samples. Under label corruption, training on high-MI samples improves classification accuracy by up to 15% compared to random sampling. Furthermore, the method exhibits robustness to benign input modifications, preserving semantically valid data while filtering truly corrupted samples.
[LG-29] Semantic-Enhanced Time-Series Forecasting via Large Language Models
链接: https://arxiv.org/abs/2508.07697
作者: Hao Liu,Chun Yang,Zhang xiaoxing,Xiaobin Zhu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 14 pages,9 figures
Abstract:Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.
[LG-30] Multi-Hop Privacy Propagation for Differentially Private Federated Learning in Social Networks ECAI25
链接: https://arxiv.org/abs/2508.07676
作者: Chenchen Lin,Xuehe Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
*备注: Accepted by ECAI25
Abstract:Federated learning (FL) enables collaborative model training across decentralized clients without sharing local data, thereby enhancing privacy and facilitating collaboration among clients connected via social networks. However, these social connections introduce privacy externalities: a client’s privacy loss depends not only on its privacy protection strategy but also on the privacy decisions of others, propagated through the network via multi-hop interactions. In this work, we propose a socially-aware privacy-preserving FL mechanism that systematically quantifies indirect privacy leakage through a multi-hop propagation model. We formulate the server-client interaction as a two-stage Stackelberg game, where the server, as the leader, optimizes incentive policies, and clients, as followers, strategically select their privacy budgets, which determine their privacy-preserving levels by controlling the magnitude of added noise. To mitigate information asymmetry in networked privacy estimation, we introduce a mean-field estimator to approximate the average external privacy risk. We theoretically prove the existence and convergence of the fixed point of the mean-field estimator and derive closed-form expressions for the Stackelberg Nash Equilibrium. Despite being designed from a client-centric incentive perspective, our mechanism achieves approximately-optimal social welfare, as revealed by Price of Anarchy (PoA) analysis. Experiments on diverse datasets demonstrate that our approach significantly improves client utilities and reduces server costs while maintaining model performance, outperforming both Social-Agnostic (SA) baselines and methods that account for social externalities.
[LG-31] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
链接: https://arxiv.org/abs/2508.07675
作者: Xutong Liu,Baran Atalar,Xiangxiang Dai,Jinhang Zuo,Siwei Wang,John C.S. Lui,Wei Chen,Carlee Joe-Wong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.
[LG-32] Multi-Turn Jailbreaks Are Simpler Than They Seem
链接: https://arxiv.org/abs/2508.07646
作者: Xiaoxue Yang,Jaeha Lee,Anna-Katharina Dick,Jasper Timm,Fei Xie,Diogo Cruz
类目: Machine Learning (cs.LG)
*备注: 25 pages, 15 figures. Accepted at COLM 2025 SoLaR Workshop
Abstract:While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker’s ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at this https URL
[LG-33] Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals
链接: https://arxiv.org/abs/2508.07638
作者: Jia Zhang,Yao Liu,Chen-Xi Zhang,Yi Liu,Yi-Xuan Jin,Lan-Zhe Guo,Yu-Feng Li
类目: Machine Learning (cs.LG)
*备注: Under review
Abstract:Aligning Large Language Models (LLMs) with diverse human values requires moving beyond a single holistic “better-than” preference criterion. While collecting fine-grained, aspect-specific preference data is more reliable and scalable, existing methods like Direct Preference Optimization (DPO) struggle with the severe noise and conflicts inherent in such aggregated datasets. In this paper, we tackle this challenge from a data-centric perspective. We first derive the Direct Multi-Preference Optimization (DMPO) objective, and uncover a key Preference Divergence (PD) term that quantifies inter-aspect preference conflicts. Instead of using this term for direct optimization, we leverage it to formulate a novel, theoretically-grounded data selection principle. Our principle advocates for selecting a subset of high-consensus data-identified by the most negative PD values-for efficient DPO training. We prove the optimality of this strategy by analyzing the loss bounds of the DMPO objective in the selection problem. To operationalize our approach, we introduce practical methods of PD term estimation and length bias mitigation, thereby proposing our PD selection method. Evaluation on the UltraFeedback dataset with three varying conflict levels shows that our simple yet effective strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle using aggregated preference signals, all while boosting training efficiency and obviating the need for intractable holistic preference annotating, unlocking the potential of robust LLM alignment via fine-grained preference signals.
[LG-34] Extracting Complex Topology from Multivariate Functional Approximation: Contours Jacobi Sets and Ridge-Valley Graphs
链接: https://arxiv.org/abs/2508.07637
作者: Guanqun Ma,David Lenz,Hanqi Guo,Tom Peterka,Bei Wang
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注: The paper is to be published at the 15th IEEE Workshop on Large Data Analysis and Visualization (LDAV)
Abstract:Implicit continuous models, such as functional models and implicit neural networks, are an increasingly popular method for replacing discrete data representations with continuous, high-order, and differentiable surrogates. These models offer new perspectives on the storage, transfer, and analysis of scientific data. In this paper, we introduce the first framework to directly extract complex topological features – contours, Jacobi sets, and ridge-valley graphs – from a type of continuous implicit model known as multivariate functional approximation (MFA). MFA replaces discrete data with continuous piecewise smooth functions. Given an MFA model as the input, our approach enables direct extraction of complex topological features from the model, without reverting to a discrete representation of the model. Our work is easily generalizable to any continuous implicit model that supports the queries of function values and high-order derivatives. Our work establishes the building blocks for performing topological data analysis and visualization on implicit continuous models.
[LG-35] When and how can inexact generative models still sample from the data manifold?
链接: https://arxiv.org/abs/2508.07581
作者: Nisha Chandramoorthy,Adriaan de Clercq
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR)
*备注:
Abstract:A curious phenomenon observed in some dynamical generative models is the following: despite learning errors in the score function or the drift vector field, the generated samples appear to shift \emphalong the support of the data distribution but not \emphaway from it. In this work, we investigate this phenomenon of \emphrobustness of the support by taking a dynamical systems approach on the generating stochastic/deterministic process. Our perturbation analysis of the probability flow reveals that infinitesimal learning errors cause the predicted density to be different from the target density only on the data manifold for a wide class of generative models. Further, what is the dynamical mechanism that leads to the robustness of the support? We show that the alignment of the top Lyapunov vectors (most sensitive infinitesimal perturbation directions) with the tangent spaces along the boundary of the data manifold leads to robustness and prove a sufficient condition on the dynamics of the generating process to achieve this alignment. Moreover, the alignment condition is efficient to compute and, in practice, for robust generative models, automatically leads to accurate estimates of the tangent bundle of the data manifold. Using a finite-time linear perturbation analysis on samples paths as well as probability flows, our work complements and extends existing works on obtaining theoretical guarantees for generative models from a stochastic analysis, statistical learning and uncertainty quantification points of view. Our results apply across different dynamical generative models, such as conditional flow-matching and score-based generative models, and for different target distributions that may or may not satisfy the manifold hypothesis.
[LG-36] Barron Space Representations for Elliptic PDEs with Homogeneous Boundary Conditions
链接: https://arxiv.org/abs/2508.07559
作者: Ziang Chen,Liqiang Huang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:
Abstract:We study the approximation complexity of high-dimensional second-order elliptic PDEs with homogeneous boundary conditions on the unit hypercube, within the framework of Barron spaces. Under the assumption that the coefficients belong to suitably defined Barron spaces, we prove that the solution can be efficiently approximated by two-layer neural networks, circumventing the curse of dimensionality. Our results demonstrate the expressive power of shallow networks in capturing high-dimensional PDE solutions under appropriate structural assumptions.
[LG-37] Multimodal Remote Inference
链接: https://arxiv.org/abs/2508.07555
作者: Keyuan Zhang,Yin Sun,Bo Ji
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
*备注: Accepted by The 22nd IEEE International Conference on Mobile Ad-Hoc and Smart Systems (MASS 2025)
Abstract:We consider a remote inference system with multiple modalities, where a multimodal machine learning (ML) model performs real-time inference using features collected from remote sensors. As sensor observations may change dynamically over time, fresh features are critical for inference tasks. However, timely delivering features from all modalities is often infeasible due to limited network resources. To this end, we study a two-modality scheduling problem to minimize the ML model’s inference error, which is expressed as a penalty function of AoI for both modalities. We develop an index-based threshold policy and prove its optimality. Specifically, the scheduler switches modalities when the current modality’s index function exceeds a threshold. We show that the two modalities share the same threshold, and both the index functions and the threshold can be computed efficiently. The optimality of our policy holds for (i) general AoI functions that are \emphnon-monotonic and \emphnon-additive and (ii) \emphheterogeneous transmission times. Numerical results show that our policy reduces inference error by up to 55% compared to round-robin and uniform random policies, which are oblivious to the AoI-based inference error function. Our results shed light on how to improve remote inference accuracy by optimizing task-oriented AoI functions.
[LG-38] Physics-Informed Multimodal Bearing Fault Classification under Variable Operating Conditions using Transfer Learning
链接: https://arxiv.org/abs/2508.07536
作者: Tasfiq E. Alam,Md Manjurul Ahsan,Shivakumar Raman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate and interpretable bearing fault classification is critical for ensuring the reliability of rotating machinery, particularly under variable operating conditions where domain shifts can significantly degrade model performance. This study proposes a physics-informed multimodal convolutional neural network (CNN) with a late fusion architecture, integrating vibration and motor current signals alongside a dedicated physics-based feature extraction branch. The model incorporates a novel physics-informed loss function that penalizes physically implausible predictions based on characteristic bearing fault frequencies - Ball Pass Frequency Outer (BPFO) and Ball Pass Frequency Inner (BPFI) - derived from bearing geometry and shaft speed. Comprehensive experiments on the Paderborn University dataset demonstrate that the proposed physics-informed approach consistently outperforms a non-physics-informed baseline, achieving higher accuracy, reduced false classifications, and improved robustness across multiple data splits. To address performance degradation under unseen operating conditions, three transfer learning (TL) strategies - Target-Specific Fine-Tuning (TSFT), Layer-Wise Adaptation Strategy (LAS), and Hybrid Feature Reuse (HFR) - are evaluated. Results show that LAS yields the best generalization, with additional performance gains when combined with physics-informed modeling. Validation on the KAIST bearing dataset confirms the framework’s cross-dataset applicability, achieving up to 98 percent accuracy. Statistical hypothesis testing further verifies significant improvements (p 0.01) in classification performance. The proposed framework demonstrates the potential of integrating domain knowledge with data-driven learning to achieve robust, interpretable, and generalizable fault diagnosis for real-world industrial applications.
[LG-39] FairDRL-ST: Disentangled Representation Learning for Fair Spatio-Temporal Mobility Prediction
链接: https://arxiv.org/abs/2508.07518
作者: Sichen Zhao,Wei Shao,Jeffrey Chan,Ziqi Xu,Flora Salim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted as a Research Paper (short) at ACM SIGSPATIAL 2025. This arXiv version is the full version of the paper
Abstract:As deep spatio-temporal neural networks are increasingly utilised in urban computing contexts, the deployment of such methods can have a direct impact on users of critical urban infrastructure, such as public transport, emergency services, and traffic management systems. While many spatio-temporal methods focus on improving accuracy, fairness has recently gained attention due to growing evidence that biased predictions in spatio-temporal applications can disproportionately disadvantage certain demographic or geographic groups, thereby reinforcing existing socioeconomic inequalities and undermining the ethical deployment of AI in public services. In this paper, we propose a novel framework, FairDRL-ST, based on disentangled representation learning, to address fairness concerns in spatio-temporal prediction, with a particular focus on mobility demand forecasting. By leveraging adversarial learning and disentangled representation learning, our framework learns to separate attributes that contain sensitive information. Unlike existing methods that enforce fairness through supervised learning, which may lead to overcompensation and degraded performance, our framework achieves fairness in an unsupervised manner with minimal performance loss. We apply our framework to real-world urban mobility datasets and demonstrate its ability to close fairness gaps while delivering competitive predictive performance compared to state-of-the-art fairness-aware methods.
[LG-40] Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach
链接: https://arxiv.org/abs/2508.07505
作者: Yueyang Quan,Chang Wang,Shengjie Zhai,Minghong Fang,Zhuqing Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: To appear in ACM MobiHoc 2025
Abstract:Decentralized min-max optimization allows multi-agent systems to collaboratively solve global min-max optimization problems by facilitating the exchange of model updates among neighboring agents, eliminating the need for a central server. However, sharing model updates in such systems carry a risk of exposing sensitive data to inference attacks, raising significant privacy concerns. To mitigate these privacy risks, differential privacy (DP) has become a widely adopted technique for safeguarding individual data. Despite its advantages, implementing DP in decentralized min-max optimization poses challenges, as the added noise can hinder convergence, particularly in non-convex scenarios with complex agent interactions in min-max optimization problems. In this work, we propose an algorithm called DPMixSGD (Differential Private Minmax Hybrid Stochastic Gradient Descent), a novel privacy-preserving algorithm specifically designed for non-convex decentralized min-max optimization. Our method builds on the state-of-the-art STORM-based algorithm, one of the fastest decentralized min-max solutions. We rigorously prove that the noise added to local gradients does not significantly compromise convergence performance, and we provide theoretical bounds to ensure privacy guarantees. To validate our theoretical findings, we conduct extensive experiments across various tasks and models, demonstrating the effectiveness of our approach.
[LG-41] N-BEATS-MOE: N-BEATS with a Mixture-of-Experts Layer for Heterogeneous Time Series Forecasting
链接: https://arxiv.org/abs/2508.07490
作者: Ricardo Matos,Luis Roque,Vitor Cerqueira
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Deep learning approaches are increasingly relevant for time series forecasting tasks. Methods such as N-BEATS, which is built on stacks of multilayer perceptrons (MLPs) blocks, have achieved state-of-the-art results on benchmark datasets and competitions. N-BEATS is also more interpretable relative to other deep learning approaches, as it decomposes forecasts into different time series components, such as trend and seasonality. In this work, we present N-BEATS-MOE, an extension of N-BEATS based on a Mixture-of-Experts (MoE) layer. N-BEATS-MOE employs a dynamic block weighting strategy based on a gating network which allows the model to better adapt to the characteristics of each time series. We also hypothesize that the gating mechanism provides additional interpretability by identifying which expert is most relevant for each series. We evaluate our method across 12 benchmark datasets against several approaches, achieving consistent improvements on several datasets, especially those composed of heterogeneous time series.
[LG-42] Structured Superposition of Autoencoders for UEP Codes at Intermediate Blocklengths
链接: https://arxiv.org/abs/2508.07487
作者: Vukan Ninkovic,Dejan Vukobratovic
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted for publication at IEEE Communication Letters
Abstract:Unequal error protection (UEP) coding that enables differentiated reliability levels within a transmitted message is essential for modern communication systems. Autoencoder (AE)-based code designs have shown promise in the context of learned equal error protection (EEP) coding schemes. However, their application to UEP remains largely unexplored, particularly at intermediate blocklengths, due to the increasing complexity of AE-based models. Inspired by the proven effectiveness of superposition coding and successive interference cancellation (SIC) decoding in conventional UEP schemes, we propose a structured AE-based architecture that extends AE-based UEP codes to substantially larger blocklengths while maintaining efficient training. By structuring encoding and decoding into smaller AE subblocks, our method provides a flexible framework for fine-tuning UEP reliability levels while adapting to diverse system parameters. Numerical results show that the proposed approach improves over established achievability bounds of randomized superposition coding-based UEP schemes with SIC decoding, making the proposed structured AE-based UEP codes a scalable and efficient solution for next-generation networks.
[LG-43] Online Convex Optimization with Heavy Tails: Old Algorithms New Regrets and Applications
链接: https://arxiv.org/abs/2508.07473
作者: Zijian Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Part of this work is in submission
Abstract:In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite \mathsfp -th central moment for some \mathsfp\in\left(1,2\right] . Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing \mathsfp ), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping. Furthermore, we explore broader settings (e.g., smooth OCO) and extend our ideas to optimistic algorithms to handle different cases simultaneously.
[LG-44] MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification
链接: https://arxiv.org/abs/2508.07465
作者: Tiantian Yang,Zhiqian Chen
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注: 11 pages, 6 figures
Abstract:Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality and complex interactions among omics layers present major challenges for predictive modeling. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) to perform omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. On three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance (e.g., 87.2% vs. 33.4% F1 on imbalanced data). The model maintains computational efficiency through sparse graphs (2.1-2.8 edges per node) and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight MOTGNN’s potential to improve both predictive accuracy and interpretability in multi-omics disease modeling.
[LG-45] owards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten
链接: https://arxiv.org/abs/2508.07458
作者: Wei Qian,Chenxu Zhao,Yangyi Li,Wenqian Ye,Mengdi Huai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Currently, various uncertainty quantification methods have been proposed to provide certainty and probability estimates for deep learning models’ label predictions. Meanwhile, with the growing demand for the right to be forgotten, machine unlearning has been extensively studied as a means to remove the impact of requested sensitive data from a pre-trained model without retraining the model from scratch. However, the vulnerabilities of such generated predictive uncertainties with regard to dedicated malicious unlearning attacks remain unexplored. To bridge this gap, for the first time, we propose a new class of malicious unlearning attacks against predictive uncertainties, where the adversary aims to cause the desired manipulations of specific predictive uncertainty results. We also design novel optimization frameworks for our attacks and conduct extensive experiments, including black-box scenarios. Notably, our extensive experiments show that our attacks are more effective in manipulating predictive uncertainties than traditional attacks that focus on label misclassifications, and existing defenses against conventional attacks are ineffective against our attacks.
[LG-46] Unsupervised operator learning approach for dissipative equations via Onsager principle
链接: https://arxiv.org/abs/2508.07440
作者: Zhipeng Chang,Zhenye Wen,Xiaofei Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing operator learning methods rely on supervised training with high-fidelity simulation data, introducing significant computational cost. In this work, we propose the deep Onsager operator learning (DOOL) method, a novel unsupervised framework for solving dissipative equations. Rooted in the Onsager variational principle (OVP), DOOL trains a deep operator network by directly minimizing the OVP-defined Rayleighian functional, requiring no labeled data, and then proceeds in time explicitly through conservation/change laws for the solution. Another key innovation here lies in the spatiotemporal decoupling strategy: the operator’s trunk network processes spatial coordinates exclusively, thereby enhancing training efficiency, while integrated external time stepping enables temporal extrapolation. Numerical experiments on typical dissipative equations validate the effectiveness of the DOOL method, and systematic comparisons with supervised DeepONet and MIONet demonstrate its enhanced performance. Extensions are made to cover the second-order wave models with dissipation that do not directly follow OVP.
[LG-47] Efficient Reward Identification In Max Entropy Reinforcement Learning with Sparsity and Rank Priors
链接: https://arxiv.org/abs/2508.07400
作者: Mohamad Louai Shehab,Alperen Tercan,Necmiye Ozay
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we consider the problem of recovering time-varying reward functions from either optimal policies or demonstrations coming from a max entropy reinforcement learning problem. This problem is highly ill-posed without additional assumptions on the underlying rewards. However, in many applications, the rewards are indeed parsimonious, and some prior information is available. We consider two such priors on the rewards: 1) rewards are mostly constant and they change infrequently, 2) rewards can be represented by a linear combination of a small number of feature functions. We first show that the reward identification problem with the former prior can be recast as a sparsification problem subject to linear constraints. Moreover, we give a polynomial-time algorithm that solves this sparsification problem exactly. Then, we show that identifying rewards representable with the minimum number of features can be recast as a rank minimization problem subject to linear constraints, for which convex relaxations of rank can be invoked. In both cases, these observations lead to efficient optimization-based reward identification algorithms. Several examples are given to demonstrate the accuracy of the recovered rewards as well as their generalizability.
[LG-48] Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs ICML2025
链接: https://arxiv.org/abs/2508.07395
作者: Behnoush Khavari,Mehran Shakerinava,Jayesh Khullar,Jerry Huang,François Rivest,Siamak Ravanbakhsh,Sarath Chandar
类目: Machine Learning (cs.LG)
*备注: 5 pages. Accepted at ICML 2025 Workshop on Methods and Opportunities at Small Scale
Abstract:Recent work has shown that LRNN models such as S4D, Mamba, and DeltaNet lack state-tracking capability due to either time-invariant transition matrices or restricted eigenvalue ranges. To address this, input-dependent transition matrices, particularly those that are complex or non-triangular, have been proposed to enhance SSM performance on such tasks. While existing theorems demonstrate that both input-independent and non-negative SSMs are incapable of solving simple state-tracking tasks, such as parity, regardless of depth, they do not explore whether combining these two types in a multilayer SSM could help. We investigate this question for efficient SSMs with diagonal transition matrices and show that such combinations still fail to solve parity. This implies that a recurrence layer must both be input-dependent and include negative eigenvalues. Our experiments support this conclusion by analyzing an SSM model that combines S4D and Mamba layers.
[LG-49] ght Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems
链接: https://arxiv.org/abs/2508.07392
作者: Nikita Puchkin,Denis Suchkov,Alexey Naumov,Denis Belomestny
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 54 pages, 4 figures
Abstract:Modern methods of generative modelling and unpaired image-to-image translation based on Schrödinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired image-to-image translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrödinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrödinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.
[LG-50] Intrinsic training dynamics of deep neural networks
链接: https://arxiv.org/abs/2508.07370
作者: Sibylle Marcotte,Gabriel Peyré,Rémi Gribonval
类目: Machine Learning (cs.LG)
*备注:
Abstract:A fundamental challenge in the theory of deep learning is to understand whether gradient-based training in high-dimensional parameter spaces can be captured by simpler, lower-dimensional structures, leading to so-called implicit bias. As a stepping stone, we study when a gradient flow on a high-dimensional variable \theta implies an intrinsic gradient flow on a lower-dimensional variable z = \phi(\theta) , for an architecture-related function \phi . We express a so-called intrinsic dynamic property and show how it is related to the study of conservation laws associated with the factorization \phi . This leads to a simple criterion based on the inclusion of kernels of linear maps which yields a necessary condition for this property to hold. We then apply our theory to general ReLU networks of arbitrary depth and show that, for any initialization, it is possible to rewrite the flow as an intrinsic dynamic in a lower dimension that depends only on z and the initialization, when \phi is the so-called path-lifting. In the case of linear networks with \phi the product of weight matrices, so-called balanced initializations are also known to enable such a dimensionality reduction; we generalize this result to a broader class of \em relaxed balanced initializations, showing that, in certain configurations, these are the \emphonly initializations that ensure the intrinsic dynamic property. Finally, for the linear neural ODE associated with the limit of infinitely deep linear networks, with relaxed balanced initialization, we explicitly express the corresponding intrinsic dynamics.
[LG-51] Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants
链接: https://arxiv.org/abs/2508.07333
作者: Yuhao Liu,Rui Hu,Yu Chen,Longbo Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Stochastic interpolants offer a robust framework for continuously transforming samples between arbitrary data distributions, holding significant promise for generative modeling. Despite their potential, rigorous finite-time convergence guarantees for practical numerical schemes remain largely unexplored. In this work, we address the finite-time convergence analysis of numerical implementations for ordinary differential equations (ODEs) derived from stochastic interpolants. Specifically, we establish novel finite-time error bounds in total variation distance for two widely used numerical integrators: the first-order forward Euler method and the second-order Heun’s method. Furthermore, our analysis on the iteration complexity of specific stochastic interpolant constructions provides optimized schedules to enhance computational efficiency. Our theoretical findings are corroborated by numerical experiments, which validate the derived error bounds and complexity analyses.
[LG-52] PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets
链接: https://arxiv.org/abs/2508.07253
作者: Bartlomiej Chybowski,Shima Abdullateef,Hollan Haule,Alfredo Gonzalez-Sulser,Javier Escudero
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Reliable seizure detection is critical for diagnosing and managing epilepsy, yet clinical workflows remain dependent on time-consuming manual EEG interpretation. While machine learning has shown promise, existing approaches often rely on dataset-specific optimisations, limiting their real-world applicability and reproducibility. Here, we introduce an innovative, open-source machine-learning framework that enables robust and generalisable seizure detection across varied clinical datasets. We evaluate our approach on two publicly available EEG datasets that differ in patient populations and electrode configurations. To enhance robustness, the framework incorporates an automated pre-processing pipeline to standardise data and a majority voting mechanism, in which multiple models independently assess each second of EEG before reaching a final decision. We train, tune, and evaluate models within each dataset, assessing their cross-dataset transferability. Our models achieve high within-dataset performance (AUC 0.904+/-0.059 for CHB-MIT and 0.864+/-0.060 for TUSZ) and demonstrate strong generalisation across datasets despite differences in EEG setups and populations (AUC 0.615+/-0.039 for models trained on CHB-MIT and tested on TUSZ and 0.762+/-0.175 in the reverse case) without any post-processing. Furthermore, a mild post-processing improved the within-dataset results to 0.913+/-0.064 and 0.867+/-0.058 and cross-dataset results to 0.619+/-0.036 and 0.768+/-0.172. These results underscore the potential of, and essential considerations for, deploying our framework in diverse clinical settings. By making our methodology fully reproducible, we provide a foundation for advancing clinically viable, dataset-agnostic seizure detection systems. This approach has the potential for widespread adoption, complementing rather than replacing expert interpretation, and accelerating clinical integration.
[LG-53] Policy Newton methods for Distortion Riskmetrics
链接: https://arxiv.org/abs/2508.07249
作者: Soumen Pachal,Mizhaan Prajit Maniyar,Prashanth L.A
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of risk-sensitive control in a reinforcement learning (RL) framework. In particular, we aim to find a risk-optimal policy by maximizing the distortion riskmetric (DRM) of the discounted reward in a finite horizon Markov decision process (MDP). DRMs are a rich class of risk measures that include several well-known risk measures as special cases. We derive a policy Hessian theorem for the DRM objective using the likelihood ratio method. Using this result, we propose a natural DRM Hessian estimator from sample trajectories of the underlying MDP. Next, we present a cubic-regularized policy Newton algorithm for solving this problem in an on-policy RL setting using estimates of the DRM gradient and Hessian. Our proposed algorithm is shown to converge to an \epsilon -second-order stationary point ( \epsilon -SOSP) of the DRM objective, and this guarantee ensures the escaping of saddle points. The sample complexity of our algorithms to find an \epsilon -SOSP is \mathcalO(\epsilon^-3.5) . Our experiments validate the theoretical findings. To the best of our knowledge, our is the first work to present convergence to an \epsilon -SOSP of a risk-sensitive objective, while existing works in the literature have either shown convergence to a first-order stationary point of a risk-sensitive objective, or a SOSP of a risk-neutral one.
[LG-54] Strategic Incentivization for Locally Differentially Private Federated Learning
链接: https://arxiv.org/abs/2508.07138
作者: Yashwant Krishna Pagoti,Arunesh Sinha,Shamik Sural
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:In Federated Learning (FL), multiple clients jointly train a machine learning model by sharing gradient information, instead of raw data, with a server over multiple rounds. To address the possibility of information leakage in spite of sharing only the gradients, Local Differential Privacy (LDP) is often used. In LDP, clients add a selective amount of noise to the gradients before sending the same to the server. Although such noise addition protects the privacy of clients, it leads to a degradation in global model accuracy. In this paper, we model this privacy-accuracy trade-off as a game, where the sever incentivizes the clients to add a lower degree of noise for achieving higher accuracy, while the clients attempt to preserve their privacy at the cost of a potential loss in accuracy. A token based incentivization mechanism is introduced in which the quantum of tokens credited to a client in an FL round is a function of the degree of perturbation of its gradients. The client can later access a newly updated global model only after acquiring enough tokens, which are to be deducted from its balance. We identify the players, their actions and payoff, and perform a strategic analysis of the game. Extensive experiments were carried out to study the impact of different parameters.
[LG-55] A Globally Optimal Analytic Solution for Semi-Nonnegative Matrix Factorization with Nonnegative or Mixed Inputs
链接: https://arxiv.org/abs/2508.07134
作者: Lu Chenggang
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: 10 pages, 2 figures, under review in [SIAM Journal of Optimization]
Abstract:Semi-Nonnegative Matrix Factorization (semi-NMF) extends classical Nonnegative Matrix Factorization (NMF) by allowing the basis matrix to contain both positive and negative entries, making it suitable for decomposing data with mixed signs. However, most existing semi-NMF algorithms are iterative, non-convex, and prone to local minima. In this paper, we propose a novel method that yields a globally optimal solution to the semi-NMF problem under the Frobenius norm, through an orthogonal decomposition derived from the scatter matrix of the input data. We rigorously prove that our solution attains the global minimum of the reconstruction error. Furthermore, we demonstrate that when the input matrix is nonnegative, our method often achieves lower reconstruction error than standard NMF algorithms, although unfortunately the basis matrix may not satisfy nonnegativity. In particular, in low-rank cases such as rank 1 or 2, our solution reduces exactly to a nonnegative factorization, recovering the NMF structure. We validate our approach through experiments on both synthetic data and the UCI Wine dataset, showing that our method consistently outperforms existing NMF and semi-NMF methods in terms of reconstruction accuracy. These results confirm that our globally optimal, non-iterative formulation offers both theoretical guarantees and empirical advantages, providing a new perspective on matrix factorization in optimization and data analysis.
[LG-56] How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?
链接: https://arxiv.org/abs/2508.07127
作者: Niranjana Arun Menon,Iqra Farooq,Yulong Li,Sara Ahmed,Yutong Xie,Muhammad Awais,Imran Razzak
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:Cardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and sparsely annotated datasets remains a non-trivial task. Recently, LLMs has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.
[LG-57] Multi-Level Service Performance Forecasting via Spatiotemporal Graph Neural Networks
链接: https://arxiv.org/abs/2508.07122
作者: Zhihao Xue,Yun Zi,Nia Qi,Ming Gong,Yujun Zou
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a spatiotemporal graph neural network-based performance prediction algorithm to address the challenge of forecasting performance fluctuations in distributed backend systems with multi-level service call structures. The method abstracts system states at different time slices into a sequence of graph structures. It integrates the runtime features of service nodes with the invocation relationships among services to construct a unified spatiotemporal modeling framework. The model first applies a graph convolutional network to extract high-order dependency information from the service topology. Then it uses a gated recurrent network to capture the dynamic evolution of performance metrics over time. A time encoding mechanism is also introduced to enhance the model’s ability to represent non-stationary temporal sequences. The architecture is trained in an end-to-end manner, optimizing the multi-layer nested structure to achieve high-precision regression of future service performance metrics. To validate the effectiveness of the proposed method, a large-scale public cluster dataset is used. A series of multi-dimensional experiments are designed, including variations in time windows and concurrent load levels. These experiments comprehensively evaluate the model’s predictive performance and stability. The experimental results show that the proposed model outperforms existing representative methods across key metrics such as MAE, RMSE, and R2. It maintains strong robustness under varying load intensities and structural complexities. These results demonstrate the model’s practical potential for backend service performance management tasks.
[LG-58] From Nodes to Narratives: Explaining Graph Neural Networks with LLM s and Graph Context
链接: https://arxiv.org/abs/2508.07117
作者: Peyman Baghershahi,Gregoire Fournier,Pranav Nyati,Sourav Medya
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, 8 tables
Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for learning over structured data, including text-attributed graphs, which are common in domains such as citation networks, social platforms, and knowledge graphs. GNNs are not inherently interpretable and thus, many explanation methods have been proposed. However, existing explanation methods often struggle to generate interpretable, fine-grained rationales, especially when node attributes include rich natural language. In this work, we introduce LOGIC, a lightweight, post-hoc framework that uses large language models (LLMs) to generate faithful and interpretable explanations for GNN predictions. LOGIC projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This enables the LLM to reason about GNN internal representations and produce natural language explanations along with concise explanation subgraphs. Our experiments across four real-world TAG datasets demonstrate that LOGIC achieves a favorable trade-off between fidelity and sparsity, while significantly improving human-centric metrics such as insightfulness. LOGIC sets a new direction for LLM-based explainability in graph learning by aligning GNN internals with human reasoning.
[LG-59] Approaching Maximal Information Extraction in Low-Signal Regimes via Multiple Instance Learning
链接: https://arxiv.org/abs/2508.07114
作者: Atakan Azakli,Bernd Stelzer
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:In this work, we propose a new machine learning (ML) methodology to obtain more precise predictions for some parameters of interest in a given hypotheses testing problem. Our proposed method also allows ML models to have more discriminative power in cases where it is extremely challenging for state-of-the-art classifiers to have any level of accurate predictions. This method can also allow us to systematically decrease the error from ML models in their predictions. In this paper, we provide a mathematical motivation why Multiple Instance Learning (MIL) would have more predictive power over their single-instance counterparts. We support our theoretical claims by analyzing the behavior of the MIL models through their scaling behaviors with respect to the number of instances on which the model makes predictions. As a concrete application, we constrain Wilson coefficients of the Standard Model Effective Field Theory (SMEFT) using kinematic information from subatomic particle collision events at the Large Hadron Collider (LHC). We show that under certain circumstances, it might be possible to extract the theoretical maximum Fisher Information latent in a dataset.
[LG-60] BrainATCL: Adaptive Temporal Brain Connectivity Learning for Functional Link Prediction and Age Estimation
链接: https://arxiv.org/abs/2508.07106
作者: Yiran Huang,Amirhossein Nouranizadeh,Christine Ahrends,Mengjia Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Functional Magnetic Resonance Imaging (fMRI) is an imaging technique widely used to study human brain activity. fMRI signals in areas across the brain transiently synchronise and desynchronise their activity in a highly structured manner, even when an individual is at rest. These functional connectivity dynamics may be related to behaviour and neuropsychiatric disease. To model these dynamics, temporal brain connectivity representations are essential, as they reflect evolving interactions between brain regions and provide insight into transient neural states and network reconfigurations. However, conventional graph neural networks (GNNs) often struggle to capture long-range temporal dependencies in dynamic fMRI data. To address this challenge, we propose BrainATCL, an unsupervised, nonparametric framework for adaptive temporal brain connectivity learning, enabling functional link prediction and age estimation. Our method dynamically adjusts the lookback window for each snapshot based on the rate of newly added edges. Graph sequences are subsequently encoded using a GINE-Mamba2 backbone to learn spatial-temporal representations of dynamic functional connectivity in resting-state fMRI data of 1,000 participants from the Human Connectome Project. To further improve spatial modeling, we incorporate brain structure and function-informed edge attributes, i.e., the left/right hemispheric identity and subnetwork membership of brain regions, enabling the model to capture biologically meaningful topological patterns. We evaluate our BrainATCL on two tasks: functional link prediction and age estimation. The experimental results demonstrate superior performance and strong generalization, including in cross-session prediction scenarios.
[LG-61] SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization
链接: https://arxiv.org/abs/2508.07086
作者: Beilong Tang,Xiaoxiao Miao,Xin Wang,Ming Li
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, accepted by 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Abstract:Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user’s viewpoint. However, from the attacker’s perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.
[LG-62] Improving Real-Time Concept Drift Detection using a Hybrid Transformer-Autoencoder Framework
链接: https://arxiv.org/abs/2508.07085
作者: N Harshit,K Mounvik
类目: Machine Learning (cs.LG)
*备注:
Abstract:In applied machine learning, concept drift, which is either gradual or abrupt changes in data distribution, can significantly reduce model performance. Typical detection methods,such as statistical tests or reconstruction-based models,are generally reactive and not very sensitive to early detection. Our study proposes a hybrid framework consisting of Transformers and Autoencoders to model complex temporal dynamics and provide online drift detection. We create a distinct Trust Score methodology, which includes signals on (1) statistical and reconstruction-based drift metrics, more specifically, PSI, JSD, Transformer-AE error, (2) prediction uncertainty, (3) rules violations, and (4) trend of classifier error aligned with the combined metrics defined by the Trust Score. Using a time sequenced airline passenger data set with synthetic drift, our proposed model allows for a better detection of drift using as a whole and at different detection thresholds for both sensitivity and interpretability compared to baseline methods and provides a strong pipeline for drift detection in real time for applied machine learning. We evaluated performance using a time-sequenced airline passenger dataset having the gradually injected stimulus of drift in expectations,e.g. permuted ticket prices in later batches, broken into 10 time segments [1].In the data, our results support that the Transformation-Autoencoder detected drift earlier and with more sensitivity than the autoencoders commonly used in the literature, and provided improved modeling over more error rates and logical violations. Therefore, a robust framework was developed to reliably monitor concept drift.
[LG-63] Differentiable Adaptive Kalman Filtering via Optimal Transport
链接: https://arxiv.org/abs/2508.07037
作者: Yangguang He,Wenhao Li,Minzhe Li,Juan Zhang,Xiangfeng Wang,Bo Jin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages
Abstract:Learning-based filtering has demonstrated strong performance in non-linear dynamical systems, particularly when the statistics of noise are unknown. However, in real-world deployments, environmental factors, such as changing wind conditions or electromagnetic interference, can induce unobserved noise-statistics drift, leading to substantial degradation of learning-based methods. To address this challenge, we propose OTAKNet, the first online solution to noise-statistics drift within learning-based adaptive Kalman filtering. Unlike existing learning-based methods that perform offline fine-tuning using batch pointwise matching over entire trajectories, OTAKNet establishes a connection between the state estimate and the drift via one-step predictive measurement likelihood, and addresses it using optimal transport. This leverages OT’s geometry - aware cost and stable gradients to enable fully online adaptation without ground truth labels or retraining. We compare OTAKNet against classical model-based adaptive Kalman filtering and offline learning-based filtering. The performance is demonstrated on both synthetic and real-world NCLT datasets, particularly under limited training data.
[LG-64] A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling
链接: https://arxiv.org/abs/2508.07032
作者: Tiantian He,Keyue Jiang,An Zhao,Anna Schroder,Elinor Thompson,Sonja Soskic,Frederik Barkhof,Daniel C. Alexander
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:The long-term progression of neurodegenerative diseases is commonly conceptualized as a spatiotemporal diffusion process that consists of a graph diffusion process across the structural brain connectome and a localized reaction process within brain regions. However, modeling this progression remains challenging due to 1) the scarcity of longitudinal data obtained through irregular and infrequent subject visits and 2) the complex interplay of pathological mechanisms across brain regions and disease stages, where traditional models assume fixed mechanisms throughout disease progression. To address these limitations, we propose a novel stage-aware Mixture of Experts (MoE) framework that explicitly models how different contributing mechanisms dominate at different disease stages through time-dependent expert this http URL-wise, we utilize an iterative dual optimization method to properly estimate the temporal position of individual observations, constructing a co hort-level progression trajectory from irregular snapshots. Model-wise, we enhance the spatial component with an inhomogeneous graph neural diffusion model (IGND) that allows diffusivity to vary based on node states and time, providing more flexible representations of brain networks. We also introduce a localized neural reaction module to capture complex dynamics beyond standard this http URL resulting IGND-MoE model dynamically integrates these components across temporal states, offering a principled way to understand how stage-specific pathological mechanisms contribute to progression. The stage-wise weights yield novel clinical insights that align with literature, suggesting that graph-related processes are more influential at early stages, while other unknown physical processes become dominant later on.
[LG-65] LCCSP: A Scalable Framework for Enhancing Time Series Forecasting with Time-Lagged Cross-Correlations
链接: https://arxiv.org/abs/2508.07016
作者: Jianfei Wu,Wenmian Yang,Bingning Liu,Weijia Jia
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Time series forecasting is critical across various domains, such as weather, finance and real estate forecasting, as accurate forecasts support informed decision-making and risk mitigation. While recent deep learning models have improved predictive capabilities, they often overlook time-lagged cross-correlations between related sequences, which are crucial for capturing complex temporal relationships. To address this, we propose the Time-Lagged Cross-Correlations-based Sequence Prediction framework (TLCCSP), which enhances forecasting accuracy by effectively integrating time-lagged cross-correlated sequences. TLCCSP employs the Sequence Shifted Dynamic Time Warping (SSDTW) algorithm to capture lagged correlations and a contrastive learning-based encoder to efficiently approximate SSDTW distances. Experimental results on weather, finance and real estate time series datasets demonstrate the effectiveness of our framework. On the weather dataset, SSDTW reduces mean squared error (MSE) by 16.01% compared with single-sequence methods, while the contrastive learning encoder (CLE) further decreases MSE by 17.88%. On the stock dataset, SSDTW achieves a 9.95% MSE reduction, and CLE reduces it by 6.13%. For the real estate dataset, SSDTW and CLE reduce MSE by 21.29% and 8.62%, respectively. Additionally, the contrastive learning approach decreases SSDTW computational time by approximately 99%, ensuring scalability and real-time applicability across multiple time series forecasting tasks. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2508.07016 [cs.LG] (or arXiv:2508.07016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.07016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] A Comparative Study of Feature Selection in Tsetlin Machines
链接: https://arxiv.org/abs/2508.06991
作者: Vojtech Halenka,Ole-Christoffer Granmo,Lei Jiao,Per-Arne Andersen
类目: Machine Learning (cs.LG)
*备注: submitted to SGAI-2025: The 45th SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence
Abstract:Feature Selection (FS) is crucial for improving model interpretability, reducing complexity, and sometimes for enhancing accuracy. The recently introduced Tsetlin machine ™ offers interpretable clause-based learning, but lacks established tools for estimating feature importance. In this paper, we adapt and evaluate a range of FS techniques for TMs, including classical filter and embedded methods as well as post-hoc explanation methods originally developed for neural networks (e.g., SHAP and LIME) and a novel family of embedded scorers derived from TM clause weights and Tsetlin automaton (TA) states. We benchmark all methods across 12 datasets, using evaluation protocols, like Remove and Retrain (ROAR) strategy and Remove and Debias (ROAD), to assess causal impact. Our results show that TM-internal scorers not only perform competitively but also exploit the interpretability of clauses to reveal interacting feature patterns. Simpler TM-specific scorers achieve similar accuracy retention at a fraction of the computational cost. This study establishes the first comprehensive baseline for FS in TM and paves the way for developing specialized TM-specific interpretability techniques.
[LG-67] UniMove: A Unified Model for Multi-city Human Mobility Prediction
链接: https://arxiv.org/abs/2508.06986
作者: Chonghua Han,Yuan Yuan,Yukun Liu,Jingtao Ding,Jie Feng,Yong Li
类目: Machine Learning (cs.LG)
*备注: Accepted by SIGSPATIAL 2025
Abstract:Human mobility prediction is vital for urban planning, transportation optimization, and personalized services. However, the inherent randomness, non-uniform time intervals, and complex patterns of human mobility, compounded by the heterogeneity introduced by varying city structures, infrastructure, and population densities, present significant challenges in modeling. Existing solutions often require training separate models for each city due to distinct spatial representations and geographic coverage. In this paper, we propose UniMove, a unified model for multi-city human mobility prediction, addressing two challenges: (1) constructing universal spatial representations for effective token sharing across cities, and (2) modeling heterogeneous mobility patterns from varying city characteristics. We propose a trajectory-location dual-tower architecture, with a location tower for universal spatial encoding and a trajectory tower for sequential mobility modeling. We also design MoE Transformer blocks to adaptively select experts to handle diverse movement patterns. Extensive experiments across multiple datasets from diverse cities demonstrate that UniMove truly embodies the essence of a unified model. By enabling joint training on multi-city data with mutual data enhancement, it significantly improves mobility prediction accuracy by over 10.2%. UniMove represents a key advancement toward realizing a true foundational model with a unified architecture for human mobility. We release the implementation at this https URL.
[LG-68] Discovery Learning accelerates battery design evaluation
链接: https://arxiv.org/abs/2508.06985
作者: Jiawei Zhang,Yifei Zhang,Baozhao Yi,Yao Ren,Qi Jiao,Hanyu Bai,Weiran Jiang,Ziyou Song
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY); Computational Physics (physics.comp-ph)
*备注: Main text, 20 pages, 5 figures
Abstract:Fast and reliable validation of novel designs in complex physical systems such as batteries is critical to accelerating technological innovation. However, battery research and development remain bottlenecked by the prohibitively high time and energy costs required to evaluate numerous new design candidates, particularly in battery prototyping and life testing. Despite recent progress in data-driven battery lifetime prediction, existing methods require labeled data of target designs to improve accuracy and cannot make reliable predictions until after prototyping, thus falling far short of the efficiency needed to enable rapid feedback for battery design. Here, we introduce Discovery Learning (DL), a scientific machine-learning paradigm that integrates active learning, physics-guided learning, and zero-shot learning into a human-like reasoning loop, drawing inspiration from learning theories in educational psychology. DL can learn from historical battery designs and actively reduce the need for prototyping, thus enabling rapid lifetime evaluation for unobserved material-design combinations without requiring additional data labeling. To test DL, we present 123 industrial-grade large-format lithium-ion pouch cells, spanning eight material-design combinations and diverse cycling protocols. Trained solely on public datasets of small-capacity cylindrical cells, DL achieves 7.2% test error in predicting the average cycle life under unknown device variability. This results in savings of 98% in time and 95% in energy compared to industrial practices. This work highlights the potential of uncovering insights from historical designs to inform and accelerate the development of next-generation battery technologies. DL represents a key advance toward efficient data-driven modeling and helps realize the promise of machine learning for accelerating scientific discovery and engineering innovation.
[LG-69] Structure-Preserving Digital Twins via Conditional Neural Whitney Forms
链接: https://arxiv.org/abs/2508.06981
作者: Brooks Kinch,Benjamin Shaffer,Elizabeth Armstrong,Michael Meehan,John Hewson,Nathaniel Trask
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:We present a framework for constructing real-time digital twins based on structure-preserving reduced finite element models conditioned on a latent variable Z. The approach uses conditional attention mechanisms to learn both a reduced finite element basis and a nonlinear conservation law within the framework of finite element exterior calculus (FEEC). This guarantees numerical well-posedness and exact preservation of conserved quantities, regardless of data sparsity or optimization error. The conditioning mechanism supports real-time calibration to parametric variables, allowing the construction of digital twins which support closed loop inference and calibration to sensor data. The framework interfaces with conventional finite element machinery in a non-invasive manner, allowing treatment of complex geometries and integration of learned models with conventional finite element techniques. Benchmarks include advection diffusion, shock hydrodynamics, electrostatics, and a complex battery thermal runaway problem. The method achieves accurate predictions on complex geometries with sparse data (25 LES simulations), including capturing the transition to turbulence and achieving real-time inference ~0.1s with a speedup of 3.1x10^8 relative to LES. An open-source implementation is available on GitHub. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph) Cite as: arXiv:2508.06981 [cs.LG] (or arXiv:2508.06981v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.06981 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity
链接: https://arxiv.org/abs/2508.06953
作者: Shiwei Li,Xiandi Luo,Haozhao Wang,Xing Tang,Ziqiang Cui,Dugang Liu,Yuhua Li,Xiuqiang He,Ruixuan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). It approximates the update of a pretrained weight matrix W\in\mathbbR^m\times n by the product of two low-rank matrices, BA , where A \in\mathbbR^r\times n and B\in\mathbbR^m\times r (r\ll\min\m,n) . Increasing the dimension r can raise the rank of LoRA weights (i.e., BA ), which typically improves fine-tuning performance but also significantly increases the number of trainable parameters. In this paper, we propose Block Diversified Low-Rank Adaptation (BoRA), which improves the rank of LoRA weights with a small number of additional parameters. Specifically, BoRA treats the product BA as a block matrix multiplication, where A and B are partitioned into b blocks along the columns and rows, respectively (i.e., A=[A_1,\dots,A_b] and B=[B_1,\dots,B_b]^\top ). Consequently, the product BA becomes the concatenation of the block products B_iA_j for i,j\in[b] . To enhance the diversity of different block products, BoRA introduces a unique diagonal matrix \Sigma_i,j \in \mathbbR^r\times r for each block multiplication, resulting in B_i \Sigma_i,j A_j . By leveraging these block-wise diagonal matrices, BoRA increases the rank of LoRA weights by a factor of b while only requiring b^2r additional parameters. Extensive experiments across multiple datasets and models demonstrate the superiority of BoRA, and ablation studies further validate its scalability.
[LG-71] QuiZSF: An efficient data-model interaction framework for zero-shot time-series forecasting
链接: https://arxiv.org/abs/2508.06915
作者: Shichao Ma,Zhengyang Zhou,Qihe Huang,Binwu Wang,Kuo Yang,Huan Li,Yang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting has become increasingly important to empower diverse applications with streaming data. Zero-shot time-series forecasting (ZSF), particularly valuable in data-scarce scenarios, such as domain transfer or forecasting under extreme conditions, is difficult for traditional models to deal with. While time series pre-trained models (TSPMs) have demonstrated strong performance in ZSF, they often lack mechanisms to dynamically incorporate external knowledge. Fortunately, emerging retrieval-augmented generation (RAG) offers a promising path for injecting such knowledge on demand, yet they are rarely integrated with TSPMs. To leverage the strengths of both worlds, we introduce RAG into TSPMs to enhance zero-shot time series forecasting. In this paper, we propose QuiZSF (Quick Zero-Shot Time Series Forecaster), a lightweight and modular framework that couples efficient retrieval with representation learning and model adaptation for ZSF. Specifically, we construct a hierarchical tree-structured ChronoRAG Base (CRB) for scalable time-series storage and domain-aware retrieval, introduce a Multi-grained Series Interaction Learner (MSIL) to extract fine- and coarse-grained relational features, and develop a dual-branch Model Cooperation Coherer (MCC) that aligns retrieved knowledge with two kinds of TSPMs: Non-LLM based and LLM based. Compared with contemporary baselines, QuiZSF, with Non-LLM based and LLM based TSPMs as base model, respectively, ranks Top1 in 75% and 87.5% of prediction settings, while maintaining high efficiency in memory and inference time.
[LG-72] Conformal Prediction and Trustworthy AI
链接: https://arxiv.org/abs/2508.06885
作者: Anthony Bellotti,Xindi Zhao
类目: Machine Learning (cs.LG)
*备注: Preprint for an essay to be published in The Importance of Being Learnable (Enhancing the Learnability and Reliability of Machine Learning Algorithms) Essays Dedicated to Alexander Gammerman on His 80th Birthday, LNCS Springer Nature Switzerland AG ed. Nguyen K.A. and Luo Z
Abstract:Conformal predictors are machine learning algorithms developed in the 1990’s by Gammerman, Vovk, and their research team, to provide set predictions with guaranteed confidence level. Over recent years, they have grown in popularity and have become a mainstream methodology for uncertainty quantification in the machine learning community. From its beginning, there was an understanding that they enable reliable machine learning with well-calibrated uncertainty quantification. This makes them extremely beneficial for developing trustworthy AI, a topic that has also risen in interest over the past few years, in both the AI community and society more widely. In this article, we review the potential for conformal prediction to contribute to trustworthy AI beyond its marginal validity property, addressing problems such as generalization risk and AI governance. Experiments and examples are also provided to demonstrate its use as a well-calibrated predictor and for bias identification and mitigation.
[LG-73] Energy Efficient Task Offloading in UAV-Enabled MEC Using a Fully Decentralized Deep Reinforcement Learning Approach
链接: https://arxiv.org/abs/2508.06863
作者: Hamidreza Asadian-Rad,Hossein Soleimani,Shahrokh Farahmand
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:Unmanned aerial vehicles (UAVs) have been recently utilized in multi-access edge computing (MEC) as edge servers. It is desirable to design UAVs’ trajectories and user to UAV assignments to ensure satisfactory service to the users and energy efficient operation simultaneously. The posed optimization problem is challenging to solve because: (i) The formulated problem is non-convex, (ii) Due to the mobility of ground users, their future positions and channel gains are not known in advance, (iii) Local UAVs’ observations should be communicated to a central entity that solves the optimization problem. The (semi-) centralized processing leads to communication overhead, communication/processing bottlenecks, lack of flexibility and scalability, and loss of robustness to system failures. To simultaneously address all these limitations, we advocate a fully decentralized setup with no centralized entity. Each UAV obtains its local observation and then communicates with its immediate neighbors only. After sharing information with neighbors, each UAV determines its next position via a locally run deep reinforcement learning (DRL) algorithm. None of the UAVs need to know the global communication graph. Two main components of our proposed solution are (i) Graph attention layers (GAT), and (ii) Experience and parameter sharing proximal policy optimization (EPS-PPO). Our proposed approach eliminates all the limitations of semi-centralized MADRL methods such as MAPPO and MA deep deterministic policy gradient (MADDPG), while guaranteeing a better performance than independent local DRLs such as in IPPO. Numerical results reveal notable performance gains in several different criteria compared to the existing MADDPG algorithm, demonstrating the potential for offering a better performance, while utilizing local communications only.
[LG-74] chnical Report: Full-Stack Fine-Tuning for the Q Programming Language
链接: https://arxiv.org/abs/2508.06813
作者: Brendan R. Hogan,Will Brown,Adel Boyarsky,Anderson Schneider,Yuriy Nevmyvaka
类目: Machine Learning (cs.LG)
*备注: 40 pages
Abstract:Even though large language models are becoming increasingly capable, it is still unreasonable to expect them to excel at tasks that are under-represented on the Internet. Leveraging LLMs for specialized applications, particularly in niche programming languages and private domains, remains challenging and largely unsolved. In this work, we address this gap by presenting a comprehensive, open-source approach for adapting LLMs to the Q programming language, a popular tool in quantitative finance that is much less present on the Internet compared to Python, C, Java, and other ``mainstream" languages and is therefore not a strong suit of general-purpose AI models. We introduce a new Leetcode style evaluation dataset for Q, benchmark major frontier models on the dataset, then do pretraining, supervised fine tuning, and reinforcement learning to train a suite of reasoning and non-reasoning models based on the Qwen-2.5 series, spanning five parameter sizes (1.5B, 3B, 7B, 14B, 32B). Our best model achieves a pass@1 accuracy of 59 percent on our Q benchmark, surpassing the best-performing frontier model, Claude Opus-4 by 29.5 percent. Additionally, all models, even our 1.5B model, outperform GPT-4.1 on this task. In addition to releasing models, code, and data, we provide a detailed blueprint for dataset construction, model pretraining, supervised fine-tuning, and reinforcement learning. Our methodology is broadly applicable, and we discuss how these techniques can be extended to other tasks, including those where evaluation may rely on soft or subjective signals.
[LG-75] Fed MobiLLM : Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning
链接: https://arxiv.org/abs/2508.06765
作者: Xingke Yang,Liang Li,Sicong Li,Liwei Guan,Hao Wang,Xiaoqi Qi,Jiang Liu,Xin Fu,Miao Pan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Collaboratively fine-tuning (FT) large language models (LLMs) over heterogeneous mobile devices fosters immense potential applications of personalized intelligence. However, such a vision faces critical system challenges. Conventional federated LLM FT approaches place prohibitive computational and memory burdens on mobile hardware, and their synchronous model aggregation protocols stall for slower devices. In this paper, we propose Fed MobiLLM, a novel design to facilitate efficient federated LLM FT across mobile devices with diverse computing/communication speeds and local model architectures. In particular, Fed MobiLLM implements a pioneering server-assisted federated side-tuning paradigm. Briefly, mobile devices perform lightweight forward propagation computations on local data using their frozen pre-scaled backbone LLMs, and then upload selected intermediate activations. The server trains a shared side-network independently, eliminating client-side backpropagation and enabling asynchronous updates. To bridge model heterogeneity across different devices, we introduce an adaptive layer-wise feature alignment method, which ensures consistent representations for collaboratively tuning a shared side network. Extensive experimental results demonstrate that Fed MobiLLM can maintain robust fine-tuning performance while achieving extremely low on-device memory, with at least 95.2% reduction in computation overhead, 93.2% reduction in communication costs and 5.1x faster convergence compared to existing methods, validating its efficacy for practical LLM adaptation over heterogeneous mobile devices.
[LG-76] Mitigating Distribution Shift in Graph-Based Android Malware Classification via Function Metadata and LLM Embeddings
链接: https://arxiv.org/abs/2508.06734
作者: Ngoc N. Tran,Anwar Said,Waseem Abbas,Tyler Derr,Xenofon D. Koutsoukos
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures, 7 tables, under review
Abstract:Graph-based malware classifiers can achieve over 94% accuracy on standard Android datasets, yet we find they suffer accuracy drops of up to 45% when evaluated on previously unseen malware variants from the same family - a scenario where strong generalization would typically be expected. This highlights a key limitation in existing approaches: both the model architectures and their structure-only representations often fail to capture deeper semantic patterns. In this work, we propose a robust semantic enrichment framework that enhances function call graphs with contextual features, including function-level metadata and, when available, code embeddings derived from large language models. The framework is designed to operate under real-world constraints where feature availability is inconsistent, and supports flexible integration of semantic signals. To evaluate generalization under realistic domain and temporal shifts, we introduce two new benchmarks: MalNet-Tiny-Common and MalNet-Tiny-Distinct, constructed using malware family partitioning to simulate cross-family generalization and evolving threat behavior. Experiments across multiple graph neural network backbones show that our method improves classification performance by up to 8% under distribution shift and consistently enhances robustness when integrated with adaptation-based methods. These results offer a practical path toward building resilient malware detection systems in evolving threat environments.
[LG-77] ClimateSOM: A Visual Analysis Workflow for Climate Ensemble Datasets
链接: https://arxiv.org/abs/2508.06732
作者: Yuya Kawakami,Daniel Cayan,Dongyu Liu,Kwan-Liu Ma
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Ensemble datasets are ever more prevalent in various scientific domains. In climate science, ensemble datasets are used to capture variability in projections under plausible future conditions including greenhouse and aerosol emissions. Each ensemble model run produces projections that are fundamentally similar yet meaningfully distinct. Understanding this variability among ensemble model runs and analyzing its magnitude and patterns is a vital task for climate scientists. In this paper, we present ClimateSOM, a visual analysis workflow that leverages a self-organizing map (SOM) and Large Language Models (LLMs) to support interactive exploration and interpretation of climate ensemble datasets. The workflow abstracts climate ensemble model runs - spatiotemporal time series - into a distribution over a 2D space that captures the variability among the ensemble model runs using a SOM. LLMs are integrated to assist in sensemaking of this SOM-defined 2D space, the basis for the visual analysis tasks. In all, ClimateSOM enables users to explore the variability among ensemble model runs, identify patterns, compare and cluster the ensemble model runs. To demonstrate the utility of ClimateSOM, we apply the workflow to an ensemble dataset of precipitation projections over California and the Northwestern United States. Furthermore, we conduct a short evaluation of our LLM integration, and conduct an expert review of the visual workflow and the insights from the case studies with six domain experts to evaluate our approach and its utility.
[LG-78] CISO: Species Distribution Modeling Conditioned on Incomplete Species Observations
链接: https://arxiv.org/abs/2508.06704
作者: Hager Radi Abdelwahed,Mélisande Teng,Robin Zbinden,Laura Pollock,Hugo Larochelle,Devis Tuia,David Rolnick
类目: Machine Learning (cs.LG)
*备注:
Abstract:Species distribution models (SDMs) are widely used to predict species’ geographic distributions, serving as critical tools for ecological research and conservation planning. Typically, SDMs relate species occurrences to environmental variables representing abiotic factors, such as temperature, precipitation, and soil properties. However, species distributions are also strongly influenced by biotic interactions with other species, which are often overlooked. While some methods partially address this limitation by incorporating biotic interactions, they often assume symmetrical pairwise relationships between species and require consistent co-occurrence data. In practice, species observations are sparse, and the availability of information about the presence or absence of other species varies significantly across locations. To address these challenges, we propose CISO, a deep learning-based method for species distribution modeling Conditioned on Incomplete Species Observations. CISO enables predictions to be conditioned on a flexible number of species observations alongside environmental variables, accommodating the variability and incompleteness of available biotic data. We demonstrate our approach using three datasets representing different species groups: sPlotOpen for plants, SatBird for birds, and a new dataset, SatButterfly, for butterflies. Our results show that including partial biotic information improves predictive performance on spatially separate test sets. When conditioned on a subset of species within the same dataset, CISO outperforms alternative methods in predicting the distribution of the remaining species. Furthermore, we show that combining observations from multiple datasets can improve performance. CISO is a promising ecological tool, capable of incorporating incomplete biotic information and identifying potential interactions between species from disparate taxa.
[LG-79] A Tight Lower Bound for the Approximation Guarantee of Higher-Order Singular Value Decomposition
链接: https://arxiv.org/abs/2508.06693
作者: Matthew Fahrbach,Mehrdad Ghadiri
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:We prove that the classic approximation guarantee for the higher-order singular value decomposition (HOSVD) is tight by constructing a tensor for which HOSVD achieves an approximation ratio of N/(1+\varepsilon) , for any \varepsilon 0 . This matches the upper bound of De Lathauwer et al. (2000a) and shows that the approximation ratio of HOSVD cannot be improved. Using a more advanced construction, we also prove that the approximation guarantees for the ST-HOSVD algorithm of Vannieuwenhoven et al. (2012) and higher-order orthogonal iteration (HOOI) of De Lathauwer et al. (2000b) are tight by showing that they can achieve their worst-case approximation ratio of N / (1 + \varepsilon) , for any \varepsilon 0 .
[LG-80] Stabilizing Federated Learning under Extreme Heterogeneity with HeteRo-Select
链接: https://arxiv.org/abs/2508.06692
作者: Md. Akmol Masud,Md Abrar Jahin,Mahmud Hasan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) is a machine learning technique that often suffers from training instability due to the diverse nature of client data. Although utility-based client selection methods like Oort are used to converge by prioritizing high-loss clients, they frequently experience significant drops in accuracy during later stages of training. We propose a theoretical HeteRo-Select framework designed to maintain high performance and ensure long-term training stability. We provide a theoretical analysis showing that when client data is very different (high heterogeneity), choosing a smart subset of client participation can reduce communication more effectively compared to full participation. Our HeteRo-Select method uses a clear, step-by-step scoring system that considers client usefulness, fairness, update speed, and data variety. It also shows convergence guarantees under strong regularization. Our experimental results on the CIFAR-10 dataset under significant label skew ( \alpha=0.1 ) support the theoretical findings. The HeteRo-Select method performs better than existing approaches in terms of peak accuracy, final accuracy, and training stability. Specifically, HeteRo-Select achieves a peak accuracy of 74.75% , a final accuracy of 72.76% , and a minimal stability drop of 1.99% . In contrast, Oort records a lower peak accuracy of 73.98% , a final accuracy of 71.25% , and a larger stability drop of 2.73% . The theoretical foundations and empirical performance in our study make HeteRo-Select a reliable solution for real-world heterogeneous FL problems.
[LG-81] Watermarking Kolmogorov-Arnold Networks for Emerging Networked Applications via Activation Perturbation
链接: https://arxiv.org/abs/2508.06676
作者: Chia-Hsun Lu,Guan-Jhih Wu,Ya-Chi Ho,Chih-Ya Shen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 6 tables
Abstract:With the increasing importance of protecting intellectual property in machine learning, watermarking techniques have gained significant attention. As advanced models are increasingly deployed in domains such as social network analysis, the need for robust model protection becomes even more critical. While existing watermarking methods have demonstrated effectiveness for conventional deep neural networks, they often fail to adapt to the novel architecture, Kolmogorov-Arnold Networks (KAN), which feature learnable activation functions. KAN holds strong potential for modeling complex relationships in network-structured data. However, their unique design also introduces new challenges for watermarking. Therefore, we propose a novel watermarking method, Discrete Cosine Transform-based Activation Watermarking (DCT-AW), tailored for KAN. Leveraging the learnable activation functions of KAN, our method embeds watermarks by perturbing activation outputs using discrete cosine transform, ensuring compatibility with diverse tasks and achieving task independence. Experimental results demonstrate that DCT-AW has a small impact on model performance and provides superior robustness against various watermark removal attacks, including fine-tuning, pruning, and retraining after pruning.
[LG-82] ransferring Social Network Knowledge from Multiple GNN Teachers to Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2508.06663
作者: Yuan-Hung Chao,Chia-Hsun Lu,Chih-Ya Shen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 tables
Abstract:Graph Neural Networks (GNNs) have shown strong performance on graph-structured data, but their reliance on graph connectivity often limits scalability and efficiency. Kolmogorov-Arnold Networks (KANs), a recent architecture with learnable univariate functions, offer strong nonlinear expressiveness and efficient inference. In this work, we integrate KANs into three popular GNN architectures-GAT, SGC, and APPNP-resulting in three new models: KGAT, KSGC, and KAPPNP. We further adopt a multi-teacher knowledge amalgamation framework, where knowledge from multiple KAN-based GNNs is distilled into a graph-independent KAN student model. Experiments on benchmark datasets show that the proposed models improve node classification accuracy, and the knowledge amalgamation approach significantly boosts student model performance. Our findings highlight the potential of KANs for enhancing GNN expressiveness and for enabling efficient, graph-free inference.
[LG-83] Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN
链接: https://arxiv.org/abs/2508.06647
作者: Andrey Sidorenko,Paul Tiwald
类目: Machine Learning (cs.LG)
*备注:
Abstract:Synthetic data generation has become essential for securely sharing and analyzing sensitive data sets. Traditional anonymization techniques, however, often fail to adequately preserve privacy. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a neural network architecture specifically designed for generating high-quality synthetic tabular data. Using a discretization-based auto-regressive approach, TabularARGN achieves high data fidelity while remaining computationally efficient. We evaluate TabularARGN against existing synthetic data generation methods, showing competitive results in statistical similarity, machine learning utility, and detection robustness. We further perform an in-depth privacy evaluation using systematic membership-inference attacks, highlighting the robustness and effective privacy-utility balance of our approach.
[LG-84] Learning to Forget with Information Divergence Reweighted Objectives for Noisy Labels
链接: https://arxiv.org/abs/2508.06622
作者: Jeremiah Birrell,Reza Ebrahimi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 2 figures
Abstract:We introduce ANTIDOTE, a new class of objectives for learning under noisy labels which are defined in terms of a relaxation over an information-divergence neighborhood. Using convex duality, we provide a reformulation as an adversarial training method that has similar computational cost to training with standard cross-entropy loss. We show that our approach adaptively reduces the influence of the samples with noisy labels during learning, exhibiting a behavior that is analogous to forgetting those samples. ANTIDOTE is effective in practical environments where label noise is inherent in the training data or where an adversary can alter the training labels. Extensive empirical evaluations on different levels of symmetric, asymmetric, human annotation, and real-world label noise show that ANTIDOTE outperforms leading comparable losses in the field and enjoys a time complexity that is very close to that of the standard cross entropy loss.
[LG-85] Local Diffusion Models and Phases of Data Distributions
链接: https://arxiv.org/abs/2508.06614
作者: Fangjun Hu,Guangkuo Liu,Yifan Zhang,Xun Gao
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph)
*备注: 8+22 pages, 4+3 figures
Abstract:As a class of generative artificial intelligence frameworks inspired by statistical physics, diffusion models have shown extraordinary performance in synthesizing complicated data distributions through a denoising process gradually guided by score functions. Real-life data, like images, is often spatially structured in low-dimensional spaces. However, ordinary diffusion models ignore this local structure and learn spatially global score functions, which are often computationally expensive. In this work, we introduce a new perspective on the phases of data distributions, which provides insight into constructing local denoisers with reduced computational costs. We define two distributions as belonging to the same data distribution phase if they can be mutually connected via spatially local operations such as local denoisers. Then, we show that the reverse denoising process consists of an early trivial phase and a late data phase, sandwiching a rapid phase transition where local denoisers must fail. To diagnose such phase transitions, we prove an information-theoretic bound on the fidelity of local denoisers based on conditional mutual information, and conduct numerical experiments in a real-world dataset. This work suggests simpler and more efficient architectures of diffusion models: far from the phase transition point, we can use small local neural networks to compute the score function; global neural networks are only necessary around the narrow time interval of phase transitions. This result also opens up new directions for studying phases of data distributions, the broader science of generative artificial intelligence, and guiding the design of neural networks inspired by physics concepts.
[LG-86] Hypergraph Neural Network with State Space Models for Node Classification
链接: https://arxiv.org/abs/2508.06587
作者: A. Quadir,M. Tanveer
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, graph neural networks (GNNs) have gained significant attention for node classification tasks on graph-structured data. However, traditional GNNs primarily focus on adjacency relationships between nodes, often overlooking the rich role-based characteristics that are crucial for learning more expressive node representations. Existing methods for capturing role-based features are largely unsupervised and fail to achieve optimal performance in downstream tasks. To address these limitations, we propose a novel hypergraph neural network with state space model (HGMN) that effectively integrates role-aware representations into GNNs and the state space model. HGMN utilizes hypergraph construction techniques to model higher-order relationships and combines role-based and adjacency-based representations through a learnable mamba transformer mechanism. By leveraging two distinct hypergraph construction methods-based on node degree and neighborhood levels, it strengthens the connections among nodes with similar roles, enhancing the model’s representational power. Additionally, the inclusion of hypergraph convolution layers enables the model to capture complex dependencies within hypergraph structures. To mitigate the over-smoothing problem inherent in deep GNNs, we incorporate a residual network, ensuring improved stability and better feature propagation across layers. Extensive experiments conducted on one newly introduced dataset and four benchmark datasets demonstrate the superiority of HGMN. The model achieves significant performance improvements on node classification tasks compared to state-of-the-art GNN methods. These results highlight HGMN’s ability to provide enriched node representations by effectively embedding role-based features alongside adjacency information, making it a versatile and powerful tool for a variety of graph-based learning applications.
[LG-87] GFlowNets for Learning Better Drug-Drug Interaction Representations ICANN2025
链接: https://arxiv.org/abs/2508.06576
作者: Azmine Toushik Wasi
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
*备注: Accepted to ICANN 2025:AIDD
Abstract:Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.
[LG-88] Semi-Supervised Supply Chain Fraud Detection with Unsupervised Pre-Filtering
链接: https://arxiv.org/abs/2508.06574
作者: Fatemeh Moradi,Mehran Tarif,Mohammadhossein Homaei
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Six Pages, two Figures and six Tables
Abstract:Detecting fraud in modern supply chains is a growing challenge, driven by the complexity of global networks and the scarcity of labeled data. Traditional detection methods often struggle with class imbalance and limited supervision, reducing their effectiveness in real-world applications. This paper proposes a novel two-phase learning framework to address these challenges. In the first phase, the Isolation Forest algorithm performs unsupervised anomaly detection to identify potential fraud cases and reduce the volume of data requiring further analysis. In the second phase, a self-training Support Vector Machine (SVM) refines the predictions using both labeled and high-confidence pseudo-labeled samples, enabling robust semi-supervised learning. The proposed method is evaluated on the DataCo Smart Supply Chain Dataset, a comprehensive real-world supply chain dataset with fraud indicators. It achieves an F1-score of 0.817 while maintaining a false positive rate below 3.0%. These results demonstrate the effectiveness and efficiency of combining unsupervised pre-filtering with semi-supervised refinement for supply chain fraud detection under real-world constraints, though we acknowledge limitations regarding concept drift and the need for comparison with deep learning approaches.
[LG-89] Communication-Learning Co-Design for Differentially Private Over-the-Air Federated Distillation
链接: https://arxiv.org/abs/2508.06557
作者: Zihao Hu(1),Jia Yan(2),Ying-Jun Angela Zhang(1) ((1) The Chinese University of Hong Kong, (2) The Hong Kong University of Science and Technology (Guangzhou))
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 9 pages, 2 figures, submitted to IEEE Wireless Communication Letters
Abstract:The ever-growing learning model size nowadays challenges the communication efficiency and privacy preservation of the traditional federated learning (FL). In this paper, we propose a novel differentially private (DP) over-the-air federated distillation (FD) framework, where wireless devices (WDs) periodically share noise-perturbed model outputs with the parameter server by harnessing the superposition property of multi-access channels. Accordingly, over-the-air FD enables the shared responsibility of the DP preservation on the low-dimensional disclosed signals among WDs. We study the communication-learning co-design problem in differentially private over-the-air FD, aiming to maximize the learning convergence rate while meeting the transmit power and DP requirements of WDs. The main challenge is rooted in the intractable learning and privacy analysis in over-the-air FD, together with the strong coupling among the decision variables spanning two timescales. To tackle this problem, we first derive the analytical learning convergence rate and privacy losses of WDs, based on which the optimal transceiver design per FD round and long-term training rounds decision are obtained in the closed forms. Numerical results demonstrate that the proposed differentially private over-the-air FD approach achieves a better learning-privacy trade-off with largely-reduced communication overhead than the conventional FL benchmarks.
[LG-90] Generative Bid Shading in Real-Time Bidding Advertising
链接: https://arxiv.org/abs/2508.06550
作者: Yinqiu Huang,Hao Ma,Wenshuai Chen,Shuli Wang,Yongqiang Zhang,Xue Wei,Yinhua Zhu,Haitao Wang,Xingxing Wang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Bid shading plays a crucial role in Real-Time Bidding~(RTB) by adaptively adjusting the bid to avoid advertisers overspending. Existing mainstream two-stage methods, which first model bid landscapes and then optimize surplus using operations research techniques, are constrained by unimodal assumptions that fail to adapt for non-convex surplus curves and are vulnerable to cascading errors in sequential workflows. Additionally, existing discretization models of continuous values ignore the dependence between discrete intervals, reducing the model’s error correction ability, while sample selection bias in bidding scenarios presents further challenges for prediction. To address these issues, this paper introduces Generative Bid Shading~(GBS), which comprises two primary components: (1) an end-to-end generative model that utilizes an autoregressive approach to generate shading ratios by stepwise residuals, capturing complex value dependencies without relying on predefined priors; and (2) a reward preference alignment system, which incorporates a channel-aware hierarchical dynamic network~(CHNet) as the reward model to extract fine-grained features, along with modules for surplus optimization and exploration utility reward alignment, ultimately optimizing both short-term and long-term surplus using group relative policy optimization~(GRPO). Extensive experiments on both offline and online A/B tests validate GBS’s effectiveness. Moreover, GBS has been deployed on the Meituan DSP platform, serving billions of bid requests daily.
[LG-91] Self-Organizing Survival Manifolds: A Theory for Unsupervised Discovery of Prognostic Structures in Biological Systems
链接: https://arxiv.org/abs/2508.06539
作者: Atahan Karagoz
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Survival is traditionally modeled as a supervised learning task, reliant on curated outcome labels and fixed covariates. This work rejects that premise. It proposes that survival is not an externally annotated target but a geometric consequence: an emergent property of the curvature and flow inherent in biological state space. We develop a theory of Self-Organizing Survival Manifolds (SOSM), in which survival-relevant dynamics arise from low-curvature geodesic flows on latent manifolds shaped by internal biological constraints. A survival energy functional based on geodesic curvature minimization is introduced and shown to induce structures where prognosis aligns with geometric flow stability. We derive discrete and continuous formulations of the objective and prove theoretical results demonstrating the emergence and convergence of survival-aligned trajectories under biologically plausible conditions. The framework draws connections to thermodynamic efficiency, entropy flow, Ricci curvature, and optimal transport, grounding survival modeling in physical law. Health, disease, aging, and death are reframed as geometric phase transitions in the manifold’s structure. This theory offers a universal, label-free foundation for modeling survival as a property of form, not annotation-bridging machine learning, biophysics, and the geometry of life itself.
[LG-92] Adaptive Learning for IRS-Assisted Wireless Networks: Securing Opportunistic Communications Against Byzantine Eavesdroppers
链接: https://arxiv.org/abs/2508.08206
作者: Amirhossein Taherpour,Abbas Taherpour,Tamer Khattab
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We propose a joint learning framework for Byzantine-resilient spectrum sensing and secure intelligent reflecting surface (IRS)–assisted opportunistic access under channel state information (CSI) uncertainty. The sensing stage performs logit-domain Bayesian updates with trimmed aggregation and attention-weighted consensus, and the base station (BS) fuses network beliefs with a conservative minimum rule, preserving detection accuracy under a bounded number of Byzantine users. Conditioned on the sensing outcome, we pose downlink design as sum mean-squared error (MSE) minimization under transmit-power and signal-leakage constraints and jointly optimize the BS precoder, IRS phase shifts, and user equalizers. With partial (or known) CSI, we develop an augmented-Lagrangian alternating algorithm with projected updates and provide provable sublinear convergence, with accelerated rates under mild local curvature. With unknown CSI, we perform constrained Bayesian optimization (BO) in a geometry-aware low-dimensional latent space using Gaussian process (GP) surrogates; we prove regret bounds for a constrained upper confidence bound (UCB) variant of the BO module, and demonstrate strong empirical performance of the implemented procedure. Simulations across diverse network conditions show higher detection probability at fixed false-alarm rate under adversarial attacks, large reductions in sum MSE for honest users, strong suppression of eavesdropper signal power, and fast convergence. The framework offers a practical path to secure opportunistic communication that adapts to CSI availability while coherently coordinating sensing and transmission through joint learning.
[LG-93] An effective potential for generative modelling with active matter
链接: https://arxiv.org/abs/2508.08146
作者: Adrian Baule
类目: atistical Mechanics (cond-mat.stat-mech); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注:
Abstract:Score-based diffusion models generate samples from a complex underlying data distribution by time-reversal of a diffusion process and represent the state-of-the-art in many generative AI applications such as artificial image synthesis. Here, I show how a generative diffusion model can be implemented based on an underlying active particle process with finite correlation time. In contrast to previous approaches that use a score function acting on the velocity coordinate of the active particle, time reversal is here achieved by imposing an effective time-dependent potential on the position coordinate only. The effective potential is valid to first order in the persistence time and leads to a force field that is fully determined by the standard score function and its derivatives up to 2nd order. Numerical experiments for artificial data distributions confirm the validity of the effective potential.
[LG-94] Sharper Perturbed-Kullback-Leibler Exponential Tail Bounds for Beta and Dirichlet Distributions
链接: https://arxiv.org/abs/2508.07991
作者: Pierre Perrault
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents an improved exponential tail bound for Beta distributions, refining a result in [15]. This improvement is achieved by interpreting their bound as a regular Kullback-Leibler (KL) divergence one, while introducing a specific perturbation \eta that shifts the mean of the Beta distribution closer to zero within the KL bound. Our contribution is to show that a larger perturbation can be chosen, thereby tightening the bound. We then extend this result from the Beta distribution to Dirichlet distributions and Dirichlet processes (DPs).
[LG-95] Likelihood Ratio Tests by Kernel Gaussian Embedding
链接: https://arxiv.org/abs/2508.07982
作者: Leonardo V. Santoro,Victor M. Panaretos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We propose a novel kernel-based nonparametric two-sample test, employing the combined use of kernel mean and kernel covariance embedding. Our test builds on recent results showing how such combined embeddings map distinct probability measures to mutually singular Gaussian measures on the kernel’s RKHS. Leveraging this result, we construct a test statistic based on the relative entropy between the Gaussian embeddings, i.e.\ the likelihood ratio. The likelihood ratio is specifically tailored to detect equality versus singularity of two Gaussians, and satisfies a `` 0/\infty " law, in that it vanishes under the null and diverges under the alternative. To implement the test in finite samples, we introduce a regularised version, calibrated by way of permutation. We prove consistency, establish uniform power guarantees under mild conditions, and discuss how our framework unifies and extends prior approaches based on spectrally regularized MMD. Empirical results on synthetic and real data demonstrate remarkable gains in power compared to state-of-the-art methods, particularly in high-dimensional and weak-signal regimes.
[LG-96] Gaussian Approximation for Two-Timescale Linear Stochastic Approximation
链接: https://arxiv.org/abs/2508.07928
作者: Bogdan Butyrin,Artemy Rubtsov,Alexey Naumov,Vladimir Ulyanov,Sergey Samsonov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:In this paper, we establish non-asymptotic bounds for accuracy of normal approximation for linear two-timescale stochastic approximation (TTSA) algorithms driven by martingale difference or Markov noise. Focusing on both the last iterate and Polyak-Ruppert averaging regimes, we derive bounds for normal approximation in terms of the convex distance between probability distributions. Our analysis reveals a non-trivial interaction between the fast and slow timescales: the normal approximation rate for the last iterate improves as the timescale separation increases, while it decreases in the Polyak-Ruppert averaged setting. We also provide the high-order moment bounds for the error of linear TTSA algorithm, which may be of independent interest.
[LG-97] Meta Off-Policy Estimation RECSYS’25
链接: https://arxiv.org/abs/2508.07914
作者: Olivier Jeunen
类目: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: To appear in the Nineteenth ACM Conference on Recommender Systems (RecSys '25)
Abstract:Off-policy estimation (OPE) methods enable unbiased offline evaluation of recommender systems, directly estimating the online reward some target policy would have obtained, from offline data and with statistical guarantees. The theoretical elegance of the framework combined with practical successes have led to a surge of interest, with many competing estimators now available to practitioners and researchers. Among these, Doubly Robust methods provide a prominent strategy to combine value- and policy-based estimators. In this work, we take an alternative perspective to combine a set of OPE estimators and their associated confidence intervals into a single, more accurate estimate. Our approach leverages a correlated fixed-effects meta-analysis framework, explicitly accounting for dependencies among estimators that arise due to shared data. This yields a best linear unbiased estimate (BLUE) of the target policy’s value, along with an appropriately conservative confidence interval that reflects inter-estimator correlation. We validate our method on both simulated and real-world data, demonstrating improved statistical efficiency over existing individual estimators. Comments: To appear in the Nineteenth ACM Conference on Recommender Systems (RecSys '25) Subjects: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2508.07914 [stat.ML] (or arXiv:2508.07914v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2508.07914 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3705328.3759308 Focus to learn more DOI(s) linking to related resources
[LG-98] Stochastic dynamics learning with state-space systems
链接: https://arxiv.org/abs/2508.07876
作者: Juan-Pablo Ortega,Florian Rossmannek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Statistics Theory (math.ST)
*备注:
Abstract:This work advances the theoretical foundations of reservoir computing (RC) by providing a unified treatment of fading memory and the echo state property (ESP) in both deterministic and stochastic settings. We investigate state-space systems, a central model class in time series learning, and establish that fading memory and solution stability hold generically – even in the absence of the ESP – offering a robust explanation for the empirical success of RC models without strict contractivity conditions. In the stochastic case, we critically assess stochastic echo states, proposing a novel distributional perspective rooted in attractor dynamics on the space of probability distributions, which leads to a rich and coherent theory. Our results extend and generalize previous work on non-autonomous dynamical systems, offering new insights into causality, stability, and memory in RC models. This lays the groundwork for reliable generative modeling of temporal data in both deterministic and stochastic regimes.
[LG-99] G-IFT: A Gated Linear Unit adapter with Iterative Fine-Tuning for Low-Resource Childrens Speaker Verification INTERSPEECH
链接: https://arxiv.org/abs/2508.07836
作者: Vishwas M. Shetty,Jiusi Zheng,Abeer Alwan
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at WOCCI, 2025 - Interspeech workshop
Abstract:Speaker Verification (SV) systems trained on adults speech often underperform on children’s SV due to the acoustic mismatch, and limited children speech data makes fine-tuning not very effective. In this paper, we propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT), to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource children’s speech domain. In this framework, a Gated Linear Unit adapter is first inserted between the pre-trained speaker embedding model and the classifier. Then the classifier, adapter, and pre-trained speaker embedding model are optimized sequentially in an iterative way. This framework is agnostic to the type of the underlying architecture of the SV system. Our experiments on ECAPA-TDNN, ResNet, and X-vector architectures using the OGI and MyST datasets demonstrate that the G-IFT framework yields consistent reductions in Equal Error Rates compared to baseline methods.
[LG-100] Generative Inversion for Property-Targeted Materials Design: Application to Shape Memory Alloys
链接: https://arxiv.org/abs/2508.07798
作者: Cheng Li,Pengfei Danga,Yuehui Xiana,Yumei Zhou,Bofeng Shi,Xiangdong Ding,Jun Suna,Dezhen Xue
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:The design of shape memory alloys (SMAs) with high transformation temperatures and large mechanical work output remains a longstanding challenge in functional materials engineering. Here, we introduce a data-driven framework based on generative adversarial network (GAN) inversion for the inverse design of high-performance SMAs. By coupling a pretrained GAN with a property prediction model, we perform gradient-based latent space optimization to directly generate candidate alloy compositions and processing parameters that satisfy user-defined property targets. The framework is experimentally validated through the synthesis and characterization of five NiTi-based SMAs. Among them, the Ni _49.8 Ti _26.4 Hf _18.6 Zr _5.2 alloy achieves a high transformation temperature of 404 ^\circ C, a large mechanical work output of 9.9 J/cm ^3 , a transformation enthalpy of 43 J/g , and a thermal hysteresis of 29 °C, outperforming existing NiTi alloys. The enhanced performance is attributed to a pronounced transformation volume change and a finely dispersed of Ti _2 Ni-type precipitates, enabled by sluggish Zr and Hf diffusion, and semi-coherent interfaces with localized strain fields. This study demonstrates that GAN inversion offers an efficient and generalizable route for the property-targeted discovery of complex alloys.
[LG-101] Statistical Theory of Multi-stage Newton Iteration Algorithm for Online Continual Learning
链接: https://arxiv.org/abs/2508.07419
作者: Xinjia Lu,Chuhan Wang,Qian Zhao,Lixing Zhu,Xuehu Zhu
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:We focus on the critical challenge of handling non-stationary data streams in online continual learning environments, where constrained storage capacity prevents complete retention of historical data, leading to catastrophic forgetting during sequential task training. To more effectively analyze and address the problem of catastrophic forgetting in continual learning, we propose a novel continual learning framework from a statistical perspective. Our approach incorporates random effects across all model parameters and allows the dimension of parameters to diverge to infinity, offering a general formulation for continual learning problems. To efficiently process streaming data, we develop a Multi-step Newton Iteration algorithm that significantly reduces computational costs in certain scenarios by alleviating the burden of matrix inversion. Theoretically, we derive the asymptotic normality of the estimator, enabling subsequent statistical inference. Comprehensive validation through synthetic data experiments and two real datasets analyses demonstrates the effectiveness of our proposed method.
[LG-102] Nonparametric Reaction Coordinate Optimization with Histories: A Framework for Rare Event Dynamics
链接: https://arxiv.org/abs/2508.07326
作者: Polina V. Banushkina,Sergei V. Krivov
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Probability (math.PR); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注:
Abstract:Rare but critical events in complex systems, such as protein folding, chemical reactions, disease progression, and extreme weather or climate phenomena, are governed by complex, high-dimensional, stochastic dynamics. Identifying an optimal reaction coordinate (RC) that accurately captures the progress of these dynamics is crucial for understanding and simulating such processes. This work introduces a nonparametric RC optimization framework that incorporates trajectory histories, enabling robust analysis even for irregular or incomplete data. The power of the method is demonstrated through increasingly challenging analyses of protein folding dynamics, where it provides accurate committor estimates that pass a stringent validation test and yield high-resolution free energy profiles. Its generality is further illustrated through applications to dynamics in phase space, a conceptual ocean circulation model, and a longitudinal clinical dataset. These results demonstrate that rare event dynamics can be accurately characterized without exhaustive sampling of the configuration space, establishing a general, flexible, and robust framework for analyzing complex dynamical systems and longitudinal datasets.
[LG-103] Channel Charting in Smart Radio Environments
链接: https://arxiv.org/abs/2508.07305
作者: Mahdi Maleki,Reza Agahzadeh Ayoubi,Marouan Mizmizi,Umberto Spagnolini
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces the use of static electromagnetic skins (EMSs) to enable robust device localization via channel charting (CC) in realistic urban environments. We develop a rigorous optimization framework that leverages EMS to enhance channel dissimilarity and spatial fingerprinting, formulating EMS phase profile design as a codebook-based problem targeting the upper quantiles of key embedding metric, localization error, trustworthiness, and continuity. Through 3D ray-traced simulations of a representative city scenario, we demonstrate that optimized EMS configurations, in addition to significant improvement of the average positioning error, reduce the 90th-percentile localization error from over 60 m (no EMS) to less than 25 m, while drastically improving trustworthiness and continuity. To the best of our knowledge, this is the first work to exploit Smart Radio Environment (SRE) with static EMS for enhancing CC, achieving substantial gains in localization performance under challenging None-Line-of-Sight (NLoS) conditions.
[LG-104] BIGBOY1.2: Generating Realistic Synthetic Data for Disease Outbreak Modelling and Analytics
链接: https://arxiv.org/abs/2508.07239
作者: Raunak Narwal,Syed Abbas
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注:
Abstract:Modelling disease outbreak models remains challenging due to incomplete surveillance data, noise, and limited access to standardized datasets. We have created BIGBOY1.2, an open synthetic dataset generator that creates configurable epidemic time series and population-level trajectories suitable for benchmarking modelling, forecasting, and visualisation. The framework supports SEIR and SIR-like compartmental logic, custom seasonality, and noise injection to mimic real reporting artifacts. BIGBOY1.2 can produce datasets with diverse characteristics, making it suitable for comparing traditional epidemiological models (e.g., SIR, SEIR) with modern machine learning approaches (e.g., SVM, neural networks).
[LG-105] QuProFS: An Evolutionary Training-free Approach to Efficient Quantum Feature Map Search
链接: https://arxiv.org/abs/2508.07104
作者: Yaswitha Gujju,Romain Harang,Chao Li,Tetsuo Shibuya,Qibin Zhao
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:The quest for effective quantum feature maps for data encoding presents significant challenges, particularly due to the flat training landscapes and lengthy training processes associated with parameterised quantum circuits. To address these issues, we propose an evolutionary training-free quantum architecture search (QAS) framework that employs circuit-based heuristics focused on trainability, hardware robustness, generalisation ability, expressivity, complexity, and kernel-target alignment. By ranking circuit architectures with various proxies, we reduce evaluation costs and incorporate hardware-aware circuits to enhance robustness against noise. We evaluate our approach on classification tasks (using quantum support vector machine) across diverse datasets using both artificial and quantum-generated datasets. Our approach demonstrates competitive accuracy on both simulators and real quantum hardware, surpassing state-of-the-art QAS methods in terms of sampling efficiency and achieving up to a 2x speedup in architecture search runtime.
[LG-106] Reconstruction of Solar EUV Irradiance Using CaII K Images and SOHO/SEM Data with Bayesian Deep Learning and Uncertainty Quantification
链接: https://arxiv.org/abs/2508.07065
作者: Haodi Jiang,Qin Li,Jason T. L. Wang,Haimin Wang,Serena Criscuoli
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 18 pages, 10 figures
Abstract:Solar extreme ultraviolet (EUV) irradiance plays a crucial role in heating the Earth’s ionosphere, thermosphere, and mesosphere, affecting atmospheric dynamics over varying time scales. Although significant effort has been spent studying short-term EUV variations from solar transient events, there is little work to explore the long-term evolution of the EUV flux over multiple solar cycles. Continuous EUV flux measurements have only been available since 1995, leaving significant gaps in earlier data. In this study, we propose a Bayesian deep learning model, named SEMNet, to fill the gaps. We validate our approach by applying SEMNet to construct SOHO/SEM EUV flux measurements in the period between 1998 and 2014 using CaII K images from the Precision Solar Photometric Telescope. We then extend SEMNet through transfer learning to reconstruct solar EUV irradiance in the period between 1950 and 1960 using CaII K images from the Kodaikanal Solar Observatory. Experimental results show that SEMNet provides reliable predictions along with uncertainty bounds, demonstrating the feasibility of CaII K images as a robust proxy for long-term EUV fluxes. These findings contribute to a better understanding of solar influences on Earth’s climate over extended periods.
[LG-107] aking the Garbage Out of Data-Driven Prediction Across Climate Timescales
链接: https://arxiv.org/abs/2508.07062
作者: Jason C. Furtado,Maria J. Molina,Marybeth C. Arcodia,Weston Anderson,Tom Beucler,John A. Callahan,Laura M. Ciasto,Vittorio A. Gensini,Michelle L’Heureux,Kathleen Pegion,Jhayron S. Pérez-Carrasquilla,Maike Sonnewald,Ken Takahashi,Baoqiang Xiang,Brian G. Zimmerman
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 24 pages, 4 figures, 3 tables
Abstract:Artificial intelligence (AI) – and specifically machine learning (ML) – applications for climate prediction across timescales are proliferating quickly. The emergence of these methods prompts a revisit to the impact of data preprocessing, a topic familiar to the climate community, as more traditional statistical models work with relatively small sample sizes. Indeed, the skill and confidence in the forecasts produced by data-driven models are directly influenced by the quality of the datasets and how they are treated during model development, thus yielding the colloquialism “garbage in, garbage out.” As such, this article establishes protocols for the proper preprocessing of input data for AI/ML models designed for climate prediction (i.e., subseasonal to decadal and longer). The three aims are to: (1) educate researchers, developers, and end users on the effects that preprocessing has on climate predictions; (2) provide recommended practices for data preprocessing for such applications; and (3) empower end users to decipher whether the models they are using are properly designed for their objectives. Specific topics covered in this article include the creation of (standardized) anomalies, dealing with non-stationarity and the spatiotemporally correlated nature of climate data, and handling of extreme values and variables with potentially complex distributions. Case studies will illustrate how using different preprocessing techniques can produce different predictions from the same model, which can create confusion and decrease confidence in the overall process. Ultimately, implementing the recommended practices set forth in this article will enhance the robustness and transparency of AI/ML in climate prediction studies.
[LG-108] Statistical Inference for Autoencoder-based Anomaly Detection after Representation Learning-based Domain Adaptation
链接: https://arxiv.org/abs/2508.07049
作者: Tran Tuan Kiet,Nguyen Thang Loi,Vo Nguyen Le Duy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection (AD) plays a vital role across a wide range of domains, but its performance might deteriorate when applied to target domains with limited data. Domain Adaptation (DA) offers a solution by transferring knowledge from a related source domain with abundant data. However, this adaptation process can introduce additional uncertainty, making it difficult to draw statistically valid conclusions from AD results. In this paper, we propose STAND-DA – a novel framework for statistically rigorous Autoencoder-based AD after Representation Learning-based DA. Built on the Selective Inference (SI) framework, STAND-DA computes valid p -values for detected anomalies and rigorously controls the false positive rate below a pre-specified level \alpha (e.g., 0.05). To address the computational challenges of applying SI to deep learning models, we develop the GPU-accelerated SI implementation, significantly enhancing both scalability and runtime performance. This advancement makes SI practically feasible for modern, large-scale deep architectures. Extensive experiments on synthetic and real-world datasets validate the theoretical results and computational efficiency of the proposed STAND-DA method.
[LG-109] Explainable AI for Curie Temperature Prediction in Magnetic Materials
链接: https://arxiv.org/abs/2508.06996
作者: M. Adeel Ajaib,Fariha Nasir,Abdul Rehman
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures
Abstract:We explore machine learning techniques for predicting Curie temperatures of magnetic materials using the NEMAD database. By augmenting the dataset with composition-based and domain-aware descriptors, we evaluate the performance of several machine learning models. We find that the Extra Trees Regressor delivers the best performance reaching an R^2 score of up to 0.85 \pm 0.01 (cross-validated) for a balanced dataset. We employ the k-means clustering algorithm to gain insights into the performance of chemically distinct material groups. Furthermore, we perform the SHAP analysis to identify key physicochemical drivers of Curie behavior, such as average atomic number and magnetic moment. By employing explainable AI techniques, this analysis offers insights into the model’s predictive behavior, thereby advancing scientific interpretability.
[LG-110] Machine Learning Algorithms for Improving Exact Classical Solvers in Mixed Integer Continuous Optimization
链接: https://arxiv.org/abs/2508.06906
作者: Morteza Kimiaei,Vyacheslav Kungurtsev,Brian Olimba
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Integer and mixed-integer nonlinear programming (INLP, MINLP) are central to logistics, energy, and scheduling, but remain computationally challenging. This survey examines how machine learning and reinforcement learning can enhance exact optimization methods - particularly branch-and-bound (BB), without compromising global optimality. We cover discrete, continuous, and mixed-integer formulations, and highlight applications such as crew scheduling, vehicle routing, and hydropower planning. We introduce a unified BB framework that embeds learning-based strategies into branching, cut selection, node ordering, and parameter control. Classical algorithms are augmented using supervised, imitation, and reinforcement learning models to accelerate convergence while maintaining correctness. We conclude with a taxonomy of learning methods by solver class and learning paradigm, and outline open challenges in generalization, hybridization, and scaling intelligent solvers.
[LG-111] Near-Optimal Convergence of Accelerated Gradient Methods under Generalized and (L_0 L_1)-Smoothness
链接: https://arxiv.org/abs/2508.06884
作者: Alexander Tyurin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We study first-order methods for convex optimization problems with functions f satisfying the recently proposed \ell -smoothness condition ||\nabla^2f(x)|| \le \ell\left(||\nabla f(x)||\right), which generalizes the L -smoothness and (L_0,L_1) -smoothness. While accelerated gradient descent AGD is known to reach the optimal complexity O(\sqrtL R / \sqrt\varepsilon) under L -smoothness, where \varepsilon is an error tolerance and R is the distance between a starting and an optimal point, existing extensions to \ell -smoothness either incur extra dependence on the initial gradient, suffer exponential factors in L_1 R , or require costly auxiliary sub-routines, leaving open whether an AGD-type O(\sqrt\ell(0) R / \sqrt\varepsilon) rate is possible for small- \varepsilon , even in the (L_0,L_1) -smoothness case. We resolve this open question. Leveraging a new Lyapunov function and designing new algorithms, we achieve O(\sqrt\ell(0) R / \sqrt\varepsilon) oracle complexity for small- \varepsilon and virtually any \ell . For instance, for (L_0,L_1) -smoothness, our bound O(\sqrtL_0 R / \sqrt\varepsilon) is provably optimal in the small- \varepsilon regime and removes all non-constant multiplicative factors present in prior accelerated algorithms. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2508.06884 [math.OC] (or arXiv:2508.06884v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2508.06884 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-112] MOCA-HESP: Meta High-dimensional Bayesian Optimization for Combinatorial and Mixed Spaces via Hyper-ellipsoid Partitioning ECAI-2025
链接: https://arxiv.org/abs/2508.06847
作者: Lam Ngo,Huong Ha,Jeffrey Chan,Hongyu Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at the 28th European Conference on Artificial Intelligence (ECAI-2025)
Abstract:High-dimensional Bayesian Optimization (BO) has attracted significant attention in recent research. However, existing methods have mainly focused on optimizing in continuous domains, while combinatorial (ordinal and categorical) and mixed domains still remain challenging. In this paper, we first propose MOCA-HESP, a novel high-dimensional BO method for combinatorial and mixed variables. The key idea is to leverage the hyper-ellipsoid space partitioning (HESP) technique with different categorical encoders to work with high-dimensional, combinatorial and mixed spaces, while adaptively selecting the optimal encoders for HESP using a multi-armed bandit technique. Our method, MOCA-HESP, is designed as a \textitmeta-algorithm such that it can incorporate other combinatorial and mixed BO optimizers to further enhance the optimizers’ performance. Finally, we develop three practical BO methods by integrating MOCA-HESP with state-of-the-art BO optimizers for combinatorial and mixed variables: standard BO, CASMOPOLITAN, and Bounce. Our experimental results on various synthetic and real-world benchmarks show that our methods outperform existing baselines. Our code implementation can be found at this https URL
[LG-113] A Score-based Diffusion Model Approach for Adaptive Learning of Stochastic Partial Differential Equation Solutions
链接: https://arxiv.org/abs/2508.06834
作者: Toan Huynh,Ruth Lopez Fajardo,Guannan Zhang,Lili Ju,Feng Bao
类目: Computation (stat.CO); Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We propose a novel framework for adaptively learning the time-evolving solutions of stochastic partial differential equations (SPDEs) using score-based diffusion models within a recursive Bayesian inference setting. SPDEs play a central role in modeling complex physical systems under uncertainty, but their numerical solutions often suffer from model errors and reduced accuracy due to incomplete physical knowledge and environmental variability. To address these challenges, we encode the governing physics into the score function of a diffusion model using simulation data and incorporate observational information via a likelihood-based correction in a reverse-time stochastic differential equation. This enables adaptive learning through iterative refinement of the solution as new data becomes available. To improve computational efficiency in high-dimensional settings, we introduce the ensemble score filter, a training-free approximation of the score function designed for real-time inference. Numerical experiments on benchmark SPDEs demonstrate the accuracy and robustness of the proposed method under sparse and noisy observations.
[LG-114] Role of Large Language Models and Retrieval-Augmented Generation for Accelerating Crystalline Material Discovery: A Systematic Review
链接: https://arxiv.org/abs/2508.06691
作者: Agada Joseph Oche,Arpan Biswas
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures
Abstract:Large language models (LLMs) have emerged as powerful tools for knowledge-intensive tasks across domains. In materials science, to find novel materials for various energy efficient devices for various real-world applications, requires several time and cost expensive simulations and experiments. In order to tune down the uncharted material search space, minimizing the experimental cost, LLMs can play a bigger role to first provide an accelerated search of promising known material candidates. Furthermore, the integration of LLMs with domain-specific information via retrieval-augmented generation (RAG) is poised to revolutionize how researchers predict materials structures, analyze defects, discover novel compounds, and extract knowledge from literature and databases. In motivation to the potentials of LLMs and RAG in accelerating material discovery, this paper presents a broad and systematic review to examine the recent advancements in applying LLMs and RAG to key materials science problems. We survey state-of-the-art developments in crystal structure prediction, defect analysis, materials discovery, literature mining, database integration, and multi-modal retrieval, highlighting how combining LLMs with external knowledge sources enables new capabilities. We discuss the performance, limitations, and implications of these approaches, and outline future directions for leveraging LLMs to accelerate materials research and discovery for advancement in technologies in the area of electronics, optics, biomedical, and energy storage.
[LG-115] Machines Learn Number Fields But How? The Case of Galois Groups
链接: https://arxiv.org/abs/2508.06670
作者: Kyu-Hwan Lee,Seewoo Lee
类目: Number Theory (math.NT); Machine Learning (cs.LG)
*备注: 31+10+3 pages
Abstract:By applying interpretable machine learning methods such as decision trees, we study how simple models can classify the Galois groups of Galois extensions over \mathbbQ of degrees 4, 6, 8, 9, and 10, using Dedekind zeta coefficients. Our interpretation of the machine learning results allows us to understand how the distribution of zeta coefficients depends on the Galois group, and to prove new criteria for classifying the Galois groups of these extensions. Combined with previous results, this work provides another example of a new paradigm in mathematical research driven by machine learning.
[LG-116] Federated Online Learning for Heterogeneous Multisource Streaming Data
链接: https://arxiv.org/abs/2508.06652
作者: Jingmao Li,Yuanxing Chen,Shuangge Ma,Kuangnan Fang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns. Most existing federated learning methods focus on the static" datasets. However, in many real-world applications, data arrive continuously over time, forming streaming datasets. This introduces additional challenges for data storage and algorithm design, particularly under high-dimensional settings. In this paper, we propose a federated online learning (FOL) method for distributed multi-source streaming data analysis. To account for heterogeneity, a personalized model is constructed for each data source, and a novel
subgroup" assumption is employed to capture potential similarities, thereby enhancing model performance. We adopt the penalized renewable estimation method and the efficient proximal gradient descent for model training. The proposed method aligns with both federated and online learning frameworks: raw data are not exchanged among sources, ensuring data privacy, and only summary statistics of previous data batches are required for model updates, significantly reducing storage demands. Theoretically, we establish the consistency properties for model estimation, variable selection, and subgroup structure recovery, demonstrating optimal statistical efficiency. Simulations illustrate the effectiveness of the proposed method. Furthermore, when applied to the financial lending data and the web log data, the proposed method also exhibits advantageous prediction performance. Results of the analysis also provide some practical insights.
[LG-117] Benchmarking Self-Driving Labs
链接: https://arxiv.org/abs/2508.06642
作者: Adedire D. Adesiji,Jiashuo Wang,Cheng-Shu Kuo,Keith A. Brown
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:A key goal of modern materials science is accelerating the pace of materials discovery. Self-driving labs, or systems that select experiments using machine learning and then execute them using automation, are designed to fulfil this promise by performing experiments faster, more intelligently, more reliably, and with richer metadata than conventional means. This review summarizes progress in understanding the degree to which SDLs accelerate learning by quantifying how much they reduce the number of experiments required for a given goal. The review begins by summarizing the theory underlying two key metrics, namely acceleration factor AF and enhancement factor EF, which quantify how much faster and better an algorithm is relative to a reference strategy. Next, we provide a comprehensive review of the literature, which reveals a wide range of AFs with a median of 6, and that tends to increase with the dimensionality of the space, reflecting an interesting blessing of dimensionality. In contrast, reported EF values vary by over two orders of magnitude, although they consistently peak at 10-20 experiments per dimension. To understand these results, we perform a series of simulated Bayesian optimization campaigns that reveal how EF depends upon the statistical properties of the parameter space while AF depends on its complexity. Collectively, these results reinforce the motivation for using SDLs by revealing their value across a wide range of material parameter spaces and provide a common language for quantifying and understanding this acceleration.
[LG-118] Do Streetscapes Still Matter for Customer Ratings of Eating and Drinking Establishments in Car-Dependent Cities?
链接: https://arxiv.org/abs/2508.06513
作者: Chaeyeon Han,Seung Jae Lieu,Uijeong Hwang,Subhrajit Guhathakurta
类目: Physics and Society (physics.soc-ph); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Soon to be published in Journal of Urban Design
Abstract:This study examines how indoor and outdoor aesthetics, streetscapes, and neighborhood features shape customer satisfaction at eating and dining establishments (EDEs) across different urban contexts, varying in car dependency, in Washington, DC. Using review photos and street view images, computer vision models quantified perceived safety and visual appeal. Ordinal logistic regression analyzed their effects on Yelp ratings. Findings reveal that both indoor and outdoor environments significantly impact EDE ratings, while streetscape quality’s influence diminishes in car-dependent areas. The study highlights the need for context-sensitive planning that integrates indoor and outdoor factors to enhance customer experiences in diverse settings.
信息检索
[IR-0] Early Explorations of Recommender Systems for Physical Activity and Well-being RECSYS2025
链接: https://arxiv.org/abs/2508.07980
作者: Alan Said
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Second International Workshop on Recommender Systems for Sustainability and Social Good (RecSoGood) in conjunction with ACM RecSys 2025
Abstract:As recommender systems increasingly guide physical actions, often through wearables and coaching tools, new challenges arise around how users interpret, trust, and respond to this advice. This paper introduces a conceptual framework for tangible recommendations that influence users’ bodies, routines, and well-being. We describe three design dimensions: trust and interpretation, intent alignment, and consequence awareness. These highlight key limitations in applying conventional recommender logic to embodied settings. Through examples and design reflections, we outline how future systems can support long-term well-being, behavioral alignment, and socially responsible personalization.
[IR-1] Careful Queries Credible Results: Teaching RAG Models Advanced Web Search Tools with Reinforcement Learning
链接: https://arxiv.org/abs/2508.07956
作者: Yuqin Dai,Shuo Yang,Guoqing Wang,Yong Deng,Zhanwei Zhang,Jun Yin,Pengyu Zeng,Zhenzhe Ying,Changhua Meng,Can Yi,Yuchen Zhou,Weiqiang Wang,Shuai Lu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating up-to-date external knowledge, yet real-world web environments present unique challenges. These limitations manifest as two key challenges: pervasive misinformation in the web environment, which introduces unreliable or misleading content that can degrade retrieval accuracy, and the underutilization of web tools, which, if effectively employed, could enhance query precision and help mitigate this noise, ultimately improving the retrieval results in RAG systems. To address these issues, we propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content. This approach combines a retrieval filtering mechanism with a behavior- and outcome-driven reward strategy, optimizing both query formulation and retrieval outcomes. Extensive experiments demonstrate that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks.
[IR-2] Encode Me If You Can: Learning Universal User Representations via Event Sequence Autoencoding
链接: https://arxiv.org/abs/2508.07748
作者: Anton Klenitskiy,Artem Fatkulin,Daria Denisova,Anton Pembek,Alexey Vasilev
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Building universal user representations that capture the essential aspects of user behavior is a crucial task for modern machine learning systems. In real-world applications, a user’s historical interactions often serve as the foundation for solving a wide range of predictive tasks, such as churn prediction, recommendations, or lifetime value estimation. Using a task-independent user representation that is effective across all such tasks can reduce the need for task-specific feature engineering and model retraining, leading to more scalable and efficient machine learning pipelines. The goal of the RecSys Challenge 2025 by Synerise was to develop such Universal Behavioral Profiles from logs of past user behavior, which included various types of events such as product purchases, page views, and search queries. We propose a method that transforms the entire user interaction history into a single chronological sequence and trains a GRU-based autoencoder to reconstruct this sequence from a fixed-size vector. If the model can accurately reconstruct the sequence, the latent vector is expected to capture the key behavioral patterns. In addition to this core model, we explored several alternative methods for generating user embeddings and combined them by concatenating their output vectors into a unified representation. This ensemble strategy further improved generalization across diverse downstream tasks and helped our team, ai_lab_recsys, achieve second place in the RecSys Challenge 2025.
[IR-3] MLego: Interactive and Scalable Topic Exploration Through Model Reuse
链接: https://arxiv.org/abs/2508.07654
作者: Fei Ye,Jiapan Liu,Yinan Jing,Zhenying He,Weirao Wang,X. Sean Wang
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 14 pages
Abstract:With massive texts on social media, users and analysts often rely on topic modeling techniques to quickly extract key themes and gain insights. Traditional topic modeling techniques, such as Latent Dirichlet Allocation (LDA), provide valuable insights but are computationally expensive, making them impractical for real-time data analysis. Although recent advances in distributed training and fast sampling methods have improved efficiency, real-time topic exploration remains a significant challenge. In this paper, we present MLego, an interactive query framework designed to support real-time topic modeling analysis by leveraging model materialization and reuse. Instead of retraining models from scratch, MLego efficiently merges materialized topic models to construct approximate results at interactive speeds. To further enhance efficiency, we introduce a hierarchical plan search strategy for single queries and an optimized query reordering technique for batch queries. We integrate MLego into a visual analytics prototype system, enabling users to explore large-scale textual datasets through interactive queries. Extensive experiments demonstrate that MLego significantly reduces computation costs while maintaining high-quality topic modeling results. MLego enhances existing visual analytics approaches, which primarily focus on user-driven topic modeling, by enabling real-time, query-driven exploration. This complements traditional methods and bridges the gap between scalable topic modeling and interactive data analysis.
[IR-4] UMRE: A Unified Monotonic Transformation for Ranking Ensemble in Recommender Systems
链接: https://arxiv.org/abs/2508.07613
作者: Zhengrui Xu,Zhe Yang,Zhengxiao Guo,Shukai Liu,Luocheng Lin,Xiaoyan Liu,Yongqi Liu,Han Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Industrial recommender systems commonly rely on ensemble sorting (ES) to combine predictions from multiple behavioral objectives. Traditionally, this process depends on manually designed nonlinear transformations (e.g., polynomial or exponential functions) and hand-tuned fusion weights to balance competing goals – an approach that is labor-intensive and frequently suboptimal in achieving Pareto efficiency. In this paper, we propose a novel Unified Monotonic Ranking Ensemble (UMRE) framework to address the limitations of traditional methods in ensemble sorting. UMRE replaces handcrafted transformations with Unconstrained Monotonic Neural Networks (UMNN), which learn expressive, strictly monotonic functions through the integration of positive neural integrals. Subsequently, a lightweight ranking model is employed to fuse the prediction scores, assigning personalized weights to each prediction objective. To balance competing goals, we further introduce a Pareto optimality strategy that adaptively coordinates task weights during training. UMRE eliminates manual tuning, maintains ranking consistency, and achieves fine-grained personalization. Experimental results on two public recommendation datasets (Kuairand and Tenrec) and online A/B tests demonstrate impressive performance and generalization capabilities.
[IR-5] owards Comprehensible Recommendation with Large Language Model Fine-tuning
链接: https://arxiv.org/abs/2508.07595
作者: Yunze Luo,Yinjie Jiang,Gaode Chen,Xinghua Zhang,Jun Zhang,Jian Liang,Kaigui Bian
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 6 figures
Abstract:Recommender systems have become increasingly ubiquitous in daily life. While traditional recommendation approaches primarily rely on ID-based representations or item-side content features, they often fall short in capturing the underlying semantics aligned with user preferences (e.g., recommendation reasons for items), leading to a semantic-collaborative gap. Recently emerged LLM-based feature extraction approaches also face a key challenge: how to ensure that LLMs possess recommendation-aligned reasoning capabilities and can generate accurate, personalized reasons to mitigate the semantic-collaborative gap. To address these issues, we propose a novel Content Understanding from a Collaborative Perspective framework (CURec), which generates collaborative-aligned content features for more comprehensive recommendations. \method first aligns the LLM with recommendation objectives through pretraining, equipping it with instruction-following and chain-of-thought reasoning capabilities. Next, we design a reward model inspired by traditional recommendation architectures to evaluate the quality of the recommendation reasons generated by the LLM. Finally, using the reward signals, CURec fine-tunes the LLM through RL and corrects the generated reasons to ensure their accuracy. The corrected reasons are then integrated into a downstream recommender model to enhance comprehensibility and recommendation performance. Extensive experiments on public benchmarks demonstrate the superiority of CURec over existing methods.
[IR-6] Orthogonal Low Rank Embedding Stabilization
链接: https://arxiv.org/abs/2508.07574
作者: Kevin Zielnicki,Ko-Jen Hsiao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The instability of embedding spaces across model retraining cycles presents significant challenges to downstream applications using user or item embeddings derived from recommendation systems as input features. This paper introduces a novel orthogonal low-rank transformation methodology designed to stabilize the user/item embedding space, ensuring consistent embedding dimensions across retraining sessions. Our approach leverages a combination of efficient low-rank singular value decomposition and orthogonal Procrustes transformation to map embeddings into a standardized space. This transformation is computationally efficient, lossless, and lightweight, preserving the dot product and inference quality while reducing operational burdens. Unlike existing methods that modify training objectives or embedding structures, our approach maintains the integrity of the primary model application and can be seamlessly integrated with other stabilization techniques.
[IR-7] Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities
链接: https://arxiv.org/abs/2508.07399
作者: Yu Ye,Junchen Fu,Yu Song,Kaiwen Zheng,Joemon M. Jose
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Multimodal recommendation (MMRec) has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation? To answer this question, we conduct a large-scale empirical study examining the role of text and visual embeddings in modern MMRec models, both as a whole and individually. Specifically, we pose two key research questions: (1) Do multimodal embeddings as a whole improve recommendation performance? (2) Is each individual modality - text and image - useful when used alone? To isolate the effect of individual modalities - text or visual - we employ a modality knockout strategy by setting the corresponding embeddings to either constant values or random noise. To ensure the scale and comprehensiveness of our study, we evaluate 14 widely used state-of-the-art MMRec models. Our findings reveal that: (1) multimodal embeddings generally enhance recommendation performance - particularly when integrated through more sophisticated graph-based fusion models. Surprisingly, commonly adopted baseline models with simple fusion schemes, such as VBPR and BM3, show only limited gains. (2) The text modality alone achieves performance comparable to the full multimodal setting in most cases, whereas the image modality alone does not. These results offer foundational insights and practical guidance for the MMRec community. We will release our code and datasets to facilitate future research.
[IR-8] Uncertainty-Aware Semantic Decoding for LLM -Based Sequential Recommendation APWEB2025
链接: https://arxiv.org/abs/2508.07210
作者: Chenke Yin,Li Fan,Jia Wang,Dongxiao Hu,Haichao Zhang,Chong Zhang,Yang Xiang
类目: Information Retrieval (cs.IR)
*备注: Accepted by APWeb 2025
Abstract:Large language models have been widely applied to sequential recommendation tasks, yet during inference, they continue to rely on decoding strategies developed for natural language processing. This creates a mismatch between text-generation objectives and recommendation next item selection objectives. This paper addresses this limitation by proposing an Uncertainty-aware Semantic Decoding (USD) framework that combines logit-based clustering with adaptive scoring to improve next-item predictions. Our approach clusters items with similar logit vectors into semantic equivalence groups, then redistributes probability mass within these clusters and computes entropy across them to control item scoring and sampling temperature during recommendation inference. Experiments on Amazon Product datasets (six domains) gains of 18.5% in HR@3, 11.9% in NDCG@3, and 10.8% in MRR@3 compared to state-of-the-art baselines. Hyperparameter analysis confirms the optimal parameters among various settings, and experiments on H\M, and Netflix datasets indicate that the framework can adapt to differing recommendation domains. The experimental results confirm that integrating semantic clustering and uncertainty assessment yields more reliable and accurate recommendations.
[IR-9] Blending Sequential Embeddings Graphs and Engineered Features: 4th Place Solution in RecSys Challenge 2025
链接: https://arxiv.org/abs/2508.06970
作者: Sergei Makeev,Alexandr Andreev,Vladimir Baikalov,Vladislav Tytskiy,Aleksei Krasilnikov,Kirill Khrylchenko
类目: Information Retrieval (cs.IR)
*备注:
Abstract:This paper describes the 4th-place solution by team ambitious for the RecSys Challenge 2025, organized by Synerise and ACM RecSys, which focused on universal behavioral modeling. The challenge objective was to generate user embeddings effective across six diverse downstream tasks. Our solution integrates (1) a sequential encoder to capture the temporal evolution of user interests, (2) a graph neural network to enhance generalization, (3) a deep cross network to model high-order feature interactions, and (4) performance-critical feature engineering.