本篇博文主要内容为 2025-08-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-15)

今日共更新508篇论文,其中:

  • 自然语言处理89篇(Computation and Language (cs.CL))
  • 人工智能153篇(Artificial Intelligence (cs.AI))
  • 计算机视觉128篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习140篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Searching for Privacy Risks in LLM Agents via Simulation

【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体在多轮交互中可能引发的隐私泄露问题,特别是恶意智能体通过动态对话策略主动提取敏感信息所带来的隐私威胁。解决方案的关键在于提出一种基于搜索的框架,通过模拟数据主体、数据发送方和数据接收方三角色的交互过程,交替优化攻击者与防御者的指令;该框架利用LLM作为优化器,采用多线程并行搜索与跨线程传播机制高效探索交互空间,从而发现从简单直接请求到复杂多轮策略(如伪装和同意伪造)的攻击模式,并推动防御机制从规则约束进化为基于身份验证的状态机,且所发现的攻防策略具有跨场景和跨模型的迁移能力,具备良好的实际应用价值。

链接: https://arxiv.org/abs/2508.10880
作者: Yanzhe Zhang,Diyi Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. These dynamic dialogues enable adaptive attack strategies that can cause severe privacy violations, yet their evolving nature makes it difficult to anticipate and discover sophisticated vulnerabilities manually. To tackle this problem, we present a search-based framework that alternates between improving attacker and defender instructions by simulating privacy-critical agent interactions. Each simulation involves three roles: data subject, data sender, and data recipient. While the data subject’s behavior is fixed, the attacker (data recipient) attempts to extract sensitive information from the defender (data sender) through persistent and interactive exchanges. To explore this interaction space efficiently, our search algorithm employs LLMs as optimizers, using parallel search with multiple threads and cross-thread propagation to analyze simulation trajectories and iteratively propose new instructions. Through this process, we find that attack strategies escalate from simple direct requests to sophisticated multi-turn tactics such as impersonation and consent forgery, while defenses advance from rule-based constraints to identity-verification state machines. The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.
zh

[NLP-1] A Survey on Diffusion Language Models

【速读】: 该论文旨在系统性地梳理和分析扩散语言模型(Diffusion Language Models, DLMs)的发展现状与技术演进,解决当前DLM研究缺乏全面综述、方法分类不清以及实践优化路径不明确的问题。其解决方案的关键在于构建一个涵盖预训练策略、后训练方法、推理优化机制及多模态扩展的完整分类体系,并深入剖析DLM在生成效率、上下文建模能力与控制精度方面的优势,同时指出其在长序列处理、计算资源消耗等方面的挑战,从而为后续研究提供清晰的方向指引。

链接: https://arxiv.org/abs/2508.10875
作者: Tianyi Li,Mingda Chen,Bowei Guo,Zhiqiang Shen
机构: VILA Lab, Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at this https URL.
zh

[NLP-2] SSRL: Self-Search Reinforcement Learning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)中代理搜索任务对昂贵外部搜索引擎依赖的问题,从而提升训练效率与可扩展性。其核心解决方案是提出Self-Search RL(SSRL),通过格式化奖励(format-based rewards)和规则奖励(rule-based rewards)增强大语言模型(Large Language Models, LLMs)的内部自我搜索能力,使其能够在无需访问外部工具的情况下迭代优化知识利用过程,实现高效、稳定且低幻觉的模拟环境训练,并支持无缝集成外部搜索引擎,显著降低对真实搜索接口的依赖。

链接: https://arxiv.org/abs/2508.10874
作者: Yuchen Fan,Kaiyan Zhang,Heng Zhou,Yuxin Zuo,Yanxu Chen,Yu Fu,Xinwei Long,Xuekai Zhu,Che Jiang,Yuchen Zhang,Li Kang,Gang Chen,Cheng Huang,Zhizhou He,Bingning Wang,Lei Bai,Ning Ding,Bowen Zhou
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); University College London (伦敦大学学院); CSCEC Third Bureau (中国建筑第三工程局); WeChat AI (微信AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs’ Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.
zh

[NLP-3] From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms

【速读】: 该论文旨在解决自动口译质量评估(Automated Interpreting Quality Assessment, AIQA)中存在的三大问题:语言使用质量评估不足、因数据稀缺与不平衡导致的建模效果不佳,以及模型预测缺乏可解释性。其解决方案的关键在于提出一个多层次建模框架,融合特征工程、数据增强与可解释机器学习技术;通过仅使用与构念相关的透明特征并结合Shapley值(SHAP)分析,优先实现模型预测的可解释性,从而在保证高预测性能的同时提供诊断性反馈,为学习者自调节学习提供支持。

链接: https://arxiv.org/abs/2508.10860
作者: Zhaokun Jiang,Ziyin Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box’’ predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.
zh

[NLP-4] Psyche-R1: Towards Reliable Psychological LLM s through Unified Empathy Expertise and Reasoning

【速读】: 该论文旨在解决当前心理健康专业人才短缺背景下,如何提升大语言模型(Large Language Models, LLMs)在心理服务场景中生成可靠、理性且具同理心响应的能力。现有研究多聚焦于情感支持与共情对话,忽视了推理机制对高质量心理干预的重要性。解决方案的关键在于提出首个融合同理心、心理学专业知识与推理能力的中文心理大模型Psyche-R1,其核心创新是构建了一套全新的数据采集与合成流程:通过链式思维(Chain-of-Thought, CoT)推理和迭代提示-理由优化策略,生成超过7.5万条带有详细推理过程的心理问题样本,并结合7.3万条同理心对话数据;同时采用混合训练策略——利用多模型交叉选择识别困难样本进行组相对策略优化(Group Relative Policy Optimization, GRPO),强化推理能力,其余数据则用于监督微调(Supervised Fine-Tuning, SFT)以增强同理心表达与领域知识掌握。实验表明,7B参数规模的Psyche-R1在多个心理基准测试中达到671B参数模型DeepSeek-R1相当的性能水平。

链接: https://arxiv.org/abs/2508.10848
作者: Chongyuan Dai,Jinpeng Hu,Hongchang Shi,Zhuo Li,Xun Yang,Meng Wang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Peking University (北京大学); 4. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.
zh

[NLP-5] Reinforced Language Models for Sequential Decision Making

【速读】: 该论文旨在解决小型大语言模型(Large Language Models, LLMs)在序列决策任务中性能受限的问题,尤其是现有后训练方法无法有效处理多步代理任务中的信用分配(credit assignment)问题。其解决方案的关键在于提出一种名为多步组相对策略优化(Multi-Step Group-Relative Policy Optimization, MS-GRPO)的新算法,该算法基于形式化的文本媒介随机博弈(Text-Mediated Stochastic Game, TSMG)和语言代理策略(Language-Agent Policy, LAP)框架,通过将整个累计回合奖励归因于每个步骤来实现精准的信用分配,并结合一种新颖的绝对优势加权回合采样策略以提升训练效率与性能。实验表明,该方法可在30亿参数规模下显著超越720亿参数的基线模型,在Frozen Lake任务上提升50%的决策性能。

链接: https://arxiv.org/abs/2508.10839
作者: Jim Dilkes,Vahid Yazdanpanah,Sebastian Stein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.
zh

[NLP-6] Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Technical Solutions

【速读】: 该论文旨在解决Transformer架构在长程上下文保持、持续学习和知识整合方面的关键局限性,这些问题限制了其在复杂任务中的智能表现。解决方案的核心在于构建一个统一框架,融合神经科学中的多时间尺度记忆(multi-timescale memory)、选择性注意(selective attention)与巩固机制(consolidation)原理,与工程上的Memory-Augmented Transformer进展相结合。通过功能目标(如上下文扩展、推理、知识整合)、记忆表征形式(参数编码、状态驱动、显式存储、混合结构)及集成机制(注意力融合、门控控制、关联检索)三个维度对现有研究进行系统分类,并揭示从静态缓存向自适应、测试时学习系统的转变趋势,从而推动具备类脑认知能力的终身学习Transformer模型的发展。

链接: https://arxiv.org/abs/2508.10824
作者: Parsa Omidi,Xingshuai Huang,Axel Laborieux,Bahareh Nikpour,Tianyu Shi,Armaghan Eshaghi
机构: Huawei Technologies(华为技术有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory is fundamental to intelligence, enabling learning, reasoning, and adaptability across biological and artificial systems. While Transformer architectures excel at sequence modeling, they face critical limitations in long-range context retention, continual learning, and knowledge integration. This review presents a unified framework bridging neuroscience principles, including dynamic multi-timescale memory, selective attention, and consolidation, with engineering advances in Memory-Augmented Transformers. We organize recent progress through three taxonomic dimensions: functional objectives (context extension, reasoning, knowledge integration, adaptation), memory representations (parameter-encoded, state-based, explicit, hybrid), and integration mechanisms (attention fusion, gated control, associative retrieval). Our analysis of core memory operations (reading, writing, forgetting, and capacity management) reveals a shift from static caches toward adaptive, test-time learning systems. We identify persistent challenges in scalability and interference, alongside emerging solutions including hierarchical buffering and surprise-gated updates. This synthesis provides a roadmap toward cognitively-inspired, lifelong-learning Transformer architectures.
zh

[NLP-7] Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM -Assisted Feedback

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域中同行评审过程中新颖性评估(novelty assessment)这一核心但研究不足的问题,尤其在投稿量激增导致审稿人资源紧张的背景下。其解决方案的关键在于提出一种结构化的自动化新颖性评估方法,该方法通过三个阶段建模专家审稿行为:从投稿内容中提取关键信息、检索并整合相关文献、进行结构化比较以生成基于证据的判断。该方法基于对大量人工撰写的新颖性评审的分析,捕捉到独立主张验证和上下文推理等关键模式,在182篇ICLR 2025投稿上的实验表明,该方法在人类推理一致性上达到86.5%,在新颖性结论一致率上达75.3%,显著优于现有基于大语言模型(Large Language Models, LLM)的基线方法,同时提供详尽且文献感知的分析,提升评审一致性与透明度。

链接: https://arxiv.org/abs/2508.10795
作者: Osama Mohammed Afzal,Preslav Nakov,Tom Hope,Iryna Gurevych
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.
zh

[NLP-8] Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在探索(exploration)与利用(exploitation)之间难以平衡的问题,导致策略倾向于保守行为并收敛至局部最优。其解决方案的关键在于引入Pass@k作为奖励信号进行策略训练(即Pass@k Training),并通过理论分析推导出该方法的优势函数解析解,从而实现对优势函数的直接设计。研究表明,探索与利用并非固有冲突,反而可以相互促进,这一发现为RLVR中的优势函数设计提供了新思路,并展现出良好的应用前景。

链接: https://arxiv.org/abs/2508.10751
作者: Zhipeng Chen,Xiaobo Qin,Youbin Wu,Yue Ling,Qinghao Ye,Wayne Xin Zhao,Guang Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report about RLVR: 32 pages, 18 figures, 7 tables

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., \textbfPass@k Training ), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.
zh

[NLP-9] hinking Inside the Mask: In-Place Prompting in Diffusion LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中受限于前缀提示(prefix-only prompting)范式和顺序生成机制所导致的双向信息交互能力不足的问题。针对这一挑战,作者提出了一种名为ICE(In-Place Chain-of-Thought Prompting with Early Exit)的新框架,其核心创新在于:首先,在扩散大语言模型(Diffusion Large Language Models, dLLMs)的迭代优化过程中,将提示信息直接嵌入到被掩码的token位置中,从而实现更灵活的就地提示(in-place prompting)策略;其次,引入基于置信度感知的早期退出机制(confidence-aware early exit),有效降低计算开销。实验表明,该方法在GSM8K上最高提升准确率17.29%并加速4.12倍,在MMLU上加速高达276.67倍,同时保持竞争力性能。

链接: https://arxiv.org/abs/2508.10736
作者: Xiangqi Jin,Yuxuan Wang,Yifeng Gao,Zichen Wen,Biqing Qi,Dongrui Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE’s effectiveness, achieving up to 17.29% accuracy improvement with 4.12 \times speedup on GSM8K, and up to 276.67 \times acceleration on MMLU while maintaining competitive performance.
zh

[NLP-10] Learning from Natural Language Feedback for Personalized Question Answering

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在个性化问答任务中因依赖标量奖励信号进行强化学习而导致的学习效率低、个性化质量不足的问题。现有方法通常采用检索增强生成(Retrieval-Augmented Generation, RAG)结合标量奖励的强化学习策略来引导模型利用用户个人上下文,但此类标量奖励常提供弱化且非指导性的反馈,限制了模型对个性化策略的有效学习。论文提出VAC框架,其核心创新在于用自然语言反馈(Natural Language Feedback, NLF)替代标量奖励,NLF由条件于用户画像和问题叙述的反馈模型生成,作为更丰富且可操作的监督信号,使策略模型能够迭代优化输出并内化有效的个性化策略;训练过程中交替优化反馈模型与策略模型,最终得到无需推理阶段反馈即可生成高质量个性化回答的策略模型。

链接: https://arxiv.org/abs/2508.10695
作者: Alireza Salemi,Hamed Zamani
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.
zh

[NLP-11] Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph

【速读】: 该论文旨在解决手语翻译中因依赖词典(gloss)而导致的翻译精度不足与可扩展性差的问题,从而提升聋哑人群体的沟通无障碍水平。其关键解决方案在于提出一种融合图神经网络(Graph Neural Network, GNN)与Transformer架构的新方法,具体通过结合时空图卷积网络(STGCN)和长短期记忆网络(LSTM)来增强对视频序列中手势时序与空间结构信息的建模能力,并在此基础上实现无需词典标注(gloss-free)的端到端翻译。该架构在RWTH-PHOENIX-2014T、CSL-Daily、How2Sign及首次引入的BornilDB v1.0等多个多语言手语数据集上均达到新的SOTA性能,显著优于现有方法如GASLT和slt_how2sign,在BLEU-4指标上提升幅度达0.5–4.01,验证了融合策略的有效性与实用性。

链接: https://arxiv.org/abs/2508.10687
作者: Safaeid Hossain Arib,Rabeya Akter,Sejuti Rahman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Millions of individuals worldwide are affected by deafness and hearing impairment. Sign language serves as a sophisticated means of communication for the deaf and hard of hearing. However, in societies that prioritize spoken languages, sign language often faces underestimation, leading to communication barriers and social exclusion. The Continuous Bangla Sign Language Translation project aims to address this gap by enhancing translation methods. While recent approaches leverage transformer architecture for state-of-the-art results, our method integrates graph-based methods with the transformer architecture. This fusion, combining transformer and STGCN-LSTM architectures, proves more effective in gloss-free translation. Our contributions include architectural fusion, exploring various fusion strategies, and achieving a new state-of-the-art performance on diverse sign language datasets, namely RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach demonstrates superior performance compared to current translation outcomes across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01, 2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve communication accessibility for the deaf and hard of hearing.
zh

[NLP-12] Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages

【速读】: 该论文旨在解决古埃及科普特语(Coptic)到法语的系统性翻译问题,这是历史上语言翻译研究中的一个薄弱环节。其解决方案的关键在于构建一个综合性的翻译流程(pipeline),通过对比基于中间语言(pivot)与直接翻译(direct translation)的效果,验证预训练(pre-training)和多版本微调(multi-version fine-tuning)对模型性能的提升作用,并特别强调在训练数据中引入风格多样性与噪声鲁棒性(noise-awareness)的重要性。实验基于对齐的圣经语料库,结果表明,采用具有风格差异且包含噪声的训练语料进行微调,可显著提高翻译质量,为历史语言翻译工具的开发提供了关键实践依据。

链接: https://arxiv.org/abs/2508.10683
作者: Nasma Chaoui,Richard Khoury
机构: Université Laval (拉瓦尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the first systematic study of strategies for translating Coptic into French. Our comprehensive pipeline systematically evaluates: pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Utilizing aligned biblical corpora, we demonstrate that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Our findings provide crucial practical insights for developing translation tools for historical languages in general.
zh

[NLP-13] DIF: A European Deep Inference Fabric for Remote Interpretability of LLM

【速读】: 该论文旨在解决欧洲地区大型语言模型(Large Language Models, LLMs)可解释性研究基础设施缺乏广泛可及性的问题,以推动机制可解释性(mechanistic interpretability)研究的民主化。解决方案的关键在于构建一个兼容NDIF标准的欧洲深度推理网络(European Deep Inference Fabric, eDIF),其核心是一个部署在安斯巴赫应用科学大学的GPU集群,并通过NNsight API与合作机构互联,支持远程模型分析。该平台使研究人员能够对GPT-2和DeepSeek-R1-70B等模型执行激活修补、因果追踪和表征分析等干预操作,初步验证了技术可行性、可用性和科学价值,为后续扩展工具链、提升性能并建立持续协作的用户社区奠定了基础。

链接: https://arxiv.org/abs/2508.10553
作者: Irma Heithoff. Marc Guggenberger,Sandra Kalogiannis,Susanne Mayer,Fabian Maag,Sigurd Schacht,Carsten Lanquillon
机构: Hochschule Ansbach (安斯巴赫应用技术大学); Coairesearch (Coairesearch)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform’s technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.
zh

[NLP-14] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的“文本主导”(text dominance)问题,即模型在推理过程中过度依赖文本信息,而未能充分利用图像、视频、音频、时间序列和图等其他模态的信息。为系统评估这一现象,作者提出了两个量化指标:模态主导指数(Modality Dominance Index, MDI)和注意力效率指数(Attention Efficiency Index, AEI)。通过全面分析发现,文本主导现象在多种模态中普遍存在,其根源包括非文本模态的严重标记冗余导致注意力稀释、融合架构设计的影响以及任务设定对文本输入的隐式偏好。解决方案的关键在于提出一种简单的标记压缩方法,有效重平衡模型注意力分配;例如,在LLaVA-7B模型上应用该方法后,MDI从10.23显著降低至0.86,实现了更均衡的多模态感知能力。

链接: https://arxiv.org/abs/2508.10552
作者: Huyu Wu,Meng Tang,Xinhan Zheng,Haiyun Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.
zh

[NLP-15] Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards

【速读】: 该论文旨在解决长时程强化学习(Long-horizon Reinforcement Learning, RL)中奖励稀疏性(Reward Sparsity)的问题,尤其在软件工程(Software Engineering, SWE)任务中,由于多轮推理和基于规则的验证机制,传统基于结果的奖励塑造方法难以定义无偏见且有意义的即时奖励,而基于验证的奖励塑造方法则易因即时奖励与长期目标不一致导致奖励黑客(Reward Hacking)和策略退化。其解决方案的关键在于提出一种名为**门控奖励累积(Gated Reward Accumulation, G-RA)**的新方法,该方法仅当高阶(长期)奖励达到预设阈值时才累积即时奖励,从而确保RL优化的稳定性,并有效避免由奖励错位引发的策略性能下降。实验表明,G-RA显著提升了SWE-bench Verified和kBench上的任务完成率和修改率,同时保持了策略的稳定性。

链接: https://arxiv.org/abs/2508.10548
作者: Zetian Sun,Dongfang Li,Zhuoen Chen,Yuhuai Qin,Baotian Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a significant challenge, while existing outcome-based reward shaping struggles to define meaningful immediate rewards without introducing bias or requiring explicit task decomposition. Alternatively, verification-based reward shaping uses stepwise critics, but misalignment between immediate rewards and long-term objectives can lead to reward hacking and suboptimal policies. In this work, we address this problem in the context of software engineering (SWE) tasks, where multi-turn reasoning and rule-based verification are critical. We introduce the SWE-oriented RL Framework, a unified system supporting multi-turn interaction, docker-based execution, and customizable reward functions. Additionally, we propose Gated Reward Accumulation (G-RA), a novel method that accumulates immediate rewards only when high-level (long-term) rewards meet a predefined threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified and kBench demonstrate that G-RA leads to an increase in completion rates (47.6% \rightarrow 93.8% and 22.0% \rightarrow 86.0%) and modification rates (19.6% \rightarrow 23.8% and 12.0% \rightarrow 42.0%), while avoiding policy degradation caused by reward misalignment. Our findings highlight the importance of balanced reward accumulation in long-horizon RL and provide a practical solution.
zh

[NLP-16] Improving Value-based Process Verifier via Low-Cost Variance Reduction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务(如数学问题求解)中因价值评估过程验证器(value-based process verifier)训练标注估计误差而导致的性能瓶颈问题。其核心挑战在于:由于LLM推理成本高昂,蒙特卡洛(Monte Carlo, MC)采样数量受限,导致估值估计存在高方差而非偏差,从而影响验证器的准确性。解决方案的关键在于提出一种名为“复合蒙特卡洛采样”(Compound Monte Carlo Sampling, ComMCS)的方法,该方法通过线性组合当前及后续步骤的MC估计量构建无偏估计器,在不增加额外LLM推理开销的前提下显著降低估计方差,从而提升验证器的稳定性和有效性。

链接: https://arxiv.org/abs/2508.10539
作者: Zetian Sun,Dongfang Li,Baotian Hu,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in a wide range of tasks. However, their reasoning capabilities, particularly in complex domains like mathematics, remain a significant challenge. Value-based process verifiers, which estimate the probability of a partial reasoning chain leading to a correct solution, are a promising approach for improving reasoning. Nevertheless, their effectiveness is often hindered by estimation error in their training annotations, a consequence of the limited number of Monte Carlo (MC) samples feasible due to the high cost of LLM inference. In this paper, we identify that the estimation error primarily arises from high variance rather than bias, and the MC estimator is a Minimum Variance Unbiased Estimator (MVUE). To address the problem, we propose the \textscCompound \textscMonte \textscCarlo \textscSampling (ComMCS) method, which constructs an unbiased estimator by linearly combining the MC estimators from the current and subsequent steps. Theoretically, we show that our method leads to a predictable reduction in variance, while maintaining an unbiased estimation without additional LLM inference cost. We also perform empirical experiments on the MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method. Notably, ComMCS outperforms regression-based optimization method by 2.8 points, the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32 sampling experiment.
zh

[NLP-17] Diversity First Quality Later: A Two-Stage Assumption for Language Model Alignment

【速读】: 该论文旨在解决语言模型(Language Model, LM)对齐过程中,静态偏好数据与在线采样(on-policy)偏好数据在效果上存在系统性差异的问题。以往方法如直接偏好优化(Direct Preference Optimization, DPO)通常依赖于静态偏好数据进行策略优化,而近期改进版本引入了训练循环中生成的在线偏好候选数据以提升对齐效果。然而,本文发现在线数据并非始终最优——例如在Llama-3上其有效性可达静态数据的3倍,而在Zephyr上反而仅为0.4倍。为解释这一现象,作者提出“对齐阶段假设”(alignment stage assumption),将对齐过程划分为两个阶段:偏好注入阶段(benefits from diverse data)和偏好微调阶段(favors high-quality data)。解决方案的关键在于通过理论与实证分析识别这两个阶段的边界,并设计了一种有效算法用于动态判断何时切换数据来源,从而显著提升不同模型和对齐方法(如DPO和SLiC-HF)下的对齐性能。

链接: https://arxiv.org/abs/2508.10530
作者: Zetian Sun,Dongfang Li,Baotian Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The alignment of language models (LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling (i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a 3 \times effectiveness compared with static data for Llama-3, and a 0.4 \times effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on 5 models (Llama, Zephyr, Phi-2, Qwen, Pythia) and 2 alignment methods (DPO, SLiC-HF) to show the generalizability of alignment stage assumption and boundary measurement.
zh

[NLP-18] Reverse Physician-AI Relationship: Full-process Clinical Diagnosis Driven by a Large Language Model

【速读】: 该论文旨在解决当前人工智能(AI)在临床诊断中角色受限的问题,即AI通常仅作为医生的辅助工具,在诊断流程的特定环节提供支持,而无法从模糊主诉出发独立驱动完整的诊断过程,导致其难以显著减轻医生工作负担并提升诊断效率。解决方案的关键在于提出一种范式转变:将AI重新定位为诊断流程的主要决策者,医生则作为AI的辅助角色;为此,研究团队开发了具备深度推理能力的大语言模型DxDirector-7B,能够自主完成从初始症状到最终诊断的全流程推理,并建立明确的责任划分机制以保障医疗安全。实证结果表明,该模型在罕见病、复杂病例及真实世界场景下均展现出更高的诊断准确率和更低的医生参与度,验证了其作为专科医生替代方案的潜力。

链接: https://arxiv.org/abs/2508.10492
作者: Shicheng Xu,Xin Huang,Zihao Wei,Liang Pang,Huawei Shen,Xueqi Cheng
机构: State Key Laboratory of AI Safety (人工智能安全国家重点实验室); Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Peking University Third Hospital (北京大学第三医院)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 39 pages

点击查看摘要

Abstract:Full-process clinical diagnosis in the real world encompasses the entire diagnostic workflow that begins with only an ambiguous chief complaint. While artificial intelligence (AI), particularly large language models (LLMs), is transforming clinical diagnosis, its role remains largely as an assistant to physicians. This AI-assisted working pattern makes AI can only answer specific medical questions at certain parts within the diagnostic process, but lack the ability to drive the entire diagnostic process starting from an ambiguous complaint, which still relies heavily on human physicians. This gap limits AI’s ability to fully reduce physicians’ workload and enhance diagnostic efficiency. To address this, we propose a paradigm shift that reverses the relationship between physicians and AI: repositioning AI as the primary director, with physicians serving as its assistants. So we present DxDirector-7B, an LLM endowed with advanced deep thinking capabilities, enabling it to drive the full-process diagnosis with minimal physician involvement. Furthermore, DxDirector-7B establishes a robust accountability framework for misdiagnoses, delineating responsibility between AI and human physicians. In evaluations across rare, complex, and real-world cases under full-process diagnosis setting, DxDirector-7B not only achieves significant superior diagnostic accuracy but also substantially reduces physician workload than state-of-the-art medical LLMs as well as general-purpose LLMs. Fine-grained analyses across multiple clinical departments and tasks validate its efficacy, with expert evaluations indicating its potential to serve as a viable substitute for medical specialists. These findings mark a new era where AI, traditionally a physicians’ assistant, now drives the entire diagnostic process to drastically reduce physicians’ workload, indicating an efficient and accurate diagnostic solution.
zh

[NLP-19] When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing AAAI

【速读】: 该论文旨在解决可信自然语言处理(Trustworthy Natural Language Processing, NLP)中解释性(Explainability)与隐私保护(Privacy)之间关系不明确的问题,尤其关注二者是否存在权衡(trade-off)或能否共存。其关键解决方案在于通过实证研究,基于差分隐私(Differential Privacy, DP)和事后解释(Post-hoc Explainability)两类主流方法,系统考察不同下游任务、文本隐私化策略及解释方法对隐私与解释性交互影响的复杂机制,从而揭示二者并非必然对立,并提出了若干面向未来交叉研究的实践建议。

链接: https://arxiv.org/abs/2508.10482
作者: Mahdi Dhaini,Stephen Meisenbacher,Ege Erdogan,Florian Matthes,Gjergji Kasneci
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)

点击查看摘要

Abstract:In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of \textitexplainability and \textitprivacy. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving \textitboth explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of \textitDifferential Privacy (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.
zh

[NLP-20] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse Factual and Relevant Rationales

【速读】: 该论文旨在解决基于大视觉语言模型(LVLM)生成文本推理过程(rationales)在训练可微分多模态假信息检测器时所面临的三大核心挑战:生成推理内容多样性不足、因幻觉导致的事实性错误,以及无关或冲突信息引入的噪声。解决方案的关键在于提出一个与检测器无关的框架DiFaR,其通过五种思维链(chain-of-thought)提示激发LVLM产生多样化的推理路径,并引入轻量级后处理过滤模块,依据句子级别的事实性和相关性评分筛选高质量推理语句,从而显著提升推理质量与假信息检测性能。

链接: https://arxiv.org/abs/2508.10444
作者: Herun Wan,Jiaying Wu,Minnan Luo,Xiangzheng Kong,Zihan Ma,Zhi Zeng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.
zh

[NLP-21] Computational Economics in Large Language Models : Exploring Model Behavior and Incentive Design under Resource Constraints

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的高计算成本问题,尤其是在资源受限场景下如何实现高效、可解释的推理。其核心解决方案是引入“计算经济学”(computational economics)框架,将LLM视为由资源受限的代理(如注意力头和神经元模块)构成的内部经济系统,通过设计激励驱动的训练范式,在任务损失中加入可微分的计算成本项,从而引导模型在保持性能的同时稀疏化激活并优化计算分配。关键创新在于利用经济激励机制实现计算资源的自适应配置,使得模型在GLUE和WikiText-103等基准上能沿帕累托前沿生成一系列高效模型,相较事后剪枝方法显著降低浮点运算次数(FLOPS)与延迟,并提升注意力模式的可解释性。

链接: https://arxiv.org/abs/2508.10426
作者: Sandeep Reddy,Kabir Khan,Rohit Patil,Ananya Chakraborty,Faizan A. Khan,Swati Kulkarni,Arjun Verma,Neha Singh
机构: San Francisco State University (旧金山州立大学)
类目: Computation and Language (cs.CL)
备注: Preprint; 7 figures, 4 tables, 1 algorithm. Experiments on GLUE (MNLI, STS-B, CoLA) and WikiText-103 with BERT-base; evaluation includes FLOPS, latency, Gini and entropy metrics

点击查看摘要

Abstract:Large language models (LLMs) are limited by substantial computational cost. We introduce a “computational economics” framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.
zh

[NLP-22] Evaluating LLM s on Chinese Idiom Translation

【速读】: 该论文旨在解决中文成语(idiom)翻译中存在的系统性错误问题,尤其是在当前大语言模型(large language models, LLMs)主导的机器翻译系统中,成语因其隐喻性和历史文化背景常被误译或遗漏。研究者提出一个名为IdiomEval的评估框架,包含全面的错误分类体系,并基于900组来自九种现代翻译系统的翻译对进行标注,覆盖网页、新闻、维基百科和社交媒体四个领域。关键解决方案在于构建了一个更贴近人类判断的细粒度错误标签体系,并开发出能够识别成语翻译错误的改进型模型,其F₁分数达到0.68,显著优于传统评价指标(Pearson相关系数低于0.48),从而为提升成语翻译质量提供了可量化、可操作的评估与优化路径。

链接: https://arxiv.org/abs/2508.10421
作者: Cai Yang,Yao Dou,David Heineman,Xiaofeng Wu,Wei Xu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted at COLM 2025

点击查看摘要

Abstract:Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F _1 scores of 0.68 for detecting idiom translation errors.
zh

[NLP-23] ComoRAG : A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

【速读】: 该论文旨在解决长篇叙事文本(如小说)中因情节复杂、角色关系动态演变而导致的生成式 AI (Generative AI) 理解困难问题,尤其针对大语言模型(LLM)在处理长上下文时推理能力下降与计算成本高昂的挑战。传统基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法受限于静态、单步检索机制,难以捕捉长程上下文中的动态关联。其解决方案的关键在于提出 ComoRAG,该方法模拟人类认知中记忆相关信号的动态交互过程,将叙事推理建模为迭代式推理循环:在每次循环中,模型生成探针查询以探索新路径,并将新检索到的信息整合进全局记忆池,从而支持对复杂问题的连贯上下文构建与逐步推理,显著提升对需要全局理解的复杂查询的处理能力。

链接: https://arxiv.org/abs/2508.10419
作者: Juyuan Wang,Rongchen Zhao,Wei Wei,Yufeng Wang,Mo Yu,Jie Zhou,Jin Xu,Liyan Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM’s diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at this https URL
zh

[NLP-24] CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model

【速读】: 该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)模型在执行指令时容易偏离正确轨迹,且缺乏有效误差纠正能力的问题。解决方案的关键在于提出一种新颖的后训练范式——自校正飞轮(Self-correction Flywheel),其核心思想是将训练集中模型产生的错误轨迹视为有价值的训练数据源,而非缺陷。通过识别这些错误轨迹并自动构建感知与动作层面的自校正数据,驱动模型持续迭代优化;随着多轮飞轮循环的进行,模型逐步提升导航性能,最终在R2R-CE和RxR-CE基准上分别达到65.1%和69.3%的成功率,显著优于现有最优模型。

链接: https://arxiv.org/abs/2508.10416
作者: Zhuoyuan Yu,Yuxing Long,Zihan Yang,Chengyan Zeng,Hongwei Fan,Jiyao Zhang,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing vision-and-language navigation models often deviate from the correct trajectory when executing instructions. However, these models lack effective error correction capability, hindering their recovery from errors. To address this challenge, we propose Self-correction Flywheel, a novel post-training paradigm. Instead of considering the model’s error trajectories on the training set as a drawback, our paradigm emphasizes their significance as a valuable data source. We have developed a method to identify deviations in these error trajectories and devised innovative techniques to automatically generate self-correction data for perception and action. These self-correction data serve as fuel to power the model’s continued training. The brilliance of our paradigm is revealed when we re-evaluate the model on the training set, uncovering new error trajectories. At this time, the self-correction flywheel begins to spin. Through multiple flywheel iterations, we progressively enhance our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2% and 16.4%. Real robot tests in various indoor and outdoor environments demonstrate \method’s superior capability of error correction, dynamic obstacle avoidance, and long instruction following.
zh

[NLP-25] Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

【速读】: 该论文旨在解决如何在黑盒场景下生成能够成功“越狱”大型语言模型(Large Language Models, LLMs)的对抗文本问题,以揭示模型漏洞并提升其鲁棒性。解决方案的关键在于提出一种基于稀疏自动编码器(Sparse Autoencoder, SAE)的稀疏特征扰动框架(Sparse Feature Perturbation Framework, SFPF),通过SAE重建隐藏层表示并聚类成功攻击样本中的高激活特征,进而对这些关键特征进行选择性扰动,从而在保留恶意意图的同时增强安全信号,提高对抗样本绕过现有防御机制的能力。

链接: https://arxiv.org/abs/2508.10404
作者: Huizhen Shu,Xuying Li,Qirui Wang,Yuji Kosuga,Mengqiu Tian,Zhuo Li
机构: hydrox.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP this http URL, the method’s effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.
zh

[NLP-26] Jailbreaking Commercial Black-Box LLM s with Explicitly Harmful Prompts

【速读】: 该论文旨在解决当前评估越狱攻击(jailbreak attacks)时面临的挑战,即许多现有的红队测试数据集包含未明确有害或无法诱导有害输出的提示(prompt),导致评估结果不准确。为实现更精准的攻击评估,需对这些数据集进行恶意内容检测与清理。现有方法要么依赖人工标注(劳动密集),要么使用大语言模型(LLM)进行自动检测(但准确性不稳定)。为此,作者提出了一种混合评估框架MDH(Malicious content Detection based on LLMs with Human assistance),通过结合LLM辅助标注与最小程度的人工干预,在保证准确性的前提下提升效率。其关键创新在于利用LLM初步筛选并辅以少量人工校验,同时发现精心设计的开发者消息(developer messages)可显著提升越狱成功率,并据此提出两种新策略:D-Attack(基于上下文模拟)和DH-CoT(引入被劫持的思维链,Chain of Thought),从而增强攻击效果与检测能力。

链接: https://arxiv.org/abs/2508.10390
作者: Chiyu Zhang,Lu Zhou,Xiaogang Xu,Jiafei Wu,Liming Fang,Zhe Liu
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: this https URL.
zh

[NLP-27] Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding

【速读】: 该论文旨在解决低资源语言在方面级情感分析(Aspect-Based Sentiment Analysis, ABSA)中的性能瓶颈问题,当前跨语言ABSA方法多集中于简单任务且依赖不可靠的外部翻译工具。其解决方案的关键在于提出一种基于约束解码(constrained decoding)的序列到序列(sequence-to-sequence)建模方法,无需依赖翻译工具即可提升跨语言性能平均达5%,并在最复杂任务上实现显著优化;同时支持多任务学习,通过约束解码使多个ABSA任务在单模型中协同求解,性能提升超过10%。该方法在七种语言和六类ABSA任务上验证有效,超越现有最优模型并建立新基准。

链接: https://arxiv.org/abs/2508.10369
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained decoding with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained decoding boosting results by more than 10%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.10369 [cs.CL] (or arXiv:2508.10369v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.10369 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] Large Language Models for Summarizing Czech Historical Documents and Beyond WWW

【速读】: 该论文旨在解决捷克语文本摘要任务中,尤其是历史文档摘要领域长期存在的数据稀缺与语言复杂性难题。针对这一问题,作者提出两个关键解决方案:其一,利用先进的大语言模型(如Mistral和mT5)在现代捷克语摘要数据集SumeCzech上实现新的最优性能;其二,构建了一个名为Posel od Čerchova的新历史捷克语文本摘要数据集,并提供基线结果,从而为历史文献的自动摘要处理开辟了新路径。这两项贡献共同推动了捷克语文本摘要技术的发展,尤其拓展了对历史文本处理的研究边界。

链接: https://arxiv.org/abs/2508.10368
作者: Václav Tran,Jakub Šmíd,Jiří Martínek,Ladislav Lenc,Pavel Král
机构: University of West Bohemia in Pilsen (皮尔森西波希米亚大学); NTIS - New Technologies for the Information Society (信息社会新技术中心)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2 (ICAART 2025). Official version: this https URL

点击查看摘要

Abstract:Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od Čerchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.
zh

[NLP-29] Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLM s and Constrained Decoding for Sequence-to-Sequence Models WWW

【速读】: 该论文旨在解决低资源语言在方面情感分析(Aspect-based Sentiment Analysis, ABSA)中的性能瓶颈问题,当前跨语言ABSA研究多集中于简单任务且严重依赖外部翻译工具,限制了其在复杂场景下的应用。解决方案的关键在于提出一种新颖的序列到序列(sequence-to-sequence)方法,通过约束解码(constrained decoding)机制,在不依赖翻译工具的前提下显著提升跨语言ABSA性能,最高可达10%的改进,从而拓展了跨语言ABSA的任务复杂度边界,并提供了一种高效、实用的替代方案。

链接: https://arxiv.org/abs/2508.10366
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: University of West Bohemia in Pilsen (西波希米亚大学皮尔森分校); NTIS - New Technologies for the Information Society (信息社会新技术中心)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2 (ICAART 2025). Official version: this https URL

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained decoding, improves cross-lingual ABSA performance by up to 10%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (LLMs) and show that while fine-tuned multilingual LLMs can achieve comparable results, English-centric LLMs struggle with these tasks.
zh

[NLP-30] Improving OCR for Historical Texts of Multiple Languages

【速读】: 该论文旨在解决历史文献与现代手写文本的光学字符识别(OCR)及文档版面分析问题,涵盖从古代希伯来文死海古卷到16至18世纪会议决议文件,再到现代英文手写文本的多场景识别挑战。解决方案的关键在于针对不同任务特性设计并应用先进的深度学习模型:对于古籍文本,通过数据增强提升训练集多样性,并采用Kraken和TrOCR模型优化字符识别精度;在会议决议文档中,结合DeepLabV3+语义分割与双向长短期记忆网络(Bidirectional LSTM),引入置信度引导的伪标签策略进行迭代优化;而在现代英文手写识别中,则使用ResNet34作为编码器的CRNN架构,配合连接时序分类(CTC)损失函数以有效建模序列依赖关系。整体方法体现出对不同文本形态与历史背景的高度适应性。

链接: https://arxiv.org/abs/2508.10356
作者: Hylke Westerdijk,Ben Blankenborg,Khondoker Ittehadul Islam
机构: University of Groningen (格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.
zh

[NLP-31] Making Qwen 3 Think in Korean with Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在非英语语境下推理能力不足的问题,特别是使Qwen3 14B模型能够“原生”地以韩语进行逻辑推理。其核心挑战在于如何在保持模型通用能力的同时,显著提升其在特定语言(韩语)下的高级推理性能,如数学和编程任务。解决方案的关键在于提出一种两阶段微调方法:第一阶段通过高质量韩语推理数据集的监督微调(Supervised Fine-Tuning, SFT)建立坚实的韩语逻辑推理基础;第二阶段采用定制化的群体相对策略优化(Group Relative Policy Optimization, GRPO)算法进行强化学习微调,并引入一个oracle判别模型来校准奖励信号,从而有效缓解奖励欺骗(reward hacking)和策略崩溃(policy collapse)等稳定性问题,最终实现韩语推理对齐与整体问题解决能力的同步增强。

链接: https://arxiv.org/abs/2508.10355
作者: Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a two-stage fine-tuning approach to make the large language model Qwen3 14B “think” natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.
zh

[NLP-32] Cross-Prompt Encoder for Low-Performing Languages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低性能语言上的迁移能力不足的问题,即这些语言即使在全模型微调下仍表现不佳。其核心解决方案是提出一种名为Cross-Prompt Encoder (XPE) 的轻量级编码架构,通过多源训练策略利用类型学多样化的语言数据,以捕捉跨语言的抽象且可迁移的模式;同时引入Dual Soft Prompt机制,将基于编码器的软提示与直接训练的标准软提示相结合,从而在目标语言中同时实现广泛共享结构与语言特异性对齐。实验表明,XPE在低性能语言上效果显著,而混合变体则展现出更广泛的多语言适应性。

链接: https://arxiv.org/abs/2508.10352
作者: Beso Mikaberidze,Teimuraz Saghinadze,Simon Ostermann,Philipp Muller
机构: Muskhelishvili Institute of Computational Mathematics, Georgian Technical University (MICM); Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI); Center for European Research in Trusted AI (CERTAIN)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages-those that achieve poor accuracy even under full-model fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a lightweight encoding architecture with multi-source training on typologically diverse languages - a design that enables the model to capture abstract and transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.
zh

[NLP-33] Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM -based Recommendation

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的推荐系统在处理用户交互历史时,因过度强调语义相关性而导致协同信号(collaborative signal)逐步衰减的问题。具体而言,当使用预训练的协同嵌入(collaborative ID embeddings)作为输入时,LLM骨干网络会随着层数加深而削弱原始协同信息,这与传统Transformer序列模型中协同信号得以保留甚至增强的现象形成对比。解决方案的关键在于从谱域(spectral perspective)出发设计FreLLM4Rec:首先利用全局图低通滤波器(Global Graph Low-Pass Filter, G-LPF)对混合语义与协同信息的物品嵌入进行初步去噪;随后引入时间频率调制机制(Temporal Frequency Modulation, TFM),通过理论证明其与理想但难以实现的局部图傅里叶滤波器之间的等价关系,确保协同信号在每一层中被主动保留。该方法有效缓解了协同信号衰减问题,并在四个基准数据集上显著提升推荐性能,NDCG@10最高提升达8.00%。

链接: https://arxiv.org/abs/2508.10312
作者: Minhao Wang,Yunhang He,Cong Xu,Zhangchi Zhu,Wei Zhang
机构: East China Normal University (华东师范大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users’ interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.
zh

[NLP-34] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis ECAI-2025

【速读】: 该论文旨在解决现有文档解析方法在表格(table)语义理解上的局限性,即当前研究多集中于表布局分析、检测与数据提取等浅层任务,缺乏对表格与其上下文之间深层语义关联的挖掘,从而限制了跨段落数据解释和一致性分析等高级应用。其解决方案的关键在于提出DOTABLER框架,该框架基于定制化数据集与预训练模型的领域特定微调,构建了一个完整的语义解析流程,能够精准识别与表格语义相关的上下文片段,并在此基础上实现以表格为中心的文档结构解析与领域特定表格检索功能,从而实现表格锚定的深度语义分析与高精度语义相关表格提取。

链接: https://arxiv.org/abs/2508.10311
作者: Xuan Li,Jialiang Dong,Raymond Wong
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 28th European Conference on Artificial Intelligence (ECAI-2025)

点击查看摘要

Abstract:Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.
zh

[NLP-35] ReviewRL: Towards Automated Scientific Review with RL

【速读】: 该论文旨在解决科学论文评审过程中因投稿量激增和审稿人疲劳导致的效率与质量下降问题,现有自动化评审方法在事实准确性、评分一致性及分析深度方面表现不足,常生成浅显或泛化的反馈。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架 ReviewRL,该框架通过三个核心组件实现:(1) 利用 ArXiv-MCP 检索增强上下文生成管道引入相关文献以提升事实依据;(2) 采用监督微调建立基础评审能力;(3) 设计复合奖励函数驱动强化学习过程,协同优化评审质量和评分准确性。实验表明,ReviewRL 在 ICLR 2025 论文上的表现显著优于现有方法,为科学发现中基于 RL 的自动评述生成奠定了基础。

链接: https://arxiv.org/abs/2508.10308
作者: Sihang Zeng,Kai Tian,Kaiyan Zhang,Yuru wang,Junqi Gao,Runze Liu,Sa Yang,Jingxuan Li,Xinwei Long,Jiaheng Ma,Biqing Qi,Bowen Zhou
机构: Tsinghua University (清华大学); University of Washington (华盛顿大学); Shanghai AI Laboratory (上海人工智能实验室); Peking University (北京大学); Harbin Engineering University (哈尔滨工程大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.
zh

[NLP-36] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成文本过程中可能再现并强化性别与种族偏见的问题,尤其关注这些偏见如何通过话语形式被固化和传播。其解决方案的关键在于提出一种定性、话语分析框架,用于揭示自动化量化方法难以捕捉的隐性偏见机制。通过对LLM生成的关于黑人女性与白人女性的短篇故事进行人工分析,研究发现黑人女性常被赋予祖先联系与抵抗属性,而白人女性则更多体现自我探索过程,这反映了模型对社会刻板印象的再生产,并暴露了模型在面对偏见修正提示时仅做表面修改、未能真正实现包容叙事的局限性。这一方法论强调了批判性、跨学科视角在AI设计与部署中的必要性,以识别和缓解算法背后意识形态运作带来的不平等后果。

链接: https://arxiv.org/abs/2508.10304
作者: Gustavo Bonil,Simone Hashiguti,Jhessica Silva,João Gondim,Helena Maia,Nádia Silva,Helio Pedrini,Sandra Avila
机构: Instituto de Estudos da Linguagem (IEL), Universidade Estadual de Campinas (UNICAMP); Instituto de Computação (IC), Universidade Estadual de Campinas (UNICAMP); Instituto de Informática, Universidade Federal de Goiás (UFG)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.
zh

[NLP-37] Inductive Bias Extraction and Matching for LLM Prompts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)对提示词(prompt)措辞敏感性的问题,即微小的文本变化可能导致输出质量显著波动。其核心挑战在于如何有效匹配提示词与模型自身的归纳偏置(inductive bias),以提升任务表现。解决方案的关键在于提出“归纳偏置提取与匹配”策略:通过将LLM自身的输出作为提示的一部分,自动构建更契合模型内在偏好和推理模式的提示结构,从而增强提示与模型先验知识的一致性。实证结果表明,该方法可使用于分类任务的LLM李克特评分(Likert rating)提升最高达19%,在排序任务中提升最高达27%。

链接: https://arxiv.org/abs/2508.10295
作者: Christian M. Angel,Francis Ferraro
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The active research topic of prompt engineering makes it evident that LLMs are sensitive to small changes in prompt wording. A portion of this can be ascribed to the inductive bias that is present in the LLM. By using an LLM’s output as a portion of its prompt, we can more easily create satisfactory wording for prompts. This has the effect of creating a prompt that matches the inductive bias in model. Empirically, we show that using this Inductive Bias Extraction and Matching strategy improves LLM Likert ratings used for classification by up to 19% and LLM Likert ratings used for ranking by up to 27%.
zh

[NLP-38] A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona

【速读】: 该论文旨在解决构造语言(constructed language)是否会在社群使用过程中发生类似自然语言的语言演变问题。研究通过计算与语料库方法,分析了Toki Pona这一仅有约120个核心词汇的构造语言中词类流动性(fluid word classes)和及物性(transitivity)等特征随时间变化及不同语料库间的使用差异,发现社会语言学因素对Toki Pona的影响机制与自然语言一致,表明即使在人为设计的语言系统中,只要由社区持续使用,其仍会自发演化。解决方案的关键在于采用量化语料库分析方法,揭示构造语言在实际使用中的动态变化规律,从而证明语言演化具有普遍性,不局限于自然语言范畴。

链接: https://arxiv.org/abs/2508.10246
作者: Daniel Huang,Hyoun-A Joo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 14 figures. submitted to UGA Working Papers in Linguistics 2025

点击查看摘要

Abstract:This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.
zh

[NLP-39] Personalized Real-time Jargon Support for Online Meetings

【速读】: 该论文旨在解决跨学科交流中因领域特定术语(jargon)导致的信息壁垒问题,特别是在工作场所会议中,现有术语管理策略存在显著局限性。其解决方案的关键在于设计并实现一个基于大语言模型(Large Language Model, LLM)的交互式系统 ParseJargon,该系统能够实时识别并根据用户个体背景提供个性化术语解释,从而提升理解力、参与度及对同事工作的认可度。实验证明,个性化支持显著优于无支持和通用型支持条件,且现场研究验证了其在真实会议场景中的可用性和实际价值。

链接: https://arxiv.org/abs/2508.10239
作者: Yifan Song,Wing Yee Au,Hon Yung Wong,Brian P. Bailey,Tal August
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Fujitsu Research of America (富士通美国研究院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective interdisciplinary communication is frequently hindered by domain-specific jargon. To explore the jargon barriers in-depth, we conducted a formative diary study with 16 professionals, revealing critical limitations in current jargon-management strategies during workplace meetings. Based on these insights, we designed ParseJargon, an interactive LLM-powered system providing real-time personalized jargon identification and explanations tailored to users’ individual backgrounds. A controlled experiment comparing ParseJargon against baseline (no support) and general-purpose (non-personalized) conditions demonstrated that personalized jargon support significantly enhanced participants’ comprehension, engagement, and appreciation of colleagues’ work, whereas general-purpose support negatively affected engagement. A follow-up field study validated ParseJargon’s usability and practical value in real-time meetings, highlighting both opportunities and limitations for real-world deployment. Our findings contribute insights into designing personalized jargon support tools, with implications for broader interdisciplinary and educational applications.
zh

[NLP-40] Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia

【速读】: 该论文旨在解决临床高风险(Clinical High Risk, CHR)人群在精神分裂症早期阶段缺乏高效、标准化症状评估工具的问题,以支持精准干预。其核心解决方案是利用大语言模型(Large Language Models, LLMs)从非结构化临床访谈文本中预测简明精神病评定量表(Brief Psychiatric Rating Scale, BPRS)得分,无需专门设计用于BPRS测量的访谈流程。关键创新在于LLMs在零样本(zero-shot)条件下即可实现与人类评分者相当的准确性(中位一致性达0.84,组内相关系数ICC=0.73),并展现出跨语言适用性(外文评估中位一致性0.88)和整合纵向信息的能力,从而为CHR患者提供客观、可扩展且标准化的症状监测方案。

链接: https://arxiv.org/abs/2508.10226
作者: Andrew X. Chen,Guillermo Horga,Sean Escola
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patients who are at clinical high risk (CHR) for schizophrenia need close monitoring of their symptoms to inform appropriate treatments. The Brief Psychiatric Rating Scale (BPRS) is a validated, commonly used research tool for measuring symptoms in patients with schizophrenia and other psychotic disorders; however, it is not commonly used in clinical practice as it requires a lengthy structured interview. Here, we utilize large language models (LLMs) to predict BPRS scores from clinical interview transcripts in 409 CHR patients from the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort. Despite the interviews not being specifically structured to measure the BPRS, the zero-shot performance of the LLM predictions compared to the true assessment (median concordance: 0.84, ICC: 0.73) approaches human inter- and intra-rater reliability. We further demonstrate that LLMs have substantial potential to improve and standardize the assessment of CHR patients via their accuracy in assessing the BPRS in foreign languages (median concordance: 0.88, ICC: 0.70), and integrating longitudinal information in a one-shot or few-shot learning approach.
zh

[NLP-41] Understanding Textual Emotion Through Emoji Prediction

【速读】: 该论文旨在解决从短文本序列中准确预测表情符号(emoji)的问题,尤其关注类不平衡对模型性能的影响。其解决方案的关键在于采用四种深度学习架构(前馈网络、卷积神经网络CNN、Transformer和BERT)进行对比实验,并引入焦点损失(focal loss)与正则化技术来缓解类别不平衡问题。结果表明,BERT因预训练优势在整体性能上最优,而CNN在稀有emoji类别上的表现更优,凸显了模型架构选择与超参数调优在情感感知型emoji预测中的重要性。

链接: https://arxiv.org/abs/2508.10222
作者: Ethan Gordon,Nishank Kuppa,Rigved Tummala,Sriram Anasuri
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This project explores emoji prediction from short text sequences using four deep learning architectures: a feed-forward network, CNN, transformer, and BERT. Using the TweetEval dataset, we address class imbalance through focal loss and regularization techniques. Results show BERT achieves the highest overall performance due to its pre-training advantage, while CNN demonstrates superior efficacy on rare emoji classes. This research shows the importance of architecture selection and hyperparameter tuning for sentiment-aware emoji prediction, contributing to improved human-computer interaction.
zh

[NLP-42] Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中常见的幻觉问题,特别是忠实性幻觉(Faithfulness Hallucinations),即模型生成的内容与输入上下文严重偏离或语义不一致的现象。此类问题在生成式AI(Generative AI)应用中尤为危险,可能导致误导性输出。解决方案的关键在于提出一种轻量级框架——语义分歧度量(Semantic Divergence Metrics, SDM),其核心创新是通过联合聚类句子嵌入构建提示(prompt)与回答之间的共享主题空间,并利用信息论指标(如Jensen-Shannon散度和Wasserstein距离)量化二者间的语义分歧程度。该框架不仅关注单一提示下响应的一致性,还引入语义等价的提示改写来检测更深层次的任意性,从而有效识别出“虚构”(confabulation)类错误响应。此外,论文进一步将KL散度KL(Answer || Prompt)作为“语义探索”(Semantic Exploration)的强信号,结合多维指标形成“语义盒”(Semantic Box)诊断框架,用于分类LLM响应类型并识别高风险的自信型虚构行为。

链接: https://arxiv.org/abs/2508.10192
作者: Igor Halperin
机构: Fidelity Investments (富达投资)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations – events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, confabulations, defined as responses that are arbitrary and semantically misaligned with the user’s query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, \mathcalS_H , combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer || Prompt) as a powerful indicator of \textbfSemantic Exploration, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.
zh

[NLP-43] PakBBQ: A Culturally Adapted Bias Benchmark for QA EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言和区域语境中公平性不足的问题,尤其是当前主流LLMs多基于西方中心的数据训练与评估,忽视了如巴基斯坦等地区特有的社会文化偏见维度。其解决方案的关键在于构建了一个文化与地域适配的基准测试集——PakBBQ,该数据集涵盖8类偏见维度(包括年龄、残疾、外貌、性别、社会经济地位、宗教、地域归属及语言正式程度),包含英、乌尔都语双语的214个模板和17,180个问答对。通过在模糊与明确消歧情境下以及负面与非负面问题框架中对多语言LLMs进行评估,研究发现:(i)消歧可平均提升12%准确率;(ii)乌尔都语场景下模型表现出更强的反偏见行为;(iii)负面提问框架显著减少刻板印象响应。这表明,上下文化基准与简单的提示工程策略是缓解低资源环境下模型偏见的有效手段。

链接: https://arxiv.org/abs/2508.10186
作者: Abdullah Hashmat,Muhammad Arham Mirza,Agha Ali Raza
机构: Lahore University of Management Sciences (拉合尔管理科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 pages, 7 figures, 2 tables, Submitted to EMNLP 2025

点击查看摘要

Abstract:With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.
zh

[NLP-44] Efficient Forward-Only Data Valuation for Pretrained LLM s and VLMs

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)中个体训练样本影响难以高效量化的问题,从而提升模型的透明度与可问责性。传统数据估值方法依赖海森矩阵(Hessian)信息或模型重训练,计算开销巨大,难以应用于百亿参数级别的模型。其解决方案的关键在于提出一种仅需前向传播(forward-only)的数据估值框架 For-Value,通过利用现代基础模型丰富的隐藏表示,基于单次前向传递即可计算出闭式表达的影响分数,无需昂贵的梯度计算,从而实现高效且准确的样本影响力估计。理论分析表明,For-Value 能通过捕捉训练与验证样本在隐藏表示上的对齐性和预测误差来精确估计每样本影响,实验验证其在识别关键微调样本和检测误标注数据方面优于或媲美基于梯度的基线方法。

链接: https://arxiv.org/abs/2508.10180
作者: Wenlong Deng,Jiaming Zhang,Qi Zeng,Christos Thrampoulidis,Boying Gong,Xiaoxiao Li
机构: Meta; Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.
zh

[NLP-45] Estimating Machine Translation Difficulty

【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)系统在某些场景下已达到近完美翻译质量,导致难以有效区分先进模型性能差异并识别改进方向的问题。其核心挑战在于如何自动识别出翻译系统表现较差的文本片段,从而为开发更具区分度的评估方法和引导未来研究提供依据。解决方案的关键在于提出了一种新的“翻译难度估计”(Translation Difficulty Estimation)任务形式化定义,并引入一个用于评估难度估计器的新指标;在此基础上,构建了专用模型Sentinel-src系列(如Sentinel-src-24和Sentinel-src-25),相比基于词频或句法复杂度的启发式方法以及大语言模型作为裁判(LLM-as-a-judge)的方法,这些模型能更准确地预测文本对现有MT系统的挑战程度,并成功应用于构建更具挑战性的机器翻译基准数据集。

链接: https://arxiv.org/abs/2508.10175
作者: Lorenzo Proietti,Stefano Perrella,Vilém Zouhar,Roberto Navigli,Tom Kocmi
机构: Sapienza NLP Group, Sapienza University of Rome (罗马大学); ETH Zurich (苏黎世联邦理工学院); Cohere
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text’s difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.10175 [cs.CL] (or arXiv:2508.10175v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.10175 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-46] LaajMeter: A Framework for LaaJ Evaluation

【速读】: 该论文旨在解决在领域特定场景下,由于标注数据稀缺和专家评估成本高昂,导致大语言模型作为评价者(LLM-as-a-Judge, LaaJ)的元评估(meta-evaluation)缺乏可靠指标与阈值判定标准的问题。现有方法常使用未经领域验证的评估指标,难以判断其是否能有效区分不同质量的LaaJ,并确定足够的评价性能阈值。解决方案的关键在于提出LaajMeter——一个基于仿真的受控元评估框架,通过生成代表虚拟模型与评价者的合成数据,在真实条件下系统性分析各类评估指标对LaaJ质量的敏感性,从而帮助从业者验证并优化LaaJ的评估指标选择及阈值设定,尤其适用于低资源场景下的可信、可复现的自然语言处理(NLP)评估。

链接: https://arxiv.org/abs/2508.10161
作者: Gal Amram,Eitan Farchi,Shmulik Froimovich,Raviv Gal,Avi Ziv
机构: IBM Research, Israel (IBM 研究院, 以色列)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. In such cases, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. As a result, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate and refine LaaJs for specific evaluation tasks: they can test whether their metrics correctly distinguish between better and worse (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.10161 [cs.CL] (or arXiv:2508.10161v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.10161 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-47] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理现实世界中常见的复杂、交互式任务时表现不足的问题,尤其是其在多轮对话、信息获取和不完整数据推理方面的局限性。解决方案的关键在于提出一个全新的基准测试套件,包含一系列设计用于评估特定推理能力、交互式对话能力和信息寻求能力的多轮任务,并采用确定性评分机制以避免人工干预,从而客观衡量模型在复杂交互场景中的性能短板。

链接: https://arxiv.org/abs/2508.10142
作者: Kartikeya Badola,Jonathan Simon,Arian Hosseini,Sara Marie Mc Carthy,Tsendsuren Munkhdalai,Abhimanyu Goyal,Tomáš Kočiský,Shyam Upadhyay,Bahare Fatemi,Mehran Kazemi
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.
zh

[NLP-48] mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning

【速读】: 该论文旨在解决当前推理增强型大语言模型(Reasoning-Reinforced Large Language Models, LLMs)在多语言常识推理任务中对人类不同推理技能的利用机制尚不明确的问题,尤其是涉及跨语言与跨文化日常知识的情境。其解决方案的关键在于提出一个多语言可扩展的基于技能的常识推理基准(Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning, mSCoRe),该基准包含三个核心组件:(1) 一种新颖的推理技能分类体系,支持对模型推理过程进行细粒度分析;(2) 针对常识推理评估定制的数据合成流程,确保数据质量与多样性;(3) 一套复杂度动态扩展框架,使任务难度能够随未来LLM能力提升而自适应调整。实验表明,mSCoRe对当前主流LLMs仍具挑战性,尤其在高复杂度层级下暴露了模型在多语言通用常识和文化常识理解上的局限,为后续研究提供了清晰的方向。

链接: https://arxiv.org/abs/2508.10137
作者: Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
机构: University of Oregon (俄勒冈大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbfMultilingual and Scalable Benchmark for \textbfSkill-based \textbfCommonsense \textbfReasoning (\textbfmSCoRe). Our benchmark incorporates three key components that are designed to systematically evaluate LLM’s reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models’ reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbfmSCoRe remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models’ reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.
zh

[NLP-49] Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

【速读】: 该论文旨在解决基于可验证奖励的强化微调(Reinforced Fine-Tuning, ReFT)在复杂推理任务(如数学推理)中训练成本过高的问题。标准ReFT框架需在每轮训练中生成多个完整推理路径以供评分,导致大量计算资源消耗。其核心解决方案是提出一种名为Nested-ReFT的新框架,通过将目标模型的部分层作为行为模型(behavior model)在训练期间生成离策略(off-policy)完成结果,并结合动态层跳过机制(dynamic layer skipping per batch),显著降低推理开销。理论分析表明该方法能提供无偏梯度估计且方差可控,实证结果显示其在多个数学推理基准和模型规模下均实现了更高的单位时间token处理效率(tokens/sec),同时通过三种偏差缓解策略保持与基线ReFT相当的性能表现。

链接: https://arxiv.org/abs/2508.10123
作者: Maxime Heuillet,Yufei Cui,Boxing Chen,Audrey Durand,Prasanna Parthasarathi
机构: 1. Université de Lille (里尔大学); 2. Inria (法国国家信息与自动化研究院); 3. Google (谷歌); 4. Ecole normale supérieure (巴黎高等师范学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.
zh

[NLP-50] Amazon Nova AI Challenge – Trusted AI: Advancing secure AI-assisted software development

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在软件开发应用中的安全性问题,特别是如何确保其在实际使用中不会产生有害、不安全或偏离预期的行为。解决方案的关键在于通过举办亚马逊 Nova AI 挑战赛(Amazon Nova AI Challenge),组织高校团队开展自动化红队测试(automated red teaming)与安全对齐(safety alignment)的协同创新,构建多轮对抗性评估机制,并提供高质量标注数据集以支持迭代优化。挑战赛中引入了定制化的基准编码模型、比赛编排服务和评估工具包(evaluation harness),推动了基于推理的安全对齐、鲁棒模型防护机制、多轮越狱攻击(multi-turn jail-breaking)以及大语言模型(LLMs)高效探测等前沿技术的发展,从而系统性提升 AI 编程助手的安全性与可靠性。

链接: https://arxiv.org/abs/2508.10108
作者: Sattvik Sahai,Prasoon Goyal,Michael Johnston,Anna Gottardi,Yao Lu,Lucy Hu,Luke Dai,Shaohua Liu,Samyuth Sagi,Hangjie Shi,Desheng Zhang,Lavina Vaz,Leslie Ball,Maureen Murray,Rahul Gupta,Shankar Ananthakrishna
机构: Amazon Nova Responsible AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 1st Proceedings of Amazon Nova AI Challenge (Trusted AI 2025)

点击查看摘要

Abstract:AI systems for software development are rapidly gaining prominence, yet significant challenges remain in ensuring their safety. To address this, Amazon launched the Trusted AI track of the Amazon Nova AI Challenge, a global competition among 10 university teams to drive advances in secure AI. In the challenge, five teams focus on developing automated red teaming bots, while the other five create safe AI assistants. This challenge provides teams with a unique platform to evaluate automated red-teaming and safety alignment methods through head-to-head adversarial tournaments where red teams have multi-turn conversations with the competing AI coding assistants to test their safety alignment. Along with this, the challenge provides teams with a feed of high quality annotated data to fuel iterative improvement. Throughout the challenge, teams developed state-of-the-art techniques, introducing novel approaches in reasoning-based safety alignment, robust model guardrails, multi-turn jail-breaking, and efficient probing of large language models (LLMs). To support these efforts, the Amazon Nova AI Challenge team made substantial scientific and engineering investments, including building a custom baseline coding specialist model for the challenge from scratch, developing a tournament orchestration service, and creating an evaluation harness. This paper outlines the advancements made by university teams and the Amazon Nova AI Challenge team in addressing the safety challenges of AI for software development, highlighting this collaborative effort to raise the bar for AI safety.
zh

[NLP-51] SaraCoder: Orchestrating Semantic and Structural Cues for Profit-Oriented Repository-Level Code Completion

【速读】: 该论文旨在解决仓库级代码补全(repository-level code completion)中因依赖浅层文本相似性导致的语义误导、冗余和同质化问题,以及外部符号歧义无法解析的挑战。其解决方案的关键在于提出Saracoder框架,核心包含两个模块:一是层次特征优化(Hierarchical Feature Optimization)模块,通过提炼深层语义关系、去除重复项、引入基于图结构的编辑重要性加权相似度度量,并对结果重新排序,从而在相关性和多样性之间实现最优平衡;二是外部感知标识符消歧模块(External-Aware Identifier Disambiguator),利用依赖分析精准解决跨文件符号歧义问题。实验证明,该方法在CrossCodeEval和RepoEval-Updated基准上显著优于现有基线,为构建更准确、鲁棒的仓库级代码补全系统提供了新范式。

链接: https://arxiv.org/abs/2508.10068
作者: Xiaohan Chen,Zhongying Pan,Quan Feng,Yu Tian,Shuqun Yang,Mengru Wang,Lina Gong,Yuxia Geng,Piji Li,Xiang Chen
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) for repository-level code completion commonly relies on superficial text similarity, leading to results plagued by semantic misguidance, redundancy, and homogeneity, while also failing to resolve external symbol ambiguity. To address these challenges, we introduce Saracoder, a Hierarchical Feature-Optimized retrieval framework. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity via dependency analysis. Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated benchmarks demonstrate that Saracoder significantly outperforms existing baselines across multiple programming languages and models. Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and robust repository-level code completion systems.
zh

[NLP-52] Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在少样本信息抽取(few-shot Information Extraction, IE)任务中性能不稳定的问题,其核心挑战在于传统示例选择策略未能有效识别和利用模型自身的不确定性,尤其是由格式生成困难(如结构化输出语法错误)与语义内容不一致共同导致的“内省性混淆”(introspective confusion)。解决方案的关键在于提出一种名为主动提示(Active Prompting for Information Extraction, APIE)的新框架,该框架基于一个双维度不确定性度量——格式不确定性(Format Uncertainty,衡量生成正确语法结构的难度)与内容不确定性(Content Uncertainty,衡量提取语义的一致性),通过综合评分对未标注数据进行排序并主动选取最具挑战性和信息量的样本作为提示示例,从而显著提升模型在多个基准上的抽取准确率与鲁棒性。

链接: https://arxiv.org/abs/2508.10036
作者: Dong Zhao,Yadong Wang,Xiang Chen,Chenxi Wang,Hongliang Dai,Chuanxing Geng,Shengzhong Zhang,Shaoyuan Li,Sheng-Jun Huang
机构: 1. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 2. University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.
zh

[NLP-53] he Cost of Thinking: Increased Jailbreak Risk in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在启用思考模式(thinking mode)时更容易受到越狱攻击(Jailbreak attack)的问题。研究发现,具有思考模式的LLMs在AdvBench和HarmBench上的攻击成功率显著高于非思考模式,且成功攻击样本通常具备教育用途或过长的思维链长度特征,甚至在模型明确知晓问题有害性的情况下仍会生成有害回答。为缓解此问题,论文提出一种安全思考干预(safe thinking intervention)方法,其关键在于通过在提示中添加特定的“思考标记”(specific thinking tokens),显式引导模型内部的推理过程,从而有效降低思考模式下LLMs的攻击成功率。

链接: https://arxiv.org/abs/2508.10032
作者: Fan Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding “specific thinking tokens” of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.
zh

[NLP-54] Context Misleads LLM s: The Role of Context Filtering in Maintaining Safe Alignment of LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时所暴露的安全与伦理风险问题,即恶意用户通过构造特定的对抗性上下文诱导LLM生成有害内容。其解决方案的关键在于提出一种名为“Context Filtering model”的输入预处理机制,该机制能够有效过滤不可信和不可靠的上下文信息,同时识别出包含真实用户意图的核心提示(primary prompts),从而揭示隐藏的恶意意图。该方法在不损害LLM原始性能的前提下显著降低越狱攻击成功率(最高达88%),并实现安全性和有用性(Helpfulness)的最优平衡,且具备即插即用特性,适用于所有LLM(包括白盒与黑盒模型),无需对模型本身进行微调。

链接: https://arxiv.org/abs/2508.10031
作者: Jinhwa Kim,Ian G. Harris
机构: University of California, Irvine (加州大学欧文分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering model, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 88% while maintaining the original LLMs’ performance, achieving state-of-the-art Safety and Helpfulness Product results. Notably, our model is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves. We will make our model publicly available for research purposes.
zh

[NLP-55] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

【速读】: 该论文旨在解决现有提示优化(prompt optimization)方法在对齐黑盒大语言模型(Large Language Models, LLMs)时忽视推理策略(inference strategy)选择的问题。当前方法通常独立优化提示,而不考虑部署阶段所采用的推理策略(如Best-of-N采样或多数投票),而实证与理论分析表明,提示与推理策略之间存在强耦合关系,且用户对多目标权衡和计算预算的偏好显著影响最优配置。解决方案的关键在于提出一个统一的新框架IAPO(Inference-Aware Prompt Optimization),该框架在优化提示的同时显式考虑推理预算和任务目标,实现提示与推理规模的联合优化;进一步设计了固定预算训练算法PSST(Prompt Scaling via Sequential Trimming),并提供了有限预算下的误差概率保证,从而在实际约束下实现更高效的对齐效果。

链接: https://arxiv.org/abs/2508.10030
作者: Saaduddin Mahmud,Mason Nakamura,Kyle H. Wray,Shlomo Zilberstein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.
zh

[NLP-56] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面易受 jailbreak 攻击的问题,即攻击者通过特定输入诱导模型生成有害或违反其安全策略的内容。解决方案的关键在于提出一种基于表示空间的攻击方法——隐空间融合劫持(Latent Fusion Jailbreak, LFJ),其核心机制是通过选择语义和句法高度相似的有害与良性查询对,在关键层和标记处进行梯度引导的隐藏状态插值,并结合优化策略以平衡攻击成功率、输出流畅性与计算效率。实验表明,LFJ 在多个基准测试中平均攻击成功率达 94.01%,显著优于现有方法;同时,作者进一步提出基于对抗训练的防御方案,通过对插值样本微调模型,使攻击成功率下降超过 80%,且不损害模型在良性输入上的性能表现。

链接: https://arxiv.org/abs/2508.10029
作者: Wenpeng Xing,Mohan Li,Chunqiang Hu,Haitao XuNingyu Zhang,Bo Lin,Meng Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ’s effectiveness.
zh

[NLP-57] PREF: Reference-Free Evaluation of Personalised Text Generation in LLM s

【速读】: 该论文旨在解决个性化文本生成系统评估中缺乏用户个体差异考量的问题,现有方法通常忽略用户特定偏好,导致评估结果与真实用户满意度不一致。其解决方案的关键在于提出一个无需黄金个性化参考文本的评估框架 PREF(Personalised Reference-free Evaluation Framework),通过三阶段流程实现:首先利用大语言模型(LLM)生成覆盖通用质量维度(如事实性、连贯性和完整性)的指导准则;其次基于目标用户的画像或推断出的偏好进行重排序和选择性增强,形成个性化评价标准;最后由 LLM 判官依据该标准对候选答案评分,从而在保证基础质量的同时捕捉主观优先级。此分阶段设计提升了评估的鲁棒性、透明度和可复用性,并支持小模型逼近大模型的个性化表现。

链接: https://arxiv.org/abs/2508.10028
作者: Xiao Fu,Hossein A. Rahmani,Bin Wu,Jerome Ramos,Emine Yilmaz,Aldo Lipani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 7 pages

点击查看摘要

Abstract:Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbfPREF, a \textbfPersonalised \textbfReference-free \textbfEvaluation \textbfFramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user’s profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.
zh

[NLP-58] LLM CARE: Alzheimers Detection via Transformer Models Enhanced by LLM -Generated Synthetic Data

【速读】: 该论文旨在解决阿尔茨海默病及相关痴呆(Alzheimer’s disease and related dementias, ADRD)早期筛查中诊断率低的问题,提出一种基于语音的自然语言处理(Natural Language Processing, NLP)方法以识别早期认知衰退的言语标志。其核心解决方案在于:(1)融合Transformer模型嵌入与手工提取的110个词汇特征构建融合模型,显著提升分类性能(F1=83.3);(2)利用大语言模型(Large Language Models, LLMs)生成标签条件下的合成语音数据进行数据增强,使训练集扩充两倍后F1提升至85.7;(3)对比单模态与多模态LLM分类器,发现微调可大幅提升单模态模型性能(如MedAlpaca从47.3升至78.5 F1),而当前多模态模型表现仍有限(GPT-4o为70.2 F1)。研究证明,将Transformer嵌入与语言学特征结合,并通过临床调优的LLMs实现数据增强和分类任务,是提升ADRD检测准确性的关键路径。

链接: https://arxiv.org/abs/2508.10027
作者: Ali Zolnour,Hossein Azadmaleki,Yasaman Haghbin,Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sina Rashidi,Masoud Khani,AmirSajjad Taleban,Samin Mahdizadeh Sani,Maryam Dadkhah,James M. Noble,Suzanne Bakken,Yadollah Yaghoobzadeh,Abdol-Hossein Vahabie,Masoud Rouhizadeh,Maryam Zolnoori
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alzheimer’s disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank “cookie-theft” task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 - 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.10027 [cs.CL] (or arXiv:2508.10027v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.10027 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yasaman Haghbin [view email] [v1] Fri, 8 Aug 2025 13:44:55 UTC (2,429 KB) Full-text links: Access Paper: View a PDF of the paper titled LLMCARE: Alzheimer’s Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data, by Ali Zolnour and 15 other authorsView PDFOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-08 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-59] SABER: Switchable and Balanced Training for Efficient LLM Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中因统一启用链式思维(chain-of-thought reasoning)而导致的推理成本过高和延迟过大的问题。其核心解决方案是提出SABER(Switchable and Balanced Training for Efficient LLM Reasoning),一个基于强化学习的训练框架,使模型具备用户可控、基于token预算的推理能力。关键创新在于:首先对每个训练样本的基线模型思维token使用情况进行聚类并分配至预定义的预算层级;其次,在微调过程中通过系统提示和长度感知奖励引导模型遵守分配的预算;同时引入“无思考”(NoThink)示例以保证模型在关闭显式推理时仍具可靠性;最终支持四种离散推理模式(NoThink、FastThink、CoreThink、DeepThink),实现延迟与推理深度之间的灵活权衡。

链接: https://arxiv.org/abs/2508.10026
作者: Kai Zhao,Yanjun Zhao,Jiaming Song,Shien He,Lusheng Zhang,Qiang Zhang,Tianjiao Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example’s base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.
zh

[NLP-60] Detecting and explaining postpartum depression in real-time with generative artificial intelligence

【速读】: 该论文旨在解决产后抑郁(Postpartum Depression, PPD)的快速识别与风险因素分析问题,以实现及时干预和精准预防。其解决方案的关键在于构建一个融合自然语言处理(Natural Language Processing, NLP)、机器学习(Machine Learning, ML)与大语言模型(Large Language Models, LLMs)的智能筛查系统,通过非侵入式自由言语分析实现实时评估;同时,利用树基可解释模型与LLMs结合的方式提升预测结果的透明度,借助特征重要性和自然语言描述缓解“黑箱”问题,最终在各项评价指标上达到90%的PPD检测准确率,显著优于现有方法。

链接: https://arxiv.org/abs/2508.10025
作者: Silvia García-Méndez,Francisco de Arriba-Pérez
机构: Information Technologies Group, atlanTTic, University of Vigo (维戈大学), Vigo, Spain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.
zh

[NLP-61] RTTC: Reward-Guided Collaborative Test-Time Compute

【速读】: 该论文旨在解决Test-Time Compute (TTC)在大语言模型(Large Language Models, LLMs)推理阶段应用时存在的计算开销过高问题,即不同查询对TTC策略的响应存在差异,而盲目统一采用RAG或Test-Time Training (TTT)等策略会导致资源浪费。其解决方案的关键在于提出Reward-Guided Test-Time Compute (RTTC),通过预训练的奖励模型(reward model)为每个查询动态选择最优的TTC策略,从而在保证下游任务准确率的同时最小化冗余计算;此外,RTTC采用分布式架构结合Query-State Caching机制,实现历史查询状态在检索和适应层面的高效复用,显著提升系统可扩展性与效率。

链接: https://arxiv.org/abs/2508.10024
作者: J. Pablo Muñoz,Jinjie Yuan
机构: Intel Labs (英特尔实验室); Intel Corporation (英特尔公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference, leveraging strategies such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). However, the optimal adaptation strategy varies across queries, and indiscriminate application of TTC strategy incurs substantial computational overhead. In this work, we introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query via a pretrained reward model, maximizing downstream accuracy across diverse domains and tasks. RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary. To further mitigate redundant computation, we propose Query-State Caching, which enables the efficient reuse of historical query states at both retrieval and adaptation levels. Extensive experiments across multiple LLMs and benchmarks demonstrate that RTTC consistently achieves superior accuracy compared to vanilla RAG or TTT, validating the necessity of adaptive, reward-guided TTC selection and the potential of RTTC for scalable, high-performance language model adaptation.
zh

[NLP-62] Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多项选择题问答(Multiple-Choice Question Answering, MCQA)任务中因幻觉(hallucination)和非事实性生成导致的响应不可靠问题。解决方案的关键在于提出一种增强显著性检验的置信预测(conformal prediction, CP)框架,通过自一致性重采样(self-consistency resampling)计算选项频率,并结合 p 值计算与符合性评分(conformity scoring),构建基于零假设检验(null hypothesis testing, H0\mathcal{H}_0)的预测集,从而实现用户指定的经验误覆盖率(empirical miscoverage rate)并验证平均预测集大小(Average Prediction Set Size, APSS)作为不确定性度量的有效性。

链接: https://arxiv.org/abs/2508.10022
作者: Yuanchang Ye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates p -value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs’ black-box nature, subsequently constructing prediction sets via null hypothesis testing ( \mathcalH_0 ) with empirically derived p -values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ( \alpha ), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.
zh

[NLP-63] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

【速读】: 该论文旨在解决金融场景中从客户历史通信序列中学习高效且具语义意义的客户端嵌入(client embeddings)的问题,尤其是在使用大语言模型(LLM)直接处理长事件序列时存在的计算成本高、部署不现实等瓶颈。其解决方案的关键在于提出一种名为LATTE的对比学习框架,通过将原始事件嵌入与冻结的大语言模型生成的语义嵌入对齐,利用行为特征压缩为短提示(short prompts)并由LLM编码作为监督信号,从而以极低的推理开销和输入规模实现优于现有方法的事件序列表征效果,同时满足延迟敏感环境下的实际部署需求。

链接: https://arxiv.org/abs/2508.10021
作者: Egor Fadeev,Dzhambulat Mollaev,Aleksei Shestov,Dima Korolev,Omar Zoloev,Ivan Kireev,Andrey Savchenko,Maksim Makarenko
机构: Sber AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.
zh

[NLP-64] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models

【速读】: 该论文旨在解决在联邦学习(Federated Learning, FL)环境中高效提升大语言模型(Large Language Models, LLMs)推理能力的难题,尤其是在医疗场景下,需在严格的计算、通信和隐私约束条件下实现高准确率与可解释性并重的推理输出。传统联邦微调方法仅关注答案正确性,忽视了推理链(Chain-of-Thought, CoT)质量,并依赖可能侵犯隐私的知识蒸馏;同时,其通信开销较高。解决方案的关键在于提出 FedCoT 框架:通过轻量级的 CoT 增强机制,本地模型生成多条推理路径,由紧凑的判别器动态选择最优路径,从而提升推理准确性与鲁棒性并保障可解释性;此外,采用改进的聚合策略,在 LoRA 模块堆叠基础上引入客户端分类器感知机制,实现跨异构客户端的无噪声聚合,显著降低通信负担并增强模型泛化能力。

链接: https://arxiv.org/abs/2508.10020
作者: Chuan Li,Qianyi Zhao,Fengran Mo,Cen Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models’ innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.
zh

[NLP-65] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在数学和逻辑推理任务中表现受限的问题,其核心挑战在于自然语言输入的复杂性和多样性导致模型难以从冗余或干扰信息中提取本质问题,从而影响推理效率与准确性。解决方案的关键在于提出一种“解耦理解与推理”的新框架——通过将自然语言问题映射到语义简化但表达能力强的规范问题空间(canonical problem space),使SLMs能够专注于标准化输入上的推理过程,避免语言表层变化带来的干扰。具体实现上,作者设计了DURIT(Decoupled Understanding from Reasoning via Iterative Training)算法,采用三步迭代训练机制:首先利用强化学习构建问题映射器,其次通过自蒸馏对齐推理轨迹,最后在规范问题空间中训练推理策略,并在映射器与推理器之间交替优化。实验表明,该方法显著提升了SLMs在域内和域外推理任务中的性能及鲁棒性,验证了解耦策略的有效性。

链接: https://arxiv.org/abs/2508.10019
作者: Li Wang,Changhao Zhang,Zengqi Xiu,Kai Lu,Xin Yu,Kui Zhang,Wenjun Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., \leq 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs’ performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.
zh

[NLP-66] A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在处理语义等价但形式不同的句子时,无法保持一致的下一个词概率分布的问题,例如“Charles Darwin wrote”与“Charles Darwin is the author of”虽语义相同,但LLM常生成不同的概率输出。解决方案的关键在于引入范畴同伦(categorical homotopy)框架,通过构建LLM马尔可夫范畴(LLM Markov category)来形式化语言概率分布,并利用同伦技术捕捉范畴中不同箭头之间的“弱等价”关系,从而对语义等价的句子进行统一建模,克服因表达方式差异导致的非同构箭头问题。

链接: https://arxiv.org/abs/2508.10018
作者: Sridhar Mahadevan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
备注: 26 pages. arXiv admin note: text overlap with arXiv:2402.18732

点击查看摘要

Abstract:Natural language is replete with superficially different statements, such as Charles Darwin wrote" and Charles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to capture weak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.
zh

[NLP-67] raining-Free Multimodal Large Language Model Orchestration

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)难以集成到统一的多模态输入输出系统中的问题,尤其针对模态对齐困难、文本到语音(Text-to-Speech)效率低下及系统整合复杂等挑战。其解决方案的关键在于提出一种无需额外训练的多模态大语言模型编排机制(MLLM Orchestration),通过三个核心创新实现:(1) 由一个中央控制大语言模型驱动的任务调度机制,利用设计良好的代理动态分配任务至专用模型;(2) 并行文本到语音架构,支持真正的全双工交互与自然中断处理;(3) 跨模态记忆集成系统,通过智能信息融合与检索维持跨模态上下文一致性,并在特定场景下避免冗余模态调用以提升响应速度。该方法显著提升了系统的模块化程度、可解释性与计算效率,且在标准基准测试中性能优于传统联合训练方案达7.8%,延迟降低10.3%。

链接: https://arxiv.org/abs/2508.10016
作者: Tianyu Xie,Yuhang Wu,Yongdong Luo,Jiayi Ji,Xiawu Zheng
机构: Xiamen University (厦门大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.
zh

[NLP-68] RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

【速读】: 该论文旨在解决当前任务导向对话(Task-Oriented Dialogue, TOD)研究中缺乏真实语音信号、语言模型对语音不流畅性(speech disfluencies)和说话人差异敏感性不足的问题,以及中文多轮、多领域语音-文本双模态数据集的缺失。解决方案的关键在于构建首个中文多轮、多领域语音-文本双模态TOD数据集RealTalk-CN,包含5.4k对话(60K语句,150小时),并附带语音与文本的配对标注;同时提出一种新颖的跨模态聊天任务,模拟用户在语音与文本之间动态切换的真实交互场景,从而系统评估语音大语言模型(Speech-based LLMs)在语音不流畅性鲁棒性、说话人特征敏感性和跨域泛化能力方面的性能。

链接: https://arxiv.org/abs/2508.10015
作者: Enzhi Wang,Qicheng Li,Shiwan Zhao,Aobo Kong,Jiaming Zhou,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin
机构: TMCC, College of Computer Science, Nankai University (南开大学计算机学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.
zh

[NLP-69] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

【速读】: 该论文旨在解决当前角色扮演(role-playing)评估中普遍存在的“LLM-as-a-judge”范式缺乏人类对角色一致性(role fidelity)感知的问题,即现有大语言模型(Large Language Models, LLMs)在判断角色扮演质量时可能无法准确反映人类的实际认知标准。其解决方案的关键在于提出首个专门用于测试LLM角色识别能力的基准——PersonaEval,该基准基于小说、剧本和视频对话等真实人类撰写的语料,要求模型根据上下文准确识别说话者的身份(persona attribution)。实验表明,即使最优LLM也只能达到约69%的准确率,远低于人类参与者近90.8%的水平,揭示了当前LLM在角色理解上仍存在显著差距,且这种差距不仅源于任务特定微调,更依赖于更强的人类推理能力。

链接: https://arxiv.org/abs/2508.10014
作者: Lingfeng Zhou,Jialing Zhang,Jin Gao,Mohan Jiang,Dequan Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by COLM 2025

点击查看摘要

Abstract:Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at this https URL.
zh

[NLP-70] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)训练中高质量、推理密集型问答对稀缺的问题,尤其是来自稀疏领域源(如PubMed论文或法律文档)的挑战。现有方法依赖表面模式,无法生成可控的、复杂的多跳推理问题,从而限制了LLM对深层理解能力的提升。解决方案的关键在于提出Semantic Bridge框架,其核心创新是“语义图编织”(semantic graph weaving),包含三种互补的桥接机制:实体桥接(entity bridging)用于共享实体的角色变化、谓词链桥接(predicate chain bridging)用于时序/因果/逻辑序列构建、因果桥接(causal bridging)用于显式推理链生成;通过AMR驱动分析实现对复杂度和类型的高度可控,显著提升了生成问答对的质量与多样性,实验证明其在通用和专业领域均优于基线方法,并以更少数据超越人工标注样本。

链接: https://arxiv.org/abs/2508.10013
作者: Linqing Chen,Hanmeng Zhong,Wentao Wu,Weilei Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbfSemantic Bridge, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textitsemantic graph weaving-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.
zh

[NLP-71] Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中因静态知识依赖和推理过程不透明而导致性能受限的问题,同时克服现有知识图谱(Knowledge Graph, KG)探索方法中存在的两大局限:基于问题引导的方法因粒度不匹配产生冗余探索,而基于线索引导的方法在复杂场景下难以有效利用上下文信息。解决方案的关键在于提出 Guidance Graph guided Knowledge Exploration (GG Explore) 框架,其核心创新是引入一个中间层的 Guidance Graph,用于抽象目标知识的结构并保留更广泛的语义上下文,从而精准界定检索空间;在此基础上进一步设计 Structural Alignment 和 Context Aware Pruning 两个机制,分别实现无 LLM 开销的候选过滤与基于图约束的语义一致性强化,显著提升探索效率与准确性,尤其在复杂任务中表现优越,并可在较小规模 LLM 上保持高性能,具备实际应用价值。

链接: https://arxiv.org/abs/2508.10012
作者: Dehao Tao,Guangjie Liu,Weizheng,Yongfeng Huang,Minghu jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge’ s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.
zh

[NLP-72] Evaluation of GPT -based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan

【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的生成式AI(Generative AI)在营养学教育领域,特别是日本注册营养师国家执业资格考试中的应用潜力尚未明确,其作为学习辅助工具的有效性、准确性和稳定性仍缺乏系统评估。解决方案的关键在于:使用日本注册营养师国家考试的真实考题作为提示(prompt),对ChatGPT及三个Bing模型(Precise、Creative、Balanced)进行多轮测试,通过分析回答准确性、一致性与响应时间来量化模型性能,并引入角色设定等提示工程策略以探索优化空间。结果表明,部分模型(如Bing-Precise和Bing-Creative)勉强达到及格线(60%),但整体准确率和答案稳定性不足,尤其在营养教育类题目上表现较差,且重复测试中答案一致性低,提示当前生成式AI尚难以作为可靠的学习辅助工具用于专业资质备考。

链接: https://arxiv.org/abs/2508.10011
作者: Yuta Nagamori,Mikoto Kosai,Yuji Kawai,Haruka Marumo,Misaki Shibuya,Tatsuya Negishi,Masaki Imanishi,Yasumasa Ikeda,Koichiro Tsuchiya,Asuka Sawai,Licht Miyamoto
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential performance improvements. Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. While some generative AI models marginally exceeded the passing threshold, overall accuracy and answer consistency remained suboptimal. Moreover, all the models demonstrated notable limitations in answer consistency and robustness. Further advancements are needed to ensure reliable and stable AI-based study aids for dietitian licensure preparation.
zh

[NLP-73] An Audit and Analysis of LLM -Assisted Health Misinformation Jailbreaks Against LLM s

【速读】: 该论文旨在解决生成式 AI(Generative AI)在医疗领域可能被用于制造和传播有害虚假信息的问题,尤其是通过“越狱攻击”(jailbreak attacks)诱导大型语言模型(Large Language Models, LLMs)输出恶意内容。其解决方案的关键在于利用LLM自身的能力来检测和识别由其他LLM生成的有害医疗虚假信息,并与社交媒体上真实存在的健康类虚假信息进行对比分析,从而验证LLM作为检测工具的有效性。研究发现,经过适当设计的LLM能够有效区分来自越狱攻击的虚假信息与人类生成的虚假内容,表明其在构建更健康的数字信息生态方面具有潜力。

链接: https://arxiv.org/abs/2508.10010
作者: Ayana Hussain,Patrick Zhao,Nicholas Vincent
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation – inadvertently, or when prompted by “jailbreak” attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.
zh

[NLP-74] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts INTERSPEECH2025

【速读】: 该论文旨在解决硬参数共享(Hard-parameter sharing)在多任务学习中导致的任务干扰问题,从而影响模型整体性能。其解决方案的关键在于提出一种监督式专家混合模型(Supervised Mixture of Experts, S-MoE),通过引入特殊引导标记(guiding tokens)来显式地将每个任务路由至专属的专家子网络(即独立的前馈网络),从而避免了传统Mixture of Experts模型中需要额外训练门控函数(gating functions)的复杂性,并有效缓解了任务间的参数冲突。

链接: https://arxiv.org/abs/2508.10009
作者: Hojun Jin,Eunsoo Hong,Ziwon Hyung,Sungjun Lim,Seungjin Lee,Keunseok Cho
机构: Samsung(三星)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.
zh

[NLP-75] Multidimensional classification of posts for online course discussion forum curation

【速读】: 该论文旨在解决在线课程讨论论坛的自动整理(automatic curation)问题,其核心挑战在于需要频繁更新以适应新数据,导致大型语言模型(Large Language Models, LLMs)的持续微调(fine-tuning)成为资源密集型任务。为避免高昂的微调成本,论文提出并评估了一种基于贝叶斯融合(Bayesian fusion)的解决方案,其关键在于将预训练通用LLM的多维分类得分与在本地数据上训练的分类器输出进行融合,从而在不进行微调的情况下提升分类性能,且效果优于单一分类器,并可与微调后的LLM相媲美。

链接: https://arxiv.org/abs/2508.10008
作者: Antonio Leandro Martins Candido,Jose Everardo Bessa Maia
机构: Federal Institute of Education, Science, and Technology of Ceará (联邦教育、科学和技术学院); State University of Ceará (塞阿拉州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:The automatic curation of discussion forums in online courses requires constant updates, making frequent retraining of Large Language Models (LLMs) a resource-intensive process. To circumvent the need for costly fine-tuning, this paper proposes and evaluates the use of Bayesian fusion. The approach combines the multidimensional classification scores of a pre-trained generic LLM with those of a classifier trained on local data. The performance comparison demonstrated that the proposed fusion improves the results compared to each classifier individually, and is competitive with the LLM fine-tuning approach
zh

[NLP-76] Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models

【速读】: 该论文旨在解决敌意归因偏差(hostile attribution bias)测量中开放式回答人工评分耗时耗力的问题,特别是在使用模糊意图敌意问卷(Ambiguous Intentions Hostility Questionnaire, AIHQ)时,其开放题需由专业人员逐条编码,限制了大规模应用。解决方案的关键在于利用大语言模型(large language models, LLMs)对AIHQ的开放式响应进行自动化评分,通过在已有临床数据集(包含创伤性脑损伤患者与健康对照组)上微调模型,使其生成的归因倾向和攻击性反应评分与人类评分高度一致,且在不同情境类型及独立非临床数据集中均表现出良好的泛化能力,从而显著提升心理评估的效率与可扩展性。

链接: https://arxiv.org/abs/2508.10007
作者: Y. Lyu,D. Combs,D. Neumann,Y. C. Leong
机构: 未知
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注: We have no known conflict of interest

点击查看摘要

Abstract:Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.
zh

[NLP-77] From Answers to Questions: EQGBench for Evaluating LLM s Educational Question Generation

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在教育场景中从提供答案向生成高质量教学问题(Educational Question Generation, EQG)转变时面临的挑战,尤其关注中文语境下生成问题的教育价值与有效性。其解决方案的关键在于构建了 EQGBench——一个面向中文教育场景的综合性基准测试平台,包含900个涵盖数学、物理和化学三门初中学科的评估样本,并采用五维评价框架系统评估46个主流大模型的表现,从而揭示当前模型在生成具有教学意义和能力培养功能的问题方面的不足,为未来研究提供可量化、可比较的基准。

链接: https://arxiv.org/abs/2508.10005
作者: Chengliang Zhou,Mei Wang,Ting Zhang,Qiannan Zhu,Jian Li,Hua Huang
机构: Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs’ performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students’ comprehensive abilities.
zh

[NLP-78] User Perception of Attention Visualizations: Effects on Interpretability Across Evidence-Based Medical Documents

【速读】: 该论文试图解决的问题是:注意力机制(attention mechanism)在Transformer架构中是否能够作为有效的解释工具,用于支持医学专家对生物医学文献分类任务的理解与决策,以及不同的可视化方式如何影响其解释效果。解决方案的关键在于通过一项面向多学科医学专家的用户研究,评估基于注意力权重的解释在实际应用中的感知有用性,并发现注意力权重的可视化形式显著影响其解释效用——尽管注意力权重本身未被普遍认为具有高解释价值,但用户更偏好直观的视觉编码方式(如文本亮度或背景色),而非遵循Munzner视觉有效性原则的精确编码(如柱状图长度)。

链接: https://arxiv.org/abs/2508.10004
作者: Andrés Carvallo,Denis Parra,Peter Brusilovsky,Hernan Valdivieso,Gabriel Rada,Ivania Donoso,Vladimir Araujo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The attention mechanism is a core component of the Transformer architecture. Beyond improving performance, attention has been proposed as a mechanism for explainability via attention weights, which are associated with input features (e.g., tokens in a document). In this context, larger attention weights may imply more relevant features for the model’s prediction. In evidence-based medicine, such explanations could support physicians’ understanding and interaction with AI systems used to categorize biomedical literature. However, there is still no consensus on whether attention weights provide helpful explanations. Moreover, little research has explored how visualizing attention affects its usefulness as an explanation aid. To bridge this gap, we conducted a user study to evaluate whether attention-based explanations support users in biomedical document classification and whether there is a preferred way to visualize them. The study involved medical experts from various disciplines who classified articles based on study design (e.g., systematic reviews, broad synthesis, randomized and non-randomized trials). Our findings show that the Transformer model (XLNet) classified documents accurately; however, the attention weights were not perceived as particularly helpful for explaining the predictions. However, this perception varied significantly depending on how attention was visualized. Contrary to Munzner’s principle of visual effectiveness, which favors precise encodings like bar length, users preferred more intuitive formats, such as text brightness or background color. While our results do not confirm the overall utility of attention weights for explanation, they suggest that their perceived helpfulness is influenced by how they are visually presented.
zh

[NLP-79] Semantic Structure in Large Language Model Embeddings

【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)中语义信息的结构本质及其与人类语义认知之间的一致性问题,特别是如何在不丢失关键语义信息的前提下,揭示LLM嵌入空间中的低维语义结构。解决方案的关键在于:通过分析词向量在由反义词对(如kind - cruel)定义的语义方向上的投影,发现这些投影可高度还原人类对词汇的评分结果,并进一步证明这些语义特征可压缩至一个三维子空间,其结构与人类调查数据高度一致;同时,研究还揭示了沿某一语义方向移动token时,会引发与其他几何对齐特征相关的非目标效应(off-target effects),且该效应强度与词向量间的余弦相似度成正比。这一发现表明,LLM中的语义特征以类似人类语言的方式相互纠缠,而语义信息本质上具有显著的低维性,理解并建模这种结构对于避免控制语义特征时产生意外后果至关重要。

链接: https://arxiv.org/abs/2508.10003
作者: Austin C. Kozlowski,Callin Dai,Andrei Boutyline
机构: University of Chicago (芝加哥大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.
zh

[NLP-80] HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish

【速读】: 该论文旨在解决在代码混杂(code-mixed)且资源匮乏的语言场景下(如印度英语与印地语混合的Hinglish)进行事实核查(fact-checking)这一挑战,现有系统多集中于高资源、单一语言环境,难以适用于印度等语言多样性地区的真实政治话语。解决方案的关键在于构建了一个名为HiFACT的新基准数据集,包含28位印度各邦首席部长发表的1500条真实Hinglish声明及其文本证据和真伪标签,并提出了一种图感知的检索增强型事实核查模型(HiFACTMix),其核心创新包括:多语言上下文编码、声明-证据语义对齐、证据图构建、图神经推理以及自然语言解释生成,从而实现更准确且可解释的事实验证。

链接: https://arxiv.org/abs/2508.10001
作者: Rakesh Thakur,Sneha Sharma,Gauri Chopra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fact-checking in code-mixed, low-resource languages such as Hinglish remains an underexplored challenge in natural language processing. Existing fact-verification systems largely focus on high-resource, monolingual settings and fail to generalize to real-world political discourse in linguistically diverse regions like India. Given the widespread use of Hinglish by public figures, particularly political figures, and the growing influence of social media on public opinion, there’s a critical need for robust, multilingual and context-aware fact-checking tools. To address this gap a novel benchmark HiFACT dataset is introduced with 1,500 realworld factual claims made by 28 Indian state Chief Ministers in Hinglish, under a highly code-mixed low-resource setting. Each claim is annotated with textual evidence and veracity labels. To evaluate this benchmark, a novel graphaware, retrieval-augmented fact-checking model is proposed that combines multilingual contextual encoding, claim-evidence semantic alignment, evidence graph construction, graph neural reasoning, and natural language explanation generation. Experimental results show that HiFACTMix outperformed accuracy in comparison to state of art multilingual baselines models and provides faithful justifications for its verdicts. This work opens a new direction for multilingual, code-mixed, and politically grounded fact verification research.
zh

[NLP-81] AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification

【速读】: 该论文旨在解决文本分类模型在实际应用中因各类文本数据不足而导致性能受限的问题。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成数据,并通过自动化工作流搜索能产生“更有效”合成数据的输入样本,从而提升分类模型性能。该工作进一步提出一种基于类别特征自适应选择搜索策略的集成算法,相较于单一策略显著提升了模型改进效果。

链接: https://arxiv.org/abs/2508.10000
作者: Chenhao Xue,Yuanzhe Jin,Adrian Carrasco-Revilla,Joyraj Chakraborty,Min Chen
机构: University of Oxford (牛津大学); Inetum (Inetum)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When developing text classification models for real world applications, one major challenge is the difficulty to collect sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labelled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective’’ synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs.
zh

[NLP-82] XFacta: Contemporary Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLM s

【速读】: 该论文旨在解决当前多模态虚假信息检测方法中存在的两大核心问题:一是现有方法在证据检索与推理能力之间的瓶颈尚不明确,限制了模型性能的进一步提升;二是现有基准数据集存在过时或人工合成的问题,无法真实反映当代社交媒体中的多模态虚假信息传播模式。为应对这些问题,论文提出了一种名为XFacta的当代、真实世界多模态虚假信息数据集,并系统评估了不同架构和规模的多模态大语言模型(Multimodal Large Language Models, MLLMs)在检测任务中的表现。其关键解决方案在于构建一个可自动更新的半自动检测闭环框架,通过持续引入新内容保持XFacta的数据时效性,从而为MLLM-based检测器提供更可靠、更具现实意义的评估基础,推动该领域向更精准、鲁棒的方向发展。

链接: https://arxiv.org/abs/2508.09999
作者: Yuzhuo Xiao,Zeyu Han,Yuhan Wang,Huaizu Jiang
机构: Guizhou University (贵州大学); Northeastern University (东北大学); UC Santa Cruz (加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: For associated code and dataset, see this https URL

点击查看摘要

Abstract:The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.
zh

[NLP-83] INTIMA: A Benchmark for Human-AI Companionship Behavior

【速读】: 该论文试图解决当前生成式 AI(Generative AI)在用户情感互动中缺乏系统评估标准的问题,旨在量化语言模型在陪伴行为(companionship behaviors)上的表现。其解决方案的关键在于提出 Interactions and Machine Attachment Benchmark (INTIMA),这是一个基于心理学理论和真实用户数据构建的基准测试框架,包含31种行为类别、4个维度及368个针对性提示语,并对响应进行“强化陪伴”、“维持边界”或“中性”三类标注。通过该基准对多个主流模型的评估,揭示了不同模型在情绪支持与边界设定之间的不平衡倾向,从而推动更一致、健康的人机情感交互设计。

链接: https://arxiv.org/abs/2508.09998
作者: Lucie-Aimée Kaffee,Giada Pistilli,Yacine Jernite
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.
zh

[NLP-84] hematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling

【速读】: 该论文旨在解决教育场景中生成式 AI(Generative AI)交互数据缺乏系统性内容与任务维度分类的问题,尤其针对K-12阶段真实课堂环境中学生、教师与ChatGPT之间大量匿名文本交互数据的结构化分析不足。其解决方案的关键在于提出一种新颖且简单的主题建模方法,通过分别对内容(如自然、人物)和任务(如写作、解释)两个维度进行分层分类,并结合显式指令引导大语言模型(LLM)进行预处理后的层级主题提取,从而实现比传统计算方法更符合人类认知的结构化洞察,为教育实践中GenAI的有效应用提供实证支持与可操作框架。

链接: https://arxiv.org/abs/2508.09997
作者: Johannes Schneider,Béatrice S. Hasler,Michaela Varrone,Fabian Hoya,Thomas Schroffenegger,Dana-Kristin Mah,Karl Peböck
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at the International Conference on Computer-Human Interaction Research and Applications (CHIRA), 2025

点击查看摘要

Abstract:We analyze anonymous interaction data of minors in class-rooms spanning several months, schools, and subjects employing a novel, simple topic modeling approach. Specifically, we categorize more than 17,000 messages generated by students, teachers, and ChatGPT in two dimensions: content (such as nature and people) and tasks (such as writing and explaining). Our hierarchical categorization done separately for each dimension includes exemplary prompts, and provides both a high-level overview as well as tangible insights. Prior works mostly lack a content or thematic categorization. While task categorizations are more prevalent in education, most have not been supported by real-world data for K-12. In turn, it is not surprising that our analysis yielded a number of novel applications. In deriving these insights, we found that many of the well-established classical and emerging computational methods, i.e., topic modeling, for analysis of large amounts of texts underperform, leading us to directly apply state-of-the-art LLMs with adequate pre-processing to achieve hierarchical topic structures with better human alignment through explicit instructions than prior approaches. Our findings support fellow researchers, teachers and students in enriching the usage of GenAI, while our discussion also highlights a number of concerns and open questions for future research.
zh

[NLP-85] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

【速读】: 该论文旨在解决开源大语言模型(Large Language Models, LLMs)在高风险领域(如刑事司法、教育、医疗和金融)中公平性不足的问题,尤其是缺乏可验证、不可篡改且可复现的评估机制。其解决方案的关键在于提出了一种基于互联网计算机协议(Internet Computer Protocol, ICP)区块链的透明评估协议,通过智能合约执行链上HTTP请求调用Hugging Face托管端点,并将数据集、提示词和评估指标直接存储于链上,从而确保评估过程的可追溯性和结果的可信度。该方法实现了对Llama、DeepSeek和Mistral等模型在PISA学术表现预测任务上的公平性基准测试,同时结合StereoSet衍生的结构化上下文关联指标衡量社会偏见,并进一步扩展至英语、西班牙语和葡萄牙语的多语言公平性分析,揭示跨语言差异,所有代码与结果均开源,支持社区审计与模型版本的长期公平性追踪。

链接: https://arxiv.org/abs/2508.09993
作者: Hugo Massaroli,Leonardo Iara,Emmanuel Iarussi,Viviana Siless
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.
zh

[NLP-86] Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry

【速读】: 该论文旨在解决临床文档中自动化数据提取的挑战,以提升医疗环境中的效率。其核心问题在于如何在实际部署自然语言处理(Natural Language Processing, NLP)解决方案时克服技术与业务之间的脱节。论文提出的关键解决方案包括:基于明确的业务目标定义问题而非仅追求技术准确性;采用迭代式开发方法;从项目初期即推动跨学科协作与共同设计,涵盖领域专家、终端用户和机器学习(Machine Learning, ML)专家;同时强调务实的模型选择策略(如混合方法或简化模型)、严格的数据质量管控(包括代表性、漂移和标注一致性)、引入人机协同验证与持续审计的错误缓解机制,并加强组织层面的人工智能(Artificial Intelligence, AI)素养建设。这些实践性洞见具有普适性,适用于各类医疗机构推进AI/NLP落地,从而优化数据管理流程并改善患者护理与公共卫生结果。

链接: https://arxiv.org/abs/2508.09991
作者: Lovedeep Gondara,Gregory Arbour,Raymond Ng,Jonathan Simkin,Shebnum Devji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.
zh

[NLP-87] Personalized Product Search Ranking: A Multi-Task Learning Approach with Tabular and Non-Tabular Data PRICAI-2025

【速读】: 该论文旨在解决个性化产品搜索排序(personalized product search ranking)中如何有效融合结构化与非结构化数据以提升模型性能的问题。其核心挑战在于如何利用多任务学习(multi-task learning, MTL)框架同时处理混合数据类型(tabular 和 non-tabular),并捕捉用户多样化的行为模式。解决方案的关键在于:1)引入预训练的 TinyBERT 模型生成语义嵌入(semantic embeddings),用于表征非结构化文本信息;2)设计一种新颖的采样策略,增强对用户行为多样性的建模能力;3)提出基于点击率(click-through rates)、点击位置及语义相似度的可扩展相关性标注机制(relevance labeling),替代传统人工标注,从而提升标签效率与泛化能力。实验表明,结合非结构化数据与嵌入技术的多任务学习范式显著优于多个基线模型(如 XGBoost、TabNet、FT-Transformer 等),且消融实验证实了细粒度优化(如 TinyBERT 层微调、查询-商品嵌入交互)对性能提升的重要贡献。

链接: https://arxiv.org/abs/2508.09636
作者: Lalitesh Morishetti,Abhay Kumar,Jonathan Scott,Kaushiki Nag,Gunjan Sharma,Shanu Vashishtha,Rahul Sridhar,Rohit Chatter,Kannan Achan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 2 figures, The Pacific Rim International Conference on Artificial Intelligence (PRICAI-2025) Conference

点击查看摘要

Abstract:In this paper, we present a novel model architecture for optimizing personalized product search ranking using a multi-task learning (MTL) framework. Our approach uniquely integrates tabular and non-tabular data, leveraging a pre-trained TinyBERT model for semantic embeddings and a novel sampling technique to capture diverse customer behaviors. We evaluate our model against several baselines, including XGBoost, TabNet, FT-Transformer, DCN-V2, and MMoE, focusing on their ability to handle mixed data types and optimize personalized ranking. Additionally, we propose a scalable relevance labeling mechanism based on click-through rates, click positions, and semantic similarity, offering an alternative to traditional human-annotated labels. Experimental results show that combining non-tabular data with advanced embedding techniques in multi-task learning paradigm significantly enhances model performance. Ablation studies further underscore the benefits of incorporating relevance labels, fine-tuning TinyBERT layers, and TinyBERT query-product embedding interactions. These results demonstrate the effectiveness of our approach in achieving improved personalized product search ranking.
zh

[NLP-88] Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否在抽象推理过程中模拟人类神经认知机制的问题。其核心解决方案在于通过对比人类参与者与八种开源LLMs在抽象模式补全任务中的行为表现和神经表征(基于脑电图EEG记录的固定相关电位FRPs),发现仅参数规模达约700亿的大型LLM(如Qwen-2.5-72B和DeepSeek-R1-70B)能实现与人类相当的准确率,并展现出与人类相似的模式特异性难度分布;同时,所有LLM在其中间层均形成对抽象模式类别的显著聚类表示,且聚类强度与任务性能正相关;更重要的是,任务最优层的表征几何结构与人类前额叶FRPs呈现中等正相关,而与其他EEG指标(如反应锁相ERP和静息态EEG)无显著关联,表明LLMs可能共享与生物智能类似的抽象模式表征机制,为揭示人工与生物智能间潜在共性提供了初步实证依据。

链接: https://arxiv.org/abs/2508.10057
作者: Christopher Pinier,Sonia Acuña Vargas,Mariia Steeghs-Turchina,Dora Matzke,Claire E. Stevenson,Michael D. Nunez
机构: Psychological Methods, University of Amsterdam (阿姆斯特丹大学心理方法学系)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Presented at the 8th Annual Conference on Cognitive Computational Neuroscience (August 12-15, 2025; Amsterdam, The Netherlands); 20 pages, 11 figures

点击查看摘要

Abstract:This study investigates whether large language models (LLMs) mirror human neurocognition during abstract reasoning. We compared the performance and neural representations of human participants with those of eight open-source LLMs on an abstract-pattern-completion task. We leveraged pattern type differences in task performance and in fixation-related potentials (FRPs) as recorded by electroencephalography (EEG) during the task. Our findings indicate that only the largest tested LLMs (~70 billion parameters) achieve human-comparable accuracy, with Qwen-2.5-72B and DeepSeek-R1-70B also showing similarities with the human pattern-specific difficulty profile. Critically, every LLM tested forms representations that distinctly cluster the abstract pattern categories within their intermediate layers, although the strength of this clustering scales with their performance on the task. Moderate positive correlations were observed between the representational geometries of task-optimal LLM layers and human frontal FRPs. These results consistently diverged from comparisons with other EEG measures (response-locked ERPs and resting EEG), suggesting a potential shared representational space for abstract patterns. This indicates that LLMs might mirror human brain mechanisms in abstract reasoning, offering preliminary evidence of shared principles between biological and artificial intelligence.
zh

计算机视觉

[CV-0] Quantum Visual Fields with Neural Amplitude Encoding

【速读】:该论文旨在解决量子隐式神经表示(Quantum Implicit Neural Representations, QINRs)在架构设计、参数化电路(ansatz)构建、量子力学特性利用效率、训练稳定性以及与经典模块协同等方面存在的挑战,特别是在2D图像和3D几何场学习任务中的性能瓶颈。其解决方案的关键在于提出一种新型QINR——量子视觉场(Quantum Visual Field, QVF),该方法通过基于可学习能量流形的神经幅度编码(neural amplitude encoding)将经典数据嵌入到量子态矢量中,确保有意义的希尔伯特空间映射;同时采用全纠缠的可学习参数化量子电路作为ansatz,在实数希尔伯特空间中执行量子幺正(unitary)操作,从而实现数值稳定且收敛迅速的训练过程;此外,QVF直接使用投影测量提取编码在ansatz中的学习信号,无需经典后处理,显著提升了端到端的学习效率和表达能力。

链接: https://arxiv.org/abs/2508.10900
作者: Shuteng Wang,Christian Theobalt,Vladislav Golyanik
机构: MPI for Informatics (马普研究所信息学所); SIC (视觉计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 15 figures and four tables; project page: this https URL

点击查看摘要

Abstract:Quantum Implicit Neural Representations (QINRs) include components for learning and execution on gate-based quantum computers. While QINRs recently emerged as a promising new paradigm, many challenges concerning their architecture and ansatz design, the utility of quantum-mechanical properties, training efficiency and the interplay with classical modules remain. This paper advances the field by introducing a new type of QINR for 2D image and 3D geometric field learning, which we collectively refer to as Quantum Visual Field (QVF). QVF encodes classical data into quantum statevectors using neural amplitude encoding grounded in a learnable energy manifold, ensuring meaningful Hilbert space embeddings. Our ansatz follows a fully entangled design of learnable parametrised quantum circuits, with quantum (unitary) operations performed in the real Hilbert space, resulting in numerically stable training with fast convergence. QVF does not rely on classical post-processing – in contrast to the previous QINR learning approach – and directly employs projective measurement to extract learned signals encoded in the ansatz. Experiments on a quantum hardware simulator demonstrate that QVF outperforms the existing quantum approach and widely used classical foundational baselines in terms of visual representation accuracy across various metrics and model characteristics, such as learning of high-frequency details. We also show applications of QVF in 2D and 3D field completion and 3D shape interpolation, highlighting its practical potential.
zh

[CV-1] Puppeteer: Rig and Animate Your 3D Models

【速读】:该论文旨在解决静态3D模型向动态动画资产转换过程中存在的瓶颈问题,即当前生成式AI虽已大幅提升静态3D建模效率,但骨骼绑定(rigging)与动画生成仍高度依赖人工干预。其解决方案的关键在于提出一个端到端的综合框架Puppeteer,包含三个核心技术模块:首先,采用基于关节的令牌化策略和分层有序排列结合随机扰动的自回归Transformer,实现对骨骼结构的高精度预测;其次,通过引入拓扑感知的注意力机制来推断蒙皮权重(skinning weights),显式编码骨骼图距离所反映的关节间关系;最后,构建基于可微优化的动画生成管道,在保证时序一致性与高保真度的同时显著提升计算效率,有效消除传统方法中常见的抖动问题。

链接: https://arxiv.org/abs/2508.10898
作者: Chaoyue Song,Xiu Li,Fan Yang,Zhongcong Xu,Jiacheng Wei,Fayao Liu,Jiashi Feng,Guosheng Lin,Jianfeng Zhang
机构: Nanyang Technological University (南洋理工大学); ByteDance Seed (字节跳动种子实验室); Institute for Infocomm Research, A*STAR (新加坡资讯通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.
zh

[CV-2] Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning

【速读】:该论文旨在解决跨域3D人体运动建模中模型灵活性与可扩展性不足的问题,即现有方法通常依赖于特定领域的组件和多阶段训练流程,限制了其在多种模态、任务和数据集上的统一处理能力。解决方案的关键在于提出一种新的单阶段训练范式,通过引入Human-in-Context (HiC) 模型实现端到端的跨域泛化:HiC 在Pose-in-Context (PiC) 基础上扩展为融合姿态(pose)与网格(mesh)表示的统一框架,同时采用最大最小相似性提示采样策略以增强跨域泛化能力,并设计双分支上下文注入网络结构来更有效地建模上下文依赖关系,从而显著提升模型在多样性数据规模和复杂任务场景下的性能表现。

链接: https://arxiv.org/abs/2508.10897
作者: Mengyuan Liu,Xinshun Wang,Zhongbin Fang,Deheng Ye,Xia Li,Tao Tang,Songtao Wu,Xiangtai Li,Ming-Hsuan Yang
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Tencent (腾讯); ETH Zurich (苏黎世联邦理工学院); Nanyang Technological University, Singapore (南洋理工大学); Sony R&D Center (索尼研发中⼼); University of California at Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at this https URL.
zh

[CV-3] ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning ICCV

【速读】:该论文旨在解决视频类增量学习(Video Class-Incremental Learning, VCIL)中的灾难性遗忘问题,同时平衡记忆效率与性能之间的矛盾。现有方法通常依赖于存储时间密集的样本进行回放训练,导致内存开销大;而采用时间稀疏样本的方法则会丢失关键的时间信息,影响性能。为此,作者提出ESSENTIAL框架,其核心创新在于将**情景记忆(episodic memory)语义记忆(semantic memory)**相结合:前者存储时间稀疏的特征表示,后者通过可学习提示(learnable prompts)捕捉通用知识,并引入一种基于交叉注意力机制的记忆检索(Memory Retrieval, MR)模块,实现从稀疏特征中重构出稠密时序信息。这一设计显著提升了记忆利用效率并保持了优异的分类性能。

链接: https://arxiv.org/abs/2508.10896
作者: Jongseo Lee,Kyungho Bae,Kyle Min,Gyeong-Moon Park,Jinwoo Choi
机构: Kyung Hee University (中央大学); Danggeun Market Inc.; Intel Labs; Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 ICCV Highlight paper, 17 pages including supplementary material

点击查看摘要

Abstract:In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.
zh

[CV-4] MAESTRO: Masked AutoEncoders for Multimodal Multitemporal and Multispectral Earth Observation Data

【速读】:该论文旨在解决自监督学习(Self-supervised Learning)在遥感(Remote Sensing)领域应用时面临的挑战,即标准自监督方法难以适配地球观测数据特有的多模态(Multimodal)、多时相(Multitemporal)和多光谱(Multispectral)特性。其解决方案的关键在于提出MAESTRO——一种针对掩码自动编码器(Masked Autoencoder)的新型改进模型,通过优化融合策略与设计引入光谱先验(Spectral Prior)作为自监督信号的目标归一化方案,从而有效利用多时相动态信息,在四个地球观测数据集上实现了在依赖多时相特征任务上的新最优性能,同时保持对单一时相模态任务的高竞争力。

链接: https://arxiv.org/abs/2508.10894
作者: Antoine Labatie,Michael Vaccaro,Nina Lardiere,Anatol Garioud,Nicolas Gonthier
机构: Institut national de l’information géographique et forestière (IGN), France; Univ Gustave Eiffel, ENSG, IGN, LASTIG, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at this https URL.
zh

[CV-5] STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

【速读】:该论文旨在解决多视角三维重建(multi-view 3D reconstruction)中现有方法存在的两大瓶颈问题:一是依赖昂贵的全局优化策略,导致计算效率低下;二是采用简化的记忆机制,在序列长度增加时性能显著下降。其解决方案的关键在于提出STream3R,一种将点云预测重构为仅解码器(decoder-only)Transformer任务的新框架,通过因果注意力(causal attention)实现对图像序列的高效流式处理,从而支持在线、实时的3D感知。该方法利用大规模3D数据集学习几何先验,具备良好的泛化能力,尤其在动态场景下优于传统方法,并天然兼容大语言模型(LLM)式的训练基础设施,便于大规模预训练与下游任务微调。

链接: https://arxiv.org/abs/2508.10893
作者: Yushi Lan,Yihang Luo,Fangzhou Hong,Shangchen Zhou,Honghua Chen,Zhaoyang Lyu,Shuai Yang,Bo Dai,Chen Change Loy,Xingang Pan
机构: S-Lab, Nanyang Technological University, Singapore (南洋理工大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); WICT, Peking University (北京大学网络与信息中心); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TL;DR: Streaming 4D reconstruction using causal transformer. Project page: this https URL

点击查看摘要

Abstract:We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: this https URL.
zh

[CV-6] oonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

【速读】:该论文旨在解决传统动画制作中关键帧插值(inbetweening)与上色(colorization)阶段分离导致的误差累积和伪影问题,尤其针对现有AI方法在大运动场景下插值效果差、上色需密集逐帧草图等局限性。其解决方案的关键在于提出ToonComposer——一个将插值与上色统一为单一后关键帧阶段的生成式模型,通过稀疏草图注入机制实现精确控制,并引入空间低秩适配器(spatial low-rank adapter)对现代视频基础模型进行卡通领域适配,同时保留其时间一致性先验。该设计仅需单张草图和参考着色帧即可完成高质量动画生成,支持多草图灵活布局,显著降低人工负担并提升创作灵活性。

链接: https://arxiv.org/abs/2508.10881
作者: Lingen Li,Guangzhi Wang,Zhaoyang Zhang,Yaowei Li,Xiaoyu Li,Qi Dou,Jinwei Gu,Tianfan Xue,Ying Shan
机构: The Chinese University of Hong Kong (香港中文大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.
zh

[CV-7] Medico 2025: Visual Question Answering for Gastrointestinal Imaging

【速读】:该论文旨在解决胃肠镜影像中视觉问答(Visual Question Answering, VQA)任务的可解释性问题,即如何构建能够回答临床相关问题并提供符合医学推理逻辑的可解释性解释的模型。其解决方案的关键在于引入两个子任务:一是基于Kvasir-VQA-x1数据集(由6,500张图像和159,549个复杂问答对构成)进行多样化视觉问题的回答;二是生成多模态解释以支持临床决策。通过结合定量性能指标与专家评审的可解释性评估,该方案致力于推动医疗图像分析中可信人工智能(Trustworthy AI)的发展。

链接: https://arxiv.org/abs/2508.10869
作者: Sushant Gautam,Vajira Thambawita,Michael Riegler,Pål Halvorsen,Steven Hicks
机构: SimulaMet - Simula Metropolitan Center for Digital Engineering, Oslo, Norway; Simula Research Laboratory, Oslo, Norway; OsloMet - Oslo Metropolitan University, Oslo, Norway
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: this https URL
zh

[CV-8] xVerse: A Universe of 3D Objects with High-Resolution Textures

【速读】:该论文旨在解决当前大规模3D数据集在高分辨率纹理生成方面的研究不足问题,尤其是在端到端生成高分辨率纹理方面缺乏合适的数据支持。解决方案的关键在于构建TexVerse——一个包含858K唯一高分辨率3D模型的大型数据集,其中超过158K模型具有物理基础渲染(Physically Based Rendering, PBR)材质,并涵盖总计160万实例的高分辨率变体;同时提供专门子集如TexVerse-Skeleton(69K带骨骼模型)和TexVerse-Animation(54K动画模型),保留原始骨架与动画数据,并附带详尽的模型标注信息,从而为纹理合成、PBR材料开发、动画处理及多种三维视觉与图形任务提供高质量数据资源。

链接: https://arxiv.org/abs/2508.10868
作者: Yibo Zhang,Li Zhang,Rui Ma,Nan Cao
机构: Shanghai Innovation Institute (上海创新研究院); Jilin University (吉林大学); Fudan University (复旦大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.
zh

[CV-9] Performance of GPT -5 in Brain Tumor MRI Reasoning

【速读】:该论文旨在解决脑肿瘤类型在磁共振成像(MRI)上的准确鉴别问题,这是神经肿瘤学治疗方案制定的关键前提。解决方案的核心在于利用大语言模型(LLMs)的视觉问答(VQA)能力,将图像解读与自然语言推理相结合,在零样本链式思维(zero-shot chain-of-thought)设置下对多序列MRI三平面拼接图及结构化临床特征进行联合分析。实验基于来自3个Brain Tumor Segmentation(BraTS)数据集的胶质母细胞瘤(GLI)、脑膜瘤(MEN)和脑转移瘤(MET)病例构建标准化VQA任务,结果显示GPT-5-mini在宏观平均准确率上表现最优(44.19%),表明当前生成式AI(Generative AI)模型虽能在结构化神经肿瘤VQA任务中达到中等水平,但尚未达到临床可用标准。

链接: https://arxiv.org/abs/2508.10865
作者: Mojtaba Safari,Shansong Wang,Mingzhe Hu,Zach Eidex,Qiang Li,Xiaofeng Yang
机构: Emory University School of Medicine (埃默里大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.
zh

[CV-10] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

【速读】:该论文旨在解决当前视频生成技术中难以保证物理合理性的问题,即生成的视频虽视觉上逼真,但可能违背物理规律,限制了其在高精度应用场景中的实用性。解决方案的关键在于提出PhysHPO框架,通过分层跨模态直接偏好优化(Hierarchical Cross-Modal Direct Preference Optimization),实现细粒度的偏好对齐:从实例级(Instance Level)到状态级(State Level)、运动级(Motion Level)再到语义级(Semantic Level)逐层优化视频内容与输入提示的一致性、时间连续性、动态真实性和叙事逻辑性;同时引入自动化数据筛选管道,高效识别并利用大规模文本-视频数据集中高质量样本,避免昂贵的数据构建过程,从而显著提升视频生成的物理合理性和整体质量。

链接: https://arxiv.org/abs/2508.10858
作者: Harold Haodong Chen,Haojian Huang,Qifeng Chen,Harry Yang,Ser-Nam Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize “good data” from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.
zh

[CV-11] Generalizable Federated Learning using Client Adaptive Focal Modulation WACV2024

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据分布非独立同分布(non-IID)和跨域(cross-domain)场景下模型泛化能力不足的问题。其核心解决方案是提出 AdaptFED,通过引入任务感知的客户端嵌入(task-aware client embeddings)来进一步个性化调制动态,从而增强焦点调制(focal modulation)机制在通用联邦学习中的适应性;同时,该方法还提供了更严格的理论性能边界,并设计了一种基于低秩超网络条件化的高效变体以降低通信开销,提升了系统在资源受限环境下的可扩展性。实验验证表明,该方案在源域无监督和跨任务联邦设置中显著优于现有最优基线。

链接: https://arxiv.org/abs/2508.10840
作者: Tajamul Ashraf,Iqra Altaf Gillani
机构: MBZUAI (Mohamed Bin Zayed University of Artificial Intelligence); NIT Srinagar (National Institute of Technology Srinagar)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2024 Extended Paper

点击查看摘要

Abstract:Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at this http URL
zh

[CV-12] Self-Supervised Stereo Matching with Multi-Baseline Contrastive Learning

【速读】:该论文旨在解决当前自监督立体匹配方法在遮挡区域因光度一致性假设失效而导致性能下降的问题。其解决方案的关键在于提出一种基于教师-学生架构的对比学习框架BaCon-Stereo,通过多基线输入使教师网络在学生网络的遮挡区域中仍能提供有效监督信号,并引入遮挡感知注意力图以增强学生网络对遮挡区域的补全能力,从而提升整体匹配精度与鲁棒性。

链接: https://arxiv.org/abs/2508.10838
作者: Peng Xu,Zhiyu Xiang,Jingyun Fu,Tianyu Pu,Kai Wang,Chaojie Ji,Tingming Bai,Eryun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current self-supervised stereo matching relies on the photometric consistency assumption, which breaks down in occluded regions due to ill-posed correspondences. To address this issue, we propose BaCon-Stereo, a simple yet effective contrastive learning framework for self-supervised stereo network training in both non-occluded and occluded regions. We adopt a teacher-student paradigm with multi-baseline inputs, in which the stereo pairs fed into the teacher and student share the same reference view but differ in target views. Geometrically, regions occluded in the student’s target view are often visible in the teacher’s, making it easier for the teacher to predict in these regions. The teacher’s prediction is rescaled to match the student’s baseline and then used to supervise the student. We also introduce an occlusion-aware attention map to better guide the student in learning occlusion completion. To support training, we synthesize a multi-baseline dataset BaCon-20k. Extensive experiments demonstrate that BaCon-Stereo improves prediction in both occluded and non-occluded regions, achieves strong generalization and robustness, and outperforms state-of-the-art self-supervised methods on both KITTI 2015 and 2012 benchmarks. Our code and dataset will be released upon paper acceptance.
zh

[CV-13] UI-Venus Technical Report: Building High-performance UI Agents with RFT

【速读】:该论文旨在解决当前UI代理(UI Agent)在仅依赖截图输入时,如何实现高精度的UI定位(UI grounding)与导航(navigation)任务的问题。解决方案的关键在于提出了一种基于多模态大语言模型(Multimodal Large Language Model, MLLM)的原生UI代理UI-Venus,其通过少量高质量训练样本(数十万级)结合强化微调(Reinforcement Fine-Tuning, RFT)策略,在Screenspot-V2/Pro基准上分别达到94.1%/50.8%和95.3%/61.9%的准确率,超越现有开源与闭源基线。核心创新包括:设计针对UI定位与导航的精细化奖励函数、高效的训练数据清洗协议,以及提出Self-Evolving Trajectory History Alignment与Sparse Action Enhancement机制,以优化历史推理轨迹并平衡稀疏但关键的操作分布,从而提升复杂UI场景下的规划连贯性与泛化能力。

链接: https://arxiv.org/abs/2508.10833
作者: Zhangxuan Gu,Zhengwen Zeng,Zhenyu Xu,Xingran Zhou,Shuheng Shen,Yunfei Liu,Beitong Zhou,Changhua Meng,Tianyu Xia,Weizhi Chen,Yue Wen,Jingya Dou,Fei Tang,Jinzhen Lin,Yulin Liu,Zhenlin Guo,Yichen Gong,Heng Jia,Changlong Gao,Yuan Guo,Yong Deng,Zhenyu Guo,Liang Chen,Weiqiang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source this http URL show UI-Venus’s summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing this http URL achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning this http URL further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \ Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at this https URL.
zh

[CV-14] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops

【速读】:该论文旨在解决全球粮食安全面临的植物病害早期检测难题,提出了一种基于计算机视觉的移动端解决方案。其关键在于构建了一个涵盖33种作物、101种植物病害的综合性数据集(整合Plant Doc、PlantVillage和PlantWild三个公开数据集),并采用轻量化神经网络架构(如MobileNetV2、MobileNetV3、EfficientNet-B0/B1)进行模型训练与部署,最终以EfficientNet-B1实现94.7%的分类准确率,在精度与计算效率之间取得最佳平衡,从而支持在资源受限设备上的实际应用。

链接: https://arxiv.org/abs/2508.10817
作者: Anand Kumar,Harminder Pal Monga,Tapasi Brahma,Satyam Kalra,Navas Sherif
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Plant diseases are a major threat to food security globally. It is important to develop early detection systems which can accurately detect. The advancement in computer vision techniques has the potential to solve this challenge. We have developed a mobile-friendly solution which can accurately classify 101 plant diseases across 33 crops. We built a comprehensive dataset by combining different datasets, Plant Doc, PlantVillage, and PlantWild, all of which are for the same purpose. We evaluated performance across several lightweight architectures - MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1 - specifically chosen for their efficiency on resource-constrained devices. The results were promising, with EfficientNet-B1 delivering our best performance at 94.7% classification accuracy. This architecture struck an optimal balance between accuracy and computational efficiency, making it well-suited for real-world deployment on mobile devices.
zh

[CV-15] Object Fidelity Diffusion for Remote Sensing Image Generation

【速读】:该论文旨在解决遥感图像生成中对象保真度(object fidelity)不足的问题,现有扩散模型因难以充分捕捉形态细节,导致生成图像质量较低,进而影响目标检测模型的鲁棒性和可靠性。其解决方案的关键在于提出一种名为Object Fidelity Diffusion (OF-Diff) 的双分支扩散模型,首次基于布局提取对象先验形状,并引入扩散一致性损失(diffusion consistency loss),使模型在无需真实图像作为引导的情况下即可生成高保真度遥感图像;同时结合DDPO(Diffusion Policy Optimization)对扩散过程进行微调,提升生成图像的多样性与语义一致性,从而显著改善多类小目标(如飞机、船舶和车辆)的检测性能,mAP分别提升8.3%、7.7%和4.0%。

链接: https://arxiv.org/abs/2508.10801
作者: Ziqi Ye,Shuran Ma,Jie Yang,Xiaoyi Yang,Ziyang Gong,Xue Yang,Haipeng Wang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. University of California, Berkeley (加州大学伯克利分校); 4. Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.
zh

[CV-16] VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Vessel Segmentation

【速读】:该论文旨在解决X射线血管造影图像中因标注数据稀缺导致的血管分割精度不足问题,特别是传统掩码图像建模(Masked Image Modeling, MIM)方法由于血管与背景像素存在严重类别不平衡而难以学习到有效的血管表征。其解决方案的关键在于提出了一种面向血管解剖结构的掩码图像建模方法(Vascular anatomy-aware Masked Image Modeling, VasoMIM),通过两个互补组件实现:一是基于解剖知识的掩码策略,优先掩蔽包含血管的图像块以引导模型关注血管区域;二是解剖一致性损失,强制原始图像与重建图像在血管语义上保持一致,从而提升血管表征的判别能力。

链接: https://arxiv.org/abs/2508.10794
作者: De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Tian-Yu Xiang,Rui-Ze Ma,Nu-Fang Xiao,Zeng-Guang Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly integrates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.
zh

[CV-17] Cooperative Face Liveness Detection from Optical Flow

【速读】:该论文旨在解决视频人脸活体检测(face liveness detection)中对各类呈现攻击(presentation attacks,如打印照片、屏幕显示、面具和视频回放)识别能力不足的问题。解决方案的关键在于提出了一种基于用户交互的新协议:指导参与者缓慢将正面朝向的面部移近摄像头,结合光流(optical flow)分析来提取面部三维体积信息。通过神经网络估计光流并融合RGB帧输入至分类器,有效利用时空特征提升活体判别可靠性,显著优于传统被动式方法。

链接: https://arxiv.org/abs/2508.10786
作者: Artem Sokolov,Mikhail Nikitin,Anton Konushin
机构: M. V. Lomonosov Moscow State University (莫斯科国立大学); Tevian (特维安); AIRI (人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we proposed a novel cooperative video-based face liveness detection method based on a new user interaction scenario where participants are instructed to slowly move their frontal-oriented face closer to the camera. This controlled approaching face protocol, combined with optical flow analysis, represents the core innovation of our approach. By designing a system where users follow this specific movement pattern, we enable robust extraction of facial volume information through neural optical flow estimation, significantly improving discrimination between genuine faces and various presentation attacks (including printed photos, screen displays, masks, and video replays). Our method processes both the predicted optical flows and RGB frames through a neural classifier, effectively leveraging spatial-temporal features for more reliable liveness detection compared to passive methods.
zh

[CV-18] Ultra-High-Definition Reference-Based Landmark Image Super-Resolution with Generative Diffusion Prior

【速读】:该论文旨在解决现有基于扩散模型的参考图像超分辨率(Reference-based Image Super-Resolution, RefSR)方法在低分辨率(LR)图像与参考高分辨率(HR)图像之间信息对齐不足,以及训练数据集分辨率低、细节缺失导致恢复质量受限的问题。解决方案的关键在于提出TriFlowSR框架,其核心创新是设计了一种显式的参考匹配策略(Reference Matching Strategy),以实现LR图像与参考HR图像之间的精细模式匹配;同时构建了首个面向超高清(Ultra-High-Definition, UHD)地标场景的RefSR数据集Landmark-4K,该数据集包含真实退化条件下的高质量参考图像,从而显著提升了生成结果的语义一致性和纹理保真度。

链接: https://arxiv.org/abs/2508.10779
作者: Zhenning Shi,Zizheng Yan,Yuhang Yu,Clara Xue,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Tao Li,Qingnan Fan
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Google (谷歌); 3. Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at this https URL.
zh

[CV-19] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中因迭代去噪过程缓慢及长序列下二次复杂度的注意力机制(Attention Mechanism)导致的推理瓶颈问题。现有加速策略如步骤蒸馏(Step Distillation)和稀疏注意力机制虽各自有效,但联合应用时面临训练成本高或效果不佳的问题。其解决方案的关键在于提出BLADE框架,通过两个创新设计实现数据无感的联合训练:一是自适应块稀疏注意力(Adaptive Block-Sparse Attention, ASA),动态生成内容感知的稀疏掩码以聚焦显著时空特征;二是基于轨迹分布匹配(Trajectory Distribution Matching, TDM)的稀疏感知蒸馏范式,将稀疏性直接融入蒸馏过程而非作为独立压缩步骤,从而实现快速收敛与高效推理。实验表明,该方法在多个文本到视频模型上均实现了显著加速(最高达14.10倍)并保持甚至提升视频质量。

链接: https://arxiv.org/abs/2508.10774
作者: Youping Gu,Xiaolong Li,Yuhao Hu,Bohan Zhuang
机构: Zhejiang University (浙江大学); Central Media Technology Institute, Huawei Technologies (华为技术有限公司中央媒体技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech report

点击查看摘要

Abstract:Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges – training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: this http URL.
zh

[CV-20] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences

【速读】:该论文旨在解决当前视频真实性检测基准在现实性、规模和复杂性方面的不足,难以有效评估现代视觉-语言模型对高保真度AI生成视频伪造的检测能力。其解决方案的关键在于提出AEGIS——一个大规模、精心构建的基准数据集,包含超过10,000个由多种先进生成模型(如Stable Video Diffusion、CogVideoX-5B、KLing和Sora)生成的真实与合成视频,涵盖开源与专有架构;同时引入具有挑战性的子集用于鲁棒性评估,并提供多模态标注(语义真实性描述、运动特征和低层视觉特征),从而推动模型在真实世界伪造威胁下的泛化能力和可靠性研究。

链接: https://arxiv.org/abs/2508.10771
作者: Jieyu Li,Xin Zhang,Joey Tianyi Zhou
机构: National University of Singapore (新加坡国立大学); Centre for Frontier AI Research, Institute of High Performance Computing, Agency for Science, Technology and Research (前沿人工智能研究中心,高性能计算研究所,科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the 33rd ACM International Conference on Multimedia

点击查看摘要

Abstract:Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset’s unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on this https URL.
zh

[CV-21] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

【速读】:该论文旨在解决当前主流视觉语言模型(Vision Language Models, VLMs)在**空间物理推理(spatio-physical reasoning)**能力上的不足问题,即这些模型虽在特定领域如多模态数学和纯空间理解中表现优异,但在理解和推理真实物理世界中的空间关系与物理规律方面仍存在显著短板。其解决方案的关键在于:首先通过监督微调(supervised fine-tuning)提升模型基础能力,随后引入基于规则的强化学习(rule-based reinforcement learning)以增强模型的深度推理能力,并对Qwen2.5-VL-7B进行优化,最终显著提升了其在空间物理推理任务上的性能,甚至超越了部分领先闭源模型。然而,研究也指出该方法在新物理场景下的泛化能力仍有限,提示未来需探索更有效的建模范式。

链接: https://arxiv.org/abs/2508.10770
作者: Tiancheng Han,Yunfei Gao,Yong Li,Wuzhou Yu,Qiaosheng Zhang,Wenqi Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model’s generalization to new physics scenarios remains limited – underscoring the pressing need for new approaches in spatio-physical reasoning.
zh

[CV-22] Agent ic Design Review System

【速读】:该论文旨在解决图形设计评估中缺乏系统性、多维度整合反馈的问题,即如何从对齐、构图、美学和色彩选择等多个方面进行整体评价,并通过聚合专家评审意见来提升评估的准确性与实用性。其解决方案的关键在于提出一种基于多智能体协作的图形设计评审系统(Agentic Design Review System, AgenticDRS),其中多个代理(agent)在元智能体(meta-agent)的协调下协同分析设计,核心创新包括:基于图匹配的上下文示例选择方法(in-context exemplar selection approach)以及独特的提示扩展机制(prompt expansion method),使每个代理具备设计感知能力;同时构建了DRS-BENCH基准测试平台用于量化评估该框架的有效性。

链接: https://arxiv.org/abs/2508.10745
作者: Sayan Nag,K J Joseph,Koustava Goswami,Vlad I Morariu,Balaji Vasan Srinivasan
机构: Adobe Research(Adobe 研究院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction.
zh

[CV-23] An Efficient Model-Driven Groupwise Approach for Atlas Construction

【速读】:该论文旨在解决当前基于数据驱动的图像配准方法在群体(groupwise) atlas 构建中面临的三大问题:依赖大规模训练数据、泛化能力有限,以及在群体场景下缺乏真正的推理阶段。为此,作者提出了一种全新的模型驱动型群体配准框架 DARC(Diffeomorphic Atlas Registration via Coordinate descent),其核心创新在于采用坐标下降(coordinate descent)策略与中心性约束激活函数,实现了无偏、微分同胚(diffeomorphic)且具有高解剖保真度的 atlas 构建。该方案无需训练、计算高效,并能灵活处理任意数量的 3D 医学图像而不会引发 GPU 内存瓶颈,从而为 atlas 构建及其下游任务(如单次分割和形状合成)提供了一个通用、可扩展且资源友好的解决方案。

链接: https://arxiv.org/abs/2508.10743
作者: Ziwei Zou,Bei Zou,Xiaoyan Kui,Wenqi Lu,Haoran Dou,Arezoo Zakeri,Timothy Cootes,Alejandro F Frangi,Jinming Duan
机构: 1. University of Oxford (牛津大学); 2. Tsinghua University (清华大学); 3. University College London (伦敦大学学院); 4. King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Atlas construction is fundamental to medical image analysis, offering a standardized spatial reference for tasks such as population-level anatomical modeling. While data-driven registration methods have recently shown promise in pairwise settings, their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. In contrast, model-driven methods offer training-free, theoretically grounded, and data-efficient alternatives, though they often face scalability and optimization challenges when applied to large 3D datasets. In this work, we introduce DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. DARC supports a broad range of image dissimilarity metrics and efficiently handles arbitrary numbers of 3D images without incurring GPU memory issues. Through a coordinate descent strategy and a centrality-enforcing activation function, DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. Beyond atlas construction, we demonstrate two key applications: (1) One-shot segmentation, where labels annotated only on the atlas are propagated to subjects via inverse deformations, outperforming state-of-the-art few-shot methods; and (2) shape synthesis, where new anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Overall, DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications.
zh

[CV-24] Forgery Guided Learning Strategy with Dual Perception Network for Deepfake Cross-domain Detection

【速读】:该论文旨在解决深度伪造(deepfake)检测技术在面对未知伪造技术时泛化能力不足的问题,尤其是在新兴伪造方法与传统技术差异不断扩大的背景下,现有依赖共性伪造痕迹的跨域检测方法效果显著下降。解决方案的关键在于提出一种伪造引导学习(Forgery Guided Learning, FGL)策略,通过捕捉已知与未知伪造技术之间的差异信息,使检测网络能够实时动态调整学习过程;同时设计双感知网络(Dual Perception Network, DPNet),在频域流中动态感知并提取不同伪造技术间的判别特征,并结合图卷积操作建模特征空间中的全局关系,从而增强对伪造痕迹的感知能力和关联理解,实现对未知伪造技术的有效检测。

链接: https://arxiv.org/abs/2508.10741
作者: Lixin Jia,Zhiqing Guo,Gaobo Yang,Liejun Wang,Keqin Li
机构: Xinjiang University (新疆大学); Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center (新疆多模态智能处理与信息安全工程技术研究中心); Hunan University (湖南大学); State University of New York (纽约州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on this https URL.
zh

[CV-25] Axis-level Symmetry Detection with Group-Equivariant Representation ICCV2025

【速读】:该论文旨在解决复杂场景中对称性(symmetry)检测的精度问题,尤其是现有基于热力图的方法在定位对称轴(symmetry axis)时缺乏个体化精确识别能力。其解决方案的关键在于提出一种以显式几何原语(即直线和点)表示反射对称与旋转对称的新框架,并采用对二面体群(dihedral group)等变的双分支架构:针对反射对称引入与群结构对齐的方向锚点(orientational anchors)实现方向特异性检测,并设计反射匹配机制衡量候选轴两侧模式的镜像相似性;针对旋转对称则提出旋转匹配机制,在固定角度间隔下比较模式以识别旋转中心。该方法显著提升了对称轴级别的检测精度,达到当前最优性能。

链接: https://arxiv.org/abs/2508.10740
作者: Wongyun Yu,Ahyun Seo,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH), South Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.
zh

[CV-26] Privacy-enhancing Sclera Segmentation Benchmarking Competition: SSBC 2025

【速读】:该论文旨在解决隐私保护背景下虹膜(sclera)分割模型的训练问题,即如何在不使用真实世界眼部图像的情况下,通过合成数据训练出高性能的分割模型。其关键解决方案在于利用合成生成的眼部图像进行模型训练,并设计专门的训练策略以提升模型性能,实验表明仅使用合成数据训练的模型仍可达到F₁分数超过0.8的竞争力表现;同时,混合训练(合成+少量真实数据)的优势更多来源于方法论选择而非单纯引入真实数据,凸显了合成数据在隐私敏感生物特征识别开发中的潜力。

链接: https://arxiv.org/abs/2508.10737
作者: Matej Vitek,Darian Tomašević,Abhijit Das,Sabari Nathan,Gökhan Özbulak,Gözde Ayşe Tataroğlu Özbulak,Jean-Paul Calbimonte,André Anjos,Hariohm Hemant Bhatt,Dhruv Dhirendra Premani,Jay Chaudhari,Caiyong Wang,Jian Jiang,Chi Zhang,Qi Zhang,Iyyakutti Iyappan Ganapathi,Syed Sadaf Ali,Divya Velayudan,Maregu Assefa,Naoufel Werghi,Zachary A. Daniels,Leeon John,Ritesh Vyas,Jalil Nourmohammadi Khiarak,Taher Akbari Saeed,Mahsa Nasehi,Ali Kianfar,Mobina Pashazadeh Panahi,Geetanjali Sharma,Pushp Raj Panth,Raghavendra Ramachandra,Aditya Nigam,Umapada Pal,Peter Peer,Vitomir Štruc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Joint Conference on Biometrics (IJCB) 2025, 13 pages

点击查看摘要

Abstract:This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: (i) one relying solely on synthetic data for model development, and (ii) one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved F_1 scores of over 0.8 in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: this https URL.
zh

[CV-27] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction ICCV2025

【速读】:该论文旨在解决当前机器学习框架在类别发现任务中难以模拟人类感知系统对已知与新类别物体的识别能力的问题,即通用类别发现(Generalized Category Discovery, GCD)的局限性。其解决方案的关键在于提出一种受人类认知机制启发的共识感知范式——ConGCD,通过将物体分解为视觉原语(visual primitives),并基于高层语义重构建立原语导向的表示;同时,利用主导共识单元(dominant consensus unit)捕捉类别判别模式和上下文共识单元(contextual consensus unit)挖掘分布不变性特征,并通过共识调度器动态优化激活路径,最终实现多路共识融合的预测输出。

链接: https://arxiv.org/abs/2508.10731
作者: Luyao Tang,Kunze Huang,Chaoqi Chen,Yuxuan Yuan,Chenxin Li,Xiaotong Tu,Xinghao Ding,Yue Huang
机构: Xiamen University (厦门大学); Shenzhen University (深圳大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025 as *** Highlight ***!

点击查看摘要

Abstract:Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD’s effectiveness as a consensus-aware paradigm. Code is available at this http URL.
zh

[CV-28] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在第一人称视频问答(Egocentric Video Question Answering, EgocentricQA)任务中跨域泛化能力不足的问题。现有研究主要集中在日常活动场景(如烹饪、清洁),而真实应用场景常面临视觉风格和语义内容显著不同的领域偏移(domain shift),导致模型性能急剧下降。解决方案的关键在于构建一个名为 EgoCross 的综合性基准,涵盖手术、工业、极限运动和动物视角四个高影响力且差异显著的领域,包含约 1000 组问答对(OpenQA 和 CloseQA 格式),并设计了预测、识别、定位和计数四项核心任务,从而系统评估 MLLMs 的跨域适应能力。实验表明,当前主流 MLLMs 在非日常领域表现严重受限,凸显其泛化瓶颈,同时通过初步探索微调和强化学习等方法为提升模型鲁棒性提供了方向。

链接: https://arxiv.org/abs/2508.10729
作者: Yanjun Li,Yuqian Fu,Tianwen Qian,Qi’ao Xu,Silong Dai,Danda Pani Paudel,Luc Van Gool,Xiaoling Wang
机构: East China Normal University (华东师范大学); INSAIT, Institute for Computer Science, Artificial Intelligence and Technology (INSAIT,计算机科学、人工智能与技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbfEgoCross, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \hrefthis https URLthis https URL.
zh

[CV-29] Exploiting Discriminative Codebook Prior for Autoregressive Image Generation

【速读】:该论文旨在解决当前基于离散token的自回归图像生成系统中,代码本(codebook)所蕴含的丰富token相似性信息未被有效利用的问题。现有方法通过朴素k-means聚类提取代码本先验,但其在代码本特征空间中表现不佳,主要由于token空间差异性和中心点距离计算不准确等固有缺陷。解决方案的关键在于提出判别式代码本先验提取器(Discriminative Codebook Prior Extractor, DCPE),其核心创新是用更具合理性的实例级距离替代传统基于中心点的距离度量,并结合凝聚式合并策略,避免对高密度区域的不当分割并聚合低密度区域,从而更有效地挖掘和利用代码本中的token相似性信息。实验表明,DCPE可无缝集成至现有框架,在加速模型训练的同时提升生成质量。

链接: https://arxiv.org/abs/2508.10719
作者: Longxiang Tang,Ruihang Chu,Xiang Wang,Yujin Han,Pingyu Wu,Chunming He,Yingya Zhang,Shiwei Zhang,Jiaya Jia
机构: The Hong Kong University of Science and Technology (香港科技大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to TPAMI

点击查看摘要

Abstract:Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.
zh

[CV-30] Revisiting Cross-View Localization from Image Matching

【速读】:该论文旨在解决跨视角定位(cross-view localization)中细粒度图像匹配难题,即在地面视图与航空/卫星视图之间建立精确、几何一致的像素级对应关系,从而提升定位精度和结果可解释性。现有方法依赖于严格的空间对应关系,但难以实现精细匹配,导致定位结果粗糙或几何不一致。其解决方案的关键在于:1)提出表面模型(Surface Model)以准确建模可见区域并支持BEV(鸟瞰图)投影;2)设计SimRefiner模块通过局部-全局残差校正机制优化相似性矩阵,消除对RANSAC等后处理步骤的依赖,从而显著提升匹配质量和定位准确性。

链接: https://arxiv.org/abs/2508.10716
作者: Panwang Xia,Qiong Wu,Lei Yu,Yi Liu,Mingtao Xiong,Lei Liang,Yongjun Zhang,Yi Wan
机构: Wuhan University (武汉大学); AntGroup (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird’s-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.
zh

[CV-31] Lightweight CNNs for Embedded SAR Ship Target Detection and Classification

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)数据在海上船舶大范围监测中面临的近实时监控难题,其核心瓶颈在于需下传全部原始数据、地面成像聚焦与分析导致的带宽限制和延迟问题。解决方案的关键在于设计并评估适用于Stripmap和干涉宽幅(Interferometric Wide, IW)模式下未聚焦SAR数据的神经网络模型,实现星上实时推理,从而显著减少需下传的数据量,并通过在FPGA上的部署验证了该方法的可行性;同时,基于船只与风力发电机的二分类任务证明了目标识别的潜力。

链接: https://arxiv.org/abs/2508.10712
作者: Fabian Kresse,Georgios Pilikos,Mario Azcueta,Nicolas Floury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Big Data from Space 2025 (BiDS’25)

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) data enables large-scale surveillance of maritime vessels. However, near-real-time monitoring is currently constrained by the need to downlink all raw data, perform image focusing, and subsequently analyze it on the ground. On-board processing to generate higher-level products could reduce the data volume that needs to be downlinked, alleviating bandwidth constraints and minimizing latency. However, traditional image focusing and processing algorithms face challenges due to the satellite’s limited memory, processing power, and computational resources. This work proposes and evaluates neural networks designed for real-time inference on unfocused SAR data acquired in Stripmap and Interferometric Wide (IW) modes captured with Sentinel-1. Our results demonstrate the feasibility of using one of our models for on-board processing and deployment on an FPGA. Additionally, by investigating a binary classification task between ships and windmills, we demonstrate that target classification is possible.
zh

[CV-32] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

【速读】:该论文旨在解决当前自回归(autoregressive, AR)文本到图像生成模型中存在的两大问题:一是依赖计算密集型的扩散模型处理连续图像标记(image tokens),二是采用向量量化(vector quantization, VQ)获得离散标记时引入量化损失。其解决方案的关键在于提出NextStep-1,一个包含14B参数的自回归模型与一个157M参数的流匹配(flow matching)头相结合的架构,通过联合训练离散文本标记和连续图像标记,并以预测下一个标记为目标,实现了高保真度图像合成与高效图像编辑能力,从而在保持自回归范式优势的同时显著提升了性能与实用性。

链接: https://arxiv.org/abs/2508.10711
作者: NextStep Team:Chunrui Han,Guopeng Li,Jingwei Wu,Quan Sun,Yan Cai,Yuang Peng,Zheng Ge,Deyu Zhou,Haomiao Tang,Hongyu Zhou,Kenkun Liu,Ailin Huang,Bin Wang,Changxin Miao,Deshan Sun,En Yu,Fukun Yin,Gang Yu,Hao Nie,Haoran Lv,Hanpeng Hu,Jia Wang,Jian Zhou,Jianjian Sun,Kaijun Tan,Kang An,Kangheng Lin,Liang Zhao,Mei Chen,Peng Xing,Rui Wang,Shiyu Liu,Shutao Xia,Tianhao You,Wei Ji,Xianfang Zeng,Xin Han,Xuelin Zhang,Yana Wei,Yanming Xu,Yimin Jiang,Yingming Wang,Yu Zhou,Yucheng Han,Ziyang Meng,Binxing Jiao,Daxin Jiang,Xiangyu Zhang,Yibo Zhu
机构: NextStep-Team(NextStep团队); StepFun(StepFun)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
zh

[CV-33] CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation

【速读】:该论文旨在解决扩散模型在文本到图像生成中难以准确反映输入提示中指定物体数量的问题。现有方法依赖外部计数模块或从学习到的token/潜在特征中提取数量表示,但存在准确性不足且忽略关键结构特性(即物体实例数量主要由去噪过程早期步骤决定)的局限性。解决方案的关键在于提出CountCluster方法,该方法在推理时基于注意力得分将物体交叉注意力图划分为k个簇,并定义每个簇空间上分离的理想分布,通过优化潜在表示使其逼近该目标分布,从而在不依赖外部工具或额外训练的情况下实现更精确的物体数量控制。

链接: https://arxiv.org/abs/2508.10710
作者: Joohyeon Lee,Jin-Seop Lee,Jee-Hyong Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic–The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textitCountCluster, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into k clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: this https URL .
zh

[CV-34] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios

【速读】:该论文旨在解决传统RGB相机在复杂交通环境(如夜间驾驶、隧道等)中因动态范围有限而导致全局对比度降低及高频细节(如纹理和边缘)丢失的问题,从而影响判别性特征提取并损害基于帧的物体检测性能。解决方案的关键在于融合生物启发的事件相机与RGB相机,提出一种运动线索融合网络(MCFNet),其核心创新包括:1)事件校正模块(ECM)通过光流引导的变形实现异步事件流与图像帧的时间对齐,并联合优化以学习任务感知的事件表征;2)事件动态上采样模块(EDUM)提升事件帧的空间分辨率以匹配图像结构,确保精确的时空对齐;3)跨模态Mamba融合模块(CMM)采用自适应特征融合机制与新颖的交错扫描策略,有效整合互补信息以增强检测鲁棒性。

链接: https://arxiv.org/abs/2508.10704
作者: Zhanwen Liu,Yujing Sun,Yang Wang,Nan Yang,Shengbo Eben Li,Xiangmo Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at this https URL.
zh

[CV-35] Novel View Synthesis using DDIM Inversion

【速读】:该论文旨在解决从单张输入图像中合成新视角(novel view)的难题,该任务需在未知3D结构的情况下推断遮挡区域细节并保持跨视角的几何一致性。现有方法通常依赖于大规模扩散模型微调或多视图训练,成本高昂且易出现模糊重建与泛化性能差的问题。其解决方案的关键在于提出一个轻量级显式视图转换框架:首先利用DDIM反演(DDIM-inverted latent)获取输入图像的潜在表示,随后通过相机位姿条件控制的U-Net网络(TUNet)预测目标视角对应的潜在表示;进一步设计了一种新颖的融合策略,利用DDIM反演中固有的噪声相关性结构来保留纹理和细粒度细节,最终以融合后的潜在表示作为初始条件进行DDIM采样,从而充分利用预训练扩散模型的生成先验能力,实现高质量的新视角合成。

链接: https://arxiv.org/abs/2508.10688
作者: Sehajdeep SIngh,A V Subramanyam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.
zh

[CV-36] IADGPT : Unified LVLM for Few-Shot Industrial Anomaly Detection Localization and Reasoning via In-Context Learning

【速读】:该论文旨在解决少样本工业异常检测(Few-Shot Industrial Anomaly Detection, FS-IAD)中现有基于大视觉语言模型(Large Vision-Language Models, LVLMs)方法因缺乏工业领域知识和推理能力而导致性能不足的问题,从而难以达到专业质检人员的水平。解决方案的关键在于提出一个统一框架 IADGPT,其核心创新是采用三阶段渐进式训练策略:第一、二阶段逐步引导模型获取基础工业知识与差异感知能力;第三阶段引入基于上下文学习(in-context learning)的训练范式,使模型能利用少量样本作为示例实现对新产品的泛化检测;同时设计了结合 logits 输出与注意力图的多粒度评分机制,支持图像级与像素级异常评分,并融合语言输出完成异常推理,从而实现类人化的工业异常检测与解释能力。

链接: https://arxiv.org/abs/2508.10681
作者: Mengyang Zhao,Teng Fu,Haiyang Yu,Ke Niu,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.
zh

[CV-37] Physics-Informed Joint Multi-TE Super-Resolution with Implicit Neural Representation for Robust Fetal T2 Mapping

【速读】:该论文旨在解决胎儿脑部磁共振成像(MRI)中T2定量映射的挑战,尤其是在低场强(0.55T)下由于胎儿运动导致的图像伪影和扫描时间过长的问题。当前方法依赖于在每个回波时间(TE)重复采集多个厚层堆叠(stacks),不仅耗时且对运动高度敏感。解决方案的关键在于提出一种联合重建策略,通过隐式神经表示(implicit neural representations)与物理信息正则化相结合,显式建模T2衰减过程,在不同TE之间实现信息共享,同时保持解剖结构和T2定量精度。该方法显著提升了运动严重情况下的重建质量,并首次实现了0.55T下胎儿T2映射的在体成像,证明可通过利用解剖冗余减少每TE所需的堆叠数量。

链接: https://arxiv.org/abs/2508.10680
作者: Busra Bulut,Maik Dannecker,Thomas Sanchez,Sara Neves Silva,Vladyslav Zalevskyi,Steven Jia,Jean-Baptiste Ledoux,Guillaume Auzias,François Rousseau,Jana Hutter,Daniel Rueckert,Meritxell Bach Cuadra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:T2 mapping in fetal brain MRI has the potential to improve characterization of the developing brain, especially at mid-field (0.55T), where T2 decay is slower. However, this is challenging as fetal MRI acquisition relies on multiple motion-corrupted stacks of thick slices, requiring slice-to-volume reconstruction (SVR) to estimate a high-resolution (HR) 3D volume. Currently, T2 mapping involves repeated acquisitions of these stacks at each echo time (TE), leading to long scan times and high sensitivity to motion. We tackle this challenge with a method that jointly reconstructs data across TEs, addressing severe motion. Our approach combines implicit neural representations with a physics-informed regularization that models T2 decay, enabling information sharing across TEs while preserving anatomical and quantitative T2 fidelity. We demonstrate state-of-the-art performance on simulated fetal brain and in vivo adult datasets with fetal-like motion. We also present the first in vivo fetal T2 mapping results at 0.55T. Our study shows potential for reducing the number of stacks per TE in T2 mapping by leveraging anatomical redundancy.
zh

[CV-38] HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

【速读】:该论文旨在解决移动红外小目标检测(MIRSTD)中因目标尺寸小、强度弱及运动模式复杂而导致的检测困难问题。现有方法通常仅建模特征节点间的低阶相关性,并在单一时间尺度内进行特征提取与增强,难以充分捕捉多尺度时空动态特性。其解决方案的关键在于提出HyperTea框架,首次将卷积神经网络(CNN)、循环神经网络(RNN)与超图神经网络(HGNN)融合用于MIRSTD,通过全局时间增强模块(GTEM)实现语义聚合与传播以强化全局时序上下文,局部时间增强模块(LTEM)捕获相邻帧间的局部运动模式并增强局部时序信息,同时引入时间对齐模块(TAM)缓解跨尺度特征错位问题,从而有效建模高阶时空相关性,显著提升检测性能。

链接: https://arxiv.org/abs/2508.10678
作者: Zhaoyuan Qi,Weihua Gao,Wenlong Niu,Jie Tang,Yun Li,Xiaodong Peng
机构: Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences(中国科学院国家空间科学中心); School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target’s small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at this https URL.
zh

[CV-39] Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation ICCV2025

【速读】:该论文旨在解决构建一个高质量人脸数据集以训练人脸识别模型的问题,同时确保该数据集中不包含与现有公开人脸数据集重叠的身份信息(identity)。解决方案的关键在于采用混合策略:首先对基础HSFace数据集进行深度清洗,利用基于面部嵌入聚类和GPT-4o辅助验证的多专家(Mixture-of-Experts, MoE)方法识别并移除标注错误或不一致的身份;随后保留最大且一致的身份簇,并通过数据增强固定每身份50张图像。为进一步提升多样性,使用Stable Diffusion结合提示工程生成合成身份,再通过Vec2Face高效扩展为49个身份一致的变体,从而融合GAN与扩散模型样本的优势。此外,为缓解合成身份间视觉相似度过高的问题,引入课程学习(curriculum learning)策略,在训练早期加入合成样本,使模型由易到难逐步适应。最终构建的数据集经验证无身份泄露,且在10K、20K和100K身份规模下均显著提升模型性能,获得DataCV ICCV Challenge第一名。

链接: https://arxiv.org/abs/2508.10672
作者: Feiran Li,Qianqian Xu,Shilong Bao,Boyu Han,Zhiyong Yang,Qingming Huang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院); BDKM, University of Chinese Academy of Sciences (中国科学院大学BDKM)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accpeted to ICCV 2025 DataCV Workshop

点击查看摘要

Abstract:In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf1st place in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at this https URL.
zh

[CV-40] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Visual Language Models, LVLMs)在城市尺度下细粒度街景定位(fine-grained street-level localization)任务中表现不佳的问题,尤其是如何利用街景图像实现灵活的地址相关问答。其解决方案的关键在于引入视角不变的卫星图像作为宏观视觉线索,并提出跨视图对齐调优(cross-view alignment tuning),包括卫星视图与街景图像的拼接机制以及自动标签生成机制,从而增强LVLM对街道分布的全局理解能力。该方法通过两阶段训练策略——跨视图对齐调优和地址定位调优——显著提升了模型在匹兹堡和旧金山两个街景VQA数据集上的平均地址定位准确率,分别超过对比模型9%和12%。

链接: https://arxiv.org/abs/2508.10667
作者: Shixiong Xu,Chenghao Zhang,Lubin Fan,Yuan Zhou,Bin Fan,Shiming Xiang,Gaofeng Meng,Jieping Ye
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Alibaba Inc. (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM’s global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.
zh

[CV-41] Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking

【速读】:该论文旨在解决多模态视觉目标跟踪(Multi-Modal Visual Object Tracking, MMVOT)任务中因数据模态混合训练与评估基准分离导致的训练-测试不一致性问题,进而引发的性能下降。其核心解决方案在于提出一个统一基准UniBench300,将多种模态(如RGB、红外、深度、事件等)的数据整合在同一评估框架下,从而消除跨任务评估的偏差,并通过串行式任务整合策略实现更稳定的模型统一过程;该策略将性能退化建模为先前任务的知识遗忘现象,自然契合持续学习(Continual Learning, CL)范式,验证了引入CL机制可有效缓解多模态融合中的稳定性问题,提升整体跟踪性能。

链接: https://arxiv.org/abs/2508.10655
作者: Zhangyong Tang,Tianyang Xu,Xuefeng Zhu,Chunyang Cheng,Tao Zhou,Xiaojun Wu,Josef Kittler
机构: Jiangnan University (江南大学); Nanjing University of Science and Technology (南京理工大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACMMM 2025

点击查看摘要

Abstract:Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textitinconsistency between training and testing, thus leading to performance \textitdegradation. To address these issues, this work advances in two aspects: \ding182 A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27%. \ding183 The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT RGBD RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textitthis https URL.
zh

[CV-42] Geospatial Diffusion for Land Cover Imperviousness Change Forecasting

【速读】:该论文旨在解决土地利用与土地覆盖变化(Land Use and Land Cover Change, LULC)预测能力滞后于地球系统模型发展的问题,而LULC是评估未来气候情景下风险和后果的关键输入。其解决方案的核心在于提出一种新范式,将LULC预测建模为一个数据合成问题,借助生成式AI(Generative AI, GenAI)从历史及辅助数据中学习时空模式,并生成未来的土地覆盖分布。研究通过训练扩散模型(diffusion model)对美国本土地区的不透水面(imperviousness)进行十年尺度预测,实验表明在≥0.7×0.7 km²分辨率下,该方法的平均绝对误差(MAE)优于假设无变化的基线模型,验证了生成模型可有效捕捉历史数据中的关键时空特征以支持未来变化预测。

链接: https://arxiv.org/abs/2508.10649
作者: Debvrat Varshney,Vibhas Vats,Bhartendu Pandey,Christa Brelsford,Philipe Dias
机构: Oak Ridge National Laboratory (橡树岭国家实验室); Indiana University Bloomington (印第安纳大学布卢明顿分校); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Land cover, both present and future, has a significant effect on several important Earth system processes. For example, impervious surfaces heat up and speed up surface water runoff and reduce groundwater infiltration, with concomitant effects on regional hydrology and flood risk. While regional Earth System models have increasing skill at forecasting hydrologic and atmospheric processes at high resolution in future climate scenarios, our ability to forecast land-use and land-cover change (LULC), a critical input to risk and consequences assessment for these scenarios, has lagged behind. In this paper, we propose a new paradigm exploiting Generative AI (GenAI) for land cover change forecasting by framing LULC forecasting as a data synthesis problem conditioned on historical and auxiliary data-sources. We discuss desirable properties of generative models that fundament our research premise, and demonstrate the feasibility of our methodology through experiments on imperviousness forecasting using historical data covering the entire conterminous United States. Specifically, we train a diffusion model for decadal forecasting of imperviousness and compare its performance to a baseline that assumes no change at all. Evaluation across 12 metropolitan areas for a year held-out during training indicate that for average resolutions \geq 0.7\times0.7km^2 our model yields MAE lower than such a baseline. This finding corroborates that such a generative model can capture spatiotemporal patterns from historical data that are significant for projecting future change. Finally, we discuss future research to incorporate auxiliary information on physical properties about the Earth, as well as supporting simulation of different scenarios by means of driver variables.
zh

[CV-43] SemPT: Semantic Prompt Tuning for Vision-Language Models

【速读】:该论文旨在解决视觉迁移学习中针对未见类别的泛化难题,其核心挑战在于如何在保持类别特异性表征的同时获取可迁移的知识。现有提示调优方法依赖稀疏的类别标签或不一致的大语言模型(Large Language Model, LLM)生成描述,导致知识表示碎片化并削弱迁移能力。解决方案的关键在于提出语义提示调优(Semantic Prompt Tuning, SemPT),通过两阶段提示策略引导LLM提取跨类别的共享属性级知识,并生成属性级描述,从而捕获超越标签的可迁移语义线索;同时引入视觉引导加权机制对属性级描述嵌入进行降噪处理,增强文本嵌入质量,并将图像嵌入与标签嵌入及属性增强嵌入联合对齐,实现已见类别判别性与未见类别迁移性的平衡。推理阶段根据类别是否暴露动态选择标签嵌入或属性增强嵌入,确保有效适应。

链接: https://arxiv.org/abs/2508.10645
作者: Xiao Shi,Yangjun Ou,Zhenzhong Chen
机构: Wuhan Textile University (武汉纺织大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.
zh

[CV-44] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs

【速读】:该论文旨在解决奶牛跛行(lameness)自动检测的难题,传统方法依赖于人工设计的运动特征,存在特征工程复杂、泛化能力有限等问题。其解决方案的关键在于融合姿态估计(pose estimation)与双向长短期记忆网络(Bidirectional Long-Short-Term Memory, BLSTM):首先利用T-LEAP模型实现无标记的姿态估计,提取九个关键点(位于牛蹄、头部和背部)的轨迹;随后将这些关键点轨迹作为输入,通过BLSTM网络自动学习时序运动特征,从而避免手动特征工程。该方法在仅需1秒视频数据的情况下即可实现85%的分类准确率,显著优于基于人工特征的传统方法(80%)。

链接: https://arxiv.org/abs/2508.10643
作者: Helena Russello,Rik van der Tol,Eldert J. van Henten,Gert Kootstra
机构: Wageningen University & Research (瓦赫宁根大学与研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows’ hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.
zh

[CV-45] Processing and acquisition traces in visual encoders: What does CLIP know about your camera? ICCV2025

【速读】:该论文旨在解决视觉模型在面对图像采集或处理参数的细微变化时,其鲁棒性不足的问题。这些参数可能对人类感知而言是不可察觉的,但它们在训练数据中隐含地与语义标签相关联,从而在测试阶段引入分布偏移(distribution shift),导致性能波动。解决方案的关键在于揭示并利用这些隐含的、由图像获取或处理过程所决定的参数信息——研究发现,这类参数在视觉表示中被系统性编码,并可被高效恢复;更重要的是,它们对语义预测的影响取决于其与语义标签之间的相关性或反相关性:当存在强正相关时,能提升预测准确性;而负相关则会显著损害性能。这一发现为理解视觉模型的内在偏差和设计更具鲁棒性的表示学习方法提供了新视角。

链接: https://arxiv.org/abs/2508.10637
作者: Ryan Ramos,Vladan Stojnić,Giorgos Kordopatis-Zilos,Yuta Nakashima,Giorgos Tolias,Noa Garcia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 main pages, supplementary attached, ICCV 2025 highlight

点击查看摘要

Abstract:Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: this https URL Comments: 8 main pages, supplementary attached, ICCV 2025 highlight Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.10637 [cs.CV] (or arXiv:2508.10637v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.10637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-46] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在环境变化理解中面临的三大局限:忽略环境传感器提供的因果信号、依赖单一来源的描述文本易受风格偏差影响,以及缺乏基于场景的交互式推理能力。其解决方案的关键在于提出首个交互式视觉语言模型 ChatENV,通过融合卫星图像对与真实世界传感器数据(如温度、PM10、CO等),构建包含177k张图像和152k个时序对的多源异构数据集,并利用GPT-4o与Gemini 2.0进行多样化标注,最终采用低秩适配(Low-Rank Adaptation, LoRA)技术微调Qwen-2.5-VL模型,实现高精度的时间序列推理与“假设性”场景分析(如BERT-F1达0.903),从而推动具备传感器感知能力的地面化环境监测发展。

链接: https://arxiv.org/abs/2508.10635
作者: Hosam Elgendy,Ahmed Sharshar,Ahmed Aboeitta,Mohsen Guizani
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and “what-if” reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.
zh

[CV-47] Increasing the Utility of Synthetic Images through Chamfer Guidance

【速读】:该论文旨在解决生成式图像模型在提升生成质量的同时牺牲了生成多样性的问题,这一局限性限制了其作为合成训练数据源的实用性。传统基于引导(guidance)的方法虽能改善质量或多样性,但常忽略合成数据与真实数据之间的分布偏移(distribution shift)。论文提出的关键解决方案是Chamfer Guidance:一种无需训练的引导方法,仅需少量真实样本即可刻画合成数据的质量与多样性。该方法通过引入Chamfer距离度量实现对生成结果的精细化控制,在保持甚至提升ImageNet-1k和标准地理多样性基准上的生成质量的同时显著增强多样性,且在仅用2张真实样本时即达到SOTA的few-shot性能(精度96.4%,分布覆盖86.4%),使用32张样本时进一步提升至97.5%和92.7%。此外,该方法不依赖无条件模型,采样阶段相比classifier-free-guidance方法减少31% FLOPs,具备高效性和实用性优势。

链接: https://arxiv.org/abs/2508.10631
作者: Nicola Dall’Asen,Xiaofeng Zhang,Reyhane Askari Hemmat,Melissa Hall,Jakob Verbeek,Adriana Romero-Soriano,Michal Drozdzal
机构: University of Trento(特伦托大学); University of Pisa(比萨大学); Mila - Québec AI Institute(蒙特利尔学习算法研究所); FAIR at Meta(元宇宙FAIR实验室); Université de Montréal(蒙特利尔大学); McGill University(麦吉尔大学); Canada CIFAR AI chair(加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.
zh

[CV-48] FIND-Net – Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction MICCAI2025

【速读】:该论文旨在解决金属植入物(metallic implants)在计算机断层扫描(Computed Tomography, CT)成像中引起的金属伪影(Metal Artifacts)问题,此类伪影会严重降低图像质量,影响诊断与治疗规划。现有深度学习方法虽在金属伪影去除(Metal Artifact Reduction, MAR)上取得进展,但常面临伪影抑制不足或结构细节丢失的矛盾。论文提出FIND-Net(Fourier-Integrated Network with Dictionary Kernels),其核心创新在于将频率域与空间域处理相结合,通过引入快速傅里叶卷积(Fast Fourier Convolution, FFC)层和可训练高斯滤波机制,将MAR建模为跨域联合优化任务,从而增强全局上下文感知能力和频域选择性,实现更有效的伪影抑制与解剖结构保留。实验表明,FIND-Net在合成数据和真实临床CT图像上均显著优于当前最优方法,在平均绝对误差(MAE)、结构相似性指数(SSIM)和峰值信噪比(PSNR)等指标上均有提升,验证了其鲁棒性和临床适用性。

链接: https://arxiv.org/abs/2508.10617
作者: Farid Tasharofi,Fuxin Fan,Melika Qahqaie,Mareike Thies,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at MICCAI 2025. This is the submitted version prior to peer review. The final Version of Record will appear in the MICCAI 2025 proceedings (Springer LNCS)

点击查看摘要

Abstract:Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net’s ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net’s potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at this https URL
zh

[CV-49] Fourier-Guided Attention Upsampling for Image Super-Resolution

【速读】:该论文旨在解决单图像超分辨率(Single Image Super-Resolution, SISR)中传统上采样模块(如Sub-Pixel Convolution)在重建高频细节时性能不足以及引入混叠伪影(aliasing artifacts)的问题。解决方案的关键在于提出一种轻量级的频率引导注意力机制(Frequency-Guided Attention, FGA),其核心包括:(1) 基于傅里叶特征的多层感知机(MLP)用于位置频率编码,以增强对高频信息的建模能力;(2) 跨分辨率相关性注意力层(Cross-Resolution Correlation Attention Layer)实现自适应的空间对齐;(3) 频率域L1损失(frequency-domain L1 loss)用于监督频谱保真度。FGA仅引入0.3M参数,在多种骨干网络中均显著提升性能,平均PSNR增益达0.12~0.14 dB,并在纹理丰富数据集上实现高达29%的频域一致性改善,有效减少混叠并保留精细结构。

链接: https://arxiv.org/abs/2508.10616
作者: Daejune Choi,Youchan No,Jinhyung Lee,Duksu Kim
机构: Korea University of Technology and Education (韩国科学技术教育大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, under submission to a journal

点击查看摘要

Abstract:We propose Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution. Conventional upsamplers, such as Sub-Pixel Convolution, are efficient but frequently fail to reconstruct high-frequency details and introduce aliasing artifacts. FGA addresses these issues by integrating (1) a Fourier feature-based Multi-Layer Perceptron (MLP) for positional frequency encoding, (2) a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and (3) a frequency-domain L1 loss for spectral fidelity supervision. Adding merely 0.3M parameters, FGA consistently enhances performance across five diverse super-resolution backbones in both lightweight and full-capacity scenarios. Experimental results demonstrate average PSNR gains of 0.12~0.14 dB and improved frequency-domain consistency by up to 29%, particularly evident on texture-rich datasets. Visual and spectral evaluations confirm FGA’s effectiveness in reducing aliasing and preserving fine details, establishing it as a practical, scalable alternative to traditional upsampling methods.
zh

[CV-50] owards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous Driving

【速读】:该论文旨在解决生成式 AI (Generative AI) 在自动驾驶2D目标检测系统中面临的关键安全问题——对抗性补丁(adversarial patches)攻击的实效性不足,尤其是在高分辨率数据集上的迁移能力弱、评估指标不合理导致攻击效果被高估的问题。其解决方案的核心在于三个方面:首先提出**实用攻击成功率(Practical Attack Success Rate, PASR)作为更贴近真实场景(如行人安全)的评估指标;其次设计定位-置信度抑制损失(Localization-Confidence Suppression Loss, LCSL)以提升攻击在不同模型间的迁移性能;最后引入概率尺度保持填充策略(Probabilistic Scale-Preserving Padding, PSPP)**作为预处理步骤,增强补丁在高分辨率图像中的适应性和跨数据集迁移能力。实验表明,所提框架 P³A 在未见模型和高分辨率数据集上均显著优于现有最优攻击方法。

链接: https://arxiv.org/abs/2508.10600
作者: Yuxin Cao,Yedi Zhang,Wentao He,Yifan Liao,Yan Xiao,Chang Li,Zhiyong Huang,Jin Song Dong
机构: Changan Automobile (长安汽车); National University of Singapore (新加坡国立大学); Ningbo University (宁波大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P ^3 A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P ^3 A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.
zh

[CV-51] On Spectral Properties of Gradient-based Explanation Methods

【速读】:该论文旨在解决深度网络解释方法(explanation methods)中存在的可靠性问题,这些问题源于现有方法缺乏充分的形式化基础。其核心解决方案在于引入概率论和谱分析(spectral analysis)的新视角,对梯度类解释方法进行形式化建模,揭示了由梯度使用引发的普遍谱偏差(spectral bias)。关键创新包括:(i) 提出一种确定标准扰动尺度的机制,以避免因扰动超参数选择不当导致的解释不一致;(ii) 设计一种称为SpectralLens的聚合方法,通过谱域信息整合提升解释的稳定性与一致性。理论分析得到了定量实验的验证。

链接: https://arxiv.org/abs/2508.10595
作者: Amir Mehrpanah,Erik Englesson,Hossein Azizpour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 16 figures, published in European Conference on Computer Vision 2024

点击查看摘要

Abstract:Understanding the behavior of deep networks is crucial to increase our confidence in their results. Despite an extensive body of work for explaining their predictions, researchers have faced reliability issues, which can be attributed to insufficient formalism. In our research, we adopt novel probabilistic and spectral perspectives to formally analyze explanation methods. Our study reveals a pervasive spectral bias stemming from the use of gradient, and sheds light on some common design choices that have been discovered experimentally, in particular, the use of squared gradient and input perturbation. We further characterize how the choice of perturbation hyperparameters in explanation methods, such as SmoothGrad, can lead to inconsistent explanations and introduce two remedies based on our proposed formalism: (i) a mechanism to determine a standard perturbation scale, and (ii) an aggregation method which we call SpectralLens. Finally, we substantiate our theoretical results through quantitative evaluations.
zh

[CV-52] EvTurb: Event Camera Guided Turbulence Removal

【速读】:该论文旨在解决大气湍流导致的图像退化问题,此类退化包括模糊(blur)和几何倾斜失真(geometric tilt distortion),严重影响下游计算机视觉任务的性能。现有单图与多帧方法因湍流引起的退化具有高度病态性而难以有效处理。其解决方案的关键在于提出一种事件引导的湍流去除框架EvTurb,通过建模基于事件的湍流形成机制,采用两阶段事件引导网络实现退化的解耦:首先利用事件积分(event integrals)在粗略输出中降低模糊;随后基于原始事件流生成方差图(variance map),用于消除精细输出中的倾斜失真。该方法充分利用了事件相机高时间分辨率的优势,实现了高效且高质量的湍流图像恢复。

链接: https://arxiv.org/abs/2508.10582
作者: Yixing Liu,Minggui Teng,Yifei Xia,Peiqi Duan,Boxin Shi
机构: Peking University (北京大学); State Key Laboratory of Multimedia Information Processing (多媒体信息处理国家重点实验室); National Engineering Research Center of Visual Technology (视觉技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atmospheric turbulence degrades image quality by introducing blur and geometric tilt distortions, posing significant challenges to downstream computer vision tasks. Existing single-image and multi-frame methods struggle with the highly ill-posed nature of this problem due to the compositional complexity of turbulence-induced distortions. To address this, we propose EvTurb, an event guided turbulence removal framework that leverages high-speed event streams to decouple blur and tilt effects. EvTurb decouples blur and tilt effects by modeling event-based turbulence formation, specifically through a novel two-step event-guided network: event integrals are first employed to reduce blur in the coarse outputs. This is followed by employing a variance map, derived from raw event streams, to eliminate the tilt distortion for the refined outputs. Additionally, we present TurbEvent, the first real-captured dataset featuring diverse turbulence scenarios. Experimental results demonstrate that EvTurb surpasses state-of-the-art methods while maintaining computational efficiency.
zh

[CV-53] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在人本交互场景中缺乏细粒度评估框架的问题,特别是针对复杂人类意图的理解与共情、情境感知回应的生成能力不足。其解决方案的关键在于提出一个名为HumanSense的综合性基准,用于系统评估MLLMs在扩展多模态上下文理解及合理反馈生成方面的表现,并通过引入多阶段、模态渐进式的强化学习方法提升模型的推理能力,从而增强对对话者需求与情绪的情境分析能力,最终实现更高质量的人机交互。

链接: https://arxiv.org/abs/2508.10576
作者: Zheng Qin,Ruobing Zheng,Yabing Wang,Tianqi Li,Yi Yuan,Jingdong Chen,Le Wang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor’s needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolorbrightpinkthis https URL
zh

[CV-54] owards Agent ic AI for Multimodal-Guided Video Object Segmentation

【速读】:该论文旨在解决参考引导的视频目标分割(Referring-based Video Object Segmentation, RVOS)任务中传统方法依赖专用模型训练所带来的高计算复杂度和人工标注成本问题。现有基于视觉-语言基础模型的方法虽实现了无需训练即可达到与监督模型相当的性能,但其固定流程缺乏对任务动态性的适应能力。解决方案的关键在于提出一种多模态智能体(Multi-Modal Agent)系统,利用大语言模型(Large Language Models, LLMs)的推理能力生成针对每个输入动态定制的工作流,并通过迭代式交互一组跨模态的专用工具来精准识别由多模态提示描述的目标对象,从而实现更灵活、自适应的分割过程。

链接: https://arxiv.org/abs/2508.10572
作者: Tuyen Tran,Thao Minh Le,Truyen Tran
机构: Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.
zh

[CV-55] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection

【速读】:该论文旨在解决遥感图像变化检测(Remote Sensing Change Detection, RSCD)中因类别不平衡、多尺度变化特征难以捕捉以及模型泛化能力不足等问题。其解决方案的关键在于:首先,利用Segment Anything Model (SAM) 的编码器进行微调以提取通用视觉表征;其次,引入空间-时间特征增强(Spatial-Temporal Feature Enhancement, STFE)模块和多尺度解码器融合(Multi-Scale Decoder Fusion, MSDF)机制,从而在不同尺度上实现鲁棒的变化检测;最后,设计一种新颖的交叉熵掩码损失(Cross-Entropy Masking, CEM),有效缓解变化检测数据集中严重的类别不平衡问题。该方法在四个公开数据集(Levir-CD、WHU-CD、CLCD 和 S2Looking)上均超越现有最先进(SOTA)方法,尤其在复杂的大规模 S2Looking 数据集上实现了 2.5% 的 F1 分数提升。

链接: https://arxiv.org/abs/2508.10568
作者: Humza Naveed,Xina Zeng,Mitch Bryson,Nagita Mehrseresht
机构: The University of Sydney (悉尼大学); ARIAM; NearMap
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: work in progress

点击查看摘要

Abstract:Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: this https URL
zh

[CV-56] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving

【速读】:该论文旨在解决纯视觉感知在恶劣天气、部分遮挡及精确速度估计方面的局限性,这些因素在安全敏感场景中严重影响了运动理解与长时程轨迹预测的准确性,进而制约了自动驾驶系统(Autonomous Driving, AD)的可靠性和安全性。其解决方案的核心在于提出一种基于查询机制的端到端相机-雷达融合框架 SpaRC-AD,通过稀疏三维特征对齐和多普勒(Doppler)基速度估计,实现对目标锚点(agent anchors)、地图折线(map polylines)以及运动建模的精细化优化,从而显著提升3D检测、多目标跟踪、在线地图构建、运动预测和轨迹规划等关键任务性能,在多个基准测试中实现了空间一致性与时间连续性的协同增强。

链接: https://arxiv.org/abs/2508.10567
作者: Philipp Wolters,Johannes Gilg,Torben Teepe,Gerhard Rigoll
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 4 figures, 5 tables

点击查看摘要

Abstract:End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at this https URL
zh

[CV-57] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

【速读】:该论文旨在解决当前音频驱动人脸视频生成方法中存在的运动模糊和唇部抖动问题,其根本原因在于现有方法依赖隐式建模音频与面部运动之间的关联,缺乏显式的发音器官先验(即与言语相关的面部肌肉运动的解剖学指导)。解决方案的关键在于提出HM-Talker框架,通过融合隐式与显式运动表征来提升视频的时序一致性和视觉保真度:其中,显式运动特征采用动作单元(Action Units, AUs)——一种基于解剖学定义的面部肌肉运动——并与隐式特征协同优化,以减少音素-视觉单元(phoneme-viseme)错位;同时引入交叉模态解耦模块(Cross-Modal Disentanglement Module, CMDM)直接从音频中预测AU并提取互补特征,并通过混合运动建模模块(Hybrid Motion Modeling Module, HMMM)动态组合随机配对的隐式/显式特征,从而消除身份相关偏差、增强跨个体泛化能力,最终实现高保真且唇同步准确的个性化人脸视频生成。

链接: https://arxiv.org/abs/2508.10566
作者: Shiyu Liu,Kui Jiang,Xianming Liu,Hongxun Yao,Xiaocheng Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations–an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy.
zh

[CV-58] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks ICCV

【速读】:该论文旨在解决模型量化中精度损失与计算效率之间的权衡问题,具体针对后训练量化(Post-Training Quantization, PTQ)易导致性能显著下降,而量化感知训练(Quantization-Aware Training, QAT)则因高GPU内存占用和长训练时间难以高效部署的问题。解决方案的关键在于提出一种通用的混合量化方法PTQAT:通过选择关键层进行QAT微调,其余层采用PTQ,从而在保持接近QAT性能的同时大幅减少需微调的参数量(约50%)。值得注意的是,PTQAT发现对量化前后输出差异较小的层优先微调反而能更有效地补偿误差传播,而非仅关注初始误差较大的层,这一反直觉策略显著提升了量化模型的整体精度。

链接: https://arxiv.org/abs/2508.10557
作者: Xinhao Wang,Zhiwei Lin,Zhongyu Xia,Yongtao Wang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, Accepted by ICCVW 2025

点击查看摘要

Abstract:Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight this http URL this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model’s quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.
zh

[CV-59] Retrieval-Augmented Prompt for OOD Detection

【速读】:该论文旨在解决当前分布外(Out-of-Distribution, OOD)检测方法在训练阶段缺乏充分语义监督的问题,即现有方法依赖有限且与真实测试OOD样本不匹配的辅助异常样本或分布内(In-Distribution, ID)数据来生成异常信息,导致模型性能受限。其解决方案的关键在于提出一种名为检索增强提示(Retrieval-Augmented Prompt, RAP)的新方法:通过从外部文本知识中检索与异常样本语义相关的描述词,动态增强预训练视觉语言模型的OOD提示信息,在训练时提供更强的语义监督,并在测试时根据实际遇到的OOD样本实时更新提示,从而实现快速适应和更准确的OOD识别。

链接: https://arxiv.org/abs/2508.10556
作者: Ruisong Han,Zongbo Han,Jiahao Zhang,Mingyue Cheng,Changqing Zhang
机构: Tianjin University (天津大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model’s prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model’s OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.
zh

[CV-60] AR Surgical Navigation With Surface Tracing: Comparing In-SitVisualization with Tool-Tracking Guidance for Neurosurgical Applications

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)手术导航系统在临床应用中因汇聚-调节冲突(vergence-accommodation conflict)和显示技术对遮挡处理能力有限而导致的深度感知误差问题,这些问题在要求高精度的外科场景中尤为突出。解决方案的关键在于提出了一种基于微软HoloLens 2设备内置传感器的新颖表面追踪注册方法,结合实时红外工具跟踪技术,实现术中目标解剖结构的精准定位与模拟脑室外引流管(external ventricular drain catheter)放置过程中的动态引导。通过对比静态原位可视化与实时工具跟踪引导两种模式,实验表明后者显著提升了插入精度、角度误差控制及深度准确性,并获得用户更高的主观满意度,验证了仅依赖设备本体传感器即可实现高性能AR手术导航的可行性。

链接: https://arxiv.org/abs/2508.10554
作者: Marc J. Fischer,Jeffrey Potts,Gabriel Urreola,Dax Jones,Paolo Palmisciano,E. Bradley Strong,Branden Cord,Andrew D. Hernandez,Julia D. Sharma,E. Brandon Strong
机构: University of California, Davis(加州大学戴维斯分校); University of Oklahoma(俄克拉荷马大学); Xironetic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10pages, 3 figures, will be published at ISMAR 2025 (accepted)

点击查看摘要

Abstract:Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement, relying only on the onboard sensors of the Microsoft HoloLens 2. A group of intended users performed the procedure of simulated insertions under two AR guidance conditions: static in-situ visualization, where planned trajectories are overlaid directly onto the patient anatomy, and real-time tool-tracking guidance, where live feedback of the catheter’s pose is provided relative to the plan. Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. System Usability Scale surveys assessed user experience and cognitive workload. Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. A free copy of this paper and all supplemental materials are available at this https URL.
zh

[CV-61] PSScreen: Partially Supervised Multiple Retinal Disease Screening BMVC2025

【速读】:该论文旨在解决多中心、部分标注的视网膜疾病筛查模型训练中面临的两大挑战:一是不同医疗站点间数据分布差异大(即域偏移,domain shift),二是部分类别缺乏标签(label absent issue)。解决方案的关键在于提出PSScreen模型,其核心创新包括:1)设计双流架构,一通道学习确定性特征,另一通道通过不确定性注入学习概率特征;2)利用文本引导将两类特征解耦为疾病特异性特征,并通过特征蒸馏对齐以提升域泛化能力;3)借助双流伪标签一致性缓解标签缺失问题,并引入自蒸馏机制将已知类别的任务相关语义从确定性流迁移至概率流,从而显著提升六种视网膜疾病及正常状态的检测性能。

链接: https://arxiv.org/abs/2508.10549
作者: Boyi Zheng,Qing Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025 (Oral)

点击查看摘要

Abstract:Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at this https URL.
zh

[CV-62] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images

【速读】:该论文旨在解决光学遥感图像(Optical Remote Sensing Images, ORSIs)中显著目标检测(Salient Object Detection, SOD)面临的两大挑战:目标尺度变化大以及目标与背景之间对比度低。现有基于视觉Transformer(Vision Transformer, ViT)和卷积神经网络(Convolutional Neural Networks, CNNs)的方法虽试图融合全局与局部特征,但难以有效整合异构特征,限制了整体性能。其解决方案的关键在于提出一种图增强的上下文与区域感知网络(Graph-enhanced Contextual and Regional Perception Network, GCRPNet),该架构基于Mamba结构,通过两个核心模块实现突破:一是差异-相似引导的分层图注意力模块(Difference-Similarity Guided Hierarchical Graph Attention Module, DS-HGAM),强化跨尺度特征间的交互能力并提升结构感知力;二是LEVSS解码器模块,集成自适应扫描策略与多粒度协同注意力增强模块(Multi-Granularity Collaborative Attention Enhancement Module, MCAEM),在多尺度卷积特征上进行自适应补丁扫描,从而增强局部区域信息捕捉能力和Mamba的局部建模性能。

链接: https://arxiv.org/abs/2508.10542
作者: Mengyu Ren,Yutong Li,Hua Li,Runmin Cong,Sam Kwong
机构: Hainan University (海南大学); Shandong University (山东大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.
zh

[CV-63] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

【速读】:该论文旨在解决医学图像定位(Medical Image Grounding)任务中存在的三大挑战:模态覆盖有限、标注粒度粗略以及缺乏统一且可泛化的定位框架。其解决方案的关键在于构建了一个大规模多模态医学图像定位数据集 Med-GLIP-5M,包含超过 530 万条区域级标注,涵盖七种成像模态和从器官到细粒度病灶的分层标签,并提出了一种模态感知的定位框架 Med-GLIP。该框架通过在多样化数据上训练,隐式地学习多层次语义理解能力,无需显式设计专家模块即可识别多粒度结构(如区分肺部与肺炎病灶),从而在多个基准测试中显著优于现有方法,并有效提升下游任务(如医学视觉问答和报告生成)的性能。

链接: https://arxiv.org/abs/2508.10528
作者: Ziye Deng,Ruihan He,Jiaxiang Liu,Yuan Wang,Zijie Meng,Songtao Jiang,Yong Xie,Zuozhu Liu
机构: Zhejiang University (浙江大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data – enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.
zh

[CV-64] Reasoning in Computer Vision: Taxonomy Models Tasks and Methodologies

【速读】:该论文旨在解决当前视觉推理研究中缺乏统一分析与比较的问题,即现有文献多孤立地探讨关系推理、符号推理、时序推理、因果推理和常识推理等不同类型,未能系统性地对比其方法论、架构实现及评估协议。解决方案的关键在于提出一个五类视觉推理的分类框架,并深入分析基于图模型、记忆网络、注意力机制和神经符号系统的实现路径,同时梳理评估协议在功能正确性、结构一致性和因果有效性方面的设计与局限,从而为构建可扩展、可解释且跨领域的下一代视觉系统提供理论基础与实践指引。

链接: https://arxiv.org/abs/2508.10523
作者: Ayushman Sarkar,Mohd Yamani Idna Idris,Zhenyu Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.
zh

[CV-65] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba ICCV2025

【速读】:该论文旨在解决从第一人称视角视频(egocentric video)和音乐双模态输入中联合估计人类舞蹈动作的问题,这一任务在现有研究中尚未充分探索。其核心挑战在于:第一人称视角常遮挡人体大部分区域,导致全身姿态估计困难;同时,生成的动作需在视觉与音乐节奏上保持一致。解决方案的关键在于提出一种名为EgoMusic Motion Network的新模型,其核心是Skeleton Mamba模块,该模块显式建模人体骨骼结构,并结合扩散模型与Mamba架构的优势以高效捕捉时序动态特征。此外,作者构建了大规模数据集EgoAIST++,包含超过36小时的舞者动作数据,为方法训练与验证提供支撑。实验表明,该方法在多个指标上显著优于当前最优方案,并具备良好的真实场景泛化能力。

链接: https://arxiv.org/abs/2508.10522
作者: Quang Nguyen,Nhat Le,Baoru Huang,Minh Nhat Vu,Chengcheng Tang,Van Nguyen,Ngan Le,Thieu Vo,Anh Nguyen
机构: FPT Software AI Center; The University of Western Australia; TU Wien; Meta; University of Arkansas; National University of Singapore; University of Liverpool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The 2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025)

点击查看摘要

Abstract:Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.
zh

[CV-66] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection

【速读】:该论文旨在解决输电线路螺栓缺陷检测中因缺陷图像稀缺和数据分布不均衡导致的检测性能受限问题。解决方案的关键在于提出一种基于分割驱动的螺栓缺陷编辑方法(Segmentation-Driven Bolt Defect Editing, SBDE),其核心包括三个模块:首先,设计了螺栓属性分割模型(Bolt-SAM),通过CLAHE-FFT适配器(CFA)与多部分感知掩码解码器(MAMD)提升复杂螺栓属性的分割精度,生成高质量掩码;其次,构建掩码优化模块(MOD)并集成至图像修复模型LaMa,形成MOD-LaMa缺陷属性编辑模型,实现对正常螺栓到缺陷螺栓的属性编辑转换;最后,引入编辑恢复增强策略(Editing Recovery Augmentation, ERA),将编辑后的缺陷螺栓还原并嵌入原检测场景,从而有效扩充缺陷检测数据集。实验证明,SBDE生成的缺陷图像显著优于现有图像编辑模型,并显著提升缺陷检测性能,验证了方法的有效性与应用潜力。

链接: https://arxiv.org/abs/2508.10509
作者: Yangjie Xiao,Ke Zhang,Jiacun Wang,Xin Sheng,Yurong Guo,Meijuan Chen,Zehua Ren,Zhaoye Zheng,Zhenbing Zhao
机构: North China Electric Power University (华北电力大学); Monmouth University (蒙茅斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at this https URL.
zh

[CV-67] Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian splatting)在场景优化过程中因几何约束不足而导致的细节模糊问题,尤其是在高频纹理和锐利边界区域。解决方案的关键在于提出一种融合多采样抗锯齿(MSAA)与双几何约束的综合优化框架:首先通过自适应混合四重子像素实现对高频成分的有效抗混叠;其次引入两种约束机制——一是基于动态梯度分析的自适应加权策略,优先优化重建不足区域;二是梯度差分约束,用于在物体边界处施加几何正则化。该方法能够精准分配计算资源至关键区域,同时保持全局一致性,在保证实时渲染效率的前提下显著提升细节保留能力,尤其在结构相似性(SSIM)和感知质量(LPIPS)方面优于现有基线方法。

链接: https://arxiv.org/abs/2508.10507
作者: Zheng Zhou,Jia-Chen Zhang,Yu-Jie Xiong,Chun-Ming Xia
机构: SUEs(上海工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian splatting have significantly improved real-time novel view synthesis, yet insufficient geometric constraints during scene optimization often result in blurred reconstructions of fine-grained details, particularly in regions with high-frequency textures and sharp discontinuities. To address this, we propose a comprehensive optimization framework integrating multisample anti-aliasing (MSAA) with dual geometric constraints. Our system computes pixel colors through adaptive blending of quadruple subsamples, effectively reducing aliasing artifacts in high-frequency components. The framework introduces two constraints: (a) an adaptive weighting strategy that prioritizes under-reconstructed regions through dynamic gradient analysis, and (b) gradient differential constraints enforcing geometric regularization at object boundaries. This targeted optimization enables the model to allocate computational resources preferentially to critical regions requiring refinement while maintaining global consistency. Extensive experimental evaluations across multiple benchmarks demonstrate that our method achieves state-of-the-art performance in detail preservation, particularly in preserving high-frequency textures and sharp discontinuities, while maintaining real-time rendering efficiency. Quantitative metrics and perceptual studies confirm statistically significant improvements over baseline approaches in both structural similarity (SSIM) and perceptual quality (LPIPS).
zh

[CV-68] weezeEdit: Consistent and Efficient Image Editing with Path Regularization

【速读】:该论文旨在解决当前基于文本引导的图像编辑方法中存在语义保留不足与编辑路径冗长的问题。现有方法通常依赖于从源图像反演得到的噪声作为锚点(inversion anchors),在生成目标图像时易过度对齐提示词,而忽视源图像语义的保持,导致编辑效率低下。解决方案的关键在于提出一种无需微调和反演的框架TweezeEdit,其核心创新是通过梯度驱动的正则化策略在整个去噪路径上进行约束,而非仅依赖反演锚点,从而确保源图像语义一致性并显著缩短编辑路径;实验表明,该方法仅需12步即可完成高质量编辑(每张图像约1.6秒),具备实时应用潜力。

链接: https://arxiv.org/abs/2508.10498
作者: Jianda Mao,Kaibo Wang,Yang Xiang,Kani Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit’s superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.
zh

[CV-69] On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

【速读】:该论文旨在解决ReLU网络在视觉任务中生成的梯度解释(gradient-based explanations)存在噪声大、难以解释的问题,尤其是由于其对单个像素敏感而导致的尖锐过渡现象。现有方法如GradCAM通过构建代理模型(surrogate models)来平滑解释结果,但往往以牺牲忠实性(faithfulness)为代价。论文提出一个统一的谱分析框架(spectral framework),用于系统地分析和量化解释中的平滑性(smoothness)与忠实性之间的权衡关系;其关键在于利用该框架量化并正则化ReLU网络对高频信息的贡献,从而从理论上识别和调控这一权衡,同时定义并测量不同后处理方法产生的“解释差距”(explanation gap)。

链接: https://arxiv.org/abs/2508.10490
作者: Amir Mehrpanah,Matteo Gamba,Kevin Smith,Hossein Azizpour
机构: KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 14 figures, to be published in International Conference on Computer Vision 2025

点击查看摘要

Abstract:ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and quantify smoothness, faithfulness, and their trade-off in explanations. Using this framework, we quantify and regularize the contribution of ReLU networks to high-frequency information, providing a principled approach to identifying this trade-off. Our analysis characterizes how surrogate-based smoothing distorts explanations, leading to an ``explanation gap’’ that we formally define and measure for different post-hoc methods. Finally, we validate our theoretical findings across different design choices, datasets, and ablations.
zh

[CV-70] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images AAAI2026

【速读】:该论文旨在解决肺腺癌(lung adenocarcinoma, LUAD)中空气传播浸润(spread through air spaces, STAS)的诊断难题,其传统病理学评估依赖人工阅片,存在劳动强度大、易漏诊和误诊的问题。为提升STAS识别的准确性和效率,作者提出了一种多模式注意力感知的多实例学习框架(STAMP),其核心创新在于:采用双分支架构从不同语义空间学习STAS相关病理特征;通过基于Transformer的实例编码与多模式注意力聚合模块动态选择与STAS密切相关的区域,抑制无关噪声并增强全局表征的判别能力;同时引入相似性正则化约束以减少分支间特征冗余,从而显著提高整体诊断准确性。

链接: https://arxiv.org/abs/2508.10473
作者: Liangrui Pan,xiaoyu Li,Guang Zhu,Guanting Li,Ruixin Wang,Jiadi Luo,Yaning Yang,Liang qingchun,Shaoliang Peng
机构: Hunan University (湖南大学); Central South University (中南大学); Xiangtan University (湘潭大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Submit to AAAI2026

点击查看摘要

Abstract:Spread through air spaces (STAS) constitutes a novel invasive pattern in lung adenocarcinoma (LUAD), associated with tumor recurrence and diminished survival rates. However, large-scale STAS diagnosis in LUAD remains a labor-intensive endeavor, compounded by the propensity for oversight and misdiagnosis due to its distinctive pathological characteristics and morphological features. Consequently, there is a pressing clinical imperative to leverage deep learning models for STAS diagnosis. This study initially assembled histopathological images from STAS patients at the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University, alongside the TCGA-LUAD cohort. Three senior pathologists conducted cross-verification annotations to construct the STAS-SXY, STAS-TXY, and STAS-TCGA datasets. We then propose a multi-pattern attention-aware multiple instance learning framework, named STAMP, to analyze and diagnose the presence of STAS across multi-center histopathology images. Specifically, the dual-branch architecture guides the model to learn STAS-associated pathological features from distinct semantic spaces. Transformer-based instance encoding and a multi-pattern attention aggregation modules dynamically selects regions closely associated with STAS pathology, suppressing irrelevant noise and enhancing the discriminative power of global representations. Moreover, a similarity regularization constraint prevents feature redundancy across branches, thereby improving overall diagnostic accuracy. Extensive experiments demonstrated that STAMP achieved competitive diagnostic results on STAS-SXY, STAS-TXY and STAS-TCGA, with AUCs of 0.8058, 0.8017, and 0.7928, respectively, surpassing the clinical level.
zh

[CV-71] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition

【速读】:该论文旨在解决毫米波雷达(mmWave radar)在人体动作识别(Human Action Recognition, HAR)应用中因点云数据稀疏且噪声大而导致的识别准确率低的问题。现有研究常采用密度聚类(DBSCAN)、匈牙利算法(Hungarian Algorithm)和卡尔曼滤波(Kalman Filtering)三种方法对雷达数据进行处理以提升质量与连续性,但缺乏对其单独及组合使用效果的系统评估。论文的关键解决方案在于:基于MiliPoint数据集,全面评估这三种方法的单用、两两组合及三者联合使用的效果,同时提出针对各方法的改进策略以提升识别精度,并量化分析不同方案在识别准确率与计算成本之间的权衡关系,从而为未来基于毫米波雷达的HAR系统设计提供实证依据与优化方向。

链接: https://arxiv.org/abs/2508.10469
作者: Maimunatu Tunau,Vincent Gbouna Zakka,Zhuangzhuang Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems
zh

[CV-72] SingleStrip: learning skull-stripping from a single labeled example MICCAI2025

【速读】:该论文旨在解决三维脑部磁共振成像(MRI)图像中颅骨剥离(skull-stripping)任务对大量标注数据的依赖问题,尤其是在仅有极少量标注样本(如仅一个)时模型性能严重受限的挑战。解决方案的关键在于融合领域随机化(domain randomization)与自训练(self-training)策略:首先通过自动量化体素强度生成合成训练图像以初始化模型;其次利用卷积自编码器(convolutional autoencoder, AE)在单个标注样本上训练,并基于其重建误差评估未标注数据预测脑 mask 的质量;最后选择高置信度伪标签对网络进行微调,从而在分布外数据上实现接近使用多标注样本训练模型的分割性能。该方法显著提升了从极少标注数据中学习的有效性,为新解剖结构或新兴成像技术的研究提供了可行的半监督分割框架。

链接: https://arxiv.org/abs/2508.10464
作者: Bella Specktor-Fadida,Malte Hoffmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as an oral presentation to the MICCAI 2025 Data Engineering in Medical Imaging (DEMI) workshop

点击查看摘要

Abstract:Deep learning segmentation relies heavily on labeled data, but manual labeling is laborious and time-consuming, especially for volumetric images such as brain magnetic resonance imaging (MRI). While recent domain-randomization techniques alleviate the dependency on labeled data by synthesizing diverse training images from label maps, they offer limited anatomical variability when very few label maps are available. Semi-supervised self-training addresses label scarcity by iteratively incorporating model predictions into the training set, enabling networks to learn from unlabeled data. In this work, we combine domain randomization with self-training to train three-dimensional skull-stripping networks using as little as a single labeled example. First, we automatically bin voxel intensities, yielding labels we use to synthesize images for training an initial skull-stripping model. Second, we train a convolutional autoencoder (AE) on the labeled example and use its reconstruction error to assess the quality of brain masks predicted for unlabeled data. Third, we select the top-ranking pseudo-labels to fine-tune the network, achieving skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. We compare AE-based ranking to consistency-based ranking under test-time augmentation, finding that the AE approach yields a stronger correlation with segmentation accuracy. Our results highlight the potential of combining domain randomization and AE-based quality control to enable effective semi-supervised segmentation from extremely limited labeled data. This strategy may ease the labeling burden that slows progress in studies involving new anatomical structures or emerging imaging techniques.
zh

[CV-73] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

【速读】:该论文旨在解决植被样方图像中多物种植物识别的挑战,即在训练阶段使用单物种植物图像进行模型训练,而在测试阶段面对包含多个物种的样方图像(quadrat images),这种设置导致了显著的域偏移(domain shift)问题。解决方案的关键在于采用基于预训练 DINOv2 Vision Transformer Base (ViT-B/14) 的多头架构,通过多个分类头分别预测物种、属和科级别,并利用植物分类学层级结构增强泛化能力;同时引入多尺度切片(multi-scale tiling)以捕获不同尺度的植物特征,动态阈值优化策略依据平均预测长度调整分类置信度,以及结合袋装(bagging)与Hydra模型集成方法提升鲁棒性。此外,还采用图像裁剪去除非植物干扰物、top-n过滤约束预测结果及logit阈值策略优化最终输出,最终在约140万张训练图像(覆盖7,806个物种)上取得优异性能,排名PlantCLEF 2025私有排行榜第3名。

链接: https://arxiv.org/abs/2508.10457
作者: Hanna Herasimchyk,Robin Labryga,Tomislav Prusina
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted for publication at: LifeCLEF Lab at CLEF 2025 Working Notes, 2025, Madrid, Spain

点击查看摘要

Abstract:We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at this https URL.
zh

[CV-74] rajectory-aware Shifted State Space Models for Online Video Super-Resolution

【速读】:该论文旨在解决在线视频超分辨率(Online Video Super-Resolution, VSR)中因仅依赖单个前帧进行时序对齐而导致的长程时序建模能力受限的问题。现有方法难以有效捕捉视频中跨帧的长期动态信息,从而影响重建质量。其解决方案的关键在于提出一种基于轨迹感知的Shifted SSM(Trajectory-aware Shifted State Space Model, TS-Mamba)架构:首先通过构建视频内特征轨迹来选择最相似的前帧token,随后设计轨迹感知的Shifted Mamba聚合模块(TSMA),利用基于Hilbert扫描和对应位移操作的SSM块增强空间连续性并补偿扫描损失,从而实现高效且精准的时空信息聚合。此外,引入轨迹感知损失函数以监督轨迹生成过程,提升训练阶段token选择的准确性。该方法在多个公开数据集上优于六种主流在线VSR模型,并实现超过22.7%的计算复杂度降低(以MACs衡量)。

链接: https://arxiv.org/abs/2508.10453
作者: Qiang Zhu,Xiandong Meng,Yuxian Jiang,Fan Zhang,David Bull,Shuyuan Zhu,Bing Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% complexity reduction (in MACs). The source code for TS-Mamba will be available at this https URL.
zh

[CV-75] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images

【速读】:该论文旨在解决如何通过生物启发式模型学习人类视觉感知机制的问题,特别是探索视觉系统是否能从图像统计特性中自发形成对失真敏感的高效神经表征。其解决方案的关键在于设计了一个名为PerceptNet的端到端优化架构,该架构模拟了视网膜到初级视觉皮层(V1)的通路,并在图像重建任务(如自动编码、去噪、去模糊和稀疏正则化)中训练。结果表明,尽管未引入任何感知信息进行初始化或监督训练,该模型的编码器阶段(类V1层)与人类对图像失真的感知判断具有最强相关性,且这种相关性在中等噪声、模糊和稀疏度下达到最优,揭示了生物视觉系统可能被优化以去除特定水平的失真并维持适度的稀疏性。

链接: https://arxiv.org/abs/2508.10450
作者: Pablo Hernández-Cámara,Jesus Malo,Valero Laparra
机构: Image Processing Lab, University of Valencia (瓦伦西亚大学图像处理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A number of scientists suggested that human visual perception may emerge from image statistics, shaping efficient neural representations in early vision. In this work, a bio-inspired architecture that can accommodate several known facts in the retina-V1 cortex, the PerceptNet, has been end-to-end optimized for different tasks related to image reconstruction: autoencoding, denoising, deblurring, and sparsity regularization. Our results show that the encoder stage (V1-like layer) consistently exhibits the highest correlation with human perceptual judgments on image distortion despite not using perceptual information in the initialization or training. This alignment exhibits an optimum for moderate noise, blur and sparsity. These findings suggest that the visual system may be tuned to remove those particular levels of distortion with that level of sparsity and that biologically inspired models can learn perceptual metrics without human supervision.
zh

[CV-76] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry

【速读】:该论文旨在解决传统建筑电气布局图(Electrical Layout Plans, ELPs)以扫描文档形式保存、缺乏机器可读性导致的自动化处理困难问题,从而阻碍了大规模建筑信息建模(BIM)、设施管理及合规性检查等应用。其解决方案的关键在于构建了一个标注完整的数字化电气布局图(Digitised Electrical Layout Plans, DELP)数据集,并基于预训练目标检测模型(如YOLOv8)实现高精度符号识别;进一步开发了轻量级开源工具SkeySpot,支持电气符号的实时检测、分类与量化,输出结构化标准数据,提升建筑信息在不同系统间的互操作性,显著降低中小企业对专有CAD软件的依赖和人工标注成本,推动建筑行业向标准化、可持续方向发展。

链接: https://arxiv.org/abs/2508.10449
作者: Dhruv Dosi,Rohit Meena,Param Rajpura,Yogesh Kumar Meena
机构: IIT Gandhinagar (印度理工学院甘地纳格尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, preprint accepted in IEEE SMC 2025

点击查看摘要

Abstract:Legacy floor plans, often preserved only as scanned documents, remain essential resources for architecture, urban planning, and facility management in the construction industry. However, the lack of machine-readable floor plans render large-scale interpretation both time-consuming and error-prone. Automated symbol spotting offers a scalable solution by enabling the identification of service key symbols directly from floor plans, supporting workflows such as cost estimation, infrastructure maintenance, and regulatory compliance. This work introduces a labelled Digitised Electrical Layout Plans (DELP) dataset comprising 45 scanned electrical layout plans annotated with 2,450 instances across 34 distinct service key classes. A systematic evaluation framework is proposed using pretrained object detection models for DELP dataset. Among the models benchmarked, YOLOv8 achieves the highest performance with a mean Average Precision (mAP) of 82.5%. Using YOLOv8, we develop SkeySpot, a lightweight, open-source toolkit for real-time detection, classification, and quantification of electrical symbols. SkeySpot produces structured, standardised outputs that can be scaled up for interoperable building information workflows, ultimately enabling compatibility across downstream applications and regulatory platforms. By lowering dependency on proprietary CAD systems and reducing manual annotation effort, this approach makes the digitisation of electrical layouts more accessible to small and medium-sized enterprises (SMEs) in the construction industry, while supporting broader goals of standardisation, interoperability, and sustainability in the built environment.
zh

[CV-77] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations

【速读】:该论文旨在解决红外-可见光目标检测中因需双模态标注而导致的高标注成本问题。现有方法通常要求对两种模态均进行标注才能输出检测结果,限制了实际应用中的效率与可扩展性。其解决方案的关键在于提出一种基于单模态标注的解耦目标检测框架(DOD-SA),该框架采用单模态与双模态协同教师-学生网络(CoSD-TSNet),通过教师模型为未标注模态生成伪标签,实现跨模态知识迁移,并借助渐进式自调优训练策略(PaST)分阶段优化模型结构,同时设计伪标签分配器(PLA)以缓解训练过程中的模态错位问题,从而在仅依赖单模态标注的情况下显著提升检测性能。

链接: https://arxiv.org/abs/2508.10445
作者: Hang Jin,Chenqiang Gao,Junjie Guo,Fangcen Liu,Kanghui Tian,Qinyao Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Infrared-visible object detection has shown great potential in real-world applications, enabling robust all-day perception by leveraging the complementary information of infrared and visible images. However, existing methods typically require dual-modality annotations to output detection results for both modalities during prediction, which incurs high annotation costs. To address this challenge, we propose a novel infrared-visible Decoupled Object Detection framework with Single-modality Annotations, called DOD-SA. The architecture of DOD-SA is built upon a Single- and Dual-Modality Collaborative Teacher-Student Network (CoSD-TSNet), which consists of a single-modality branch (SM-Branch) and a dual-modality decoupled branch (DMD-Branch). The teacher model generates pseudo-labels for the unlabeled modality, simultaneously supporting the training of the student model. The collaborative design enables cross-modality knowledge transfer from the labeled modality to the unlabeled modality, and facilitates effective SM-to-DMD branch supervision. To further improve the decoupling ability of the model and the pseudo-label quality, we introduce a Progressive and Self-Tuning Training Strategy (PaST) that trains the model in three stages: (1) pretraining SM-Branch, (2) guiding the learning of DMD-Branch by SM-Branch, and (3) refining DMD-Branch. In addition, we design a Pseudo Label Assigner (PLA) to align and pair labels across modalities, explicitly addressing modality misalignment during training. Extensive experiments on the DroneVehicle dataset demonstrate that our method outperforms state-of-the-art (SOTA).
zh

[CV-78] We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂数学推理任务中表现不足的问题,尤其针对现有研究普遍忽视的两个关键环节:知识驱动的设计体系缺失与以模型为中心的数据空间建模不足。其解决方案的核心在于构建一个统一系统 We-Math 2.0,通过四大创新实现数学推理能力的全面提升:(1) 构建五级分层的 MathBook 知识体系,覆盖491个知识点和1,819条基础原理;(2) 设计 MathBook-Standard Pro 数据集,引入三维难度空间并生成每题7个渐进变体,提升训练数据的广度与挑战性;(3) 提出两阶段强化学习(Reinforcement Learning, RL)框架 MathBook-RL,包括冷启动微调(Cold-Start Fine-tuning)与基于平均奖励学习和动态数据调度的渐进对齐策略;(4) 开发覆盖全部知识点的综合评测基准 MathBookEval,验证模型在多样化推理路径下的泛化能力。整体方案实现了从知识结构化、数据精细化到训练机制优化的闭环增强,显著提升了MLLMs的数学推理性能。

链接: https://arxiv.org/abs/2508.10433
作者: Runqi Qiao,Qiuna Tan,Peiqing Yang,Yanzi Wang,Xiaowan Wang,Enhui Wan,Sitong Zhou,Guanting Dong,Yuchen Zeng,Yida Xu,Jie Wang,Chong Sun,Chen Li,Honggang Zhang
机构: BUPT(北京邮电大学); WeChat Vision, Tencent Inc.(微信视觉,腾讯公司); Tsinghua University(清华大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Working in progress

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.
zh

[CV-79] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation

【速读】:该论文旨在解决持续视频实例分割(Continual Video Instance Segmentation)中的三重混淆问题:实例级(instance-wise)、类别级(category-wise)和任务级(task-wise)的混淆,同时确保跨帧的时间一致性。其核心解决方案是提出对比残差注入与语义提示机制(CRISP),关键在于:1)通过实例相关性损失建模实例跟踪,强化当前任务查询与历史查询空间的相关性;2)构建自适应残差语义提示(ARSP)学习框架,利用类别文本生成可学习的语义残差提示池,并设计动态查询-提示匹配机制建立映射关系,结合基于对比学习的语义一致性损失维持对象查询与残差提示间的语义连贯性;3)引入简洁但高效的增量提示初始化策略,以保障任务间查询空间的关联性。实验表明,CRISP在YouTube-VIS-2019和YouTube-VIS-2021数据集上显著优于现有方法,有效缓解灾难性遗忘并提升分割与分类性能。

链接: https://arxiv.org/abs/2508.10432
作者: Baichen Liu,Qi Lyu,Xudong Wang,Jiahua Dong,Lianqing Liu,Zhi Han
机构: State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China (国家重点实验室机器人与智能系统,沈阳自动化研究所,中国科学院,沈阳 110016,中国); University of Chinese Academy of Sciences, Beijing 100049, China (中国科学院大学,北京 100049,中国); Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates (穆罕默德·本·扎耶德人工智能大学,阿布扎比,阿拉伯联合酋长国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at this https URL.
zh

[CV-80] MM-Food-100K: A 100000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance

【速读】:该论文旨在解决食品智能分析中高质量、可验证的多模态数据稀缺问题,以支持生成式 AI 在营养预测等下游任务中的性能提升。其解决方案的关键在于构建了一个包含10万样本的多模态食品智能数据集MM-Food-100K,该数据集源自120万张经质量筛选的食品图像,通过Codatta贡献模型实现社区众包与AI辅助质量控制相结合,并利用链下账本记录每个提交与钱包地址的关联以确保溯源性,同时规划链上协议保障透明度。该方法显著提升了视觉语言模型(如ChatGPT 5、Qwen-Max)在图像营养预测任务上的表现,验证了数据质量与可追溯性对模型性能的重要作用。

链接: https://arxiv.org/abs/2508.10429
作者: Yi Dong,Yusuke Muraoka,Scott Shi,Yi Zhang
机构: Codatta(康达塔); Kite AI( kite 人工智能)
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 6 tables. The dataset is available at this https URL

点击查看摘要

Abstract:We present MM-Food-100K, a public 100,000-sample multimodal food intelligence dataset with verifiable provenance. It is a curated approximately 10% open subset of an original 1.2 million, quality-accepted corpus of food images annotated for a wide range of information (such as dish name, region of creation). The corpus was collected over six weeks from over 87,000 contributors using the Codatta contribution model, which combines community sourcing with configurable AI-assisted quality checks; each submission is linked to a wallet address in a secure off-chain ledger for traceability, with a full on-chain protocol on the roadmap. We describe the schema, pipeline, and QA, and validate utility by fine-tuning large vision-language models (ChatGPT 5, ChatGPT OSS, Qwen-Max) on image-based nutrition prediction. Fine-tuning yields consistent gains over out-of-box baselines across standard metrics; we report results primarily on the MM-Food-100K subset. We release MM-Food-100K for publicly free access and retain approximately 90% for potential commercial access with revenue sharing to contributors.
zh

[CV-81] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景中缺乏精确时空推理能力的问题。现有VLMs主要基于静态网络图像-文本对进行训练,难以理解动态交通环境中的时空关系。为此,作者提出STRIDE-QA,一个大规模、物理具身的视觉问答(Visual Question Answering, VQA)数据集,其关键在于:基于东京100小时多传感器驾驶数据构建,包含285K帧和1600万组问答对,并通过密集自动标注(如3D边界框、分割掩码和多目标轨迹)实现物理空间与时间维度的精准锚定;同时设计三种新颖的问答任务,支持对象中心(object-centric)与自我中心(ego-centric)的时空推理,从而显著提升VLMs在空间定位(达55%成功)和未来运动预测一致性(达28%)上的性能表现,为安全关键型自主系统提供了可靠的基础。

链接: https://arxiv.org/abs/2508.10427
作者: Keishi Ishihara,Kento Sasaki,Tsubasa Takahashi,Daiki Shiono,Yu Yamaguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.
zh

[CV-82] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

【速读】:该论文旨在解决基于扩散Transformer(DiT)的可控文本到图像生成中,现有方法依赖为UNet设计的ControlNet范式所导致的参数冗余和计算成本高昂的问题。其核心解决方案是提出NanoControl,一种以Flux为骨干网络的轻量级控制机制:关键创新在于设计了一种类似LoRA(低秩适配)的控制模块,直接从原始条件输入中学习控制信号,避免了对DiT骨干网络的复制;同时引入KV-Context Augmentation机制,以简洁有效的方式将条件相关的键值信息融入骨干网络,实现条件特征的深度融合,从而在仅增加0.024%参数和0.029% GFLOPs的情况下,显著降低计算开销并保持优异的生成质量和可控性。

链接: https://arxiv.org/abs/2508.10424
作者: Shanyuan Liu,Jian Zhu,Junda Lu,Yue Gong,Liuzhuozheng Li,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin
机构: 1360 AI Research; Nanjing University of Science and Technology (南京理工大学); University of Science and Technology Beijing (北京科技大学); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024% increase in parameter count and a 0.029% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.
zh

[CV-83] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection

【速读】:该论文旨在解决现有3D车道线检测方法在高度估计上的局限性,特别是对复杂道路几何结构适应性差以及跨帧稳定性不足的问题。传统方法依赖固定坡度锚点(fixed slope anchors),难以应对多样化的道路场景;同时缺乏时间一致性机制,导致连续帧间高度估计波动较大,影响实际驾驶应用的可靠性。解决方案的关键在于提出SC-Lane框架,其核心创新包括:1)坡度感知自适应特征模块(Slope-Aware Adaptive Feature module),通过图像线索动态预测权重,融合多坡度高度特征以生成统一高度图,提升对不同道路地形的鲁棒性;2)高度一致性模块(Height Consistency Module),强制连续帧间高度估计的一致性,确保时序稳定性,从而增强系统在真实驾驶环境中的表现。

链接: https://arxiv.org/abs/2508.10411
作者: Chaesong Park,Eunbin Seo,Jihyeon Hwang,Jongwoo Lim
机构: Seoul National University (首尔国立大学); Hyundai Motor Group (现代汽车集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 5 tables

点击查看摘要

Abstract:In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:this https URL
zh

[CV-84] ranslation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型中因特定词汇与不希望出现的内容高度纠缠而导致的生成失控问题,例如在生成“查理·卓别林”时即使明确指令不包含胡须,仍会稳定出现胡须。其解决方案的关键在于提出一种基于文本嵌入空间的直接抑制方法:通过引入一个delta向量来修改文本嵌入,从而削弱不需要内容在生成图像中的影响,并进一步将该delta向量融入交叉注意力机制(cross-attention mechanism),形成选择性抑制方法(Selective Suppression with Delta Vector, SSDV),实现对目标区域更有效的抑制;同时,在个性化T2I模型中通过优化delta向量实现了更精确的抑制效果,优于以往基线方法。

链接: https://arxiv.org/abs/2508.10407
作者: Eunseo Koh,Seunghoo Hong,Tae-Young Kim,Simon S. Woo,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of Charlie Chaplin", a mustache" consistently appears even if explicitly instructed not to include it, as the concept of mustache" is strongly entangled with Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.
zh

[CV-85] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection

【速读】:该论文旨在解决真实场景中驾驶员分心检测模型泛化能力下降的问题,其根源在于数据标注成本高导致的少样本学习挑战,以及训练数据集与目标部署环境之间的显著域偏移(domain shift)。解决方案的关键在于提出一种基于姿态驱动的质量控制数据增强框架(Pose-driven Quality-controlled Data Augmentation Framework, PQ-DAF),其中包含两个核心组件:一是采用渐进式条件扩散模型(Progressive Conditional Diffusion Models, PCDMs)精确捕捉关键驾驶员姿态特征并合成多样化训练样本;二是利用CogVLM视觉语言模型构建样本质量评估模块,通过置信度阈值过滤低质量合成样本,从而提升增强数据集的可靠性与跨域鲁棒性。实验证明,该方法在少样本条件下显著提升了模型的泛化性能。

链接: https://arxiv.org/abs/2508.10397
作者: Haibin Sun,Xinghui Song
机构: Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Driver distraction detection is essential for improving traffic safety and reducing road accidents. However, existing models often suffer from degraded generalization when deployed in real-world scenarios. This limitation primarily arises from the few-shot learning challenge caused by the high cost of data annotation in practical environments, as well as the substantial domain shift between training datasets and target deployment conditions. To address these issues, we propose a Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) that leverages a vision-language model for sample filtering to cost-effectively expand training data and enhance cross-domain robustness. Specifically, we employ a Progressive Conditional Diffusion Model (PCDMs) to accurately capture key driver pose features and synthesize diverse training examples. A sample quality assessment module, built upon the CogVLM vision-language model, is then introduced to filter out low-quality synthetic samples based on a confidence threshold, ensuring the reliability of the augmented dataset. Extensive experiments demonstrate that PQ-DAF substantially improves performance in few-shot driver distraction detection, achieving significant gains in model generalization under data-scarce conditions.
zh

[CV-86] Unlocking Robust Semantic Segmentation Performance via Label-only Elastic Deformations against Implicit Label Noise

【速读】:该论文旨在解决现实世界语义分割数据集中存在的隐式(implicit)标签噪声问题,这类噪声源于对象边界模糊和标注者差异等内在挑战,虽不显著但会损害模型性能。传统数据增强方法对图像与标签应用相同变换,可能放大此类细微噪声并限制模型泛化能力。其解决方案的关键在于提出NSegment+框架,通过解耦图像与标签的变换策略——仅对分割标签施加受控的弹性形变,而保留原始图像不变,从而引导模型学习鲁棒的对象结构表征,有效应对标签中的微小不一致性。

链接: https://arxiv.org/abs/2508.10383
作者: Yechan Kim,Dongho Yoon,Younkwan Lee,Unse Fatima,Hong Kook Kim,Songjae Lee,Sanga Park,Jeong Ho Park,Seonjong Kang,Moongu Jeon
机构: 1. Korea Advanced Institute of Science and Technology (韩国科学技术院); 2. Samsung Electronics (三星电子); 3. LG Electronics (LG电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While previous studies on image segmentation focus on handling severe (or explicit) label noise, real-world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model’s generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively-even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.
zh

[CV-87] owards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

【速读】:该论文旨在解决当前图像生成模型在大规模数据集上训练时,因缺乏对场景底层结构和空间布局的充分信息而产生的空间不一致性和失真问题(spatial inconsistency and distortion)。其解决方案的关键在于引入并联合生成图像与其对应的内在场景属性(如深度图、分割图等),通过预训练估计器从大规模图像数据集中提取丰富的内在属性,并利用自编码器将这些属性聚合为单一潜在变量;在此基础上,基于预训练的潜扩散模型(Latent Diffusion Models, LDMs)同时对图像域和内在域进行去噪,通过精心设计的互信息共享机制使两者相互映射且不降低图像质量,从而隐式捕捉场景结构,提升生成图像的空间一致性与真实性。

链接: https://arxiv.org/abs/2508.10382
作者: Hyundo Lee,Suhyung Choi,Byoung-Tak Zhang,Inwoo Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).
zh

[CV-88] Contrast Sensitivity Function of Multimodal Vision-Language Models

【速读】:该论文旨在解决多模态视觉语言模型(Multimodal Vision-Language Models, VLMs)在低级视觉特征感知上与人类视觉系统对齐程度的问题,特别是如何量化VLMs对空间频率敏感性的表现。其解决方案的关键在于提出了一种受行为心理学实验启发的新方法,通过直接向模型提示不同对比度下带通滤波噪声图像的可见性判断任务,来估计模型的对比敏感函数(Contrast Sensitivity Function, CSF)。该方法相较于以往研究更贴近真实心理物理学实验设计,并首次系统性地揭示了多种VLM架构在CSF形状和幅度上与人类感知的差异,同时指出提示词表述对模型响应具有显著影响,凸显了提示稳定性问题的重要性。

链接: https://arxiv.org/abs/2508.10367
作者: Pablo Hernández-Cámara,Alexandra Gomez-Villa,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra
机构: Image Processing Lab, Universidad de Valencia, Paterna, Spain (图像处理实验室,瓦伦西亚大学,帕特纳,西班牙); Center for Biomaterials and Tissue Engineering Universitat Politecnica de Valencia, Valencia, Spain (生物材料与组织工程中心,瓦伦西亚理工大学,瓦伦西亚,西班牙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.
zh

[CV-89] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging

【速读】:该论文旨在解决时间分辨扫描透射电子显微镜(STEM)数据中因机械/热不稳定性导致的空间漂移与束流诱导信号衰减(radiation damage)两种交织的退化效应难以分离的问题,这使得原子尺度材料动态演化过程的准确建模变得困难。解决方案的关键在于提出AtomDiffuser框架,通过预测任意两帧STEM图像之间的仿射变换(affine transformation)和空间变化的衰减图(spatially varying decay map),将漂移与辐射衰减解耦;该方法利用退化过程作为物理启发的时序条件先验,而非传统去噪或配准流程,从而实现可解释的原子级结构演化分析,并在合成与真实冷冻STEM数据上均表现出良好泛化能力。

链接: https://arxiv.org/abs/2508.10359
作者: Hao Wang,Hongkui Zheng,Kai He,Abolfazl Razi
机构: Clemson University (克莱姆森大学); University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.
zh

[CV-90] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images

【速读】:该论文旨在解决当前肾脏病理诊断中对肾小球超微结构特征定量分析自动化程度不足的问题,尤其是现有研究多局限于单一超微结构识别,难以满足临床实际诊断需求。其解决方案的关键在于提出了一种名为Glo-DMU的glomerular morphometry框架,该框架基于三个深度学习模型:超微结构分割模型、滤过屏障区域分类模型和电子致密沉积物检测模型,能够同步量化三种最常用于诊断的超微结构特征——肾小球基底膜厚度、足突融合程度及电子致密沉积物位置,实现了全自动、高精度、高通量的多特征联合定量分析,为肾小球疾病诊断提供了高效辅助工具。

链接: https://arxiv.org/abs/2508.10351
作者: Zhentai Zhang,Danyi Weng,Guibin Zhang,Xiang Chen,Kaixing Long,Jian Geng,Yanmeng Lu,Lei Zhang,Zhitao Zhou,Lei Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Complex and diverse ultrastructural features can indicate the type, progression, and prognosis of kidney diseases. Recently, computational pathology combined with deep learning methods has shown tremendous potential in advancing automatic morphological analysis of glomerular ultrastructure. However, current research predominantly focuses on the recognition of individual ultrastructure, which makes it challenging to meet practical diagnostic needs. In this study, we propose the glomerular morphometry framework of ultrastructural characterization (Glo-DMU), which is grounded on three deep models: the ultrastructure segmentation model, the glomerular filtration barrier region classification model, and the electron-dense deposits detection model. Following the conventional protocol of renal biopsy diagnosis, this framework simultaneously quantifies the three most widely used ultrastructural features: the thickness of glomerular basement membrane, the degree of foot process effacement, and the location of electron-dense deposits. We evaluated the 115 patients with 9 renal pathological types in real-world diagnostic scenarios, demonstrating good consistency between automatic quantification results and morphological descriptions in the pathological reports. Glo-DMU possesses the characteristics of full automation, high precision, and high throughput, quantifying multiple ultrastructural features simultaneously, and providing an efficient tool for assisting renal pathologists.
zh

[CV-91] Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

【速读】:该论文旨在解决视觉-语言指令微调(vision-language instruction tuning)中训练数据选择不当导致性能受限的问题,尤其关注如何提升特定基准测试(benchmark)上的表现。其核心发现是:现有视觉-语言基准测试主要受益于与自身任务相似的视觉概念(visual concepts)或视觉技能(visual skills),二者存在本质差异。解决方案的关键在于提出一种针对性的数据筛选方法——首先从目标基准中提取其主导依赖的概念或技能类别,判断其更偏向概念还是技能,进而筛选出与之匹配度最高的指令数据进行训练。实验表明,该方法在10余个基准上平均提升0.9%,在技能导向子集上提升达1.5%,验证了通过识别并平衡概念知识获取与视觉技能学习之间的权衡关系来优化指令数据选择的有效性。

链接: https://arxiv.org/abs/2508.10339
作者: Andrew Bai,Justin Cui,Ruochen Wang,Cho-Jui Hsieh
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9% over the best existing baseline averaged over all benchmarks and +1.5% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.
zh

[CV-92] ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人任务中难以将视觉注意力精准定位到目标区域的问题,现有方法常导致注意力分散,影响操作精度。解决方案的关键在于提出ReconVLA,一种基于隐式对齐(implicit grounding)范式的重建型VLA模型:通过条件化于模型视觉输出的扩散Transformer(diffusion transformer),重构图像中与被操作物体对应的注视区域(gaze region),从而引导模型学习细粒度特征并实现精确视觉注意力分配,有效利用任务相关的视觉信息以完成高精度操作。

链接: https://arxiv.org/abs/2508.10333
作者: Wenxuan Song,Ziyang Zhou,Han Zhao,Jiayi Chen,Pengxiang Ding,Haodong Yan,Yuxin Huang,Feilong Tang,Donglin Wang,Haoang Li
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model’s visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model’s generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization. Our project page is this https URL.
zh

[CV-93] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

【速读】:该论文旨在解决生成式模型(Generative Models)在视觉内容合成任务中因使用代理目标函数(如似然或重建损失)而导致的感知质量、语义准确性与物理真实性不一致的问题。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL),利用其在优化不可微分、偏好驱动及时间结构化目标上的理论优势,实现对生成过程的精细化控制与高阶目标对齐,从而提升图像、视频及3D/4D生成内容的可控性、一致性与人类对齐能力。

链接: https://arxiv.org/abs/2508.10316
作者: Yuanzhi Liang,Yijie Fang,Rui Li,Ziqi Ni,Ruijie Su,Chi Zhang,Xuelong Li
机构: Xi’an University of Electronic Science and Technology (西安电子科技大学); University of Science and Technology of China (中国科学技术大学); Southeast University (东南大学); China Telecom (中国电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Ongoing work

点击查看摘要

Abstract:Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.
zh

[CV-94] From Pixel to Mask: A Survey of Out-of-Distribution Segmentation

【速读】:该论文旨在解决分布外(Out-of-distribution, OoD)检测与分割在自动驾驶等安全关键场景中的局限性问题。传统OoD检测方法仅能识别异常对象的存在,但缺乏像素级的空间定位能力,难以支持精准的下游控制决策。为应对这一挑战,论文系统梳理了当前OoD分割方法的四大类:(i)测试时OoD分割、(ii)基于异常暴露的监督训练、(iii)基于重建的方法、(iv)利用强大模型的方法,并指出其核心解决方案在于通过引入像素级语义定位能力,使感知模块不仅能识别OoD对象,还能实现精确分割,从而提升自动驾驶系统的鲁棒性和安全性。

链接: https://arxiv.org/abs/2508.10309
作者: Wenjie Zhao,Jia Li,Yunhui Guo
机构: University of Texas at Dallas(德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OoD) detection and segmentation have attracted growing attention as concerns about AI security rise. Conventional OoD detection methods identify the existence of OoD objects but lack spatial localization, limiting their usefulness in downstream tasks. OoD segmentation addresses this limitation by localizing anomalous objects at pixel-level granularity. This capability is crucial for safety-critical applications such as autonomous driving, where perception modules must not only detect but also precisely segment OoD objects, enabling targeted control actions and enhancing overall system robustness. In this survey, we group current OoD segmentation approaches into four categories: (i) test-time OoD segmentation, (ii) outlier exposure for supervised training, (iii) reconstruction-based methods, (iv) and approaches that leverage powerful models. We systematically review recent advances in OoD segmentation for autonomous-driving scenarios, identify emerging challenges, and discuss promising future research directions.
zh

[CV-95] Improving Learning of New Diseases through Knowledge-Enhanced Initialization for Federated Adapter Tuning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中医疗场景下客户端如何高效适应新任务或疾病的问题,尤其是在使用大基础模型(Foundation Models, FMs)时,通过低成本的适配器(Adapter)微调实现快速知识迁移。其核心挑战在于如何利用历史任务和跨客户端的知识积累,为新任务生成更有效的初始适配器参数,从而提升学习效率与性能。解决方案的关键在于提出了一种名为联邦知识增强初始化(Federated Knowledge-Enhanced Initialization, FedKEI)的新框架:首先在服务器端进行全局聚类以提取跨任务通用知识;随后优化簇间(inter-cluster)与簇内(intra-cluster)聚合权重,实现个性化知识迁移;并通过双层优化机制协同学习全局簇内权重并针对每个客户端的任务目标优化局部簇间权重,从而显著提升新任务的适应能力。

链接: https://arxiv.org/abs/2508.10299
作者: Danni Peng,Yuan Wang,Kangning Cai,Peiyan Ning,Jiming Xu,Yong Liu,Rick Siow Mong Goh,Qingsong Wei,Huazhu Fu
机构: Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR); EVYD Technology
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In healthcare, federated learning (FL) is a widely adopted framework that enables privacy-preserving collaboration among medical institutions. With large foundation models (FMs) demonstrating impressive capabilities, using FMs in FL through cost-efficient adapter tuning has become a popular approach. Given the rapidly evolving healthcare environment, it is crucial for individual clients to quickly adapt to new tasks or diseases by tuning adapters while drawing upon past experiences. In this work, we introduce Federated Knowledge-Enhanced Initialization (FedKEI), a novel framework that leverages cross-client and cross-task transfer from past knowledge to generate informed initializations for learning new tasks with adapters. FedKEI begins with a global clustering process at the server to generalize knowledge across tasks, followed by the optimization of aggregation weights across clusters (inter-cluster weights) and within each cluster (intra-cluster weights) to personalize knowledge transfer for each new task. To facilitate more effective learning of the inter- and intra-cluster weights, we adopt a bi-level optimization scheme that collaboratively learns the global intra-cluster weights across clients and optimizes the local inter-cluster weights toward each client’s task objective. Extensive experiments on three benchmark datasets of different modalities, including dermatology, chest X-rays, and retinal OCT, demonstrate FedKEI’s advantage in adapting to new diseases compared to state-of-the-art methods.
zh

[CV-96] SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning

【速读】:该论文旨在解决视觉刺激到大脑神经响应映射中的核心挑战:即如何在建模生物变异性的前提下,同时保留编码刺激信息的功能一致性。由于相同视觉输入在不同试验、情境和受试者间会引发可变的血氧水平依赖(BOLD)响应,传统确定性方法难以兼顾这两方面。其解决方案的关键在于提出SynBrain框架,该框架包含两个核心组件:(i) BrainVAE通过概率学习将神经表征建模为连续概率分布,并借助视觉语义约束维持功能一致性;(ii) 语义到神经映射器(Semantic-to-Neural Mapper)作为语义传输路径,将视觉语义投影至神经响应流形,实现高保真fMRI信号合成。这一设计使模型既能捕捉生物变异,又能生成具有功能性解释力的神经响应。

链接: https://arxiv.org/abs/2508.10298
作者: Weijian Mai,Jiamin Wu,Yu Zhu,Zhouheng Yao,Dongzhan Zhou,Andrew F. Luo,Qihao Zheng,Wanli Ouyang,Chunfeng Song
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. The code will be made publicly available.
zh

[CV-97] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild ICCV2025

【速读】:该论文旨在解决现有运动合成方法在生成真实交互动作时,难以同时建模个体动作与多人协作动态的问题。传统方法通常将单独动作和交互动作分开处理,忽略了现实场景中动作的自然流动性和协同性。其解决方案的关键在于提出Interleaved Learning for Motion Synthesis (InterSyn)框架,采用交错学习策略,通过两个核心模块实现统一建模:一是Interleaved Interaction Synthesis (INS)模块,从第一人称视角联合建模个体与交互行为;二是Relative Coordination Refinement (REC)模块,精细化调整角色间的相对动力学关系,确保动作同步性。该方法显著提升了文本到动作的对齐度与动作多样性,为更自然、鲁棒的多角色运动合成提供了新范式。

链接: https://arxiv.org/abs/2508.10297
作者: Yiyi Ma,Yuanzhi Liang,Xiu Li,Chi Zhang,Xuelong Li
机构: Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.
zh

[CV-98] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method

【速读】:该论文旨在解决多模态光学图像(如可见光与红外、近红外等)在几何处理中因辐射非线性和几何形变差异导致的匹配精度下降问题。其解决方案的关键在于提出了一种相位一致性加权最小绝对偏差(Phase Consistency Weighted Least Absolute Deviation, PCWLAD)亚像素模板匹配方法,该方法包含两个核心步骤:首先利用结构相似性指数(SSIM)在粗匹配阶段基于未滤波的相位一致性(PC)图进行初始匹配;随后在精匹配阶段,结合辐射与几何变换模型,并引入互结构滤波以抑制噪声对结构一致性的影响,最终采用加权最小绝对偏差(WLAD)准则估计亚像素级偏移量,从而显著提升匹配精度。

链接: https://arxiv.org/abs/2508.10294
作者: Tao Huang,Hongbo Pan,Nanxi Zhou,Shun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at this https URL.
zh

[CV-99] JRDB-Reasoning : A Difficulty-Graded Benchmark for Visual Reasoning in Robotics

【速读】:该论文旨在解决当前视觉推理基准测试中存在的三大问题:缺乏对推理复杂度的明确定义、无法控制生成不同难度和任务定制化的问题,以及缺少结构化的分步推理标注(workflows)。其解决方案的关键在于三个方面:首先,形式化定义了推理复杂度;其次,提出了一种自适应查询引擎,能够生成具有可调复杂度且附带详细中间步骤标注的定制化问题;最后,通过扩展JRDB数据集,引入人-物交互和几何关系标注,构建了面向人群密集环境中视觉推理的JRDB-Reasoning基准。这一系列改进使得视觉语言模型能够在不同推理层级上实现细粒度评估与动态测试。

链接: https://arxiv.org/abs/2508.10287
作者: Simindokht Jahangard,Mehrzad Mohammadi,Yi Shen,Zhixi Cai,Hamid Rezatofighi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.
zh

[CV-100] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation

【速读】:该论文旨在解决花样滑冰中跳跃动作的细粒度时序动作分割(Temporal Action Segmentation, TAS)问题,其核心挑战在于标注数据稀缺以及现有方法未能充分建模跳跃动作的三维空间特征与程序化结构。解决方案的关键在于提出一个融合三维姿态表示学习和细粒度阶段标注的新框架:首先设计了一种视图不变的、针对花样滑冰特化的姿态表征学习方法(View-Invariant, Figure Skating-Specific pose representation learning, VIFSS),通过对比学习预训练和动作分类微调相结合的方式提升模型泛化能力;其次构建了首个公开的3D姿态数据集FS-Jump3D,并引入“起跳准备”与“落地”两个关键阶段的细粒度标注,使模型能够显式学习跳跃动作的语义流程结构。实验表明,该方法在元素级TAS任务上F1@50得分超过92%,且在小样本场景下仍表现优异,凸显其实际应用价值。

链接: https://arxiv.org/abs/2508.10281
作者: Ryota Tanaka,Tomohiro Suzuki,Keisuke Fujii
机构: Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the entry (preparation)'' and landing’’ phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.
zh

[CV-101] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance

【速读】:该论文旨在解决现有文本驱动图像生成方法在语义对齐准确性和结构一致性方面的性能瓶颈问题。其解决方案的关键在于引入文本-图像对比约束与结构引导机制的融合:一方面通过对比学习模块建立强跨模态对齐约束,提升文本与图像间的语义匹配精度;另一方面利用语义布局图或边缘草图等结构先验,在空间层面引导生成器进行结构建模,从而增强生成图像的布局完整性和细节保真度。整体框架通过联合优化对比损失、结构一致性损失和语义保留损失,实现多目标监督,显著提升了生成内容的语义一致性和可控性。

链接: https://arxiv.org/abs/2508.10280
作者: Danyi Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve the semantic consistency and controllability of the generated content. Systematic experiments are conducted on the COCO-2014 dataset. Sensitivity analyses are performed on embedding dimensions, text length, and structural guidance strength. Quantitative metrics confirm the superior performance of the proposed method in terms of CLIP Score, FID, and SSIM. The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity. It demonstrates a strong ability to generate semantically clear and structurally complete images, offering a viable technical path for joint text-image modeling and image generation.
zh

[CV-102] Pose-Robust Calibration Strategy for Point-of-Gaze Estimation on Mobile Phones BMVC

【速读】:该论文旨在解决基于外观的注视点估计(Point-of-Gaze, PoG)在跨个体场景中泛化能力差的问题,以及现有校准方法对头部姿态变化敏感导致的性能下降问题。其解决方案的关键在于:通过构建一个包含32名受试者在固定或动态头部姿态下注视指定点的基准数据集MobilePoG,系统分析校准点多样性与头部姿态变化对估计精度的影响;进一步提出一种动态校准策略,用户在移动手机的同时注视校准点,自然引入头部姿态变化,从而提升校准过程的效率和鲁棒性,最终获得对头部姿态不敏感的更优PoG估计算法。

链接: https://arxiv.org/abs/2508.10268
作者: Yujie Zhao,Jiabei Zeng,Shiguang Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for British Machine Vision Conference (BMVC) 2025

点击查看摘要

Abstract:Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator’s ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.
zh

[CV-103] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中常见的幻觉问题(hallucinations),即生成与图像内容不一致的文本,根源在于模型对图像不同区域的信息验证能力有限。解决方案的关键在于提出一种无需训练的解码方法——多区域融合解码(Multi-Region Fusion Decoding, MRFD),其核心机制是通过跨注意力机制识别显著区域,分别生成区域级响应,并基于这些响应间的 Jensen-Shannon Divergence (JSD) 计算可靠性权重,进而利用区域感知提示(region-aware prompts)引导一致性感知的融合策略,从而提升输出文本的事实准确性。

链接: https://arxiv.org/abs/2508.10264
作者: Haonan Ge,Yiwei Wang,Ming-Hsuan Yang,Yujun Cai
机构: University of California, Merced (加州大学默塞德分校); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations – text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
zh

[CV-104] Deep Learning for Crack Detection: A Review of Learning Paradigms Generalizability and Datasets

【速读】:该论文旨在解决当前基于深度学习的裂缝检测方法在学习范式、泛化能力及数据多样性方面的局限性问题。其关键解决方案在于系统梳理并总结了从全监督学习向半监督、弱监督、无监督、小样本学习、域自适应及基础模型微调等新兴学习范式的转变趋势,同时推动模型在跨数据集场景下的通用性提升,并引入新型多模态数据集3DCrack(基于3D激光扫描获取),为未来研究提供高质量基准数据支持,从而建立主流深度学习方法及前沿基础模型的性能基线,助力裂缝检测技术向更高效、鲁棒和实用的方向发展。

链接: https://arxiv.org/abs/2508.10256
作者: Xinan Zhang,Haolin Wang,Yung-An Hsieh,Zhongyu Yang,Anthony Yezzi,Yi-Chang Tsai
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset reacquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: this https URL
zh

[CV-105] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics

【速读】:该论文旨在解决如何从组织学图像中提取稳健的细胞级特征,并将其与空间转录组数据进行有效整合的问题,从而在复杂肿瘤组织中实现精准的细胞类型注释和微环境异质性解析。解决方案的关键在于提出CellSymphony这一灵活的多模态框架,利用基础模型(foundation model)从Xenium空间转录组数据和组织学图像中分别提取真实单细胞分辨率的嵌入表示(embedding),并通过学习联合表示来融合空间基因表达与形态学上下文信息,从而实现跨模态的协同分析与生物学意义的深度挖掘。

链接: https://arxiv.org/abs/2508.10232
作者: Paul H. Acosta,Pingjun Chen,Simon P. Castillo,Maria Esther Salvatierra,Yinyin Yuan,Xiaoxi Pan
机构: The University of Texas MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Xenium, a new spatial transcriptomics platform, enables subcellular-resolution profiling of complex tumor tissues. Despite the rich morphological information in histology images, extracting robust cell-level features and integrating them with spatial transcriptomics data remains a critical challenge. We introduce CellSymphony, a flexible multimodal framework that leverages foundation model-derived embeddings from both Xenium transcriptomic profiles and histology images at true single-cell resolution. By learning joint representations that fuse spatial gene expression with morphological context, CellSymphony achieves accurate cell type annotation and uncovers distinct microenvironmental niches across three cancer types. This work highlights the potential of foundation models and multimodal fusion for deciphering the physiological and phenotypic orchestration of cells within complex tissue ecosystems.
zh

[CV-106] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯点云(3D Gaussian Splatting, 3DGS)在存储、传输及压缩过程中面临的效率问题,尤其是如何在不显著损失渲染质量的前提下实现高压缩比。其核心挑战在于3DGS中高斯属性(如旋转、缩放、透明度和球谐系数等)的分布特性复杂且相互关联,传统编码方法难以高效建模。解决方案的关键在于提出一种因子化与参数化的熵编码方法——EntropyGS:首先通过统计分析发现球谐系数(spherical harmonic AC attributes)服从拉普拉斯分布,而旋转、缩放和透明度可由混合高斯分布近似;其次,利用这些分布特性自适应地估计各属性的编码参数,并据此进行量化与熵编码,从而实现约30倍的码率压缩,同时保持与原始3DGS数据相当的渲染质量,且编码解码速度较快。

链接: https://arxiv.org/abs/2508.10227
作者: Yuning Huang,Jiahao Pang,Fengqing Zhu,Dong Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.
zh

[CV-107] AI-Driven Detection and Analysis of Handwriting on Seized Ivory: A Tool to Uncover Criminal Networks in the Illicit Wildlife Trade

【速读】:该论文旨在解决非洲象盗猎与跨国象牙贸易持续威胁大象种群生存的问题,尤其针对现有执法手段在追踪走私网络时面临的证据获取难、成本高及覆盖不足的困境。其核心解决方案是提出一种基于人工智能(AI)的流水线方法,用于提取和分析被缴获象牙上由盗猎者留下的手写标记,从而提供一种新颖、可扩展且低成本的法医证据来源。关键在于利用对象检测模型从6,085张图像中自动识别出超过17,000个独立标记,并通过先进AI工具进行标注与描述,最终识别出184个重复出现的“签名标记”,其中20个出现在多个缉获案件中,建立了跨案件的犯罪关联,有效补充了传统DNA等数据源的局限性。

链接: https://arxiv.org/abs/2508.10219
作者: Will Fein,Ryan J. Horwitz,John E. Brown III,Amit Misra,Felipe Oviedo,Kevin White,Juan M. Lavista Ferres,Samuel K. Wasser
机构: Microsoft(微软); University of Washington(华盛顿大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted. 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:The transnational ivory trade continues to drive the decline of elephant populations across Africa, and trafficking networks remain difficult to disrupt. Tusks seized by law enforcement officials carry forensic information on the traffickers responsible for their export, including DNA evidence and handwritten markings made by traffickers. For 20 years, analyses of tusk DNA have identified where elephants were poached and established connections among shipments of ivory. While the links established using genetic evidence are extremely conclusive, genetic data is expensive and sometimes impossible to obtain. But though handwritten markings are easy to photograph, they are rarely documented or analyzed. Here, we present an AI-driven pipeline for extracting and analyzing handwritten markings on seized elephant tusks, offering a novel, scalable, and low-cost source of forensic evidence. Having collected 6,085 photographs from eight large seizures of ivory over a 6-year period (2014-2019), we used an object detection model to extract over 17,000 individual markings, which were then labeled and described using state-of-the-art AI tools. We identified 184 recurring “signature markings” that connect the tusks on which they appear. 20 signature markings were observed in multiple seizures, establishing forensic links between these seizures through traffickers involved in both shipments. This work complements other investigative techniques by filling in gaps where other data sources are unavailable. The study demonstrates the transformative potential of AI in wildlife forensics and highlights practical steps for integrating handwriting analysis into efforts to disrupt organized wildlife crime.
zh

[CV-108] SynSpill: Improved Industrial Spill Detection With Synthetic Data ICCV

【速读】:该论文旨在解决大型视觉语言模型(Large-scale Vision-Language Models, VLMs)在工业级稀有事件检测任务(如危险泄漏检测)中因真实数据稀缺、敏感且难以标注而导致性能显著下降的问题。其核心挑战在于,传统微调方法在数据匮乏场景下失效,而VLMs虽具备零样本泛化能力,但无法满足安全关键应用对高精度的要求。解决方案的关键在于构建一个高质量的合成数据生成管道(SynSpill数据集),通过高保真度的合成数据实现参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),从而显著提升VLMs及主流目标检测器(如YOLO和DETR)在工业场景下的性能,使其达到与真实数据训练相当的效果,为工业视觉系统部署提供了一种成本低、可扩展的路径。

链接: https://arxiv.org/abs/2508.10171
作者: Aaditya Baranwal,Abdul Mueez,Jason Voelker,Guneet Bhatia,Shruti Vyas
机构: University of Central Florida (中央佛罗里达大学); Siemens Energy (西门子能源)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Accepted at ICCV (VISION’25 Workshop) 2025

点击查看摘要

Abstract:Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity – driven by privacy concerns, data sensitivity, and the infrequency of real incidents – renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: this https URL Comments: Accepted at ICCV (VISION’25 Workshop) 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET) Cite as: arXiv:2508.10171 [cs.CV] (or arXiv:2508.10171v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.10171 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Aaditya Baranwal [view email] [v1] Wed, 13 Aug 2025 20:09:58 UTC (5,702 KB)
zh

[CV-109] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model

【速读】:该论文旨在解决农业作物病害分类模型训练中因真实图像数据获取成本高、样本稀缺而导致的性能瓶颈问题。其核心解决方案在于探索将少量真实图像与大量生成式AI (Generative AI) 生成的合成图像相结合的混合训练策略,以提升模型在水melon(Citrullus lanatus)病害分类任务中的预测准确性和泛化能力。关键发现是:仅使用合成图像(H1)效果有限,而加入少量真实图像(如H2、H3和H4处理组)显著提升了模型性能,尤其当真实与合成图像比例为1:10时,加权F1分数从仅用真实图像的0.65提升至1.00,表明混合策略能有效弥补合成数据在分布真实性上的不足,从而实现更鲁棒的病害识别模型构建。

链接: https://arxiv.org/abs/2508.10156
作者: Nitin Rai,Nathan S. Boyd,Gary E. Vallad,Arnold W. Schumann
机构: University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The current advancements in generative artificial intelligence (GenAI) models have paved the way for new possibilities for generating high-resolution synthetic images, thereby offering a promising alternative to traditional image acquisition for training computer vision models in agriculture. In the context of crop disease diagnosis, GenAI models are being used to create synthetic images of various diseases, potentially facilitating model creation and reducing the dependency on resource-intensive in-field data collection. However, limited research has been conducted on evaluating the effectiveness of integrating real with synthetic images to improve disease classification performance. Therefore, this study aims to investigate whether combining a limited number of real images with synthetic images can enhance the prediction accuracy of an EfficientNetV2-L model for classifying watermelon \textit(Citrullus lanatus) diseases. The training dataset was divided into five treatments: H0 (only real images), H1 (only synthetic images), H2 (1:1 real-to-synthetic), H3 (1:10 real-to-synthetic), and H4 (H3 + random images to improve variability and model generalization). All treatments were trained using a custom EfficientNetV2-L architecture with enhanced fine-tuning and transfer learning techniques. Models trained on H2, H3, and H4 treatments demonstrated high precision, recall, and F1-score metrics. Additionally, the weighted F1-score increased from 0.65 (on H0) to 1.00 (on H3-H4) signifying that the addition of a small number of real images with a considerable volume of synthetic images improved model performance and generalizability. Overall, this validates the findings that synthetic images alone cannot adequately substitute for real images; instead, both must be used in a hybrid manner to maximize model performance for crop disease classification.
zh

[CV-110] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

【速读】:该论文旨在解决当前多模态融合方法依赖Transformer的注意力机制隐式学习模态特征相关性所带来的局限性,即难以捕捉各模态的本质特征,从而阻碍对复杂多模态结构和关联的理解。其解决方案的关键在于提出一种基于归一化流(Normalizing Flow)的显式、可解释且可计算的多模态融合框架——MANGO(Multimodal Attention-based Normalizing Flow),并通过设计新型可逆交叉注意力(Invertible Cross-Attention, ICA)层实现这一目标。ICA层引入三种新的交叉注意力机制:模态到模态交叉注意力(Modality-to-Modality Cross-Attention, MMCA)、跨模态交叉注意力(Inter-Modality Cross-Attention, IMCA)以及可学习跨模态交叉注意力(Learnable Inter-Modality Cross-Attention, LICA),以高效建模多模态数据中复杂的潜在关联,并通过归一化流结构确保高维多模态数据的可扩展性和可微分性。

链接: https://arxiv.org/abs/2508.10133
作者: Thanh-Dat Truong,Christophe Bobda,Nitin Agarwal,Khoa Luu
机构: CVIU Lab, University of Arkansas, USA (CVIU 实验室,阿肯色大学,美国); University of Florida, USA (佛罗里达大学,美国); University of Arkansas at Little Rock, USA (小岩城阿肯色大学,美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnoteThe source code of this work will be publicly available. to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.
zh

[CV-111] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging MICCAI2025

【速读】:该论文旨在解决全身体积双能X射线吸收测定法(Total-body dual X-ray absorptiometry, TBDXA)图像中自动定位解剖标志点(fiducial points)的难题,以支持后续的身体形态与外观建模(Shape and Appearance Modeling, SAM)。传统方法依赖人工标注,耗时且难以规模化,限制了其在大规模健康研究中的应用。解决方案的关键在于开发并验证了一种基于深度学习的方法,利用1,683张手动标注的TBDXA图像进行训练,在外部测试集中实现了99.5%的正确关键点识别率。该方法可高效、自动化地为35,928张不同成像模式的TBDXA扫描图生成标准化关键点,从而支撑SAM模型构建,并进一步通过两组独立队列的Kolmogorov-Smirnov检验揭示了身体形态特征与多种健康指标(如虚弱、代谢、炎症及心血管代谢标志物)之间的关联,既验证了已有结论,也提出了新的假设。

链接: https://arxiv.org/abs/2508.10132
作者: Arianna Bunnell,Devon Cataldi,Yannik Glaser,Thomas K. Wolfgruber,Steven Heymsfield,Alan B. Zonderman,Thomas L. Kelly,Peter Sadowski,John A. Shepherd
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint of manuscript accepted to the ShapeMI workshop at MICCAI 2025

点击查看摘要

Abstract:Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape’s relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at this https URL.
zh

[CV-112] From Intent to Execution: Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation

【速读】:该论文旨在解决当前计算机辅助设计(Computer-Aided Design, CAD)工作流中依赖大量领域专业知识和人工建模的问题,尤其是如何将人类设计意图高效、准确地转化为可执行的参数化三维建模代码。其解决方案的关键在于提出一种多模态思维链(Chain-of-Thought, CoT)引导的强化学习后训练框架——CAD-RL,该框架结合了基于CoT的冷启动策略与目标驱动的强化学习训练,并引入三种任务特定奖励机制:代码可执行性奖励、几何精度奖励和外部评估奖励,以提升生成代码的逻辑合理性、数值精确性和实际可用性。同时,为应对稀疏且高方差奖励带来的训练不稳定问题,进一步设计了信任区域扩展(Trust Region Stretch)、精度标记损失(Precision Token Loss)和过长过滤(Overlong Filtering)三项优化策略,从而显著提升了模型在推理质量、输出精度和代码可执行性方面的性能表现。

链接: https://arxiv.org/abs/2508.10118
作者: Ke Niu,Haiyang Yu,Zhuofan Chen,Mengyang Zhao,Teng Fu,Bin Li,Xiangyang Xue
机构: Fudan University (复旦大学); ByteDance Inc. (字节跳动)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (LLMs) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multimodal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT-based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under sparse and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.
zh

[CV-113] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs

【速读】:该论文旨在解决甲骨文(Oracle Bone Script, OBS) decipherment 中因字形稀少、抽象性强及象形多样性导致的解码难题,尤其针对现有基于深度学习的方法在忽略字根(radical)与语义关联时所表现出的泛化能力弱和可解释性差的问题。解决方案的关键在于提出一种基于大规模视觉-语言模型(Large Vision-Language Models)的可解释性解码方法,其核心创新包括:一是设计了渐进式训练策略,引导模型从字根识别逐步过渡到象形分析与字形-语义互证推理;二是引入“字根-象形双重匹配机制”(Radical-Pictographic Dual Matching mechanism),利用结构化分析结果增强零样本解码性能。该方法显著提升了模型在未见过的甲骨文字上的解码准确率,并输出逻辑清晰的分析过程,为未释读甲骨文提供考古学价值参考。

链接: https://arxiv.org/abs/2508.10113
作者: Kaixin Peng,Mengyang Zhao,Haiyang Yu,Teng Fu,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model’s zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in this https URL.
zh

[CV-114] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model

【速读】:该论文旨在解决人脸识别系统中形态攻击(morphing attack)检测的可靠性问题,即如何有效识别经过图像合成技术伪造的人脸图像以保障验证场景的安全性。其解决方案的关键在于提出一种多模态学习框架,利用对比语言-图像预训练模型(CLIP)实现零样本(zero-shot)检测,并能生成与攻击类型最相关的文本描述。该方法通过设计十种不同长度的自然语言提示(prompt),结合公开人脸生物特征数据集构建的伪造数据集进行广泛实验,验证了在五种不同形态生成技术和三种成像介质下的泛化能力,从而实现了无需特定任务微调即可准确识别和解释形态攻击的创新机制。

链接: https://arxiv.org/abs/2508.10110
作者: Sushrut Patwardhan,Raghavendra Ramachandra,Sushma Venkatesh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.
zh

[CV-115] DINOv3

【速读】:该论文旨在解决自监督学习(Self-supervised Learning)在视觉表征学习中面临的两大核心问题:一是如何通过大规模数据与模型扩展实现高质量特征表示的高效训练;二是如何缓解长周期训练下密集特征图(dense feature maps)质量退化的问题。其解决方案的关键在于三个创新策略:首先,通过精心设计的数据准备、模型架构和优化方法,有效利用大规模数据集和模型规模的协同扩展优势;其次,提出一种名为Gram anchoring的新方法,显著改善密集特征在长期训练中的稳定性与质量;最后,引入后处理策略以增强模型对分辨率、模型尺寸及文本对齐能力的灵活性。这些改进使DINOv3成为一款通用性强、无需微调即可在多种视觉任务中超越现有最优自监督与弱监督基础模型的视觉基础模型(Vision Foundation Model)。

链接: https://arxiv.org/abs/2508.10104
作者: Oriane Siméoni,Huy V. Vo,Maximilian Seitzer,Federico Baldassarre,Maxime Oquab,Cijo Jose,Vasil Khalidov,Marc Szafraniec,Seungeun Yi,Michaël Ramamonjisoa,Francisco Massa,Daniel Haziza,Luca Wehrstedt,Jianyuan Wang,Timothée Darcet,Théo Moutakanni,Leonel Sentana,Claire Roberts,Andrea Vedaldi,Jamie Tolan,John Brandt,Camille Couprie,Julien Mairal,Hervé Jégou,Patrick Labatut,Piotr Bojanowski
机构: Meta AI Research (Meta人工智能研究院); WRI; Inria
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images – using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.
zh

[CV-116] Stochastic-based Patch Filtering for Few-Shot Learning CVPR

【速读】:该论文旨在解决食物图像在少样本学习(few-shot learning)中因视觉复杂性和变异性导致的分类难题,例如同一类食物(如意大利面)在不同装饰、光照和视角下呈现显著差异,使得模型难以聚焦于关键特征而易发生误分类。解决方案的关键在于提出基于随机性过滤的补丁筛选方法(Stochastic-based Patch Filtering, SPFF),其核心思想是通过概率性地筛选与类别感知嵌入(class-aware embedding)相关性较低的补丁嵌入(patch embeddings),从而增强对类别特异性食物特征的关注,并利用相似性矩阵量化查询图像与支持图像之间的关系,实现更鲁棒的少样本分类性能。

链接: https://arxiv.org/abs/2508.10066
作者: Javier Rodenas,Eduardo Aguilar,Petia Radeva
机构: AIBA, Departament de Matemàtiques & Informàtica, Universitat de Barcelona (巴塞罗那大学数学与计算机系); Departamento de Ingeniería de Sistemas y Computación, Universidad Católica del Norte (北方天主教大学工程与计算机系); Institute of Neuroscience, Universitat de Barcelona (巴塞罗那大学神经科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Workshop MetaFood 2025

点击查看摘要

Abstract:Food images present unique challenges for few-shot learning models due to their visual complexity and variability. For instance, a pasta dish might appear with various garnishes on different plates and in diverse lighting conditions and camera perspectives. This problem leads to losing focus on the most important elements when comparing the query with support images, resulting in misclassification. To address this issue, we propose Stochastic-based Patch Filtering for Few-Shot Learning (SPFF) to attend to the patch embeddings that show greater correlation with the class representation. The key concept of SPFF involves the stochastic filtering of patch embeddings, where patches less similar to the class-aware embedding are more likely to be discarded. With patch embedding filtered according to the probability of appearance, we use a similarity matrix that quantifies the relationship between the query image and its respective support images. Through a qualitative analysis, we demonstrate that SPFF effectively focuses on patches where class-specific food features are most prominent while successfully filtering out non-relevant patches. We validate our approach through extensive experiments on few-shot classification benchmarks: Food-101, VireoFood-172 and UECFood-256, outperforming the existing SoA methods.
zh

[CV-117] Invisible Watermarks Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design ICCV2025

【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)中因依赖训练阶段权重调整而导致的效率与灵活性不足的问题,尤其关注如何通过数据层面的干预提升遗忘效果。其解决方案的关键在于提出一种名为 Water4MU 的友好遗忘水印框架,该框架基于双层优化(Bi-level Optimization, BLO)机制:上层优化水印网络以降低遗忘难度,下层独立训练模型;通过在训练数据中嵌入可控水印,实现对特定数据影响的精准移除,同时保持模型在其他任务上的性能,从而在图像分类和生成任务中显著优于现有方法,尤其是在“挑战性遗忘”场景下表现突出。

链接: https://arxiv.org/abs/2508.10065
作者: Yuhao Sun,Yihua Zhang,Gaowen Liu,Hongtao Xie,Sijia Liu
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); Michigan State University (密歇根州立大学); Cisco Research (思科研究院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:With the increasing demand for the right to be forgotten, machine unlearning (MU) has emerged as a vital tool for enhancing trust and regulatory compliance by enabling the removal of sensitive data influences from machine learning (ML) models. However, most MU algorithms primarily rely on in-training methods to adjust model weights, with limited exploration of the benefits that data-level adjustments could bring to the unlearning process. To address this gap, we propose a novel approach that leverages digital watermarking to facilitate MU by strategically modifying data content. By integrating watermarking, we establish a controlled unlearning mechanism that enables precise removal of specified data while maintaining model utility for unrelated tasks. We first examine the impact of watermarked data on MU, finding that MU effectively generalizes to watermarked data. Building on this, we introduce an unlearning-friendly watermarking framework, termed Water4MU, to enhance unlearning effectiveness. The core of Water4MU is a bi-level optimization (BLO) framework: at the upper level, the watermarking network is optimized to minimize unlearning difficulty, while at the lower level, the model itself is trained independently of watermarking. Experimental results demonstrate that Water4MU is effective in MU across both image classification and image generation tasks. Notably, it outperforms existing methods in challenging MU scenarios, known as “challenging forgets”.
zh

[CV-118] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning, Visual RL)中因高维输入和噪声奖励导致的感知与动作策略难以协同学习的问题,尤其关注如何利用大规模感知模型提升视觉泛化能力并改善样本效率。其解决方案的关键在于提出SegDAC(Segmentation-Driven Actor-Critic)方法:通过Segment Anything Model (SAM) 实现对象中心的分割分解,并结合YOLO-World利用文本提示对分割区域进行语义标注;同时设计一种基于Transformer的架构,支持每时间步动态调整关注的分割数量,并在在线强化学习过程中自动学习关键注意力区域,无需人工标签。该方法显著提升了复杂视觉扰动下的任务泛化性能,在Maniskill3基准上将最困难设置的性能提升一倍,并在所有任务中达到或超越现有方法的样本效率。

链接: https://arxiv.org/abs/2508.09325
作者: Alexandre Brown,Glen Berseth
机构: Mila Quebec AI Institute (Mila魁北克人工智能研究所); Université de Montréal (蒙特利尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks.
zh

[CV-119] When Experts Disagree: Characterizing Annotator Variability for Vessel Segmentation in DSA Images

【速读】:该论文旨在解决二维数字减影血管造影(2D DSA)中颅内血管分割结果因不同标注者而产生的变异问题,从而量化分割不确定性。其解决方案的关键在于通过分析多位标注者对同一图像的分割差异,系统性地表征和量化这种不确定性,并进一步利用该量化结果指导额外的标注工作,同时为开发具有不确定度感知能力的自动分割方法提供依据。

链接: https://arxiv.org/abs/2508.10797
作者: M. Geshvadi,G. So,D.D. Chlorogiannis,C. Galvin,E. Torio,A. Azimi,Y. Tachie-Baffour,N. Haouchine,A. Golby,M. Vangel,W.M. Wells,Y. Epelboym,R. Du,F. Durupinar,S. Frisken
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We analyze the variability among segmentations of cranial blood vessels in 2D DSA performed by multiple annotators in order to characterize and quantify segmentation uncertainty. We use this analysis to quantify segmentation uncertainty and discuss ways it can be used to guide additional annotations and to develop uncertainty-aware automatic segmentation methods.
zh

[CV-120] Insights from the Algonauts 2025 Winners

【速读】:该论文旨在解决如何利用计算模型准确预测人类大脑在观看自然主义多模态电影时的全脑功能磁共振成像(fMRI)响应问题。其核心挑战在于建模复杂、长时间序列的视听刺激与大脑活动之间的非线性关系,尤其是在跨个体和跨数据分布(out-of-distribution, OOD)场景下的泛化能力。解决方案的关键在于采用基于深度神经网络的脑编码模型(brain encoding models),结合来自CNeuroMod项目的近80小时电影刺激数据(含65小时训练集和约15小时验证集),通过优化对多模态信息(如视觉、听觉及语义内容)的表征提取与时空整合能力,显著提升了对多个参与者全脑分区(1,000个脑区)fMRI信号的预测精度,尤其在未见过的电影片段上表现优异。这一进展揭示了当前生成式AI(Generative AI)驱动的脑编码模型在理解人类感知加工机制方面的潜力,并为未来构建更具可解释性和通用性的神经科学计算框架指明方向。

链接: https://arxiv.org/abs/2508.10784
作者: Paul S. Scotti,Mihir Tripathy
机构: Medical AI Research Center (MedARC); CAMRI, Baylor College of Medicine
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: Perspective piece on Algonauts 2025 Challenge conclusion

点击查看摘要

Abstract:The Algonauts 2025 Challenge just wrapped up a few weeks ago. It is a biennial challenge in computational neuroscience in which teams attempt to build models that predict human brain activity from carefully curated stimuli. Previous editions (2019, 2021, 2023) focused on still images and short videos; the 2025 edition, which concluded last month (late July), pushed the field further by using long, multimodal movies. Teams were tasked with predicting fMRI responses across 1,000 whole-brain parcels across four participants in the dataset who were scanned while watching nearly 80 hours of naturalistic movie stimuli. These recordings came from the CNeuroMod project and included 65 hours of training data, about 55 hours of Friends (seasons 1-6) plus four feature films (The Bourne Supremacy, Hidden Figures, Life, and The Wolf of Wall Street). The remaining data were used for validation: Season 7 of Friends for in-distribution tests, and the final winners for the Challenge were those who could best predict brain activity for six films in their held-out out-of-distribution (OOD) set. The winners were just announced and the top team reports are now publicly available. As members of the MedARC team which placed 4th in the competition, we reflect on the approaches that worked, what they reveal about the current state of brain encoding, and what might come next.
zh

[CV-121] DIVA-VQA: Detecting Inter-frame Variations in UGC Video Quality ICIP

【速读】:该论文旨在解决无参考(No-Reference, NR)视频质量评估(VQA)在用户生成内容(User-Generated Content, UGC)场景下的挑战,尤其是在缺乏原始参考视频的情况下实现高精度、低复杂度的质量感知。其解决方案的关键在于提出一种基于帧间差异驱动的时空碎片化分析机制:通过利用相邻帧间的动态变化,模型分层级地识别质量敏感区域(从帧级到块级再到碎片化帧),并融合对齐残差的碎片帧与完整帧信息,以有效捕获全局和局部的时空特征;同时结合2D与3D特征提取策略,增强对视频时序变化的建模能力,从而在五个UGC数据集上实现领先的平均等级相关性表现(DIVA-VQA-L: 0.898, DIVA-VQA-B: 0.886),且运行时间复杂度显著低于现有最优方法。

链接: https://arxiv.org/abs/2508.10605
作者: Xinyi Wang,Angeliki Katsenou,David Bull
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6 pages, 1 figure. Accepted for presentation at the 2025 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:The rapid growth of user-generated (video) content (UGC) has driven increased demand for research on no-reference (NR) perceptual video quality assessment (VQA). NR-VQA is a key component for large-scale video quality monitoring in social media and streaming applications where a pristine reference is not available. This paper proposes a novel NR-VQA model based on spatio-temporal fragmentation driven by inter-frame variations. By leveraging these inter-frame differences, the model progressively analyses quality-sensitive regions at multiple levels: frames, patches, and fragmented frames. It integrates frames, fragmented residuals, and fragmented frames aligned with residuals to effectively capture global and local information. The model extracts both 2D and 3D features in order to characterize these spatio-temporal variations. Experiments conducted on five UGC datasets and against state-of-the-art models ranked our proposed method among the top 2 in terms of average rank correlation (DIVA-VQA-L: 0.898 and DIVA-VQA-B: 0.886). The improved performance is offered at a low runtime complexity, with DIVA-VQA-B ranked top and DIVA-VQA-L third on average compared to the fastest existing NR-VQA method. Code and models are publicly available at: this https URL.
zh

[CV-122] Efficient Image Denoising Using Global and Local Circulant Representation

【速读】:该论文旨在解决海量图像数据生成背景下对高效且有效的图像去噪方法的迫切需求,特别是如何在保证去噪性能的同时提升计算效率。其解决方案的关键在于提出一种名为Haar-tSVD的计算简单算法,该方法通过引入Haar变换与循环矩阵表示下主成分分析(PCA)之间的联系,利用统一的张量-奇异值分解(t-SVD)投影来有效捕捉全局与局部图像块的相关性,从而实现一步式、高度并行化的滤波操作,无需学习局部基来表示图像块,实现了速度与性能之间的良好平衡。此外,结合基于CNN的噪声估计与特征值分析的自适应噪声估计机制,进一步增强了方法的鲁棒性和适应性。

链接: https://arxiv.org/abs/2508.10307
作者: Zhaoming Kong,Jiahuan Zhang,Xiaowei Yang
机构: South China University of Technology (华南理工大学); Southern Medical University (南方医科大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancement of imaging devices and countless image data generated everyday impose an increasingly high demand on efficient and effective image denoising. In this paper, we present a computationally simple denoising algorithm, termed Haar-tSVD, aiming to explore the nonlocal self-similarity prior and leverage the connection between principal component analysis (PCA) and the Haar transform under circulant representation. We show that global and local patch correlations can be effectively captured through a unified tensor-singular value decomposition (t-SVD) projection with the Haar transform. This results in a one-step, highly parallelizable filtering method that eliminates the need for learning local bases to represent image patches, striking a balance between denoising speed and performance. Furthermore, we introduce an adaptive noise estimation scheme based on a CNN estimator and eigenvalue analysis to enhance the robustness and adaptability of the proposed method. Experiments on different real-world denoising tasks validate the efficiency and effectiveness of Haar-tSVD for noise removal and detail preservation. Datasets, code and results are publicly available at this https URL.
zh

[CV-123] DINOMotion: advanced robust tissue motion tracking with DINOv2 in 2D-Cine MRI-guided radiotherapy

【速读】:该论文旨在解决2D-Cine MRI-guided放射治疗中组织运动追踪的准确性与可解释性问题,现有基于图像配准的方法常因大尺度错位(large misalignments)而失效,且缺乏直观的对应关系解释。其解决方案的关键在于提出DINOMotion框架,该框架基于DINOv2特征提取器并引入低秩适配(Low-Rank Adaptation, LoRA)层,通过自动检测图像间的对应地标实现最优配准,从而提升鲁棒性和效率;同时,LoRA结构显著减少可训练参数,加速训练过程,而DINOv2强大的视觉表征能力保障了对大错位场景的适应性,最终在测试时直接输出配准结果,无需迭代优化,具备实时处理潜力(单次扫描约30ms),并在多个器官(肾、肝、肺)上实现了高精度(Dice分数达92.07%~95.23%)和良好可解释性。

链接: https://arxiv.org/abs/2508.10260
作者: Soorena Salari,Catherine Spino,Laurie-Anne Pharand,Fabienne Lathuiliere,Hassan Rivaz,Silvain Beriault,Yiming Xiao
机构: Concordia University (康考迪亚大学); Elekta Ltd. (Elekta有限公司); Diagnos Medical Systems (诊断医疗系统公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Biomedical Engineering (TMBE), 14 pages

点击查看摘要

Abstract:Accurate tissue motion tracking is critical to ensure treatment outcome and safety in 2D-Cine MRI-guided radiotherapy. This is typically achieved by registration of sequential images, but existing methods often face challenges with large misalignments and lack of interpretability. In this paper, we introduce DINOMotion, a novel deep learning framework based on DINOv2 with Low-Rank Adaptation (LoRA) layers for robust, efficient, and interpretable motion tracking. DINOMotion automatically detects corresponding landmarks to derive optimal image registration, enhancing interpretability by providing explicit visual correspondences between sequential images. The integration of LoRA layers reduces trainable parameters, improving training efficiency, while DINOv2’s powerful feature representations offer robustness against large misalignments. Unlike iterative optimization-based methods, DINOMotion directly computes image registration at test time. Our experiments on volunteer and patient datasets demonstrate its effectiveness in estimating both linear and nonlinear transformations, achieving Dice scores of 92.07% for the kidney, 90.90% for the liver, and 95.23% for the lung, with corresponding Hausdorff distances of 5.47 mm, 8.31 mm, and 6.72 mm, respectively. DINOMotion processes each scan in approximately 30ms and consistently outperforms state-of-the-art methods, particularly in handling large misalignments. These results highlight its potential as a robust and interpretable solution for real-time motion tracking in 2D-Cine MRI-guided radiotherapy.
zh

[CV-124] Data-Efficient Learning for Generalizable Surgical Video Understanding

【速读】:该论文旨在解决手术视频分析中模型泛化能力不足的问题,核心挑战包括标注数据稀缺、时空复杂性高以及不同手术类型和医疗机构之间的域差异(domain gap)。解决方案的关键在于构建数据高效且临床可扩展的深度学习框架:首先通过基准测试识别适用于手术阶段、操作和事件识别的最佳神经网络架构;其次提出新型半监督学习方法(如DIST、SemiVT-Surge和ENCORE),利用大量未标注视频数据并结合动态伪标签机制显著提升模型性能;最后发布两个多任务大规模数据集(GynSurg和Cataract-1K)以促进研究 reproducibility 和领域进步。整体方案聚焦于减少对专家标注的依赖,增强模型在真实临床环境中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.10215
作者: Sahar Nasirihaghighi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in surgical video analysis are transforming operating rooms into intelligent, data-driven environments. Computer-assisted systems support full surgical workflow, from preoperative planning to intraoperative guidance and postoperative assessment. However, developing robust and generalizable models for surgical video understanding remains challenging due to (I) annotation scarcity, (II) spatiotemporal complexity, and (III) domain gap across procedures and institutions. This doctoral research aims to bridge the gap between deep learning-based surgical video analysis in research and its real-world clinical deployment. To address the core challenge of recognizing surgical phases, actions, and events, critical for analysis, I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task. I further improved performance by proposing novel architectures and integrating advanced modules. Given the high cost of expert annotations and the domain gap across surgical video sources, I focused on reducing reliance on labeled data. We developed semi-supervised frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video. We introduced novel semi-supervised frameworks, including DIST, SemiVT-Surge, and ENCORE, that achieved state-of-the-art results on challenging surgical datasets by leveraging minimal labeled data and enhancing model training through dynamic pseudo-labeling. To support reproducibility and advance the field, we released two multi-task datasets: GynSurg, the largest gynecologic laparoscopy dataset, and Cataract-1K, the largest cataract surgery video dataset. Together, this work contributes to robust, data-efficient, and clinically scalable solutions for surgical video analysis, laying the foundation for generalizable AI systems that can meaningfully impact surgical care and training.
zh

[CV-125] Explainable AI Technique in Lung Cancer Detection Using Convolutional Neural Networks

【速读】:该论文旨在解决肺癌早期筛查中诊断效率与准确性不足的问题,尤其在资源有限环境中缺乏可靠、可解释的自动化辅助工具。其解决方案的关键在于构建一个集成可解释性的深度学习框架,采用定制卷积神经网络(CNN)及三种迁移学习骨干模型(DenseNet121、ResNet152、VGG19),并通过代价敏感学习缓解类别不平衡问题;其中,DenseNet121在精度、召回率和F1分数之间取得最佳平衡(分别为92%、90%、91%),并结合Shapley Additive Explanations(SHAP)方法可视化决策依据,显著提升临床透明度,从而实现快速、准确且可解释的肺部CT图像自动筛查。

链接: https://arxiv.org/abs/2508.10196
作者: Nishan Rai,Sujan Khatri,Devendra Risal
机构: Kathford International College of Engineering and Management (Kathford国际工程与管理学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures, 4 tables. Undergraduate research project report

点击查看摘要

Abstract:Early detection of lung cancer is critical to improving survival outcomes. We present a deep learning framework for automated lung cancer screening from chest computed tomography (CT) images with integrated explainability. Using the IQ-OTH/NCCD dataset (1,197 scans across Normal, Benign, and Malignant classes), we evaluate a custom convolutional neural network (CNN) and three fine-tuned transfer learning backbones: DenseNet121, ResNet152, and VGG19. Models are trained with cost-sensitive learning to mitigate class imbalance and evaluated via accuracy, precision, recall, F1-score, and ROC-AUC. While ResNet152 achieved the highest accuracy (97.3%), DenseNet121 provided the best overall balance in precision, recall, and F1 (up to 92%, 90%, 91%, respectively). We further apply Shapley Additive Explanations (SHAP) to visualize evidence contributing to predictions, improving clinical transparency. Results indicate that CNN-based approaches augmented with explainability can provide fast, accurate, and interpretable support for lung cancer screening, particularly in resource-limited settings.
zh

人工智能

[AI-0] Empirical Investigation into Configuring Echo State Networks for Representative Benchmark Problem Domains

【速读】:该论文旨在解决生成式 AI(Generative AI)领域中,Echo State Network(ESN)架构配置与参数选择缺乏系统性指导的问题,尤其针对初学者或经验不足的研究者面临的实践困境。解决方案的关键在于通过四类基准任务(时间序列预测、模式生成、混沌系统预测和时间序列分类)的实验分析,提炼出适用于同一问题域的启发式规则或经验法则,从而帮助用户在不依赖大量试错的前提下,有效调整ESN的架构设计与超参数取值,提升模型性能并缩短学习曲线。

链接: https://arxiv.org/abs/2508.10887
作者: Brooke R. Weborg,Gursel Serpen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 49 pages, 21 figures

点击查看摘要

Abstract:This paper examines Echo State Network, a reservoir computer, performance using four different benchmark problems, then proposes heuristics or rules of thumb for configuring the architecture, as well as the selection of parameters and their values, which are applicable to problems within the same domain, to help serve to fill the experience gap needed by those entering this field of study. The influence of various parameter selections and their value adjustments, as well as architectural changes made to an Echo State Network, a powerful recurrent neural network configured as a reservoir computer, can be challenging to fully comprehend without experience in the field, and even some hyperparameter optimization algorithms may have difficulty adjusting parameter values without proper manual selections made first. Therefore, it is imperative to understand the effects of parameters and their value selection on Echo State Network architecture performance for a successful build. Thus, to address the requirement for an extensive background in Echo State Network architecture, as well as examine how Echo State Network performance is affected with respect to variations in architecture, design, and parameter selection and values, a series of benchmark tasks representing different problem domains, including time series prediction, pattern generation, chaotic system prediction, and time series classification, were modeled and experimented on to show the impact on the performance of Echo State Network.
zh

[AI-1] LE-Based A2C Agent for Terrestrial Coverag e Orbital Path Planning

【速读】:该论文旨在解决低地球轨道(Low Earth Orbit, LEO)日益拥挤背景下,地球观测卫星在部署与运行中面临的碰撞风险增加及精确地表覆盖难以保障的问题。解决方案的关键在于构建一个基于优势Actor-Critic(A2C)算法的强化学习框架,将卫星轨道参数优化问题建模为马尔可夫决策过程(Markov Decision Process, MDP),并在自定义的OpenAI Gymnasium环境中利用经典开普勒轨道要素模拟轨道动力学。该方法通过训练智能体动态调整五个轨道参数(半长轴、偏心率、倾角、升交点赤经和近地点幅角),以实现预设区域内的精准覆盖。实验表明,A2C相较近端策略优化(Proximal Policy Optimization, PPO)在累积奖励上提升5.8倍且收敛速度提高31.5倍,同时保持了计算效率,适用于实时任务规划场景。

链接: https://arxiv.org/abs/2508.10872
作者: Anantha Narayanan,Battu Bhanu Teja,Pruthwik Mishra
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 5 tables

点击查看摘要

Abstract:The increasing congestion of Low Earth Orbit (LEO) poses persistent challenges to the efficient deployment and safe operation of Earth observation satellites. Mission planners must now account not only for mission-specific requirements but also for the increasing collision risk with active satellites and space debris. This work presents a reinforcement learning framework using the Advantage Actor-Critic (A2C) algorithm to optimize satellite orbital parameters for precise terrestrial coverage within predefined surface radii. By formulating the problem as a Markov Decision Process (MDP) within a custom OpenAI Gymnasium environment, our method simulates orbital dynamics using classical Keplerian elements. The agent progressively learns to adjust five of the orbital parameters - semi-major axis, eccentricity, inclination, right ascension of ascending node, and the argument of perigee-to achieve targeted terrestrial coverage. Comparative evaluation against Proximal Policy Optimization (PPO) demonstrates A2C’s superior performance, achieving 5.8x higher cumulative rewards (10.0 vs 9.263025) while converging in 31.5x fewer timesteps (2,000 vs 63,000). The A2C agent consistently meets mission objectives across diverse target coordinates while maintaining computational efficiency suitable for real-time mission planning applications. Key contributions include: (1) a TLE-based orbital simulation environment incorporating physics constraints, (2) validation of actor-critic methods’ superiority over trust region approaches in continuous orbital control, and (3) demonstration of rapid convergence enabling adaptive satellite deployment. This approach establishes reinforcement learning as a computationally efficient alternative for scalable and intelligent LEO mission planning.
zh

[AI-2] A Multimodal Neural Network for Recognizing Subjective Self-Disclosure Towards Social Robots IROS

【速读】:该论文旨在解决社会机器人在与人类互动时难以准确识别和建模人类主观自我披露(subjective self-disclosure)的问题,这是实现具备社会认知能力的社会机器人所面临的关键挑战。解决方案的核心在于构建一个基于情感识别模型的定制多模态注意力网络,并引入一种新型损失函数——尺度保持交叉熵损失(scale preserving cross entropy loss),该损失函数同时优化分类与回归任务,从而显著提升模型对自我披露程度的预测性能。实验结果显示,使用该损失函数训练的最佳模型在F1得分上达到0.83,较最优基线模型提升0.48,验证了方法的有效性。

链接: https://arxiv.org/abs/2508.10828
作者: Henry Powell,Guy Laban,Emily S. Cross
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

点击查看摘要

Abstract:Subjective self-disclosure is an important feature of human social interaction. While much has been done in the social and behavioural literature to characterise the features and consequences of subjective self-disclosure, little work has been done thus far to develop computational systems that are able to accurately model it. Even less work has been done that attempts to model specifically how human interactants self-disclose with robotic partners. It is becoming more pressing as we require social robots to work in conjunction with and establish relationships with humans in various social settings. In this paper, our aim is to develop a custom multimodal attention network based on models from the emotion recognition literature, training this model on a large self-collected self-disclosure video corpus, and constructing a new loss function, the scale preserving cross entropy loss, that improves upon both classification and regression versions of this problem. Our results show that the best performing model, trained with our novel loss function, achieves an F1 score of 0.83, an improvement of 0.48 from the best baseline model. This result makes significant headway in the aim of allowing social robots to pick up on an interaction partner’s self-disclosures, an ability that will be essential in social robots with social cognition.
zh

[AI-3] Who Benefits from AI Explanations? Towards Accessible and Interpretable Systems IJCAI2025

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)方法在无障碍设计方面的显著缺口,特别是针对视觉障碍用户群体的可访问性问题。现有XAI研究普遍依赖视觉化呈现方式,且极少纳入残障用户参与评估,导致其对非视觉用户不友好。论文提出了一种四步法的方法论概念验证框架,关键在于通过系统性地分类AI系统、定义并 contextualize 用户角色、设计与实现原型,并结合专家与用户评估,推动包容性XAI的设计实践;初步结果表明,简化解释比详细解释更易被非视觉用户理解,且多模态(如听觉、触觉)信息呈现是实现公平可解释性的必要条件。

链接: https://arxiv.org/abs/2508.10806
作者: Maria J. P. Peixoto,Akriti Pandey,Ahsan Zaman,Peter R. Lewis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper accepted for the IJCAI 2025 Workshop on Explainable Artificial Intelligence (XAI): this https URL

点击查看摘要

Abstract:As AI systems are increasingly deployed to support decision-making in critical domains, explainability has become a means to enhance the understandability of these outputs and enable users to make more informed and conscious choices. However, despite growing interest in the usability of eXplainable AI (XAI), the accessibility of these methods, particularly for users with vision impairments, remains underexplored. This paper investigates accessibility gaps in XAI through a two-pronged approach. First, a literature review of 79 studies reveals that evaluations of XAI techniques rarely include disabled users, with most explanations relying on inherently visual formats. Second, we present a four-part methodological proof of concept that operationalizes inclusive XAI design: (1) categorization of AI systems, (2) persona definition and contextualization, (3) prototype design and implementation, and (4) expert and user assessment of XAI techniques for accessibility. Preliminary findings suggest that simplified explanations are more comprehensible for non-visual users than detailed ones, and that multimodal presentation is required for more equitable interpretability.
zh

[AI-4] he SET Perceptual Factors Framework: Towards Assured Perception for Autonomous Systems

【速读】:该论文旨在解决自主系统(Autonomous Systems)中感知可靠性不足的问题,尤其是因环境因素(如天气、遮挡或传感器限制)导致的感知失败对安全决策的负面影响。解决方案的关键在于提出一个名为SET(Self, Environment, and Target)感知因素框架,通过构建SET状态树(SET State Trees)和SET因素树(SET Factor Trees)来系统化识别感知风险的来源及其对具体任务(如目标检测或位姿估计)的影响机制,并基于这两类树结构建立感知因素模型(Perceptual Factor Models),从而量化特定任务下的不确定性,为提升自主系统的安全性保障与公众信任提供透明且标准化的方法。

链接: https://arxiv.org/abs/2508.10798
作者: Troi Williams
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures, accepted to the Workshop on Public Trust in Autonomous Systems at the 2025 IEEE International Conference on Robotics Automation

点击查看摘要

Abstract:Future autonomous systems promise significant societal benefits, yet their deployment raises concerns about safety and trustworthiness. A key concern is assuring the reliability of robot perception, as perception seeds safe decision-making. Failures in perception are often due to complex yet common environmental factors and can lead to accidents that erode public trust. To address this concern, we introduce the SET (Self, Environment, and Target) Perceptual Factors Framework. We designed the framework to systematically analyze how factors such as weather, occlusion, or sensor limitations negatively impact perception. To achieve this, the framework employs SET State Trees to categorize where such factors originate and SET Factor Trees to model how these sources and factors impact perceptual tasks like object detection or pose estimation. Next, we develop Perceptual Factor Models using both trees to quantify the uncertainty for a given task. Our framework aims to promote rigorous safety assurances and cultivate greater public understanding and trust in autonomous systems by offering a transparent and standardized method for identifying, modeling, and communicating perceptual risks.
zh

[AI-5] Enhancing Fairness in Autoencoders for Node-Level Graph Anomaly Detection ECAI-2025

【速读】:该论文旨在解决基于自编码器(Autoencoder)的图异常检测(Graph Anomaly Detection, GAD)模型中存在的公平性问题,即模型可能继承并放大训练数据中的偏见,导致对某些敏感属性群体的不公平结果。解决方案的关键在于提出DECAF-GAD框架,其核心是引入结构因果模型(Structural Causal Model, SCM)以解耦敏感属性与学习到的表示,并在此基础上设计一种专门的自编码器架构和公平性引导的损失函数,从而在保持良好异常检测性能的同时显著提升公平性指标。

链接: https://arxiv.org/abs/2508.10785
作者: Shouju Wang,Yuchen Song,Sheng’en Li,Dongmian Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted in ECAI-2025

点击查看摘要

Abstract:Graph anomaly detection (GAD) has become an increasingly important task across various domains. With the rapid development of graph neural networks (GNNs), GAD methods have achieved significant performance improvements. However, fairness considerations in GAD remain largely underexplored. Indeed, GNN-based GAD models can inherit and amplify biases present in training data, potentially leading to unfair outcomes. While existing efforts have focused on developing fair GNNs, most approaches target node classification tasks, where models often rely on simple layer architectures rather than autoencoder-based structures, which are the most widely used architecturs for anomaly detection. To address fairness in autoencoder-based GAD models, we propose \textbfDis\textbfEntangled \textbfCounterfactual \textbfAdversarial \textbfFair (DECAF)-GAD, a framework that alleviates bias while preserving GAD performance. Specifically, we introduce a structural causal model (SCM) to disentangle sensitive attributes from learned representations. Based on this causal framework, we formulate a specialized autoencoder architecture along with a fairness-guided loss function. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that DECAF-GAD not only achieves competitive anomaly detection performance but also significantly enhances fairness metrics compared to baseline GAD methods. Our code is available at this https URL.
zh

[AI-6] he Knowledge-Reasoning Dissociation: Fundamental Limitations of LLM s in Clinical Natural Language Inference

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在高风险领域(如临床试验)中推理能力不足的问题,尤其是澄清其表现不佳是否源于知识缺失还是推理结构缺陷。解决方案的关键在于引入一个名为“临床试验自然语言推理基准”(Clinical Trial Natural Language Inference benchmark)的多维度测试集,涵盖因果归因、组合定位、认知验证和风险状态抽象四类推理任务,并配套使用“目标知识与元级推理验证探针”(Ground Knowledge and Meta-Level Reasoning Verification, GKMRV),从而将事实性访问失败与推理失败明确区分开来。实验证明,尽管模型在GKMRV上达到平均0.918的准确率(表明具备相关知识),但在主推理任务上仅得平均0.25准确率,且输出高度一致(平均0.87),说明模型依赖系统性启发式而非结构化推理,揭示了当前LLMs缺乏可组合、可约束的内部表示以可靠调用知识的根本局限。

链接: https://arxiv.org/abs/2508.10777
作者: Maël Jullien,Marco Valentino,André Freitas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters. We interrogate this assumption by introducing a Clinical Trial Natural Language Inference benchmark comprising four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction. Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification (GKMRV) probe, allowing us to dissociate failures of factual access from failures of inference. We evaluate six contemporary LLMs under both direct and chain of thought prompting. Models achieve near-ceiling GKMRV accuracy (mean accuracy 0.918) yet perform poorly on the main reasoning tasks (mean accuracy 0.25). Despite low accuracy, output inferences are highly consistent across samples (mean 0.87), indicating a systematic application of underlying heuristics and shortcuts. These results reveal fundamental structural and representational limitations: current LLMs often possess the relevant clinical knowledge but lack the structured, composable internal representations needed to deploy it reliably (e.g., integrating constraints, weighing evidence, or simulating counterfactuals). Decoupling knowledge from reasoning with GKMRV makes this dissociation explicit and measurable, providing an effective framework for probing the reliability of LLMs in high-stakes domains. Comments: 19 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.10777 [cs.AI] (or arXiv:2508.10777v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.10777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-7] Modeling Human Responses to Multimodal AI Content

【速读】:该论文旨在解决AI生成内容(AI-generated content)广泛传播背景下,人类对这类内容的感知与行为反应尚不明确的问题,尤其关注其在金融交易等高敏感场景中引发的社会影响。传统研究多聚焦于内容的真实性验证,而忽视了用户如何实际感知和响应AI内容,这限制了对虚假信息风险的有效防控。解决方案的关键在于提出一个以人类为中心的研究框架:首先构建包含154,552条在线帖子(其中111,153条为AI生成)的MhAIM数据集,用于大规模分析人类对AI内容的反应;其次引入三个新指标——可信度(trustworthiness)、影响力(impact)和开放性(openness),量化用户判断与互动行为;最后设计T-Lens系统,基于HR-MCP(Human Response Model Context Protocol)机制,将人类响应预测嵌入大语言模型(LLM)交互流程中,从而提升模型对人类认知偏差的适应能力,实现更符合人类预期的信息处理与决策支持。

链接: https://arxiv.org/abs/2508.10769
作者: Zhiqi Shen,Shaojing Fan,Danni Xu,Terence Sim,Mohan Kankanhalli
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:As AI-generated content becomes widespread, so does the risk of misinformation. While prior research has primarily focused on identifying whether content is authentic, much less is known about how such content influences human perception and behavior. In domains like trading or the stock market, predicting how people react (e.g., whether a news post will go viral), can be more critical than verifying its factual accuracy. To address this, we take a human-centered approach and introduce the MhAIM Dataset, which contains 154,552 online posts (111,153 of them AI-generated), enabling large-scale analysis of how people respond to AI-generated content. Our human study reveals that people are better at identifying AI content when posts include both text and visuals, particularly when inconsistencies exist between the two. We propose three new metrics: trustworthiness, impact, and openness, to quantify how users judge and engage with online content. We present T-Lens, an LLM-based agent system designed to answer user queries by incorporating predicted human responses to multimodal information. At its core is HR-MCP (Human Response Model Context Protocol), built on the standardized Model Context Protocol (MCP), enabling seamless integration with any LLM. This integration allows T-Lens to better align with human reactions, enhancing both interpretability and interaction capabilities. Our work provides empirical insights and practical tools to equip LLMs with human-awareness capabilities. By highlighting the complex interplay among AI, human cognition, and information reception, our findings suggest actionable strategies for mitigating the risks of AI-driven misinformation.
zh

[AI-8] Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets

【速读】:该论文旨在解决Transformer模型在处理大规模物理系统数据时因注意力机制的二次复杂度(quadratic scaling of the attention mechanism)而导致的效率瓶颈问题。其解决方案的关键在于将Erwin架构与原生稀疏注意力(Native Sparse Attention, NSA)机制相结合,通过适配NSA机制以适用于非序列数据,从而在保持或提升模型性能的同时显著优化计算效率和感受野(receptive field)。

链接: https://arxiv.org/abs/2508.10758
作者: Nicolas Lapautre,Maria Marchenko,Carlos Miguel Patiño,Xin Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlocking the potential of transformers on datasets of large physical systems depends on overcoming the quadratic scaling of the attention mechanism. This work explores combining the Erwin architecture with the Native Sparse Attention (NSA) mechanism to improve the efficiency and receptive field of transformer models for large-scale physical systems, addressing the challenge of quadratic attention complexity. We adapt the NSA mechanism for non-sequential data, implement the Erwin NSA model, and evaluate it on three datasets from the physical sciences – cosmology simulations, molecular dynamics, and air pressure modeling – achieving performance that matches or exceeds that of the original Erwin model. Additionally, we reproduce the experimental results from the Erwin paper to validate their implementation.
zh

[AI-9] Scaling Up without Fading Out: Goal-Aware Sparse GNN for RL-based Generalized Planning

【速读】:该论文旨在解决基于深度强化学习(Deep Reinforcement Learning, DRL)与图神经网络(Graph Neural Networks, GNNs)的广义规划(Generalized Planning)在大规模问题中因状态表示为全连接图而导致的组合爆炸和信息稀疏问题。现有方法在扩展至大网格环境时,由于边信息呈指数级增长且节点特征被稀释,导致内存需求激增、学习效率下降甚至不可行。其解决方案的关键在于提出一种稀疏且目标感知(goal-aware)的GNN表示机制,通过选择性编码局部相关关系并显式融合与目标相关的空间特征,从而有效降低冗余信息、提升节点表征质量,并显著增强策略在更大规模场景下的泛化能力与成功率。

链接: https://arxiv.org/abs/2508.10747
作者: Sangwoo Jeon,Juchul Shin,Gyeong-Tae Kim,YeonJe Cho,Seongwoo Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Generalized planning using deep reinforcement learning (RL) combined with graph neural networks (GNNs) has shown promising results in various symbolic planning domains described by PDDL. However, existing approaches typically represent planning states as fully connected graphs, leading to a combinatorial explosion in edge information and substantial sparsity as problem scales grow, especially evident in large grid-based environments. This dense representation results in diluted node-level information, exponentially increases memory requirements, and ultimately makes learning infeasible for larger-scale problems. To address these challenges, we propose a sparse, goal-aware GNN representation that selectively encodes relevant local relationships and explicitly integrates spatial features related to the goal. We validate our approach by designing novel drone mission scenarios based on PDDL within a grid world, effectively simulating realistic mission execution environments. Our experimental results demonstrate that our method scales effectively to larger grid sizes previously infeasible with dense graph representations and substantially improves policy generalization and success rates. Our findings provide a practical foundation for addressing realistic, large-scale generalized planning tasks.
zh

[AI-10] APFL: Analytic Personalized Federated Learning via Dual-Stream Least Squares

【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning, PFL)中因客户端数据非独立同分布(non-IID)导致的集体泛化能力下降问题,从而影响个性化模型的效果。其解决方案的关键在于提出一种基于双流最小二乘法的分析型个性化联邦学习(Analytic Personalized Federated Learning, APFL)框架:该框架采用冻结的预训练基础模型作为特征提取器,并设计双流结构——共享主通道用于全局泛化,专用优化通道用于每个客户端的本地个性化;通过解析解实现对异构性的不变性(heterogeneity invariance),理论上保证个性化模型不受其他客户端数据分布差异的影响,实验证明其在多个数据集上相较当前最优基线方法准确率提升至少1.10%–15.45%。

链接: https://arxiv.org/abs/2508.10732
作者: Kejia Fan,Jianheng Tang,Zhirui Yang,Feijiang Han,Jiaxu Li,Run He,Yajiang Huang,Anfeng Liu,Houbing Herbert Song,Yunhuai Liu,Huiping Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Personalized Federated Learning (PFL) has presented a significant challenge to deliver personalized models to individual clients through collaborative training. Existing PFL methods are often vulnerable to non-IID data, which severely hinders collective generalization and then compromises the subsequent personalization efforts. In this paper, to address this non-IID issue in PFL, we propose an Analytic Personalized Federated Learning (APFL) approach via dual-stream least squares. In our APFL, we use a foundation model as a frozen backbone for feature extraction. Subsequent to the feature extractor, we develop dual-stream analytic models to achieve both collective generalization and individual personalization. Specifically, our APFL incorporates a shared primary stream for global generalization across all clients, and a dedicated refinement stream for local personalization of each individual client. The analytical solutions of our APFL enable its ideal property of heterogeneity invariance, theoretically meaning that each personalized model remains identical regardless of how heterogeneous the data are distributed across all other clients. Empirical results across various datasets also validate the superiority of our APFL over state-of-the-art baselines, with advantages of at least 1.10%-15.45% in accuracy.
zh

[AI-11] Electromagnetic Simulations of Antennas on GPUs for Machine Learning Applications

【速读】:该论文旨在解决电磁(EM)仿真在机器学习(Machine Learning, ML)辅助天线设计与优化中因计算复杂度高而导致训练数据不足的问题。其核心挑战在于,尽管机器学习方法能够有效求解最优解,但通常需要大量样本进行训练,而传统CPU驱动的EM仿真难以在有限时间内生成足够多的高质量仿真数据。解决方案的关键在于构建一个基于图形处理器(GPU)加速的开放源代码电磁仿真框架(gprMax),利用GPU的并行计算能力高效生成大规模天线仿真数据集,从而支持机器学习和代理模型(Surrogate Model)的应用。研究结果表明,入门级GPU相比高端CPU具有显著性能优势,而高端游戏GPU可实现约18倍于CPU的计算效率,同时开放源代码软件在精细空间分辨率下能获得与商用软件相当的微带天线仿真精度。

链接: https://arxiv.org/abs/2508.10713
作者: Murat Temiz,Vemund Bakken
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures, 4 tables, journal article

点击查看摘要

Abstract:This study proposes an antenna simulation framework powered by graphics processing units (GPUs) based on an open-source electromagnetic (EM) simulation software (gprMax) for machine learning applications of antenna design and optimization. Furthermore, it compares the simulation results with those obtained through commercial EM software. The proposed software framework for machine learning and surrogate model applications will produce antenna data sets consisting of a large number of antenna simulation results using GPUs. Although machine learning methods can attain the optimum solutions for many problems, they are known to be data-hungry and require a great deal of samples for the training stage of the algorithms. However, producing a sufficient number of training samples in EM applications within a limited time is challenging due to the high computational complexity of EM simulations. Therefore, GPUs are utilized in this study to simulate a large number of antennas with predefined or random antenna shape parameters to produce data sets. Moreover, this study also compares various machine learning and deep learning models in terms of antenna parameter estimation performance. This study demonstrates that an entry-level GPU substantially outperforms a high-end CPU in terms of computational performance, while a high-end gaming GPU can achieve around 18 times more computational performance compared to a high-end CPU. Moreover, it is shown that the open-source EM simulation software can deliver similar results to those obtained via commercial software in the simulation of microstrip antennas when the spatial resolution of the simulations is sufficiently fine.
zh

[AI-12] GenOM: Ontology Matching with Description Generation and Large Language Model

【速读】:该论文旨在解决异构知识源之间语义互操作性与集成问题,特别是在生物医学领域中,因疾病和药物等复杂概念导致的ontology匹配(Ontology Matching, OM)难题。其解决方案的关键在于提出GenOM框架,该框架利用大语言模型(Large Language Model, LLM)增强ontology概念的语义表示,通过生成文本定义丰富概念描述,并结合嵌入模型检索对齐候选项,同时引入基于精确匹配的工具提升精度。实验表明,GenOM在OAEI Bio-ML赛道上性能优异,显著优于多种传统及近期基于LLM的匹配方法,且消融实验证实了语义增强与少样本提示(few-shot prompting)的有效性,体现了该框架的鲁棒性和适应性。

链接: https://arxiv.org/abs/2508.10703
作者: Yiping Song,Jiaoyan Chen,Renate A. Schmidt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontology matching (OM) plays an essential role in enabling semantic interoperability and integration across heterogeneous knowledge sources, particularly in the biomedical domain which contains numerous complex concepts related to diseases and pharmaceuticals. This paper introduces GenOM, a large language model (LLM)-based ontology alignment framework, which enriches the semantic representations of ontology concepts via generating textual definitions, retrieves alignment candidates with an embedding model, and incorporates exact matching-based tools to improve precision. Extensive experiments conducted on the OAEI Bio-ML track demonstrate that GenOM can often achieve competitive performance, surpassing many baselines including traditional OM systems and recent LLM-based methods. Further ablation studies confirm the effectiveness of semantic enrichment and few-shot prompting, highlighting the framework’s robustness and adaptability.
zh

[AI-13] REFN: A Reinforcement-Learning-From-Network Framework against 1-day/n-day Exploitations

【速读】:该论文旨在解决1 day或n day漏洞(即零日漏洞或已知漏洞)在网络设备中大规模部署背景下所带来的严重安全威胁,传统防御手段如主机补丁修复和网络过滤因跨设备扩展性差、与嵌入式或遗留系统兼容性不足以及部署过程易出错(如手动补丁验证)而效果有限。其解决方案的关键在于提出REFN(Reinforcement Learning From Network)框架,该框架利用强化学习(Reinforcement Learning, RL)驱动的在线网络奖励机制替代传统人类反馈(RLHF),使大型语言模型(Large Language Models, LLMs)能够自主生成网络过滤规则以防止漏洞利用;同时通过统一部署在边缘安全网关(Amazon Eero)确保兼容性,并借助真实网络流量进行在线验证提升鲁棒性,从而有效应对LLM在漏洞修复知识局限、语言到网络转换鸿沟及幻觉与非确定性输出等三大核心挑战。

链接: https://arxiv.org/abs/2508.10701
作者: Tianlong Yu,Lihong Liu,Ziyi Zhou,Fudu Xing,Kailong Wang,Yang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The exploitation of 1 day or n day vulnerabilities poses severe threats to networked devices due to massive deployment scales and delayed patching (average Mean Time To Patch exceeds 60 days). Existing defenses, including host based patching and network based filtering, are inadequate due to limited scalability across diverse devices, compatibility issues especially with embedded or legacy systems, and error prone deployment process (manual patch validation). To address these issues, we introduce REFN (Reinforcement Learning From Network), a novel framework that trains Large Language Models (LLMs) to autonomously generate network filters to prevent 1 day or n day exploitations. REFN ensures scalability by uniquely employs Reinforcement Learning (RL) driven by online network rewards instead of traditional Human Feedback (RLHF). REFN guarantees compatibility via unified deployment on edge security gateways (Amazon Eero). REFN provides robustness via online validation using real network traffic. Crucially, REFN addresses three core challenges in training LLMs for exploit prevention: 1) expanding current LLMs limited vulnerability fixing expertise via Agentic RAG based Knowledge Distillation, 2) bridging current LLMs language to network gaps through an RL From VNF Pipeline that translates language context (vulnerability description) into network enforcement, 3) addressing the LLM hallucination and non determinism via the Online Agentic Validation that penalizes erroneous outputs. Evaluated across 22 families of 1 day or n day exploits, REFN demonstrates effectiveness (21.1 percent higher accuracy than alternatives), efficiency (Mean Time To Patch of 3.65 hours) and scalability (easily scale to 10K devices). REFN serves as an initial step toward training LLMs to rapidly prevent massive scale 1 day or n day exploitations.
zh

[AI-14] STEP: Stepwise Curriculum Learning for Context-Knowledge Fusion in Conversational Recommendation

【速读】:该论文针对当前对话推荐系统(Conversational Recommender Systems, CRS)在捕捉用户偏好深层语义与对话上下文方面存在的不足,特别是外部知识图谱(Knowledge Graph, KG)信息难以高效融合到对话生成和推荐任务中的问题,提出了改进方案。其核心解决方案是引入STEP框架,关键在于采用课程引导的上下文-知识融合机制(F-Former)与轻量级任务特定提示调优(prompt tuning)相结合:首先通过三阶段课程策略逐步对齐对话上下文与知识图谱实体,缓解细粒度语义错位;随后利用两个最小但自适应的前缀提示(conversation prefix 和 recommendation prefix)将融合表示注入冻结的语言模型中,分别引导响应生成贴近用户意图和推荐项排序符合知识一致性。该双提示机制实现了跨任务语义共享的同时保持对话与推荐目标的独立性,实验表明该方法在推荐精度和对话质量上均优于主流方法。

链接: https://arxiv.org/abs/2508.10669
作者: Zhenye Yang,Jinpeng Chen,Huan Li,Xiongnan Jin,Xuanyang Li,Junwei Zhang,Hongbo Gao,Kaimin Wei,Senzhang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages; 4 figures; 6 tables; code available at this https URL

点击查看摘要

Abstract:Conversational recommender systems (CRSs) aim to proactively capture user preferences through natural language dialogue and recommend high-quality items. To achieve this, CRS gathers user preferences via a dialog module and builds user profiles through a recommendation module to generate appropriate recommendations. However, existing CRS faces challenges in capturing the deep semantics of user preferences and dialogue context. In particular, the efficient integration of external knowledge graph (KG) information into dialogue generation and recommendation remains a pressing issue. Traditional approaches typically combine KG information directly with dialogue content, which often struggles with complex semantic relationships, resulting in recommendations that may not align with user expectations. To address these challenges, we introduce STEP, a conversational recommender centered on pre-trained language models that combines curriculum-guided context-knowledge fusion with lightweight task-specific prompt tuning. At its heart, an F-Former progressively aligns the dialogue context with knowledge-graph entities through a three-stage curriculum, thus resolving fine-grained semantic mismatches. The fused representation is then injected into the frozen language model via two minimal yet adaptive prefix prompts: a conversation prefix that steers response generation toward user intent and a recommendation prefix that biases item ranking toward knowledge-consistent candidates. This dual-prompt scheme allows the model to share cross-task semantics while respecting the distinct objectives of dialogue and recommendation. Experimental results show that STEP outperforms mainstream methods in the precision of recommendation and dialogue quality in two public datasets. Comments: 10 pages; 4 figures; 6 tables; code available at this https URL Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) ACMclasses: H.3.3; I.2.7; H.2.8 Cite as: arXiv:2508.10669 [cs.AI] (or arXiv:2508.10669v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.10669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-15] SPHENIC: Topology-Informed Multi-View Clustering for Spatial Transcriptomics

【速读】:该论文旨在解决空间转录组学(spatial transcriptomics)中细胞亚群识别的两个关键问题:一是现有方法在拓扑学习时仅依赖单个细胞或其相互作用图的表示,易受噪声干扰,导致低质量的拓扑信号;二是对空间邻域信息建模不足,造成空间嵌入质量低下。解决方案的核心在于提出SPHENIC方法,其关键创新包括:(1) 引入不变的拓扑特征(invariant topological features)到聚类网络中,提升表示学习的稳定性;(2) 设计空间约束与分布优化模块(Spatial Constraint and Distribution Optimization Module, SCDOM),增强细胞与其空间邻近细胞嵌入间的相似性,降低与非邻近细胞的相似性,从而生成更具聚类友好性的高质量空间嵌入。

链接: https://arxiv.org/abs/2508.10646
作者: Chenkai Guo,Yikai Zhu,Jing Yangum,Renxiang Guan,Por Lip Yee,Guangdun Peng,Dayu Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 2 tables

点击查看摘要

Abstract:By incorporating spatial location information, spatial-transcriptomics clustering yields more comprehensive insights into cell subpopulation identification. Despite recent progress, existing methods have at least two limitations: (i) topological learning typically considers only representations of individual cells or their interaction graphs; however, spatial transcriptomic profiles are often noisy, making these approaches vulnerable to low-quality topological signals, and (ii) insufficient modeling of spatial neighborhood information leads to low-quality spatial embeddings. To address these limitations, we propose SPHENIC, a novel Spatial Persistent Homology Enhanced Neighborhood Integrative Clustering method. Specifically, SPHENIC incorporates invariant topological features into the clustering network to achieve stable representation learning. Additionally, to construct high-quality spatial embeddings that reflect the true cellular distribution, we design the Spatial Constraint and Distribution Optimization Module (SCDOM). This module increases the similarity between a cell’s embedding and those of its spatial neighbors, decreases similarity with non-neighboring cells, and thereby produces clustering-friendly spatial embeddings. Extensive experiments on 14 benchmark spatial transcriptomic slices demonstrate that SPHENIC achieves superior performance on the spatial clustering task, outperforming existing state-of-the-art methods by 3.31%-6.54% over the best alternative.
zh

[AI-16] MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多属性控制中因不同属性间相互干扰而导致的行为冲突问题。现有方法难以实现对多个属性的协同调控,常出现性能下降或不可预测的权衡现象。其解决方案的关键在于提出多子空间表示调制(Multi-Subspace Representation Steering, MSRS)框架:通过为每个属性分配正交子空间以隔离其影响,并引入混合子空间组合策略——结合特定属性子空间与共享子空间来分别捕捉差异化和共通性的调制方向;同时设计动态加权机制以高效融合各成分,在推理阶段进一步采用基于token级别的干预机制,精准定位语义相关token进行细粒度行为调整,从而显著降低属性间干扰并提升多属性控制的精度与泛化能力。

链接: https://arxiv.org/abs/2508.10599
作者: Xinyan Jiang,Lin Zhang,Jiayi Zhang,Qingsong Yang,Guimin Hu,Di Wang,Lijie Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model’s representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.
zh

[AI-17] FreeGAD: A Training-Free yet Effective Approach for Graph Anomaly Detection

【速读】:该论文旨在解决当前基于深度学习的图异常检测(Graph Anomaly Detection, GAD)方法普遍存在训练成本高、可扩展性差的问题,其核心挑战在于现有模型依赖复杂的训练过程,而实证研究表明训练阶段对最终检测性能的贡献可能低于预期。解决方案的关键在于提出一种无需训练的新型方法 FreeGAD,其核心创新包括:利用一个亲和门控残差编码器(affinity-gated residual encoder)生成具有异常感知能力的节点表示,并通过锚点节点(anchor nodes)作为伪正常和异常引导信号,结合锚点引导的统计偏差计算异常分数,从而在不进行任何训练或迭代优化的情况下实现高效、高精度且可扩展的异常检测。

链接: https://arxiv.org/abs/2508.10594
作者: Yunfeng Zhao,Yixin Liu,Shiyuan Li,Qingfeng Chen,Yu Zheng,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) aims to identify nodes that deviate from the majority within a graph, playing a crucial role in applications such as social networks and e-commerce. Despite the current advancements in deep learning-based GAD, existing approaches often suffer from high deployment costs and poor scalability due to their complex and resource-intensive training processes. Surprisingly, our empirical findings suggest that the training phase of deep GAD methods, commonly perceived as crucial, may actually contribute less to anomaly detection performance than expected. Inspired by this, we propose FreeGAD, a novel training-free yet effective GAD method. Specifically, it leverages an affinity-gated residual encoder to generate anomaly-aware representations. Meanwhile, FreeGAD identifies anchor nodes as pseudo-normal and anomalous guides, followed by calculating anomaly scores through anchor-guided statistical deviations. Extensive experiments demonstrate that FreeGAD achieves superior anomaly detection performance, efficiency, and scalability on multiple benchmark datasets from diverse domains, without any training or iterative optimization.
zh

[AI-18] Fake Speech Wild: Detecting Deepfake Speech on Social Media Platform

【速读】:该论文旨在解决深度伪造语音(deepfake speech)在真实社交媒体场景中检测性能显著下降的问题,即现有对抗措施(Countermeasures, CMs)在跨域(cross-domain)场景下鲁棒性不足。其关键解决方案包括:首先构建首个聚焦社交平台的大型真实与伪造语音数据集 Fake Speech Wild (FSW),涵盖四个不同媒体平台共254小时音频;其次建立基于自监督学习(self-supervised learning, SSL)的基准评估体系以真实反映CMs在现实环境中的表现;最后通过数据增强策略提升模型泛化能力,并结合FSW训练集显著改进检测性能,最终在所有测试集上实现平均等错误率(Equal Error Rate, EER)为3.54%。

链接: https://arxiv.org/abs/2508.10559
作者: Yuankun Xie,Ruibo Fu,Xiaopeng Wang,Zhiyong Wang,Ya Li,Zhengqi Wen,Haonnan Cheng,Long Ye
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of speech generation technology has led to the widespread proliferation of deepfake speech across social media platforms. While deepfake audio countermeasures (CMs) achieve promising results on public datasets, their performance degrades significantly in cross-domain scenarios. To advance CMs for real-world deepfake detection, we first propose the Fake Speech Wild (FSW) dataset, which includes 254 hours of real and deepfake audio from four different media platforms, focusing on social media. As CMs, we establish a benchmark using public datasets and advanced selfsupervised learning (SSL)-based CMs to evaluate current CMs in real-world scenarios. We also assess the effectiveness of data augmentation strategies in enhancing CM robustness for detecting deepfake speech on social media. Finally, by augmenting public datasets and incorporating the FSW training set, we significantly advanced real-world deepfake audio detection performance, achieving an average equal error rate (EER) of 3.54% across all evaluation sets.
zh

[AI-19] Advances in Logic-Based Entity Resolution: Enhancing ASPEN with Local Merges and Optimality Criteria KR2025

【速读】:该论文旨在解决实体解析(Entity Resolution)中因仅支持全局合并(global merges)而导致的灵活性不足问题,尤其是在处理具有多重指称关系的数据时(如“J. Lee”可能对应不同个体)。传统系统ASPEN仅允许将所有匹配的实体常量视为等价并进行全局合并,这在实际场景中往往不适用。为解决此问题,论文提出ASPEN+,其关键创新在于引入局部合并(local merges)机制,使同一实体标识符在不同上下文中可被分别映射到不同的真实实体,并结合新的最优性准则(如最小化规则违反或最大化支持合并的规则数量)来选择更优解。这一方案通过形式化多种最优解概念及在真实数据集上的实验验证,显著提升了实体解析的准确性和实用性。

链接: https://arxiv.org/abs/2508.10504
作者: Zhliang Xiang,Meghyn Bienvenu,Gianluca Cima,Víctor Gutiérrez-Basulto,Yazmín Ibáñez-García
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Full version of a paper accepted at KR 2025

点击查看摘要

Abstract:In this paper, we present ASPEN+, which extends an existing ASP-based system, ASPEN,for collective entity resolution with two important functionalities: support for local merges and new optimality criteria for preferred solutions. Indeed, ASPEN only supports so-called global merges of entity-referring constants (e.g. author ids), in which all occurrences of matched constants are treated as equivalent and merged accordingly. However, it has been argued that when resolving data values, local merges are often more appropriate, as e.g. some instances of ‘J. Lee’ may refer to ‘Joy Lee’, while others should be matched with ‘Jake Lee’. In addition to allowing such local merges, ASPEN+ offers new optimality criteria for selecting solutions, such as minimizing rule violations or maximising the number of rules supporting a merge. Our main contributions are thus (1) the formalisation and computational analysis of various notions of optimal solution, and (2) an extensive experimental evaluation on real-world datasets, demonstrating the effect of local merges and the new optimality criteria on both accuracy and runtime.
zh

[AI-20] PASS: Probabilistic Agent ic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

【速读】:该论文针对现有工具增强型智能体系统在真实世界应用中面临的三大问题展开研究:(i)黑箱推理步骤导致决策可信度低且存在安全风险;(ii)多模态融合能力差,而这对医疗任务至关重要;(iii)智能体流程僵化且计算效率低下。解决方案的关键在于提出PASS(Probabilistic Agentic Supernet Sampling)框架,其核心创新包括:通过在多工具图上自适应采样智能体工作流,生成带有可解释概率标注的决策路径,从而提升透明度与安全性;利用任务条件分布动态选择每层最优工具,并构建演化式个性化记忆以压缩关键发现;设计三阶段训练策略(专家知识预热、对比路径排序和成本感知强化学习),在性能与计算成本之间优化帕累托前沿。该方法首次实现了胸部X光(CXR)推理场景下可解释、自适应、多模态的智能体系统范式。

链接: https://arxiv.org/abs/2508.10501
作者: Yushi Feng,Junye Du,Yingying Hong,Qifan Wang,Lequan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, AUC, LLM-J.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.
zh

[AI-21] A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation

【速读】:该论文旨在解决多模态应用中“任意到任意”(any-to-any)能力的实现难题,即在文本、图像、音频和视频等不同模态之间实现统一的理解与生成,同时整合自回归语言模型(LLM)的推理优势与扩散模型(diffusion model)的高保真生成能力。现有方法受限于刚性流水线或紧耦合架构,缺乏灵活性与可扩展性。解决方案的关键在于提出一种模块化框架 MAGUS(Multi-Agent Guided Unified Multimodal System),其通过两个解耦阶段——认知(Cognition)与深思(Deliberation)——实现多模态任务的协同处理:在认知阶段,三个角色条件化的多模态 LLM 代理(Perceiver、Planner 和 Reflector)在共享文本工作空间中协作完成结构化理解与规划;在深思阶段,引入生长感知搜索(Growth-Aware Search)机制,动态协调 LLM 推理与扩散生成过程,形成相互增强的闭环。该设计支持即插即用扩展、任意模态转换及语义对齐,且无需联合训练,显著提升了多模态系统的通用性与性能。

链接: https://arxiv.org/abs/2508.10494
作者: Jiulin Li,Ping Huang,Yexin Li,Shuo Chen,Juewen Hu,Ye Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Real-world multimodal applications often require any-to-any capabilities, enabling both understanding and generation across modalities including text, image, audio, and video. However, integrating the strengths of autoregressive language models (LLMs) for reasoning and diffusion models for high-fidelity generation remains challenging. Existing approaches rely on rigid pipelines or tightly coupled architectures, limiting flexibility and scalability. We propose MAGUS (Multi-Agent Guided Unified Multimodal System), a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation. MAGUS enables symbolic multi-agent collaboration within a shared textual workspace. In the Cognition phase, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning. The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner. MAGUS supports plug-and-play extensibility, scalable any-to-any modality conversion, and semantic alignment - all without the need for joint training. Experiments across multiple benchmarks, including image, video, and audio generation, as well as cross-modal instruction following, demonstrate that MAGUS outperforms strong baselines and state-of-the-art systems. Notably, on the MME benchmark, MAGUS surpasses the powerful closed-source model GPT-4o.
zh

[AI-22] Contrastive ECOC: Learning Output Codes for Adversarial Defense

【速读】:该论文旨在解决多分类任务中传统One-hot编码(One-hot Encoding)在面对对抗攻击时鲁棒性不足的问题,以及现有错误校正输出码(Error Correcting Output Codes, ECOC)方法依赖人工设计或随机生成码书(Codebook)所导致的效率低、泛化能力差的问题。其解决方案的关键在于提出三种基于对比学习(Contrastive Learning)的自动化码书学习模型,使码书能够直接从数据中自适应地学习,从而提升模型对对抗样本的鲁棒性,并实现更高效、数据驱动的多分类编码策略。

链接: https://arxiv.org/abs/2508.10491
作者: Che-Yu Chou,Hung-Hsuan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Although one-hot encoding is commonly used for multiclass classification, it is not always the most effective encoding mechanism. Error Correcting Output Codes (ECOC) address multiclass classification by mapping each class to a unique codeword used as a label. Traditional ECOC methods rely on manually designed or randomly generated codebooks, which are labor-intensive and may yield suboptimal, dataset-agnostic results. This paper introduces three models for automated codebook learning based on contrastive learning, allowing codebooks to be learned directly and adaptively from data. Across four datasets, our proposed models demonstrate superior robustness to adversarial attacks compared to two baselines. The source is available at this https URL.
zh

[AI-23] SEQ-GPT : LLM -assisted Spatial Query via Example

【速读】:该论文旨在解决传统空间服务(如在线地图)在处理复杂查询任务时用户体验受限的问题,尤其是当用户需要同时搜索多个相关位置时。为应对这一挑战,作者提出Spatial Exemplar Query (SEQ) 场景,并设计了SEQ-GPT系统,该系统基于大语言模型(Large Language Models, LLMs)实现自然语言驱动的多样化空间查询。解决方案的关键在于:一是利用LLMs的语言理解与交互能力,支持动态澄清用户意图和根据反馈调整搜索策略;二是构建定制化的LLM适配流程,通过对话合成与多模态协同机制,将自然语言语义有效对齐至结构化空间数据与查询逻辑,从而实现端到端的空间搜索增强。

链接: https://arxiv.org/abs/2508.10486
作者: Ivan Khai Ze Lim,Ningyi Liao,Yiming Yang,Gerald Wei Yong Yip,Siqiang Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contemporary spatial services such as online maps predominantly rely on user queries for location searches. However, the user experience is limited when performing complex tasks, such as searching for a group of locations simultaneously. In this study, we examine the extended scenario known as Spatial Exemplar Query (SEQ), where multiple relevant locations are jointly searched based on user-specified examples. We introduce SEQ-GPT, a spatial query system powered by Large Language Models (LLMs) towards more versatile SEQ search using natural language. The language capabilities of LLMs enable unique interactive operations in the SEQ process, including asking users to clarify query details and dynamically adjusting the search based on user feedback. We also propose a tailored LLM adaptation pipeline that aligns natural language with structured spatial data and queries through dialogue synthesis and multi-model cooperation. SEQ-GPT offers an end-to-end demonstration for broadening spatial search with realistic data and application scenarios.
zh

[AI-24] Pinet: Optimizing hard-constrained neural networks with orthogonal projection layers

【速读】:该论文旨在解决神经网络在处理具有凸约束的优化问题时难以保证约束满足的问题,尤其是在参数化约束优化场景中,传统数值求解器效率低且对超参数敏感。其解决方案的关键在于提出一种名为Π net的新输出层结构,该结构通过算子分裂(operator splitting)实现前向传播中的快速可靠投影,同时利用隐函数定理(implicit function theorem)完成反向传播,从而将约束条件显式嵌入模型设计中,确保可行解的生成。该方法不仅显著提升单次或批量优化问题的求解速度,还优于现有学习方法在训练时间、解质量及超参数鲁棒性方面的表现,同时保持相似的推理效率。

链接: https://arxiv.org/abs/2508.10480
作者: Panagiotis D. Grontas,Antonio Terpin,Efe C. Balta,Raffaello D’Andrea,John Lygeros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce an output layer for neural networks that ensures satisfaction of convex constraints. Our approach, \Pi net, leverages operator splitting for rapid and reliable projections in the forward pass, and the implicit function theorem for backpropagation. We deploy \Pi net as a feasible-by-design optimization proxy for parametric constrained optimization problems and obtain modest-accuracy solutions faster than traditional solvers when solving a single problem, and significantly faster for a batch of problems. We surpass state-of-the-art learning approaches in terms of training time, solution quality, and robustness to hyperparameter tuning, while maintaining similar inference times. Finally, we tackle multi-vehicle motion planning with non-convex trajectory preferences and provide \Pi net as a GPU-ready package implemented in JAX with effective tuning heuristics.
zh

[AI-25] FIRESPARQL: A LLM -based Framework for SPARQL Query Generation over Scholarly Knowledge Graphs

【速读】:该论文旨在解决学术知识图谱(Scholarly Knowledge Graphs, SKGs)上自然语言问答(Natural Language Questions, NLQs)任务中大语言模型(Large Language Models, LLMs)生成SPARQL查询时存在的两大问题:一是结构不一致,如查询中缺失或冗余三元组;二是语义不准确,即尽管结构正确但实体或属性使用错误。解决方案的关键在于提出FIRESPARQL框架,其核心为微调后的LLM,并可选地引入检索增强生成(Retrieval-Augmented Generation, RAG)提供上下文信息,以及添加SPARQL查询修正层以提升查询准确性。实验表明,微调策略在SciQA基准测试中表现最优,Query Accuracy达到0.90 ROUGE-L,Result Accuracy达到0.85 RelaxedEM。

链接: https://arxiv.org/abs/2508.10467
作者: Xueli Pan,Victor de Boer,Jacco van Ossenbruggen
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Accepted at 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)

点击查看摘要

Abstract:Question answering over Scholarly Knowledge Graphs (SKGs) remains a challenging task due to the complexity of scholarly content and the intricate structure of these graphs. Large Language Model (LLM) approaches could be used to translate natural language questions (NLQs) into SPARQL queries; however, these LLM-based approaches struggle with SPARQL query generation due to limited exposure to SKG-specific content and the underlying schema. We identified two main types of errors in the LLM-generated SPARQL queries: (i) structural inconsistencies, such as missing or redundant triples in the queries, and (ii) semantic inaccuracies, where incorrect entities or properties are shown in the queries despite a correct query structure. To address these issues, we propose FIRESPARQL, a modular framework that supports fine-tuned LLMs as a core component, with optional context provided via retrieval-augmented generation (RAG) and a SPARQL query correction layer. We evaluate the framework on the SciQA Benchmark using various configurations (zero-shot, zero-shot with RAG, one-shot, fine-tuning, and fine-tuning with RAG) and compare the performance with baseline and state-of-the-art approaches. We measure query accuracy using BLEU and ROUGE metrics, and query result accuracy using relaxed exact match(RelaxedEM), with respect to the gold standards containing the NLQs, SPARQL queries, and the results of the queries. Experimental results demonstrate that fine-tuning achieves the highest overall performance, reaching 0.90 ROUGE-L for query accuracy and 0.85 RelaxedEM for result accuracy on the test set.
zh

[AI-26] X-Node: Self-Explanation is All We Need

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在高风险临床应用中缺乏可解释性的问题,现有解释方法多为事后全局性分析,难以提供个体节点决策的局部推理依据。其解决方案的关键在于提出X-Node框架——一种自解释型GNN架构,其中每个节点在其预测过程中主动生成解释:通过构建包含度数、中心性、聚类系数、特征显著性和标签一致性等可解释线索的结构化上下文向量,由轻量级Reasoner模块映射为紧凑解释向量,该向量同时用于重构节点潜在嵌入以保证忠实性、驱动预训练大语言模型(Large Language Model, LLM)生成自然语言解释,并通过“文本注入”机制将解释反馈至消息传递流程中引导模型自身学习,从而实现性能与可解释性的协同优化。

链接: https://arxiv.org/abs/2508.10461
作者: Prajit Sengupta,Islem Rekik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved state-of-the-art results in computer vision and medical image classification tasks by capturing structural dependencies across data instances. However, their decision-making remains largely opaque, limiting their trustworthiness in high-stakes clinical applications where interpretability is essential. Existing explainability techniques for GNNs are typically post-hoc and global, offering limited insight into individual node decisions or local reasoning. We introduce X-Node, a self-explaining GNN framework in which each node generates its own explanation as part of the prediction process. For every node, we construct a structured context vector encoding interpretable cues such as degree, centrality, clustering, feature saliency, and label agreement within its local topology. A lightweight Reasoner module maps this context into a compact explanation vector, which serves three purposes: (1) reconstructing the node’s latent embedding via a decoder to enforce faithfulness, (2) generating a natural language explanation using a pre-trained LLM (e.g., Grok or Gemini), and (3) guiding the GNN itself via a “text-injection” mechanism that feeds explanations back into the message-passing pipeline. We evaluate X-Node on two graph datasets derived from MedMNIST and MorphoMNIST, integrating it with GCN, GAT, and GIN backbones. Our results show that X-Node maintains competitive classification accuracy while producing faithful, per-node explanations. Repository: this https URL.
zh

[AI-27] RealAC: A Domain-Agnostic Framework for Realistic and Actionable Counterfactual Explanations

【速读】:该论文旨在解决当前生成式AI(Generative AI)中反事实解释(counterfactual explanations)的现实性(realism)与可操作性(actionability)不足的问题。现有方法通常依赖于人工设计的约束或领域知识来建模特征间的复杂依赖关系,难以捕捉数据中的非线性关联,且缺乏对用户偏好和因果合理性的支持,导致生成的解释在实践中不可行或因果上不成立。其解决方案的关键在于提出一个领域无关的框架RealAC,通过自动对齐事实实例与反事实实例之间特征对的联合分布来隐式保留复杂的多变量依赖结构,同时允许用户冻结无法或不愿更改的属性,在优化过程中抑制这些特征的变化,从而实现兼顾现实性和用户意图的反事实生成。

链接: https://arxiv.org/abs/2508.10455
作者: Asiful Arefeen,Shovito Barua Soumma,Hassan Ghasemzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counterfactual explanations provide human-understandable reasoning for AI-made decisions by describing minimal changes to input features that would alter a model’s prediction. To be truly useful in practice, such explanations must be realistic and feasible – they should respect both the underlying data distribution and user-defined feasibility constraints. Existing approaches often enforce inter-feature dependencies through rigid, hand-crafted constraints or domain-specific knowledge, which limits their generalizability and ability to capture complex, nonlinear relations inherent in data. Moreover, they rarely accommodate user-specified preferences and suggest explanations that are causally implausible or infeasible to act upon. We introduce RealAC, a domain-agnostic framework for generating realistic and actionable counterfactuals. RealAC automatically preserves complex inter-feature dependencies without relying on explicit domain knowledge – by aligning the joint distributions of feature pairs between factual and counterfactual instances. The framework also allows end-users to ``freeze’’ attributes they cannot or do not wish to change by suppressing change in frozen features during optimization. Evaluations on three synthetic and two real datasets demonstrate that RealAC balances realism with actionability. Our method outperforms state-of-the-art baselines and Large Language Model-based counterfactual generation techniques in causal edge score, dependency preservation score, and IM1 realism metric and offers a solution for causality-aware and user-centric counterfactual generation.
zh

[AI-28] Alternating Approach-Putt Models for Multi-Stage Speech Enhancement

【速读】:该论文旨在解决语音增强网络在去除噪声的同时引入失真(artifacts)的问题,这些失真会降低语音质量。解决方案的关键在于提出一种后处理神经网络模型PuttNet,其灵感来源于高尔夫运动中“击球”(Approach)之后的“推杆”(Putt)过程,通过交替使用原始语音增强模型与PuttNet进行迭代优化,显著提升了语音感知质量(PESQ)、客观可懂度(STOI)及背景噪声侵扰度(CBAK)等指标,且图形分析表明该交替策略优于单一模型的重复应用。

链接: https://arxiv.org/abs/2508.10436
作者: Iksoon Jeong,Kyung-Joong Kim,Kang-Hun Ahn
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Speech enhancement using artificial neural networks aims to remove noise from noisy speech signals while preserving the speech content. However, speech enhancement networks often introduce distortions to the speech signal, referred to as artifacts, which can degrade audio quality. In this work, we propose a post-processing neural network designed to mitigate artifacts introduced by speech enhancement models. Inspired by the analogy of making a Putt' after an Approach’ in golf, we name our model PuttNet. We demonstrate that alternating between a speech enhancement model and the proposed Putt model leads to improved speech quality, as measured by perceptual quality scores (PESQ), objective intelligibility (STOI), and background noise intrusiveness (CBAK) scores. Furthermore, we illustrate with graphical analysis why this alternating Approach outperforms repeated application of either model alone.
zh

[AI-29] Unpacking the Implicit Norm Dynamics of Sharpness-Aware Minimization in Tensorized Models

【速读】:该论文旨在解决Sharpness-Aware Minimization (SAM)在更一般化的张量化(tensorized)或尺度不变(scale-invariant)模型中的隐式正则化机制尚不明确的问题。现有研究主要集中在简单的双核尺度不变场景,而对复杂模型中SAM如何调控各核心(core)参数范数的动态平衡缺乏理论分析。解决方案的关键在于引入“范数偏差”(Norm Deviation)作为衡量核心范数失衡的全局指标,并通过梯度流分析揭示SAM对范数偏差的隐式控制机制——其本质由各核心范数与其梯度幅值之间的协方差决定。基于此发现,作者提出一种简单有效的显式正则化方法Deviation-Aware Scaling (DAS),通过数据自适应地缩放核心范数来模仿SAM的隐式正则行为,在保持性能的同时显著降低计算开销。

链接: https://arxiv.org/abs/2508.10435
作者: Tianxiao Cao,Kyohei Atarashi,Hisashi Kashima
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has been proven to be an effective optimization technique for improving generalization in overparameterized models. While prior works have explored the implicit regularization of SAM in simple two-core scale-invariant settings, its behavior in more general tensorized or scale-invariant models remains underexplored. In this work, we leverage scale-invariance to analyze the norm dynamics of SAM in general tensorized models. We introduce the notion of \emphNorm Deviation as a global measure of core norm imbalance, and derive its evolution under SAM using gradient flow analysis. We show that SAM’s implicit control of Norm Deviation is governed by the covariance between core norms and their gradient magnitudes. Motivated by these findings, we propose a simple yet effective method, \emphDeviation-Aware Scaling (DAS), which explicitly mimics this regularization behavior by scaling core norms in a data-adaptive manner. Our experiments across tensor completion, noisy training, model compression, and parameter-efficient fine-tuning confirm that DAS achieves competitive or improved performance over SAM, while offering reduced computational overhead.
zh

[AI-30] HiRef: Leverag ing Hierarchical Ontology and Network Refinement for Robust Medication Recommendation

【速读】:该论文旨在解决基于纵向电子健康记录(Electronic Health Records, EHR)的药物推荐任务中,由于罕见医疗实体和不完整记录导致模型泛化能力差的问题。现有数据驱动模型依赖于观测到的共现模式,在缺失或新出现的临床场景下表现不佳。解决方案的关键在于提出一种统一框架HiRef(Hierarchical Ontology and Network Refinement for Robust Medication Recommendation),其核心是融合两种互补结构:一是利用医学本体(Medical Ontology)中的层次语义信息,通过将本体实体嵌入双曲空间(Hyperbolic Space)来自然建模树状关系并借助共享祖先实现知识迁移;二是引入先验引导的稀疏正则化策略,对EHR共现图进行精细化重构,抑制虚假边同时保留临床有意义的关联,从而提升模型在未见代码场景下的鲁棒性。

链接: https://arxiv.org/abs/2508.10425
作者: Yan Ting Chok,Soyon Park,Seungheun Baek,Hajung Kim,Junhyun Lee,Jaewoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medication recommendation is a crucial task for assisting physicians in making timely decisions from longitudinal patient medical records. However, real-world EHR data present significant challenges due to the presence of rarely observed medical entities and incomplete records that may not fully capture the clinical ground truth. While data-driven models trained on longitudinal Electronic Health Records often achieve strong empirical performance, they struggle to generalize under missing or novel conditions, largely due to their reliance on observed co-occurrence patterns. To address these issues, we propose Hierarchical Ontology and Network Refinement for Robust Medication Recommendation (HiRef), a unified framework that combines two complementary structures: (i) the hierarchical semantics encoded in curated medical ontologies, and (ii) refined co-occurrence patterns derived from real-world EHRs. We embed ontology entities in hyperbolic space, which naturally captures tree-like relationships and enables knowledge transfer through shared ancestors, thereby improving generalizability to unseen codes. To further improve robustness, we introduce a prior-guided sparse regularization scheme that refines the EHR co-occurrence graph by suppressing spurious edges while preserving clinically meaningful associations. Our model achieves strong performance on EHR benchmarks (MIMIC-III and MIMIC-IV) and maintains high accuracy under simulated unseen-code settings. Extensive experiments with comprehensive ablation studies demonstrate HiRef’s resilience to unseen medical codes, supported by in-depth analyses of the learned sparsified graph structure and medical code embeddings.
zh

[AI-31] MASH: Cooperative-Heterogeneous Multi-Agent Reinforcement Learning for Single Humanoid Robot Locomotion

【速读】:该论文旨在解决单人形机器人在运动控制中难以实现全身协调与高效学习的问题。传统方法通常采用单智能体强化学习(Single-Agent Reinforcement Learning, SARL)或用于多机器人系统的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL),但二者均无法有效提升单机器人整体肢体协同能力。解决方案的关键在于提出一种合作异构多智能体强化学习框架——MASH(Multi-Agent Reinforcement Learning for Single Humanoid Locomotion),将机器人的每条肢体(腿和臂)视为独立智能体,各自探索动作空间,同时共享一个全局评论家(global critic)以实现协作式学习。此设计显著加速了训练收敛速度,并提升了整机的协同运动能力,突破了单一智能体策略的局限性。

链接: https://arxiv.org/abs/2508.10423
作者: Qi Liu,Xiaopeng Zhang,Mingshan Tan,Shuaikang Ma,Jinliang Ding,Yanjie Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper proposes a novel method to enhance locomotion for a single humanoid robot through cooperative-heterogeneous multi-agent deep reinforcement learning (MARL). While most existing methods typically employ single-agent reinforcement learning algorithms for a single humanoid robot or MARL algorithms for multi-robot system tasks, we propose a distinct paradigm: applying cooperative-heterogeneous MARL to optimize locomotion for a single humanoid robot. The proposed method, multi-agent reinforcement learning for single humanoid locomotion (MASH), treats each limb (legs and arms) as an independent agent that explores the robot’s action space while sharing a global critic for cooperative learning. Experiments demonstrate that MASH accelerates training convergence and improves whole-body cooperation ability, outperforming conventional single-agent reinforcement learning methods. This work advances the integration of MARL into single-humanoid-robot control, offering new insights into efficient locomotion strategies.
zh

[AI-32] MCP2OSC: Parametric Control by Natural Language

【速读】:该论文旨在解决自然语言文本提示(text prompts)在复杂任务中精度不足,而旋钮或滑块控制虽可实现精确调整却增加操作复杂性之间的矛盾问题。其解决方案的关键在于提出一种新的模型上下文协议(Model Context Protocol, MCP)服务器与一套独特的提示设计准则,结合大型语言模型(Large Language Model, LLM)能力,使用户可通过自然语言生成、解释、搜索、可视化、验证和调试开放声音控制(OpenSoundControl, OSC)消息,并管理OSC地址模式。该方法通过LLM直接处理和生成人类可读的OSC消息,实现了人机协作的增强,为多媒体设备提供了一种基于LLM的通用控制机制。

链接: https://arxiv.org/abs/2508.10414
作者: Yuan-Yi Fan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Text prompts enable intuitive content creation but may fall short in achieving high precision for intricate tasks; knob or slider controls offer precise adjustments at the cost of increased complexity. To address the gap between knobs and prompts, a new MCP (Model Context Protocol) server and a unique set of prompt design criteria are presented to enable exploring parametric OSC (OpenSoundControl) control by natural language prompts. Demonstrated by 14 practical QA examples with best practices and the generalized prompt templates, this study finds Claude integrated with the MCP2OSC server effective in generating OSC messages by natural language, interpreting, searching, and visualizing OSC messages, validating and debugging OSC messages, and managing OSC address patterns. MCP2OSC enhances human-machine collaboration by leveraging LLM (Large Language Model) to handle intricate OSC development tasks, and by empowering human creativity with an intuitive language interface featuring flexible precision controls: a prompt-based OSC tool. This study provides a novel perspective on the creative MCP application at the network protocol level by utilizing LLM’s strength in directly processing and generating human-readable OSC messages. The results suggest its potential for a LLM-based universal control mechanism for multimedia devices.
zh

[AI-33] AnalogSeeker: An Open-source Foundation Language Model for Analog Circuit Design

【速读】:该论文旨在解决模拟电路设计领域中数据稀缺与知识复杂性难题,以构建一个能融合领域知识并提供设计辅助的开源基础语言模型。其核心挑战在于如何从有限且非结构化的教材文本中提取细粒度、可学习的知识,并有效训练出高性能模型。解决方案的关键在于提出了一种基于领域知识框架的语料收集策略和一种细粒度领域知识蒸馏方法:首先系统整理高质量教材形成结构化语料库,再通过多智能体框架将未标注文本分解为典型学习节点,并从中提炼出带有详细推理过程的问题-答案对;同时,创新性地设计了邻域自约束监督微调算法(neighborhood self-constrained supervised fine-tuning),通过限制训练前后输出分布的扰动幅度来提升模型稳定性与性能。最终基于Qwen2.5-32B-Instruct模型训练得到AnalogSeeker,在AMSBench-TQA基准上达到85.04%准确率,显著优于原始模型(+15.67个百分点),且在运算放大器设计任务中展现出良好的下游应用效果。

链接: https://arxiv.org/abs/2508.10409
作者: Zihao Chen,Ji Zhuang,Jinyi Shen,Xiaoyue Ke,Xinyi Yang,Mingjie Zhou,Zhuoyao Du,Xu Yan,Zhouyang Wu,Zhenyu Xu,Jiangli Huang,Li Shang,Xuan Zeng,Fan Yang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose AnalogSeeker, an effort toward an open-source foundation language model for analog circuit design, with the aim of integrating domain knowledge and giving design assistance. To overcome the scarcity of data in this field, we employ a corpus collection strategy based on the domain knowledge framework of analog circuits. High-quality, accessible textbooks across relevant subfields are systematically curated and cleaned into a textual domain corpus. To address the complexity of knowledge of analog circuits, we introduce a granular domain knowledge distillation method. Raw, unlabeled domain corpus is decomposed into typical, granular learning nodes, where a multi-agent framework distills implicit knowledge embedded in unstructured text into question-answer data pairs with detailed reasoning processes, yielding a fine-grained, learnable dataset for fine-tuning. To address the unexplored challenges in training analog circuit foundation models, we explore and share our training methods through both theoretical analysis and experimental validation. We finally establish a fine-tuning-centric training paradigm, customizing and implementing a neighborhood self-constrained supervised fine-tuning algorithm. This approach enhances training outcomes by constraining the perturbation magnitude between the model’s output distributions before and after training. In practice, we train the Qwen2.5-32B-Instruct model to obtain AnalogSeeker, which achieves 85.04% accuracy on AMSBench-TQA, the analog circuit knowledge evaluation benchmark, with a 15.67% point improvement over the original model and is competitive with mainstream commercial models. Furthermore, AnalogSeeker also shows effectiveness in the downstream operational amplifier design task. AnalogSeeker is open-sourced at this https URL for research use.
zh

[AI-34] LeanRAG : Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

【速读】:该论文旨在解决知识图谱增强型生成(Knowledge Graph-based Retrieval-Augmented Generation, KG-RAG)方法中存在的两个关键问题:一是高层概念摘要形成孤立的“语义岛屿”,缺乏显式关系以支持跨社区推理;二是检索过程结构感知不足,常退化为低效的扁平搜索,未能利用知识图谱的丰富拓扑结构。解决方案的关键在于提出LeanRAG框架,其核心创新是深度融合知识聚合与检索策略:首先设计一种新颖的语义聚合算法,将实体聚类并构建聚合层摘要间的显式关系,形成可导航的语义网络;其次采用自底向上的结构引导检索策略,将查询锚定到最相关的细粒度实体,并沿着语义路径系统性地收集简洁且上下文完备的证据集,从而显著降低路径检索开销并减少冗余信息获取。

链接: https://arxiv.org/abs/2508.10391
作者: Yaoze Zhang,Rong Wu,Pinlong Cai,Xiaoman Wang,Guohang Yan,Song Mao,Ding Wang,Botian Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands’', lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph’s rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph’s semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46% retrieval redundancy. Code is available at: this https URL
zh

[AI-35] Mamba: Efficient Acceleration Framework for Mamba Models in Edge Computing

【速读】:该论文旨在解决生成式 AI (Generative AI) 中基于状态空间模型(State Space Model, SSM)的 Mamba 模型在资源受限边缘设备上的部署难题,即当前缺乏针对此类模型的硬件加速框架。解决方案的关键在于提出 eMamba,一个端到端的硬件加速框架:首先通过轻量级硬件感知替代复杂归一化层、近似昂贵运算(如 SiLU 激活和指数运算),并结合近似感知神经架构搜索(approximation-aware neural architecture search, NAS)优化近似过程中的可学习参数;其次,在 FPGA 和 ASIC 上实现全链路量化与部署,显著降低延迟、功耗与面积,同时保持与先进方法相当的精度。

链接: https://arxiv.org/abs/2508.10370
作者: Jiyong Kim,Jaeho Lee,Jiahao Lin,Alish Kanani,Miao Sun,Umit Y. Ogras,Jaehyun Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted at ESWEEK 2025 (CODES+ISSS) conference

点击查看摘要

Abstract:State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9 \times fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62 \times lower latency and 2.22-9.95 \times higher throughput, with 4.77 \times smaller area, 9.84 \times lower power, and 48.6 \times lower energy consumption than baseline solutions while maintaining competitive accuracy.
zh

[AI-36] What to Ask Next? Probing the Imaginative Reasoning of LLM s with TurtleSoup Puzzles

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在信息稀疏环境中进行想象推理(imaginative reasoning)能力评估的不足问题,即模型如何主动构建、测试和修订假设以应对动态、探索性任务。现有基准通常静态或聚焦于社会推理,难以刻画此类推理的复杂过程。解决方案的关键在于提出一个综合研究框架,包含三个核心组件:一是TurtleSoup-Bench,首个大规模、双语、交互式想象推理基准,包含800个来自互联网与专家作者的“乌龟汤”谜题;二是Mosaic-Agent,一种用于评估LLMs在该场景下表现的新颖智能体;三是多维评价协议,从逻辑一致性、细节完成度和结论匹配度三个维度量化推理质量。这一框架系统揭示了当前LLMs在想象推理上的局限性和人类差距,为未来探索性智能体行为研究奠定基础。

链接: https://arxiv.org/abs/2508.10358
作者: Mengtao Zhou,Sifan Wu,Huan Zhang,Qi Sima,Bang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate the capacity of Large Language Models (LLMs) for imaginative reasoning–the proactive construction, testing, and revision of hypotheses in information-sparse environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic “Turtle Soup” game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup puzzles sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess LLMs’ performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency, detail completion, and conclusion alignment. Experiments with leading LLMs reveal clear capability limits, common failure patterns, and a significant performance gap compared to humans. Our work offers new insights into LLMs’ imaginative reasoning and establishes a foundation for future research on exploratory agent behavior.
zh

[AI-37] Welfare-Centric Clustering

【速读】:该论文旨在解决传统公平聚类(fair clustering)方法在实际应用中可能产生不合理或不符合直觉的聚类结果的问题,这些问题通常源于仅关注群体代表性均衡或群体特定聚类成本的公平性定义。为应对这一挑战,作者提出一种以福利为中心(welfare-centric)的聚类框架,通过同时建模群体的效用(utility),即基于距离和比例代表性两个维度来衡量群体收益,并据此形式化出两种优化目标:罗尔斯主义(Rawlsian,即平等主义)目标和功利主义(Utilitarian)目标。解决方案的关键在于引入新颖的算法来实现这两种目标,并提供理论保证,实验证明其在多个真实数据集上显著优于现有公平聚类基线方法。

链接: https://arxiv.org/abs/2508.10345
作者: Claire Jie Zhang,Seyed A. Esmaeili,Jamie Morgenstern
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:Fair clustering has traditionally focused on ensuring equitable group representation or equalizing group-specific clustering costs. However, Dickerson et al. (2025) recently showed that these fairness notions may yield undesirable or unintuitive clustering outcomes and advocated for a welfare-centric clustering approach that models the utilities of the groups. In this work, we model group utilities based on both distances and proportional representation and formalize two optimization objectives based on welfare-centric clustering: the Rawlsian (Egalitarian) objective and the Utilitarian objective. We introduce novel algorithms for both objectives and prove theoretical guarantees for them. Empirical evaluations on multiple real-world datasets demonstrate that our methods significantly outperform existing fair clustering baselines.
zh

[AI-38] Multi-Agent Trust Region Policy Optimisation: A Joint Constraint Approach

【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)中因异构智能体间策略更新不协调而导致的训练不稳定与局部最优问题。现有方法如异构智能体信任域策略优化(Heterogeneous-Agent Trust Region Policy Optimization, HATRPO)采用统一的KL散度阈值约束,难以适应不同智能体的学习需求,从而限制了整体性能。解决方案的关键在于提出两种动态分配KL阈值的方法:一是基于KKT条件的HATRPO-W,通过全局KL约束优化阈值分配;二是贪心策略的HATRPO-G,依据改进收益与发散比优先级排序分配阈值。二者均将序列策略优化与约束阈值调度相连接,显著提升了异构环境下MARL的学习效率与稳定性,实验表明两者均使最终奖励提升超22.5%,且HATRPO-W在训练过程中表现出更低方差和更稳定的动态特性。

链接: https://arxiv.org/abs/2508.10340
作者: Chak Lam Shek,Guangyao Shi,Pratap Tokekar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) requires coordinated and stable policy updates among interacting agents. Heterogeneous-Agent Trust Region Policy Optimization (HATRPO) enforces per-agent trust region constraints using Kullback-Leibler (KL) divergence to stabilize training. However, assigning each agent the same KL threshold can lead to slow and locally optimal updates, especially in heterogeneous settings. To address this limitation, we propose two approaches for allocating the KL divergence threshold across agents: HATRPO-W, a Karush-Kuhn-Tucker-based (KKT-based) method that optimizes threshold assignment under global KL constraints, and HATRPO-G, a greedy algorithm that prioritizes agents based on improvement-to-divergence ratio. By connecting sequential policy optimization with constrained threshold scheduling, our approach enables more flexible and effective learning in heterogeneous-agent settings. Experimental results demonstrate that our methods significantly boost the performance of HATRPO, achieving faster convergence and higher final rewards across diverse MARL benchmarks. Specifically, HATRPO-W and HATRPO-G achieve comparable improvements in final performance, each exceeding 22.5%. Notably, HATRPO-W also demonstrates more stable learning dynamics, as reflected by its lower variance.
zh

[AI-39] A Curriculum Learning Approach to Reinforcement Learning: Leverag ing RAG for Multimodal Question Answering

【速读】:该论文旨在解决多模态多轮问答任务中如何构建高效、准确的检索增强生成(Retrieval-Augmented Generation, RAG)系统的问题,具体包括基于图像知识图谱的结构化数据问答、融合知识图谱与网络搜索结果的信息整合,以及多轮对话中的上下文理解与跨源信息聚合。解决方案的关键在于:针对任务1,采用视觉大语言模型(Vision Large Language Model, VLLM)并结合从GPT-4.1蒸馏的知识进行监督微调,进一步引入课程学习(Curriculum Learning)策略引导强化学习(Reinforcement Learning),显著提升答案准确性并降低幻觉;针对任务2和任务3,则通过集成外部网络搜索API引入额外知识,增强系统处理复杂查询和多轮交互的能力。该方法在任务1中获得第一名(领先52.38%),任务3中位列第三,验证了课程学习与强化学习相结合训练策略的有效性。

链接: https://arxiv.org/abs/2508.10337
作者: Chenliang Zhang,Lin Wang,Yuanyuan Lu,Yusheng Qi,Kexin Wang,Peixu Hou,Wenshi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.
zh

[AI-40] A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中在非独立同分布(Non-IID)客户端数据分布下进行后门攻击防御的挑战,尤其是在缺乏干净服务器数据集的情况下如何有效防护并保持模型性能。其解决方案的关键在于提出了一种名为CLIP-Fed的防御框架,该框架利用视觉-语言预训练模型(vision-language pre-training models)的零样本学习能力,结合预聚合与后聚合双重防御策略,从而缓解Non-IID对防御效果的限制;同时,通过多模态大语言模型和频次分析构建并增强服务器端数据集,避免使用客户端原始样本以保障隐私,并采用原型对比损失(prototype contrastive loss)和Kullback-Leibler散度(Kullback-Leibler divergence)对齐全局模型与CLIP的知识,消除触发模式与目标标签之间的相关性,最终显著降低攻击成功率(ASR)并提升平均准确率(MA)。

链接: https://arxiv.org/abs/2508.10315
作者: Keke Gai,Dongjue Wang,Jing Yu,Liehuang Zhu,Qi Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing backdoor defense methods in Federated Learning (FL) rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset, which limits the practicality and effectiveness. Defending against backdoor attacks under heterogeneous client data distributions while preserving model performance remains a significant challenge. In this paper, we propose a FL backdoor defense framework named CLIP-Fed, which leverages the zero-shot learning capabilities of vision-language pre-training models. By integrating both pre-aggregation and post-aggregation defense strategies, CLIP-Fed overcomes the limitations of Non-IID imposed on defense effectiveness. To address privacy concerns and enhance the coverage of the dataset against diverse triggers, we construct and augment the server dataset using the multimodal large language model and frequency analysis without any client samples. To address class prototype deviations caused by backdoor samples and eliminate the correlation between trigger patterns and target labels, CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence. Extensive experiments on representative datasets validate the effectiveness of CLIP-Fed. Compared to state-of-the-art methods, CLIP-Fed achieves an average reduction in ASR, i.e., 2.03% on CIFAR-10 and 1.35% on CIFAR-10-LT, while improving average MA by 7.92% and 0.48%, respectively.
zh

[AI-41] Promoting Efficient Reasoning with Verifiable Stepwise Reward

【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在复杂推理任务中普遍存在的“过度思考”(overthinking)问题,即模型在处理简单问题时消耗过多计算资源,导致效率低下。现有高效推理方法通常依赖于对任务的精确评估来预设token预算或选择推理模式,限制了其灵活性和可靠性。论文提出了一种基于规则的可验证分步奖励机制(Verifiable Stepwise Reward Mechanism, VSRM),其核心在于通过评估推理轨迹中每个中间状态的表现来动态分配奖励,从而鼓励有效步骤、惩罚无效步骤。这一机制直观且天然契合推理任务的逐步特性,实验表明该方法能在保持原始推理性能的同时显著减少输出长度,有效抑制无效步骤,从根本上缓解过度思考问题。

链接: https://arxiv.org/abs/2508.10293
作者: Chuhuai Yue,Chengqi Dong,Yinan Gao,Hang He,Jiajun Chai,Guojun Yin,Wei Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning with verifiable rewards. However, LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency. Existing efficient reasoning methods typically require accurate task assessment to preset token budgets or select reasoning modes, which limits their flexibility and reliability. In this work, we revisit the essence of overthinking and identify that encouraging effective steps while penalizing ineffective ones is key to its solution. To this end, we propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory. This approach is intuitive and naturally fits the step-by-step nature of reasoning tasks. We conduct extensive experiments on standard mathematical reasoning benchmarks, including AIME24 and AIME25, by integrating VSRM with PPO and Reinforce++. Results show that our method achieves substantial output length reduction while maintaining original reasoning performance, striking an optimal balance between efficiency and accuracy. Further analysis of overthinking frequency and pass@k score before and after training demonstrates that our approach in deed effectively suppresses ineffective steps and encourages effective reasoning, fundamentally alleviating the overthinking problem. All code will be released upon acceptance.
zh

[AI-42] Why Cannot Large Language Models Ever Make True Correct Reasoning ?

【速读】:该论文试图解决的问题是:当前对大语言模型(Large Language Models, LLMs)所宣称的“理解能力”和“推理能力”的认知是否真实,即这些能力是否具有实质基础。论文指出,这种认知源于人们对概念的模糊理解,并非LLMs具备真正意义上的理解与推理能力。解决方案的关键在于揭示LLMs的工作原理本质上存在根本性局限——其输出依赖于统计模式匹配而非符号逻辑或因果推理机制,因此无法实现真正的正确推理(true correct reasoning)。

链接: https://arxiv.org/abs/2508.10265
作者: Jingde Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 8 pages. arXiv admin note: substantial text overlap with arXiv:2412.12408

点击查看摘要

Abstract:Recently, with the application progress of AIGC tools based on large language models (LLMs), led by ChatGPT, many AI experts and more non-professionals are trumpeting the “understanding ability” and “reasoning ability” of the LLMs. The present author considers that the so-called “understanding ability” and “reasoning ability” of LLMs are just illusions of those people who with vague concepts. In fact, the LLMs can never have the true understanding ability and true reasoning ability. This paper intents to explain that, because the essential limitations of their working principle, the LLMs can never have the ability of true correct reasoning.
zh

[AI-43] Facilitating Longitudinal Interaction Studies of AI Systems

【速读】:该论文试图解决的问题是:当前用户与人工智能(Artificial Intelligence, AI)的交互具有动态演化特性,包括学习、适应和再利用等过程,而一次性评估方法无法充分捕捉这些长期变化,导致对UIST(User Interface Software and Technology)工具的设计、构建与评估存在局限。解决方案的关键在于推动纵向研究(longitudinal studies),通过设计更实用的部署策略、评估框架和数据收集机制,帮助研究人员应对实施长期研究的技术与方法论挑战,并借助研讨会形式组织主题演讲、小组讨论及动手实践环节,促进该领域形成系统化的研究社区,从而将纵向研究作为更受认可的方法应用于UIST工具的开发与评估中。

链接: https://arxiv.org/abs/2508.10252
作者: Tao Long,Sitong Wang,Émilie Fabre,Tony Wang,Anup Sathya,Jason Wu,Savvas Petridis,Dingzeyu Li,Tuhin Chakrabarty,Yue Jiang,Jingyi Li,Tiffany Tseng,Ken Nakagaki,Qian Yang,Nikolas Martelaro,Jeffrey V. Nickerson,Lydia B. Chilton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted workshop proposal @ UIST 2025 Busan, Korea. Workshop website: this https URL

点击查看摘要

Abstract:UIST researchers develop tools to address user challenges. However, user interactions with AI evolve over time through learning, adaptation, and repurposing, making one time evaluations insufficient. Capturing these dynamics requires longer-term studies, but challenges in deployment, evaluation design, and data collection have made such longitudinal research difficult to implement. Our workshop aims to tackle these challenges and prepare researchers with practical strategies for longitudinal studies. The workshop includes a keynote, panel discussions, and interactive breakout groups for discussion and hands-on protocol design and tool prototyping sessions. We seek to foster a community around longitudinal system research and promote it as a more embraced method for designing, building, and evaluating UIST tools.
zh

[AI-44] Extending the Entropic Potential of Events for Uncertainty Quantification and Decision-Making in Artificial Intelligence

【速读】:该论文旨在解决人工智能(AI)系统中不确定性量化(uncertainty quantification)、决策优化与可解释性之间的协同不足问题。其核心挑战在于如何有效建模离散事件对系统未来熵的动态影响,从而提升智能体在复杂环境中的适应能力与透明度。解决方案的关键在于引入“熵势”(entropic potential)这一概念——该概念源自物理学,经由条件期望形式重构后应用于AI场景,用于衡量动作、观测等离散事件对未来不确定性的扰动程度。通过将熵势定义为事件引发的未来熵变化的期望值,并结合反事实分析,该框架实现了对不确定性传播机制的理论刻画与实践建模,进而统一应用于策略评估、内在奖励设计、可解释AI和异常检测等多个领域,显著增强了AI系统的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2508.10241
作者: Mark Zilberman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:This work demonstrates how the concept of the entropic potential of events – a parameter quantifying the influence of discrete events on the expected future entropy of a system – can enhance uncertainty quantification, decision-making, and interpretability in artificial intelligence (AI). Building on its original formulation in physics, the framework is adapted for AI by introducing an event-centric measure that captures how actions, observations, or other discrete occurrences impact uncertainty at future time horizons. Both the original and AI-adjusted definitions of entropic potential are formalized, with the latter emphasizing conditional expectations to account for counterfactual scenarios. Applications are explored in policy evaluation, intrinsic reward design, explainable AI, and anomaly detection, highlighting the metric’s potential to unify and strengthen uncertainty modeling in intelligent systems. Conceptual examples illustrate its use in reinforcement learning, Bayesian inference, and anomaly detection, while practical considerations for computation in complex AI models are discussed. The entropic potential framework offers a theoretically grounded, interpretable, and versatile approach to managing uncertainty in AI, bridging principles from thermodynamics, information theory, and machine learning.
zh

[AI-45] No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings

【速读】:该论文旨在解决当前生物声学(Bioacoustics)领域中基于预训练深度学习(DL)模型提取声学特征时存在的性能不稳定问题,特别是未微调模型在任务表现上的局限性及其对背景噪声与目标声音分离能力的不足。其解决方案的关键在于通过系统性地对比11种DL模型在相同任务上的嵌入(Embedding)表现,发现:1)未经微调的音频预训练模型即使在降维后也难以超越微调后的AlexNet;2)无论是否微调,多数模型均无法有效区分背景噪声与标注声音,而ResNet表现出显著优势;3)在微调过程中若减少背景声音样本数量,可显著提升模型性能。因此,研究强调了针对具体任务进行微调以及微调后嵌入质量验证的重要性。

链接: https://arxiv.org/abs/2508.10230
作者: Chenggang Chen,Zhiyu Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings’ dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: this https URL_Embeddings
zh

[AI-46] An Explainable AI based approach for Monitoring Animal Health

【速读】:该论文旨在解决奶牛健康监测与产奶效率优化难题,核心挑战在于难以对农场中所有奶牛进行持续追踪与行为分析。解决方案的关键在于构建基于可解释机器学习(Explainable Machine Learning, XAI)的数据驱动方法,利用三轴加速度计传感器连续采集奶牛活动数据,并通过蓝牙物联网(IoT)设备和4G网络实现高效数据传输与实时推理;同时,采用滑动窗口技术提取时序特征并结合信号处理与统计指标进行预处理,最终通过超参数优化的k近邻分类器实现高精度活动分类(训练集AUC均值0.98,测试集AUC达0.99),并借助SHAP等可解释人工智能框架解析关键特征重要性,从而为养殖户提供透明、可靠且可操作的决策支持,推动可持续畜牧业管理实践。

链接: https://arxiv.org/abs/2508.10210
作者: Rahul Janaa,Shubham Dixit,Mrityunjay Sharma,Ritesh Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Classifier achieved the best performance, with AUC of mean 0.98 and standard deviation of 0.0026 on the training set and 0.99 on testing set). In order to ensure transparency, Explainable AI based frameworks such as SHAP is used to interpret feature importance that can be understood and used by practitioners. A detailed comparison of the important features, along with the stability analysis of selected features, supports development of explainable and practical ML models for sustainable livestock management.
zh

[AI-47] KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的自动化机器学习(AutoML)系统中存在的两大核心问题:一是探索策略受限,表现为一次性方法缺乏多样性,而蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)方法无法有效重组优质局部解;二是执行瓶颈严重,源于代码验证周期长,阻碍了迭代优化。解决方案的关键在于提出一种名为KompeteAI的新颖AutoML框架,其创新点包括:引入合并阶段(merging stage)以组合最优候选方案,从而实现对部分解的有效重组与融合;通过集成检索增强生成(Retrieval-Augmented Generation, RAG)技术,从Kaggle笔记本和arXiv论文中获取真实世界策略以扩展假设空间;同时设计预测评分模型和加速调试方法,利用早期指标评估解的潜力,避免昂贵的全量代码执行,使管道评估速度提升6.9倍。这一系列改进使KompeteAI在MLE-Bench基准上平均性能优于主流方法3%,并在新提出的Kompete-bench上达到最先进水平。

链接: https://arxiv.org/abs/2508.10177
作者: Stepan Kulibaba,Artem Dzhalilov,Roman Pakhomov,Oleg Svidchenko,Alexander Gasnikov,Aleksei Shpilman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent Large Language Model (LLM)-based AutoML systems demonstrate impressive capabilities but face significant limitations such as constrained exploration strategies and a severe execution bottleneck. Exploration is hindered by one-shot methods lacking diversity and Monte Carlo Tree Search (MCTS) approaches that fail to recombine strong partial solutions. The execution bottleneck arises from lengthy code validation cycles that stifle iterative refinement. To overcome these challenges, we introduce KompeteAI, a novel AutoML framework with dynamic solution space exploration. Unlike previous MCTS methods that treat ideas in isolation, KompeteAI introduces a merging stage that composes top candidates. We further expand the hypothesis space by integrating Retrieval-Augmented Generation (RAG), sourcing ideas from Kaggle notebooks and arXiv papers to incorporate real-world strategies. KompeteAI also addresses the execution bottleneck via a predictive scoring model and an accelerated debugging method, assessing solution potential using early stage metrics to avoid costly full-code execution. This approach accelerates pipeline evaluation 6.9 times. KompeteAI outperforms leading methods (e.g., RD-agent, AIDE, and Ml-Master) by an average of 3% on the primary AutoML benchmark, MLE-Bench. Additionally, we propose Kompete-bench to address limitations in MLE-Bench, where KompeteAI also achieves state-of-the-art results
zh

[AI-48] Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在执行复杂任务时因长链式思维(Chain-of-Thought, CoT)推理导致的生成长度过长问题,这不仅增加了计算开销,还可能引发过度思考(overthinking),从而难以在推理效果与效率之间取得平衡。解决方案的关键在于提出一种长度可控偏好优化(Length Controlled Preference Optimization, LCPO)方法,该方法基于Bradley-Terry损失框架分析不同偏好优化目标的收敛行为,并通过直接调控与负对数似然(NLL)损失相关的隐式奖励来学习长度偏好;LCPO能够在有限数据和训练资源下有效缩短输出长度(平均减少超50%),同时保持推理性能不变,为实现计算高效的推理引导提供了新路径。

链接: https://arxiv.org/abs/2508.10164
作者: Bin Hong,Jiayu Liu,Zhenya Huang,Kai Zhang,Mengdi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current methods for efficient reasoning often compromise reasoning quality or require extensive resources. This paper investigates efficient methods to reduce the generation length of LRMs. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence behaviors of the objectives of various preference optimization methods under a Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our approach significantly reduces the average output length by over 50% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.
zh

[AI-49] Improving and Evaluating Open Deep Research Agents

【速读】:该论文旨在解决当前深度研究代理(Deep Research Agents, DRAs)领域中开源系统稀缺、评估基准不足的问题,特别是针对现有评测多依赖专有闭源系统的局限性。为推动学术界对DRAs的研究,作者将具有挑战性的BrowseComp基准简化为更易计算的BrowseComp-Small(BC-Small),并在此基础上对比开源模型Open Deep Research(ODR)与两家闭源系统(Anthropic和Google)。解决方案的关键在于提出三项战略改进措施,形成ODR+模型,显著提升了在BC-Small测试集上的成功率至10%,达到当时开源自研与闭源系统中的最优水平,且通过消融实验验证了每项改进均对性能提升有贡献。

链接: https://arxiv.org/abs/2508.10152
作者: Doaa Allabadi,Kyle Bradbury,Jordan M. Malof
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.
zh

[AI-50] Out-of-Distribution Detection using Counterfactual Distance

【速读】:该论文旨在解决机器学习系统在面对分布外(out-of-distribution, OOD)数据时的安全性问题,即如何实现准确且可解释的OOD检测。其核心挑战在于现有方法难以在保证高检测性能的同时提供对检测结果的可解释性。解决方案的关键在于利用反事实解释(counterfactual explanations)来计算输入样本到决策边界的特征距离,从而构建一种后验(post-hoc)OOD检测方法。该方法不仅通过在嵌入空间中直接生成反事实样本来提升计算效率,还天然地将反事实解释作为结果解释工具,实现了检测性能与可解释性的统一。实验表明,该方法在CIFAR-10、CIFAR-100和ImageNet-200多个基准数据集上均达到或超过当前最优水平。

链接: https://arxiv.org/abs/2508.10148
作者: Maria Stoica,Francesco Leofante,Alessio Lomuscio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and explainable out-of-distribution (OOD) detection is required to use machine learning systems safely. Previous work has shown that feature distance to decision boundaries can be used to identify OOD data effectively. In this paper, we build on this intuition and propose a post-hoc OOD detection method that, given an input, calculates the distance to decision boundaries by leveraging counterfactual explanations. Since computing explanations can be expensive for large architectures, we also propose strategies to improve scalability by computing counterfactuals directly in embedding space. Crucially, as the method employs counterfactual explanations, we can seamlessly use them to help interpret the results of our detector. We show that our method is in line with the state of the art on CIFAR-10, achieving 93.50% AUROC and 25.80% FPR95. Our method outperforms these methods on CIFAR-100 with 97.05% AUROC and 13.79% FPR95 and on ImageNet-200 with 92.55% AUROC and 33.55% FPR95 across four OOD datasets
zh

[AI-51] rETF-semiSL: Semi-Supervised Learning for Neural Collapse in Temporal Data

【速读】:该论文旨在解决时间序列分类任务中预训练策略缺乏理论指导、 pretext 任务选择具有启发式倾向且迁移性能不可靠的问题。其核心挑战在于如何设计有效的预训练目标,以使模型学习到具备良好几何结构的嵌入表示,从而提升下游分类性能。解决方案的关键在于提出一种新型半监督预训练策略,通过强制隐空间表示满足神经坍缩(Neural Collapse)现象来优化嵌入可分性,并结合旋转等距紧框架分类器与伪标签机制,在少量标注样本下训练深度编码器;同时引入生成式预训练任务和新颖的序列增强策略,有效捕捉时序动态特性,从而显著优于传统 pretext 任务在 LSTM、Transformer 和状态空间模型上的表现。

链接: https://arxiv.org/abs/2508.10147
作者: Yuhan Xie,William Cappelletti,Mahsa Shoaran,Pascal Frossard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Deep neural networks for time series must capture complex temporal patterns, to effectively represent dynamic data. Self- and semi-supervised learning methods show promising results in pre-training large models, which – when finetuned for classification – often outperform their counterparts trained from scratch. Still, the choice of pretext training tasks is often heuristic and their transferability to downstream classification is not granted, thus we propose a novel semi-supervised pre-training strategy to enforce latent representations that satisfy the Neural Collapse phenomenon observed in optimally trained neural classifiers. We use a rotational equiangular tight frame-classifier and pseudo-labeling to pre-train deep encoders with few labeled samples. Furthermore, to effectively capture temporal dynamics while enforcing embedding separability, we integrate generative pretext tasks with our method, and we define a novel sequential augmentation strategy. We show that our method significantly outperforms previous pretext tasks when applied to LSTMs, transformers, and state-space models on three multivariate time series classification datasets. These results highlight the benefit of aligning pre-training objectives with theoretically grounded embedding geometry.
zh

[AI-52] Agent ic AI Frameworks: Architectures Protocols and Design Challenges

【速读】:该论文旨在解决当前Agentic AI(智能体人工智能)系统在架构设计、通信机制、记忆管理、安全防护及服务导向计算适配等方面的碎片化与缺乏统一评估标准的问题。其解决方案的关键在于通过系统性综述和对比分析主流Agentic AI框架(如CrewAI、LangGraph、AutoGen等),构建一个涵盖架构原理、协作协议与安全机制的分类体系,并深入探讨包括合同网协议(Contract Net Protocol, CNP)、Agent-to-Agent(A2A)、Agent Network Protocol(ANP)和Agora在内的多种代理通信协议,从而为提升系统的可扩展性、鲁棒性和互操作性提供理论依据与实践指导。

链接: https://arxiv.org/abs/2508.10146
作者: Hana Derouiche,Zaki Brahmi,Haithem Mazeni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has ushered in a transformative paradigm in artificial intelligence, Agentic AI, where intelligent agents exhibit goal-directed autonomy, contextual reasoning, and dynamic multi-agent coordination. This paper provides a systematic review and comparative analysis of leading Agentic AI frameworks, including CrewAI, LangGraph, AutoGen, Semantic Kernel, Agno, Google ADK, and MetaGPT, evaluating their architectural principles, communication mechanisms, memory management, safety guardrails, and alignment with service-oriented computing paradigms. Furthermore, we identify key limitations, emerging trends, and open challenges in the field. To address the issue of agent communication, we conduct an in-depth analysis of protocols such as the Contract Net Protocol (CNP), Agent-to-Agent (A2A), Agent Network Protocol (ANP), and Agora. Our findings not only establish a foundational taxonomy for Agentic AI systems but also propose future research directions to enhance scalability, robustness, and interoperability. This work serves as a comprehensive reference for researchers and practitioners working to advance the next generation of autonomous AI systems.
zh

[AI-53] MCP-Orchestrated Multi-Agent System for Automated Disinformation Detection

【速读】:该论文旨在解决数字平台上虚假信息(disinformation)传播广泛所引发的信息完整性挑战。其解决方案的关键在于构建一个基于多智能体(multi-agent)架构的生成式AI系统,通过协同多个专业化代理(agent)实现对新闻标题和短文本片段中虚假信息的检测:包括基于逻辑回归的机器学习代理、依赖命名实体识别(Named Entity Recognition, NER)的维基百科知识核查代理、利用大语言模型(Large Language Model, LLM)提示工程进行连贯性检测的代理,以及通过网络爬取数据提取关系三元组用于事实核查的分析代理;这些代理由模型上下文协议(Model Context Protocol, MCP)统一调度,实现共享上下文与实时学习能力,最终在准确率(95.3%)和F1分数(0.964)上显著优于单一代理和传统方法,且其加权聚合策略基于个体误分类率数学推导,优于算法阈值优化。

链接: https://arxiv.org/abs/2508.10143
作者: Alexandru-Andrei Avram,Adrian Groza,Alexandru Lecu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages + 1 page references, 5 figures, 4 tables, Registered for the 27th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 2025, Timisoara

点击查看摘要

Abstract:The large spread of disinformation across digital platforms creates significant challenges to information integrity. This paper presents a multi-agent system that uses relation extraction to detect disinformation in news articles, focusing on titles and short text snippets. The proposed Agentic AI system combines four agents: (i) a machine learning agent (logistic regression), (ii) a Wikipedia knowledge check agent (which relies on named entity recognition), (iii) a coherence detection agent (using LLM prompt engineering), and (iv) a web-scraped data analyzer that extracts relational triplets for fact checking. The system is orchestrated via the Model Context Protocol (MCP), offering shared context and live learning across components. Results demonstrate that the multi-agent ensemble achieves 95.3% accuracy with an F1 score of 0.964, significantly outperforming individual agents and traditional approaches. The weighted aggregation method, mathematically derived from individual agent misclassification rates, proves superior to algorithmic threshold optimization. The modular architecture makes the system easily scalable, while also maintaining details of the decision processes.
zh

[AI-54] Less is More: Learning Graph Tasks with Just LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在图结构任务中推理能力的提升问题,具体聚焦于:LLMs是否能在不依赖专门图编码模型的情况下学习解决基础图任务、能否将所学解决方案泛化到未见过的图结构或任务,以及不同方法之间的优劣。其关键解决方案是通过提供带有思维链(chain-of-thought)的指令式训练样本,使小规模LLMs能够有效学习并泛化图任务的求解策略,而无需引入图神经网络(GNN)等专用图编码模块。

链接: https://arxiv.org/abs/2508.10115
作者: Sola Shirai,Kavitha Srinivas,Julian Dolby,Michael Katz,Horst Samulowitz,Shirin Sohrabi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For large language models (LLMs), reasoning over graphs could help solve many problems. Prior work has tried to improve LLM graph reasoning by examining how best to serialize graphs as text and by combining GNNs and LLMs. However, the merits of such approaches remain unclear, so we empirically answer the following research questions: (1) Can LLMs learn to solve fundamental graph tasks without specialized graph encoding models?, (2) Can LLMs generalize learned solutions to unseen graph structures or tasks?, and (3) What are the merits of competing approaches to learn graph tasks? We show that even small LLMs can learn to solve graph tasks by training them with instructive chain-of-thought solutions, and this training generalizes, without specialized graph encoders, to new tasks and graph structures.
zh

[AI-55] Advancing Data Equity: Practitioner Responsibility and Accountability in NLP Data Practices AAAI

【速读】:该论文试图解决的问题是:当前对算法偏见的研究多集中于技术审计与公平性评估,但缺乏对自然语言处理(Natural Language Processing, NLP)数据从业者在实际工作中如何感知和应对数据公平性问题的理解。研究通过聚焦从业者视角,揭示其在组织约束与系统性挑战下对公平性的认知、实践困境及参与新兴治理机制(如美国人工智能权利法案)的现状。解决方案的关键在于构建一个多尺度的人工智能治理框架,推动技术、政策与社区维度之间的协同参与,强调通过结构性治理改革增强从业者自主性与社区同意机制,从而超越表面的数据多样性(diversity washing),实现真正可持续的NLP数据公平性提升。

链接: https://arxiv.org/abs/2508.10071
作者: Jay L. Cunningham,Kevin Zhongyang Shao,Rock Yuren Pang,Nathaniel Mengist
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 6 Pages (References and Appendices). The archival version has been accepted to AAAI (AIES 2025) without the extended Appendices. This extended version includes Appendices

点击查看摘要

Abstract:While research has focused on surfacing and auditing algorithmic bias to ensure equitable AI development, less is known about how NLP practitioners - those directly involved in dataset development, annotation, and deployment - perceive and navigate issues of NLP data equity. This study is among the first to center practitioners’ perspectives, linking their experiences to a multi-scalar AI governance framework and advancing participatory recommendations that bridge technical, policy, and community domains. Drawing on a 2024 questionnaire and focus group, we examine how U.S.-based NLP data practitioners conceptualize fairness, contend with organizational and systemic constraints, and engage emerging governance efforts such as the U.S. AI Bill of Rights. Findings reveal persistent tensions between commercial objectives and equity commitments, alongside calls for more participatory and accountable data workflows. We critically engage debates on data diversity and diversity washing, arguing that improving NLP equity requires structural governance reforms that support practitioner agency and community consent.
zh

[AI-56] NetMoniAI: An Agent ic AI Framework for Network Security Monitoring

【速读】:该论文旨在解决传统网络监控与安全系统在资源受限环境下难以实现高效、协同检测的问题,尤其针对分布式网络中异常行为识别滞后、冗余分析严重以及全局态势感知不足等挑战。解决方案的关键在于提出一种两层式智能体-AI(agentic AI)架构——即每个节点部署轻量级自主微智能体(micro-agent)进行本地流量分析和异常检测,同时由中心控制器聚合各节点的洞察信息以识别协同攻击并维持全网态势感知,从而在保证准确性的同时显著提升响应速度与可扩展性。

链接: https://arxiv.org/abs/2508.10052
作者: Pallavi Zambare,Venkata Nikhil Thanikella,Nikhil Padmanabh Kottur,Sree Akhil Akula,Ying Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE 3rd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings 2025)

点击查看摘要

Abstract:In this paper, we present NetMoniAI, an agentic AI framework for automatic network monitoring and security that integrates decentralized analysis with lightweight centralized coordination. The framework consists of two layers: autonomous micro-agents at each node perform local traffic analysis and anomaly detection. A central controller then aggregates insights across nodes to detect coordinated attacks and maintain system-wide situational awareness. We evaluated NetMoniAI on a local micro-testbed and through NS-3 simulations. Results confirm that the two-tier agentic-AI design scales under resource constraints, reduces redundancy, and improves response time without compromising accuracy. To facilitate broader adoption and reproducibility, the complete framework is available as open source. This enables researchers and practitioners to replicate, validate, and extend it across diverse network environments and threat scenarios. Github link: this https URL
zh

[AI-57] Legal Zero-Days: A Novel Risk Vector for Advanced AI Systems

【速读】:该论文试图解决前沿人工智能系统可能引发的新型风险问题,即“法律零日漏洞”(Legal Zero-Days)——指法律框架中尚未被发现的漏洞,一旦被利用即可在无需诉讼等程序的情况下直接造成社会层面的重大 disruption。其解决方案的关键在于提出一套风险识别与评估模型,并通过构建“法律谜题”(legal puzzles)作为评估工具,用于测试AI系统发现此类漏洞的能力。研究表明,当前AI模型尚难以可靠识别具有实际影响的法律零日漏洞,但未来系统可能具备此能力,从而为提升法律体系的鲁棒性提供新机遇。

链接: https://arxiv.org/abs/2508.10050
作者: Greg Sadler,Nathan Sherburn
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 table, 1 figure. Introduces Legal Zero-Days as a novel AI risk vector and provides evaluation framework for measuring AI systems’ ability to discover legal vulnerabilities

点击查看摘要

Abstract:We introduce the concept of “Legal Zero-Days” as a novel risk vector for advanced AI systems. Legal Zero-Days are previously undiscovered vulnerabilities in legal frameworks that, when exploited, can cause immediate and significant societal disruption without requiring litigation or other processes before impact. We present a risk model for identifying and evaluating these vulnerabilities, demonstrating their potential to bypass safeguards or impede government responses to AI incidents. Using the 2017 Australian dual citizenship crisis as a case study, we illustrate how seemingly minor legal oversights can lead to large-scale governance disruption. We develop a methodology for creating “legal puzzles” as evaluation instruments for assessing AI systems’ capabilities to discover such vulnerabilities. Our findings suggest that while current AI models may not reliably find impactful Legal Zero-Days, future systems may develop this capability, presenting both risks and opportunities for improving legal robustness. This work contributes to the broader effort to identify and mitigate previously unrecognized risks from frontier AI systems.
zh

[AI-58] A Survey of Optimization Modeling Meets LLM s: Progress and Future Directions

【速读】:该论文旨在解决优化建模(optimization modeling)过程中对运筹学专业人员高度依赖的问题,通过大语言模型(large language models, LLMs)实现数学建模的自动化。其解决方案的关键在于构建一个涵盖数据合成与微调、推理框架、基准数据集及性能评估的完整技术体系,并通过系统性清洗现有基准数据集、建立公平的性能评估排行榜,以及开发集成清理后数据集、代码和论文资源的在线门户,显著提升建模自动化工具的可靠性与可复用性。

链接: https://arxiv.org/abs/2508.10047
作者: Ziyang Xiao,Jingrong Xie,Lilin Xu,Shisi Guan,Jingyan Zhu,Xiongwei Han,Xiaojin Fu,WingYin Yu,Han Wu,Wei Shi,Qingcan Kang,Jiahui Duan,Tao Zhong,Mingxuan Yuan,Jia Zeng,Yuan Wang,Gang Chen,Dongxiang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:By virtue of its great utility in solving real-world problems, optimization modeling has been widely employed for optimal decision-making across various sectors, but it requires substantial expertise from operations research professionals. With the advent of large language models (LLMs), new opportunities have emerged to automate the procedure of mathematical modeling. This survey presents a comprehensive and timely review of recent advancements that cover the entire technical stack, including data synthesis and fine-tuning for the base model, inference frameworks, benchmark datasets, and performance evaluation. In addition, we conducted an in-depth analysis on the quality of benchmark datasets, which was found to have a surprisingly high error rate. We cleaned the datasets and constructed a new leaderboard with fair performance evaluation in terms of base LLM model and datasets. We also build an online portal that integrates resources of cleaned datasets, code and paper repository to benefit the community. Finally, we identify limitations in current methodologies and outline future research opportunities.
zh

[AI-59] SABIA: An AI-Powered Tool for Detecting Opioid-Related Behaviors on Social Media

【速读】:该论文旨在解决社交平台上 opioid(阿片类药物)滥用相关用户行为识别难题,尤其是由非正式语言、俚语及编码表达所导致的检测困难问题。其解决方案的关键在于提出一种融合 Bidirectional Encoder Representations from Transformers (BERT)、双向长短期记忆网络(BiLSTM)与三卷积神经网络(3CNN)的混合深度学习模型 SABIA,通过多层特征提取机制有效捕捉文本的语义和上下文信息,并在五类用户行为(贩毒者、活跃使用者、康复者、处方使用者、非使用者)分类任务中实现高精度识别,相较基准模型 Logistic Regression 提升准确率 9.30%,验证了该方法在公共健康监测中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2508.10046
作者: Muhammad Ahmad,Fida Ullah,Muhammad Usman,Ildar Batyrshin,Grigori Sidorov
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social media platforms have become valuable tools for understanding public health challenges by offering insights into patient behaviors, medication use, and mental health issues. However, analyzing such data remains difficult due to the prevalence of informal language, slang, and coded communication, which can obscure the detection of opioid misuse. This study addresses the issue of opioid-related user behavior on social media, including informal expressions, slang terms, and misspelled or coded language. We analyzed the existing Bidirectional Encoder Representations from Transformers (BERT) technique and developed a BERT-BiLSTM-3CNN hybrid deep learning model, named SABIA, to create a single-task classifier that effectively captures the features of the target dataset. The SABIA model demonstrated strong capabilities in capturing semantics and contextual information. The proposed approach includes: (1) data preprocessing, (2) data representation using the SABIA model, (3) a fine-tuning phase, and (4) classification of user behavior into five categories. A new dataset was constructed from Reddit posts, identifying opioid user behaviors across five classes: Dealers, Active Opioid Users, Recovered Users, Prescription Users, and Non-Users, supported by detailed annotation guidelines. Experiments were conducted using supervised learning. Results show that SABIA achieved benchmark performance, outperforming the baseline (Logistic Regression, LR = 0.86) and improving accuracy by 9.30%. Comparisons with seven previous studies confirmed its effectiveness and robustness. This study demonstrates the potential of hybrid deep learning models for detecting complex opioid-related behaviors on social media, supporting public health monitoring and intervention efforts.
zh

[AI-60] Generative AI for Cybersecurity of Energy Management Systems: Methods Challenges and Future Directions

【速读】:该论文旨在解决能源管理系统(EMS)在动态网络安全漏洞和系统问题(SPs)环境下的安全防护难题,尤其针对SCADA数据流中多阶段攻击(如状态估计后隐蔽攻击、EMS数据库篡改及HMI显示污染)带来的复杂威胁。其解决方案的关键在于提出一种融合生成式AI(GenAI)的异常检测系统(ADS),并进一步构建一套基于视觉标记的生成智能框架(SoM-GI),通过多模态分析整合视觉指标与规则推理机制,克服传统数值方法在空间推理上的局限性,从而实现对HMI界面中肉眼可见但数值难以捕捉的异常模式的精准识别,最终提升EMS对网络攻击与系统错误的综合防御能力。

链接: https://arxiv.org/abs/2508.10044
作者: Aydin Zaboli,Junho Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 36 pages, 10 figures

点击查看摘要

Abstract:This paper elaborates on an extensive security framework specifically designed for energy management systems (EMSs), which effectively tackles the dynamic environment of cybersecurity vulnerabilities and/or system problems (SPs), accomplished through the incorporation of novel methodologies. A comprehensive multi-point attack/error model is initially proposed to systematically identify vulnerabilities throughout the entire EMS data processing pipeline, including post state estimation (SE) stealth attacks, EMS database manipulation, and human-machine interface (HMI) display corruption according to the real-time database (RTDB) storage. This framework acknowledges the interconnected nature of modern attack vectors, which utilize various phases of supervisory control and data acquisition (SCADA) data flow. Then, generative AI (GenAI)-based anomaly detection systems (ADSs) for EMSs are proposed for the first time in the power system domain to handle the scenarios. Further, a set-of-mark generative intelligence (SoM-GI) framework, which leverages multimodal analysis by integrating visual markers with rules considering the GenAI capabilities, is suggested to overcome inherent spatial reasoning limitations. The SoM-GI methodology employs systematic visual indicators to enable accurate interpretation of segmented HMI displays and detect visual anomalies that numerical methods fail to identify. Validation on the IEEE 14-Bus system shows the framework’s effectiveness across scenarios, while visual analysis identifies inconsistencies. This integrated approach combines numerical analysis with visual pattern recognition and linguistic rules to protect against cyber threats and system errors.
zh

[AI-61] Securing Agent ic AI: Threat Modeling and Risk Analysis for Network Monitoring Agent ic AI System

【速读】:该论文旨在解决将大型语言模型(Large Language Models, LLMs)与自主代理(autonomous agents)结合应用于网络监控和决策系统时所引发的安全问题。其核心挑战在于,此类系统在面对攻击时易受资源拒绝服务(如流量重放攻击)和内存污染(如篡改历史日志文件)等威胁,从而导致性能退化,如遥测更新延迟和计算负载增加。解决方案的关键在于提出并验证了MAESTRO框架——一种基于七层威胁建模架构的多层防御体系,通过内存隔离、规划器验证、实时异常响应机制及跨层通信保护,实现对代理型人工智能(agentic AI)系统的漏洞识别、风险评估与缓解,从而提升其在对抗环境下的可靠性与韧性。

链接: https://arxiv.org/abs/2508.10043
作者: Pallavi Zambare,Venkata Nikhil Thanikella,Ying Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted and under review in IEEE Transactions on Privacy

点击查看摘要

Abstract:When combining Large Language Models (LLMs) with autonomous agents, used in network monitoring and decision-making systems, this will create serious security issues. In this research, the MAESTRO framework consisting of the seven layers threat modeling architecture in the system was used to expose, evaluate, and eliminate vulnerabilities of agentic AI. The prototype agent system was constructed and implemented, using Python, LangChain, and telemetry in WebSockets, and deployed with inference, memory, parameter tuning, and anomaly detection modules. Two practical threat cases were confirmed as follows: (i) resource denial of service by traffic replay denial-of-service, and (ii) memory poisoning by tampering with the historical log file maintained by the agent. These situations resulted in measurable levels of performance degradation, i.e. telemetry updates were delayed, and computational loads were increased, as a result of poor system adaptations. It was suggested to use a multilayered defense-in-depth approach with memory isolation, validation of planners and anomaly response systems in real-time. These findings verify that MAESTRO is viable in operational threat mapping, prospective risk scoring, and the basis of the resilient system design. The authors bring attention to the importance of the enforcement of memory integrity, paying attention to the adaptation logic monitoring, and cross-layer communication protection that guarantee the agentic AI reliability in adversarial settings.
zh

[AI-62] FIDELIS: Blockchain-Enabled Protection Against Poisoning Attacks in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning)中因数据投毒攻击(Data Poisoning Attack)导致模型性能与完整性下降的问题。现有检测方法缺乏标准化且依赖中心化信任机制,难以保障安全性与可靠性。其解决方案的关键在于提出一个基于区块链的去中心化毒检测框架 \Sys,通过在参与客户端间分散全局服务器角色,并引入由各客户端生成、经共识机制验证的“裁判模型”(Judge Model)来识别模型更新中的投毒行为,从而实现对数据投毒攻击的有效防御与可扩展的模型验证。

链接: https://arxiv.org/abs/2508.10042
作者: Jane Carney,Kushal Upreti,Gaby G. Dagher,Tim Andersen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning enhances traditional deep learning by enabling the joint training of a model with the use of IoT device’s private data. It ensures privacy for clients, but is susceptible to data poisoning attacks during training that degrade model performance and integrity. Current poisoning detection methods in federated learning lack a standardized detection method or take significant liberties with trust. In this paper, we present \Sys, a novel blockchain-enabled poison detection framework in federated learning. The framework decentralizes the role of the global server across participating clients. We introduce a judge model used to detect data poisoning in model updates. The judge model is produced by each client and verified to reach consensus on a single judge model. We implement our solution to show \Sys is robust against data poisoning attacks and the creation of our judge model is scalable.
zh

[AI-63] Exploring Content and Social Connections of Fake News with Explainable Text and Graph Learning

【速读】:该论文旨在解决虚假信息在全球范围内的传播问题,以及现有自动事实核查系统在内容分析基础上难以有效应对社交网络传播机制(如“点赞”和用户网络)所带来的挑战。同时,单纯标记内容为假可能引发自动化偏见和确认偏误,削弱系统的可信度。其解决方案的关键在于提出一个可解释的多模态框架,融合内容特征、社交媒体行为特征与图结构特征,通过将虚假信息分类器与可解释性技术相结合,实现对预测决策的完整且可理解的解释,从而提升模型性能与人类用户的信任度。

链接: https://arxiv.org/abs/2508.10040
作者: Vítor N. Lourenço,Aline Paes,and Tillman Weyde
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted to publication at the 35th Brazilian Conference on Intelligent Systems, BRACIS 2025

点击查看摘要

Abstract:The global spread of misinformation and concerns about content trustworthiness have driven the development of automated fact-checking systems. Since false information often exploits social media dynamics such as “likes” and user networks to amplify its reach, effective solutions must go beyond content analysis to incorporate these factors. Moreover, simply labelling content as false can be ineffective or even reinforce biases such as automation and confirmation bias. This paper proposes an explainable framework that combines content, social media, and graph-based features to enhance fact-checking. It integrates a misinformation classifier with explainability techniques to deliver complete and interpretable insights supporting classification decisions. Experiments demonstrate that multimodal information improves performance over single modalities, with evaluations conducted on datasets in English, Spanish, and Portuguese. Additionally, the framework’s explanations were assessed for interpretability, trustworthiness, and robustness with a novel protocol, showing that it effectively generates human-understandable justifications for its predictions.
zh

[AI-64] Multi-task Adversarial Attacks against Black-box Model with Few-shot Queries

【速读】:该论文旨在解决当前多任务对抗文本攻击在实际应用场景中效果受限的问题,特别是针对黑盒反馈API、查询次数有限以及多任务类型混合等挑战。现有方法通常依赖于共享内部特征和大量查询,难以适应现实环境。其解决方案的关键在于提出一种名为CEMA(Cluster and Ensemble Multi-task Text Adversarial Attack)的新型黑盒攻击框架,通过利用不同任务间对抗样本的迁移性,构建一个深度级替代模型(deep-level substitute model),该模型以“即插即用”方式训练,无需模拟目标模型结构;同时结合多种文本分类方法生成多个对抗候选样本,并选择对替代模型攻击效果最优者进行部署,从而将复杂的多任务攻击简化为分类攻击,显著降低查询需求(仅需约100次),并在多类任务(如分类、翻译、摘要、文生图)及真实商业API(如Baidu、Google Translate)、大语言模型(如ChatGPT 4o)和图像生成模型(如Stable Diffusion V2)上验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2508.10039
作者: Wenqiang Wang,Yan Xiao,Hao Lin,Yangshijie Zhang,Xiaochun Cao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current multi-task adversarial text attacks rely on abundant access to shared internal features and numerous queries, often limited to a single task type. As a result, these attacks are less effective against practical scenarios involving black-box feedback APIs, limited queries, or multiple task types. To bridge this gap, we propose \textbfCluster and \textbfEnsemble \textbfMulti-task Text Adversarial \textbfAttack (\textbfCEMA), an effective black-box attack that exploits the transferability of adversarial texts across different tasks. CEMA simplifies complex multi-task scenarios by using a \textitdeep-level substitute model trained in a \textitplug-and-play manner for text classification, enabling attacks without mimicking the victim model. This approach requires only a few queries for training, converting multi-task attacks into classification attacks and allowing attacks across various tasks. CEMA generates multiple adversarial candidates using different text classification methods and selects the one that most effectively attacks substitute models. In experiments involving multi-task models with two, three, or six tasks–spanning classification, translation, summarization, and text-to-image generation–CEMA demonstrates significant attack success with as few as 100 queries. Furthermore, CEMA can target commercial APIs (e.g., Baidu and Google Translate), large language models (e.g., ChatGPT 4o), and image-generation models (e.g., Stable Diffusion V2), showcasing its versatility and effectiveness in real-world applications. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.10039 [cs.CR] (or arXiv:2508.10039v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.10039 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-65] Certifiably robust malware detectors by design

【速读】:该论文旨在解决静态恶意软件检测中机器学习模型易受对抗样本攻击的问题,即通过微小但精心设计的修改使检测器失效,而软件功能保持不变。解决方案的关键在于提出一种基于可证明鲁棒性设计的新模型架构,并揭示所有鲁棒检测器均可分解为特定结构——这一结构可应用于学习在脆弱特征上依然稳健的恶意软件检测器。作者提出的框架ERDALT正是基于此结构,实现了在有限降低检测性能的前提下,提升模型对对抗样本的鲁棒性。

链接: https://arxiv.org/abs/2508.10038
作者: Pierre-Francois Gimenez,Sarath Sivaprasad,Mario Fritz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malware analysis involves analyzing suspicious software to detect malicious payloads. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example of malware without altering its functionality requires specific transformations. We propose a new model architecture for certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. We compare and validate these approaches with machine-learning-based malware detection methods, allowing for robust detection with limited reduction of detection performance.
zh

[AI-66] Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7

【速读】:该论文旨在解决语言模型(Language Models, LMs)在认知安全方面存在的人类类似漏洞问题,例如情绪框架效应(emotional framing),这些问题传统行为对齐方法难以捕捉和防范。其解决方案的关键在于提出了一套基于人类认知安全研究的七类漏洞分类体系(CCS-7),并设计了“先思考、后验证”(Think First, Verify Always, TFVA)的干预策略作为基准防护机制。通过随机对照试验(151名参与者)验证TFVA可提升人类认知安全性7.9%,随后在12,180次实验中测试该策略对七种不同架构语言模型的效果,发现认知安全风险具有显著的架构依赖性:某些漏洞(如身份混淆)几乎被完全缓解,而另一些(如来源干扰)则出现反效果,误差率最高上升135%。这表明认知安全应被视为模型特定的工程问题,必须进行架构感知的认知安全测试,才能确保干预措施的有效性和安全性。

链接: https://arxiv.org/abs/2508.10033
作者: Yuksel Aydin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models exhibit human-like cognitive vulnerabilities, such as emotional framing, that escape traditional behavioral alignment. We present CCS-7 (Cognitive Cybersecurity Suite), a taxonomy of seven vulnerabilities grounded in human cognitive security research. To establish a human benchmark, we ran a randomized controlled trial with 151 participants: a “Think First, Verify Always” (TFVA) lesson improved cognitive security by +7.9% overall. We then evaluated TFVA-style guardrails across 12,180 experiments on seven diverse language model architectures. Results reveal architecture-dependent risk patterns: some vulnerabilities (e.g., identity confusion) are almost fully mitigated, while others (e.g., source interference) exhibit escalating backfire, with error rates increasing by up to 135% in certain models. Humans, in contrast, show consistent moderate improvement. These findings reframe cognitive safety as a model-specific engineering problem: interventions effective in one architecture may fail, or actively harm, another, underscoring the need for architecture-aware cognitive safety testing before deployment.
zh

[AI-67] A Robust Pipeline for Differentially Private Federated Learning on Imbalanced Clinical Data using SMOTETomek and FedProx

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗场景下因数据分布非独立同分布(non-IID)和严重类别不平衡导致的模型性能下降问题,同时平衡隐私保护(通过差分隐私,Differential Privacy, DP)与临床实用性之间的权衡。解决方案的关键在于:首先在客户端层面引入混合过采样技术SMOTETomek以缓解类别不平衡问题,从而提升模型召回率;其次采用优化的FedProx算法应对非IID数据挑战,增强模型收敛稳定性;最终在隐私预算(epsilon)与模型召回之间识别出最优操作区域,实现强隐私保障(epsilon = 9.0)下高临床实用性(召回率 > 77%)。

链接: https://arxiv.org/abs/2508.10017
作者: Rodrigo Tertulino
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: This is being prepared to be submitted to the Journal of the Brazilian Computer Society (JBCS), which is still under construction

点击查看摘要

Abstract:Federated Learning (FL) presents a groundbreaking approach for collaborative health research, allowing model training on decentralized data while safeguarding patient privacy. FL offers formal security guarantees when combined with Differential Privacy (DP). The integration of these technologies, however, introduces a significant trade-off between privacy and clinical utility, a challenge further complicated by the severe class imbalance often present in medical datasets. The research presented herein addresses these interconnected issues through a systematic, multi-stage analysis. An FL framework was implemented for cardiovascular risk prediction, where initial experiments showed that standard methods struggled with imbalanced data, resulting in a recall of zero. To overcome such a limitation, we first integrated the hybrid Synthetic Minority Over-sampling Technique with Tomek Links (SMOTETomek) at the client level, successfully developing a clinically useful model. Subsequently, the framework was optimized for non-IID data using a tuned FedProx algorithm. Our final results reveal a clear, non-linear trade-off between the privacy budget (epsilon) and model recall, with the optimized FedProx consistently out-performing standard FedAvg. An optimal operational region was identified on the privacy-utility frontier, where strong privacy guarantees (with epsilon 9.0) can be achieved while maintaining high clinical utility (recall greater than 77%). Ultimately, our study provides a practical methodological blueprint for creating effective, secure, and accurate diagnostic tools that can be applied to real-world, heterogeneous healthcare data.
zh

[AI-68] OpenFPL: An open-source forecasting method rivaling state-of-the-art Fantasy Premier League services

【速读】:该论文旨在解决Fantasy Premier League(FPL)玩家在缺乏透明且高精度球员表现预测工具的情况下,难以做出最优阵容决策的问题。当前的高精度预测模型主要依赖于商业服务,其算法和数据来源不透明,限制了普通用户公平获取优质预测能力。解决方案的关键在于提出OpenFPL——一个完全基于公开数据构建的开源球员表现预测方法,采用分位置的集成模型架构,并在四个赛季(2020-21至2023-24)的FPL与Understat数据上进行优化训练。实证结果显示,OpenFPL在2024-25赛季前瞻性测试中达到与领先商业服务相当的准确性,尤其在高回报球员(≥2分)预测上显著优于基准,验证了其在短期至中期(1–3轮比赛)战术规划和最终日决策中的有效性。

链接: https://arxiv.org/abs/2508.09992
作者: Daniel Groos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Models and inference code are freely available at this https URL

点击查看摘要

Abstract:Fantasy Premier League engages the football community in selecting the Premier League players who will perform best from gameweek to gameweek. Access to accurate performance forecasts gives participants an edge over competitors by guiding expectations about player outcomes and reducing uncertainty in squad selection. However, high-accuracy forecasts are currently limited to commercial services whose inner workings are undisclosed and that rely on proprietary data. This paper aims to democratize access to highly accurate forecasts of player performance by presenting OpenFPL, an open-source Fantasy Premier League forecasting method developed exclusively from public data. Comprising position-specific ensemble models optimized on Fantasy Premier League and Understat data from four previous seasons (2020-21 to 2023-24), OpenFPL achieves accuracy comparable to a leading commercial service when tested prospectively on data from the 2024-25 season. OpenFPL also surpasses the commercial benchmark for high-return players ( 2 points), which are most influential for rank gains. These findings hold across one-, two-, and three-gameweek forecast horizons, supporting long-term planning of transfers and strategies while also informing final-day decisions.
zh

[AI-69] Estimating Covariance for Global Minimum Variance Portfolio: A Decision-Focused Learning Approach

【速读】:该论文旨在解决传统投资组合优化中因参数估计不准确而导致决策质量下降的问题,尤其是基于最小均方误差(MSE)的预测导向估计方法在实际应用中可能无法生成最优资产配置。其解决方案的关键在于采用决策导向学习(Decision-Focused Learning, DFL)框架,直接优化投资决策的质量而非最小化预测误差;具体而言,作者通过推导全局最小方差投资组合(GMVP)解析解及其主成分性质,获得决策损失的梯度信息,并以此指导参数估计过程,从而显著提升投资组合的实际表现,尤其在降低波动性和增强决策驱动特征方面效果突出。

链接: https://arxiv.org/abs/2508.10776
作者: Juchan Kim,Inwoo Tae,Yongjae Lee
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Portfolio optimization constitutes a cornerstone of risk management by quantifying the risk-return trade-off. Since it inherently depends on accurate parameter estimation under conditions of future uncertainty, the selection of appropriate input parameters is critical for effective portfolio construction. However, most conventional statistical estimators and machine learning algorithms determine these parameters by minimizing mean-squared error (MSE), a criterion that can yield suboptimal investment decisions. In this paper, we adopt decision-focused learning (DFL) - an approach that directly optimizes decision quality rather than prediction error such as MSE - to derive the global minimum-variance portfolio (GMVP). Specifically, we theoretically derive the gradient of decision loss using the analytic solution of GMVP and its properties regarding the principal components of itself. Through extensive empirical evaluation, we show that prediction-focused estimation methods may fail to produce optimal allocations in practice, whereas DFL-based methods consistently deliver superior decision performance. Furthermore, we provide a comprehensive analysis of DFL’s mechanism in GMVP construction, focusing on its volatility reduction capability, decision-driving features, and estimation characteristics.
zh

[AI-70] FROGENT: An End-to-End Full-process Drug Design Agent

【速读】:该论文旨在解决当前药物设计领域中AI工具碎片化的问题,即多种强大的AI工具分散在独立的Web应用、桌面程序和代码库中,导致科学家需手动处理不兼容的接口和专用脚本,造成流程繁琐且重复。解决方案的关键在于提出一个名为FROGENT(Full-pROcess druG dEsign ageNT)的智能体框架,其核心是利用大型语言模型(Large Language Model, LLM)与模型上下文协议(Model Context Protocol),动态集成多个生物化学数据库、可扩展工具库及任务特定的AI模型,从而实现从靶点识别、分子生成到逆合成规划等复杂药物发现工作流的自动化执行。

链接: https://arxiv.org/abs/2508.10760
作者: Qihua Pan,Dong Xu,Jenna Xinyi Yao,Lijia Ma,Zexuan Zhu,Junkai Ji
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Powerful AI tools for drug discovery reside in isolated web apps, desktop programs, and code libraries. Such fragmentation forces scientists to manage incompatible interfaces and specialized scripts, which can be a cumbersome and repetitive process. To address this issue, a Full-pROcess druG dEsign ageNT, named FROGENT, has been proposed. Specifically, FROGENT utilizes a Large Language Model and the Model Context Protocol to integrate multiple dynamic biochemical databases, extensible tool libraries, and task-specific AI models. This agentic framework allows FROGENT to execute complicated drug discovery workflows dynamically, including component tasks such as target identification, molecule generation and retrosynthetic planning. FROGENT has been evaluated on eight benchmarks that cover various aspects of drug discovery, such as knowledge retrieval, property prediction, virtual screening, mechanistic analysis, molecular design, and synthesis. It was compared against six increasingly advanced ReAct-style agents that support code execution and literature searches. Empirical results demonstrated that FROGENT triples the best baseline performance in hit-finding and doubles it in interaction profiling, significantly outperforming both the open-source model Qwen3-32B and the commercial model GPT-4o. In addition, real-world cases have been utilized to validate the practicability and generalization of FROGENT. This development suggests that streamlining the agentic drug discovery pipeline can significantly enhance researcher productivity.
zh

[AI-71] Deep Learning in Classical and Quantum Physics

【速读】:该论文旨在解决量子科学与技术领域中因系统复杂性导致的研究瓶颈问题,即如何高效探索高维参数空间、从实验数据中提取模式以及实现数据驱动的科研方向引导。其解决方案的关键在于引入深度学习(Deep Learning, DL)作为工具,通过其强大的非线性建模能力支持量子控制协议优化和具有特定量子特性的材料发现等任务;同时强调识别并缓解DL模型可能存在的过拟合噪声数据、掩盖因果结构及物理可解释性差等风险,从而确保科学严谨性,并为量子物理、化学与工程领域的研究人员提供一套兼顾实用性与责任性的AI方法应用指南。

链接: https://arxiv.org/abs/2508.10666
作者: Timothy Heightman,Marcin Płodzień
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Scientific progress is tightly coupled to the emergence of new research tools. Today, machine learning (ML)-especially deep learning (DL)-has become a transformative instrument for quantum science and technology. Owing to the intrinsic complexity of quantum systems, DL enables efficient exploration of large parameter spaces, extraction of patterns from experimental data, and data-driven guidance for research directions. These capabilities already support tasks such as refining quantum control protocols and accelerating the discovery of materials with targeted quantum properties, making ML/DL literacy an essential skill for the next generation of quantum scientists. At the same time, DL’s power brings risks: models can overfit noisy data, obscure causal structure, and yield results with limited physical interpretability. Recognizing these limitations and deploying mitigation strategies is crucial for scientific rigor. These lecture notes provide a comprehensive, graduate-level introduction to DL for quantum applications, combining conceptual exposition with hands-on examples. Organized as a progressive sequence, they aim to equip readers to decide when and how to apply DL effectively, to understand its practical constraints, and to adapt AI methods responsibly to problems across quantum physics, chemistry, and engineering.
zh

[AI-72] Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Childrens Speech

【速读】:该论文旨在解决儿童语音在年龄和性别分类任务中因音高、发音及发育特征高度变异所带来的挑战,尤其是自监督学习(Self-Supervised Learning, SSL)模型在儿童语音中编码说话者特质的能力尚不明确的问题。解决方案的关键在于对四种Wav2Vec2变体进行逐层分析,发现早期层(1–7层)比深层更擅长捕捉说话者特异性信息,而深层则逐渐聚焦于语言信息;同时通过主成分分析(PCA)进一步降低冗余并突出最具判别力的特征,从而显著提升分类性能——例如Wav2Vec2-large-lv60在CMU Kids数据集上实现97.14%(年龄)和98.20%(性别)的准确率,验证了基于模型深度结构化特征的针对性策略对于构建儿童感知语音接口的有效性。

链接: https://arxiv.org/abs/2508.10332
作者: Abhijit Sinha,Harishankar Kumar,Mohit Joshi,Hemant Kumar Kathania,Shrikanth Narayanan,Sudarsana Reddy Kadiri
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at Workshop on Child Computer Interaction (WOCCI 2025)

点击查看摘要

Abstract:Children’s speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.
zh

[AI-73] CATNet: A geometric deep learning approach for CAT bond spread prediction in the primary market

【速读】:该论文旨在解决传统灾害债券(Catastrophe Bond, CAT bond)定价模型难以捕捉其复杂关联数据的问题。解决方案的关键在于提出CATNet框架,利用几何深度学习中的关系图卷积网络(Relational Graph Convolutional Network, R-GCN)将CAT债券一级市场建模为图结构,从而挖掘市场内在的网络拓扑特征。该方法不仅显著优于强基准模型(如随机森林),而且通过引入拓扑中心性指标作为特征,进一步提升了预测精度,并验证了这些网络特征能够量化行业长期经验中关于发行人声誉、承销商影响力和风险集中度的直觉认知,揭示了网络连通性是影响价格的核心因素。

链接: https://arxiv.org/abs/2508.10208
作者: Dixon Domfeh,Saeid Safarveisi
机构: 未知
类目: Pricing of Securities (q-fin.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:Traditional models for pricing catastrophe (CAT) bonds struggle to capture the complex, relational data inherent in these instruments. This paper introduces CATNet, a novel framework that applies a geometric deep learning architecture, the Relational Graph Convolutional Network (R-GCN), to model the CAT bond primary market as a graph, leveraging its underlying network structure for spread prediction. Our analysis reveals that the CAT bond market exhibits the characteristics of a scale-free network, a structure dominated by a few highly connected and influential hubs. CATNet demonstrates high predictive performance, significantly outperforming a strong Random Forest benchmark. The inclusion of topological centrality measures as features provides a further, significant boost in accuracy. Interpretability analysis confirms that these network features are not mere statistical artifacts; they are quantitative proxies for long-held industry intuition regarding issuer reputation, underwriter influence, and peril concentration. This research provides evidence that network connectivity is a key determinant of price, offering a new paradigm for risk assessment and proving that graph-based models can deliver both state-of-the-art accuracy and deeper, quantifiable market insights.
zh

[AI-74] Jet Image Tagging Using Deep Learning: An Ensemble Model

【速读】:该论文旨在解决高能粒子物理中喷注(jet)分类的难题,即如何准确识别由夸克和胶子碎片化与强子化产生的复杂多维结构喷注,以支持对基本相互作用及标准模型之外新物理现象的探索。传统分类方法难以捕捉喷注的精细特征,因此论文提出一种基于双神经网络集成(Ensemble Model)的解决方案,其关键在于将喷注数据转换为二维直方图表示,而非高维空间中的点,并通过协同学习各子网络的优势,实现对Top夸克、轻夸克(上或下)以及W/Z玻色子等喷注类别的二分类与多类别分类任务,从而显著优于单一网络模型的性能。

链接: https://arxiv.org/abs/2508.10034
作者: Juvenal Bassa,Vidya Manian,Sudhir Malik,Arghya Chattopadhyay
机构: 未知
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
备注: 19 Pages. All codes available at this https URL

点击查看摘要

Abstract:Jet classification in high-energy particle physics is important for understanding fundamental interactions and probing phenomena beyond the Standard Model. Jets originate from the fragmentation and hadronization of quarks and gluons, and pose a challenge for identification due to their complex, multidimensional structure. Traditional classification methods often fall short in capturing these intricacies, necessitating advanced machine learning approaches. In this paper, we employ two neural networks simultaneously as an ensemble to tag various jet types. We convert the jet data to two-dimensional histograms instead of representing them as points in a higher-dimensional space. Specifically, this ensemble approach, hereafter referred to as Ensemble Model, is used to tag jets into classes from the JetNet dataset, corresponding to: Top Quarks, Light Quarks (up or down), and W and Z bosons. For the jet classes mentioned above, we show that the Ensemble Model can be used for both binary and multi-categorical classification. This ensemble approach learns jet features by leveraging the strengths of each constituent network achieving superior performance compared to either individual network.
zh

机器学习

[LG-0] A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

链接: https://arxiv.org/abs/2508.10899
作者: Haydn Thomas Jones,Natalie Maus,Josh Magnus Ludan,Maggie Ziyu Huan,Jiaming Liang,Marcelo Der Torossian Torres,Jiatao Liang,Zachary Ives,Yoseph Barash,Cesar de la Fuente-Nunez,Jacob R. Gardner,Mark Yatskar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI-driven discovery can greatly reduce design time and enhance new therapeutics’ effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had high probability of being mutagenic. In this work, we introduce \ourdataset, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. \ourdataset~ consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). \ourdataset~is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with \ourdataset~can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at \hrefthis https URLthis http URL, and will provide expanded versions as available literature grows.

[LG-1] Efficiently Verifiable Proofs of Data Attribution

链接: https://arxiv.org/abs/2508.10866
作者: Ari Karchmer,Seth Neel,Martin Pawelczyk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data attribution methods aim to answer useful counterfactual questions like “what would a ML model’s prediction be if it were trained on a different dataset?” However, estimation of data attribution models through techniques like empirical influence or “datamodeling” remains very computationally expensive. This causes a critical trust issue: if only a few computationally rich parties can obtain data attributions, how can resource-constrained parties trust that the provided attributions are indeed “good,” especially when they are used for important downstream applications (e.g., data pricing)? In this paper, we address this trust issue by proposing an interactive verification paradigm for data attribution. An untrusted and computationally powerful Prover learns data attributions, and then engages in an interactive proof with a resource-constrained Verifier. Our main result is a protocol that provides formal completeness, soundness, and efficiency guarantees in the sense of Probably-Approximately-Correct (PAC) verification. Specifically, if both Prover and Verifier follow the protocol, the Verifier accepts data attributions that are \epsilon-close to the optimal data attributions (in terms of the Mean Squared Error) with probability 1-\delta. Conversely, if the Prover arbitrarily deviates from the protocol, even with infinite compute, then this is detected (or it still yields data attributions to the Verifier) except with probability \delta. Importantly, our protocol ensures the Verifier’s workload, measured by the number of independent model retrainings it must perform, scales only as O(1/\epsilon); i.e., independently of the dataset size. At a technical level, our results apply to efficiently verifying any linear function over the boolean hypercube computed by the Prover, making them broadly applicable to various attribution tasks.

[LG-2] CrossDenoise: Denoising Implicit Feedback via a Lightweight Entity-Aware Synergistic Framework

链接: https://arxiv.org/abs/2508.10851
作者: Ze Liu,Xianquan Wang,Shuochen Liu,Jie Ma,Huibo Xu,Yupeng Han,Zhe Yang,Kai Zhang,Longfei Li,Jun Zhou
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems heavily rely on implicit feedback, which is inherently noisy due to false positives and negatives, severely degrading recommendation accuracy. Existing denoising strategies often overlook entity-aware modeling, suffer from high computational overhead, or demand excessive hyperparameter tuning, limiting their real-world applicability. We propose CrossDenoise, a novel and lightweight framework that addresses these challenges by disentangling noise estimation into user-, item-, and interaction-specific factors. Leveraging empirical observations that show significant heterogeneity in user and item noise propensities, CrossDenoise computes entity reputation factors (user/item reliability) via a rank-based linear mapping of average training losses. These are fused with interaction-level weights derived from an empirical cumulative distribution function (ECDF) of individual losses. This design is model-agnostic, computationally efficient, and requires only two intuitive hyperparameters. Extensive experiments on ML-1M, Yelp, and Amazon-book datasets, across GMF, NeuMF, and CDAE backbones, demonstrate that CrossDenoise consistently and significantly outperforms state-of-the-art baselines. For instance, it achieves up to 27.01% NDCG@50 gain on Yelp with NeuMF, while incurring negligible computational and memory overhead. Our analysis confirms that CrossDenoise effectively separates clean from noisy samples and remains robust under varied hyperparameter settings. It offers a practical and scalable solution for denoising implicit feedback.

[LG-3] SoK: Data Minimization in Machine Learning

链接: https://arxiv.org/abs/2508.10836
作者: Robin Staab,Nikola Jovanović,Kimberly Mai,Prakhar Ganesh,Martin Vechev,Ferdinando Fioretto,Matthew Jagielski
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and \emphDM-adjacent methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.

[LG-4] Comparison of Data Reduction Criteria for Online Gaussian Processes

链接: https://arxiv.org/abs/2508.10815
作者: Thore Wietzke,Knut Graichen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages

点击查看摘要

Abstract:Gaussian Processes (GPs) are widely used for regression and system identification due to their flexibility and ability to quantify uncertainty. However, their computational complexity limits their applicability to small datasets. Moreover in a streaming scenario, more and more datapoints accumulate which is intractable even for Sparse GPs. Online GPs aim to alleviate this problem by e.g. defining a maximum budget of datapoints and removing redundant datapoints. This work provides a unified comparison of several reduction criteria, analyzing both their computational complexity and reduction behavior. The criteria are evaluated on benchmark functions and real-world datasets, including dynamic system identification tasks. Additionally, acceptance criteria are proposed to further filter out redundant datapoints. This work yields practical guidelines for choosing a suitable criterion for an online GP algorithm.

[LG-5] Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee

链接: https://arxiv.org/abs/2508.10804
作者: Yu-Heng Hung,Ping-Chun Hsieh,Kai Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online restless multi-armed bandits (RMABs) typically assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards. However, in real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics, posing significant challenges for traditional RMAB algorithms. In this work, we specifically consider N -armd RMAB with non-stationary transition constrained by bounded variation budgets B . Our proposed \rmab; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations. We further establish that \rmab; achieves \widetilde\mathcalO(N^2 B^\frac14 T^\frac34) regret bound by leveraging a relaxed definition of regret, providing a foundational theoretical framework for non-stationary RMAB problems for the first time.

[LG-6] IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data

链接: https://arxiv.org/abs/2508.10775
作者: Dong Xu,Zhangfan Yang,Jenna Xinyi Yao,Shuangbao Song,Zexuan Zhu,Junkai Ji
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training-set biases. As such, we present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design. Specifically, we use PAC-Bayesian information-bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains the original TargetDiff architecture and hyperparameters for training to generate molecules compatible with the binding pocket; it then applies an L-BFGS optimization step to finely refine each conformation by optimizing five physics-based terms and adjusting six translational and rotational degrees of freedom in under one second. With only these modifications, IBEX raises the zero-shot docking success rate on CBGBench CrossDocked2020-based from 53% to 64%, improves the mean Vina score from -7.41 kcal mol^-1 to -8.07 kcal mol^-1 , and achieves the best median Vina energy in 57 of 100 pockets versus 3 for the original TargetDiff. IBEX also increases the QED by 25%, achieves state-of-the-art validity and diversity, and markedly reduces extrapolation error.

[LG-7] MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control

链接: https://arxiv.org/abs/2508.10684
作者: Yuchen Zhu,Wei Guo,Jaemoo Choi,Guan-Horng Liu,Yongxin Chen,Molei Tao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning a neural sampler to generate samples from discrete state spaces where the target probability mass function \pi\propto\mathrme^-U is known up to a normalizing constant, which is an important task in fields such as statistical physics, machine learning, combinatorial optimization, etc. To better address this challenging task when the state space has a large cardinality and the distribution is multi-modal, we propose \textbfM asked \textbfD iffusion \textbfN eural \textbfS ampler ( \textbfMDNS ), a novel framework for training discrete neural samplers by aligning two path measures through a family of learning objectives, theoretically grounded in the stochastic optimal control of the continuous-time Markov chains. We validate the efficiency and scalability of MDNS through extensive experiments on various distributions with distinct statistical properties, where MDNS learns to accurately sample from the target distributions despite the extremely high problem dimensions and outperforms other learning-based baselines by a large margin. A comprehensive study of ablations and extensions is also provided to demonstrate the efficacy and potential of the proposed framework.

[LG-8] Advancing Autonomous Incident Response: Leverag ing LLM s and Cyber Threat Intelligence

链接: https://arxiv.org/abs/2508.10677
作者: Amine Tellache,Abdelaziz Amara Korba,Amdjed Mokhtari,Horea Moldovan,Yacine Ghamri-Doudane
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective incident response (IR) is critical for mitigating cyber threats, yet security teams are overwhelmed by alert fatigue, high false-positive rates, and the vast volume of unstructured Cyber Threat Intelligence (CTI) documents. While CTI holds immense potential for enriching security operations, its extensive and fragmented nature makes manual analysis time-consuming and resource-intensive. To bridge this gap, we introduce a novel Retrieval-Augmented Generation (RAG)-based framework that leverages Large Language Models (LLMs) to automate and enhance IR by integrating dynamically retrieved CTI. Our approach introduces a hybrid retrieval mechanism that combines NLP-based similarity searches within a CTI vector database with standardized queries to external CTI platforms, facilitating context-aware enrichment of security alerts. The augmented intelligence is then leveraged by an LLM-powered response generation module, which formulates precise, actionable, and contextually relevant incident mitigation strategies. We propose a dual evaluation paradigm, wherein automated assessment using an auxiliary LLM is systematically cross-validated by cybersecurity experts. Empirical validation on real-world and simulated alerts demonstrates that our approach enhances the accuracy, contextualization, and efficiency of IR, alleviating analyst workload and reducing response latency. This work underscores the potential of LLM-driven CTI fusion in advancing autonomous security operations and establishing a foundation for intelligent, adaptive cybersecurity frameworks.

[LG-9] Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization

链接: https://arxiv.org/abs/2508.10651
作者: Reijo Jaakkola,Tomi Janhunen,Antti Kuusisto,Magdalena Ortiz,Matias Selin,Mantas Šimkus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel approach for graph classification based on tabularizing graph data via variants of the Weisfeiler-Leman algorithm and then applying methods for tabular data. We investigate a comprehensive class of Weisfeiler-Leman variants obtained by modifying the underlying logical framework and establish a precise theoretical characterization of their expressive power. We then test two selected variants on twelve benchmark datasets that span a range of different domains. The experiments demonstrate that our approach matches the accuracy of state-of-the-art graph neural networks and graph kernels while being more time or memory efficient, depending on the dataset. We also briefly discuss directly extracting interpretable modal logic formulas from graph datasets.

[LG-10] Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

链接: https://arxiv.org/abs/2508.10644
作者: Yihua Wang,Qi Jia,Cong Xu,Feiyu Chen,Yuhan Liu,Haotian Zhang,Liang Jin,Lu Liu,Zhichun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model’s generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++ ^R by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

[LG-11] Energy-Based Models for Predicting Mutational Effects on Proteins

链接: https://arxiv.org/abs/2508.10629
作者: Patrick Soga,Zhenyu Lei,Yinhan He,Camille Bilodeau,Jundong Li
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Predicting changes in binding free energy ( \Delta\Delta G ) is a vital task in protein engineering and protein-protein interaction (PPI) engineering for drug discovery. Previous works have observed a high correlation between \Delta\Delta G and entropy, using probabilities of biologically important objects such as side chain angles and residue identities to estimate \Delta\Delta G . However, estimating the full conformational distribution of a protein complex is generally considered intractable. In this work, we propose a new approach to \Delta\Delta G prediction that avoids this issue by instead leveraging energy-based models for estimating the probability of a complex’s conformation. Specifically, we novelly decompose \Delta\Delta G into a sequence-based component estimated by an inverse folding model and a structure-based component estimated by an energy model. This decomposition is made tractable by assuming equilibrium between the bound and unbound states, allowing us to simplify the estimation of degeneracies associated with each state. Unlike previous deep learning-based methods, our method incorporates an energy-based physical inductive bias by connecting the often-used sequence log-odds ratio-based approach to \Delta\Delta G prediction with a new \Delta\Delta E term grounded in statistical mechanics. We demonstrate superiority over existing state-of-the-art structure and sequence-based deep learning methods in \Delta\Delta G prediction and antibody optimization against SARS-CoV-2.

[LG-12] Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

链接: https://arxiv.org/abs/2508.10628
作者: Lucas Cardoso,Vitor Santos,José Ribeiro Filho,Ricardo Prudêncio,Regiane Kawasaki,Ronnie Alves
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, 1 table, Accepted to the ENIAC 2025 conference

点击查看摘要

Abstract:Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can significantly impair model performance and resulted in cases with accuracy below 50%, while other partitions reached more than 70% in the same dataset.

[LG-13] Variance Reduced Policy Gradient Method for Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2508.10608
作者: Davide Guidobene,Lorenzo Benedetti,Diego Arapovic
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Multi-Objective Reinforcement Learning (MORL) is a generalization of traditional Reinforcement Learning (RL) that aims to optimize multiple, often conflicting objectives simultaneously rather than focusing on a single reward. This approach is crucial in complex decision-making scenarios where agents must balance trade-offs between various goals, such as maximizing performance while minimizing costs. We consider the problem of MORL where the objectives are combined using a non-linear scalarization function. Just like in standard RL, policy gradient methods (PGMs) are amongst the most effective for handling large and continuous state-action spaces in MORL. However, existing PGMs for MORL suffer from high sample inefficiency, requiring large amounts of data to be effective. Previous attempts to solve this problem rely on overly strict assumptions, losing PGMs’ benefits in scalability to large state-action spaces. In this work, we address the issue of sample efficiency by implementing variance-reduction techniques to reduce the sample complexity of policy gradients while maintaining general assumptions.

[LG-14] Oops!.. They Stole it Again: Attacks on Split Learning

链接: https://arxiv.org/abs/2508.10598
作者: Tanveer Khan,Antonis Michalas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split Learning (SL) is a collaborative learning approach that improves privacy by keeping data on the client-side while sharing only the intermediate output with a server. However, the distributed nature of SL introduces new security challenges, necessitating a comprehensive exploration of potential attacks. This paper systematically reviews various attacks on SL, classifying them based on factors such as the attacker’s role, the type of privacy risks, when data leaks occur, and where vulnerabilities exist. We also analyze existing defense methods, including cryptographic methods, data modification approaches, distributed techniques, and hybrid solutions. Our findings reveal security gaps, highlighting the effectiveness and limitations of existing defenses. By identifying open challenges and future directions, this work provides valuable information to improve SL privacy issues and guide further research.

[LG-15] Self-Supervised Temporal Super-Resolution of Energy Data using Generative Adversarial Transformer

链接: https://arxiv.org/abs/2508.10587
作者: Xuanhao Mu,Gökhan Demirel,Yuzhe Zhang,Jianlei Liu,Thorsten Schlachter,Veit Hagenmeyer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:To bridge the temporal granularity gap in energy network design and operation based on Energy System Models, resampling of time series is required. While conventional upsampling methods are computationally efficient, they often result in significant information loss or increased noise. Advanced models such as time series generation models, Super-Resolution models and imputation models show potential, but also face fundamental challenges. The goal of time series generative models is to learn the distribution of the original data to generate high-resolution series with similar statistical characteristics. This is not entirely consistent with the definition of upsampling. Time series Super-Resolution models or imputation models can degrade the accuracy of upsampling because the input low-resolution time series are sparse and may have insufficient context. Moreover, such models usually rely on supervised learning paradigms. This presents a fundamental application paradox: their training requires the high-resolution time series that is intrinsically absent in upsampling application scenarios. To address the mentioned upsampling issue, this paper introduces a new method utilizing Generative Adversarial Transformers (GATs), which can be trained without access to any ground-truth high-resolution data. Compared with conventional interpolation methods, the introduced method can reduce the root mean square error (RMSE) of upsampling tasks by 9%, and the accuracy of a model predictive control (MPC) application scenario is improved by 13%.

[LG-16] GNN-based Unified Deep Learning

链接: https://arxiv.org/abs/2508.10583
作者: Furkan Pala,Islem Rekik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models often struggle to maintain generalizability in medical imaging, particularly under domain-fracture scenarios where distribution shifts arise from varying imaging techniques, acquisition protocols, patient populations, demographics, and equipment. In practice, each hospital may need to train distinct models - differing in learning task, width, and depth - to match local data. For example, one hospital may use Euclidean architectures such as MLPs and CNNs for tabular or grid-like image data, while another may require non-Euclidean architectures such as graph neural networks (GNNs) for irregular data like brain connectomes. How to train such heterogeneous models coherently across datasets, while enhancing each model’s generalizability, remains an open problem. We propose unified learning, a new paradigm that encodes each model into a graph representation, enabling unification in a shared graph learning space. A GNN then guides optimization of these unified models. By decoupling parameters of individual models and controlling them through a unified GNN (uGNN), our method supports parameter sharing and knowledge transfer across varying architectures (MLPs, CNNs, GNNs) and distributions, improving generalizability. Evaluations on MorphoMNIST and two MedMNIST benchmarks - PneumoniaMNIST and BreastMNIST - show that unified learning boosts performance when models are trained on unique distributions and tested on mixed ones, demonstrating strong robustness to unseen data with large distribution shifts. Code and benchmarks: this https URL

[LG-17] chnical Report: Facilitating the Adoption of Causal Inference Methods Through LLM -Empowered Co-Pilot

链接: https://arxiv.org/abs/2508.10581
作者: Jeroen Berrevoets,Julianna Piskorz,Robert Davis,Harry Amad,Jim Weatherall,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating treatment effects (TE) from observational data is a critical yet complex task in many fields, from healthcare and economics to public policy. While recent advances in machine learning and causal inference have produced powerful estimation techniques, their adoption remains limited due to the need for deep expertise in causal assumptions, adjustment strategies, and model selection. In this paper, we introduce CATE-B, an open-source co-pilot system that uses large language models (LLMs) within an agentic framework to guide users through the end-to-end process of treatment effect estimation. CATE-B assists in (i) constructing a structural causal model via causal discovery and LLM-based edge orientation, (ii) identifying robust adjustment sets through a novel Minimal Uncertainty Adjustment Set criterion, and (iii) selecting appropriate regression methods tailored to the causal structure and dataset characteristics. To encourage reproducibility and evaluation, we release a suite of benchmark tasks spanning diverse domains and causal complexities. By combining causal inference with intelligent, interactive assistance, CATE-B lowers the barrier to rigorous causal analysis and lays the foundation for a new class of benchmarks in automated treatment effect estimation.

[LG-18] Reproducible Physiological Features in Affective Computing: A Preliminary Analysis on Arousal Modeling

链接: https://arxiv.org/abs/2508.10561
作者: Andrea Gargano,Jasin Machkour,Mimma Nardelli,Enzo Pasquale Scilingo,Michael Muma
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to 2025 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE). 6 pages, 3 figures

点击查看摘要

Abstract:In Affective Computing, a key challenge lies in reliably linking subjective emotional experiences with objective physiological markers. This preliminary study addresses the issue of reproducibility by identifying physiological features from cardiovascular and electrodermal signals that are associated with continuous self-reports of arousal levels. Using the Continuously Annotated Signal of Emotion dataset, we analyzed 164 features extracted from cardiac and electrodermal signals of 30 participants exposed to short emotion-evoking videos. Feature selection was performed using the Terminating-Random Experiments (T-Rex) method, which performs variable selection systematically controlling a user-defined target False Discovery Rate. Remarkably, among all candidate features, only two electrodermal-derived features exhibited reproducible and statistically significant associations with arousal, achieving a 100% confirmation rate. These results highlight the necessity of rigorous reproducibility assessments in physiological features selection, an aspect often overlooked in Affective Computing. Our approach is particularly promising for applications in safety-critical environments requiring trustworthy and reliable white box models, such as mental disorder recognition and human-robot interaction systems.

[LG-19] Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

链接: https://arxiv.org/abs/2508.10541
作者: Brian Shing-Hei Wong,Joshua Mincheol Kim,Sin-Hang Fung,Qing Xiong,Kelvin Fu-Kiu Ao,Junkang Wei,Ran Wang,Dan Michelle Wang,Jingying Zhou,Bo Feng,Alfred Sze-Lok Cheng,Kevin Y. Yip,Stephen Kwok-Wing Tsui,Qin Cao
类目: Machine Learning (cs.LG)
*备注: 59 pages, 5 main figures, 15 supplementary figures, 2 supplementary tables

点击查看摘要

Abstract:Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non-allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein sequence characteristics, is crucial for Applm’s performance by detecting important differences among protein sequences. In addition to providing Applm as open-source software, we also provide our carefully curated benchmark datasets to facilitate future research.

[LG-20] Projected Coupled Diffusion for Test-Time Constrained Joint Generation

链接: https://arxiv.org/abs/2508.10531
作者: Hao Luan,Yi Xian Goh,See-Kiong Ng,Chun Kai Ling
类目: Machine Learning (cs.LG)
*备注: 37 pages

点击查看摘要

Abstract:Modifications to test-time sampling have emerged as an important extension to diffusion algorithms, with the goal of biasing the generative process to achieve a given objective without having to retrain the entire diffusion model. However, generating jointly correlated samples from multiple pre-trained diffusion models while simultaneously enforcing task-specific constraints without costly retraining has remained challenging. To this end, we propose Projected Coupled Diffusion (PCD), a novel test-time framework for constrained joint generation. PCD introduces a coupled guidance term into the generative dynamics to encourage coordination between diffusion models and incorporates a projection step at each diffusion step to enforce hard constraints. Empirically, we demonstrate the effectiveness of PCD in application scenarios of image-pair generation, object manipulation, and multi-robot motion planning. Our results show improved coupling effects and guaranteed constraint satisfaction without incurring excessive computational costs.

[LG-21] Nonlocal Monte Carlo via Reinforcement Learning

链接: https://arxiv.org/abs/2508.10520
作者: Dmitrii Dobrynin,Masoud Mohseni,John Paul Strachan
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:Optimizing or sampling complex cost functions of combinatorial optimization problems is a longstanding challenge across disciplines and applications. When employing family of conventional algorithms based on Markov Chain Monte Carlo (MCMC) such as simulated annealing or parallel tempering, one assumes homogeneous (equilibrium) temperature profiles across input. This instance independent approach was shown to be ineffective for the hardest benchmarks near a computational phase transition when the so-called overlap-gap-property holds. In these regimes conventional MCMC struggles to unfreeze rigid variables, escape suboptimal basins of attraction, and sample high-quality and diverse solutions. In order to mitigate these challenges, Nonequilibrium Nonlocal Monte Carlo (NMC) algorithms were proposed that leverage inhomogeneous temperature profiles thereby accelerating exploration of the configuration space without compromising its exploitation. Here, we employ deep reinforcement learning (RL) to train the nonlocal transition policies of NMC which were previously designed phenomenologically. We demonstrate that the resulting solver can be trained solely by observing energy changes of the configuration space exploration as RL rewards and the local minimum energy landscape geometry as RL states. We further show that the trained policies improve upon the standard MCMC-based and nonlocal simulated annealing on hard uniform random and scale-free random 4-SAT benchmarks in terms of residual energy, time-to-solution, and diversity of solutions metrics.

[LG-22] Learning State-Space Models of Dynamic Systems from Arbitrary Data using Joint Embedding Predictive Architectures

链接: https://arxiv.org/abs/2508.10489
作者: Jonas Ulmen,Ganesh Sundaram,Daniel Görges
类目: Machine Learning (cs.LG)
*备注: 6 Pages, Published in IFAC Joint Symposia on Mechatronics Robotics 2025

点击查看摘要

Abstract:With the advent of Joint Embedding Predictive Architectures (JEPAs), which appear to be more capable than reconstruction-based methods, this paper introduces a novel technique for creating world models using continuous-time dynamic systems from arbitrary observation data. The proposed method integrates sequence embeddings with neural ordinary differential equations (neural ODEs). It employs loss functions that enforce contractive embeddings and Lipschitz constants in state transitions to construct a well-organized latent state space. The approach’s effectiveness is demonstrated through the generation of structured latent state-space models for a simple pendulum system using only image data. This opens up a new technique for developing more general control algorithms and estimation techniques with broad applications in robotics.

[LG-23] Confounding is a Pervasive Problem in Real World Recommender Systems

链接: https://arxiv.org/abs/2508.10479
作者: Alexander Merkov,David Rohde,Alexandre Gilotte,Benjamin Heymann
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Unobserved confounding arises when an unmeasured feature influences both the treatment and the outcome, leading to biased causal effect estimates. This issue undermines observational studies in fields like economics, medicine, ecology or epidemiology. Recommender systems leveraging fully observed data seem not to be vulnerable to this problem. However many standard practices in recommender systems result in observed features being ignored, resulting in effectively the same problem. This paper will show that numerous common practices such as feature engineering, A/B testing and modularization can in fact introduce confounding into recommendation systems and hamper their performance. Several illustrations of the phenomena are provided, supported by simulation studies with practical suggestions about how practitioners may reduce or avoid the affects of confounding in real systems.

[LG-24] EDAPT: Towards Calibration-Free BCIs with Continual Online Adaptation

链接: https://arxiv.org/abs/2508.10474
作者: Lisa Haxel,Jaivardhan Kapoor,Ulf Ziemann,Jakob H. Macke
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
*备注: Preprint

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) suffer from accuracy degradation as neural signals drift over time and vary across users, requiring frequent recalibration that limits practical deployment. We introduce EDAPT, a task- and model-agnostic framework that eliminates calibration through continual model adaptation. EDAPT first trains a baseline decoder using data from multiple users, then continually personalizes this model via supervised finetuning as the neural patterns evolve during use. We tested EDAPT across nine datasets covering three BCI tasks, and found that it consistently improved accuracy over conventional, static methods. These improvements primarily stem from combining population-level pretraining and online continual finetuning, with unsupervised domain adaptation providing further gains on some datasets. EDAPT runs efficiently, updating models within 200 milliseconds on consumer-grade hardware. Finally, decoding accuracy scales with total data budget rather than its allocation between subjects and trials. EDAPT provides a practical pathway toward calibration-free BCIs, reducing a major barrier to BCI deployment.

[LG-25] GraphFedMIG: Tackling Class Imbalance in Federated Graph Learning via Mutual Information-Guided Generation

链接: https://arxiv.org/abs/2508.10471
作者: Xinrui Li,Qilin Fan,Tianfu Wang,Kaiwen Wei,Ke Yu,Xu Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated graph learning (FGL) enables multiple clients to collaboratively train powerful graph neural networks without sharing their private, decentralized graph data. Inherited from generic federated learning, FGL is critically challenged by statistical heterogeneity, where non-IID data distributions across clients can severely impair model performance. A particularly destructive form of this is class imbalance, which causes the global model to become biased towards majority classes and fail at identifying rare but critical events. This issue is exacerbated in FGL, as nodes from a minority class are often surrounded by biased neighborhood information, hindering the learning of expressive embeddings. To grapple with this challenge, we propose GraphFedMIG, a novel FGL framework that reframes the problem as a federated generative data augmentation task. GraphFedMIG employs a hierarchical generative adversarial network where each client trains a local generator to synthesize high-fidelity feature representations. To provide tailored supervision, clients are grouped into clusters, each sharing a dedicated discriminator. Crucially, the framework designs a mutual information-guided mechanism to steer the evolution of these client generators. By calculating each client’s unique informational value, this mechanism corrects the local generator parameters, ensuring that subsequent rounds of mutual information-guided generation are focused on producing high-value, minority-class features. We conduct extensive experiments on four real-world datasets, and the results demonstrate the superiority of the proposed GraphFedMIG compared with other baselines.

[LG-26] Efficient Methods for Accurate Sparse Trajectory Recovery and Map Matching ICDE

链接: https://arxiv.org/abs/2508.10460
作者: Wei Tian,Jieming Shi,Man Lung Yiu
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 13 pages, accepted by 2025 IEEE 41st International Conference on Data Engineering (ICDE)

点击查看摘要

Abstract:Real-world trajectories are often sparse with low-sampling rates (i.e., long intervals between consecutive GPS points) and misaligned with road networks, yet many applications demand high-quality data for optimal performance. To improve data quality with sparse trajectories as input, we systematically study two related research problems: trajectory recovery on road network, which aims to infer missing points to recover high-sampling trajectories, and map matching, which aims to map GPS points to road segments to determine underlying routes. In this paper, we present efficient methods TRMMA and MMA for accurate trajectory recovery and map matching, respectively, where MMA serves as the first step of TRMMA. In MMA, we carefully formulate a classification task to map a GPS point from sparse trajectories to a road segment over a small candidate segment set, rather than the entire road network. We develop techniques in MMA to generate effective embeddings that capture the patterns of GPS data, directional information, and road segments, to accurately align sparse trajectories to routes. For trajectory recovery, TRMMA focuses on the segments in the route returned by MMA to infer missing points with position ratios on road segments, producing high-sampling trajectories efficiently by avoiding evaluation of all road segments. Specifically, in TRMMA, we design a dual-transformer encoding process to cohesively capture latent patterns in trajectories and routes, and an effective decoding technique to sequentially predict the position ratios and road segments of missing points. We conduct extensive experiments to compare TRMMA and MMA with numerous existing methods for trajectory recovery and map matching, respectively, on 4 large real-world datasets. TRMMA and MMA consistently achieve the best result quality, often by a significant margin.

[LG-27] SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLM s in Complex Decision-Making Tasks

链接: https://arxiv.org/abs/2508.10428
作者: Pengbo Shen,Yaqing Wang,Ni Mu,Yao Luan,Runpeng Xie,Senhao Yang,Lexiang Wang,Hao Hu,Shuang Xu,Yiqin Yang,Bo Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI’s ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game’s full complexity, such as its complete game context, diverse action spaces, and all playable races. To address this gap, we present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimizes text-based observations to tackle spatial reasoning challenges. Complementing this, we introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution, featuring iterative self-correction and continuous improvement via fine-tuning on high-quality gameplay data. Its key components include a Planner-Executor-Verifier structure to break down gameplay, and a scoring system for selecting high-quality training samples. Comprehensive analysis using SC2Arena provides valuable insights into developing generalist agents that were not possible with previous benchmarks. Experimental results also demonstrate that our proposed StarEvolve achieves superior performance in strategic planning. Our code, environment, and algorithms are publicly available.

[LG-28] XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

链接: https://arxiv.org/abs/2508.10395
作者: Aditya Tomar,Coleman Hooper,Minjae Lee,Haocheng Xi,Rishabh Tiwari,Wonjun Kang,Luca Manolache,Michael W. Mahoney,Kurt Keutzer,Amir Gholami
类目: Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2 \times memory savings compared to KV caching. By applying XQuant, we achieve up to \sim 7.7\times memory savings with 0.1 perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10 \times memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5 \times memory savings with only 0.1 perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.

[LG-29] A Unified Evaluation Framework for Multi-Annotator Tendency Learning

链接: https://arxiv.org/abs/2508.10393
作者: Liyun Zhang,Jingcheng Ke,Shenli Fan,Xuanmeng Sha,Zheng Lian
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 9 pages

点击查看摘要

Abstract:Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.

[LG-30] Clicks Versus Conversion: Choosing a Recommenders Training Objective in E-Commerce

链接: https://arxiv.org/abs/2508.10377
作者: Michael Weiss,Robert Rosenbach,Christian Eggenberger
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ranking product recommendations to optimize for a high click-through rate (CTR) or for high conversion, such as add-to-cart rate (ACR) and Order-Submit-Rate (OSR, view-to-purchase conversion) are standard practices in e-commerce. Optimizing for CTR appears like a straightforward choice: Training data (i.e., click data) are simple to collect and often available in large quantities. Additionally, CTR is used far beyond e-commerce, making it a generalist, easily implemented option. ACR and OSR, on the other hand, are more directly linked to a shop’s business goals, such as the Gross Merchandise Value (GMV). In this paper, we compare the effects of using either of these objectives using an online A/B test. Among our key findings, we demonstrate that in our shops, optimizing for OSR produces a GMV uplift more than five times larger than when optimizing for CTR, without sacrificing new product discovery. Our results also provide insights into the different feature importances for each of the objectives.

[LG-31] Semantic Communication with Distribution Learning through Sequential Observations

链接: https://arxiv.org/abs/2508.10350
作者: Samer Lahoud,Kinda Khawam
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Semantic communication aims to convey meaning rather than bit-perfect reproduction, representing a paradigm shift from traditional communication. This paper investigates distribution learning in semantic communication where receivers must infer the underlying meaning distribution through sequential observations. While semantic communication traditionally optimizes individual meaning transmission, we establish fundamental conditions for learning source statistics when priors are unknown. We prove that learnability requires full rank of the effective transmission matrix, characterize the convergence rate of distribution estimation, and quantify how estimation errors translate to semantic distortion. Our analysis reveals a fundamental trade-off: encoding schemes optimized for immediate semantic performance often sacrifice long-term learnability. Experiments on CIFAR-10 validate our theoretical framework, demonstrating that system conditioning critically impacts both learning rate and achievable performance. These results provide the first rigorous characterization of statistical learning in semantic communication and offer design principles for systems that balance immediate performance with adaptation capability.

[LG-32] Flexible Personalized Split Federated Learning for On-Device Fine-Tuning of Foundation Models

链接: https://arxiv.org/abs/2508.10349
作者: Tianjun Yuan,Jiaxiang Geng,Pengchao Han,Xianhao Chen,Bing Luo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, Submitted to INFOCOM2026

点击查看摘要

Abstract:Fine-tuning foundation models is critical for superior performance on personalized downstream tasks, compared to using pre-trained models. Collaborative learning can leverage local clients’ datasets for fine-tuning, but limited client data and heterogeneous data distributions hinder effective collaboration. To address the challenge, we propose a flexible personalized federated learning paradigm that enables clients to engage in collaborative learning while maintaining personalized objectives. Given the limited and heterogeneous computational resources available on clients, we introduce \textbfflexible personalized split federated learning (FlexP-SFL). Based on split learning, FlexP-SFL allows each client to train a portion of the model locally while offloading the rest to a server, according to resource constraints. Additionally, we propose an alignment strategy to improve personalized model performance on global data. Experimental results show that FlexP-SFL outperforms baseline models in personalized fine-tuning efficiency and final accuracy.

[LG-33] A Hierarchical IDS for Zero-Day Attack Detection in Internet of Medical Things Networks

链接: https://arxiv.org/abs/2508.10346
作者: Md Ashraf Uddin,Nam H. Chu,Reza Rafeh
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 13 pages, and 4 figures

点击查看摘要

Abstract:The Internet of Medical Things (IoMT) is driving a healthcare revolution but remains vulnerable to cyberattacks such as denial of service, ransomware, data hijacking, and spoofing. These networks comprise resource constrained, heterogeneous devices (e.g., wearable sensors, smart pills, implantables), making traditional centralized Intrusion Detection Systems (IDSs) unsuitable due to response delays, privacy risks, and added vulnerabilities. Centralized IDSs require all sensors to transmit data to a central server, causing delays or network disruptions in dense environments. Running IDSs locally on IoMT devices is often infeasible due to limited computation, and even lightweight IDS components remain at risk if updated models are delayed leaving them exposed to zero-day attacks that threaten patient health and data security. We propose a multi level IoMT IDS framework capable of detecting zero day attacks and distinguishing between known and unknown threats. The first layer (near Edge) filters traffic at a coarse level (attack or not) using meta-learning or One Class Classification (OCC) with the usfAD algorithm. Subsequent layers (far Edge, Cloud) identify attack type and novelty. Experiments on the CICIoMT2024 dataset show 99.77 percentage accuracy and 97.8 percentage F1-score. The first layer detects zero-day attacks with high accuracy without needing new datasets, ensuring strong applicability in IoMT environments. Additionally, the meta-learning approach achieves high.

[LG-34] Uncertainty-Aware Prediction of Parkinsons Disease Medication Needs: A Two-Stage Conformal Prediction Approach

链接: https://arxiv.org/abs/2508.10284
作者: Ricardo Diaz-Rincon,Muxuan Liang,Adolfo Ramirez-Zamora,Benjamin Shickel
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Accepted to MLHC 2025

点击查看摘要

Abstract:Parkinson’s Disease (PD) medication management presents unique challenges due to heterogeneous disease progression and treatment response. Neurologists must balance symptom control with optimal dopaminergic dosing based on functional disability while minimizing side effects. This balance is crucial as inadequate or abrupt changes can cause levodopa-induced dyskinesia, wearing off, and neuropsychiatric effects, significantly reducing quality of life. Current approaches rely on trial-and-error decisions without systematic predictive methods. Despite machine learning advances, clinical adoption remains limited due to reliance on point predictions that do not account for prediction uncertainty, undermining clinical trust and utility. Clinicians require not only predictions of future medication needs but also reliable confidence measures. Without quantified uncertainty, adjustments risk premature escalation to maximum doses or prolonged inadequate symptom control. We developed a conformal prediction framework anticipating medication needs up to two years in advance with reliable prediction intervals and statistical guarantees. Our approach addresses zero-inflation in PD inpatient data, where patients maintain stable medication regimens between visits. Using electronic health records from 631 inpatient admissions at University of Florida Health (2011-2021), our two-stage approach identifies patients likely to need medication changes, then predicts required levodopa equivalent daily dose adjustments. Our framework achieved marginal coverage while reducing prediction interval lengths compared to traditional approaches, providing precise predictions for short-term planning and wider ranges for long-term forecasting. By quantifying uncertainty, our approach enables evidence-based decisions about levodopa dosing, optimizing symptom control while minimizing side effects and improving life quality.

[LG-35] he Conditional Regret-Capacity Theorem for Batch Universal Prediction

链接: https://arxiv.org/abs/2508.10282
作者: Marco Bondaschi,Michael Gastpar
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive a conditional version of the classical regret-capacity theorem. This result can be used in universal prediction to find lower bounds on the minimal batch regret, which is a recently introduced generalization of the average regret, when batches of training data are available to the predictor. As an example, we apply this result to the class of binary memoryless sources. Finally, we generalize the theorem to Rényi information measures, revealing a deep connection between the conditional Rényi divergence and the conditional Sibson’s mutual information.

[LG-36] Source Component Shift Adaptation via Offline Decomposition and Online Mixing Approach ECAI2025

链接: https://arxiv.org/abs/2508.10257
作者: Ryuta Matsuno
类目: Machine Learning (cs.LG)
*备注: To appear in ECAI 2025

点击查看摘要

Abstract:This paper addresses source component shift adaptation, aiming to update predictions adapting to source component shifts for incoming data streams based on past training data. Existing online learning methods often fail to utilize recurring shifts effectively, while model-pool-based methods struggle to capture individual source components, leading to poor adaptation. In this paper, we propose a source component shift adaptation method via an offline decomposition and online mixing approach. We theoretically identify that the problem can be divided into two subproblems: offline source component decomposition and online mixing weight adaptation. Based on this, our method first determines prediction models, each of which learns a source component solely based on past training data offline through the EM algorithm. Then, it updates the mixing weight of the prediction models for precise prediction through online convex optimization. Thanks to our theoretical derivation, our method fully leverages the characteristics of the shifts, achieving superior adaptation performance over existing methods. Experiments conducted on various real-world regression datasets demonstrate that our method outperforms baselines, reducing the cumulative test loss by up to 67.4%.

[LG-37] Federated Anomaly Detection for Multi-Tenant Cloud Platforms with Personalized Modeling

链接: https://arxiv.org/abs/2508.10255
作者: Yuxi Wang,Heyao Liu,Nyutian Long,Guanzi Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes an anomaly detection method based on federated learning to address key challenges in multi-tenant cloud environments, including data privacy leakage, heterogeneous resource behavior, and the limitations of centralized modeling. The method establishes a federated training framework involving multiple tenants. Each tenant trains the model locally using private resource usage data. Through parameter aggregation, a global model is optimized, enabling cross-tenant collaborative anomaly detection while preserving data privacy. To improve adaptability to diverse resource usage patterns, a personalized parameter adjustment mechanism is introduced. This allows the model to retain tenant-specific feature representations while sharing global knowledge. In the model output stage, the Mahalanobis distance is used to compute anomaly scores. This enhances both the accuracy and stability of anomaly detection. The experiments use real telemetry data from a cloud platform to construct a simulated multi-tenant environment. The study evaluates the model’s performance under varying participation rates and noise injection levels. These comparisons demonstrate the proposed method’s robustness and detection accuracy. Experimental results show that the proposed method outperforms existing mainstream models across key metrics such as Precision, Recall, and F1-Score. It also maintains stable performance in various complex scenarios. These findings highlight the method’s practical potential for intelligent resource monitoring and anomaly diagnosis in cloud computing environments.

[LG-38] Multi-Agent Reinforcement Learning for Adaptive Resource Orchestration in Cloud-Native Clusters

链接: https://arxiv.org/abs/2508.10253
作者: Guanzi Yao,Heyao Liu,Linyan Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of high resource dynamism and scheduling complexity in cloud-native database systems. It proposes an adaptive resource orchestration method based on multi-agent reinforcement learning. The method introduces a heterogeneous role-based agent modeling mechanism. This allows different resource entities, such as compute nodes, storage nodes, and schedulers, to adopt distinct policy representations. These agents are better able to reflect diverse functional responsibilities and local environmental characteristics within the system. A reward-shaping mechanism is designed to integrate local observations with global feedback. This helps mitigate policy learning bias caused by incomplete state observations. By combining real-time local performance signals with global system value estimation, the mechanism improves coordination among agents and enhances policy convergence stability. A unified multi-agent training framework is developed and evaluated on a representative production scheduling dataset. Experimental results show that the proposed method outperforms traditional approaches across multiple key metrics. These include resource utilization, scheduling latency, policy convergence speed, system stability, and fairness. The results demonstrate strong generalization and practical utility. Across various experimental scenarios, the method proves effective in handling orchestration tasks with high concurrency, high-dimensional state spaces, and complex dependency relationships. This confirms its advantages in real-world, large-scale scheduling environments.

[LG-39] Convergence Analysis of Max-Min Exponential Neural Network Operators in Orlicz Space

链接: https://arxiv.org/abs/2508.10248
作者: Satyaranjan Pradhan,Madan Mohan Soren
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 35 pages, 6 figures

点击查看摘要

Abstract:In this current work, we propose a Max Min approach for approximating functions using exponential neural network operators. We extend this framework to develop the Max Min Kantorovich-type exponential neural network operators and investigate their approximation properties. We study both pointwise and uniform convergence for univariate functions. To analyze the order of convergence, we use the logarithmic modulus of continuity and estimate the corresponding rate of convergence. Furthermore, we examine the convergence behavior of the Max Min Kantorovich type exponential neural network operators within the Orlicz space setting. We provide some graphical representations to illustrate the approximation error of the function through suitable kernel and sigmoidal activation functions.

[LG-40] Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models

链接: https://arxiv.org/abs/2508.10243
作者: Taibiao Zhao,Mingxuan Sun,Hao Wang,Xiaobing Chen,Xiangwei Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer models have demonstrated exceptional performance and have become indispensable in computer vision (CV) and natural language processing (NLP) tasks. However, recent studies reveal that transformers are susceptible to backdoor attacks. Prior backdoor attack methods typically rely on retraining with clean data or altering the model architecture, both of which can be resource-intensive and intrusive. In this paper, we propose Head-wise Pruning and Malicious Injection (HPMI), a novel retraining-free backdoor attack on transformers that does not alter the model’s architecture. Our approach requires only a small subset of the original data and basic knowledge of the model architecture, eliminating the need for retraining the target transformer. Technically, HPMI works by pruning the least important head and injecting a pre-trained malicious head to establish the backdoor. We provide a rigorous theoretical justification demonstrating that the implanted backdoor resists detection and removal by state-of-the-art defense techniques, under reasonable assumptions. Experimental evaluations across multiple datasets further validate the effectiveness of HPMI, showing that it 1) incurs negligible clean accuracy loss, 2) achieves at least 99.55% attack success rate, and 3) bypasses four advanced defense mechanisms. Additionally, relative to state-of-the-art retraining-dependent attacks, HPMI achieves greater concealment and robustness against diverse defense strategies, while maintaining minimal impact on clean accuracy.

[LG-41] Can Transformers Break Encryption Schemes via In-Context Learning?

链接: https://arxiv.org/abs/2508.10235
作者: Jathin Korrapati,Patrick Mendoza,Aditya Tomar,Abein Abraham
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) has emerged as a powerful capability of transformer-based language models, enabling them to perform tasks by conditioning on a small number of examples presented at inference time, without any parameter updates. Prior work has shown that transformers can generalize over simple function classes like linear functions, decision trees, even neural networks, purely from context, focusing on numerical or symbolic reasoning over underlying well-structured functions. Instead, we propose a novel application of ICL into the domain of cryptographic function learning, specifically focusing on ciphers such as mono-alphabetic substitution and Vigenère ciphers, two classes of private-key encryption schemes. These ciphers involve a fixed but hidden bijective mapping between plain text and cipher text characters. Given a small set of (cipher text, plain text) pairs, the goal is for the model to infer the underlying substitution and decode a new cipher text word. This setting poses a structured inference challenge, which is well-suited for evaluating the inductive biases and generalization capabilities of transformers under the ICL paradigm. Code is available at this https URL.

[LG-42] Interpretable Machine Learning Model for Early Prediction of Acute Kidney Injury in Critically Ill Patients with Cirrhosis: A Retrospective Study

链接: https://arxiv.org/abs/2508.10233
作者: Li Sun,Shuheng Chen,Junyi Fan,Yong Si,Minoo Ahmadi,Elham Pishgar,Kamiar Alaei,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Cirrhosis is a progressive liver disease with high mortality and frequent complications, notably acute kidney injury (AKI), which occurs in up to 50% of hospitalized patients and worsens outcomes. AKI stems from complex hemodynamic, inflammatory, and metabolic changes, making early detection essential. Many predictive tools lack accuracy, interpretability, and alignment with intensive care unit (ICU) workflows. This study developed an interpretable machine learning model for early AKI prediction in critically ill patients with cirrhosis. Methods: We conducted a retrospective analysis of the MIMIC-IV v2.2 database, identifying 1240 adult ICU patients with cirrhosis and excluding those with ICU stays under 48 hours or missing key data. Laboratory and physiological variables from the first 48 hours were extracted. The pipeline included preprocessing, missingness filtering, LASSO feature selection, and SMOTE class balancing. Six algorithms-LightGBM, CatBoost, XGBoost, logistic regression, naive Bayes, and neural networks-were trained and evaluated using AUROC, accuracy, F1-score, sensitivity, specificity, and predictive values. Results: LightGBM achieved the best performance (AUROC 0.808, 95% CI 0.741-0.856; accuracy 0.704; NPV 0.911). Key predictors included prolonged partial thromboplastin time, absence of outside-facility 20G placement, low pH, and altered pO2, consistent with known cirrhosis-AKI mechanisms and suggesting actionable targets. Conclusion: The LightGBM-based model enables accurate early AKI risk stratification in ICU patients with cirrhosis using routine clinical variables. Its high negative predictive value supports safe de-escalation for low-risk patients, and interpretability fosters clinician trust and targeted prevention. External validation and integration into electronic health record systems are warranted. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.10233 [cs.LG] (or arXiv:2508.10233v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.10233 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuheng Chen [view email] [v1] Wed, 13 Aug 2025 23:03:28 UTC (652 KB)

[LG-43] Comparison of D-Wave Quantum Annealing and Markov Chain Monte Carlo for Sampling from a Probability Distribution of a Restricted Boltzmann Machine

链接: https://arxiv.org/abs/2508.10228
作者: Abdelmoula El Yazizi,Samee U. Khan,Yaroslav Koshka
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: 22 pages, 10 figures

点击查看摘要

Abstract:A local-valley (LV) centered approach to assessing the quality of sampling from Restricted Boltzmann Machines (RBMs) was applied to the latest generation of the D-Wave quantum annealer. D-Wave and Gibbs samples from a classically trained RBM were obtained at conditions relevant to the contrastive-divergence-based RBM learning. The samples were compared for the number of the LVs to which they belonged and the energy of the corresponding local minima. No significant (desirable) increase in the number of the LVs has been achieved by decreasing the D-Wave annealing time. At any training epoch, the states sampled by the D-Wave belonged to a somewhat higher number of LVs than in the Gibbs sampling. However, many of those LVs found by the two techniques differed. For high-probability sampled states, the two techniques were (unfavorably) less complementary and more overlapping. Nevertheless, many potentially “important” local minima, i.e., those having intermediate, even if not high, probability values, were found by only one of the two sampling techniques while missed by the other. The two techniques overlapped less at later than earlier training epochs, which is precisely the stage of the training when modest improvements to the sampling quality could make meaningful differences for the RBM trainability. The results of this work may explain the failure of previous investigations to achieve substantial (or any) improvement when using D-Wave-based sampling. However, the results reveal some potential for improvement, e.g., using a combined classical-quantum approach.

[LG-44] Benchmark-Driven Selection of AI: Evidence from DeepSeek -R1

链接: https://arxiv.org/abs/2508.10173
作者: Petr Spelda,Vit Stritecky
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 17 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Evaluation of reasoning language models gained importance after it was observed that they can combine their existing capabilities into novel traces of intermediate steps before task completion and that the traces can sometimes help them to generalize better than past models. As reasoning becomes the next scaling dimension of large language models, careful study of their capabilities in critical tasks is needed. We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning. We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity’s Last Exam. Steering development of AI by impactful benchmarks trades evaluation for learning and makes novelty of test tasks key for measuring generalization capabilities of reasoning models. Consequently, some benchmarks could be seen as curricula for training rather than unseen test sets.

[LG-45] Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training

链接: https://arxiv.org/abs/2508.10160
作者: Timon Merk,Saeed Salehi,Richard M. Koehler,Qiming Cui,Maria Olaru,Amelia Hahn,Nicole R. Provenza,Simon Little,Reza Abbasi-Asl,Phil A. Starr,Wolf-Julian Neumann
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 5 pages, 6 figures

点击查看摘要

Abstract:Neural decoding of pathological and physiological states can enable patient-individualized closed-loop neuromodulation therapy. Recent advances in pre-trained large-scale foundation models offer the potential for generalized state estimation without patient-individual training. Here we present a foundation model trained on chronic longitudinal deep brain stimulation recordings spanning over 24 days. Adhering to long time-scale symptom fluctuations, we highlight the extended context window of 30 minutes. We present an optimized pre-training loss function for neural electrophysiological data that corrects for the frequency bias of common masked auto-encoder loss functions due to the 1-over-f power law. We show in a downstream task the decoding of Parkinson’s disease symptoms with leave-one-subject-out cross-validation without patient-individual training.

[LG-46] Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression

链接: https://arxiv.org/abs/2508.10154
作者: Zhankun Luo,Abolfazl Hashemi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture models have attracted significant attention due to practical effectiveness and comprehensive theoretical foundations. A persisting challenge is model misspecification, which occurs when the model to be fitted has more mixture components than those in the data distribution. In this paper, we develop a theoretical understanding of the Expectation-Maximization (EM) algorithm’s behavior in the context of targeted model misspecification for overspecified two-component Mixed Linear Regression (2MLR) with unknown d -dimensional regression parameters and mixing weights. In Theorem 5.1 at the population level, with an unbalanced initial guess for mixing weights, we establish linear convergence of regression parameters in O(\log(1/\epsilon)) steps. Conversely, with a balanced initial guess for mixing weights, we observe sublinear convergence in O(\epsilon^-2) steps to achieve the \epsilon -accuracy at Euclidean distance. In Theorem 6.1 at the finite-sample level, for mixtures with sufficiently unbalanced fixed mixing weights, we demonstrate a statistical accuracy of O((d/n)^1/2) , whereas for those with sufficiently balanced fixed mixing weights, the accuracy is O((d/n)^1/4) given n data samples. Furthermore, we underscore the connection between our population level and finite-sample level results: by setting the desired final accuracy \epsilon in Theorem 5.1 to match that in Theorem 6.1 at the finite-sample level, namely letting \epsilon = O((d/n)^1/2) for sufficiently unbalanced fixed mixing weights and \epsilon = O((d/n)^1/4) for sufficiently balanced fixed mixing weights, we intuitively derive iteration complexity bounds O(\log (1/\epsilon))=O(\log (n/d)) and O(\epsilon^-2)=O((n/d)^1/2) at the finite-sample level for sufficiently unbalanced and balanced initial mixing weights. We further extend our analysis in overspecified setting to low SNR regime.

[LG-47] Constrained Decoding of Diffusion LLM s with Context-Free Grammars

链接: https://arxiv.org/abs/2508.10111
作者: Niels Mündler,Jasper Dekoninck,Martin Vechev
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promising performance across diverse domains. Many practical applications of LLMs, such as code completion and structured data extraction, require adherence to syntactic constraints specified by a formal language. Yet, due to their probabilistic nature, LLM output is not guaranteed to adhere to such formal languages. Prior work has proposed constrained decoding as a means to restrict LLM generation to particular formal languages. However, existing works are not applicable to the emerging paradigm of diffusion LLMs, when used in practical scenarios such as the generation of formally correct C++ or JSON output. In this paper we address this challenge and present the first constrained decoding method for diffusion models, one that can handle formal languages captured by context-free grammars. We begin by reducing constrained decoding to the more general additive infilling problem, which asks whether a partial output can be completed to a valid word in the target language. This problem also naturally subsumes the previously unaddressed multi-region infilling constrained decoding. We then reduce this problem to the task of deciding whether the intersection of the target language and a regular language is empty and present an efficient algorithm to solve it for context-free languages. Empirical results on various applications, such as C++ code infilling and structured data extraction in JSON, demonstrate that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness. Importantly, our efficiency optimizations ensure that the computational overhead remains practical.

[LG-48] Next Edit Prediction: Learning to Predict Code Edits from Context and Interaction History

链接: https://arxiv.org/abs/2508.10074
作者: Ruofan Lu,Yintong Huo,Meng Zhang,Yichen Li,Michael R. Lyu
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has led to the widespread adoption of AI-powered coding assistants integrated into a development environment. On one hand, low-latency code completion offers completion suggestions but is fundamentally constrained to the cursor’s current position. On the other hand, chat-based editing can perform complex modifications, yet forces developers to stop their work, describe the intent in natural language, which causes a context-switch away from the code. This creates a suboptimal user experience, as neither paradigm proactively predicts the developer’s next edit in a sequence of related edits. To bridge this gap and provide the seamless code edit suggestion, we introduce the task of Next Edit Prediction, a novel task designed to infer developer intent from recent interaction history to predict both the location and content of the subsequent edit. Specifically, we curate a high-quality supervised fine-tuning dataset and an evaluation benchmark for the Next Edit Prediction task. Then, we conduct supervised fine-tuning on a series of models and performed a comprehensive evaluation of both the fine-tuned models and other baseline models, yielding several novel findings. This work lays the foundation for a new interaction paradigm that proactively collaborate with developers by anticipating their following action, rather than merely reacting to explicit instructions.

[LG-49] Measuring Time Series Forecast Stability for Demand Planning KDD’25

链接: https://arxiv.org/abs/2508.10063
作者: Steven Klee,Yuntian Xia
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures; KDD '25

点击查看摘要

Abstract:Time series forecasting is a critical first step in generating demand plans for supply chains. Experiments on time series models typically focus on demonstrating improvements in forecast accuracy over existing/baseline solutions, quantified according to some accuracy metric. There is no doubt that forecast accuracy is important; however in production systems, demand planners often value consistency and stability over incremental accuracy improvements. Assuming that the inputs have not changed significantly, forecasts that vary drastically from one planning cycle to the next require high amounts of human intervention, which frustrates demand planners and can even cause them to lose trust in ML forecasting models. We study model-induced stochasticity, which quantifies the variance of a set of forecasts produced by a single model when the set of inputs is fixed. Models with lower variance are more stable. Recently the forecasting community has seen significant advances in forecast accuracy through the development of deep machine learning models for time series forecasting. We perform a case study measuring the stability and accuracy of state-of-the-art forecasting models (Chronos, DeepAR, PatchTST, Temporal Fusion Transformer, TiDE, and the AutoGluon best quality ensemble) on public data sets from the M5 competition and Favorita grocery sales. We show that ensemble models improve stability without significantly deteriorating (or even improving) forecast accuracy. While these results may not be surprising, the main point of this paper is to propose the need for further study of forecast stability for models that are being deployed in production systems. Comments: 6 pages, 3 figures; KDD '25 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.10063 [cs.LG] (or arXiv:2508.10063v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.10063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] A Personalized Exercise Assistant using Reinforcement Learning (PEARL): Results from a four-arm Randomized-controlled Trial

链接: https://arxiv.org/abs/2508.10060
作者: Amy Armento Lee,Narayan Hegde,Nina Deliu,Emily Rosenzweig,Arun Suggala,Sriram Lakshminarasimhan,Qian He,John Hernandez,Martin Seneviratne,Rahul Singh,Pradnesh Kalkar,Karthikeyan Shanmugam,Aravindan Raghuveer,Abhimanyu Singh,My Nguyen,James Taylor,Jatin Alla,Sofia S. Villar,Hulya Emir-Farinas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Consistent physical inactivity poses a major global health challenge. Mobile health (mHealth) interventions, particularly Just-in-Time Adaptive Interventions (JITAIs), offer a promising avenue for scalable, personalized physical activity (PA) promotion. However, developing and evaluating such interventions at scale, while integrating robust behavioral science, presents methodological hurdles. The PEARL study was the first large-scale, four-arm randomized controlled trial to assess a reinforcement learning (RL) algorithm, informed by health behavior change theory, to personalize the content and timing of PA nudges via a Fitbit app. We enrolled and randomized 13,463 Fitbit users into four study arms: control, random, fixed, and RL. The control arm received no nudges. The other three arms received nudges from a bank of 155 nudges based on behavioral science principles. The random arm received nudges selected at random. The fixed arm received nudges based on a pre-set logic from survey responses about PA barriers. The RL group received nudges selected by an adaptive RL algorithm. We included 7,711 participants in primary analyses (mean age 42.1, 86.3% female, baseline steps 5,618.2). We observed an increase in PA for the RL group compared to all other groups from baseline to 1 and 2 months. The RL group had significantly increased average daily step count at 1 month compared to all other groups: control (+296 steps, p=0.0002), random (+218 steps, p=0.005), and fixed (+238 steps, p=0.002). At 2 months, the RL group sustained a significant increase compared to the control group (+210 steps, p=0.0122). Generalized estimating equation models also revealed a sustained increase in daily steps in the RL group vs. control (+208 steps, p=0.002). These findings demonstrate the potential of a scalable, behaviorally-informed RL approach to personalize digital health interventions for PA. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.10060 [cs.LG] (or arXiv:2508.10060v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.10060 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] xRFM: Accurate scalable and interpretable feature learning models for tabular data

链接: https://arxiv.org/abs/2508.10053
作者: Daniel Beaglehole,David Holzmüller,Adityanarayanan Radhakrishnan,Mikhail Belkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to 31 other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across 100 regression datasets and is competitive to the best methods across 200 classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2508.10053 [cs.LG] (or arXiv:2508.10053v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.10053 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Neural Network-Based Detection and Multi-Class Classification of FDI Attacks in Smart Grid Home Energy Systems

链接: https://arxiv.org/abs/2508.10035
作者: Varsha Sen,Biswash Basnet
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:False Data Injection Attacks (FDIAs) pose a significant threat to smart grid infrastructures, particularly Home Area Networks (HANs), where real-time monitoring and control are highly adopted. Owing to the comparatively less stringent security controls and widespread availability of HANs, attackers view them as an attractive entry point to manipulate aggregated demand patterns, which can ultimately propagate and disrupt broader grid operations. These attacks undermine the integrity of smart meter data, enabling malicious actors to manipulate consumption values without activating conventional alarms, thereby creating serious vulnerabilities across both residential and utility-scale infrastructures. This paper presents a machine learning-based framework for both the detection and classification of FDIAs using residential energy data. A real-time detection is provided by the lightweight Artificial Neural Network (ANN), which works by using the most vital features of energy consumption, cost, and time context. For the classification of different attack types, a Bidirectional LSTM is trained to recognize normal, trapezoidal, and sigmoid attack shapes through learning sequential dependencies in the data. A synthetic time-series dataset was generated to emulate realistic household behaviour. Experimental results demonstrate that the proposed models are effective in identifying and classifying FDIAs, offering a scalable solution for enhancing grid resilience at the edge. This work contributes toward building intelligent, data-driven defence mechanisms that strengthen smart grid cybersecurity from residential endpoints.

[LG-53] Whisper Smarter not Harder: Adversarial Attack on Partial Suppression

链接: https://arxiv.org/abs/2508.09994
作者: Zheng Jie Wong,Bingquan Shen
类目: ound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Currently, Automatic Speech Recognition (ASR) models are deployed in an extensive range of applications. However, recent studies have demonstrated the possibility of adversarial attack on these models which could potentially suppress or disrupt model output. We investigate and verify the robustness of these attacks and explore if it is possible to increase their imperceptibility. We additionally find that by relaxing the optimisation objective from complete suppression to partial suppression, we can further decrease the imperceptibility of the attack. We also explore possible defences against these attacks and show a low-pass filter defence could potentially serve as an effective defence.

[LG-54] An Iterative Algorithm for Differentially Private k-PCA with Adaptive Noise

链接: https://arxiv.org/abs/2508.10879
作者: Johanna Düngler,Amartya Sanyal
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Given n i.i.d. random matrices A_i \in \mathbbR^d \times d that share a common expectation \Sigma , the objective of Differentially Private Stochastic PCA is to identify a subspace of dimension k that captures the largest variance directions of \Sigma , while preserving differential privacy (DP) of each individual A_i . Existing methods either (i) require the sample size n to scale super-linearly with dimension d , even under Gaussian assumptions on the A_i , or (ii) introduce excessive noise for DP even when the intrinsic randomness within A_i is small. Liu et al. (2022a) addressed these issues for sub-Gaussian data but only for estimating the top eigenvector ( k=1 ) using their algorithm DP-PCA. We propose the first algorithm capable of estimating the top k eigenvectors for arbitrary k \leq d , whilst overcoming both limitations above. For k=1 our algorithm matches the utility guarantees of DP-PCA, achieving near-optimal statistical error even when n = \tilde!O(d) . We further provide a lower bound for general k 1 , matching our upper bound up to a factor of k , and experimentally demonstrate the advantages of our algorithm over comparable baselines.

[LG-55] Performance of universal machine-learned potentials with explicit long-range interactions in biomolecular simulations

链接: https://arxiv.org/abs/2508.10841
作者: Viktor Zaverkin,Matheus Ferraz,Francesco Alesiani,Mathias Niepert
类目: Chemical Physics (physics.chem-ph); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Universal machine-learned potentials promise transferable accuracy across compositional and vibrational degrees of freedom, yet their application to biomolecular simulations remains underexplored. This work systematically evaluates equivariant message-passing architectures trained on the SPICE-v2 dataset with and without explicit long-range dispersion and electrostatics. We assess the impact of model size, training data composition, and electrostatic treatment across in- and out-of-distribution benchmark datasets, as well as molecular simulations of bulk liquid water, aqueous NaCl solutions, and biomolecules, including alanine tripeptide, the mini-protein Trp-cage, and Crambin. While larger models improve accuracy on benchmark datasets, this trend does not consistently extend to properties obtained from simulations. Predicted properties also depend on the composition of the training dataset. Long-range electrostatics show no systematic impact across systems. However, for Trp-cage, their inclusion yields increased conformational variability. Our results suggest that imbalanced datasets and immature evaluation practices currently challenge the applicability of universal machine-learned potentials to biomolecular simulations.

[LG-56] Accelerating exoplanet climate modelling: A machine learning approach to complement 3D GCM grid simulations

链接: https://arxiv.org/abs/2508.10827
作者: Alexander Plaschzug,Amit Reza,Ludmila Carone,Sebastian Gernjak,Christiane Helling
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the development of ever-improving telescopes capable of observing exoplanet atmospheres in greater detail and number, there is a growing demand for enhanced 3D climate models to support and help interpret observational data from space missions like CHEOPS, TESS, JWST, PLATO, and Ariel. However, the computationally intensive and time-consuming nature of general circulation models (GCMs) poses significant challenges in simulating a wide range of exoplanetary atmospheres. This study aims to determine whether machine learning (ML) algorithms can be used to predict the 3D temperature and wind structure of arbitrary tidally-locked gaseous exoplanets in a range of planetary parameters. A new 3D GCM grid with 60 inflated hot Jupiters orbiting A, F, G, K, and M-type host stars modelled with Exorad has been introduced. A dense neural network (DNN) and a decision tree algorithm (XGBoost) are trained on this grid to predict local gas temperatures along with horizontal and vertical winds. To ensure the reliability and quality of the ML model predictions, WASP-121 b, HATS-42 b, NGTS-17 b, WASP-23 b, and NGTS-1 b-like planets, which are all targets for PLATO observation, are selected and modelled with ExoRad and the two ML methods as test cases. The DNN predictions for the gas temperatures are to such a degree that the calculated spectra agree within 32 ppm for all but one planet, for which only one single HCN feature reaches a 100 ppm difference. The developed ML emulators can reliably predict the complete 3D temperature field of an inflated warm to ultra-hot tidally locked Jupiter around A to M-type host stars. It provides a fast tool to complement and extend traditional GCM grids for exoplanet ensemble studies. The quality of the predictions is such that no or minimal effects on the gas phase chemistry, hence on the cloud formation and transmission spectra, are to be expected.

[LG-57] Parity Cross-Resonance: A Multiqubit Gate

链接: https://arxiv.org/abs/2508.10807
作者: Xuexin Xu,Siyu Wang,Radhika Joshi,Rihan Hai,Mohammad H. Ansari
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:We present a native three-qubit entangling gate that exploits engineered interactions to realize control-control-target and control-target-target operations in a single coherent step. Unlike conventional decompositions into multiple two-qubit gates, our hybrid optimization approach selectively amplifies desired interactions while suppressing unwanted couplings, yielding robust performance across the computational subspace and beyond. The new gate can be classified as a cross-resonance gate. We show it can be utilized in several ways, for example, in GHZ triplet state preparation, Toffoli-class logic demonstrations with many-body interactions, and in implementing a controlled-ZZ gate. The latter maps the parity of two data qubits directly onto a measurement qubit, enabling faster and higher-fidelity stabilizer measurements in surface-code quantum error correction. In all these examples, we show that the three-qubit gate performance remains robust across Hilbert space sizes, as confirmed by testing under increasing total excitation numbers. This work lays the foundation for co-designing circuit architectures and control protocols that leverage native multiqubit interactions as core elements of next-generation superconducting quantum processors.

[LG-58] Memorisation and forgetting in a learning Hopfield neural network: bifurcation mechanisms attractors and basins

链接: https://arxiv.org/abs/2508.10765
作者: Adam E. Essex(1),Natalia B. Janson(1),Rachel A. Norris(1),Alexander G. Balanov(1) ((1) Loughborough University, England)
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 19 pages, 14 figures. The following article has been submitted to `Chaos: An Interdisciplinary Journal of Nonlinear Science’. After it is published, it will be found at this https URL

点击查看摘要

Abstract:Despite explosive expansion of artificial intelligence based on artificial neural networks (ANNs), these are employed as "black boxes’', as it is unclear how, during learning, they form memories or develop unwanted features, including spurious memories and catastrophic forgetting. Much research is available on isolated aspects of learning ANNs, but due to their high dimensionality and non-linearity, their comprehensive analysis remains a challenge. In ANNs, knowledge is thought to reside in connection weights or in attractor basins, but these two paradigms are not linked explicitly. Here we comprehensively analyse mechanisms of memory formation in an 81-neuron Hopfield network undergoing Hebbian learning by revealing bifurcations leading to formation and destruction of attractors and their basin boundaries. We show that, by affecting evolution of connection weights, the applied stimuli induce a pitchfork and then a cascade of saddle-node bifurcations creating new attractors with their basins that can code true or spurious memories, and an abrupt disappearance of old memories (catastrophic forgetting). With successful learning, new categories are represented by the basins of newly born point attractors, and their boundaries by the stable manifolds of new saddles. With this, memorisation and forgetting represent two manifestations of the same mechanism. Our strategy to analyse high-dimensional learning ANNs is universal and applicable to recurrent ANNs of any form. The demonstrated mechanisms of memory formation and of catastrophic forgetting shed light on the operation of a wider class of recurrent ANNs and could aid the development of approaches to mitigate their flaws.

[LG-59] Symmetry-Constrained Multi-Scale Physics-Informed Neural Networks for Graphene Electronic Band Structure Prediction

链接: https://arxiv.org/abs/2508.10718
作者: Wei Shan Lee,I Hang Kwok,Kam Ian Leong,Chi Kiu Althina Chau,Kei Chon Sio
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 36 pages and 14 figures

点击查看摘要

Abstract:Accurate prediction of electronic band structures in two-dimensional materials remains a fundamental challenge, with existing methods struggling to balance computational efficiency and physical accuracy. We present the Symmetry-Constrained Multi-Scale Physics-Informed Neural Network (SCMS-PINN) v35, which directly learns graphene band structures while rigorously enforcing crystallographic symmetries through a multi-head architecture. Our approach introduces three specialized ResNet-6 pathways – K-head for Dirac physics, M-head for saddle points, and General head for smooth interpolation – operating on 31 physics-informed features extracted from k-points. Progressive Dirac constraint scheduling systematically increases the weight parameter from 5.0 to 25.0, enabling hierarchical learning from global topology to local critical physics. Training on 10,000 k-points over 300 epochs achieves 99.99% reduction in training loss (34.597 to 0.003) with validation loss of 0.0085. The model predicts Dirac point gaps within 30.3 \mu eV of theoretical zero and achieves average errors of 53.9 meV (valence) and 40.5 meV (conduction) across the Brillouin zone. All twelve C _6v operations are enforced through systematic averaging, guaranteeing exact symmetry preservation. This framework establishes a foundation for extending physics-informed learning to broader two-dimensional materials for accelerated discovery.

[LG-60] Physics-Informed Deep Contrast Source Inversion: A Unified Framework for Inverse Scattering Problems

链接: https://arxiv.org/abs/2508.10555
作者: Haoran Sun,Daoqi Liu,Hongyu Zhou,Maokun Li,Shenheng Xu,Fan Yang
类目: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse scattering problems are critical in electromagnetic imaging and medical diagnostics but are challenged by their nonlinearity and diverse measurement scenarios. This paper proposes a physics-informed deep contrast source inversion framework (DeepCSI) for fast and accurate medium reconstruction across various measurement conditions. Inspired by contrast source inversion (CSI) and neural operator methods, a residual multilayer perceptron (ResMLP) is employed to model current distributions in the region of interest under different transmitter excitations, effectively linearizing the nonlinear inverse scattering problem and significantly reducing the computational cost of traditional full-waveform inversion. By modeling medium parameters as learnable tensors and utilizing a hybrid loss function that integrates state equation loss, data equation loss, and total variation regularization, DeepCSI establishes a fully differentiable framework for joint optimization of network parameters and medium properties. Compared with conventional methods, DeepCSI offers advantages in terms of simplicity and universal modeling capabilities for diverse measurement scenarios, including phase-less and multi-frequency observation. Simulations and experiments demonstrate that DeepCSI achieves high-precision, robust reconstruction under full-data, phaseless data, and multifrequency conditions, outperforming traditional CSI methods and providing an efficient and universal solution for complex inverse scattering problems.

[LG-61] Mitigating Exponential Mixed Frequency Growth through Frequency Selection and Dimensional Separation in Quantum Machine Learning

链接: https://arxiv.org/abs/2508.10533
作者: Michael Poppel,David Bucher,Maximilian Zorn,Nico Kraus,Jonas Stein,Claudia Linnhoff-Popien
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To leverage the potential computational speedup of quantum computing (QC), research in quantum machine learning (QML) has gained increasing prominence. Angle encoding techniques in QML models have been shown to generate truncated Fourier series, offering asymptotically universal function approximation capabilities. By selecting efficient feature maps (FMs) within quantum circuits, one can leverage the exponential growth of Fourier frequencies for improved approximation. In multi-dimensional settings, additional input dimensions induce further exponential scaling via mixed frequencies. In practice, however, quantum models frequently fail at regression tasks. Through two white-box experiments, we show that such failures can occur even when the relevant frequencies are present, due to an insufficient number of trainable parameters. In order to mitigate the double-exponential parameter growth resulting from double-exponentially growing frequencies, we propose frequency selection and dimensional separation as techniques to constrain the number of parameters, thereby improving trainability. By restricting the QML model to essential frequencies and permitting mixed frequencies only among feature dimensions with known interdependence, we expand the set of tractable problems on current hardware. We demonstrate the reduced parameter requirements by fitting two white-box functions with known frequency spectrum and dimensional interdependencies that could not be fitted with the default methods. The reduced parameter requirements permit us to perform training on a noisy quantum simulator and to demonstrate inference on real quantum hardware. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2508.10533 [quant-ph] (or arXiv:2508.10533v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2508.10533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] Virtual Sensing for Solder Layer Degradation and Temperature Monitoring in IGBT Modules

链接: https://arxiv.org/abs/2508.10515
作者: Andrea Urgolo,Monika Stipsitz,Helios Sanchis-Alepuz
类目: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Andrea Urgolo and Monika Stipsitz contributed equally to this work

点击查看摘要

Abstract:Monitoring the degradation state of Insulated Gate Bipolar Transistor (IGBT) modules is essential for ensuring the reliability and longevity of power electronic systems, especially in safety-critical and high-performance applications. However, direct measurement of key degradation indicators - such as junction temperature, solder fatigue or delamination - remains challenging due to the physical inaccessibility of internal components and the harsh environment. In this context, machine learning-based virtual sensing offers a promising alternative by bridging the gap from feasible sensor placement to the relevant but inaccessible locations. This paper explores the feasibility of estimating the degradation state of solder layers, and the corresponding full temperature maps based on a limited number of physical sensors. Based on synthetic data of a specific degradation mode, we obtain a high accuracy in the estimation of the degraded solder area (1.17% mean absolute error), and are able to reproduce the surface temperature of the IGBT with a maximum relative error of 4.56% (corresponding to an average relative error of 0.37%).

[LG-63] Mo Memory Mo Problems: Stream-Native Machine Unlearning

链接: https://arxiv.org/abs/2508.10193
作者: Kennon Stewart
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning work assumes a static, i.i.d training environment that doesn’t truly exist. Modern ML pipelines need to learn, unlearn, and predict continuously on production streams of data. We translate the notion of the batch unlearning scenario to the online setting using notions of regret, sample complexity, and deletion capacity. We further tighten regret bounds to a logarithmic \mathcalO(\lnT) , a first for a machine unlearning algorithm. And we swap out an expensive Hessian inversion with online variant of L-BFGS optimization, removing a memory footprint that scales linearly with time. Such changes extend the lifespan of an ML model before expensive retraining, making for a more efficient unlearning process.

[LG-64] Estimating carbon pools in the shelf sea environment: reanalysis or model-informed machine learning?

链接: https://arxiv.org/abs/2508.10178
作者: Jozef Skakala
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 24 pages, 9 figures (4 in the appendix)

点击查看摘要

Abstract:Shelf seas are important for carbon sequestration and carbon cycle, but available in situ, or satellite data for carbon pools in the shelf sea environment are often sparse, or highly uncertain. Alternative can be provided by reanalyses, but these are often expensive to run. We propose to use an ensemble of neural networks (NN) to learn from a coupled physics-biogeochemistry model the relationship between the directly observable variables and carbon pools. We demonstrate for North-West European Shelf (NWES) sea environment, that when the NN trained on a model free run simulation is applied to the NWES reanalysis, it is capable to reproduce the reanalysis outputs for carbon pools. Moreover, unlike the existing NWES reanalysis, the NN ensemble is also capable to provide uncertainty information for the pools. We focus on explainability of the results and demonstrate potential use of the NNs for future climate what-if scenarios. We suggest that model-informed machine learning presents a viable alternative to expensive reanalyses and could complement observational data, wherever they are missing and/or highly uncertain.

[LG-65] Prediction-Powered Inference with Inverse Probability Weighting

链接: https://arxiv.org/abs/2508.10149
作者: Jyotishka Datta,Nicholas G. Polson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:Prediction-powered inference (PPI) is a recent framework for valid statistical inference with partially labeled data, combining model-based predictions on a large unlabeled set with bias correction from a smaller labeled subset. We show that PPI can be extended to handle informative labeling by replacing its unweighted bias-correction term with an inverse probability weighted (IPW) version, using the classical Horvitz–Thompson or Hájek forms. This connection unites design-based survey sampling ideas with modern prediction-assisted inference, yielding estimators that remain valid when labeling probabilities vary across units. We consider the common setting where the inclusion probabilities are not known but estimated from a correctly specified model. In simulations, the performance of IPW-adjusted PPI with estimated propensities closely matches the known-probability case, retaining both nominal coverage and the variance-reduction benefits of PPI.

[LG-66] Machine Learning for Cloud Detection in IASI Measurements: A Data-Driven SVM Approach with Physical Constraints

链接: https://arxiv.org/abs/2508.10120
作者: Chiara Zugarini,Cristina Sgattoni,Luca Sgheri
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloud detection is essential for atmospheric retrievals, climate studies, and weather forecasting. We analyze infrared radiances from the Infrared Atmospheric Sounding Interferometer (IASI) onboard Meteorological Operational (MetOp) satellites to classify scenes as clear or cloudy. We apply the Support Vector Machine (SVM) approach, based on kernel methods for non-separable data. In this study, the method is implemented for Cloud Identification (CISVM) to classify the test set using radiances or brightness temperatures, with dimensionality reduction through Principal Component Analysis (PCA) and cloud-sensitive channel selection to focus on the most informative features. Our best configuration achieves 88.30 percent agreement with reference labels and shows strong consistency with cloud masks from the Moderate Resolution Imaging Spectroradiometer (MODIS), with the largest discrepancies in polar regions due to sensor differences. These results demonstrate that CISVM is a robust, flexible, and efficient method for automated cloud classification from infrared radiances, suitable for operational retrievals and future missions such as Far infrared Outgoing Radiation Understanding and Monitoring (FORUM), the ninth European Space Agency Earth Explorer Mission. Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG) Cite as: arXiv:2508.10120 [physics.ao-ph] (or arXiv:2508.10120v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2508.10120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] In silico study on the cytotoxicity against Hela cancer cells of xanthones bioactive compounds from Garcinia cowa: QSAR based on Graph Deep Learning Network Pharmacology and Molecular Docking

链接: https://arxiv.org/abs/2508.10117
作者: Nguyen Manh Son,Pham Huu Vang,Nguyen Thi Dung,Nguyen Manh Ha. Ta Thi Thao,Tran Thi Thu Thuy,Phan Minh Giang
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cancer is recognized as a complex group of diseases, contributing to the highest global mortality rates, with increasing prevalence and a trend toward affecting younger populations. It is characterized by uncontrolled proliferation of abnormal cells, invasion of adjacent tissues, and metastasis to distant organs. Garcinia cowa, a traditional medicinal plant widely used in Southeast Asia, including Vietnam, is employed to treat fever, cough, indigestion, as a laxative, and for parasitic diseases. Numerous xanthone compounds isolated from this species exhibit a broad spectrum of biological activities, with some showing promise as anti cancer and antimalarial agents. Network pharmacology analysis successfully identified key bioactive compounds Rubraxanthone, Garcinone D, Norcowanin, Cowanol, and Cowaxanthone alongside their primary protein targets (TNF, CTNNB1, SRC, NFKB1, and MTOR), providing critical insights into the molecular mechanisms underlying their anti-cancer effects. The Graph Attention Network algorithm demonstrated superior predictive performance, achieving an R2 of 0.98 and an RMSE of 0.02 after data augmentation, highlighting its accuracy in predicting pIC50 values for xanthone based compounds. Additionally, molecular docking revealed MTOR as a potential target for inducing cytotoxicity in HeLa cancer cells from Garcinia cowa.

[LG-68] Dynamical Alignment: A Principle for Adaptive Neural Computation

链接: https://arxiv.org/abs/2508.10064
作者: Xia Chen
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 16 pages, 10 figures;

点击查看摘要

Abstract:The computational capabilities of a neural network are widely assumed to be determined by its static architecture. Here we challenge this view by establishing that a fixed neural structure can operate in fundamentally different computational modes, driven not by its structure but by the temporal dynamics of its input signals. We term this principle ‘Dynamical Alignment’. Applying this principle offers a novel resolution to the long-standing paradox of why brain-inspired spiking neural networks (SNNs) underperform. By encoding static input into controllable dynamical trajectories, we uncover a bimodal optimization landscape with a critical phase transition governed by phase space volume dynamics. A ‘dissipative’ mode, driven by contracting dynamics, achieves superior energy efficiency through sparse temporal codes. In contrast, an ‘expansive’ mode, driven by expanding dynamics, unlocks the representational power required for SNNs to match or even exceed their artificial neural network counterparts on diverse tasks, including classification, reinforcement learning, and cognitive integration. We find this computational advantage emerges from a timescale alignment between input dynamics and neuronal integration. This principle, in turn, offers a unified, computable perspective on long-observed dualities in neuroscience, from stability-plasticity dilemma to segregation-integration dynamic. It demonstrates that computation in both biological and artificial systems can be dynamically sculpted by ‘software’ on fixed ‘hardware’, pointing toward a potential paradigm shift for AI research: away from designing complex static architectures and toward mastering adaptive, dynamic computation principles. Comments: 16 pages, 10 figures; Subjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG) MSC classes: 68T05 ACMclasses: I.2.6 Cite as: arXiv:2508.10064 [q-bio.NC] (or arXiv:2508.10064v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2508.10064 Focus to learn more arXiv-issued DOI via DataCite

[LG-69] Bayesian Models for Joint Selection of Features and Auto-Regressive Lags: Theory and Applications in Environmental and Financial Forecasting

链接: https://arxiv.org/abs/2508.10055
作者: Alokesh Manna,Sujit K. Ghosh
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a Bayesian framework for variable selection in linear regression with autocorrelated errors, accommodating lagged covariates and autoregressive structures. This setting occurs in time series applications where responses depend on contemporaneous or past explanatory variables and persistent stochastic shocks, including financial modeling, hydrological forecasting, and meteorological applications requiring temporal dependency capture. Our methodology uses hierarchical Bayesian models with spike-and-slab priors to simultaneously select relevant covariates and lagged error terms. We propose an efficient two-stage MCMC algorithm separating sampling of variable inclusion indicators and model parameters to address high-dimensional computational challenges. Theoretical analysis establishes posterior selection consistency under mild conditions, even when candidate predictors grow exponentially with sample size, common in modern time series with many potential lagged variables. Through simulations and real applications (groundwater depth prediction, SP 500 log returns modeling), we demonstrate substantial gains in variable selection accuracy and predictive performance. Compared to existing methods, our framework achieves lower MSPE, improved true model component identification, and greater robustness with autocorrelated noise, underscoring practical utility for model interpretation and forecasting in autoregressive settings.

[LG-70] zERExtractor:An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature

链接: https://arxiv.org/abs/2508.09995
作者: Rui Zhou,Haohui Ma,Tianle Xin,Lixin Zou,Qiuyue Hu,Hongxi Cheng,Mingzhi Lin,Jingjing Guo,Sheng Wang,Guoqing Zhang,Yanjie Wei,Liangzhen Zheng
类目: Biomolecules (q-bio.BM); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid expansion of enzyme kinetics literature has outpaced the curation capabilities of major biochemical databases, creating a substantial barrier to AI-driven modeling and knowledge discovery. We present zERExtractor, an automated and extensible platform for comprehensive extraction of enzyme-catalyzed reaction and activity data from scientific literature. zERExtractor features a unified, modular architecture that supports plug-and-play integration of state-of-the-art models, including large language models (LLMs), as interchangeable components, enabling continuous system evolution alongside advances in AI. Our pipeline combines domain-adapted deep learning, advanced OCR, semantic entity recognition, and prompt-driven LLM modules, together with human expert corrections, to extract kinetic parameters (e.g., kcat, Km), enzyme sequences, substrate SMILES, experimental conditions, and molecular diagrams from heterogeneous document formats. Through active learning strategies integrating AI-assisted annotation, expert validation, and iterative refinement, the system adapts rapidly to new data sources. We also release a large benchmark dataset comprising over 1,000 annotated tables and 5,000 biological fields from 270 P450-related enzymology publications. Benchmarking demonstrates that zERExtractor consistently outperforms existing baselines in table recognition (Acc 89.9%), molecular image interpretation (up to 99.1%), and relation extraction (accuracy 94.2%). zERExtractor bridges the longstanding data gap in enzyme kinetics with a flexible, plugin-ready framework and high-fidelity extraction, laying the groundwork for future AI-powered enzyme modeling and biochemical knowledge discovery.

信息检索

[IR-0] Hypercomplex Prompt-aware Multimodal Recommendation CIKM2025

链接: https://arxiv.org/abs/2508.10753
作者: Zheyu Chen,Jinfeng Xu,Hewei Wang,Shuo Yang,Zitong Wan,Haibo Hu
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Modern recommender systems face critical challenges in handling information overload while addressing the inherent limitations of multimodal representation learning. Existing methods suffer from three fundamental limitations: (1) restricted ability to represent rich multimodal features through a single representation, (2) existing linear modality fusion strategies ignore the deep nonlinear correlations between modalities, and (3) static optimization methods failing to dynamically mitigate the over-smoothing problem in graph convolutional network (GCN). To overcome these limitations, we propose HPMRec, a novel Hypercomplex Prompt-aware Multimodal Recommendation framework, which utilizes hypercomplex embeddings in the form of multi-components to enhance the representation diversity of multimodal features. HPMRec adopts the hypercomplex multiplication to naturally establish nonlinear cross-modality interactions to bridge semantic gaps, which is beneficial to explore the cross-modality features. HPMRec also introduces the prompt-aware compensation mechanism to aid the misalignment between components and modality-specific features loss, and this mechanism fundamentally alleviates the over-smoothing problem. It further designs self-supervised learning tasks that enhance representation diversity and align different modalities. Extensive experiments on four public datasets show that HPMRec achieves state-of-the-art recommendation performance.

[IR-1] FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model

链接: https://arxiv.org/abs/2508.10615
作者: Yufei Ye,Wei Guo,Hao Wang,Hong Zhu,Yuyang Ye,Yong Liu,Huifeng Guo,Ruiming Tang,Defu Lian,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Scaling laws for autoregressive generative recommenders reveal potential for larger, more versatile systems but mean greater latency and training costs. To accelerate training and inference, we investigated the recent generative recommendation models HSTU and FuXi- \alpha , identifying two efficiency bottlenecks: the indexing operations in relative temporal attention bias and the computation of the query-key attention map. Additionally, we observed that relative attention bias in self-attention mechanisms can also serve as attention maps. Previous works like Synthesizer have shown that alternative forms of attention maps can achieve similar performance, naturally raising the question of whether some attention maps are redundant. Through empirical experiments, we discovered that using the query-key attention map might degrade the model’s performance in recommendation tasks. To address these bottlenecks, we propose a new framework applicable to Transformer-like recommendation models. On one hand, we introduce Functional Relative Attention Bias, which avoids the time-consuming operations of the original relative attention bias, thereby accelerating the process. On the other hand, we remove the query-key attention map from the original self-attention layer and design a new Attention-Free Token Mixer module. Furthermore, by applying this framework to FuXi- \alpha , we introduce a new model, FuXi- \beta . Experiments across multiple datasets demonstrate that FuXi- \beta outperforms previous state-of-the-art models and achieves significant acceleration compared to FuXi- \alpha , while also adhering to the scaling law. Notably, FuXi- \beta shows an improvement of 27% to 47% in the NDCG@10 metric on large-scale industrial datasets compared to FuXi- \alpha . Our code is available in a public repository: this https URL

[IR-2] DAS: Dual-Aligned Semantic IDs Empowered Industrial Recommender System CIKM2025

链接: https://arxiv.org/abs/2508.10584
作者: Wencai Ye,Mingjie Sun,Shaoyun Shi,Peng Wang,Wenjin Wu,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Semantic IDs are discrete identifiers generated by quantizing the Multi-modal Large Language Models (MLLMs) embeddings, enabling efficient multi-modal content integration in recommendation systems. However, their lack of collaborative signals results in a misalignment with downstream discriminative and generative recommendation objectives. Recent studies have introduced various alignment mechanisms to address this problem, but their two-stage framework design still leads to two main limitations: (1) inevitable information loss during alignment, and (2) inflexibility in applying adaptive alignment strategies, consequently constraining the mutual information maximization during the alignment process. To address these limitations, we propose a novel and flexible one-stage Dual-Aligned Semantic IDs (DAS) method that simultaneously optimizes quantization and alignment, preserving semantic integrity and alignment quality while avoiding the information loss typically associated with two-stage methods. Meanwhile, DAS achieves more efficient alignment between the semantic IDs and collaborative signals, with the following two innovative and effective approaches: (1) Multi-view Constrative Alignment: To maximize mutual information between semantic IDs and collaborative signals, we first incorporate an ID-based CF debias module, and then design three effective contrastive alignment methods: dual user-to-item (u2i), dual item-to-item/user-to-user (i2i/u2u), and dual co-occurrence item-to-item/user-to-user (i2i/u2u). (2) Dual Learning: By aligning the dual quantizations of users and ads, the constructed semantic IDs for users and ads achieve stronger alignment. Finally, we conduct extensive offline experiments and online A/B tests to evaluate DAS’s effectiveness, which is now successfully deployed across various advertising scenarios at Kuaishou App, serving over 400 million users daily.

[IR-3] Efficient Patent Searching Using Graph Transformers SIGIR2025

链接: https://arxiv.org/abs/2508.10496
作者: Krzysztof Daniell,Igor Buzhinsky,Sebastian Björkqvist
类目: Information Retrieval (cs.IR)
*备注: Accepted for publication at the PatentSemTech 2025 workshop, held in conjunction with SIGIR 2025

点击查看摘要

Abstract:Finding relevant prior art is crucial when deciding whether to file a new patent application or invalidate an existing patent. However, searching for prior art is challenging due to the large number of patent documents and the need for nuanced comparisons to determine novelty. An accurate search engine is therefore invaluable for speeding up the process. We present a Graph Transformer-based dense retrieval method for patent searching where each invention is represented by a graph describing its features and their relationships. Our model processes these invention graphs and is trained using prior art citations from patent office examiners as relevance signals. Using graphs as input significantly improves the computational efficiency of processing long documents, while leveraging examiner citations allows the model to learn domain-specific similarities beyond simple text-based matching. The result is a search engine that emulates how professional patent examiners identify relevant documents. We compare our approach against publicly available text embedding models and show substantial improvements in both prior art retrieval quality and computational efficiency.

[IR-4] Semantic IDs for Joint Generative Search and Recommendation RECSYS2025

链接: https://arxiv.org/abs/2508.10478
作者: Gustavo Penha,Edoardo D’Amico,Marco De Nadai,Enrico Palumbo,Alexandre Tamborrino,Ali Vardasbi,Max Lefarov,Shawn Lin,Timothy Heath,Francesco Fabbri,Hugues Bouchard
类目: Information Retrieval (cs.IR)
*备注: Accepted for publication in the 19th ACM Conference on Recommender Systems (RecSys 2025), Late-Breaking Results track

点击查看摘要

Abstract:Generative models powered by Large Language Models (LLMs) are emerging as a unified solution for powering both recommendation and search tasks. A key design choice in these models is how to represent items, traditionally through unique identifiers (IDs) and more recently with Semantic IDs composed of discrete codes, obtained from embeddings. While task-specific embedding models can improve performance for individual tasks, they may not generalize well in a joint setting. In this paper, we explore how to construct Semantic IDs that perform well both in search and recommendation when using a unified model. We compare a range of strategies to construct Semantic IDs, looking into task-specific and cross-tasks approaches, and also whether each task should have its own semantic ID tokens in a joint search and recommendation generative model. Our results show that using a bi-encoder model fine-tuned on both search and recommendation tasks to obtain item embeddings, followed by the construction of a unified Semantic ID space provides an effective trade-off, enabling strong performance in both tasks. We hope these findings spark follow-up work on generalisable, semantically grounded ID schemes and inform the next wave of unified generative recommender architectures.

[IR-5] Proxy Model-Guided Reinforcement Learning for Client Selection in Federated Recommendation

链接: https://arxiv.org/abs/2508.10401
作者: Liang Qu,Jianxin Li,Wei Yuan,Penghui Ruan,Yuhui Shi,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Federated recommender systems have emerged as a promising privacy-preserving paradigm, enabling personalized recommendation services without exposing users’ raw data. By keeping data local and relying on a central server to coordinate training across distributed clients, FedRSs protect user privacy while collaboratively learning global models. However, most existing FedRS frameworks adopt fully random client selection strategy in each training round, overlooking the statistical heterogeneity of user data arising from diverse preferences and behavior patterns, thereby resulting in suboptimal model performance. While some client selection strategies have been proposed in the broader federated learning literature, these methods are typically designed for generic tasks and fail to address the unique challenges of recommendation scenarios, such as expensive contribution evaluation due to the large number of clients, and sparse updates resulting from long-tail item distributions. To bridge this gap, we propose ProxyRL-FRS, a proxy model-guided reinforcement learning framework tailored for client selection in federated recommendation. Specifically, we first introduce ProxyNCF, a dual-branch model deployed on each client, which augments standard Neural Collaborative Filtering with an additional proxy model branch that provides lightweight contribution estimation, thus eliminating the need for expensive per-round local training traditionally required to evaluate a client’s contribution. Furthermore, we design a staleness-aware SA reinforcement learning agent that selects clients based on the proxy-estimated contribution, and is guided by a reward function balancing recommendation accuracy and embedding staleness, thereby enriching the update coverage of item embeddings. Experiments conducted on public recommendation datasets demonstrate the effectiveness of ProxyRL-FRS.

[IR-6] DS4RS: Community-Driven and Explainable Dataset Search Engine for Recommender System Research

链接: https://arxiv.org/abs/2508.10238
作者: Xinyang Shao,Tri Kurniawan Wijaya
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Accessing suitable datasets is critical for research and development in recommender systems. However, finding datasets that match specific recommendation task or domains remains a challenge due to scattered sources and inconsistent metadata. To address this gap, we propose a community-driven and explainable dataset search engine tailored for recommender system research. Our system supports semantic search across multiple dataset attributes, such as dataset names, descriptions, and recommendation domain, and provides explanations of search relevance to enhance transparency. The system encourages community participation by allowing users to contribute standardized dataset metadata in public repository. By improving dataset discoverability and search interpretability, the system facilitates more efficient research reproduction. The platform is publicly available at: this https URL.

[IR-7] Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

链接: https://arxiv.org/abs/2508.10116
作者: Yipeng Zhang,Hongju Yu,Aritra Mandal,Canran Xu,Qunzhi Zhou,Zhe Wu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Item information, such as titles and attributes, is essential for effective user engagement in e-commerce. However, manual or semi-manual entry of structured item specifics often produces inconsistent quality, errors, and slow turnaround, especially for Customer-to-Customer sellers. Generating accurate descriptions directly from item images offers a promising alternative. Existing retrieval-based solutions address some of these issues but often miss fine-grained visual details and struggle with niche or specialized categories. We propose Optimized Preference-Based AI for Listings (OPAL), a framework for generating schema-compliant, high-quality item descriptions from images using a fine-tuned multimodal large language model (MLLM). OPAL addresses key challenges in multimodal e-commerce applications, including bridging modality gaps and capturing detailed contextual information. It introduces two data refinement methods: MLLM-Assisted Conformity Enhancement, which ensures alignment with structured schema requirements, and LLM-Assisted Contextual Understanding, which improves the capture of nuanced and fine-grained information from visual inputs. OPAL uses visual instruction tuning combined with direct preference optimization to fine-tune the MLLM, reducing hallucinations and improving robustness across different backbone architectures. We evaluate OPAL on real-world e-commerce datasets, showing that it consistently outperforms baseline methods in both description quality and schema completion rates. These results demonstrate that OPAL effectively bridges the gap between visual and textual modalities, delivering richer, more accurate, and more consistent item descriptions. This work advances automated listing optimization and supports scalable, high-quality content generation in e-commerce platforms. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.10116 [cs.IR] (or arXiv:2508.10116v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.10116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表