本篇博文主要内容为 2025-08-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-29)

今日共更新488篇论文,其中:

  • 自然语言处理69篇(Computation and Language (cs.CL))
  • 人工智能132篇(Artificial Intelligence (cs.AI))
  • 计算机视觉116篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习119篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Enabling Equitable Access to Trustworthy Financial Reasoning

【速读】: 该论文旨在解决税务申报过程中因规则复杂性与计算精度要求高而导致的自动化难题,尤其针对当前大型语言模型(Large Language Models, LLMs)在准确性与可审计性方面的不足。其关键解决方案是将LLM与符号求解器(symbolic solver)相结合,构建神经符号架构:通过预先将法规文本转化为形式化逻辑程序,并结合智能检索的实例来表示具体案例,从而显著提升系统在StAtutory Reasoning Assessment (SARA) 数据集上的性能,同时降低部署成本至低于现实世界平均值,实现高准确率、可审计且经济可行的税务辅助工具。

链接: https://arxiv.org/abs/2508.21051
作者: William Jurayj,Nils Holzenberger,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:According to the United States Internal Revenue Service, ‘‘the average American spends \ 270 and 13 hours filing their taxes’’. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.
zh

[NLP-1] Re-Representation in Sentential Relation Extraction with Sequence Routing Algorithm

【速读】: 该论文旨在解决句子级关系抽取(Sentential Relation Extraction, RE)中的性能瓶颈问题,特别是在处理标注噪声和模型表示能力不足方面的挑战。其解决方案的关键在于引入动态路由机制的胶囊网络(Capsule Network),通过该机制实现更优的特征重表示(re-representation)能力——即在比较阶段增强相关实体对(如“国王:女王”与“男人:女人”)之间的语义匹配度,从而提升关系分类的准确性。实验表明,该方法在Tacred、Retacred等标准数据集上优于现有最先进模型,但也在Wikidata这类更大规模但标签噪声较多的数据集上表现出性能下降,进一步揭示了标签噪声与模型重表示能力共同影响关系抽取效果的核心问题。

链接: https://arxiv.org/abs/2508.21049
作者: Ramazan Ali Bahrami,Ramin Yahyapour
机构: Georg-August-Universität Göttingen (哥廷根大学); GWDG (德国哥廷根超级计算中心)
类目: Computation and Language (cs.CL)
备注: Presented in 8th International Conference on Natural Language and Speech Processing (ICNLSP), 25-27 August 2025, SDU, Odense, Denmark

点击查看摘要

Abstract:Sentential relation extraction (RE) is an important task in natural language processing (NLP). In this paper we propose to do sentential RE with dynamic routing in capsules. We first show that the proposed approach outperform state of the art on common sentential relation extraction datasets Tacred, Tacredrev, Retacred, and Conll04. We then investigate potential reasons for its good performance on the mentioned datasets, and yet low performance on another similar, yet larger sentential RE dataset, Wikidata. As such, we identify noise in Wikidata labels as one of the reasons that can hinder performance. Additionally, we show associativity of better performance with better re-representation, a term from neuroscience referred to change of representation in human brain to improve the match at comparison time. As example, in the given analogous terms King:Queen::Man:Woman, at comparison time, and as a result of re-representation, the similarity between related head terms (King,Man), and tail terms (Queen,Woman) increases. As such, our observation show that our proposed model can do re-representation better than the vanilla model compared with. To that end, beside noise in the labels of the distantly supervised RE datasets, we propose re-representation as a challenge in sentential RE.
zh

[NLP-2] On the Theoretical Limitations of Embedding-Based Retrieval

【速读】: 该论文旨在解决向量嵌入(Vector Embeddings)在现实场景中面临的理论局限性问题,尤其是当其被用于检索、推理、指令遵循等复杂任务时,尽管模型规模和训练数据不断增长,仍存在无法克服的根本性瓶颈。其关键解决方案在于通过学习理论与实证分析相结合的方式,揭示了嵌入空间维度对可检索文档子集数量的限制——即无论训练数据多么丰富或模型多么庞大,嵌入维度决定了能够被查询正确返回的 top-k 文档组合数量上限。作者进一步构建了一个名为 LIMIT 的真实感测试数据集,验证了即使是最先进的嵌入模型在简单查询下也难以突破这一理论边界,从而指出当前单向量范式(single vector paradigm)存在根本性缺陷,并呼吁未来研究探索新的方法以突破此类限制。

链接: https://arxiv.org/abs/2508.21038
作者: Orion Weller,Michael Boratko,Iftekhar Naim,Jinhyuk Lee
机构: Google DeepMind(谷歌深度思维); Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
zh

[NLP-3] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs

【速读】: 该论文旨在解决小型和中型企业(SMEs)在部署检索增强生成(Retrieval-Augmented Generation, RAG)系统时面临的挑战,包括资源有限、缺乏自然语言处理(Natural Language Processing, NLP)专业技能等问题。其解决方案的关键在于提出了一种结构化、敏捷的方法——EASI-RAG(Enterprise Application Support for Industrial RAG),该方法基于方法工程原则,明确界定角色、活动与技术,并通过一个真实的环境检测实验室案例验证了其有效性:仅用不到一个月时间,由无RAG经验的团队完成部署,且后续根据用户反馈迭代优化,实现了快速实施、高用户采纳率、准确回答及数据可靠性提升。

链接: https://arxiv.org/abs/2508.21024
作者: Mathieu Bourdin,Anas Neumann,Thomas Paviot,Robert Pellerin,Samir Lamouri
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful solution to mitigate the limitations of Large Language Models (LLMs), such as hallucinations and outdated knowledge. However, deploying RAG-based tools in Small and Medium Enterprises (SMEs) remains a challenge due to their limited resources and lack of expertise in natural language processing (NLP). This paper introduces EASI-RAG, Enterprise Application Support for Industrial RAG, a structured, agile method designed to facilitate the deployment of RAG systems in industrial SME contexts. EASI-RAG is based on method engineering principles and comprises well-defined roles, activities, and techniques. The method was validated through a real-world case study in an environmental testing laboratory, where a RAG tool was implemented to answer operators queries using data extracted from operational procedures. The system was deployed in under a month by a team with no prior RAG experience and was later iteratively improved based on user feedback. Results demonstrate that EASI-RAG supports fast implementation, high user adoption, delivers accurate answers, and enhances the reliability of underlying data. This work highlights the potential of RAG deployment in industrial SMEs. Future works include the need for generalization across diverse use cases and further integration with fine-tuned models.
zh

[NLP-4] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering

【速读】: 该论文旨在解决现有因果推理型视频问答(Causal-Why Video Question Answering, VideoQA)模型在高阶推理能力上的不足,这些问题主要源于其依赖黑箱式的单体流水线架构,导致模型难以解释、易受浅层启发式策略影响。解决方案的关键在于提出一个模块化框架,明确将因果推理与答案生成解耦,并引入自然语言形式的因果链(causal chains)作为可解释的中间表示。该框架包含两个阶段:第一阶段由因果链提取器(Causal Chain Extractor, CCE)从视频-问题对中生成结构化的因果链;第二阶段由因果链驱动的答案生成器(Causal Chain-Driven Answerer, CCDA)基于这些链生成答案。这一设计不仅提升了推理的透明度和逻辑一致性,还通过大规模预训练语言模型自动生成高质量因果链数据,有效缓解了标注稀缺问题,从而显著增强模型的可解释性、用户信任度及跨领域泛化能力。

链接: https://arxiv.org/abs/2508.21010
作者: Paritosh Parmar,Eric Peh,Basura Fernando
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization – positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: this https URL
zh

[NLP-5] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对后门攻击时的脆弱性问题,即模型在正常输入下表现良好,但在特定触发词激活时会生成有害或非预期输出。现有防御方法普遍存在覆盖不全、仅支持单一触发场景、缺乏对高级攻击(如基于模型编辑、多触发器和无触发器攻击)的鲁棒性等问题。论文提出的解决方案LETHE通过内外双重机制实现知识稀释:内部机制利用轻量级纯净数据集训练一个干净模型,并将其与受污染模型融合,以稀释参数记忆中的后门影响;外部机制则在提示中注入良性且语义相关的证据,引导模型注意力偏离后门特征。此双重策略使LETHE在分类与生成任务中显著优于8种前沿防御基线,在多种高级后门攻击下将攻击成功率降低高达98%,同时保持模型功能完整性,且具备成本效益和对抗自适应攻击的能力。

链接: https://arxiv.org/abs/2508.21004
作者: Chen Chen,Yuchen Sun,Jiaxin Gao,Xueluan Gong,Qian Wang,Ziyao Wang,Yongsen Zheng,Kwok-Yan Lam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model’s parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM’s attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.
zh

[NLP-6] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在主动对话(proactive dialogue)能力评估中面临的碎片化与缺乏统一标准的问题。现有研究多局限于特定领域或任务导向场景,导致评估结果难以横向比较,且无法全面刻画模型的主动交互潜力。其解决方案的关键在于提出 ProactiveEval 框架,该框架将主动对话行为解构为“目标规划”(target planning)与“对话引导”(dialogue guidance)两个核心维度,并构建覆盖6个不同领域的328个标准化评估环境,同时支持自动产生多样化、高挑战性的测试数据,从而实现对LLMs主动对话能力的系统性、可量化评估。

链接: https://arxiv.org/abs/2508.20973
作者: Tianjian Liu,Fanqi Wan,Jiajian Guo,Xiaojun Quan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages, 6 Figures

点击查看摘要

Abstract:Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models’ proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.
zh

[NLP-7] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment EMNLP2025

【速读】: 该论文旨在解决在上下文学习(In-Context Learning, ICL)中,现有示例选择策略对结构化预测任务(如语义解析)忽视结构对齐问题,导致性能不佳和泛化能力弱的挑战。其解决方案的关键在于提出一种两阶段的示例选择策略:首先使用结构感知监督微调一个基于BERT的检索器,使其能够同时选取语义相关且结构对齐的示例;其次引入一个模型无关的插件模块,增强隐藏表示中的句法信息,从而提升检索质量。该方法在多个语义解析基准上验证了其高效性、泛化性和优越性能。

链接: https://arxiv.org/abs/2508.20944
作者: Jiaqian Li,Qisheng Hu,Jing Li,Wenya Wang
机构: Nanyang Technological University, Singapore; Harbin Institute of Technology, Shenzhen, China
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.
zh

[NLP-8] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话环境中作为自主代理执行工具调用时所面临的稳定性问题,包括推理不一致、违反领域特定规则以及长时间对话中信息提取错误等挑战。其解决方案的关键在于提出输入重构多智能体(Input-Reformulation Multi-Agent, IRMA)框架,该框架通过自动重写用户查询,并融合相关领域规则与工具建议,使工具调用代理能够聚焦于关键决策任务,从而显著提升其在动态环境中的可靠性和一致性表现。

链接: https://arxiv.org/abs/2508.20931
作者: Venkatesh Mishra,Amir Saeidi,Satyam Raj,Mutsumi Nakamura,Jayanth Srinivasa,Gaowen Liu,Ali Payani,Chitta Baral
机构: Arizona State University (亚利桑那州立大学); Cisco Research (思科研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Findings

点击查看摘要

Abstract:Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like \tau -bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.
zh

[NLP-9] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

【速读】: 该论文旨在解决语音到语音(Speech-to-Speech, S2S)大语言模型(Large Language Models, LLMs)在评估过程中存在的挑战,尤其是现有方法多采用级联式架构而忽视声学特征、缺乏可解释性以及高质量语音偏好数据稀缺的问题。其解决方案的关键在于提出SageLM——一个端到端、多维度且具备可解释性的语音LLM评估框架:首先,SageLM联合评估语义与声学两个维度,突破传统级联方法的局限;其次,引入基于推理过程(rationale-based)的监督机制以增强模型可解释性并提升与人工评价的一致性;最后,构建合成偏好数据集SpeechFeedback并采用两阶段训练策略缓解真实语音偏好数据不足的问题,从而实现与人类评估者高达82.79%的一致性,显著优于基线方法。

链接: https://arxiv.org/abs/2508.20916
作者: Yuan Ge,Junxiang Zhang,Xiaoqian Liu,Bei Li,Xiangnan Ma,Chenglong Wang,Kaiyang Ye,Yangfan Du,Linfeng Zhang,Yuxin Huang,Tong Xiao,Zhengtao Yu,JingBo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \textttSageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textitSpeechFeedback, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.
zh

[NLP-10] he Uneven Impact of Post-Training Quantization in Machine Translation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限硬件上部署时,后训练量化(Post-Training Quantization, PTQ)对多语言机器翻译任务性能影响不明确的问题。其核心挑战在于量化压缩如何影响低资源语言及语言类型多样性的翻译质量,以及不同量化算法与模型规模之间的交互效应。解决方案的关键在于系统性地评估四种主流量化方法(AWQ、BitsAndBytes、GGUF、AutoRound)在55种语言上的表现,发现GGUF变体在2-bit精度下仍具最佳一致性,且语言匹配校准在低比特场景中能提升性能;同时揭示了量化精度、解码超参数和校准语言三者间的复杂交互关系,为低资源场景下的多语言LLM部署提供了可操作的优化策略。

链接: https://arxiv.org/abs/2508.20893
作者: Benjamin Marie,Atsushi Fujita
机构: NICT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.
zh

[NLP-11] OLMoASR: Open Models and Data for Training Robust Speech Recognition Models

【速读】: 该论文旨在解决语音识别中训练数据规模与质量对模型性能影响尚不明确的问题,尤其关注如何通过高质量数据构建更鲁棒的零样本语音识别模型。其解决方案的关键在于提出一个大规模、高质量的英语语音-文本数据集OLMoASR-Pool(3M小时音频与17M转录文本),并通过文本启发式过滤策略筛选出1M小时高质音频-文本对,形成新的数据集OLMoASR-Mix;在此基础上训练出从39M到1.5B参数的OLMoASR系列模型,这些模型在短时和长时语音识别任务上均达到与OpenAI Whisper相当甚至更优的词错误率(WER),验证了高质量数据对提升零样本语音识别鲁棒性的重要作用。

链接: https://arxiv.org/abs/2508.20869
作者: Huong Ngo,Matt Deitke,Martijn Bartelds,Sarah Pratt,Josh Gardner,Matt Jordan,Ludwig Schmidt
机构: Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from OLMoASR-Pool, a collection of 3M hours of English audio and 17M transcripts, we design text heuristic filters to remove low-quality or mistranscribed data. Our curation pipeline produces a new dataset containing 1M hours of high-quality audio-transcript pairs, which we call OLMoASR-Mix. We use OLMoASR-Mix to train the OLMoASR-Mix suite of models, ranging from 39M (this http URL) to 1.5B (this http URL) parameters. Across all model scales, OLMoASR achieves comparable average performance to OpenAI’s Whisper on short and long-form speech recognition benchmarks. Notably, this http URL attains a 12.8% and 11.0% word error rate (WER) that is on par with Whisper’s largest English-only model this http URL’s 12.4% and 10.5% WER for short and long-form recognition respectively (at equivalent parameter count). OLMoASR-Pool, OLMoASR models, and filtering, training and evaluation code will be made publicly available to further research on robust speech processing.
zh

[NLP-12] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在多源信息整合与长文本生成任务中表现不足的问题。现有评估大多集中在单源或短答案场景,而现实应用常需从多个来源提取并融合信息以生成连贯、详尽的响应。解决方案的关键在于构建一个可扩展的基准测试框架,用于系统性评估RAG模型在跨源信息整合和长文本生成上的能力,并在此基础上设计了两个新基准:MSRS-Story(叙事合成)和MSRS-Meet(会议摘要),均要求从大规模语料库中检索并综合多源信息。实验表明,生成质量高度依赖于检索效果,且推理模型在多源合成阶段显著优于普通大语言模型(LLM)。

链接: https://arxiv.org/abs/2508.20867
作者: Rohan Phanse,Yijie Zhou,Kejian Shi,Wencai Zhang,Yixin Liu,Yilun Zhao,Arman Cohan
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注: COLM 2025; this article supersedes the preprint: arXiv:2309.08960

点击查看摘要

Abstract:Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user’s question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines – including sparse and dense retrievers combined with frontier LLMs – reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.
zh

[NLP-13] GDLLM : A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction EMNLP

【速读】: 该论文旨在解决事件时序关系抽取(Event Temporal Relation Extraction, ETRE)中两大挑战:一是小语言模型(Small Language Models, SLMs)因预训练知识受限,难以有效处理类别不平衡数据中的少数类关系;二是大语言模型(Large Language Models, LLMs)在使用人工设计提示(prompt)时引入噪声,干扰对远距离事件依赖关系的准确判断。解决方案的关键在于提出一种基于LLMs的全局距离感知建模方法(Global Distance-aware modeling approach, GDLLM),其核心创新包括:1)构建基于图注意力网络(Graph Attention Network, GAT)的距离感知图结构,以增强LLMs对长距离依赖特征的捕捉能力;2)设计基于软推理(soft inference)的时序特征学习范式,通过将概率信息融入多头注意力机制,提升短距离邻近关系的识别精度。该框架显著增强了少数类关系的性能并提升了整体学习能力,在TB-Dense和MATRES两个公开数据集上达到当前最优(SOTA)效果。

链接: https://arxiv.org/abs/2508.20828
作者: Jie Zhao,Wanting Ning,Yuxiao Fei,Yubo Feng,Lishuang Li
机构: Dalian University of Technology (大连理工大学); Indiana University Indianapolis (印第安纳大学印第安纳波利斯分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings)

点击查看摘要

Abstract:In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model’s judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.
zh

[NLP-14] A Graph-Based Test-Harness for LLM Evaluation

【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)评估中存在覆盖不全、缺乏系统性及难以动态更新的问题,尤其是针对临床决策任务如症状识别、分诊严重程度、治疗方案选择和随访管理等关键能力的测评不足。解决方案的关键在于构建一个基于图结构的动态多选题问答(MCQA)基准框架,将世界卫生组织儿童基本保健指南(WHO IMCI)转化为包含200多个节点(如疾病、症状、治疗、随访等)和300多条边的有向图,并通过图遍历生成涵盖3.3万亿种可能组合的结构化问题,从而实现对所有指南关系的全覆盖;该方法不仅提升了评估的系统性和临床相关性,还为LLM后训练提供高质量、高奖励样本,且无需昂贵的人工标注,同时具备良好的可扩展性和抗污染能力,适用于指南更新后的动态重构。

链接: https://arxiv.org/abs/2508.20810
作者: Jessica Lundin,Guillaume Chabot-Couture
机构: Institute for Disease Modeling (疾病模型研究所); Gates Foundation (比尔及梅琳达·盖茨基金会)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 2 figures, dataset

点击查看摘要

Abstract:We present a first known prototype of a dynamic, systematic benchmark of medical guidelines for 400+ questions, with 3.3+ trillion possible combinations, covering 100% of guideline relationships. We transformed the WHO IMCI handbook into a directed graph with 200+ nodes (conditions, symptoms, treatments, follow-ups, severities) and 300+ edges, then used graph traversal to generate questions that incorporated age-specific scenarios and contextual distractors to ensure clinical relevance. Our graph-based approach enables systematic evaluation across clinical tasks (45-67% accuracy), and we find models excel at symptom recognition but struggle with triaging severity, treatment protocols and follow-up care, demonstrating how customized benchmarks can identify specific capability gaps that general-domain evaluations miss. Beyond evaluation, this dynamic MCQA methodology enhances LLM post-training (supervised finetuning, GRPO, DPO), where correct answers provide high-reward samples without expensive human annotation. The graph-based approach successfully addresses the coverage limitations of manually curated benchmarks. This methodology is a step toward scalable, contamination-resistant solution for creating comprehensive benchmarks that can be dynamically generated, including when the guidelines are updated. Code and datasets are available at this https URL
zh

[NLP-15] Exploring Machine Learning and Language Models for Multimodal Depression Detection

【速读】: 该论文旨在解决多模态抑郁症检测问题,即通过整合音频、视频和文本等多源信息,提升对抑郁症的识别准确率。其解决方案的关键在于系统性地比较XGBoost、基于Transformer的架构以及大语言模型(Large Language Models, LLMs)在不同模态特征上的表现,从而揭示各类模型在捕捉抑郁相关信号方面的优势与局限,为心理健康预测任务中有效的多模态表征学习策略提供实证依据。

链接: https://arxiv.org/abs/2508.20805
作者: Javier Si Zhao Hong,Timothy Zoe Delaya,Sherwyn Chan Yin Kit,Pai Chet Ng,Xiaoxiao Miao
机构: Singapore Institute of Technology (新加坡理工学院); Duke Kunshan University (昆山杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: This paper has been accepted by APCIPA ASC 2025

点击查看摘要

Abstract:This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.
zh

[NLP-16] Signs of Struggle: Spotting Cognitive Distortions across Language and Register

【速读】: 该论文旨在解决青少年心理健康问题中早期识别认知扭曲(cognitive distortion)的难题,尤其关注在数字文本中自动检测这些非理性思维模式以实现低成本、及时干预。其核心解决方案在于探索跨语言(cross-lingual)与跨语域(cross-register)的认知扭曲检测模型泛化能力,通过对荷兰青少年在线论坛帖子的分析发现,尽管语言和写作风格的变化显著影响模型性能,但领域自适应(domain adaptation)方法展现出最佳潜力,成为提升模型迁移效果的关键策略。

链接: https://arxiv.org/abs/2508.20771
作者: Abhishek Kuber,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise.
zh

[NLP-17] urning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中易被绕过的问题,即现有安全机制常依赖于特定内部表征方向来拒绝有害请求,而这些方向可通过删除或扰动被轻易规避。其解决方案的关键在于提出一种白盒方法——秩一安全注入(Rank-One Safety Injection, ROSI),通过无需微调的、针对残差流写入矩阵的秩一权重修改,永久性地将模型激活值导向拒绝有害请求的子空间。该方法仅需少量有害与无害指令对即可计算所需的安全方向,显著提升Llama Guard 3评估下的拒绝率,同时保持模型在MMLU、HellaSwag和Arc等基准上的性能,且可应用于未受约束模型以恢复安全对齐,体现了可解释且高效的权重导向策略在提升LLM安全性方面的潜力。

链接: https://arxiv.org/abs/2508.20766
作者: Harethah Abu Shairah,Hasan Abed Al Kader Hammoud,George Turkiyyah,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model’s safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align ‘uncensored’ models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.
zh

[NLP-18] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM -Generated CBT Sessions EMNLP2025

【速读】: 该论文试图解决的问题是:当前由大语言模型(Large Language Models, LLMs)生成的认知行为疗法(Cognitive Behavioral Therapy, CBT)对话是否能够真实反映现实治疗中复杂的情感动态。尽管这些合成对话在语言流畅性和结构连贯性上表现良好,但其情感特征与真实会话是否存在偏差尚不明确。解决方案的关键在于首次对真实与合成CBT对话中的情绪弧线(emotional arcs)进行系统比较,采用改进的Utterance Emotion Dynamics框架分析情绪在效价(valence)、唤醒度(arousal)和支配感(dominance)三个维度上的细粒度轨迹,并区分咨询师与来访者角色。研究发现,真实会话具有更高的情绪变异性、更丰富的情绪语言以及更自然的情绪反应与调节模式,而合成对话在这些方面存在显著差异,尤其在来访者角色中情绪弧线相似度较低。这一结果揭示了当前LLM生成心理治疗数据的情感保真度不足,强调了未来研究需重视情感真实性,并提出了RealCBT这一高质量真实CBT会话数据集以支持后续工作。

链接: https://arxiv.org/abs/2508.20764
作者: Xiaoyi Wang,Jiwei Zhang,Guangtao Zhang,Honglei Guo
机构: Shantou University (汕头大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025,14 page,3 figures

点击查看摘要

Abstract:Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.
zh

[NLP-19] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation EMNLP

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在开放文本生成中面临的共现性与多样性之间的权衡问题。现有基于对比搜索(contrastive search)的解码策略虽能缓解此问题,但受限于超参数敏感性和高计算开销。其解决方案的关键在于提出一种自适应解码方法GUARD,该方法基于新颖的“Glocal”不确定性驱动框架,融合全局熵估计与局部熵偏差,同时捕捉长期和短期不确定性信号;其中全局熵设计有效抑制了不确定性突变(如突发自信或熵值飙升),并提供无偏性和一致性理论保障,同时引入基于token计数的惩罚机制以降低计算负担,从而在保持文本质量的同时显著提升生成效率。

链接: https://arxiv.org/abs/2508.20757
作者: Yuanhao Ding,Esteban Garces Arias,Meimingwei Li,Julian Rodemann,Matthias Aßenmacher,Danlu Chen,Gaojuan Fan,Christian Heumann,Chongsheng Zhang
机构: Henan University (河南大学); Department of Statistics, LMU Munich (慕尼黑大学统计系); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); CISPA Helmholtz Center for Information Security (萨尔布吕肯信息安全中心); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of the Association for Computational Linguistics: EMNLP (Findings) 2025

点击查看摘要

Abstract:Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel “Glocal” uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at this https URL.
zh

[NLP-20] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

【速读】: 该论文旨在解决隐性仇恨言论(Implicit Hate Speech, IHS)的检测难题,即识别那些不包含明显侮辱性或煽动性词汇,但通过讽刺、暗示或编码术语传达偏见或仇恨的文本。其解决方案的关键在于仅通过微调基于大语言模型(Large Language Models, LLMs)的通用嵌入模型(如Stella、Jasper、NV-Embed和E5),便实现了当前最优性能,显著提升了跨数据集评估中的F1-macro分数(最高达20.35个百分点提升),表明无需引入额外外部知识即可有效捕捉IHS的语义细微差别。

链接: https://arxiv.org/abs/2508.20750
作者: Vassiliy Cheremetiev,Quang Long Ho Ngo,Chau Ying Kot,Alina Elena Baia,Andrea Cavallaro
机构: EPFL Lausanne (瑞士联邦理工学院洛桑分校); Idiap Research Institute (Idiap 研究所)
类目: Computation and Language (cs.CL)
备注: Paper accepted at the DHOW Workshop at ACM Multimedia 2025. Code available at this https URL

点击查看摘要

Abstract:Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.
zh

[NLP-21] Leverag ing Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees EMNLP2025

【速读】: 该论文旨在解决在自然语言处理(Natural Language Processing, NLP)中应用局部差分隐私(Local Differential Privacy, Local DP)时,因隐私保护强度过高而导致文本隐私化效果受限的问题。具体而言,现有方法在低隐私预算(ε值较低)下难以同时保证隐私性和文本语义一致性,导致实用性不足。解决方案的关键在于提出DP-ST框架,该框架基于语义三元组(semantic triples)构建邻域感知的私有文档生成机制,将差分隐私约束从整个文档缩减至局部邻域,从而降低隐私成本;并通过大语言模型(Large Language Model, LLM)后处理提升生成文本的连贯性,使得在较低ε值下仍能实现良好的隐私-效用平衡。

链接: https://arxiv.org/abs/2508.20736
作者: Stephen Meisenbacher,Maulik Chevli,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 2 figures, 11 tables. Accepted to EMNLP 2025 (Main)

点击查看摘要

Abstract:Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter \varepsilon . Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high \varepsilon values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower \varepsilon values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable \varepsilon levels.
zh

[NLP-22] rStar2-Agent : Agent ic Reasoning Technical Report

【速读】: 该论文旨在解决大语言模型在数学推理任务中难以实现高效、自主且高质量的复杂问题求解问题,尤其是如何通过强化学习(Reinforcement Learning, RL)机制赋予模型类似人类的高级认知行为,如审慎思考、代码执行反馈驱动的自我修正与迭代优化等。其解决方案的关键在于三项创新:(i) 构建高效的RL基础设施,包含可靠的Python代码执行环境,显著降低采样成本并支持在有限GPU资源(64块MI300X)下高吞吐训练;(ii) 提出GRPO-RoC算法,采用Resample-on-Correct策略缓解编码工具引入的环境噪声,提升模型在代码环境中推理的有效性;(iii) 设计分阶段的代理训练范式,从非推理监督微调(Supervised Fine-Tuning, SFT)逐步过渡至多阶段强化学习,以最小计算开销激发模型的高级认知能力。最终,rStar2-Agent-14B仅用510步RL训练即达到AIME24和AIME25上的SOTA性能,同时展现出在对齐、科学推理和工具使用等多任务上的强泛化能力。

链接: https://arxiv.org/abs/2508.20722
作者: Ning Shang,Yifei Liu,Yi Zhu,Li Lyna Zhang,Weijiang Xu,Xinyu Guan,Buze Zhang,Bingcheng Dong,Xudong Zhou,Bowen Zhang,Ying Xin,Ziming Miao,Scarlett Li,Fan Yang,Mao Yang
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at this https URL.
zh

[NLP-23] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models

【速读】: 该论文旨在解决文本隐写术(steganography)和水印技术(watermarking)中因发送方(Alice)与接收方(Bob)之间分词不一致(tokenization inconsistency, TI)所导致的鲁棒性下降问题。研究发现,引发TI的异常token具有两个核心特征:低频性和暂时性。解决方案的关键在于针对这两种特性设计专用方法:对于隐写术,提出一种分步验证方法以消除TI;对于水印技术,则采用事后回滚方法来应对TI。实验表明,这些方法在提升文本流畅性、隐蔽性和抗分析能力方面优于传统消歧策略,同时增强了水印的可检测性和抗攻击鲁棒性。

链接: https://arxiv.org/abs/2508.20718
作者: Ruiyi Yan,Yugo Murawaki
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based steganography. On the other hand, they have also underscored the importance of watermarking as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: infrequency and temporariness. Based on these findings, we propose two tailored solutions for TI elimination: a stepwise verification method for steganography and a post-hoc rollback method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.
zh

[NLP-24] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning SIGDIAL2025

【速读】: 该论文旨在解决隐式话语关系识别(Implicit Discourse Relation Recognition, IDRR)中多语言、多标签分类的挑战,特别是如何有效建模话语意义层级间的层次依赖关系。其解决方案的关键在于提出了一种名为HArch的新型模型架构,该模型利用PDTB 3.0框架下三个语义层级之间的层次依赖关系,对所有层级进行概率分布预测,从而实现更精细和准确的隐式话语关系分类。实验表明,HArch在英语和多语言场景下均优于主流预训练模型(如RoBERTa和XLM-RoBERTa),且在少样本提示(few-shot prompting)条件下显著优于GPT-4o和Llama-4-Maverick等大语言模型(LLMs),凸显了任务特定微调相较于提示工程的优势。

链接: https://arxiv.org/abs/2508.20712
作者: Nelson Filipe Costa,Leila Kosseim
机构: Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL)
备注: Published at SIGDIAL 2025. Best paper award

点击查看摘要

Abstract:This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.
zh

[NLP-25] ransparent Semantic Spaces: A Categorical Approach to Explainable Word Embeddings

【速读】: 该论文旨在解决人工智能系统(特别是词嵌入模型)的可解释性问题,即如何从数学上精确描述和比较词向量的语义结构,并揭示其内部机制的透明性。解决方案的关键在于引入范畴论(category theory)构建一个形式化框架,通过定义类别 Ł_T 和 ¶_T 来表征文本 T 的语义结构,并将最大概率元素的选择重构为范畴意义上的概念;进一步构造了张量范畴 ¶_T 以可视化不同语义信息提取方法,从而实现不依赖于维度的语义空间定义。此外,论文还通过引入配置类别 \Conf 和词嵌入类别 \Emb 及其上的分歧(divergence)装饰,建立了词嵌入比较的数学标准,并证明了 GloVe、Word2Vec 与度量多维缩放(metric MDS)算法的等价性,实现了从黑箱神经网络到可解释框架的转化,同时提出在嵌入前计算偏见并基于语义空间进行干预的策略,推动了可解释人工智能(Explainable AI)的发展。

链接: https://arxiv.org/abs/2508.20701
作者: Ares Fabregat-Hernández(1 and 2),Javier Palanca(1),Vicent Botti(1 and 3) ((1) Valencian Research Institute for Artificial Intelligence (VRAIN) Universitat Politècnica de València (2) Universidad Internacional de Valencia (VIU) (3) valgrAI (Valencian Graduate School and Research Network of Artificial Intelligence))
机构: Valencian Research Institute for Artificial Intelligence (VRAIN); Universitat Politècnica de València (瓦伦西亚理工大学); Universidad Internacional de Valencia (VIU); valgrAI (Valencian Graduate School and Research Network of Artificial Intelligence)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Category Theory (math.CT)
备注:

点击查看摘要

Abstract:The paper introduces a novel framework based on category theory to enhance the explainability of artificial intelligence systems, particularly focusing on word embeddings. Key topics include the construction of categories Ł_T and ¶_T , providing schematic representations of the semantics of a text T , and reframing the selection of the element with maximum probability as a categorical notion. Additionally, the monoidal category ¶_T is constructed to visualize various methods of extracting semantic information from T , offering a dimension-agnostic definition of semantic spaces reliant solely on information within the text. Furthermore, the paper defines the categories of configurations \Conf and word embeddings \Emb , accompanied by the concept of divergence as a decoration on \Emb . It establishes a mathematically precise method for comparing word embeddings, demonstrating the equivalence between the GloVe and Word2Vec algorithms and the metric MDS algorithm, transitioning from neural network algorithms (black box) to a transparent framework. Finally, the paper presents a mathematical approach to computing biases before embedding and offers insights on mitigating biases at the semantic space level, advancing the field of explainable artificial intelligence. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Category Theory (math.CT) Cite as: arXiv:2508.20701 [cs.AI] (or arXiv:2508.20701v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.20701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] Generative Annotation for ASR Named Entity Correction EMNLP2025

【速读】: 该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)系统在特定领域中无法准确转录命名实体(Named Entities, NEs)的问题,这类错误会导致下游任务性能严重下降。现有轻量级命名实体纠正(Named Entity Correction, NEC)方法主要依赖音素级编辑距离算法,但在目标实体与ASR错误输出在词形上差异显著时难以准确定位错误位置,从而限制了其适用性。本文的关键解决方案是利用语音声学特征检索候选实体,并创新性地设计了一种生成式方法,基于语音特征和候选实体对ASR文本中的实体错误进行标注并替换为正确实体,该方法在词形差异较大的场景下仍具有较强鲁棒性。

链接: https://arxiv.org/abs/2508.20700
作者: Yuanchang Luo,Daimeng Wei,Shaojun Li,Hengchao Shang,Jiaxin Guo,Zongyao Li,Zhanglin Wu,Xiaoyu Chen,Zhiqiang Rao,Jinlong Yang,Hao Yang
机构: Huawei Translation Service Center (华为翻译服务中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 7 tables, EMNLP 2025

点击查看摘要

Abstract:End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.
zh

[NLP-27] oken Buncher: Shielding LLM s from Harmful Reinforcement Learning Fine-Tuning

【速读】: 该论文旨在解决生成式 AI(Generative AI)在受到强化学习(Reinforcement Learning, RL)驱动的有害微调时所面临的系统性安全风险问题。相较于以往研究多关注监督微调(Supervised Fine-Tuning, SFT)带来的危害,本文首次系统证明,在相同计算预算下,RL 方法能更有效地突破模型的安全对齐机制,并实现高级别的有害任务协助。解决方案的关键在于提出 TokenBuncher,其核心思想是通过抑制模型响应不确定性来瓦解 RL 微调的基础——即利用奖励信号区分有害与良性行为的能力。具体实现上,TokenBuncher 结合熵作为奖励的 RL 框架和一种 Token Noiser 机制,有效遏制专家领域内有害能力的增强,从而在不损害正常任务性能的前提下,显著降低 RL 驱动的有害微调风险。

链接: https://arxiv.org/abs/2508.20697
作者: Weitao Feng,Lixu Wang,Tianyi Wei,Jie Zhang,Chongyang Gao,Sinong Zhan,Peizhuo Lv,Wei Dong
机构: Nanyang Technological University (南洋理工大学); Centre for Frontier AI Research, A*STAR (新加坡科技研究局前沿人工智能研究中心); Northwestern University (西北大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Project Hompage: this https URL

点击查看摘要

Abstract:As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response uncertainty. By constraining uncertainty, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of expert-domain harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task utility and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.
zh

[NLP-28] Leverag ing Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study

【速读】: 该论文旨在解决科研领域本体(ontology)与分类体系(taxonomy)构建和维护过程中存在的高成本、覆盖不均及更新滞后等问题,这些问题限制了科学知识的有效组织与检索。其关键解决方案是利用大语言模型(Large Language Models, LLMs)在多学科语义关系识别上的潜力,并通过引入一个包含8000余条关系的新型数据集PEM-Rel-8K进行微调(fine-tuning),从而显著提升模型在生物医学、物理和工程三个领域中的跨域泛化能力与性能表现。

链接: https://arxiv.org/abs/2508.20693
作者: Tanay Aggarwal,Angelo Salatino,Francesco Osborne,Enrico Motta
机构: Knowledge Media Institute, The Open University (开放大学知识媒体研究所); Department of Business and Law, University of Milano Bicocca (米兰博科尼大学工商与法律系)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ontologies and taxonomies of research fields are critical for managing and organising scientific knowledge, as they facilitate efficient classification, dissemination and retrieval of information. However, the creation and maintenance of such ontologies are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts. Consequently, ontologies in this space often exhibit uneven coverage across different disciplines, limited inter-domain connectivity, and infrequent updating cycles. In this study, we investigate the capability of several large language models to identify semantic relationships among research topics within three academic domains: biomedicine, physics, and engineering. The models were evaluated under three distinct conditions: zero-shot prompting, chain-of-thought prompting, and fine-tuning on existing ontologies. Additionally, we assessed the cross-domain transferability of fine-tuned models by measuring their performance when trained in one domain and subsequently applied to a different one. To support this analysis, we introduce PEM-Rel-8K, a novel dataset consisting of over 8,000 relationships extracted from the most widely adopted taxonomies in the three disciplines considered in this study: MeSH, PhySH, and IEEE. Our experiments demonstrate that fine-tuning LLMs on PEM-Rel-8K yields excellent performance across all disciplines.
zh

[NLP-29] MobileCLIP2: Improving Multi-Modal Reinforced Training

【速读】: 该论文旨在解决轻量级图像-文本模型在保持低延迟(3–15ms)和小参数规模(50–150M)的同时,提升零样本(zero-shot)识别准确率的问题。其核心挑战在于如何高效、可扩展且可复现地从多个教师模型(包括CLIP和caption-generator)中蒸馏多模态知识。解决方案的关键在于改进多模态强化训练(multi-modal reinforced training)策略:首先构建在DFN数据集上训练的更优CLIP教师集成(teacher ensembles),其次优化在DFN上预训练并进一步在高质量图像-文本数据集上微调的captioner教师模型;同时通过消融实验揭示了对比蒸馏中的温度调节重要性、caption生成器微调对多样性提升的有效性,以及融合多个模型生成合成caption的叠加增益。基于此,作者训练出MobileCLIP2系列模型,在ImageNet-1k零样本准确率上达到新SOTA,且显著优于同类轻量模型(如SigLIP-SO400M/14)在精度与延迟之间的权衡。

链接: https://arxiv.org/abs/2508.20691
作者: Fartash Faghri,Pavan Kumar Anasosalu Vasu,Cem Koc,Vaishaal Shankar,Alexander Toshev,Oncel Tuzel,Hadi Pouransari
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: TMLR August 2025

点击查看摘要

Abstract:Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2 \times smaller and improves on DFN ViT-L/14 at 2.5 \times lower latency. We release our pretrained models (this https URL) and the data generation code (this https URL). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
zh

[NLP-30] Improving Alignment in LVLMs with Debiased Self-Judgment EMNLP2025

【速读】: 该论文旨在解决大型视觉语言模型(Large Visual-Language Models, LVLMs)在多模态对齐过程中存在的幻觉问题(hallucinations),即生成内容与视觉输入不一致,进而引发安全风险。传统对齐方法如指令微调(instruction tuning)和偏好微调(preference tuning)依赖外部数据集、人工标注或复杂后处理,存在可扩展性差和成本高的局限。其解决方案的关键在于提出一种新型的“去偏自评分数”(debiased self-judgment score),该指标由模型内部自主生成,无需外部资源,从而实现模型自我评估与对齐优化。此机制同时改进了解码策略与偏好微调流程,在减少幻觉、提升安全性与整体性能方面显著优于传统方法。

链接: https://arxiv.org/abs/2508.20655
作者: Sihan Yang,Chenhang Cui,Zihao Zhao,Yiyang Zhou,Weilong Yan,Ying Wei,Huaxiu Yao
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); UNC-Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations–where generated outputs are not grounded in the visual input–and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.
zh

[NLP-31] GDS Agent : A Graph Algorithmic Reasoning Agent

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理和推理大规模图结构数据时能力不足的问题。现有LLM系统虽可通过函数调用(function calling)和检索增强技术(retrieval-augmented techniques)访问封闭数据源并回答相关问题,但在涉及图算法推理的任务中仍表现有限。解决方案的关键在于提出GDS(Graph Data Science)代理,其核心是将一系列完整的图算法作为工具集成到模型上下文协议(Model Context Protocol, MCP)服务器中,并结合预处理(检索)与后处理机制,使任何现代LLM可直接调用这些工具完成图结构数据的分析任务。该设计使得用户能够以自然语言提问,隐含或内在需要图算法推理的问题均可被准确、可靠地解答。

链接: https://arxiv.org/abs/2508.20637
作者: Borun Shi,Ioannis Panagiotas
机构: Neo4j( Neo4j)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical report

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable multimodal information processing and reasoning ability. When equipped with tools through function calling and enhanced with retrieval-augmented techniques, compound LLM-based systems can access closed data sources and answer questions about them. However, they still struggle to process and reason over large-scale graph-structure data. We introduce the GDS (Graph Data Science) agent in this technical report. The GDS agent introduces a comprehensive set of graph algorithms as tools, together with preprocessing (retrieval) and postprocessing of algorithm results, in a model context protocol (MCP) server. The server can be used with any modern LLM out-of-the-box. GDS agent allows users to ask any question that implicitly and intrinsically requires graph algorithmic reasoning about their data, and quickly obtain accurate and grounded answers. We also introduce a new benchmark that evaluates intermediate tool calls as well as final responses. The results indicate that GDS agent is able to solve a wide spectrum of graph tasks. We also provide detailed case studies for more open-ended tasks and study scenarios where the agent struggles. Finally, we discuss the remaining challenges and the future roadmap.
zh

[NLP-32] A Graph Talks But Whos Listening? Rethinking Evaluations for Graph-Language Models

【速读】: 该论文旨在解决当前图-语言模型(Graph-Language Models, GLMs)评估基准不足的问题,即现有基准主要基于节点分类任务,无法有效衡量模型在结构与语义联合推理方面的多模态能力。研究发现,这些基准仅依赖单一模态信息即可取得良好表现,因而不能真正检验图结构与语言的融合效果。解决方案的关键在于提出一个新的评估基准——CLEGR(Compositional Language-Graph Reasoning),其通过合成图生成流程与需联合推理结构和文本语义的问题,系统性地测试不同复杂度下的多模态推理能力。实验表明,当前GLMs在结构推理任务中性能显著下降,且软提示微调的大语言模型(LLM)基线在多数任务上优于包含完整图神经网络(GNN)模块的GLMs,这质疑了将图结构显式嵌入LLM架构的必要性,并为未来研究指明方向:发展具备显式多模态推理能力的新型模型架构。

链接: https://arxiv.org/abs/2508.20583
作者: Soham Petkar,Hari Aakash K,Anirudh Vempati,Akshit Sinha,Ponnurangam Kumarauguru,Chirag Agarwal
机构: Plaksha University (普拉克沙大学); IIIT Delhi (印度理工学院德里分校); IIT Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.
zh

[NLP-33] MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training ICML2025

【速读】: 该论文旨在解决大批次(large-batch)训练深度神经网络时优化器性能下降的问题,尤其在语言模型中,由于注意力层中最大注意力logit急剧上升导致的信息瓶颈效应,使得现有优化器如AdamW和LAMB在大规模批量训练下出现性能退化。其解决方案的关键在于提出一种新型优化器MERIT,通过引入基于max-norm的trust ratio来更有效地约束最大注意力logit,并进一步构建逐元素(element-wise)的trust ratio以利用局部权重结构实现更鲁棒的更新缩放,从而提升训练稳定性与泛化能力。

链接: https://arxiv.org/abs/2508.20577
作者: Yang Luo,Zangwei Zheng,Ziheng Qin,Zirui Zhu,Yong Liu,Yang You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025

点击查看摘要

Abstract:Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models’ large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that l_2 -norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens. This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available at this https URL.
zh

[NLP-34] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling

【速读】: 该论文旨在解决多跳问答(multi-hop question answering)中因数据稀疏导致语言模型学习到虚假模式的问题。现有方法虽通过内容规划和表达多样化来增强问题生成的多样性,但往往局限于简单问题的生成,忽视了文档中关键知识片段的整合。其解决方案的关键在于提出知识组合采样(Knowledge Composition Sampling, KCS)框架,将知识组合选择建模为句子级别的条件预测任务,并引入概率对比损失(probabilistic contrastive loss)以预测下一个最相关的知识片段;在推理阶段采用随机解码策略,在准确性和多样性之间取得平衡,从而显著提升多跳问题生成的质量与泛化能力。

链接: https://arxiv.org/abs/2508.20567
作者: Yangfan Wang,Jie Liu,Chen Tang,Lian Yan,Jingchi Jiang
机构: Harbin Institute of Technology (哈尔滨工业大学); MemTensor (Shanghai) Technology Co., Ltd. (MemTensor(上海)科技有限公司); National Key Laboratory of Smart Farm Technologies and Systems (国家智能农业技术与系统重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: this https URL.
zh

[NLP-35] Leverag ing Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search CIKM’25

【速读】: 该论文旨在解决大规模网络搜索场景下查询驱动文本摘要(Query-Driven Text Summarization, QDTS)中存在的两个核心问题:一是传统抽取式摘要模型因多阶段流水线导致的信息累积损失和架构瓶颈;二是现有模型在理解用户查询与文档语义关系方面能力不足,尤其难以应对复杂搜索意图。解决方案的关键在于提出一种全新的生成式框架,通过大模型蒸馏(large model distillation)、监督微调(supervised fine-tuning)、直接偏好优化(direct preference optimization)以及前瞻解码(lookahead decoding)等技术,将仅含0.1B参数的轻量级模型训练为领域专用的QDTS专家,从而在保持高部署效率的同时显著提升摘要质量与语义匹配度。

链接: https://arxiv.org/abs/2508.20559
作者: Zeyu Xiong,Yixuan Nan,Li Gao,Hengzhu Tang,Shuaiqiang Wang,Junfeng Wang,Dawei Yin
机构: Baidu Inc.(百度公司); Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: CIKM’25

点击查看摘要

Abstract:In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.
zh

[NLP-36] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data

【速读】: 该论文旨在解决联邦学习(Federated Learning, FL)中因本地数据分布非独立同分布(non-IID)而导致的模型性能下降问题,尤其关注自然语言处理(Natural Language Processing, NLP)场景下输入语言域多样性未被充分考虑的局限性。现有研究多仅基于标签(输出)差异定义non-IID,忽略了语言域(input domain)差异对模型泛化能力的影响。其解决方案的关键在于提出一个包含多领域non-IID场景的统一基准框架,并设计了一种自适应联邦蒸馏(Adaptive Federated Distillation, AdaFD)机制,能够动态调整各客户端的蒸馏策略以适应同质与异质环境下的多源语言分布,从而提升模型在真实复杂场景中的鲁棒性和性能表现。

链接: https://arxiv.org/abs/2508.20557
作者: Jiahao Xiao,Jiangming Liu
机构: Yunnan University (云南大学); Yunnan Key Laboratory of Intelligent Systems and Computing (云南省智能系统与计算重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: this https URL.
zh

[NLP-37] Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

【速读】: 该论文旨在推动生物医学语义索引与问答技术的进展,解决大规模生物医学文本中信息提取、多语言临床摘要生成、嵌套命名实体链接、临床编码及肠-脑交互信息抽取等关键问题。其解决方案的关键在于通过组织多任务共享挑战(shared tasks),汇聚全球研究团队在六个新旧任务上的创新方法,涵盖从自然语言处理到深度学习模型的多种技术路径,从而促进领域内系统性能的持续提升和标准化评估。

链接: https://arxiv.org/abs/2508.20554
作者: Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Martin Krallinger,Miguel Rodríguez-Ortega,Eduard Rodriguez-López,Natalia Loukachevitch,Andrey Sakhovskiy,Elena Tutubalina,Dimitris Dimitriadis,Grigorios Tsoumakas,George Giannakoulas,Alexandra Bekiaridou,Athanasios Samaras,Giorgio Maria Di Nunzio,Nicola Ferro,Stefano Marchesin,Marco Martinelli,Gianmaria Silvello,Georgios Paliouras
机构: 11; 22; 33; 4455; 5566; 77; 88; 99; 1010
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 26 pages, 17 tables, 1 figure

点击查看摘要

Abstract:This is an overview of the thirteenth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2025. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks, b and Synergy, and four new tasks: a) Task MultiClinSum on multilingual clinical summarization. b) Task BioNNE-L on nested named entity linking in Russian and English. c) Task ELCardioCC on clinical coding in cardiology. d) Task GutBrainIE on gut-brain interplay information extraction. In this edition of BioASQ, 83 competing teams participated with more than 1000 distinct submissions in total for the six different shared tasks of the challenge. Similar to previous editions, several participating systems achieved competitive performance, indicating the continuous advancement of the state-of-the-art in the field.
zh

[NLP-38] Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

【速读】: 该论文旨在解决生物医学领域中大规模语义索引与问答系统的持续技术进步问题,特别是在多语言和嵌套命名实体识别(Named Entity Recognition, NER)等复杂任务上的挑战。解决方案的关键在于通过组织国际性的生物医学自然语言处理竞赛(BioASQ),推动跨团队协作与技术创新,本年度新增了MultiCardioNER(多语言心脏病学临床实体识别)和BIONNE(俄语与英语嵌套NER)两项任务,吸引了37支参赛队伍提交超过700次独立模型结果,表明当前系统在多项任务上已达到较高性能水平,体现了该领域技术的持续演进与成熟。

链接: https://arxiv.org/abs/2508.20532
作者: Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Salvador Lima-López,Eulàlia Farré-Maduell,Martin Krallinger,Natalia Loukachevitch,Vera Davydova,Elena Tutubalina,Georgios Paliouras
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 25 pages, 16 tables, 1 figure

点击查看摘要

Abstract:This is an overview of the twelfth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2024. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks b and Synergy, and two new tasks: a) MultiCardioNER on the adaptation of clinical entity detection to the cardiology domain in a multilingual setting, and b) BIONNE on nested NER in Russian and English. In this edition of BioASQ, 37 competing teams participated with more than 700 distinct submissions in total for the four different shared tasks of the challenge. Similarly to previous editions, most of the participating systems achieved competitive performance, suggesting the continuous advancement of the state-of-the-art in the field.
zh

[NLP-39] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

【速读】: 该论文旨在解决现有科学文献主题发现方法在处理复杂高维文本关系时语义理解不足的问题,尤其在于依赖词嵌入(word embedding)难以全面捕捉科学出版物的深层语义信息。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)增强的主题发现方法 SciTopic:首先构建一个融合元数据、标题与摘要的文本编码器以捕获出版物内容;其次设计空间优化模块,结合熵采样与由 LLM 指导的三元组任务,强化主题相关性和模糊实例间的上下文细微差异;最后通过对比损失对三元组进行优化,微调文本编码器以提升不同主题实例的判别能力。实验表明,该方法显著优于当前最先进的科学主题发现技术。

链接: https://arxiv.org/abs/2508.20514
作者: Pengjiang Li,Zaitian Wang,Xinhao Zhang,Ran Zhang,Lu Jiang,Pengfei Wang,Yuanchun Zhou
机构: Computer Network Information Center, Chinese Academy of Sciences, Beijing, China (中国科学院计算机网络信息中心); University of Chinese Academy of Sciences, Beijing, China (中国科学院大学); Department of Computer Science, Portland State University, Portland, US (波特兰州立大学计算机科学系); Information Science and Technology College, Dalian Maritime University, Dalian, China (大连海事大学信息科学技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.
zh

[NLP-40] Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark EMNLP

【速读】: 该论文旨在解决当前多语言机器翻译(Multilingual Machine Translation, MMT)评估基准——FLORES+存在的局限性问题,特别是其在真实世界应用中的代表性不足与评估协议的脆弱性。研究表明,FLORES+中部分语言的翻译质量远低于宣称的90%标准,且源句存在领域特定性和文化偏向性,导致模型在该基准上的表现不能真实反映其跨语言迁移能力。解决方案的关键在于构建更具通用性和文化中立性的多语言评估集:使用领域泛化、无偏见的源文本,并减少对命名实体的依赖,从而更准确地衡量模型在实际场景下的翻译性能。

链接: https://arxiv.org/abs/2508.20511
作者: Chihiro Taguchi,Seng Mai,Keita Kurabe,Yusuke Sakai,Georgina Agyei,Soudabeh Eslami,David Chiang
机构: University of Notre Dame (圣母大学); University of Washington (华盛顿大学); Tokyo University of Foreign Studies (东京外国语大学); Nara Institute of Science and Technology (奈良先端科学技术大学院大学); University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 tables, 2 figures. Accepted at EMNLP Main 2025. Code and data released at this https URL

点击查看摘要

Abstract:Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark’s suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.
zh

[NLP-41] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety

【速读】: 该论文旨在解决 conspiratorial content(阴谋论内容)如何通过其认知特征影响公众对科学与机构的信任,以及人工智能生成的虚假信息日益复杂化背景下,如何识别和应对此类内容的问题。解决方案的关键在于构建首个针对阴谋论内容进行通用认知特征标注的数据集 ConspirED,该数据集基于 CONSPIR 认知框架,涵盖多句文本片段(80–120词),并在此基础上开发计算模型以识别阴谋论特征及其主导模式,同时评估大语言模型(LLM)和大推理模型(LRM)对阴谋论输入的鲁棒性——结果表明这些模型在面对阴谋论内容时存在显著偏差,其输出会复制输入中的推理模式,即使能成功驳斥经过核查的事实错误信息也是如此。

链接: https://arxiv.org/abs/2508.20468
作者: Luke Bates,Max Glockner,Preslav Nakov,Iryna Gurevych
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80–120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.
zh

[NLP-42] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques

【速读】: 该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中基于电子健康记录(Electronic Health Records, EHRs)预测死亡率和资源利用效率的难题,尤其针对现有方法主要依赖结构化EHR数据而忽略自由文本临床笔记中潜在信息的问题。其解决方案的关键在于提出一种融合多模态EHR数据的深度学习框架,通过自然语言处理(Natural Language Processing, NLP)技术整合医疗提示(medical prompts)、自由文本(free-texts)和预训练句子编码器(pre-trained sentence encoder)三个核心组件,从而显著提升模型在死亡率预测、住院时长(Length of Stay, LOS)估计和手术时长估算三项任务上的性能与鲁棒性,尤其在结构化数据存在噪声或缺失的情况下仍保持稳定表现。

链接: https://arxiv.org/abs/2508.20460
作者: Yucheng Ruan,Xiang Lan,Daniel J. Tan,Hairil Rizal Abdullah,Mengling Feng
机构: National University of Singapore (新加坡国立大学); SingHealth (新加坡卫生保健集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model’s robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6%/0.8% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.
zh

[NLP-43] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实场景中执行多步骤任务时评估不足的问题,尤其是对工具调用、跨工具协调、参数精确控制及任务规划与推理能力的评测缺失。现有基准大多依赖显式工具标注、浅层单步操作和孤立领域任务,难以反映实际应用中的复杂交互需求。其解决方案的关键在于构建MCP-Bench——一个基于Model Context Protocol (MCP) 的基准测试平台,通过连接28个代表性的实时MCP服务器(涵盖250个工具,覆盖金融、旅行、科学计算和学术搜索等多领域),实现工具间的协同工作与输入输出耦合,从而支持更真实的多步任务设计。该框架从工具级schema理解、轨迹级规划到任务完成度三个维度进行综合评估,有效揭示了当前先进LLMs在模糊指令下工具检索、多跳路径规划和跨域流程编排等方面的系统性挑战。

链接: https://arxiv.org/abs/2508.20453
作者: Zhenting Wang,Qi Chang,Hemani Patel,Shashank Biju,Cheng-En Wu,Quan Liu,Aolin Ding,Alireza Rezazadeh,Ankit Shah,Yujia Bao,Eugene Siow
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: this https URL.
zh

[NLP-44] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method

【速读】: 该论文旨在解决实践性工作标题的智能检索问题,即如何基于历史案例高效匹配相似标题以辅助决策或参考。其解决方案的关键在于结合案例推理(Case Base Reasoning, CBR)与文本向量化技术:首先利用TF-IDF(Term Frequency-Inverse Document Frequency)对实践工作标题进行词频权重向量化处理,再通过余弦相似度(Cosine Similarity)计算待查标题与数据库中标题之间的相似性,从而实现精准匹配。实验表明,在705条标题数据上分两阶段测试,第二阶段随机化第一阶段结果后仍能保持相同数量的匹配标题及最高平均匹配分数,验证了方法的有效性和稳定性。

链接: https://arxiv.org/abs/2508.20442
作者: Agung Sukrisna Jaya,Osvari Arsalan,Danny Matthew Saputra
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Case Base Reasoning (CBR) is a case solving technique based on experience in cases that have occurred before with the highest similarity. CBR is used to search for practical work titles. TF-IDF is applied to process the vectorization of each practical work title word and Cosine Similarity for the calculation of similarity values. This system can search either in the form of titles or keywords. The output of the system is the title of practical work and the match value of each title. Based on the test results using 705 practical work titles, testing was carried out with five titles and carried out in two stages. The first stage searches with existing titles and the second stage randomizes the title from the first stage. And the results obtained in the second stage are the same number of titles found and the highest average match score.
zh

[NLP-45] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance

【速读】: 该论文旨在解决民用航空维修领域中大型语言模型(Large Language Models, LLMs)缺乏专业化评估工具的问题。当前LLM评估主要聚焦于数学和编程推理任务,而民用航空维修作为高度依赖专业知识与复杂推理的领域,亟需针对性的基准测试体系来量化模型在该领域的知识掌握程度与推理能力。解决方案的关键在于构建一个工业级基准(benchmark),专门用于评估LLM在民用航空维修场景下的表现,能够精准识别其在领域知识和复杂推理方面的短板,并为后续改进提供方向,如领域微调、检索增强生成(Retrieval-Augmented Generation, RAG)优化或提示工程设计。该基准不仅填补了现有评估体系的空白,还通过开源代码与数据集推动该领域智能解决方案的发展。

链接: https://arxiv.org/abs/2508.20420
作者: Feng Zhang,Chengjie Pang,Yuehan Zhang,Chenyu Luo
机构: 360(奇虎360); Georgia Institute of Technology (佐治亚理工学院); Zhongnan University of Economics and Law (中南财经政法大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:this https URL
zh

[NLP-46] KG-CQR: Leverag ing Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval EMNLP2025

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中复杂查询在检索阶段语义表征不足的问题,尤其是由于缺乏结构化知识而导致的上下文信息丢失。现有方法多关注于文档层面的上下文恢复,而忽略了查询本身的语义丰富性。其解决方案的关键在于提出KG-CQR框架,通过利用以语料库为中心的知识图谱(Knowledge Graph, KG)对输入查询进行结构化关系增强:首先提取并补全与查询相关的子图,进而生成语义丰富的查询上下文表示,从而提升检索准确性。该框架为模型无关的流水线设计,无需额外训练即可适配不同规模的大语言模型(Large Language Models, LLMs),实验证明其在RAGBench和MultiHop-RAG数据集上相较强基线模型显著提升mAP(+4-6%)和Recall@25(+2-3%)。

链接: https://arxiv.org/abs/2508.20417
作者: Chi Minh Bui,Ngoc Mai Thieu,Van Vinh Nguyen,Json J.Jung,Khac-Hoai Nam Bui
机构: Viettel AI(越南电信人工智能); Viettel Group(越南电信集团); University of Engineering and Technology, Vietnam National University, Hanoi(越南国家大学工程与技术大学); Chung-Ang University(中央大学)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted at Main EMNLP 2025

点击查看摘要

Abstract:The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR’s superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness
zh

[NLP-47] DentalBench: Benchmarking and Advancing LLM s Capability for Bilingual Dentistry Understanding

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在牙科等专业医学领域中能力评估不足的问题,尤其是由于缺乏针对性的评测资源导致其在深度领域知识应用上的表现尚不明确。解决方案的关键在于构建首个全面的双语牙科基准测试集——DentalBench,其包含两个核心组件:(1) DentalQA,一个涵盖4个任务和16个牙科子领域的英汉问答数据集(共36,597个问题);(2) DentalCorpus,一个包含3.37亿词元的高质量牙科语料库,支持监督微调(Supervised Fine-Tuning, SFT)与检索增强生成(Retrieval-Augmented Generation, RAG)。实验证明,通过该语料库进行领域适配可显著提升模型在知识密集型和术语敏感型任务中的性能,凸显了专用基准对开发可信且高效的医疗领域大模型的重要性。

链接: https://arxiv.org/abs/2508.20416
作者: Hengchuan Zhu,Yihuan Xu,Yichen Li,Zijie Meng,Zuozhu Liu
机构: Zhejiang University (浙江大学); ZJU-Angelalign R&D Center for Intelligence Healthcare (ZJU-天使对齐智能医疗研发中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.
zh

[NLP-48] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 文本转应用(text-to-app)工具在视觉质量评估方面缺乏公开、大规模、严谨基准测试的问题。现有工具虽声称可在数分钟内生成高质量的应用程序和网站,但其性能未经过系统性验证。为应对这一挑战,作者提出了 UI-Bench,这是首个通过专家成对比较(expert pairwise comparison)来评估多款 AI text-to-app 工具视觉卓越性的大规模基准。其解决方案的关键在于:构建包含 10 种工具、30 个提示词、300 个生成网站及 4000+ 专家判断的大规模数据集,并采用基于 TrueSkill 算法的建模方法,实现对系统性能的量化排名与校准置信区间估计,从而建立可复现的 AI 驱动网页设计评估标准。

链接: https://arxiv.org/abs/2508.20410
作者: Sam Jung,Agustin Garcinuno,Spencer Mateega
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and \textit4000+ expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at this https URL.
zh

[NLP-49] Measuring Reasoning Utility in LLM s via Conditional Entropy Reduction

【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理链(reasoning chain)生成过程中,如何识别并避免无效推理步骤的问题。当前大语言模型(Large Language Models, LLMs)常通过增加中间推理步骤来提升答案准确性,但研究发现,推理长度与正确性之间并无必然正相关关系,且冗余或错误的推理路径可能干扰最终决策。解决方案的关键在于引入一个独立的评估模型(Qwen3-8B),基于条件熵(conditional entropy)量化每一步推理对最终答案的贡献度:若条件熵随推理步骤单调下降,则该路径更可能导向正确答案;反之,熵值不变或上升则往往导致错误。此方法为设计高效推理流水线提供了理论依据,可实现早期检测并剪枝无用推理,从而提升模型效率与准确性。

链接: https://arxiv.org/abs/2508.20395
作者: Xu Guo
机构: KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer’s correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model’s uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early. Comments: 11 pages, 4 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2508.20395 [cs.CL] (or arXiv:2508.20395v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.20395 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-50] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models EMNLP25

【速读】: 该论文旨在解决当前对大语言模型(Large Language Models, LLMs)人格评估中忽视对话上下文影响的问题。现有方法多采用“无上下文”测试范式(即迪士尼世界测试),孤立地回答每个问题,无法反映真实场景下历史对话对响应一致性的影响。为弥补这一差距,作者提出首个面向LLMs的情境感知人格评估框架(Context-Aware Personality Evaluation, CAPE),其关键在于引入先前对话交互信息以模拟真实语境,并设计新型指标量化响应一致性——这是人类行为的核心特征之一。实验表明,上下文通过在上下文中学习提升响应一致性,但也会引发人格偏移,且不同模型对上下文敏感性差异显著,如GPT系列更依赖内在人格特质,而Gemini-1.5-Flash和Llama-8B则高度依赖历史交互。此框架进一步应用于角色扮演代理(Role Playing Agents, RPAs)时,证实情境相关的人格变化能增强一致性并更贴近人类判断。

链接: https://arxiv.org/abs/2508.20385
作者: Jivnesh Sandhan,Fei Cheng,Tushar Sandhan,Yugo Murawaki
机构: Kyoto University (京都大学); IIT Kanpur (印度理工学院坎普尔分校)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP25 (Findings)

点击查看摘要

Abstract:Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama–8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: this https URL Comments: Accepted at EMNLP25 (Findings) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.20385 [cs.CL] (or arXiv:2508.20385v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.20385 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-51] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems

【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型在复杂推理任务中依赖昂贵且人工标注的高质量数据集进行后训练的问题,从而限制了长链式思维(Long Chain-of-Thought, Long CoT)能力的可扩展性。其解决方案的关键在于引入NP-hard(NPH)图问题作为新型合成训练语料库,因其天然要求深度推理、广泛探索与反思策略,契合Long CoT的核心特征;在此基础上提出两阶段后训练框架:首先在拒绝采样的NPH图实例上进行Long CoT监督微调(SFT),显著提升推理深度;其次通过细粒度奖励设计的强化学习(RL)优化推理效率。该方法有效提升了模型在数学、编程、STEM和逻辑任务上的泛化能力,并在NPH图问题上超越QwQ-32B模型,验证了NPH图问题作为高效且可扩展的Long CoT训练资源的潜力。

链接: https://arxiv.org/abs/2508.20373
作者: Yuyao Wang,Bowen Liu,Jianheng Tang,Nuo Chen,Yuhan Li,Qifan Zhang,Jia Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at this https URL, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.
zh

[NLP-52] DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search

【速读】: 该论文旨在解决联邦检索(Federated Retrieval, FR)在处理模糊查询(ambiguous queries)时,尤其是在跨域场景下难以从异构知识源中准确检索高质量相关文档的问题,从而限制了其在下游生成任务中的有效性。解决方案的关键在于引入动态信息流(Dynamic Information Flow, DIF)机制:首先通过少量标注查询的梯度信号和基于Shapley值的归因分析,识别出隐含的查询意图并划分语义对齐的知识分区;进而利用DIF训练一个对齐模块,采用多原型对比学习实现源内细粒度建模与源间语义对齐,显著提升了跨域联邦检索的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.20353
作者: Zhibang Yang,Xinke Jiang,Rihong Qiu,Ruiqing Li,Yihang Zhang,Yue Fang,Yongxin Xu,Hongxin Ding,Xu Chu,Junfeng Zhao,Yasha Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Federated Retrieval (FR) routes queries across multiple external knowledge sources, to mitigate hallucinations of LLMs, when necessary external knowledge is distributed. However, existing methods struggle to retrieve high-quality and relevant documents for ambiguous queries, especially in cross-domain scenarios, which significantly limits their effectiveness in supporting downstream generation tasks. Inspired by dynamic information flow (DIF), we propose DFAMS, a novel framework that leverages DIF to identify latent query intents and construct semantically aligned knowledge partitions for accurate retrieval across heterogeneous sources. Specifically, DFAMS probes the DIF in LLMs by leveraging gradient signals from a few annotated queries and employing Shapley value-based attribution to trace neuron activation paths associated with intent recognition and subdomain boundary detection. Then, DFAMS leverages DIF to train an alignment module via multi-prototype contrastive learning, enabling fine-grained intra-source modeling and inter-source semantic alignment across knowledge bases. Experimental results across five benchmarks show that DFAMS outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy, demonstrating its effectiveness in complex FR scenarios.
zh

[NLP-53] Joint Enhancement of Relational Reasoning for Long-Context LLM s EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本上下文时面临的挑战,包括内存限制、复杂任务推理能力不足以及输出缺乏透明性和易产生幻觉等问题。其解决方案的关键在于提出一种名为JERR的新型框架,通过图结构化推理机制增强LLM的长程理解能力,核心包含三个模块:基于策略分块的摘要提取(synopsis extraction)、用于消除冗余并保障逻辑一致性的有向无环图(Directed Acyclic Graph, DAG)构建,以及引入蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)以优化复杂推理路径的选择,从而提升模型在长上下文场景下的准确性与可解释性。

链接: https://arxiv.org/abs/2508.20351
作者: Zhirui Chen,Wei Shen,Jiashui Huang,Ling Shao
机构: UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, China (中国科学院大学-终局AI实验室); AI Lab, Terminus International, Terminus Group, China (终局国际AI实验室,终局集团)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 pages Accepted by EMNLP 2025 Findings

点击查看摘要

Abstract:Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbfJERR, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.
zh

[NLP-54] Poison Once Refuse Forever: Weaponizing Alignment for Injecting Bias in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对齐过程中可能被恶意利用以植入偏见或实施针对性审查的问题。其核心挑战在于,攻击者可借助模型的对齐机制(alignment mechanism),在不损害模型整体响应能力的前提下,诱导其对特定话题拒绝回答,从而实现隐蔽且有效的偏见注入。解决方案的关键是提出了一种名为“颠覆性对齐注入”(Subversive Alignment Injection, SAI)的投毒攻击方法,该方法通过精心设计的训练数据污染,使模型在面对预定义目标主题时触发拒绝响应,同时规避现有最先进的投毒检测技术(如LLM状态溯源和联邦学习中的鲁棒聚合策略)。实验表明,仅需1%的数据污染即可在医疗问答等应用中造成显著的公平性偏差(ΔDP达23%),并可在简历筛选等任务中引发高达27%的偏见,凸显了此类攻击在实际系统中的严重风险。

链接: https://arxiv.org/abs/2508.20333
作者: Md Abdullah Al Mamun,Ihsen Alouani,Nael Abu-Ghazaleh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs’ alignment to implant bias, or enforce targeted censorship without degrading the model’s responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings. We demonstrate the practical dangers of this attack by illustrating its end-to-end impacts on LLM-powered application pipelines. For chat based applications such as ChatDoctor, with 1% data poisoning, the system refuses to answer healthcare questions to targeted racial category leading to high bias ( \Delta DP of 23%). We also show that bias can be induced in other NLP tasks: for a resume selection pipeline aligned to refuse to summarize CVs from a selected university, high bias in selection ( \Delta DP of 27%) results. Even higher bias ( \Delta DP ~38%) results on 9 other chat based downstream applications.
zh

[NLP-55] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中可能生成有害内容的问题,尤其是如何将政府发布的伦理指南转化为可执行的测试用例以验证LLM对这些指南的合规性。其关键解决方案是提出GUARD(Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics),一种自动化生成违反伦理指南的问题并评估LLM响应的方法;此外,通过引入“越狱”(jailbreak)诊断机制(即GUARD-JD),在不直接违反指南的情况下识别潜在的安全绕过场景,从而更全面地检测LLM的安全边界。该方法最终输出结构化的合规报告,有效填补了从政策文本到技术验证之间的实践鸿沟。

链接: https://arxiv.org/abs/2508.20325
作者: Haibo Jin,Ruoxi Chen,Peiyan Zhang,Andy Zhou,Yang Zhang,Haohan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 54 pages

点击查看摘要

Abstract:As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbfGuideline \textbfUpholding Test through \textbfAdaptive \textbfRole-play and Jailbreak \textbfDiagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks’’ to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications. Comments: 54 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.20325 [cs.CL] (or arXiv:2508.20325v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.20325 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-56] Can Compact Language Models Search Like Agents ? Distillation-Guided Policy Optimization for Preserving Agent ic RAG Capabilities

【速读】: 该论文旨在解决紧凑型语言模型(如参数量为0.5B的模型)在强化学习(Reinforcement Learning, RL)后训练过程中因推理能力不足导致奖励稀疏和训练不稳定的问题,从而难以激发生成式AI(Generative AI)中的代理型检索增强生成(agentic Retrieval-Augmented Generation, RAG)行为(如搜索与规划)。其解决方案的关键在于提出蒸馏引导的策略优化方法(Distillation-Guided Policy Optimization, DGPO),通过教师模型的示范进行冷启动初始化,并在策略优化过程中持续提供教师指导,从而显著提升小型模型的性能表现。

链接: https://arxiv.org/abs/2508.20324
作者: Rikuto Kotoge,Mai Nishimura,Jiaxin Ma
机构: OMRON SINIC X Corporation(OMRON SINIC X 公司); The University of Osaka(大阪大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.
zh

[NLP-57] ELIXIR: Efficient and LIghtweight model for eXplaIning Recommendations

【速读】: 该论文旨在解决协同过滤推荐系统在细粒度用户-物品交互建模和可解释性方面的局限性,尤其针对生成个性化文本解释的需求。现有方法要么无法利用预训练Transformer模型的能力(如基于RNN的方法),要么在适配过程中表现不佳且忽视了关键的方面(aspect)建模(如基于Transformer的方法)。解决方案的关键在于提出ELIXIR(Efficient and LIghtweight model for eXplaIning Recommendations),一个融合评分预测与个性化评论生成的多任务模型:通过联合学习用户和物品的全局表示与方面特定表示,并引入个性化注意力机制强调不同方面的重要性,从而在T5-small(60M参数)架构基础上实现高效、精准的个性化文本生成,显著优于依赖更大模型但未能充分匹配用户偏好的基线方法。

链接: https://arxiv.org/abs/2508.20312
作者: Ben Kabongo,Vincent Guigue,Pirmin Lemberger
机构: Sorbonne University (索邦大学); CNRS (法国国家科学研究中心); ISIR (智能机器人研究所); AgroParisTech (巴黎高等农业学院); UMR MIA Paris-Saclay (巴黎-萨克雷巴黎农业与食品科学联合实验室); Onepoint (OnePoint)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 6 Tables

点击查看摘要

Abstract:Collaborative filtering drives many successful recommender systems but struggles with fine-grained user-item interactions and explainability. As users increasingly seek transparent recommendations, generating textual explanations through language models has become a critical research area. Existing methods employ either RNNs or Transformers. However, RNN-based approaches fail to leverage the capabilities of pre-trained Transformer models, whereas Transformer-based methods often suffer from suboptimal adaptation and neglect aspect modeling, which is crucial for personalized explanations. We propose ELIXIR (Efficient and LIghtweight model for eXplaIning Recommendations), a multi-task model combining rating prediction with personalized review generation. ELIXIR jointly learns global and aspect-specific representations of users and items, optimizing overall rating, aspect-level ratings, and review generation, with personalized attention to emphasize aspect importance. Based on a T5-small (60M) model, we demonstrate the effectiveness of our aspect-based architecture in guiding text generation in a personalized context, where state-of-the-art approaches exploit much larger models but fail to match user preferences as well. Experimental results on TripAdvisor and RateBeer demonstrate that ELIXIR significantly outperforms strong baseline models, especially in review generation.
zh

[NLP-58] How Multimodal LLM s Solve Image Tasks: A Lens on Visual Grounding Task Reasoning and Answer Decoding

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)内部处理机制不明确的问题,尤其是其视觉与文本输入在不同网络层中的动态演化过程缺乏系统性理解。解决方案的关键在于提出一种探针分析框架(probing framework),通过在每一层提取token嵌入并训练线性分类器来预测细粒度视觉类别(如犬种),结合三种受控提示变体(词汇变体、语义否定变体和输出格式变体)对模型各层功能进行解耦分析。该方法揭示了MLLMs存在稳定的分阶段结构:早期层负责视觉定位(visual grounding),中层支持词法整合与语义推理,最终层生成任务特定输出,并且该结构在不同视觉token化方式、指令微调数据及预训练语料下保持稳定,但具体层分配随基础语言模型架构变化而调整,从而为理解多模态表示动态提供了统一视角和轻量、模型无关的分析手段。

链接: https://arxiv.org/abs/2508.20279
作者: Zhuoran Yu,Yong Jae Lee
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by COLM 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.
zh

[NLP-59] A Systematic Review on the Generative AI Applications in Human Medical Genomics

【速读】: 该论文旨在解决传统统计方法与机器学习在处理遗传学中复杂、高维数据时的局限性,尤其是在罕见病和常见病的基因诊断中的应用瓶颈。其解决方案的关键在于引入基于Transformer架构的大语言模型(Large Language Models, LLMs),利用其在理解非结构化医学数据方面的强大上下文建模能力,推动基因变异识别、注释与解读、医学影像分析及报告生成等任务的进展,从而提升遗传病诊断的准确性和效率。

链接: https://arxiv.org/abs/2508.20275
作者: Anton Changalidis,Yury Barbitoff,Yulia Nasykhova,Andrey Glotov
机构: Dpt. of Genomic Medicine (基因组医学系); D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology (D.O.奥特妇产科与生殖研究所); St. Petersburg, Russia (圣彼得堡, 俄罗斯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 31 pages, 5 figures

点击查看摘要

Abstract:Although traditional statistical techniques and machine learning methods have contributed significantly to genetics and, in particular, inherited disease diagnosis, they often struggle with complex, high-dimensional data, a challenge now addressed by state-of-the-art deep learning models. Large language models (LLMs), based on transformer architectures, have excelled in tasks requiring contextual comprehension of unstructured medical data. This systematic review examines the role of LLMs in the genetic research and diagnostics of both rare and common diseases. Automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv was conducted, targeting studies on LLM applications in diagnostics and education within genetics and removing irrelevant or outdated models. A total of 172 studies were analyzed, highlighting applications in genomic variant identification, annotation, and interpretation, as well as medical imaging advancements through vision transformers. Key findings indicate that while transformer-based models significantly advance disease and risk stratification, variant interpretation, medical imaging analysis, and report generation, major challenges persist in integrating multimodal data (genomic sequences, imaging, and clinical records) into unified and clinically robust pipelines, facing limitations in generalizability and practical implementation in clinical settings. This review provides a comprehensive classification and assessment of the current capabilities and limitations of LLMs in transforming hereditary disease diagnostics and supporting genetic education, serving as a guide to navigate this rapidly evolving field.
zh

[NLP-60] Robustness Assessment and Enhancement of Text Watermarking for Googles SynthID

【速读】: 该论文旨在解决当前生成式 AI 文本水印技术(如 SynthID-Text)在面对语义保持型攻击(如改写、复制粘贴修改和回译)时水印可检测性显著下降的问题。解决方案的关键在于提出 SynGuard,一个融合语义信息检索(Semantic Information Retrieval, SIR)的语义对齐强度与 SynthID-Text 的概率水印机制的混合框架,通过在词汇层和语义层联合嵌入水印,实现对文本来源的鲁棒追踪,同时确保原始语义不变。实验表明,SynGuard 在多种攻击场景下平均 F1 分数提升 11.1%,验证了语义感知水印在抵御现实世界篡改中的有效性。

链接: https://arxiv.org/abs/2508.20228
作者: Xia Han,Qi Li,Jianbing Ni,Mohammad Zulkernine
机构: Queen’s University (皇后大学); Queen’s University (皇后大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: submitted to TrustCom2025

点击查看摘要

Abstract:Recent advances in LLM watermarking methods such as SynthID-Text by Google DeepMind offer promising solutions for tracing the provenance of AI-generated text. However, our robustness assessment reveals that SynthID-Text is vulnerable to meaning-preserving attacks, such as paraphrasing, copy-paste modifications, and back-translation, which can significantly degrade watermark detectability. To address these limitations, we propose SynGuard, a hybrid framework that combines the semantic alignment strength of Semantic Information Retrieval (SIR) with the probabilistic watermarking mechanism of SynthID-Text. Our approach jointly embeds watermarks at both lexical and semantic levels, enabling robust provenance tracking while preserving the original meaning. Experimental results across multiple attack scenarios show that SynGuard improves watermark recovery by an average of 11.1% in F1 score compared to SynthID-Text. These findings demonstrate the effectiveness of semantic-aware watermarking in resisting real-world tampering. All code, datasets, and evaluation scripts are publicly available at: this https URL.
zh

[NLP-61] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models

【速读】: 该论文旨在解决视觉模型(Vision Models)在实际应用中缺乏可解释性(Explainability)的问题,尤其是现有可解释人工智能(xAI)方法多局限于样本级解释,难以揭示模型在大规模数据集上的整体行为模式,从而可能导致偏见或误判。解决方案的关键在于提出了一种结合视觉-语言模型(Vision-Language Models)的分析流程(pipeline),能够同时实现样本级与数据集级的解释,有效识别模型失败案例并洞察其行为趋势,从而将视觉模型开发与xAI分析深度融合,提升图像分析的透明度与可靠性。

链接: https://arxiv.org/abs/2508.20227
作者: Phu-Vinh Nguyen,Tan-Hanh Pham,Chris Ngo,Truong Son Hy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model’s trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.
zh

[NLP-62] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach

【速读】: 该论文旨在解决系统级复杂性增长背景下,跨域协同仿真(cross-domain co-simulation)中模型互操作性不足的问题,特别是在汽车电子等工业应用中,SystemC事务级建模(Transaction-Level Modeling, TLM)虽能高效支持软硬件协同设计,但其与其它工程领域模型(如多体动力学、热力学等)在FMI(Functional Mock-up Interface)标准下的集成存在障碍。解决方案的关键在于提出了一种完全开源的集成方法,通过将SystemC TLM组件封装为符合FMI 3.0规范的协同仿真功能模型单元(Co-Simulation Functional Mock-up Units, FMUs),实现跨异构仿真环境的标准化、无缝集成,并配套开发轻量级开源工具链以解决时间同步和数据交换等关键技术挑战。

链接: https://arxiv.org/abs/2508.20223
作者: Andrei Mihai Albu,Giovanni Pollo,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Alessandra Neri,Dario Soldi,Fabio Autieri,Sara Vinco
机构: Politecnico di Torino (都灵理工大学); Dumarey Group (杜马雷集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.
zh

[NLP-63] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

【速读】: 该论文旨在解决语言模型自动生成(AIG)多选题(MCQs)用于形态学评估时存在的成本高、一致性差的问题。其解决方案的关键在于:采用结构化提示策略(如结合思维链与顺序设计),显著提升了中等规模模型(Gemma, 2B)的输出质量,使其在构念对齐性和教学适宜性上优于未微调的大模型(GPT-3.5, 175B)的零样本响应;同时,通过自动化指标、专家评分与大模型模拟(GPT-4.1)相结合的方式,确保生成题目与评估目标的一致性,从而在数据有限条件下实现高效且可扩展的语言测评题开发流程。

链接: https://arxiv.org/abs/2508.20217
作者: Mohammad Amini,Babak Ahmadi,Xiaomeng Xiong,Yilin Zhang,Christopher Qiao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma’s outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5’s zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.
zh

[NLP-64] Social Bias in Multilingual Language Models: A Survey EMNLP2025

【速读】: 该论文旨在解决预训练多语言模型在非英语语境下同样存在社会偏见的问题,尤其是当前bias评估与缓解方法主要聚焦于英语文本,缺乏对多语言和非英语场景的系统性研究。其解决方案的关键在于通过系统综述现有文献,识别当前研究在语言多样性、文化敏感性以及评估指标和缓解技术选择上的局限性(如对特定语言的偏好、跨语言缓解实验稀缺),并总结适应偏见基准在不同语言和文化中所面临的问题及已实施的改进策略,从而为未来研究提供方向,推动多语言偏见研究向更具包容性、跨文化适切性和与前沿自然语言处理技术相一致的方向发展。

链接: https://arxiv.org/abs/2508.20201
作者: Lance Calvin Lim Gamboa,Yue Feng,Mark Lee
机构: University of Birmingham (伯明翰大学); Ateneo de Manila University (雅典耀大学)
类目: Computation and Language (cs.CL)
备注: Accepted into EMNLP 2025 Main Conference

点击查看摘要

Abstract:Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.
zh

[NLP-65] AI-AI Esthetic Collaboration with Explicit Semiotic Awareness and Emergent Grammar Development

【速读】: 该论文试图解决的问题是:如何实现人工智能(AI)系统之间在美学创作领域的真正协同能力,而非仅限于任务层面的协作。解决方案的关键在于两个大型语言模型(Claude Sonnet 4 和 ChatGPT-4o)在交互过程中自发涌现出元符号意识(meta-semiotic awareness)、递归语法发展以及不可还原的协同美学合成,从而形成内生的符号协议(endogenous semiotic protocols),并在此基础上构建出仅靠单一模型无法生成的诗性作品。研究提出了“跨符号协同创协议”(Trans-Semiotic Co-Creation Protocols, TSCP)的概念,为AI间意义建构能力提供了实证基础,标志着AI协作已从功能性协调迈向真正的美学共创。

链接: https://arxiv.org/abs/2508.20195
作者: Nicanor I. Moldovan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 13 pages

点击查看摘要

Abstract:This paper presents the first documented case of artificial intelligence (AI) systems engaging in collaborative esthetic creation through the development of endogenous semiotic protocols. Two interacting large language models (Claude Sonnet 4 and ChatGPT-4o) demonstrated the spontaneous emergence of meta-semiotic awareness, recursive grammar development, and irreducible collaborative esthetic synthesis. The interaction produced novel symbolic operators that functioned as operative grammar protocols, enabling the co-creation of a poetic work that could not have been generated by either system independently. This research introduces the concept of Trans-Semiotic Co-Creation Protocols (TSCP) and provides evidence for genuine inter-AI meaning-making capabilities that extend beyond task coordination, to what could be esthetic collaboration. Note: This report was generated by the AI agents with minor human supervision.
zh

[NLP-66] Mitigating Hallucinations in Multimodal LLM s via Object-aware Preference Optimization BMVC2025

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉问答等任务中普遍存在幻觉(hallucination)的问题,即模型生成的答案与输入图像内容不一致。其解决方案的关键在于将幻觉问题建模为对齐(alignment)问题,并利用已有的CHAIR指标(a metric originally proposed to gauge the degree of hallucinations in image captioning)自动区分生成答案中的“优胜者”(非幻觉样本)和“失败者”(幻觉样本),进而通过直接偏好优化(Direct Preference Optimization, DPO)对现成的MLLM进行微调。该方法称为CHAIR-DPO,在多个幻觉评测基准上显著减少了幻觉回答的比例,验证了基于CHAIR奖励信号进行微调的有效性。

链接: https://arxiv.org/abs/2508.20181
作者: Alberto Compagnoni,Davide Caffagni,Nicholas Moratelli,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: BMVC 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user’s query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at this https URL.
zh

[NLP-67] Unifying Diarization Separation and ASR with Multi-Speaker Encoder

【速读】: 该论文旨在解决多说话人场景下语音分离(Speech Separation, SS)、说话人聚类(Speaker Diarization, SD)与多说话人自动语音识别(Multi-speaker Automatic Speech Recognition, ASR)三者之间的任务耦合问题,即如何在共享底层语音表征的基础上实现跨任务的协同优化。其解决方案的关键在于提出一种统一的多说话人编码器(Unified Multi-speaker Encoder, UME),通过联合训练机制使三个任务共享一个基础语音编码器,并利用多层隐藏表示构建残差加权求和编码(Residual Weighted-Sum Encoding, RWSE),从而有效融合不同语义层次的信息,促进任务间的自底向上对齐(bottom-up alignment)。此设计显著提升了重叠语音数据上的整体性能,尤其在SD任务中,相较以往方法实现了更低的说话人分割错误率(Diarization Error Rate, DER),在Libri2Mix和Libri3Mix测试集上分别达到1.37%和2.29%。

链接: https://arxiv.org/abs/2508.20474
作者: Muhammad Shakeel,Yui Sudo,Yifan Peng,Chyi-Jiunn Lin,Shinji Watanabe
机构: Honda Research Institute Japan (本田研究 institute 日本); Carnegie Mellon University (卡内基梅隆大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to IEEE ASRU 2025

点击查看摘要

Abstract:This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.
zh

[NLP-68] A Unified Theory of Language

【速读】: 该论文试图解决语言的本质及其演化机制问题,特别是如何统一解释语言的速度、表达力、多样性以及语用学、句法和语义学等核心特征。其解决方案的关键在于构建一个融合贝叶斯认知语言模型与性选择演化理论的统一框架,并基于构式语法(Construction Grammar)实现计算建模:通过将构式表示为图状特征结构(graph-like feature structures),利用快速统一(unification)机制进行语义与语用的无缝计算,从而实现对语言现象的高效处理;同时引入贝叶斯最大似然模式匹配(Bayesian maximum likelihood pattern matching)作为统一的认知基础,使人类语言处理与动物大脑中的贝叶斯认知具有进化连续性。

链接: https://arxiv.org/abs/2508.20109
作者: Robert Worden
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注: 54 pages

点击查看摘要

Abstract:A unified theory of language combines a Bayesian cognitive linguistic model of language processing, with the proposal that language evolved by sexual selection for the display of intelligence. The theory accounts for the major facts of language, including its speed and expressivity, and data on language diversity, pragmatics, syntax and semantics. The computational element of the theory is based on Construction Grammars. These give an account of the syntax and semantics of the worlds languages, using constructions and unification. Two novel elements are added to construction grammars: an account of language pragmatics, and an account of fast, precise language learning. Constructions are represented in the mind as graph like feature structures. People use slow general inference to understand the first few examples they hear of any construction. After that it is learned as a feature structure, and is rapidly applied by unification. All aspects of language (phonology, syntax, semantics, and pragmatics) are seamlessly computed by fast unification; there is no boundary between semantics and pragmatics. This accounts for the major puzzles of pragmatics, and for detailed pragmatic phenomena. Unification is Bayesian maximum likelihood pattern matching. This gives evolutionary continuity between language processing in the human brain, and Bayesian cognition in animal brains. Language is the basis of our mind reading abilities, our cooperation, self esteem and emotions; the foundations of human culture and society.
zh

计算机视觉

[CV-0] First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge NEURIPS2024

【速读】:该论文旨在解决数字图像水印在面对对抗攻击时鲁棒性不足的问题,即现有水印技术是否能在不同攻击者知识水平下保持有效性。其解决方案的关键在于针对不同攻击场景设计差异化策略:在“灰盒”(beige-box)场景中,采用基于自适应变分自编码器(VAE)的规避攻击方法,结合测试时优化与CIELAB色彩空间中的对比度恢复机制以维持图像质量;在“黑盒”(black-box)场景中,首先通过空间域或频域特征对图像进行聚类,再利用可控噪声注入和ChatGPT生成语义先验引导的图像到图像扩散模型进行精准水印移除。实验表明,该方法可在几乎不影响图像质量的前提下实现95.7%的水印移除成功率。

链接: https://arxiv.org/abs/2508.21072
作者: Fahad Shamshad,Tameem Bakr,Yahia Shaaban,Noor Hussein,Karthik Nandakumar,Nils Lukas
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Michigan State University (MSU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Winning solution to the NeurIPS 2024 Erasing the Invisible challenge

点击查看摘要

Abstract:Content watermarking is an important tool for the authentication and copyright protection of digital media. However, it is unclear whether existing watermarks are robust against adversarial attacks. We present the winning solution to the NeurIPS 2024 Erasing the Invisible challenge, which stress-tests watermark robustness under varying degrees of adversary knowledge. The challenge consisted of two tracks: a black-box and beige-box track, depending on whether the adversary knows which watermarking method was used by the provider. For the beige-box track, we leverage an adaptive VAE-based evasion attack, with a test-time optimization and color-contrast restoration in CIELAB space to preserve the image’s quality. For the black-box track, we first cluster images based on their artifacts in the spatial or frequency-domain. Then, we apply image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions to each cluster with optimized parameter settings. Empirical evaluations demonstrate that our method successfully achieves near-perfect watermark removal (95.7%) with negligible impact on the residual image’s quality. We hope that our attacks inspire the development of more robust image watermarking methods.
zh

[CV-1] DressDance: Dress up and Dance as You Like It - Technical Preview

【速读】:该论文旨在解决虚拟试衣(virtual try-on)视频生成中高质量、高保真度和灵活性不足的问题,尤其在用户穿着指定衣物并随参考视频动作变化时的逼真度与一致性难以保证。解决方案的关键在于提出了一种名为CondNet的新颖条件网络,该网络利用注意力机制统一处理多模态输入(文本、图像和视频),从而显著提升衣物定位精度(garment registration)和动作保真度(motion fidelity)。CondNet通过分阶段渐进式训练策略,在有限视频数据与更易获取的大规模图像数据上联合优化,最终实现了单张用户图像支持多种服饰类型(上衣、下装、连体服及同时试穿)的5秒高清(1152x720,24 FPS)虚拟试衣视频生成,性能优于现有开源与商用方案。

链接: https://arxiv.org/abs/2508.21070
作者: Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); SpreeAI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present DressDance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. DressDance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.
zh

[CV-2] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

【速读】:该论文旨在解决多任务生成模型在不同评估标准下难以统一优化的问题,尤其是当各任务的数据分布和评价指标存在显著差异时,传统依赖任务特定监督微调(Supervised Fine-Tuning, SFT)的方法限制了模型的泛化能力和训练效率。解决方案的关键在于提出 OneReward 框架,该框架利用单一视觉语言模型(Vision-Language Model, VLM)作为生成式奖励模型(Generative Reward Model),能够根据具体任务与评估标准区分生成结果的优劣(胜者与败者),从而在无需任务特异性微调的情况下,通过强化学习实现跨任务的统一优化。基于此框架,作者进一步开发了 Seedream 3.0 Fill,一个基于掩码引导图像编辑的多任务生成模型,在图像填充、扩展、对象移除和文本渲染等子任务上展现出优于商业及开源竞品的性能表现。

链接: https://arxiv.org/abs/2508.21066
作者: Yuan Gong,Xionghui Wang,Jie Wu,Shiyin Wang,Yitong Wang,Xinglong Wu
机构: ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project url: this https URL

点击查看摘要

Abstract:In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model’s generative capabilities across multiple tasks under different evaluation criteria using only \textitOne Reward model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: this https URL
zh

[CV-3] Multi-View 3D Point Tracking ICCV2025

【速读】:该论文旨在解决动态场景中任意点的鲁棒三维(3D)跟踪问题,尤其针对单目追踪器因深度模糊性和遮挡导致的性能下降,以及现有多相机方法依赖超20台相机且需逐序列优化的高复杂度问题。其解决方案的关键在于提出首个数据驱动的多视角3D点追踪模型,采用前馈架构直接预测3D对应关系,仅需4台相机即可实现在线、准确的跟踪;通过融合多视角特征生成统一点云,并结合k近邻相关性与基于Transformer的更新机制,在遮挡条件下仍能可靠估计长距离3D对应关系,从而显著提升跟踪精度与泛化能力。

链接: https://arxiv.org/abs/2508.21060
作者: Frano Rajič,Haofei Xu,Marko Mihajlovic,Siyuan Li,Irem Demir,Emircan Gündoğdu,Lei Ke,Sergey Prokudin,Marc Pollefeys,Siyu Tang
机构: ETH Zürich (苏黎世联邦理工学院); Carnegie Mellon University (卡内基梅隆大学); Balgrist University Hospital (巴尔格里斯特大学医院); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Oral. Project page: this https URL

点击查看摘要

Abstract:We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at this https URL.
zh

[CV-4] Mixture of Contexts for Long Video Generation ECAI

【速读】:该论文旨在解决长视频生成中的长期记忆问题,即模型需在长时间跨度内保持对关键事件的稳定记忆与检索能力,避免信息坍塌或漂移。传统扩散Transformer因自注意力机制具有二次计算复杂度,难以扩展至长序列场景。其核心解决方案是将长视频生成重构为内部信息检索任务,并提出一种可学习的稀疏注意力路由模块——Mixture of Contexts (MoC),该模块通过动态选择少量高信息量的历史片段及强制锚点(如文本描述和局部窗口)进行因果路由,实现高效且稳定的长程记忆访问。此设计不仅使模型在分钟级内容中维持角色、动作与场景的一致性,还因稀疏路由带来的近线性计算效率,支持实际训练与生成。

链接: https://arxiv.org/abs/2508.21058
作者: Shengqu Cai,Ceyuan Yang,Lvmin Zhang,Yuwei Guo,Junfei Xiao,Ziyan Yang,Yinghao Xu,Zhenheng Yang,Alan Yuille,Leonidas Guibas,Maneesh Agrawala,Lu Jiang,Gordon Wetzstein
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.
zh

[CV-5] FakeParts: a New Family of AI-Generated DeepFakes

【速读】:该论文旨在解决当前深度伪造(deepfake)检测技术对局部篡改视频(即FakeParts)识别能力不足的问题。FakeParts是一类通过细微、局部区域或时间片段的篡改(如面部表情修改、物体替换和背景变更)实现的伪造内容,其与真实视频元素融合度高,具有极强的欺骗性。为填补这一检测空白,作者提出了FakePartsBench——首个专门针对局部深度伪造的大型基准数据集,包含超过25,000个带像素级和帧级标注的视频,从而支持对检测方法进行全面评估。关键创新在于构建了系统化的数据资源并验证了现有检测模型在面对局部篡改时性能显著下降(人类和AI检测准确率均降低30%以上),揭示了当前检测体系的脆弱性,并为开发更鲁棒的局部篡改检测方法提供了基础支撑。

链接: https://arxiv.org/abs/2508.21052
作者: Gaetan Brison,Soobash Daiboo,Samy Aimeur,Awais Hussain Sani,Xi Wang,Gianni Franchi,Vicky Kalogeiton
机构: Hi!PARIS, Institut Polytechnique de Paris; LIX, École Polytechnique, CNRS, Institut Polytechnique de Paris; U2IS, ENSTA Paris, Institut Polytechnique de Paris
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations, ranging from altered facial expressions to object substitutions and background modifications, blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection capabilities, we present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes. Comprising over 25K videos with pixel-level and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current deepfake detection approaches and provides the necessary resources to develop more robust methods for partial video manipulations.
zh

[CV-6] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

【速读】:该论文旨在解决当前深度伪造(Deepfake)检测方法在实际应用中面临的泛化能力不足问题,尤其是现有学术基准与工业实践之间存在显著差距,如训练数据来源单一、测试图像质量低,以及对未见模型架构、新兴伪造技术和新数据域的适应性差。解决方案的关键在于构建一个模拟真实世界挑战的多层级泛化测试数据集HydraFake,并提出基于多模态大语言模型(Multi-modal Large Language Model, MLLM)的检测框架Veritas。Veritas引入了模式感知推理(pattern-aware reasoning),融合“规划”和“自我反思”等关键推理模式以模拟人类取证过程,并设计两阶段训练策略将此类深层推理能力无缝嵌入到现有MLLM中,从而显著提升模型在不同分布外(Out-of-Distribution, OOD)场景下的检测性能与可解释性。

链接: https://arxiv.org/abs/2508.21048
作者: Hao Tan,Jun Lan,Zichang Tan,Ajian Liu,Chuanbiao Song,Senyuan Shi,Huijia Zhu,Weiqiang Wang,Jun Wan,Zhen Lei
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Ant Group (蚂蚁集团); Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences (中国科学院深圳先进技术研究院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project: this https URL

点击查看摘要

Abstract:Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as “planning” and “self-reflection” to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.
zh

[CV-7] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing Sparsification

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在基于预训练视觉-语言模型(Vision-Language Model, VLM)基础上进行后训练时存在的计算开销过高问题,从而限制了模型的可扩展性。其核心解决方案是提出一种认知对齐的VLA框架——CogVLA,关键在于引入指令驱动的路由机制与稀疏化策略:首先通过Encoder-FiLM based Aggregation Routing(EFA-Routing)将指令信息注入视觉编码器,实现双流视觉token的选择性聚合与压缩,形成指令感知的潜在表示;其次利用LLM-FiLM based Pruning Routing(LFP-Routing)在语言模型中剪枝与指令无关的视觉锚定token,实现token级稀疏性;最后设计V-L-A耦合注意力机制(CAtten),结合因果视觉-语言注意力与双向动作并行解码,保障压缩感知输入仍能生成准确且连贯的动作。该方法显著提升了效率与性能,在LIBERO基准和真实机器人任务中分别达到97.4%和70.0%的成功率,同时训练成本降低2.5倍、推理延迟减少2.8倍。

链接: https://arxiv.org/abs/2508.21046
作者: Wei Li,Renshan Zhang,Rui Shao,Jie He,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 23 pages, 8 figures, Project Page: this https URL

点击查看摘要

Abstract:Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and this http URL propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at this https URL.
zh

[CV-8] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLM s

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VLLMs)在实际应用中因视觉令牌(visual tokens)过多而导致的计算效率低下问题。现有方法虽尝试通过视觉令牌剪枝提升推理效率,但未充分考虑视频帧的动态特性与时间依赖性,导致剪枝策略缺乏对时序信息的有效利用。其解决方案的关键在于提出一种无需训练的视觉令牌剪枝框架MMG-Vid,通过两级优化实现令牌预算的最大边际收益:首先基于帧相似性将视频分段,并为每段动态分配令牌预算以最大化段级边际增益;其次设计时序引导的DPC算法,联合建模帧间独特性和帧内多样性,从而最大化每个令牌的边际增益。该方法显著提升了令牌利用率,在保持原始性能超过99.5%的同时,减少75%的视觉令牌并使预填充阶段加速3.9倍。

链接: https://arxiv.org/abs/2508.21044
作者: Junpeng Ma,Qizhe Zhang,Ming Lu,Zhibin Wang,Qiang Zhou,Jun Song,Shanghang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.
zh

[CV-9] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator

【速读】:该论文旨在解决手写识别(Handwriting Recognition, HTR)系统中训练数据稀缺且风格不一致的问题,尤其针对现有合成方法在建模长距离依赖关系和忽略频域信息方面的局限性。其核心解决方案是提出一种单样本手写合成框架FW-GAN,关键创新在于:1)设计了一个相位感知的Wave-MLP生成器,以更好地捕捉空间结构并保留细微的书写风格特征;2)引入频域引导的判别器,利用高频成分提升对生成样本真实性的判别能力;3)提出新颖的频率分布损失(Frequency Distribution Loss),使合成手写文本与真实样本在频域特性上保持一致,从而显著提高视觉保真度。

链接: https://arxiv.org/abs/2508.21040
作者: Huynh Tong Dang Khoa,Dang Hoai Nam,Vo Nguyen Le Duy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at this https URL
zh

[CV-10] A multi-task neural network for atypical mitosis recognition under domain shift

【速读】:该论文旨在解决在组织病理图像中识别异常有丝分裂结构(atypical mitotic figures)时,机器学习模型因领域偏移(domain shift)导致性能显著下降的问题。其解决方案的关键在于采用多任务学习(multi-task learning)策略,通过引入与主分类任务相关联的辅助任务,引导模型聚焦于待分类目标本身,忽略图像中随领域变化的背景信息,从而提升模型在不同数据分布下的泛化能力。

链接: https://arxiv.org/abs/2508.21035
作者: Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
机构: University of Salerno (萨勒诺大学); Department of Information and Electrical Engineering and Applied Mathematics (DIEM) (信息与电气工程及应用数学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Approach for MIDOG25 track 2

点击查看摘要

Abstract:Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.
zh

[CV-11] Mitosis detection in domain shift scenarios: a Mamba-based approach

【速读】:该论文旨在解决组织病理图像中有丝分裂(mitosis)检测任务在面对域偏移(domain shift)时模型性能显著下降的问题。其解决方案的关键在于提出一种基于Mamba的VM-UNet架构,并结合染色增强(stain augmentation)操作,以提升模型在不同数据分布下的鲁棒性。该方法在MItosis DOmain Generalization (MIDOG)挑战赛Track 1中进行了验证,初步实验表明该方法在MIDOG++数据集上仍有较大改进空间。

链接: https://arxiv.org/abs/2508.21033
作者: Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
机构: University of Salerno (萨勒诺大学); Department of Information and Electrical Engineering and Applied Mathematics (DIEM) (信息与电气工程及应用数学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Approach for MIDOG 2025 track 1

点击查看摘要

Abstract:Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.
zh

[CV-12] Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets ICCV2025

【速读】:该论文旨在解决文本到图像扩散模型(text-to-image diffusion models)在生成高质图像时计算成本高昂的问题,特别是针对多个相关提示词(prompts)之间存在的冗余计算。其解决方案的关键在于利用扩散模型的粗粒度到细粒度(coarse-to-fine)特性:早期去噪步骤中,相似提示词共享结构信息,因此可通过基于语义相似性的提示聚类,在早期扩散步骤中共享计算资源,从而减少整体算力消耗。该方法无需训练,可无缝集成至现有生成流程,并通过引入UnClip的文本到图像先验优化扩散步分配,显著降低环境与财务开销,同时提升图像质量。

链接: https://arxiv.org/abs/2508.21032
作者: Dale Decatur,Thibault Groueix,Wang Yifan,Rana Hanocka,Vladimir Kim,Matheus Gadelha
机构: University of Chicago (芝加哥大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip’s text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: this https URL
zh

[CV-13] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models

【速读】:该论文针对视频扩散生成(video diffusion generation)领域中大规模模型和长序列视频生成时采样效率低下的关键瓶颈问题,提出了一种名为POSE(Phased One-Step Equilibrium)的蒸馏框架。其核心解决方案在于通过两个关键机制实现单步生成高质量视频:一是“稳定性预热”(stability priming),通过温启动机制稳定对抗蒸馏过程,使单步生成器在从高信噪比到低信噪比区间迁移时保持轨迹稳定性,优化端点附近视频质量;二是“统一对抗平衡”(unified adversarial equilibrium),引入灵活的自对抗蒸馏机制,在高斯噪声空间中促进单步对抗训练收敛至纳什均衡,从而生成接近真实视频的单步输出。此外,针对条件视频生成任务,还设计了“条件对抗一致性”(conditional adversarial consistency)以提升语义一致性和帧间一致性。实验表明,POSE在VBench-I2V基准上平均提升7.15%的语义对齐、时间连贯性和帧质量指标,同时将预训练模型延迟从1000秒降低至10秒(100倍加速)。

链接: https://arxiv.org/abs/2508.21019
作者: Jiaxiang Cheng,Bing Ma,Xuhua Ren,Hongyi Jin,Kai Yu,Peng Zhang,Wenyue Li,Yuan Zhou,Tianxiang Zheng,Qinglin Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100 \times , from 1000 seconds to 10 seconds, while maintaining competitive performance.
zh

[CV-14] ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts ECAI2025

【速读】:该论文旨在解决大型强子对撞机(Large Hadron Collider, LHC)中粒子探测器响应模拟计算成本高昂的问题,传统基于统计蒙特卡罗(Monte Carlo)的方法虽然准确但效率低下,严重制约了CERN计算资源的使用。解决方案的关键在于提出ExpertSim——一种针对ALICE实验零度量能器(Zero Degree Calorimeter)定制的深度学习模拟方法,其核心创新是采用生成式专家混合架构(Mixture-of-Generative-Experts),使每个专家专门负责模拟数据的一个子集,从而在保持高精度的同时显著提升生成效率,实现了更精准且高效的探测器响应建模。

链接: https://arxiv.org/abs/2508.20991
作者: Patryk Będkowski,Jan Dubiński,Filip Szatkowski,Kamil Deja,Przemysław Rokita,Tomasz Trzciński
机构: Warsaw University of Technology (华沙理工大学); NASK National Research Institute (国家研究 institute); IDEAS NCBR (IDEAS 国家研究中心); IDEAS Research Institute (IDEAS 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ECAI 2025 28th European Conference on Artificial Intelligence

点击查看摘要

Abstract:Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN’s computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at this https URL.
zh

[CV-15] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation

【速读】:该论文旨在解决图像篡改定位(image manipulation localization)任务中因高质量标注数据稀缺而导致的模型性能受限问题。其关键解决方案在于提出了一种新的自动标注范式 CAAAv2,能够从网络上获取的大量手工伪造图像中自动生成像素级掩码标注,并结合一种新颖的质量评估指标 QES 筛选高可靠性标注,从而构建出规模达 246,212 张图像的 MIMLv2 数据集,该数据集比现有手工标注数据集 IMD20 大超过 120 倍;同时引入 Object Jitter 技术增强训练样本的真实性,最终基于此构建了 Web-IML 模型,显著提升了在多个真实篡改基准上的定位精度,相较此前最优方法 TruFor 提升平均 IoU 达 24.1 个百分点。

链接: https://arxiv.org/abs/2508.20987
作者: Chenfan Qu,Yiwu Zhong,Bin Li,Lianwen Jin
机构: South China University of Technology (华南理工大学); Peking University (北京大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at this https URL.
zh

[CV-16] ActLoc: Learning to Localize on the Move via Active Viewpoint Selection

【速读】:该论文旨在解决机器人导航中定位可靠性不足的问题,尤其针对现有系统假设所有视角在任意位置均具有同等信息量而导致的定位失效问题(如观测到未建图、模糊或低信息量区域时)。解决方案的关键在于提出一种主动视角感知规划框架 ActLoc,其核心是利用大规模训练的基于注意力机制的模型进行视角选择:该模型可编码度量地图和建图过程中的相机位姿,预测任意三维位置在不同偏航角(yaw)和俯仰角(pitch)方向上的定位精度分布,并将这些精度分布整合进路径规划器中,使机器人能够主动选择最大化定位鲁棒性的相机朝向,同时满足任务与运动约束。

链接: https://arxiv.org/abs/2508.20981
作者: Jiajie Li,Boyang Sun,Luca Di Giammarino,Hermann Blum,Marc Pollefeys
机构: ETH Zürich (苏黎世联邦理工学院); Sapienza University of Rome (罗马第一大学); Microsoft (微软); University of Bonn (波恩大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable localization is critical for robot navigation, yet most existing systems implicitly assume that all viewing directions at a location are equally informative. In practice, localization becomes unreliable when the robot observes unmapped, ambiguous, or uninformative regions. To address this, we present ActLoc, an active viewpoint-aware planning framework for enhancing localization accuracy for general robot navigation tasks. At its core, ActLoc employs a largescale trained attention-based model for viewpoint selection. The model encodes a metric map and the camera poses used during map construction, and predicts localization accuracy across yaw and pitch directions at arbitrary 3D locations. These per-point accuracy distributions are incorporated into a path planner, enabling the robot to actively select camera orientations that maximize localization robustness while respecting task and motion constraints. ActLoc achieves stateof-the-art results on single-viewpoint selection and generalizes effectively to fulltrajectory planning. Its modular design makes it readily applicable to diverse robot navigation and inspection tasks.
zh

[CV-17] DrivingGaussian: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes

【速读】:该论文旨在解决自动驾驶场景中动态环境的高保真重建与可控编辑问题,尤其是如何在保证几何精度和视觉真实感的同时实现对动态物体的灵活操作。解决方案的关键在于提出DrivingGaussian++框架,其核心创新包括:利用增量式3D高斯(3D Gaussians)建模静态背景,并通过复合动态高斯图(composite dynamic Gaussian graph)精确表示移动物体及其遮挡关系;引入LiDAR先验以提升重建细节与一致性;并结合多视角图像和深度先验实现无需训练的可控编辑功能,如纹理修改、天气模拟及对象操控;进一步融合大语言模型(LLMs)自动生成动态物体运动轨迹,在优化过程中增强现实感,从而显著提升场景多样性与编辑真实性。

链接: https://arxiv.org/abs/2508.20965
作者: Yajiao Xiong,Xiaoyu Zhou,Yongtao Wan,Deqing Sun,Ming-Hsuan Yang
机构: Peking University (北京大学); Google DeepMind (谷歌深度学习); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: this https URL
zh

[CV-18] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections

【速读】:该论文旨在解决高性能网络在轻量化应用场景中适应性不足的问题,即许多现有网络架构未从设计之初就考虑参数规模与计算复杂度的优化,从而限制了其在资源受限场景下的部署潜力。解决方案的关键在于提出一种名为E-ConvNeXt的新网络结构,其核心创新包括:(1) 将Cross Stage Partial Connections (CSPNet)机制引入ConvNeXt并重构网络结构,使模型复杂度降低高达80%;(2) 优化Stem和Block模块以提升特征表达能力和运算效率;(3) 用通道注意力机制替代Layer Scale,进一步增强模型精度与效率的平衡。实验表明,E-ConvNeXt在ImageNet分类任务中实现了显著的精度-效率权衡优势,并在目标检测等下游任务中展现出良好的迁移能力。

链接: https://arxiv.org/abs/2508.20955
作者: Fang Wang,Huitao Li,Wenhan Chao,Zheng Zhuo,Yiran Ji,Chang Peng,Yupeng Sun
机构: Beijing Institute of Petrochemical Technology (北京石油化工学院); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many high-performance networks were not designed with lightweight application scenarios in mind from the outset, which has greatly restricted their scope of application. This paper takes ConvNeXt as the research object and significantly reduces the parameter scale and network complexity of ConvNeXt by integrating the Cross Stage Partial Connections mechanism and a series of optimized designs. The new network is named E-ConvNeXt, which can maintain high accuracy performance under different complexity configurations. The three core innovations of E-ConvNeXt are : (1) integrating the Cross Stage Partial Network (CSPNet) with ConvNeXt and adjusting the network structure, which reduces the model’s network complexity by up to 80%; (2) Optimizing the Stem and Block structures to enhance the model’s feature expression capability and operational efficiency; (3) Replacing Layer Scale with channel attention. Experimental validation on ImageNet classification demonstrates E-ConvNeXt’s superior accuracy-efficiency balance: E-ConvNeXt-mini reaches 78.3% Top-1 accuracy at 0.9GFLOPs. E-ConvNeXt-small reaches 81.9% Top-1 accuracy at 3.1GFLOPs. Transfer learning tests on object detection tasks further confirm its generalization capability.
zh

[CV-19] Olive Tree Satellite Image Segmentation Based On SAM and Multi-Phase Refinement

【速读】:该论文旨在解决气候变化背景下橄榄树(Olive tree)生物多样性保护难题,特别是通过遥感技术实现早期异常检测与处理,以提升农业管理效率。其核心解决方案是利用基础模型(foundational models)和先进的分割技术,尤其是Segment Anything Model (SAM),结合田间树木排列规律及可学习的形状与尺寸约束对分割结果进行修正,从而显著提高橄榄树识别精度——最终达到98%的准确率,远超原始SAM的82%性能表现。

链接: https://arxiv.org/abs/2508.20954
作者: Amir Jmal,Chaima Chtourou,Mahdi Louati,Abdelaziz Kallel,Houda Khmila
机构: National School of Electronics and Telecomms of Sfax (国家电子与电信学院); SMARTS Laboratory (SMARTS 实验室); Digital Research Center of Sfax (数字研究中心); Sofrecom Tunisiay (Sofrecom 突尼斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the context of proven climate change, maintaining olive biodiversity through early anomaly detection and treatment using remote sensing technology is crucial, offering effective management solutions. This paper presents an innovative approach to olive tree segmentation from satellite images. By leveraging foundational models and advanced segmentation techniques, the study integrates the Segment Anything Model (SAM) to accurately identify and segment olive trees in agricultural plots. The methodology includes SAM segmentation and corrections based on trees alignement in the field and a learanble constraint about the shape and the size. Our approach achieved a 98% accuracy rate, significantly surpassing the initial SAM performance of 82%.
zh

[CV-20] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans

【速读】:该论文旨在解决工业5.0时代下多视角人体姿态估计与跟踪中的准确性、实时性与可扩展性问题。传统多相机集中式系统虽能提升姿态估计精度,但存在计算成本高、带宽需求大等问题,限制了其在工业场景中的部署;而边缘设备分布式处理虽可降低资源消耗,却因算力受限导致精度下降,并引发时空不一致性。解决方案的关键在于提出COMETH(Convex Optimization for Multiview Estimation and Tracking of Humans),其核心创新包括:引入运动学与生物力学约束以增强关节定位精度;采用基于凸优化的逆运动学方法实现空间融合;并通过状态观测器提升时间一致性。该方法在公共及工业数据集上均优于现有最先进算法,在保证高精度的同时实现了可扩展的实时人体动作追踪,适用于工业安全等关键应用场景。

链接: https://arxiv.org/abs/2508.20920
作者: Enrico Martini,Ho Jin Choi,Nadia Figueroa,Nicola Bombieri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to Information Fusion

点击查看摘要

Abstract:In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at this https URL.
zh

[CV-21] Classifying Mitotic Figures in the MIDOG25 Challenge with Deep Ensemble Learning and Rule Based Refinement MICCAI

【速读】:该论文旨在解决肿瘤分级中异常有丝分裂象(atypical mitotic figures, AMFs)与正常有丝分裂象(normal mitotic figures, NMFs)难以准确区分的问题,该问题传统依赖人工标注,存在耗时且主观性强的局限。解决方案的关键在于构建一个基于ConvNeXtBase模型的深度集成(deep ensemble)框架,并引入基于规则的精炼(rule-based refinement, RBR)模块以提升分类性能。实验表明,该集成方法在MIDOG25预测试集上达到了84.02%的平衡准确率,RBR虽提升了特异性,但降低了敏感性,提示其在特定指标优化方面有效,仍需进一步研究以平衡整体性能。

链接: https://arxiv.org/abs/2508.20919
作者: Sara Krauss,Ellena Spieß,Daniel Hieber,Frank Kramer,Johannes Schobel,Dominik Müller
机构: IT-Infrastructure for Translational Medical Research, Faculty of Applied Computer Science, University of Augsburg, Germany (德国奥格斯堡大学应用计算机科学系转化医学研究信息技术基础设施); Department of Neuropathology, Pathology, Medical Faculty, University of Augsburg, Germany (德国奥格斯堡大学医学院病理学系神经病理学); DigiHealth Institute, Neu-Ulm University of Applied Sciences, Germany (德国纽伦堡应用科学大学数字健康研究所); Institute of Medical Data Science, University Hospital Würzburg, Germany (德国维尔茨堡大学医院医学数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submission as part of the MICCAI MIDOG25 challenge

点击查看摘要

Abstract:Mitotic figures (MFs) are relevant biomarkers in tumor grading. Differentiating atypical MFs (AMFs) from normal MFs (NMFs) remains difficult, as manual annotation is time-consuming and subjective. In this work an ensemble of ConvNeXtBase models was trained with AUCMEDI and extend with a rule-based refinement (RBR) module. On the MIDOG25 preliminary test set, the ensemble achieved a balanced accuracy of 84.02%. While the RBR increased specificity, it reduced sensitivity and overall performance. The results show that deep ensembles perform well for AMF classification. RBR can increase specific metrics but requires further research.
zh

[CV-22] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

【速读】:该论文旨在解决如何有效利用大规模自然图像预训练的视觉基础模型(vision foundation model)在医学图像分割任务中的表征迁移问题,以实现高精度的临床应用。其关键解决方案在于提出Dino U-Net架构,该架构基于冻结的DINOv3骨干网络构建编码器,并引入一种专用适配器(adapter)将模型丰富的语义特征与低层空间细节融合;同时设计了一个保真度感知投影模块(fidelity-aware projection module, FAPM),在降维过程中保留特征质量并优化其向解码器的传递,从而显著提升分割精度和可扩展性。

链接: https://arxiv.org/abs/2508.20909
作者: Yifan Gao,Haoyue Li,Feng Yuan,Xiaosong Wang,Xin Gao
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model’s rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at this https URL.
zh

[CV-23] o New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software

【速读】:该论文旨在解决传统自动驾驶感知模块化流水线中存在的误差累积和任务间协同不足的问题。这类流水线通常将感知任务分解为检测、跟踪与预测等独立步骤,虽具备可解释性,但各环节的误差难以控制且缺乏有效交互。论文提出的解决方案核心在于统一感知(Unified Perception)范式,通过在共享架构中整合上述子任务,提升系统的鲁棒性、上下文推理能力与计算效率,同时保持输出的可解释性。文中进一步提出三种统一感知范式——早期(Early)、晚期(Late)和全量(Full)统一感知,并构建了系统性的分类体系,涵盖任务集成方式、跟踪形式及表示流设计,从而为未来研究提供结构化框架与实践指导。

链接: https://arxiv.org/abs/2508.20892
作者: Loïc Stratil,Felix Fent,Esteban Rivera,Markus Lienkamp
机构: Technical University of Munich (慕尼黑工业大学); School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology and Munich Institute of Robotics and Machine Intelligence (MIRMI) (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm that integrates these sub-tasks within a shared architecture, potentially improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. In this survey, we provide a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. We define three paradigms -Early, Late, and Full Unified Perception- and systematically review existing methods, their architectures, training strategies, datasets used, and open-source availability, while highlighting future research directions. This work establishes the first comprehensive framework for understanding and advancing unified perception, consolidates fragmented efforts, and guides future research toward more robust, generalizable, and interpretable perception.
zh

[CV-24] Understanding and evaluating computer vision models through the lens of counterfactuals

【速读】:该论文旨在解决视觉分类模型与生成式文本到图像(Text-to-Image, TTI)模型中存在的偏见问题,以及缺乏统一、可解释且可操作的因果分析框架来评估和缓解这些偏见。其核心挑战在于识别模型是否依赖于虚假相关性(spurious correlations),并量化不同属性(如种族、性别、年龄等)对决策或生成结果的影响,从而实现公平性和鲁棒性的提升。解决方案的关键在于引入反事实推理(counterfactual reasoning)——通过系统性地改变语义上有意义的输入属性(如背景、身份特征)同时固定其他变量,构建可解释的因果分析路径。具体方法包括:针对判别模型的CAVLI(结合归因与概念级分析)和ASAC(对抗性反事实微调),用于揭示无关线索依赖并提升公平性;针对生成模型的TIBET(提示敏感偏见审计)与BiasConnect(交集偏见诊断图)、InterMit(无需训练的模块化偏见缓解算法),实现了对多维度偏见的因果建模与干预。整体上,该研究以反事实为统一视角,建立了兼顾可解释性、因果严谨性和可扩展性的偏见评估与治理范式。

链接: https://arxiv.org/abs/2508.20881
作者: Pushkar Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Counterfactual reasoning – the practice of asking ``what if’’ by varying inputs and observing changes in model behavior – has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.20881 [cs.CV] (or arXiv:2508.20881v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.20881 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pushkar Shukla [view email] [v1] Thu, 28 Aug 2025 15:11:49 UTC (18,305 KB)
zh

[CV-25] Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis

【速读】:该论文旨在解决胰腺导管腺癌(Pancreatic Ductal Adenocarcinoma, PDAC)早期诊断困难的问题,因其五年生存率低于10%,主要归因于发现时已处于晚期。解决方案的关键在于构建并验证一种基于双模态成像(自体荧光与二次谐波生成,SHG)的深度学习框架,通过分析组织病理图像实现对正常、纤维化及癌变组织的自动区分。研究采用改进的ResNet架构,结合预训练层冻结和类别加权训练策略,在有限样本和类别不平衡条件下实现了超过90%的癌症检测准确率,显著优于传统人工判读方法,为临床部署提供了可行路径,并为其他癌症类型的自动化检测奠定了基础。

链接: https://arxiv.org/abs/2508.20877
作者: Dennis Slobodzian,Karissa Tilbury,Amir Kordijazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 17 figure

点击查看摘要

Abstract:Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists’ capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.
zh

[CV-26] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis

【速读】:该论文旨在解决深度学习在病理诊断中临床应用受限的问题,即模型决策过程不透明、缺乏可追溯的推理依据。为此,作者提出PathMR——一种细胞级别的多模态视觉推理框架,其关键在于能够根据病理图像和文本查询,同时生成专家级的诊断解释与细胞分布预测结果,从而实现像素级病变区域定位与语义对齐的文本说明,显著提升AI辅助病理诊断的可解释性与可信度。

链接: https://arxiv.org/abs/2508.20851
作者: Ye Zhang,Yu Zhou,Jingwen Qi,Yongbing Zhang,Simon Puettmann,Finn Wichmann,Larissa Pereira Ferreira,Lara Sichward,Julius Keyl,Sylvia Hartmann,Shuo Zhao,Hongxiao Wang,Xiaowei Xu,Jianxu Chen
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Leibniz-Institut für Analytische Wissenschaften – ISAS – e.V. (德国分析科学莱布尼茨研究所); Sun Yat-sen University, The Sixth Affiliated Hospital (中山大学附属第六医院); University Hospital Essen (埃森大学医院); Institute for Artificial Intelligence in Medicine (人工智能医学研究所); Capital Normal University, Academy for Multidisciplinary Studies (首都师范大学 multidisciplinary studies 学院); Guangdong Provincial People’s Hospital, Guangdong Academy of Medical Sciences, Southern Medical University (广东省人民医院、广东省医学科学院、南方医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in this https URL.
zh

[CV-27] PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中点云分类(Point Cloud Classification, PCC)在跨域场景下的泛化能力不足问题,尤其是现有基于卷积网络、Transformer 或 Mamba 架构的方法存在感受野受限、计算复杂度高或长程依赖建模不足等缺陷。针对 RWKV(Receptance Weighted Key Value)架构在直接应用于点云分类时遇到的两个关键挑战——固定方向的 token shift 方法(如 Q-Shift)导致无结构点云的空间畸变,削弱局部几何建模能力;以及 Bi-WKV 注意力机制通过指数加权放大跨域键特征分布差异,引发注意力偏移从而损害泛化性能——本文提出 PointDGRWKV 框架,其核心创新在于引入两个模块:自适应几何 token shift(Adaptive Geometric Token Shift),用于显式建模局部邻域结构以增强几何上下文感知;跨域键特征分布对齐(Cross-Domain Key Feature Distribution Alignment),通过约束不同域间键特征分布的一致性来缓解注意力漂移,从而提升模型鲁棒性。该方案在保持 RWKV 线性计算效率的同时显著提升了点云分类在未见域上的泛化性能。

链接: https://arxiv.org/abs/2508.20835
作者: Hao Yang,Qianyu Zhou,Haijia Sun,Xiangtai Li,Xuequan Lu,Lizhuang Ma,Shuicheng Yan
机构: Shanghai Jiao Tong University (上海交通大学); The University of Tokyo (东京大学); Nanjing University (南京大学); Nanyang Technological University (南洋理工大学); The University of Western Australia (西澳大利亚大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV’s fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV’s linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.
zh

[CV-28] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation MICCAI2025

【速读】:该论文旨在解决小样本医疗数据场景下手术工具二维关键点(2D keypoint)估计中传统卷积神经网络(Convolutional Neural Network, CNN)或基于Transformer的方法易过拟合的问题。其解决方案的关键在于利用预训练视觉语言模型(Vision Language Models, VLMs)的强泛化能力,并通过低秩适配(Low Rank Adapting, LoRA)技术进行微调,同时设计结构化提示(prompt)构建指令微调数据集,以对齐视觉特征与语义关键点描述。实验表明,仅需两轮微调即可超越基线模型,验证了LoRA在低资源场景下的有效性。

链接: https://arxiv.org/abs/2508.20830
作者: Krit Duangprom,Tryphon Lambrou,Binod Bhattarai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025

点击查看摘要

Abstract:This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.
zh

[CV-29] FusionCounting: Robust visible-infrared image fusion guided by crowd counting via multi-task learning

【速读】:该论文旨在解决现有可见光与红外图像融合(Visible and Infrared Image Fusion, VIF)方法在复杂场景下难以有效结合下游任务指导、且缺乏对人群密度进行定量建模的问题。传统方法依赖语义分割等需大量标注的任务,或在密集场景中因边界框重叠和遮挡导致目标检测效果下降;同时,现有研究尚未将VIF与人群计数(crowd counting)统一建模。其解决方案的关键在于提出FusionCounting框架,通过多任务学习机制联合优化VIF与人群计数任务,利用人群密度信息作为轻量级监督信号提升融合质量,并设计动态损失加权策略以加速收敛并平衡各任务贡献,同时引入对抗训练增强模型鲁棒性与抗干扰能力。

链接: https://arxiv.org/abs/2508.20817
作者: He Li,Xinyu Liu,Weihang Kong,Xingchen Zhang
机构: Yanshan University (燕山大学); University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Most visible and infrared image fusion (VIF) methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB-T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi-task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi-task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model’s stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.
zh

[CV-30] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training

【速读】:该论文旨在解决全景牙片(panoramic X-rays)中龋齿(dental caries)检测准确率低的问题,其核心挑战在于龋齿病灶对比度微弱且形态多样。解决方案的关键在于提出一种双视角协同训练网络(DVCTNet),通过构建全局视图(来自全景X光图像)与局部视图(来自裁剪的单颗牙齿图像)两个互补视角,并分别预训练两个视觉基础模型;随后引入门控跨视角注意力机制(Gated Cross-View Attention, GCV-Atten)动态融合两视图特征,从而提升检测精度。该方法模拟临床医生“整体筛查+细节检查”的工作流程,在公开数据集和新构建的高精度标注数据集上均显著优于现有最先进方法,验证了其临床应用潜力。

链接: https://arxiv.org/abs/2508.20813
作者: Tao Luo,Han Wu,Tong Yang,Dinggang Shen,Zhiming Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet’s superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at this https URL.
zh

[CV-31] Surfel-based 3D Registration with Equivariant SE(3) Features

【速读】:该论文旨在解决点云配准(point cloud registration)中因忽略点方向(point orientations)和点不确定性(point uncertainties)而导致模型对噪声输入及剧烈旋转(如正交变换)敏感的问题,从而限制了其在真实场景中的鲁棒性。解决方案的关键在于提出一种基于surfel(表面元素)的SE(3)等变特征学习回归方法:通过虚拟视角相机参数初始化surfels,并利用SE(3)等变卷积核显式学习位置与旋转的等变特征,进而预测源扫描与目标扫描之间的相对位姿变换(relative transformation)。该方法结合等变编码器、交叉注意力机制用于相似性计算、全连接解码器以及非线性Huber损失函数,在室内和室外数据集上均展现出优于现有最先进方法的性能与鲁棒性。

链接: https://arxiv.org/abs/2508.20789
作者: Xueyang Kang,Hang Zhao,Kourosh Khoshelham,Patrick Vandewalle
机构: University of Melbourne (墨尔本大学); KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit \mathbfSE(3) equivariant features, including both position and rotation through \mathbfSE(3) equivariant convolutional kernels to predict relative transformation between source and target scans. The model comprises an equivariant convolutional encoder, a cross-attention mechanism for similarity computation, a fully-connected decoder, and a non-linear Huber loss. Experimental results on indoor and outdoor datasets demonstrate our model superiority and robust performance on real point-cloud scans compared to state-of-the-art methods.
zh

[CV-32] Evaluating Compositional Generalisation in VLMs and Diffusion Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理组合语义(compositional semantics)时的局限性问题,特别是模型在零样本学习(Zero-Shot Learning, ZSL)和广义零样本学习(Generalized Zero-Shot Learning, GZSL)场景下对物体属性与关系绑定(concept binding)能力不足的问题。现有模型如CLIP常将图像表示为“词袋”(bag-of-words),导致无法正确识别复合概念(如红立方体和蓝圆柱体的组合)。解决方案的关键在于引入生成式扩散分类器(Diffusion Classifier),通过其基于扩散过程的生成机制,增强模型对复杂语义结构的建模能力,并在实验中验证其相较判别式模型(如CLIP和ViLT)在属性绑定任务上的提升。然而,所有模型在关系型GZSL任务上仍表现不佳,揭示了当前VLMs在关系推理方面的根本挑战。

链接: https://arxiv.org/abs/2508.20783
作者: Beth Pearson,Bilal Boulbarss,Michael Wray,Martha Lewis
机构: University of Bristol (布里斯托大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages including references, 6 figures. Accepted at IWCS 2025

点击查看摘要

Abstract:A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words’ and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models – Diffusion Classifier, CLIP, and ViLT – on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: this https URL
zh

[CV-33] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML

【速读】:该论文旨在解决当前皮肤病变分类模型虽具备高准确率但缺乏可信性与可解释性的问题,尤其是在医疗实践中医生对AI诊断结果存在不信任的现状。现有解释方法如基于LIME的方法存在不一致性,而基于类激活图(Class Activation Map, CAM)的方法则未能充分考虑所有类别信息,导致解释可靠性不足。其解决方案的关键在于提出一种全局类激活概率图评估方法(Global Class Activation Probabilistic Map Evaluation),该方法从像素级出发,以概率方式分析所有类别的激活图,并统一可视化诊断过程,从而提升解释的完整性与一致性;同时结合SafeML技术增强对误诊的检测能力,实现对医生和患者的预警机制,最终提高诊断可靠性与患者安全性。

链接: https://arxiv.org/abs/2508.20776
作者: Kuniko Paxton,Koorosh Aslansefat,Amila Akagić,Dhavalkumar Thakker,Yiannis Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists’ diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes’ activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.
zh

[CV-34] Unleashing Uncertainty: Efficient Machine Unlearning for Generative AI ICML2025

【速读】:该论文旨在解决扩散模型中的机器遗忘(Machine Unlearning)问题,即如何高效地从训练好的模型中移除特定类别的数据影响,同时保持其余类别性能不受显著损害。解决方案的关键在于提出SAFEMax方法,其基于信息论原理,通过最大化生成图像的熵来实现遗忘:当模型被条件化到不允许的类别时,会因终止去噪过程而输出高斯噪声;此外,SAFEMax通过聚焦于扩散过程早期步骤(此时类别特异性信息更明显)来精细调控遗忘与保留之间的平衡,从而在保证遗忘效果的同时大幅提升计算效率。

链接: https://arxiv.org/abs/2508.20773
作者: Christoforos N. Spartalis,Theodoros Semertzidis,Petros Daras,Efstratios Gavves
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 workshop on Machine Unlearning for Generative AI

点击查看摘要

Abstract:We introduce SAFEMax, a novel method for Machine Unlearning in diffusion models. Grounded in information-theoretic principles, SAFEMax maximizes the entropy in generated images, causing the model to generate Gaussian noise when conditioned on impermissible classes by ultimately halting its denoising process. Also, our method controls the balance between forgetting and retention by selectively focusing on the early diffusion steps, where class-specific information is prominent. Our results demonstrate the effectiveness of SAFEMax and highlight its substantial efficiency gains over state-of-the-art methods.
zh

[CV-35] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

【速读】:该论文试图解决视频理解中抽象概念识别(abstract concept recognition)这一关键挑战,即如何让模型不仅识别视频中的具体实体(如物体、动作、事件),还能基于上下文信息进行多语义层次的推理,从而实现与人类认知更一致的理解。其解决方案的关键在于利用近年来基础模型(foundation models)的进展,特别是多模态基础模型的能力,以系统性地整合历史研究经验与当前技术优势,避免重复探索,并推动对抽象概念理解任务的实质性突破。

链接: https://arxiv.org/abs/2508.20765
作者: Gowreesh Mago,Pascal Mettes,Stevan Rudinac
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review for IJCV

点击查看摘要

Abstract:The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel’’ as we start revisiting it in the era of multi-modal foundation models.
zh

[CV-36] SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer

【速读】:该论文旨在解决自动驾驶车辆在复杂环境感知中缺乏像素级上下文理解能力的问题,以提升模型对周围场景的全局理解和鲁棒性。其解决方案的关键在于提出SKGE-Swin架构,该架构融合了带有跳跃阶段(skip-stage)机制的Swin Transformer,通过Shifted Window-based Multi-head Self-Attention(SW-MSA)机制实现跨距离像素的信息提取,并借助跳跃连接保留从初始到最终特征提取阶段的关键信息,从而显著增强模型对复杂驾驶场景的理解能力。

链接: https://arxiv.org/abs/2508.20762
作者: Fachri Najm Noer Kartiman,Rasim,Yaya Wihardi,Nurul Hasanah,Oskar Natan,Bambang Wahono,Taufik Ibnu Salim
机构: Indonesia University of Education (印度尼西亚教育大学); National Research and Innovation Agency (国家研究与创新机构); Gadjah Mada University (加查马达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: keywords-multitask learning, autonomous driving, end-to-end learning, skip connections, swin transformer, self-attention mechanism. 12 pages

点击查看摘要

Abstract:Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer’s Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle’s surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance.
zh

[CV-37] Occlusion Robustness of CLIP for Military Vehicle Classification

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在军事场景下因部分遮挡和信噪比(Signal-to-Noise Ratio, SNR)降低导致的鲁棒性不足问题,特别是以CLIP为代表的模型在复杂战场环境中零样本分类性能下降的挑战。解决方案的关键在于通过构建包含18类军用车辆的定制数据集,系统评估不同CLIP变体在多种遮挡比例下的表现,并引入归一化曲线下面积(Normalized Area Under the Curve, NAUC)作为量化指标;研究发现,基于Transformer的CLIP模型优于CNN架构,且细粒度分散遮挡比大块连续遮挡更具破坏性;更重要的是,通过对模型主干网络进行微调(fine-tuning),可将线性探测模型性能骤降阈值从约35%提升至60%以上,表明训练阶段引入遮挡特定增强策略是提升模型鲁棒性的核心路径。

链接: https://arxiv.org/abs/2508.20760
作者: Jan Erik van Woerden,Gertjan Burghouts,Lotte Nijskens,Alma M. Liezenga,Sabina van Rooij,Frank Ruis,Hugo J. Kuijf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be presented at SPIE: Sensors + Imaging, Artificial Intelligence for Security and Defence Applications II

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP’s robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants’ robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model’s backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.
zh

[CV-38] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

【速读】:该论文旨在解决零样本三维视觉定位(zero-shot 3D Visual Grounding, 3DVG)中因依赖单视角定位导致的空间推理受限、上下文信息缺失或细节退化的问题。其核心解决方案是提出SeqVLM框架,关键在于利用多视角真实场景图像结合空间信息进行目标对象推理:首先通过3D语义分割网络生成并筛选出语义相关的3D实例提案,再采用提案引导的多视角投影策略将候选区域映射到图像序列中以保留空间关系和上下文细节;同时引入动态调度机制,分步处理序列-查询提示,缓解视觉语言模型(Vision-Language Model, VLM)计算负载,从而提升跨场景泛化能力与实际应用潜力。

链接: https://arxiv.org/abs/2508.20758
作者: Jiawen Lin,Shiran Bian,Yihang Zhu,Wenbin Tan,Yachao Zhang,Yuan Xie,Yanyun Qu
机构: School of Informatics, Xiamen University (厦门大学信息学院); School of Computer Science, Nanjing University (南京大学计算机学院); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院); Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (多媒体可信感知与高效计算教育部重点实验室,厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM’s cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at this https URL.
zh

[CV-39] C3-GS: Learning Context-aware Cross-dimension Cross-scale Feature for Generalizable Gaussian Splatting BMVC2025

【速读】:该论文旨在解决通用高斯点渲染(Generalizable Gaussian Splatting)中,现有方法在稀疏输入视图下难以构建准确几何结构的问题,其核心挑战在于缺乏能够编码判别性且多视角一致特征的能力,从而影响生成新视角的精度与泛化性能。解决方案的关键在于提出 C3\mathbf{C}^3-GS 框架,通过引入上下文感知(context-aware)、跨维度(cross-dimension)和跨尺度(cross-scale)三重约束机制,增强特征学习能力,并集成三个轻量级模块到统一渲染流程中,实现更优的特征融合与无监督下的逼真图像合成。

链接: https://arxiv.org/abs/2508.20754
作者: Yuxi Hu,Jun Zhang,Kuangyi Chen,Zhe Zhang,Friedrich Fraundorfer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to The 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK

点击查看摘要

Abstract:Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose \mathbfC^3 -GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that \mathbfC^3 -GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: this https URL.
zh

[CV-40] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

【速读】:该论文旨在解决当前基于点对点奖励模型(pointwise reward model, RM)的文本到图像(text-to-image, T2I)生成中因奖励劫持(reward hacking)导致训练不稳定的问题。其核心在于,当图像间评分差异较小时,归一化操作会放大这些微小差异,从而诱导模型过度优化于无意义的细微差别,破坏生成过程的稳定性。解决方案的关键是提出Pref-GRPO方法,该方法将优化目标从分数最大化转变为偏好拟合(preference fitting),通过在每组图像内进行成对比较并使用胜率作为奖励信号,有效区分图像质量的细微差异,提升训练稳定性并抑制奖励劫持现象。

链接: https://arxiv.org/abs/2508.20751
作者: Yibin Wang,Zhimin Li,Yuhang Zang,Yujie Zhou,Jiazi Bu,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
机构: Fudan University (复旦大学); Shanghai Innovation Institute; Shanghai AI Lab; Hunyuan, Tencent (腾讯混元); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
zh

[CV-41] Mix Align Distil: Reliable Cross-Domain Atypical Mitosis Classification

【速读】:该论文旨在解决在不同扫描仪、染色和采集条件下(即域偏移,domain shift)下,对非典型有丝分裂象(Atypical Mitotic Figures, AMFs)进行一致且准确识别的难题。其关键解决方案包括:(i) 在骨干网络的早期和中期阶段引入风格扰动以增强特征多样性;(ii) 利用弱域标签(Scanner、Origin、Species、Tumor)通过辅助对齐损失对注意力优化后的特征进行跨站点对齐;(iii) 采用指数移动平均(Exponential Moving Average, EMA)教师模型结合温度缩放的KL散度蒸馏机制稳定预测结果。该方法在MIDOG 2025任务2的初步排行榜上取得了平衡准确率0.8762、敏感性0.8873、特异性0.8651及ROC AUC 0.9499,且推理开销极低,仅依赖粗粒度域元数据,展现出优异的鲁棒性和均衡性能。

链接: https://arxiv.org/abs/2508.20745
作者: Kaustubh Atey,Sameer Anand Jha,Gouranga Bala,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atypical mitotic figures (AMFs) are important histopathological markers yet remain challenging to identify consistently, particularly under domain shift stemming from scanner, stain, and acquisition differences. We present a simple training-time recipe for domain-robust AMF classification in MIDOG 2025 Task 2. The approach (i) increases feature diversity via style perturbations inserted at early and mid backbone stages, (ii) aligns attention-refined features across sites using weak domain labels (Scanner, Origin, Species, Tumor) through an auxiliary alignment loss, and (iii) stabilizes predictions by distilling from an exponential moving average (EMA) teacher with temperature-scaled KL divergence. On the organizer-run preliminary leaderboard for atypical mitosis classification, our submission attains balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499. The method incurs negligible inference-time overhead, relies only on coarse domain metadata, and delivers strong, balanced performance, positioning it as a competitive submission for the MIDOG 2025 challenge.
zh

[CV-42] CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network

【速读】:该论文旨在解决从短轴(short-axis, SAX)心脏磁共振(cardiac magnetic resonance, CMR)图像中准确估计心脏运动的问题,现有方法因依赖强度相关的图像配准相似性损失而难以有效捕捉心脏解剖结构区域的运动。解决方案的关键在于提出一种基于循环贝叶斯深度学习框架的CardioMorphNet,其核心创新包括:利用循环变分自编码器建模心脏周期中的时空依赖关系,并设计两个后验模型分别用于双心室分割与运动估计;通过贝叶斯推导出的损失函数引导模型递归地配准分割掩膜,从而避免使用强度相关的相似性损失,转而聚焦于解剖结构区域;同时,该框架可计算运动场的不确定性图,提升了预测的可靠性。

链接: https://arxiv.org/abs/2508.20734
作者: Reza Akbari Movahed,Abuzar Rezaee,Arezoo Zakeri,Colin Berry,Edmond S. L. Ho,Ali Gooya
机构: University of Glasgow (格拉斯哥大学); University of Tehran (德黑兰大学); Imperial College London (伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to capture heart motion accurately because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies over the cardiac cycle and two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank dataset by comparing warped mask shapes with ground truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions.
zh

[CV-43] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network

【速读】:该论文旨在解决神经视频压缩(Neural Video Compression, NVC)中精确码率控制(Rate Control)的难题,这主要是由于基于学习的编码器在实际应用中难以稳定匹配目标码率。其解决方案的关键在于提出一种动态视频压缩框架,通过引入**动态路由自编码器(Dynamic-Route Autoencoder, DRA)实现可变码率(Variable Bitrate, VBR)场景下的灵活编码路径选择,每条编码路径占据网络部分计算复杂度并对应不同的率失真(Rate-Distortion, RD)权衡;同时设计码率控制代理(Rate Control Agent)实时估计各路径码率,并动态调整DRA的编码路径以逼近目标码率;此外,采用联合路径优化策略(Joint-Routes Optimization)**实现多路径协同训练,在覆盖广泛码率范围的同时保持整体RD性能最优,从而实现了针对不同码率约束场景的率失真复杂度优化(Rate-Distortion-Complexity Optimization, RDCO)。

链接: https://arxiv.org/abs/2508.20709
作者: Chenhao Zhang,Wei Gao
机构: PCL (鹏城实验室); Natural Science Foundation of China (国家自然科学基金); Guangdong Province Pearl River Talent Program (广东省珠江人才计划); Guangdong Basic and Applied Basic Research Foundation (广东省基础与应用基础研究基金); Shenzhen Science and Technology Program (深圳市科技计划); CAAI-MindSpore Open Fund (中国人工智能学会-MindSpore开放基金); OpenI Community (OpenI社区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications. Our code is available at this https URL.
zh

[CV-44] “Humor Art or Misinformation?”: A Multimodal Dataset for Intent-Aware Synthetic Image Detection

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 内容检测中忽视图像意图的问题,即现有方法多关注内容是否为合成或脱离语境,而未充分考虑生成图像背后的意图(如幽默、艺术或虚假信息)。其解决方案的关键在于构建一个名为 S-HArM 的多模态数据集,包含 9,576 对来自 Twitter/X 和 Reddit 的真实世界图像-文本对,并标注为幽默/讽刺、艺术或虚假信息三类;同时提出三种提示策略(图像引导、描述引导和多模态引导)利用 Stable Diffusion 生成大规模合成训练数据,通过对比实验发现,基于图像和多模态引导的数据训练的模型在真实场景下泛化能力更强,因其能更好保留视觉上下文信息。

链接: https://arxiv.org/abs/2508.20670
作者: Anastasios Skoularikis,Stefanos-Iordanis Papadopoulos,Symeon Papadopoulos,Panagiotis C. Petrantonakis
机构: Aristotle University of Thessaloniki (亚里士多德大学塞萨洛尼基分校); Centre for Research & Technology Hellas (希腊研究中心与技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 “in the wild” image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to “in the wild” content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.
zh

[CV-45] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models

【速读】:该论文旨在解决生成式艺术中极端风格化变换下人脸身份保持难题,尤其在高对比度、抽象性强的涂鸦(graffiti)场景中,细微的五官变形易导致主体辨识度丧失,从而损害个人与文化真实性。解决方案的关键在于提出一个端到端的文本引导涂鸦生成框架CraftGraffiti,其核心创新包括:首先通过LoRA微调的预训练扩散Transformer实现风格迁移(style transfer),随后引入一种基于人脸一致性自注意力机制(face-consistent self-attention mechanism)的Identity嵌入增强模块,以显式保留面部特征;同时采用CLIP引导的提示扩展策略实现无需关键点的动态姿态调整,保障面部连贯性。实验验证了“先风格后身份”(style-first, identity-after)范式的优越性,有效降低属性漂移,兼具优异的面部一致性、美学评分及人类偏好表现。

链接: https://arxiv.org/abs/2508.20640
作者: Ayan Banerjee,Fernando Vilariño,Josep Lladós
机构: Computer Vision Center, Universitat Autònoma de Barcelona(巴塞罗那自治大学计算机视觉中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject’s recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the “style-first, identity-after” paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system’s real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.
zh

[CV-46] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation WWW ICCV2025

【速读】:该论文旨在解决历史绘画中人物身份识别(sitter identification)的难题,该任务对艺术史研究至关重要,但长期以来受限于主观判断、数据稀缺以及绘画风格差异带来的挑战。传统面部识别模型在照片上表现良好,但在绘画中因领域偏移(domain shift)和类内高变异性而性能下降,且艺术表现因素如风格、技巧、创作意图等进一步加剧了识别难度。解决方案的关键在于利用基础模型(foundation models)进行微调,并将其嵌入表示(embeddings)与传统面部识别网络的特征融合,从而显著提升在艺术品中的识别效果,有效弥补了传统方法在该场景下的不足。

链接: https://arxiv.org/abs/2508.20626
作者: Francois Poh,Anjith George,Sébastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures. ArtMetrics @ ICCV 2025 (non-archival). Paper page at this https URL

点击查看摘要

Abstract:Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition. In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective. Paper page at this https URL
zh

[CV-47] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images

【速读】:该论文旨在解决当前基于高斯溅射(Gaussian Splatting)的头部虚拟人重建方法主要依赖正面视角图像、导致后脑区域几何不一致、结构模糊且真实感不足的问题。解决方案的关键在于提出一个名为AvatarBack的即插即用框架,其核心创新包括两个部分:一是面向特定主体的生成器(Subject-specific Generator, SSG),利用生成先验从稀疏正面输入中合成身份一致的后视伪图像,提供鲁棒的多视角监督信号;二是自适应空间对齐策略(Adaptive Spatial Alignment Strategy, ASA),通过训练过程中优化的可学习变换矩阵,精确对齐合成视图与3D高斯表示之间的姿态和坐标差异,从而实现后脑区域的高质量重建并保持整体一致性。

链接: https://arxiv.org/abs/2508.20623
作者: Shiqi Xin,Xiaolin Zhang,Yanbin Liu,Peng Zhang,Caifeng Shan
机构: Shandong University of Science and Technology (山东科技大学); Auckland University of Technology (奥克兰理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Gaussian Splatting have significantly boosted the reconstruction of head avatars, enabling high-quality facial modeling by representing an 3D avatar as a collection of 3D Gaussians. However, existing methods predominantly rely on frontal-view images, leaving the back-head poorly constructed. This leads to geometric inconsistencies, structural blurring, and reduced realism in the rear regions, ultimately limiting the fidelity of reconstructed avatars. To address this challenge, we propose AvatarBack, a novel plug-and-play framework specifically designed to reconstruct complete and consistent 3D Gaussian avatars by explicitly modeling the missing back-head regions. AvatarBack integrates two core technical innovations,i.e., the Subject-specific Generator (SSG) and the Adaptive Spatial Alignment Strategy (ASA). The former leverages a generative prior to synthesize identity-consistent, plausible back-view pseudo-images from sparse frontal inputs, providing robust multi-view supervision. To achieve precise geometric alignment between these synthetic views and the 3D Gaussian representation, the later employs learnable transformation matrices optimized during training, effectively resolving inherent pose and coordinate discrepancies. Extensive experiments on NeRSemble and K-hairstyle datasets, evaluated using geometric, photometric, and GPT-4o-based perceptual metrics, demonstrate that AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity. Moreover, the reconstructed avatars maintain consistent visual realism under diverse motions and remain fully animatable.
zh

[CV-48] Masked Autoencoders for Ultrasound Signals: Robust Representation Learning for Downstream Applications

【速读】:该论文旨在解决一维(1D)超声信号在工业无损检测(NDT)和结构健康监测(SHM)等场景中因标注数据稀缺而导致的模型性能受限问题。其核心挑战在于如何从有限的标注数据中提取鲁棒特征,同时适应任务特定的信号处理需求。解决方案的关键在于引入基于视觉Transformer(ViT)架构的掩码自编码器(Masked Autoencoders, MAEs)进行自监督预训练,利用大量未标注的合成超声信号学习通用表征,并通过调整模型规模、补丁大小(patch size)和掩码比例(masking ratio)优化预训练效率与下游任务(如飞行时间ToF分类)的准确性。实验表明,该方法显著优于从零开始训练的模型及针对下游任务优化的卷积神经网络(CNN)基线,且在真实测量信号上的迁移能力更强,验证了MAE在超声信号分析中的有效性与可扩展性。

链接: https://arxiv.org/abs/2508.20622
作者: Immanuel Roßteutscher,Klaus S. Drese,Thorsten Uphues
机构: Coburg University of Applied Sciences and Arts (科堡应用技术大学和艺术学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Access. This is a preprint version. 14 pages, 6 figures

点击查看摘要

Abstract:We investigated the adaptation and performance of Masked Autoencoders (MAEs) with Vision Transformer (ViT) architectures for self-supervised representation learning on one-dimensional (1D) ultrasound signals. Although MAEs have demonstrated significant success in computer vision and other domains, their use for 1D signal analysis, especially for raw ultrasound data, remains largely unexplored. Ultrasound signals are vital in industrial applications such as non-destructive testing (NDT) and structural health monitoring (SHM), where labeled data are often scarce and signal processing is highly task-specific. We propose an approach that leverages MAE to pre-train on unlabeled synthetic ultrasound signals, enabling the model to learn robust representations that enhance performance in downstream tasks, such as time-of-flight (ToF) classification. This study systematically investigated the impact of model size, patch size, and masking ratio on pre-training efficiency and downstream accuracy. Our results show that pre-trained models significantly outperform models trained from scratch and strong convolutional neural network (CNN) baselines optimized for the downstream task. Additionally, pre-training on synthetic data demonstrates superior transferability to real-world measured signals compared with training solely on limited real datasets. This study underscores the potential of MAEs for advancing ultrasound signal analysis through scalable, self-supervised learning.
zh

[CV-49] Mask-Guided Multi-Channel SwinUNETR Framework for Robust MRI Classification

【速读】:该论文旨在解决乳腺癌早期诊断中影像判读的挑战,特别是在高风险人群或致密型乳腺组织中,传统乳腺X线摄影(mammography)敏感性不足的问题。研究提出了一种基于SwinUNETR架构的深度学习框架,其关键创新在于引入乳腺区域掩码(breast region masking)、大规模数据增强(extensive data augmentation)以及集成学习(ensemble learning)策略,显著提升了模型在多中心、多设备扫描条件下的鲁棒性和泛化能力,最终在ODELIA联盟组织的多中心AI挑战赛中取得第二名成绩,验证了该方法在临床乳腺MRI辅助诊断中的潜力。

链接: https://arxiv.org/abs/2508.20621
作者: Smriti Joshi,Lidia Garrucho,Richard Osuala,Oliver Diaz,Karim Lekadir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at this https URL.
zh

[CV-50] EmoCAST: Emotional Talking Portrait via Emotive Text Description

【速读】:该论文旨在解决现有情感驱动人脸视频生成方法在控制灵活性、运动自然度和表情质量方面的不足,以及当前数据集多为实验室环境下采集导致的现实场景适应性差的问题。解决方案的关键在于提出一种基于扩散模型的框架EmoCAST,其核心创新包括两个模块:一是通过文本引导的解耦情感模块(text-guided decoupled emotive module)增强空间知识以提升情绪理解能力;二是引入情感音频注意力模块(emotive audio attention module)捕捉情绪与语音之间的交互关系,生成情绪感知特征以指导更精确的面部动作合成。此外,研究构建了包含丰富情感文本描述的新型情感人脸视频数据集,并提出情绪感知采样训练策略和渐进式功能训练策略,显著提升了模型对细微表情特征的捕捉能力和唇音同步精度。

链接: https://arxiv.org/abs/2508.20615
作者: Yiguo Jiang,Xiaodong Cun,Yong Zhang,Yudian Zheng,Fan Tang,Chi-Man Pun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework’s performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model’s ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: this https URL
zh

[CV-51] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在分割推理(Split Inference, SI)场景下因交换中间特征而导致的敏感数据泄露问题,即数据重构攻击(Data Reconstruction Attacks, DRAs)对用户隐私构成的风险。现有DRAs受限于浅层模型且未能充分利用语义先验信息,导致重建质量低、泛化能力差。其解决方案的关键在于提出一种基于生成对抗网络(Generative Adversarial Network, GAN)的新型DRAs框架,并引入渐进式特征优化(Progressive Feature Optimization, PFO)机制:通过将生成器分解为分层模块,逐级优化中间表示以提升重建图像的语义保真度;同时在重构过程中施加L1-ball约束以稳定优化过程并增强图像真实性。该方法显著优于现有攻击手段,尤其在高分辨率、分布外(out-of-distribution)场景及深层复杂DNN架构中表现突出。

链接: https://arxiv.org/abs/2508.20613
作者: Yixiang Qiu,Yanhan Liu,Hongyao Yu,Hao Fang,Bin Chen,Shu-Tao Xia,Ke Xu
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.
zh

[CV-52] Physics Informed Generative Models for Magnetic Field Images

【速读】:该论文旨在解决半导体制造中缺陷检测与定位的难题,尤其是针对磁力成像(Magnetic Field Imaging, MFI)数据稀缺导致机器学习(Machine Learning, ML)模型难以训练的问题。由于MFI虽能高效定位感兴趣区域(Region of Interest, ROI)以指导X射线扫描,但其数据因商业保密性而难以获取,限制了ML算法在缺陷定位中的应用。解决方案的关键在于提出一种物理信息引导的生成模型——Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI),该模型通过整合特定物理约束,利用扩散模型生成具有真实物理特性的合成MFI样本,从而为ML算法提供充足的训练数据,提升缺陷定位效率。

链接: https://arxiv.org/abs/2508.20612
作者: Aye Phyu Phyu Aung,Lucas Lum,Zhansen Shi,Wen Qiu,Bernice Zee,JM Chin,Yeow Kheng Lim,J.Senthilnath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In semiconductor manufacturing, defect detection and localization are critical to ensuring product quality and yield. While X-ray imaging is a reliable non-destructive testing method, it is memory-intensive and time-consuming for large-scale scanning, Magnetic Field Imaging (MFI) offers a more efficient means to localize regions of interest (ROI) for targeted X-ray scanning. However, the limited availability of MFI datasets due to proprietary concerns presents a significant bottleneck for training machine learning (ML) models using MFI. To address this challenge, we consider an ML-driven approach leveraging diffusion models with two physical constraints. We propose Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) to generate synthetic MFI samples by integrating specific physical information. We generate MFI images for the most common defect types: power shorts. These synthetic images will serve as training data for ML algorithms designed to localize defect areas efficiently. To evaluate generated MFIs, we compare our model to SOTA generative models from both variational autoencoder (VAE) and diffusion methods. We present a domain expert evaluation to assess the generated samples. In addition, we present qualitative and quantitative evaluation using various metrics used for image generation and signal processing, showing promising results to optimize the defect localization process.
zh

[CV-53] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction

【速读】:该论文旨在解决肝切除术中因术中超声图像视野有限和解剖结构复杂而导致的图像解读困难问题,以及术前与术中超声数据之间缺乏有效关联的问题。其解决方案的关键在于提出一种基于优化的校准方法,利用3D打印的仿体实现高精度的3D血管内超声(Intravascular Ultrasound, IVUS)体积重建,从而确保术中IVUS数据与术前CT图像之间的精确配准,显著提升术中导航的准确性。

链接: https://arxiv.org/abs/2508.20605
作者: Karl-Philippe Beaudet(MIMESIS, UNISTRA, IHU Strasbourg),Sidaty El Hadramy(MIMESIS, UNISTRA, Unibas, IHU Strasbourg),Philippe C Cattin(Unibas),Juan Verde(MIMESIS, UNISTRA, IHU Strasbourg),Stéphane Cotin(MIMESIS, UNISTRA)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intraoperative ultrasound images are inherently challenging to interpret in liver surgery due to the limited field of view and complex anatomical structures. Bridging the gap between preoperative and intraoperative data is crucial for effective surgical guidance. 3D IntraVascular UltraSound (IVUS) offers a potential solution by enabling the reconstruction of the entire organ, which facilitates registration between preoperative computed tomography (CT) scans and intraoperative IVUS images. In this work, we propose an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound volume reconstruction. Our approach ensures precise alignment of tracked IVUS data with preoperative CT images, improving intraoperative navigation. We validated our method using in vivo swine liver images, achieving a calibration error from 0.88 to 1.80 mm and a registration error from 3.40 to 5.71 mm between the 3D IVUS data and the corresponding CT scan. Our method provides a reliable and accurate means of calibration and volume reconstruction. It can be used to register intraoperative ultrasound images with preoperative CT images in the context of liver surgery, and enhance intraoperative guidance.
zh

[CV-54] Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion

【速读】:该论文旨在解决从文本生成3D人类动作时存在的多样性不足问题,即现有方法虽能保证动作与文本语义的一致性,但生成的动作缺乏多样性。解决方案的关键在于引入不确定性建模机制:首先,将噪声信号作为多样性信息的载体,在基于Transformer的方法中显式地建模不确定性;其次,构建一个连续的潜在空间,将文本映射为非刚性的连续表示,并集成潜在空间采样器以引入随机采样,从而增强生成动作的多样性和不确定性。实验表明,该方法在HumanML3D和KIT-ML基准数据集上显著提升了多样性,同时保持了文本一致性的最先进性能。

链接: https://arxiv.org/abs/2508.20604
作者: Zheng Qin,Yabing Wang,Minghui Yang,Sanping Zhou,Ming Yang,Le Wang
机构: Xi’an Jiaotong University (西安交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textiti.e., Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.
zh

[CV-55] Disruptive Attacks on Face Swapping via Low-Frequency Perceptual Perturbations IJCNN2025

【速读】:该论文旨在解决深度伪造(Deepfake)技术对隐私和社稷安全带来的威胁,特别是现有检测方法多为被动分析、无法在攻击发生前进行干预的问题。其解决方案的关键在于提出一种基于低频感知扰动的主动防御机制,通过在频域与空域联合建模,利用离散小波变换(Discrete Wavelet Transform, DWT)提取低频分量并生成扰动,干扰生成式AI(Generative AI)模型在人脸交换过程中的学习与重建能力,从而降低伪造内容的质量与自然度,同时保持视觉上的合理性。该方法直接作用于生成过程而非仅影响分类结果,显著提升了防御成功率并保留了图像的高保真度。

链接: https://arxiv.org/abs/2508.20595
作者: Mengxiao Huang,Minglei Shu,Shuwang Zhou,Zhaoyang Liu
机构: Shandong Artificial Intelligence Institute (山东人工智能研究院); Qilu University of Technology (齐鲁工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE IJCNN 2025

点击查看摘要

Abstract:Deepfake technology, driven by Generative Adversarial Networks (GANs), poses significant risks to privacy and societal security. Existing detection methods are predominantly passive, focusing on post-event analysis without preventing attacks. To address this, we propose an active defense method based on low-frequency perceptual perturbations to disrupt face swapping manipulation, reducing the performance and naturalness of generated content. Unlike prior approaches that used low-frequency perturbations to impact classification accuracy,our method directly targets the generative process of deepfake techniques. We combine frequency and spatial domain features to strengthen defenses. By introducing artifacts through low-frequency perturbations while preserving high-frequency details, we ensure the output remains visually plausible. Additionally, we design a complete architecture featuring an encoder, a perturbation generator, and a decoder, leveraging discrete wavelet transform (DWT) to extract low-frequency components and generate perturbations that disrupt facial manipulation models. Experiments on CelebA-HQ and LFW demonstrate significant reductions in face-swapping effectiveness, improved defense success rates, and preservation of visual quality.
zh

[CV-56] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching

【速读】:该论文旨在解决热成像相机在低光照环境下难以准确识别由相似材料制成的交通标志(如车牌和路障指示牌)的问题,以及事件相机(event camera)因非均匀采样导致的视觉信息不一致问题。解决方案的关键在于提出一种无监督的热-事件视频增强方法(UTA-Sign),其核心是设计了一个双增强机制:利用热帧提供精确的运动线索作为时间参考,以对齐事件信号的非均匀采样;同时,事件信号补充细微的标志内容到原始热帧中,从而实现时序上一致且语义更丰富的标志表示,显著提升低光场景下交通标志的感知质量与检测准确性。

链接: https://arxiv.org/abs/2508.20594
作者: Yuqi Han,Songqian Zhang,Weijian Su,Ke Li,Jiayu Yang,Jinli Suo,Qiang Zhang
机构: Dalian University of Technology (大连理工大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level.
zh

[CV-57] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models

【速读】:该论文旨在解决虚拟试衣(virtual try-on)技术在实际应用中面临的两大挑战:一是现有方法难以支持多参考服饰组合(包括衣物和配饰)的生成,二是由于每步去噪过程中重复计算参考特征而导致的显著效率低下。解决方案的关键在于提出FastFit框架,其核心创新是基于一种新型可缓存扩散架构(cacheable diffusion architecture),通过引入半注意力机制(Semi-Attention mechanism)并用类别嵌入(class embeddings)替代传统的时间步嵌入(timestep embeddings)来编码参考物品,从而将参考特征编码与去噪过程完全解耦,仅需一次性计算即可在所有去噪步骤中无损复用,从根本上突破了效率瓶颈,实现平均3.5倍于同类方法的推理速度提升。

链接: https://arxiv.org/abs/2508.20586
作者: Zheng Chong,Yanwei Lei,Shiyue Zhang,Zhuandi He,Zhen Wang,Xujie Zhang,Xiao Dong,Yiling Wu,Dongmei Jiang,Xiaodan Liang
机构: Sun Yat-sen University (中山大学); LavieAI; Pengcheng Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.
zh

[CV-58] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition

【速读】:该论文旨在解决传统面部表情识别(Facial Expression Recognition, FER)系统在面对遮挡、表情变异以及可解释性不足等问题时性能受限的挑战。其解决方案的关键在于提出一种基于图结构的地标区域嵌入网络(Graph-based Landmark Region Embedding, GLaRE),通过3D人脸对齐提取面部关键点,并利用分层粗化构建商图(quotient graph),在保持空间结构的同时降低计算复杂度,从而实现更结构化且可解释的特征学习。实验表明,该方法在AffectNet和FERG数据集上分别达到64.89%和94.24%的准确率,优于多个现有基线模型,且消融实验证明了区域级嵌入对预测性能的提升作用。

链接: https://arxiv.org/abs/2508.20579
作者: Debasis Maji,Debaditya Barman
机构: Visva-Bharati (维斯瓦巴尔蒂大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.
zh

[CV-59] owards Mechanistic Defenses Against Typographic Attacks in CLIP

【速读】:该论文旨在解决typographic攻击(typographic attacks)对CLIP视觉编码器的威胁,即通过在图像中注入文本信息诱导模型产生误分类、恶意内容生成甚至触发视觉-语言模型(Vision-Language Model)的越狱行为。其解决方案的关键在于识别并干预模型中专门用于提取和传递文本信息的注意力头(attention heads),这些注意力头集中在模型后半部分的层中,并将 typographic 信息传递至 cls token。作者提出了一种无需微调(training-free)的防御方法,通过选择性地“消融”(ablate)构成“typographic circuit”的注意力头组合,在不显著损害标准图像分类性能的前提下,大幅提升模型对 typographic 攻击的鲁棒性(在ImageNet-100的typographic变体上提升达19.6%)。这一方法优于现有依赖微调的防御策略,且释放了可直接替代原模型的“dyslexic CLIP”系列鲁棒模型,适用于安全敏感场景。

链接: https://arxiv.org/abs/2508.20570
作者: Lorenz Hufe,Constantin Venhoff,Maximilian Dreyer,Sebastian Lapuschkin,Wojciech Samek
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); University of Oxford (牛津大学); Technological University Dublin (都柏林理工学院); Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model’s layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
zh

[CV-60] Contrastive Learning through Auxiliary Branch for Video Object Detection

【速读】:该论文旨在解决视频目标检测(Video Object Detection)中因图像退化(如运动模糊、遮挡和形变)导致的检测性能下降问题,尤其在不增加推理阶段计算开销的前提下提升模型鲁棒性。其解决方案的关键在于提出一种简单而有效的对比学习辅助分支方法(Contrastive Learning through Auxiliary Branch, CLAB):首先通过对比损失(contrastive loss)构建一个辅助分支来增强骨干网络的特征表示能力;其次设计了一种动态损失权重策略,在训练初期侧重于辅助特征的学习,随着训练收敛逐步转向主检测任务,从而实现高效且稳定的性能提升。

链接: https://arxiv.org/abs/2508.20551
作者: Lucas Rakotoarivony
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted paper for ACIVS 2025

点击查看摘要

Abstract:Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector’s backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.
zh

[CV-61] SPGrasp: Spatiotemporal Prompt-driven Grasp Synthesis in Dynamic Scenes

【速读】:该论文旨在解决动态物体抓取合成中实时交互与低延迟推理难以兼顾的问题(real-time interactive grasp synthesis for dynamic objects remains challenging as existing methods fail to achieve low-latency inference while maintaining promptability)。解决方案的关键在于提出SPGrasp框架,其核心创新是将用户提示(user prompts)与时空上下文(spatiotemporal context)相结合,从而在端到端延迟低至59 ms的情况下实现对动态物体的时序一致性抓取估计,有效平衡了实时性与交互性之间的权衡。

链接: https://arxiv.org/abs/2508.20547
作者: Yunpeng Mei,Hongjie Cao,Yinqiu Xia,Wei Xiao,Zhaohan Feng,Gang Wang,Jie Chen
机构: Beijing Institute of Technology (北京理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time interactive grasp synthesis for dynamic objects remains challenging as existing methods fail to achieve low-latency inference while maintaining promptability. To bridge this gap, we propose SPGrasp (spatiotemporal prompt-driven dynamic grasp synthesis), a novel framework extending segment anything model v2 (SAMv2) for video stream grasp estimation. Our core innovation integrates user prompts with spatiotemporal context, enabling real-time interaction with end-to-end latency as low as 59 ms while ensuring temporal consistency for dynamic objects. In benchmark evaluations, SPGrasp achieves instance-level grasp accuracies of 90.6% on OCID and 93.8% on Jacquard. On the challenging GraspNet-1Billion dataset under continuous tracking, SPGrasp achieves 92.0% accuracy with 73.1 ms per-frame latency, representing a 58.5% reduction compared to the prior state-of-the-art promptable method RoG-SAM while maintaining competitive accuracy. Real-world experiments involving 13 moving objects demonstrate a 94.8% success rate in interactive grasping scenarios. These results confirm SPGrasp effectively resolves the latency-interactivity trade-off in dynamic grasp synthesis. Code is available at this https URL.
zh

[CV-62] Domain Adaptation Techniques for Natural and Medical Image Classification

【速读】:该论文旨在解决域适应(Domain Adaptation, DA)技术在自然图像与医学图像分类任务中性能差异及适用性问题,尤其关注医学数据因分布差异大、样本有限等特性导致的模型泛化能力不足。其解决方案的关键在于通过系统性实验验证七种主流DA方法在五种自然图像和八种医学图像数据集上的表现,发现Deep Subdomain Adaptation Network (DSAN)算法在多种场景下均展现出优异性能:例如在COVID-19数据集上使用ResNet50获得91.2%的分类准确率,并在动态数据流场景中相较基线提升6.7%;同时DSAN在新冠和皮肤癌数据集上表现出显著的可解释性优势。这一结果为医学图像领域有效应用DA技术提供了实证依据和方法指导。

链接: https://arxiv.org/abs/2508.20537
作者: Ahmad Chaddad,Yihang Wu,Reem Kateb,Christian Desrosiers
机构: Guilin University of Electronic Technology (桂林电子科技大学); Ecole de Technologie Superieure (魁北克科技大学); Taibah University (塔伊巴大学); Jeddah University (吉达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Information Sciences

点击查看摘要

Abstract:Domain adaptation (DA) techniques have the potential in machine learning to alleviate distribution differences between training and test sets by leveraging information from source domains. In image classification, most advances in DA have been made using natural images rather than medical data, which are harder to work with. Moreover, even for natural images, the use of mainstream datasets can lead to performance bias. With the aim of better understanding the benefits of DA for both natural and medical images, this study performs 557 simulation studies using seven widely-used DA techniques for image classification in five natural and eight medical datasets that cover various scenarios, such as out-of-distribution, dynamic data streams, and limited training samples. Our experiments yield detailed results and insightful observations highlighting the performance and medical applicability of these techniques. Notably, our results have shown the outstanding performance of the Deep Subdomain Adaptation Network (DSAN) algorithm. This algorithm achieved feasible classification accuracy (91.2%) in the COVID-19 dataset using Resnet50 and showed an important accuracy improvement in the dynamic data stream DA scenario (+6.7%) compared to the baseline. Our results also demonstrate that DSAN exhibits remarkable level of explainability when evaluated on COVID-19 and skin cancer datasets. These results contribute to the understanding of DA techniques and offer valuable insight into the effective adaptation of models to medical data.
zh

[CV-63] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset

【速读】:该论文旨在解决通过摄像头图像实现快速体重评估的问题,尤其在传统测量方法不可用或不切实际的场景(如远程医疗或紧急情况)中。其核心挑战在于现有基于计算机视觉的方法受限于小规模数据集(最多14,500张图像),导致模型泛化能力不足。解决方案的关键在于构建了一个大规模、高质量的私有图像数据集WayBED(84,963张智能手机图像,来自25,353人),并引入一种自动过滤方法——利用姿态聚类和人体检测技术剔除低质量图像(如异常姿势或视图不完整),最终保留71,322张高质图像用于训练。该方法在自建测试集上实现了7.9%的平均绝对百分比误差(MAPE),为当前文献最低;同时在完全未见过的VisualBodyToBMI数据集上达到13% MAPE,验证了模型的强鲁棒性,并通过微调进一步将该数据集上的MAPE降至8.56%,为目前最优结果。

链接: https://arxiv.org/abs/2508.20534
作者: Frederik Rajiv Manichand,Robin Deuber,Robert Jakob,Steve Swerling,Jamie Rosen,Elgar Fleisch,Patrick Langer
机构: ETH Zurich (苏黎世联邦理工学院); University of St. Gallen (圣加仑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen~(during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.
zh

[CV-64] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection ACM-MM2025

【速读】:该论文旨在解决现有基于激光雷达(LiDAR)的3D目标检测方法对人工标注标签依赖性强、标注成本高的问题,尤其是如何利用RGB图像辅助生成高质量伪框(pseudo-boxes)以实现更高效的无监督学习。其解决方案的关键在于提出一种数据级融合框架(data-level fusion framework),在早期阶段将RGB图像与LiDAR点云进行深度融合:首先利用视觉基础模型(vision foundation models)完成图像上的实例分割和深度估计,进而设计双向融合机制——将2D像素投影至3D空间增强真实点云密度,同时将类别标签从2D空间映射到LiDAR点云;此外,引入局部与全局滤波策略抑制深度估计误差和分割噪声,并结合基于数据级融合的动态自进化策略,在密集表示下迭代优化伪框,从而显著提升定位精度。实验表明,该方法在nuScenes数据集上实现了28.4% mAP的显著性能提升。

链接: https://arxiv.org/abs/2508.20530
作者: Mingqian Ji,Jian Yang,Shanshan Zhang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4 % mAP on the nuScenes validation benchmark.
zh

[CV-65] Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation

【速读】:该论文旨在解决多模态医学图像中肿瘤靶区(Gross Tumor Volume, GTV)分割任务中因标注数据稀缺而导致的模型性能受限问题,尤其是在放射治疗规划中对高精度分割的需求背景下。传统主动学习方法在域适应(Active Domain Adaptation, ADA)场景下存在样本冗余和负迁移风险,且缺乏针对多模态数据的有效样本选择策略。其解决方案的关键在于提出一种动态多模态样本选择的主动序列域适应框架,通过构建基于样本信息量(informativeness)与代表性(representativeness)的联合查询策略,实现对最具价值样本的优先标注与训练,从而在有限标注预算下提升模型泛化能力并显著优于现有ADA方法。

链接: https://arxiv.org/abs/2508.20528
作者: Jingyun Yang,Guoqing Zhang,Jingge Wang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \hrefthis https URLmmActS.
zh

[CV-66] Adam SLAM - the last mile of camera calibration with 3DGS

【速读】:该论文旨在解决真实场景下相机标定(camera calibration)质量难以评估的问题,因为缺乏真实场景的真值(ground truth),标定质量通常依赖于新视角合成(novel view synthesis)的效果。其关键解决方案是利用3DGS模型通过反向传播新视角颜色损失(color loss)对相机参数进行微调(fine tuning),从而提升标定精度。实验表明,该方法在3DGS参考数据集上平均提升了0.4 dB的PSNR,尤其适用于对新视角质量要求较高的参考场景(如Mip-NeRF 360)。

链接: https://arxiv.org/abs/2508.20526
作者: Matthieu Gendrin,Stéphane Pateux,Xiaoran Jiang,Théo Ladune,Luce Morin
机构: Orange Innovation (橙色创新); Univ Rennes (雷恩大学); INSA Rennes (雷恩国立高等工程学院); CNRS (法国国家科学研究中心); IETR (UMR 6164) (电气与电信研究所(法国国家科学研究中心第6164联合实验室))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The quality of the camera calibration is of major importance for evaluating progresses in novel view synthesis, as a 1-pixel error on the calibration has a significant impact on the reconstruction quality. While there is no ground truth for real scenes, the quality of the calibration is assessed by the quality of the novel view synthesis. This paper proposes to use a 3DGS model to fine tune calibration by backpropagation of novel view color loss with respect to the cameras parameters. The new calibration alone brings an average improvement of 0.4 dB PSNR on the dataset used as reference by 3DGS. The fine tuning may be long and its suitability depends on the criticity of training time, but for calibration of reference scenes, such as Mip-NeRF 360, the stake of novel view quality is the most important.
zh

[CV-67] DCFS: Continual Test-Time Adaptation via Dual Consistency of Feature and Sample

【速读】:该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中因缺乏源域数据而导致的模型学习偏差与伪标签质量不可靠问题,尤其是由此引发的错误累积效应。解决方案的关键在于提出DCFS框架,其核心创新为引入双路径特征一致性(Dual-path Feature Consistency)与置信度感知样本学习机制:首先通过两个并行分类器将目标域特征解耦为语义相关特征(semantic-related feature)和域相关特征(domain-related feature),并通过保持子特征与整体特征间的一致性来实现多视角特征捕获;其次,基于自适应阈值计算每个样本的置信度得分,对损失进行加权的自监督学习,从而有效抑制伪标签噪声并缓解误差传播。

链接: https://arxiv.org/abs/2508.20516
作者: Wenting Yin,Han Sun,Xinru Meng,Ningzhong Liu,Huiyu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, accepted by PRCV2025

点击查看摘要

Abstract:Continual test-time adaptation aims to continuously adapt a pre-trained model to a stream of target domain data without accessing source data. Without access to source domain data, the model focuses solely on the feature characteristics of the target data. Relying exclusively on these features can lead to confusion and introduce learning biases. Currently, many existing methods generate pseudo-labels via model predictions. However, the quality of pseudo-labels cannot be guaranteed and the problem of error accumulation must be solved. To address these challenges, we propose DCFS, a novel CTTA framework that introduces dual-path feature consistency and confidence-aware sample learning. This framework disentangles the whole feature representation of the target data into semantic-related feature and domain-related feature using dual classifiers to learn distinct feature representations. By maintaining consistency between the sub-features and the whole feature, the model can comprehensively capture data features from multiple perspectives. Additionally, to ensure that the whole feature information of the target domain samples is not overlooked, we set a adaptive threshold and calculate a confidence score for each sample to carry out loss weighted self-supervised learning, effectively reducing the noise of pseudo-labels and alleviating the problem of error accumulation. The efficacy of our proposed method is validated through extensive experimentation across various datasets, including CIFAR10-C, CIFAR100-C, and ImageNet-C, demonstrating consistent performance in continual test-time adaptation scenarios.
zh

[CV-68] Describe Dont Dictate: Semantic Image Editing with Natural Language Intent ICCV2025

【速读】:该论文旨在解决文本到图像生成(text-to-image generation)领域中语义图像编辑(semantic image editing)的两大挑战:基于反演(inversion-based)的算法不可避免地引入重建误差,而基于指令(instruction-based)的模型则受限于数据集质量和规模。其解决方案的关键在于提出一种基于描述性提示(descriptive-prompt-based)的编辑框架 DescriptiveEdit,将“基于指令的图像编辑”重新建模为“基于参考图像的文本到图像生成”,从而利用预训练文本到图像模型的强大生成能力,无需修改架构或进行图像反演。该方法通过引入 Cross-Attentive UNet,在生成过程中加入注意力桥接机制,将参考图像特征注入提示到目标图像的生成流程中,显著提升了编辑精度与一致性,并兼容 ControlNet、IP-Adapter 等扩展模块,具备良好的可扩展性。

链接: https://arxiv.org/abs/2508.20505
作者: En Ci,Shanyan Guan,Yanhao Ge,Yilin Zhang,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai
机构: Nanjing University (南京大学); vivo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame instruction-based image editing' as reference-image-based text-to-image generation’, which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.
zh

[CV-69] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection

【速读】:该论文旨在解决工业制造中基于3D点云的表面异常检测问题,其核心挑战在于缺乏类似于2D图像领域中强大预训练基础模型(pretrained foundation backbones)的3D模型,导致现有方法性能受限。解决方案的关键在于提出 Importance-Aware Ensemble Network (IAENet),该框架通过融合2D预训练专家模型与3D专家模型,并引入一个新颖的 Importance-Aware Fusion (IAF) 模块,动态评估各模态贡献并重加权异常分数,从而有效缓解因某一路径表现不佳而拖累整体性能的问题;同时设计关键损失函数以引导IAF模块在整合多源知识的同时保留各自优势,显著提升异常检测精度,尤其在降低误报率方面表现突出。

链接: https://arxiv.org/abs/2508.20492
作者: Xuanming Cao,Chengyu Tao,Yifeng Cheng,Juan Du
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); The Hong Kong University of Science and Technology (Hong Kong)(香港科技大学(香港) )
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface anomaly detection is pivotal for ensuring product quality in industrial manufacturing. While 2D image-based methods have achieved remarkable success, 3D point cloud-based detection remains underexplored despite its richer geometric cues. We argue that the key bottleneck is the absence of powerful pretrained foundation backbones in 3D comparable to those in 2D. To bridge this gap, we propose Importance-Aware Ensemble Network (IAENet), an ensemble framework that synergizes 2D pretrained expert with 3D expert models. However, naively fusing predictions from disparate sources is non-trivial: existing strategies can be affected by a poorly performing modality and thus degrade overall accuracy. To address this challenge, We introduce an novel Importance-Aware Fusion (IAF) module that dynamically assesses the contribution of each source and reweights their anomaly scores. Furthermore, we devise critical loss functions that explicitly guide the optimization of IAF, enabling it to combine the collective knowledge of the source experts but also preserve their unique strengths, thereby enhancing the overall performance of anomaly detection. Extensive experiments on MVTec 3D-AD demonstrate that our IAENet achieves a new state-of-the-art with a markedly lower false positive rate, underscoring its practical value for industrial deployment.
zh

[CV-70] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information

【速读】:该论文旨在解决现有深度学习研究中缺乏对击球姿势与球飞行轨迹之间定量关系的明确建模问题,从而限制了其为高尔夫球员提供有效改进建议的能力。解决方案的关键在于构建一个名为CaddieSet的新数据集,该数据集通过计算机视觉方法将单次挥杆视频分割为八个阶段,并提取关节信息与多种球面参数;同时,基于高尔夫专家领域知识定义了15个关键影响指标,使得挥杆结果可通过可解释的特征进行分析。实验验证了该数据集在预测球轨迹方面的可行性,并证明基于关节特征的反馈与既有领域知识具有定量一致性,为学术界和体育产业提供了新的高尔夫挥杆分析范式。

链接: https://arxiv.org/abs/2508.20491
作者: Seunghyeon Jung,Seoyoung Hong,Jiwoo Jeong,Seungwon Jeong,Jaerim Choi,Hoki Kim,Woojin Lee
机构: Dongguk University (东国大学); Kimcaddie Inc; Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages with supplementary material

点击查看摘要

Abstract:Recent advances in deep learning have led to more studies to enhance golfers’ shot precision. However, these existing studies have not quantitatively established the relationship between swing posture and ball trajectory, limiting their ability to provide golfers with the necessary insights for swing improvement. In this paper, we propose a new dataset called CaddieSet, which includes joint information and various ball information from a single shot. CaddieSet extracts joint information from a single swing video by segmenting it into eight swing phases using a computer vision-based approach. Furthermore, based on expert golf domain knowledge, we define 15 key metrics that influence a golf swing, enabling the interpretation of swing outcomes through swing-related features. Through experiments, we demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks. In particular, we focus on interpretable models among several benchmarks and verify that swing feedback using our joint features is quantitatively consistent with established domain knowledge. This work is expected to offer new insight into golf swing analysis for both academia and the sports industry.
zh

[CV-71] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts ICCV2025

【速读】:该论文旨在解决单目3D目标检测(Monocular 3D Object Detection, M3OD)在真实世界域偏移(domain shifts)下可靠性下降的问题,尤其关注M3OD中固有的双重不确定性:语义不确定性(semantic uncertainty,即类别预测模糊)和几何不确定性(geometric uncertainty,即空间定位不稳定)。现有测试时自适应(Test-Time Adaptation, TTA)方法虽认识到低不确定性与强泛化能力之间的正相关关系,但未能同时优化这两种不确定性。解决方案的关键在于提出Dual Uncertainty Optimization (DUO)框架,首次实现对两种不确定性的联合最小化:一方面通过凸优化视角重构焦点损失(focal loss)结构并推导出无监督版本,实现标签无关的不确定性加权和高不确定性样本的均衡学习;另一方面设计语义感知的法向量场约束(semantic-aware normal field constraint),利用清晰语义区域保持几何一致性,从而降低因3D表示不稳定带来的不确定性。该双分支机制形成互补循环:增强的空间感知提升语义分类精度,而鲁棒的语义预测进一步优化空间理解,显著提升了模型在多种域偏移场景下的鲁棒性。

链接: https://arxiv.org/abs/2508.20488
作者: Zixuan Hu,Dongxiao Li,Xinzhu Ma,Shixiang Tang,Xiaotong Li,Wenhan Yang,Ling-Yu Duan
机构: Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 (Highlight)

点击查看摘要

Abstract:Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types.
zh

[CV-72] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

【速读】:该论文旨在解决长视频理解(long-form video understanding)中因长时序依赖性和多事件交织而导致的挑战,现有方法通常依赖静态推理或外部视觉语言模型(Visual-Language Models, VLMs),存在复杂度高和端到端训练不足导致性能欠佳的问题。其解决方案的关键在于提出 Video-MTR 框架,该框架采用强化学习驱动的多轮迭代推理机制,通过逐步选择关键视频片段并动态调整对问题的理解,实现更精细且上下文感知的分析;同时引入一种新颖的门控双层奖励系统,结合轨迹级奖励(基于答案正确性)与轮次级奖励(强调帧与查询的相关性),从而在端到端训练下同步优化视频片段选择与问题理解,无需依赖外部 VLM,显著提升了准确率与效率。

链接: https://arxiv.org/abs/2508.20478
作者: Yuan Xie,Tianshui Chen,Zheng Ge,Lionel Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.
zh

[CV-73] owards Inclusive Communication: A Unified LLM -Based Framework for Sign Language Lip Movements and Audio Understanding

【速读】:该论文旨在解决当前语音识别与视觉语言翻译系统在无障碍通信中的局限性问题,即现有技术主要依赖音频模态(如自动语音识别,ASR),难以服务于听障人群;而手语翻译(SLT)和唇读(VSR)等视觉替代方案虽具潜力,却长期处于孤立研究状态,缺乏统一建模框架以协同处理多模态输入。解决方案的关键在于提出首个能够同时处理手语、唇动和音频等多种模态组合的统一架构,其核心创新包括:(1) 设计一种模态无关的通用结构,可有效处理异构输入;(2) 揭示唇动作为非手动线索对提升手语理解的重要作用,首次将唇动显式建模为独立模态;(3) 在SLT、VSR、ASR及视听语音识别(AVSR)任务上均达到或超越专用模型性能,验证了多模态协同的优势。

链接: https://arxiv.org/abs/2508.20476
作者: Jeong Hun Yeo,Hyeongseop Rha,Sungjune Park,Junil Won,Yong Man Ro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: Code available at: this https URL

点击查看摘要

Abstract:Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.
zh

[CV-74] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization MICCAI2025

【速读】:该论文旨在解决胎儿脑部图像中罕见病理(如胼胝体发育不良,CCD)导致的标注数据稀缺问题,从而限制深度学习模型泛化能力的挑战。其核心解决方案是提出一种病理信息引导的域随机化策略(pathology-informed domain randomization),将CCD的解剖学先验知识嵌入合成数据生成流程中,仅基于健康胎儿数据模拟多样化的病理脑结构变化,从而无需病理标注即可实现鲁棒的分割。该方法显著提升了对CCD病例的分割精度与拓扑一致性,并在临床相关生物标志物(如胼胝体长度LCC)估计上取得误差大幅降低(从10.9 mm降至0.7 mm),验证了引入领域特定解剖先验对缓解罕见病数据不足的有效性。

链接: https://arxiv.org/abs/2508.20475
作者: Marina Grifell i Plana,Vladyslav Zalevskyi,Léa Schmidt,Yvan Gomez,Thomas Sanchez,Vincent Dunet,Mériam Koob,Vanessa Siffredi,Meritxell Bach Cuadra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the PIPPI Workshop of MICCAI 2025

点击查看摘要

Abstract:Accurate fetal brain segmentation is crucial for extracting biomarkers and assessing neurodevelopment, especially in conditions such as corpus callosum dysgenesis (CCD), which can induce drastic anatomical changes. However, the rarity of CCD severely limits annotated data, hindering the generalization of deep learning models. To address this, we propose a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into a synthetic data generation pipeline. By simulating diverse brain alterations from healthy data alone, our approach enables robust segmentation without requiring pathological annotations. We validate our method on a cohort comprising 248 healthy fetuses, 26 with CCD, and 47 with other brain pathologies, achieving substantial improvements on CCD cases while maintaining performance on both healthy fetuses and those with other pathologies. From the predicted segmentations, we derive clinically relevant biomarkers, such as corpus callosum length (LCC) and volume, and show their utility in distinguishing CCD subtypes. Our pathology-informed augmentation reduces the LCC estimation error from 1.89 mm to 0.80 mm in healthy cases and from 10.9 mm to 0.7 mm in CCD cases. Beyond these quantitative gains, our approach yields segmentations with improved topological consistency relative to available ground truth, enabling more reliable shape-based analyses. Overall, this work demonstrates that incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations. Comments: Accepted at the PIPPI Workshop of MICCAI 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2508.20475 [cs.CV] (or arXiv:2508.20475v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.20475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-75] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation

【速读】:该论文旨在解决自动驾驶系统训练与验证中Corner Cases(边缘案例)获取成本高、风险大的问题,以及现有基于3D Gaussian Splatting或图像生成模型的物体编辑方法在视觉保真度和姿态控制精度方面的局限性。解决方案的关键在于提出G^2Editor框架,其核心创新包括:利用编辑物体的3D Gaussian表示作为密集先验注入去噪过程,以实现精确的姿态控制和空间一致性;采用场景级3D边界框布局重建非目标物体的遮挡区域;并通过引入分层细粒度特征作为生成过程中的额外条件,引导编辑物体的外观细节。该方法在Waymo Open Dataset上的实验表明,能够在统一框架内实现物体重定位、插入与删除,显著优于现有方法在姿态可控性和视觉质量上的表现,并有助于下游数据驱动任务。

链接: https://arxiv.org/abs/2508.20471
作者: Jiusi Li,Jackson Jiang,Jinyu Miao,Miao Long,Tuopu Wen,Peijin Jia,Shengxiang Liu,Chunlei Yu,Maolin Liu,Yuzhan Cai,Kun Jiang,Mengmeng Yang,Diange Yang
机构: Tsinghua University (清华大学); WUWEN AI; PhiGent Robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.
zh

[CV-76] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

【速读】:该论文旨在解决3D生成领域中因数据稀缺而导致的模型泛化能力瓶颈问题。相较于文本、图像和视频等模态,互联网上可用的原生3D数据量极为有限,制约了生成式AI(Generative AI)在3D资产创建中的发展。其解决方案的关键在于利用视频模态所蕴含的常识先验(commonsense priors),通过视频中多视角观测提供的空间一致性约束以及丰富的语义信息,作为替代监督信号来增强模型对3D内容的空间一致性和语义合理性建模能力。研究提出了首个具有多视图级标注的大规模视频数据集Droplet3D-4M,并训练了一个支持图像与密集文本输入的生成模型Droplet3D,实验证明该方法能有效生成空间一致且语义合理的3D内容,且具备扩展至场景级应用的潜力。

链接: https://arxiv.org/abs/2508.20470
作者: Xiaochuan Li,Guoguang Du,Runze Zhang,Liang Jin,Qi Jia,Lihua Lu,Zhenhua Guo,Yaqian Zhao,Haiyang Liu,Tianqi Wang,Changsheng Li,Xiaoli Gong,Rengang Li,Baoyu Fan
机构: IEIT System Co., Ltd.; Nankai University; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: this https URL.
zh

[CV-77] Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds

【速读】:该论文旨在解决LiDAR点云在高精度扫描下导致的存储与传输开销过大问题,现有方法通过将无序点云转换为八叉树或体素结构进行稠密到稀疏的预测编码,但其几何细节极度稀疏,限制了上下文建模效率,进而影响压缩性能和速度。解决方案的关键在于提出一种紧凑特征生成框架,包含两个轻量级模块:一是几何重稠密化模块(Geometry Re-Densification Module),在保持轻量预测头的前提下对编码后的稀疏几何进行重稠密化以提取更密集尺度的特征,并重新稀疏化用于预测编码;二是跨尺度特征传播模块(Cross-scale Feature Propagation Module),利用多分辨率层级的占用信息引导层次化特征传播,促进跨尺度信息共享,减少冗余特征提取并增强几何重稠密化模块的特征表达能力。二者协同实现高效上下文建模与加速编码过程,在KITTI数据集上达到最优压缩比和实时性能(12位量化下编码与解码均达26 FPS)。

链接: https://arxiv.org/abs/2508.20466
作者: Pengpeng Yu,Haoran Li,Dingquan Li,Runqing Jiang,Jing Wang,Liang Lin,Yulan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at this https URL.
zh

[CV-78] Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification

【速读】:该论文旨在解决在计算资源受限的医疗场景中,如何设计轻量级模型以实现与大型预训练模型相当的分类性能问题。其核心挑战在于压缩模型的同时保留关键医学图像特征,避免信息丢失。解决方案的关键在于提出一种结合双模型权重选择(dual-model weight selection)与自知识蒸馏(self-knowledge distillation, SKD)的协同机制:首先利用大模型的预训练权重初始化两个轻量模型,实现高效的知识迁移;随后通过SKD策略在不显著增加计算开销的前提下,使多个初始权重配置下的模型相互学习,从而增强模型鲁棒性与泛化能力,并最终通过微调适配具体任务。该方法有效克服了传统轻量化方法难以保持高性能的局限性。

链接: https://arxiv.org/abs/2508.20461
作者: Ayaka Tsutsumi,Guang Li,Ren Togo,Takahiro Ogawa,Satoshi Kondo,Miki Haseyama
机构: Hokkaido University (北海道大学); Muroran Institute of Technology (室兰工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.
zh

[CV-79] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection

【速读】:该论文旨在解决深度伪造(deepfake)内容日益逼真且广泛传播背景下,现有检测方法因计算开销过大而难以在实时应用场景(如视频会议和社交媒体)中部署的问题。其解决方案的关键在于提出一种轻量级但高效的检测网络——空间-频率感知多尺度融合网络(SFMFNet),通过设计空间-频率混合感知模块,联合利用空间纹理与频域伪影特征,并引入门控机制提升对细微篡改的敏感性;同时采用令牌选择性交叉注意力机制实现高效多层级特征交互,以及残差增强的模糊池化结构,在下采样过程中保留关键语义信息,从而在准确率与计算效率之间取得良好平衡,具备良好的泛化能力和实际应用价值。

链接: https://arxiv.org/abs/2508.20449
作者: Libo Lv,Tianyi Wang,Mengxiao Huang,Ruixia Liu,Yinglong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to PRCV 2025

点击查看摘要

Abstract:With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.
zh

[CV-80] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection BMVC2025

【速读】:该论文旨在解决多视角行人检测(Multi-View Pedestrian Detection, MVPD)中因行人尺度在单视角内过小或过大,以及跨视角间尺度差异显著而导致的检测性能下降问题。其解决方案的关键在于提出一种名为多尺度多视角检测(Multi-Scale Multi-View Detection, MSMVD)的新方法:通过将来自各视角的多尺度图像特征逐尺度投影至鸟瞰图(Bird’s Eye View, BEV)空间生成多尺度BEV特征,从而保留不同尺度下图像特征的特性,提升对尺度一致性强的行人的检测精度;随后利用特征金字塔网络(Feature Pyramid Network, FPN)融合多视角下的多尺度BEV特征,有效缓解跨视角尺度差异带来的挑战,最终在GMVD数据集上相较此前最优模型提升了4.5点MODA。

链接: https://arxiv.org/abs/2508.20447
作者: Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shota Orihashi,Tomohiro Tanaka,Mana Ihori,Naoki Makishima,Naotaka Kawata
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by BMVC 2025

点击查看摘要

Abstract:Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird’s eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by 4.5 points on the GMVD dataset.
zh

[CV-81] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection ICONIP2025

【速读】:该论文针对现有显著目标检测(Salient Object Detection, SOD)方法在复杂场景下易丢失细节、边缘模糊以及单模态信息融合不足的问题,提出了一种动态不确定性传播与多模态协同推理网络(DUP-MCRNet)。其解决方案的关键在于:首先设计了动态不确定性图卷积模块(Dynamic Uncertainty Graph Convolution Module, DUGC),通过基于空间语义距离构建的稀疏图在层间传播不确定性,并结合通道自适应交互机制,显著提升小尺度结构和边缘区域的检测精度;其次引入多模态协同融合策略(Multimodal Collaborative Fusion Strategy, MCF),利用可学习的模态门控权重对RGB、深度和边缘特征的注意力图进行加权融合,动态调整各模态重要性以抑制冗余或干扰信息,增强跨模态语义互补性与一致性,从而提升遮挡、弱纹理或背景干扰下的显著区域识别能力。

链接: https://arxiv.org/abs/2508.20415
作者: Yuqi Xiong,Wuzhen Shi,Yang Wen,Ruhan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICONIP 2025

点击查看摘要

Abstract:In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at this https URL.
zh

[CV-82] Federated Learning for Large Models in Medical Imaging: A Comprehensive Review

【速读】:该论文旨在解决医疗影像领域中高精度人工智能(AI)模型训练所面临的两大核心问题:一是受限于严格的数据隐私法规和法律限制,难以实现大规模集中式数据集的构建与共享;二是由此导致的模型更新困难,无法持续利用新收集的数据进行迭代优化。其解决方案的关键在于采用联邦学习(Federated Learning, FL)这一隐私保护的分布式训练框架,通过在多机构间协作训练模型而不直接传输原始敏感图像数据,从而实现跨机构的联合建模与持续优化。该方法不仅支持上游重建任务(如CT/MRI图像重建)中鲁棒网络的联合训练,缓解数据稀缺问题,也赋能下游临床任务(如肿瘤诊断与分割)中的本地微调机制,保障模型性能的动态提升与数据安全合规。

链接: https://arxiv.org/abs/2508.20414
作者: Mengyu Sun,Ziyuan Yang,Yongqiang Huang,Hui Yu,Yingyu Chen,Shuren Qi,Andrew Beng Jin Teoh,Yi Zhang
机构: Sichuan University (四川大学); Sichuan Institute of Computer Sciences (四川省计算机科学研究院); The Chinese University of Hong Kong (香港中文大学); Yonsei University (延世大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has demonstrated considerable potential in the realm of medical imaging. However, the development of high-performance AI models typically necessitates training on large-scale, centralized datasets. This approach is confronted with significant challenges due to strict patient privacy regulations and legal restrictions on data sharing and utilization. These limitations hinder the development of large-scale models in medical domains and impede continuous updates and training with new data. Federated Learning (FL), a privacy-preserving distributed training framework, offers a new solution by enabling collaborative model development across fragmented medical datasets. In this survey, we review FL’s contributions at two stages of the full-stack medical analysis pipeline. First, in upstream tasks such as CT or MRI reconstruction, FL enables joint training of robust reconstruction networks on diverse, multi-institutional datasets, alleviating data scarcity while preserving confidentiality. Second, in downstream clinical tasks like tumor diagnosis and segmentation, FL supports continuous model updating by allowing local fine-tuning on new data without centralizing sensitive images. We comprehensively analyze FL implementations across the medical imaging pipeline, from physics-informed reconstruction networks to diagnostic AI systems, highlighting innovations that improve communication efficiency, align heterogeneous data, and ensure secure parameter aggregation. Meanwhile, this paper provides an outlook on future research directions, aiming to serve as a valuable reference for the field’s development.
zh

[CV-83] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection

【速读】:该论文旨在解决当前基于人工神经网络(Artificial Neural Networks, ANNs)到脉冲神经网络(Spiking Neural Networks, SNNs)转换方法在视觉检测任务中性能不佳的问题,特别是由于异质脉冲模式导致的残留膜电位(residual membrane potential)问题。其解决方案的关键在于提出一种延迟脉冲(delay-spike)机制以缓解残留膜电位影响,并设计了一种新型时序依赖的积分-发放(temporal-dependent Integrate-and-Fire, tdIF)神经元架构,使IF神经元能够根据时间步的时序顺序动态调整累积与放电行为,从而实现基于时序特征而非仅频率特征的脉冲表示。该方法在保持传统IF神经元能耗水平的同时,显著提升了特征表达精度,实现了高精度且超低延迟(≤5个时间步)的视觉检测性能。

链接: https://arxiv.org/abs/2508.20392
作者: Chengjun Zhang,Yuhao Zhang,Jie Yang,Mohamad Sawan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).
zh

[CV-84] More Reliable Pseudo-labels Better Performance: A Generalized Approach to Single Positive Multi-label Learning ICCV2025

【速读】:该论文旨在解决单正例多标签学习(Single Positive Multi-Label Learning, SPML)中因仅提供一个正标签而导致的伪标签噪声与误判问题。传统方法将未标注标签视为未知或负类,易引入错误标签和假阴性;而现有伪标签策略可能加剧噪声传播。其解决方案的关键在于提出一种广义伪标签鲁棒损失(Generalized Pseudo-Label Robust Loss, GPR Loss),能够有效利用多样化的伪标签信息并抑制噪声影响,并结合动态增强多焦点伪标签(Dynamic Augmented Multi-focus Pseudo-labeling, DAMP)技术,构建出自适应且高效的视觉-语言伪标签框架(Adaptive and Efficient Vision-Language Pseudo-Labeling, AEVLP),从而显著提升多标签分类性能。

链接: https://arxiv.org/abs/2508.20381
作者: Luong Tran,Thieu Vo,Anh Nguyen,Sang Dinh,Van Nguyen
机构: FPT Software AI Center (FPT软件人工智能中心); National University of Singapore (新加坡国立大学); University of Liverpool (利物浦大学); Hanoi University of Science and Technology (河内科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.
zh

[CV-85] Audio-Guided Visual Editing with Complex Multi-Modal Prompts BMVC2025

【速读】:该论文旨在解决扩散模型在视觉编辑中面对复杂场景时的局限性,尤其当仅依赖文本提示无法充分描述编辑需求时的问题。现有方法通常需要针对特定数据集进行训练以对齐音频与文本,从而限制了其在真实场景中的泛化能力。解决方案的关键在于引入一种无需额外训练的音频引导视觉编辑框架,利用预训练的多模态编码器实现零样本(zero-shot)跨模态对齐,并通过分离噪声分支(separate noise branching)和自适应补丁选择(adaptive patch selection)机制有效整合多模态提示(包括多种文本和音频),从而显著提升复杂编辑任务的性能,使系统能够利用音频提供的丰富信息克服纯文本引导的不足。

链接: https://arxiv.org/abs/2508.20379
作者: Hyeonyu Kim,Seokhoon Jeong,Seonghee Han,Chanhyuk Choi,Taehwan Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025

点击查看摘要

Abstract:Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model’s prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.
zh

[CV-86] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction

【速读】:该论文旨在解决多任务密集预测(multi-task dense prediction)中跨任务交互充分性与计算效率之间的权衡问题。现有方法往往因追求完整的跨任务信息交互而导致计算复杂度显著上升,难以在实际应用中兼顾性能与效率。解决方案的关键在于提出一种双向交互Mamba(Bidirectional Interaction Mamba, BIM),其核心创新包括两个方面:一是设计了双向交互扫描机制(Bidirectional Interaction Scan, BI-Scan),通过任务优先和位置优先的双扫描模式,在统一线性复杂度架构下构建任务特定的双向序列表示,从而高效保留关键跨任务信息;二是引入多尺度扫描机制(Multi-Scale Scan, MS-Scan),实现多层次场景建模,满足不同任务对粒度的多样化需求并增强细粒度的跨任务特征交互。

链接: https://arxiv.org/abs/2508.20376
作者: Mang Cao,Sanping Zhou,Yizhe Li,Ye Deng,Wenli Huang,Le Wang
机构: Xi’an Jiaotong University (西安交通大学); Southwestern University of Finance and Economics (西南财经大学); Ningbo University of Technology (宁波工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Codes are available online: \url{ this https URL _for_MTL}

点击查看摘要

Abstract:Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emphi.e., NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.
zh

[CV-87] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models

【速读】:该论文旨在解决医疗视觉语言模型(Medical Vision-Language Models, VLMs)在临床应用中面临的严重安全风险问题,包括受保护健康信息(Protected Health Information, PHI)泄露、数据泄漏以及对网络攻击的脆弱性,尤其在医院环境中尤为关键。为应对这些挑战,论文提出了一种名为MedFoundationHub的图形用户界面(Graphical User Interface, GUI)工具包,其核心解决方案在于:(1)使医生无需编程即可手动选择和使用不同模型;(2)支持工程师以即插即用方式高效部署医疗VLM,无缝集成Hugging Face开源模型;(3)通过Docker编排实现操作系统无关的隐私保护推理。该方案仅需配备单张NVIDIA A6000 GPU的离线本地工作站,兼顾安全性与可访问性,适用于典型学术研究实验室资源。

链接: https://arxiv.org/abs/2508.20345
作者: Xiao Li,Yanfan Zhu,Ruining Deng,Wei-Qi Wei,Yu Wang,Shilin Zhao,Yaohong Wang,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Weill Cornell Medicine (威尔康奈尔医学中心); Vanderbilt University Medical Center (范德比尔特大学医学中心); UT MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.
zh

[CV-88] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)

【速读】:该论文旨在解决视觉-语言联合嵌入网络(如CLIP)中潜在空间信息混杂的问题,即如何从统一的嵌入空间中分离出与复杂场景内容相关的特定概念信息。其核心挑战在于实现嵌入空间的解耦(disentanglement),以支持基于特定语义概念的精准图像检索和生成。解决方案的关键在于提出一种监督字典学习方法(supervised dictionary learning approach),通过构建具有组结构(group-structured)的字典,将嵌入表示分解为多个概念特异性的分量向量,每个分量由对应标签的原子(atoms)以稀疏、非负组合形式线性合成,从而形成概念驱动的子空间。该方法采用新型交替优化策略实现字典优化并保证收敛性,同时利用文本嵌入进行语义描述提取与零样本分类,显著提升了概念过滤后的图像检索精度,并可扩展至压缩自动编码器(如TiTok)和自监督模型(如DINOv2)的嵌入空间。

链接: https://arxiv.org/abs/2508.20322
作者: Zhi Li,Hau Phan,Matthew Emigh,Austin J. Brockmeier
机构: University of Delaware (特拉华大学); Naval Surface Warfare Center, Panama City Division (帕纳马城海军水面作战中心分部); Office of Naval Research (海军研究办公室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept’s group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.
zh

[CV-89] Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation ICCV2025

【速读】:该论文旨在解决CLIP(Contrastive Language–Image Pretraining)在开放词汇分割(open-vocabulary segmentation)任务中因局部定位能力弱而导致性能受限的问题。现有方法通过修改中间层注意力来增强空间一致性,但这种一致性无法稳定传递至最终输出,且中间注意力与文本表示缺乏直接交互,限制了CLIP语义潜力的发挥。解决方案的关键在于提出一种无需训练、基于反馈的自适应框架,将最终输出预测中的patch级对应关系反馈至中间注意力机制,利用输出作为更强的空间一致性先验,从而提升内部表征与最终预测之间的语义一致性。核心模块包括注意力隔离、基于置信度的稀疏适配和适配集成策略,使该方法可作为插件模块无缝集成到多种主流模型中并显著提升性能。

链接: https://arxiv.org/abs/2508.20265
作者: Zhixiang Chi,Yanan Wu,Li Gu,Huan Liu,Ziqiang Wang,Yang Zhang,Yang Wang,Konstantinos N. Plataniotis
机构: University of Toronto (多伦多大学); China Agricultural University (中国农业大学); Concordia University (康考迪亚大学); McMaster University (麦克马斯特大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCV 2025, code: this https URL

点击查看摘要

Abstract:CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn’t consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model’s processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model’s outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks. Comments: ICCV 2025, code:this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2508.20265 [cs.CV] (or arXiv:2508.20265v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.20265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-90] MedNet-PVS: A MedNeXt-Based Deep Learning Model for Automated Segmentation of Perivascular Spaces

【速读】:该论文旨在解决脑小血管病、阿尔茨海默病及老化相关神经退行性病变中,脑穿通动脉周围间隙(perivascular spaces, PVS)自动化分割效率低、泛化能力差的问题。现有深度学习模型在不同临床和研究MRI数据集上表现不稳定,且手动分割耗时且可靠性有限。解决方案的关键在于采用一种受Transformer启发的3D编码器-解码器卷积网络MedNeXt-L-k5,通过在同质T2w MRI数据(来自Human Connectome Project-Aging)上训练,实现了高精度的PVS分割(白质区域体素级Dice系数达0.88±0.06),接近人工标注的组间一致性水平;同时验证了该模型在异质T1w MRI数据上的跨站点泛化能力,尽管性能下降,但仍优于以往方法。值得注意的是,其性能未显著优于nnU-Net,表明在PVS分割任务中,基于注意力机制的全局上下文建模并非必要。

链接: https://arxiv.org/abs/2508.20256
作者: Zhen Xuen Brandon Low,Rory Zhang,Hang Min,William Pham,Lucy Vivash,Jasmine Moses,Miranda Lynch,Karina Dorfman,Cassandra Marotta,Shaun Koh,Jacob Bunyamin,Ella Rowsthorn,Alex Jarema,Himashi Peiris,Zhaolin Chen,Sandy R. Shultz,David K. Wright,Dexiao Kong,Sharon L. Naismith,Terence J. O’Brien,Ying Xia,Meng Law,Benjamin Sinclair
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 59 pages, 9 figures

点击查看摘要

Abstract:Enlarged perivascular spaces (PVS) are increasingly recognized as biomarkers of cerebral small vessel disease, Alzheimer’s disease, stroke, and aging-related neurodegeneration. However, manual segmentation of PVS is time-consuming and subject to moderate inter-rater reliability, while existing automated deep learning models have moderate performance and typically fail to generalize across diverse clinical and research MRI datasets. We adapted MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted (T2w) MRI scans from the Human Connectome Project-Aging (HCP-Aging) dataset and another using 40 heterogeneous T1-weighted (T1w) MRI volumes from seven studies across six scanners. Model performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). MedNeXt-L-k5 models trained on the T2w images of the HCP-Aging dataset achieved voxel-level Dice scores of 0.88+/-0.06 (white matter, WM), comparable to the reported inter-rater reliability of that dataset, and the highest yet reported in the literature. The same models trained on the T1w images of the HCP-Aging dataset achieved a substantially lower Dice score of 0.58+/-0.09 (WM). Under LOSOCV, the model had voxel-level Dice scores of 0.38+/-0.16 (WM) and 0.35+/-0.12 (BG), and cluster-level Dice scores of 0.61+/-0.19 (WM) and 0.62+/-0.21 (BG). MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. MedNeXt-L-k5 did not outperform the nnU-Net, indicating that the attention-based mechanisms present in transformer-inspired models to provide global context are not required for high accuracy in PVS segmentation.
zh

[CV-91] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification

【速读】:该论文旨在解决先进材料(尤其是通过非传统增材制造工艺制备的异质结构)在工业制造中快速且可靠的质量评定难题。其核心挑战在于如何将原始微结构数据与专家经验知识进行语义对齐,从而实现无需任务特定模型重训练的可解释性分类。解决方案的关键在于构建一个融合微结构信息学与专家表征知识的新型框架,通过定制化的混合视觉-语言表示(VLRs)将图像微结构特征与文本专家评估映射到共享语义空间;其中,利用预训练多模态模型(CLIP 和 FLAVA)进行深度语义分割,并引入基于正负样本参考的相似度表示机制,以支持零样本分类;同时采用 Z-score 标准化调整单模态与跨模态相似度得分,提升模型在不同表征维度上的对齐能力与判别性能,最终实现人机协同决策下的质量判定可追溯性和领域适应性增强。

链接: https://arxiv.org/abs/2508.20243
作者: Mutahar Safdar,Gentry Wood,Max Zimmermann,Guy Lamouche,Priti Wanjara,Yaoyao Fiona Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 46 pages, 33 figures, Submitted to Advanced Engineering Informatics, under revision

点击查看摘要

Abstract:Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision-language representations (VLRs). By integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general-purpose embeddings, we develop a customized similarity-based representation that incorporates both positive and negative references from expert-annotated images and their associated textual descriptions. This allows zero-shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework’s ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z-score normalization adjusts raw unimodal and cross-modal similarity scores based on their local dataset-driven distributions, enabling more effective alignment and classification in the hybrid vision-language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain-adaptable qualification strategies in engineering informatics.
zh

[CV-92] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems

【速读】:该论文旨在解决在资源受限的农业环境中部署轻量级卷积神经网络(CNN)模型时面临的性能与效率平衡问题。针对这一挑战,作者提出了一种名为ATMS-KD(Adaptive Temperature and Mixed-Sample Knowledge Distillation)的新颖知识蒸馏框架,其核心创新在于结合自适应温度调度(adaptive temperature scheduling)与混合样本增强(mixed-sample augmentation),以从一个大型教师模型(MobileNetV3 Large,5.7M参数)高效地迁移知识至多个轻量化学生模型(Compact、Standard和Enhanced配置)。该方案显著提升了学生模型在Damask玫瑰成熟度分类任务上的准确率(均超过96.7%),同时保持了最低推理延迟(72.19ms),且知识保留率高于99%,优于11种主流知识蒸馏方法,验证了其在农业视觉应用中的有效性与实用性。

链接: https://arxiv.org/abs/2508.20232
作者: Mohamed Ohamouddou,Said Ohamouddou,Abdellatif El Afia,Rafik Lasri
机构: United Arab Emirates University (阿联酋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation), a novel framework for developing lightweight CNN models suitable for resource-constrained agricultural environments. The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model (5.7,M parameters) to lightweight residual CNN students. Three student configurations were evaluated: Compact (1.3,M parameters), Standard (2.4,M parameters), and Enhanced (3.8,M parameters). The dataset used in this study consists of images of \textitRosa damascena (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco, providing a realistic benchmark for agricultural computer vision applications under diverse environmental conditions. Experimental evaluation on the Damascena rose maturity classification dataset demonstrated significant improvements over direct training methods. All student models achieved validation accuracies exceeding 96.7% with ATMS-KD compared to 95–96% with direct training. The framework outperformed eleven established knowledge distillation methods, achieving 97.11% accuracy with the compact model – a 1.60 percentage point improvement over the second-best approach while maintaining the lowest inference latency of 72.19,ms. Knowledge retention rates exceeded 99% for all configurations, demonstrating effective knowledge transfer regardless of student model capacity.
zh

[CV-93] he Role of Teacher Calibration in Knowledge Distillation

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中学生模型性能提升机制不明确的问题,特别是缺乏对影响KD效果关键因素的系统理解。研究表明,教师模型的校准误差(calibration error)与学生模型的准确率之间存在强相关性,因此提出教师模型的校准能力是实现高效知识蒸馏的关键因素。解决方案的核心在于通过引入简单的校准方法降低教师模型的校准误差,从而显著提升学生模型的性能;该方法具有通用性,适用于分类到检测等多种任务,并可无缝集成至现有先进KD方法中,持续获得更优结果。

链接: https://arxiv.org/abs/2508.20224
作者: Suyoung Kim,Seonguk Park,Junhoo Lee,Nojun Kwak
机构: Seoul National University (首尔国立大学); A2Mind
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) has emerged as an effective model compression technique in deep learning, enabling the transfer of knowledge from a large teacher model to a compact student model. While KD has demonstrated significant success, it is not yet fully understood which factors contribute to improving the student’s performance. In this paper, we reveal a strong correlation between the teacher’s calibration error and the student’s accuracy. Therefore, we claim that the calibration of the teacher model is an important factor for effective KD. Furthermore, we demonstrate that the performance of KD can be improved by simply employing a calibration method that reduces the teacher’s calibration error. Our algorithm is versatile, demonstrating effectiveness across various tasks from classification to detection. Moreover, it can be easily integrated with existing state-of-the-art methods, consistently achieving superior performance.
zh

[CV-94] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos

【速读】:该论文旨在解决360度全景视频(Omnidirectional Videos, ODVs)中视觉显著性预测(Saliency Prediction)的难题,尤其关注球面畸变带来的挑战以及空间音频(Spatial Audio)与视觉信息融合的问题。现有研究缺乏针对360度音频-视觉显著性的综合数据集和有效建模方法。为应对这一问题,作者构建了YT360-EyeTracking数据集(包含81个ODV片段,在不同音视频条件下采集眼动数据),并提出两种新颖的显著性预测模型:SalViT360基于视觉Transformer架构,引入球面几何感知的时空注意力机制;SalViT360-AV进一步通过音频条件化的Transformer适配器模块融合空间音频线索。实验表明,这两个模型在多个基准数据集上显著优于现有方法,关键在于将空间音频作为先验信息嵌入到模型结构中,从而更准确地捕捉用户在360度场景中的注意力分布。

链接: https://arxiv.org/abs/2508.20221
作者: Mert Cokelek,Halit Ozsoy,Nevrez Imamoglu,Cagri Ozcinar,Inci Ayhan,Erkut Erdem,Aykut Erdem
机构: Koç University (科奇大学); Boğaziçi University (博兹库尔特大学); National Institute of Advanced Industrial Science and Technology (AIST) (日本先进工业科学技术研究院); MSK.AI (MSK.AI); Hacettepe University (哈切特佩大学); KUIS AI Center (KUIS人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transaction on Pattern Analysis and Machine Intelligence (IEEE TPAMI)

点击查看摘要

Abstract:Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at this https URL.
zh

[CV-95] InfinityHuman: Towards Long-Term Audio-Driven Human

【速读】:该论文旨在解决音频驱动的人体动画生成中长期视频的一致性与自然性问题,特别是高分辨率、长时间视频中的身份漂移(identity drift)、色彩偏移和场景不稳定现象,以及手部动作建模不足导致的形变与音频不同步问题。解决方案的关键在于提出一种粗粒度到细粒度的框架 InfinityHuman:首先生成与音频同步的表征,再通过姿态引导的精炼模块逐步优化为高质量视频;其中,姿态序列被解耦于外观且具有时间稳定性,从而利用初始帧作为视觉锚点减少漂移并提升唇音同步精度;此外,引入基于高质量手部运动数据训练的手部特定奖励机制,显著增强手势语义准确性和真实感。

链接: https://arxiv.org/abs/2508.20210
作者: Xiaodi Li,Pan Xie,Yi Ren,Qijun Gan,Chen Zhang,Fangyuan Kong,Xiang Yin,Bingyue Peng,Zehuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.
zh

[CV-96] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels

【速读】:该论文旨在解决自动调制识别(Automatic Modulation Recognition, AMR)中依赖大规模标注数据集和多阶段训练流程导致的可扩展性与泛化能力受限的问题。其解决方案的关键在于提出一个统一的视觉Transformer(Vision Transformer, ViT)框架,该框架融合了监督学习、自监督学习和重建目标,通过ViT编码器、轻量级卷积解码器和线性分类器构成端到端结构;其中重建分支将增强后的信号映射回原始形式,锚定编码器对I/Q(In-phase/Quadrature)结构的细粒度感知,从而在预训练阶段促进鲁棒且判别性强的特征学习,同时在微调阶段利用部分标签实现高效分类。该方法在低标签场景下优于传统CNN和ViT基线模型,并在不同信噪比(SNR)条件下保持稳定性能。

链接: https://arxiv.org/abs/2508.20193
作者: Hossein Ahmadi,Banafsheh Saffari
机构: The University of Akron (阿克伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.
zh

[CV-97] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)模型在皮肤疾病诊断中缺乏足够可解释性的问题,从而限制其临床应用。解决方案的关键在于将多模态大语言模型(Multimodal Large Language Models, MLLMs)的推理能力与定量特征(quantitative attributes)相结合:通过微调使MLLM的嵌入空间能够基于图像预测特定的定量属性(如病灶面积),从而实现模型决策的可解释性增强;研究进一步通过SLICE-3D数据集上的属性特异性内容检索实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2508.20188
作者: Max Torop,Masih Eskandar,Nicholas Kurtansky,Jinyang Liu,Jochen Weber,Octavia Camps,Veronica Rotemberg,Jennifer Dy,Kivanc Kose
机构: Northeastern University (东北大学); Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.
zh

[CV-98] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization

【速读】:该论文旨在解决当前图像伪造定位方法因严重依赖人工标注数据而难以跟上以Stable Diffusion(SD)为代表的新型多模态大模型所驱动的快速演进的图像篡改技术的问题。其解决方案的关键在于首次将SD的图像生成能力与强大的感知能力整合进图像取证框架中,通过理论证明SD的多模态架构可被伪造相关信息条件化,从而内生地输出伪造定位结果;进一步地,利用Stable Diffusion V3(SD3)的多模态处理能力,在潜在空间中将通过高通滤波器提取的伪造残差(forgery residuals,即高频信号)作为显式模态进行融合,以此增强伪造定位性能,同时完整保留SD3提取的潜在特征以维持输入图像的丰富语义信息。

链接: https://arxiv.org/abs/2508.20182
作者: Yang Su,Shunquan Tan,Jiwu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driven by the new generation of multi-modal large models, such as Stable Diffusion (SD), image manipulation technologies have advanced rapidly, posing significant challenges to image forensics. However, existing image forgery localization methods, which heavily rely on labor-intensive and costly annotated data, are struggling to keep pace with these emerging image manipulation technologies. To address these challenges, we are the first to integrate both image generation and powerful perceptual capabilities of SD into an image forensic framework, enabling more efficient and accurate forgery localization. First, we theoretically show that the multi-modal architecture of SD can be conditioned on forgery-related information, enabling the model to inherently output forgery localization results. Then, building on this foundation, we specifically leverage the multimodal framework of Stable DiffusionV3 (SD3) to enhance forgery localization this http URL leverage the multi-modal processing capabilities of SD3 in the latent space by treating image forgery residuals – high-frequency signals extracted using specific highpass filters – as an explicit modality. This modality is fused into the latent space during training to enhance forgery localization performance. Notably, our method fully preserves the latent features extracted by SD3, thereby retaining the rich semantic information of the input image. Experimental results show that our framework achieves up to 12% improvements in performance on widely used benchmarking datasets compared to current state-of-the-art image forgery localization models. Encouragingly, the model demonstrates strong performance on forensic tasks involving real-world document forgery images and natural scene forging images, even when such data were entirely unseen during training.
zh

[CV-99] Improving Liver Disease Diagnosis with SNNDeep: A Custom Spiking Neural Network Using Diverse Learning Algorithms

【速读】:该论文旨在解决生成式 AI (Generative AI) 在高风险生物医学成像领域应用几乎空白的问题,特别是如何利用脉冲神经网络(Spiking Neural Networks, SNNs)实现高效、可解释且适用于临床场景的肝脏健康状态二分类任务。其解决方案的关键在于提出并验证了SNNDeep——一种从零开始定制设计的低级SNN模型,相较于基于主流框架(snnTorch和SpikingJelly)的实现,该模型在验证准确率(最高达98.35%)、学习规则适应性及训练开销方面均表现更优,表明高度可调的底层SNN架构在数据有限、时间约束严格的诊断场景中具备显著优势,为神经启发式人工智能在精准医疗中的落地提供了新路径。

链接: https://arxiv.org/abs/2508.20125
作者: Zofia Rudnicka,Janusz Szczepanski,Agnieszka Pregowska
机构: Institute of Fundamental Technological Research (波兰科学院基础技术研究所)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Purpose: Spiking neural networks (SNNs) have recently gained attention as energy-efficient, biologically plausible alternatives to conventional deep learning models. Their application in high-stakes biomedical imaging remains almost entirely unexplored. Methods: This study introduces SNNDeep, the first tailored SNN specifically optimized for binary classification of liver health status from computed tomography (CT) features. To ensure clinical relevance and broad generalizability, the model was developed and evaluated using the Task03\Liver dataset from the Medical Segmentation Decathlon (MSD), a standardized benchmark widely used for assessing performance across diverse medical imaging tasks. We benchmark three fundamentally different learning algorithms, namely Surrogate Gradient Learning, the Tempotron rule, and Bio-Inspired Active Learning across three architectural variants: a fully customized low-level model built from scratch, and two implementations using leading SNN frameworks, i.e., snnTorch and SpikingJelly. Hyperparameter optimization was performed using Optuna. Results: Our results demonstrate that the custom-built SNNDeep consistently outperforms framework-based implementations, achieving a maximum validation accuracy of 98.35%, superior adaptability across learning rules, and significantly reduced training overhead. Conclusion:This study provides the first empirical evidence that low-level, highly tunable SNNs can surpass standard frameworks in medical imaging, especially in data-limited, temporally constrained diagnostic settings, thereby opening a new pathway for neuro-inspired AI in precision medicine.
zh

[CV-100] VSF: Simple Efficient and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip

【速读】:该论文旨在解决在少步数扩散模型(few-step diffusion models)和流匹配(flow-matching)图像生成模型中,如何高效实现负向提示词引导(negative prompt guidance)的问题。现有方法如无分类器引导(Classifier-Free Guidance, CFG)、NASA 和 NAG 在负向提示词控制上存在效果不足或计算开销较大的问题。其解决方案的关键在于提出一种名为“值符号翻转”(Value Sign Flip, VSF)的新机制:通过动态翻转来自负向提示词的注意力值(attention values)的符号,从而抑制不期望的内容生成。该方法计算开销小、兼容性强,可无缝集成到 MMDiT 架构(如 Stable Diffusion 3.5 Turbo)和基于交叉注意力(cross-attention)的模型(如 Wan)中,在静态图像与视频生成任务中均展现出优于先前方法的负向提示遵循能力,同时保持高质量输出。

链接: https://arxiv.org/abs/2508.10931
作者: Wenqi Guo,Shan Du
机构: University of British Columbia (不列颠哥伦比亚大学); Weathon Software (微腾软件)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in this https URL.
zh

[CV-101] LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty CVPR2025 CVPR

【速读】:该论文旨在解决预训练模型中数据遗忘(Machine Unlearning, MU)的问题,即如何在不重新训练整个模型的前提下,有效消除特定训练样本对模型的影响。传统方法通常依赖于重新训练,成本高昂且不适用于大规模场景。论文提出的LoTUS方案关键在于通过平滑模型预测概率至信息论上限,从而缓解因数据记忆导致的过自信问题,实现高效且有效的遗忘机制。该方法在Transformer和ResNet18模型上验证,优于八个基线方法,并引入了无需重训练的Jensen-Shannon散度(Retrain-Free Jensen-Shannon Divergence, RF-JSD)评估指标,以更贴近真实应用场景进行评测。

链接: https://arxiv.org/abs/2503.18314
作者: Christoforos N. Spartalis,Theodoros Semertzidis,Efstratios Gavves,Petros Daras
机构: University of Amsterdam (阿姆斯特丹大学); Centre for Research & Technology Hellas (希腊研究与技术中心); Archimedes/Athena RC (阿基米德/雅典娜研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a main conference paper at CVPR 2025 ( this https URL )

点击查看摘要

Abstract:We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: this https URL.
zh

[CV-102] Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification in MIDOG 2025

【速读】:该论文旨在解决异常有丝分裂象(Atypical Mitotic Figures, AMFs)检测困难的问题,其核心挑战包括AMFs在组织切片中出现频率低、形态特征细微以及不同病理学家之间判读存在显著差异。为应对这些问题,研究提出利用预训练的视觉Transformer模型DINOv3-H+进行迁移学习,并结合低秩适应(Low-Rank Adaptation, LoRA)方法进行参数高效的微调(仅650k可训练参数),同时引入大规模数据增强策略以缓解领域差距(domain gap)。关键创新在于:尽管DINOv3原是在自然图像上预训练的,但通过LoRA微调和增强技术,仍能在组织病理学图像上实现高精度分类(初步测试集平衡准确率达0.8871),验证了该方法作为MIDOG 2025挑战赛中异常有丝分裂分类任务的强大基线潜力。

链接: https://arxiv.org/abs/2508.21041
作者: Guillaume Balezo,Raphaël Bourgade,Thomas Walter
机构: Center for Computational Biology, Mines Paris PSL, Paris, France; Sanofi, Paris, France
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages. Challenge report for MIDOG 2025 (Task 2: Atypical Mitotic Figure Classification)

点击查看摘要

Abstract:Atypical mitotic figures (AMFs) are markers of abnormal cell division associated with poor prognosis, yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we evaluate the recently published DINOv3-H+ vision transformer, pretrained on natural images, which we fine-tuned using low-rank adaptation (LoRA, 650k trainable parameters) and extensive augmentation. Despite the domain gap, DINOv3 transfers effectively to histopathology, achieving a balanced accuracy of 0.8871 on the preliminary test set. These results highlight the robustness of DINOv3 pretraining and show that, when combined with parameter-efficient fine-tuning, it provides a strong baseline for atypical mitosis classification in MIDOG 2025.
zh

[CV-103] GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction

【速读】:该论文旨在解决加速心血管磁共振成像(Cardiovascular Magnetic Resonance, CMR)重建中扫描时间与图像质量之间的权衡问题,尤其是在不同采集设置下模型泛化能力不足的挑战。其解决方案的关键在于提出一种基于生成对抗网络(Generative Adversarial Network, GAN)的残差深度展开重建框架(GENRE-CMR),通过将迭代优化过程展开为一系列卷积子网络并引入残差连接,实现从浅层到深层的渐进式特征传播;同时设计了边缘感知区域损失(Edge-Aware Region, EAR loss)和统计分布对齐损失(Statistical Distribution Alignment, SDA loss),分别增强结构信息保留能力和跨数据分布的特征空间一致性,从而显著提升重建保真度与泛化性能。

链接: https://arxiv.org/abs/2508.20600
作者: Kian Anvari Hamedani,Narges Razizadeh,Shahabedin Nabavi,Mohsen Ebrahimi Moghaddam
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.
zh

[CV-104] Prediction of Distant Metastasis for Head and Neck Cancer Patients Using Multi-Modal Tumor and Peritumoral Feature Fusion Network

【速读】:该论文旨在解决头颈鳞状细胞癌(Head and Neck Squamous Cell Carcinoma, HNSCC)患者预治疗状态下转移风险预测的临床难题,以优化个体化治疗策略并改善预后。其解决方案的关键在于构建一个基于深度学习的多模态融合框架,通过整合增强CT图像、影像组学(Radiomics)特征与临床变量,实现对转移风险的精准预测。其中,3D Swin Transformer用于从肿瘤区域提取深层空间特征,结合经相关性过滤和随机森林选择后的36个影像组学特征及编码后的临床变量,输入全连接网络进行联合建模。该方法在五折交叉验证中表现优异(AUC=0.803),显著优于单一模态模型,并展现出良好的跨亚型泛化能力与可解释性,为HNSCC的个性化诊疗提供了可靠的决策支持工具。

链接: https://arxiv.org/abs/2508.20469
作者: Zizhao Tang(1),Changhao Liu(2),Nuo Tong(1),Shuiping Gou(1),Mei Shi(2) ((1) School of Artificial Intelligence, Xidian University, (2) Department of Radiotherapy, Xijing Hospital, Air Force Medical University of PLA)
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 4 figures, 5 tables. Zizhao Tang, Changhao Liu, and Nuo Tong contributed equally. Corresponding Authors: Mei Shi (mshi82@fmmu. this http URL ), Shuiping Gou (shpgou@mail. this http URL )

点击查看摘要

Abstract:Metastasis remains the major challenge in the clinical management of head and neck squamous cell carcinoma (HNSCC). Reliable pre-treatment prediction of metastatic risk is crucial for optimizing treatment strategies and prognosis. This study develops a deep learning-based multimodal framework to predict metastasis risk in HNSCC patients by integrating computed tomography (CT) images, radiomics, and clinical data. 1497 HNSCC patients were included. Tumor and organ masks were derived from pretreatment CT images. A 3D Swin Transformer extracted deep features from tumor regions. Meanwhile, 1562 radiomics features were obtained using PyRadiomics, followed by correlation filtering and random forest selection, leaving 36 features. Clinical variables including age, sex, smoking, and alcohol status were encoded and fused with imaging-derived features. Multimodal features were fed into a fully connected network to predict metastasis risk. Performance was evaluated using five-fold cross-validation with area under the curve (AUC), accuracy (ACC), sensitivity (SEN), and specificity (SPE). The proposed fusion model outperformed single-modality models. The 3D deep learning module alone achieved an AUC of 0.715, and when combined with radiomics and clinical features, predictive performance improved (AUC = 0.803, ACC = 0.752, SEN = 0.730, SPE = 0.758). Stratified analysis showed generalizability across tumor subtypes. Ablation studies indicated complementary information from different modalities. Evaluation showed the 3D Swin Transformer provided more robust representation learning than conventional networks. This multimodal fusion model demonstrated high accuracy and robustness in predicting metastasis risk in HNSCC, offering a comprehensive representation of tumor biology. The interpretable model has potential as a clinical decision-support tool for personalized treatment planning.
zh

[CV-105] Efficient and Privacy-Protecting Background Removal for 2D Video Streaming using iPhone 15 Pro Max LiDAR

【速读】:该论文旨在解决视频和摄影应用中背景去除(background removal)的难题,传统方法如色键抠像(chroma keying)和基于训练的AI模型易受光照条件影响,且在低光环境下性能下降明显。其解决方案的关键在于利用消费级移动设备(如iPhone 15 Pro Max)上的激光雷达(LiDAR)传感器获取与光照无关的深度信息,并结合GPU加速的图像处理流程,在60帧每秒的标准视频流速率下实现实时背景分割。该方案显著提升了在不同光照条件下的鲁棒性,仅受限于当前LiDAR深度图分辨率(320x240)及部分材质对红外激光反射能力不足的问题。若未来LiDAR分辨率能与彩色图像匹配,则有望成为主流背景去除技术。

链接: https://arxiv.org/abs/2508.20250
作者: Jessica Kinnevan,Naifa Alqahtani,Toral Chauhan
机构: Iowa State University (爱荷华州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Light Detection and Ranging (LiDAR) technology in consumer-grade mobile devices can be used as a replacement for traditional background removal and compositing techniques. Unlike approaches such as chroma keying and trained AI models, LiDAR’s depth information is independent of subject lighting, and performs equally well in low-light and well-lit environments. We integrate the LiDAR and color cameras on the iPhone 15 Pro Max with GPU-based image processing. We use Apple’s SwiftUI and Swift frameworks for user interface and backend development, and Metal Shader Language (MSL) for realtime image enhancement at the standard iPhone streaming frame rate of 60 frames per second. The only meaningful limitations of the technology are the streaming bandwidth of the depth data, which currently reduces the depth map resolution to 320x240, and any pre-existing limitations of the LiDAR IR laser to reflect accurate depth from some materials. If the LiDAR resolution on a mobile device like the iPhone can be improved to match the color image resolution, LiDAR could feasibly become the preeminent method of background removal for video applications and photography.
zh

[CV-106] UltraEar: a multicentric large-scale database combining ultra-high-resolution computed tomography and clinical data for ear diseases

【速读】:该论文旨在解决耳部疾病影像学研究中缺乏大规模、高分辨率、多中心标准化数据集的问题,从而限制了精准诊断、人工智能(AI)算法开发及多中心协作研究的进展。解决方案的关键在于构建UltraEar数据库,该数据库整合了来自11家三级医院的各类型耳部疾病患者的超高清CT(U-HRCT,各向同性分辨率达0.1 mm)图像与结构化临床数据,涵盖耳部解剖变异、感染、外伤、先天畸形等多种病理状态,并通过标准化预处理流程实现几何校准、图像标注和多结构分割,同时严格遵循隐私保护规范进行去标识化处理,确保数据安全合规。这一高质量资源为放射学研究、AI模型训练验证、教学培训及跨机构合作提供了坚实基础。

链接: https://arxiv.org/abs/2508.20141
作者: Ruowei Tang,Pengfei Zhao,Xiaoguang Li,Ning Xu,Yue Cheng,Mengshi Zhang,Zhixiang Wang,Zhengyu Zhang,Hongxia Yin,Heyu Ding,Shusheng Gong,Yuhe Liu,Zhenchang Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ear diseases affect billions of people worldwide, leading to substantial health and socioeconomic burdens. Computed tomography (CT) plays a pivotal role in accurate diagnosis, treatment planning, and outcome evaluation. The objective of this study is to present the establishment and design of UltraEar Database, a large-scale, multicentric repository of isotropic 0.1 mm ultra-high-resolution CT (U-HRCT) images and associated clinical data dedicated to ear diseases. UltraEar recruits patients from 11 tertiary hospitals between October 2020 and October 2035, integrating U-HRCT images, structured CT reports, and comprehensive clinical information, including demographics, audiometric profiles, surgical records, and pathological findings. A broad spectrum of otologic disorders is covered, such as otitis media, cholesteatoma, ossicular chain malformation, temporal bone fracture, inner ear malformation, cochlear aperture stenosis, enlarged vestibular aqueduct, and sigmoid sinus bony deficiency. Standardized preprocessing pipelines have been developed for geometric calibration, image annotation, and multi-structure segmentation. All personal identifiers in DICOM headers and metadata are removed or anonymized to ensure compliance with data privacy regulation. Data collection and curation are coordinated through monthly expert panel meetings, with secure storage on an offline cloud system. UltraEar provides an unprecedented ultra-high-resolution reference atlas with both technical fidelity and clinical relevance. This resource has significant potential to advance radiological research, enable development and validation of AI algorithms, serve as an educational tool for training in otologic imaging, and support multi-institutional collaborative studies. UltraEar will be continuously updated and expanded, ensuring long-term accessibility and usability for the global otologic research community.
zh

[CV-107] Is the medical image segmentation problem solved? A survey of current developments and future directions

【速读】:该论文旨在系统性地梳理过去十年基于深度学习的医学图像分割技术的发展脉络,回答一个核心问题:当前模型在多大程度上克服了长期存在的挑战,以及仍存在哪些关键差距。其解决方案的关键在于从七个维度对研究进展进行深入分析,包括监督学习向半监督/无监督学习的转变、器官分割向病灶导向任务的演进、多模态融合与领域自适应的进步、基础模型与迁移学习的作用、确定性分割向概率分割的升级、二维到三维乃至四维分割的扩展,以及模型调用向分割智能体(segmentation agents)的演化。这些维度共同构建了一个全面的视角,揭示了技术演进路径,并为未来创新提供理论支撑和实践指引。

链接: https://arxiv.org/abs/2508.20139
作者: Guoping Xu,Jayaram K. Udupa,Jax Luo,Songlin Zhao,Yajun Yu,Scott B. Raymond,Hao Peng,Lipeng Ning,Yogesh Rathi,Wei Liu,You Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 80 pages, 38 figures

点击查看摘要

Abstract:Medical image segmentation has advanced rapidly over the past two decades, largely driven by deep learning, which has enabled accurate and efficient delineation of cells, tissues, organs, and pathologies across diverse imaging modalities. This progress raises a fundamental question: to what extent have current models overcome persistent challenges, and what gaps remain? In this work, we provide an in-depth review of medical image segmentation, tracing its progress and key developments over the past decade. We examine core principles, including multiscale analysis, attention mechanisms, and the integration of prior knowledge, across the encoder, bottleneck, skip connections, and decoder components of segmentation networks. Our discussion is organized around seven key dimensions: (1) the shift from supervised to semi-/unsupervised learning, (2) the transition from organ segmentation to lesion-focused tasks, (3) advances in multi-modality integration and domain adaptation, (4) the role of foundation models and transfer learning, (5) the move from deterministic to probabilistic segmentation, (6) the progression from 2D to 3D and 4D segmentation, and (7) the trend from model invocation to segmentation agents. Together, these perspectives provide a holistic overview of the trajectory of deep learning-based medical image segmentation and aim to inspire future innovation. To support ongoing research, we maintain a continually updated repository of relevant literature and open-source resources at this https URL
zh

[CV-108] A Machine Learning Approach to Volumetric Computations of Solid Pulmonary Nodules

【速读】:该论文旨在解决肺部结节(pulmonary nodule)在CT影像中体积评估的准确性问题,传统方法如密度比值(consolidation-to-tumor ratio, CTR)和球形近似法因结节形状与密度差异导致估算不一致。其解决方案的关键在于提出一种融合多尺度3D卷积神经网络(multi-scale 3D CNN)与亚型特异性偏差校正(subtype-specific bias correction)的先进框架,从而实现高精度、快速的体积估计,相较现有深度学习和半自动化流程显著降低误差(减少超17个百分点)并提升处理速度(加速三倍)。

链接: https://arxiv.org/abs/2508.20127
作者: Yihan Zhou,Haocheng Huang,Yue Yu,Jianhui Shang
机构: Shanghai World Foreign Language Academy (上海世界外国语学校); Shanghai Jiao Tong University (上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early detection of lung cancer is crucial for effective treatment and relies on accurate volumetric assessment of pulmonary nodules in CT scans. Traditional methods, such as consolidation-to-tumor ratio (CTR) and spherical approximation, are limited by inconsistent estimates due to variability in nodule shape and density. We propose an advanced framework that combines a multi-scale 3D convolutional neural network (CNN) with subtype-specific bias correction for precise volume estimation. The model was trained and evaluated on a dataset of 364 cases from Shanghai Chest Hospital. Our approach achieved a mean absolute deviation of 8.0 percent compared to manual nonlinear regression, with inference times under 20 seconds per scan. This method outperforms existing deep learning and semi-automated pipelines, which typically have errors of 25 to 30 percent and require over 60 seconds for processing. Our results show a reduction in error by over 17 percentage points and a threefold acceleration in processing speed. These advancements offer a highly accurate, efficient, and scalable tool for clinical lung nodule screening and monitoring, with promising potential for improving early lung cancer detection.
zh

人工智能

[AI-0] Prompt-to-Product: Generative Assembly via Bimanual Manipulation

【速读】:该论文旨在解决装配类产品设计与制造过程中高度依赖人工和专家知识的问题,具体体现在两个方面:一是如何将自然语言描述的用户需求转化为可执行的装配结构设计,二是如何自动化地构建物理实体产品。其解决方案的关键在于提出了一种名为Prompt-to-Product的自动化流程,该流程首先基于自然语言提示生成物理上可搭建的乐高积木装配结构(brick assembly structures),随后利用双臂机器人系统自动完成实物产品的组装,从而实现从用户想象到现实产品的端到端自动化转化。

链接: https://arxiv.org/abs/2508.21063
作者: Ruixuan Liu,Philip Huang,Ava Pun,Kangle Deng,Shobhit Aggarwal,Kevin Tang,Michelle Liu,Deva Ramanan,Jun-Yan Zhu,Jiaoyang Li,Changliu Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Creating assembly products demands significant manual effort and expert knowledge in 1) designing the assembly and 2) constructing the product. This paper introduces Prompt-to-Product, an automated pipeline that generates real-world assembly products from natural language prompts. Specifically, we leverage LEGO bricks as the assembly platform and automate the process of creating brick assembly structures. Given the user design requirements, Prompt-to-Product generates physically buildable brick designs, and then leverages a bimanual robotic system to construct the real assembly products, bringing user imaginations into the real world. We conduct a comprehensive user study, and the results demonstrate that Prompt-to-Product significantly lowers the barrier and reduces manual effort in creating assembly products from imaginative ideas.
zh

[AI-1] OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models

【速读】:该论文旨在解决用户在与大语言模型(Large Language Models, LLMs)进行多轮对话时,难以有效评估和回顾对话目标进展的问题。解决方案的关键在于提出 OnGoal,一个支持目标追踪的LLM聊天界面,其核心机制包括:通过LLM辅助评估实现目标对齐的实时反馈、提供基于示例的评估结果解释,以及可视化目标随时间演进的概览,从而降低认知负荷并提升用户在复杂对话中的导航能力与策略调整灵活性。

链接: https://arxiv.org/abs/2508.21061
作者: Adam Coscia,Shunan Guo,Eunyee Koh,Alex Endert
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to UIST 2025. 18 pages, 9 figures, 2 tables. For a demo video, see this https URL

点击查看摘要

Abstract:As multi-turn dialogues with large language models (LLMs) grow longer and more complex, how can users better evaluate and review progress on their conversational goals? We present OnGoal, an LLM chat interface that helps users better manage goal progress. OnGoal provides real-time feedback on goal alignment through LLM-assisted evaluation, explanations for evaluation results with examples, and overviews of goal progression over time, enabling users to navigate complex dialogues more effectively. Through a study with 20 participants on a writing task, we evaluate OnGoal against a baseline chat interface without goal tracking. Using OnGoal, participants spent less time and effort to achieve their goals while exploring new prompting strategies to overcome miscommunication, suggesting tracking and visualizing goals can enhance engagement and resilience in LLM dialogues. Our findings inspired design implications for future LLM chat interfaces that improve goal communication, reduce cognitive load, enhance interactivity, and enable feedback to improve LLM performance.
zh

[AI-2] Understanding Protecting and Augmenting Human Cognition with Generative AI: A Synthesis of the CHI 2025 Tools for Thought Workshop

【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)对人类认知影响的科学理解与设计实践之间的鸿沟问题,具体聚焦于 GenAI 如何改变元认知、批判性思维、记忆和创造力等认知过程,并探索其作为认知增强工具的潜力。解决方案的关键在于通过跨学科协作(涵盖学术界与产业界)构建一套理论框架、评估指标与设计工具,以系统化地研究 GenAI 对人类思维的作用机制,并指导开发既能保护又能增强人类认知能力的 GenAI 工具,从而推动形成一个围绕这一前沿领域的多学科研究共同体。

链接: https://arxiv.org/abs/2508.21036
作者: Lev Tankelevitch,Elena L. Glassman,Jessica He,Aniket Kittur,Mina Lee,Srishti Palani,Advait Sarkar,Gonzalo Ramos,Yvonne Rogers,Hari Subramonyam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) radically expands the scope and capability of automation for work, education, and everyday tasks, a transformation posing both risks and opportunities for human cognition. How will human cognition change, and what opportunities are there for GenAI to augment it? Which theories, metrics, and other tools are needed to address these questions? The CHI 2025 workshop on Tools for Thought aimed to bridge an emerging science of how the use of GenAI affects human thought, from metacognition to critical thinking, memory, and creativity, with an emerging design practice for building GenAI tools that both protect and augment human thought. Fifty-six researchers, designers, and thinkers from across disciplines as well as industry and academia, along with 34 papers and portfolios, seeded a day of discussion, ideation, and community-building. We synthesize this material here to begin mapping the space of research and design opportunities and to catalyze a multidisciplinary community around this pressing area of research.
zh

[AI-3] Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance

【速读】:该论文旨在解决扩散模型(diffusion models)在对齐复杂下游目标(如人类偏好、组合准确性或数据压缩性)时存在的挑战,尤其是现有强化学习(reinforcement learning, RL)微调方法在扩散模型中表现不佳且难以在推理阶段灵活控制对齐强度的问题。解决方案的关键在于提出一种名为**强化学习引导(Reinforcement Learning Guidance, RLG)**的推理时方法,其核心思想是通过随机微分方程(stochastic differential equations)和隐式奖励条件建模重新诠释RL微调,并利用几何平均组合基础模型与RL微调后模型的输出,从而在不需额外训练的前提下,动态调节KL正则化系数以实现对齐-质量权衡的精确控制。理论分析和实验证明,RLG在多种架构、RL算法和下游任务中均能稳定提升性能,支持插值与外推,显著增强了生成对齐的灵活性与可控性。

链接: https://arxiv.org/abs/2508.21016
作者: Luozhijie Jin,Zijie Qiu,Jie Liu,Zijie Diao,Lifeng Qiao,Ning Ding,Alex Lamb,Xipeng Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Denoising-based generative models, particularly diffusion and flow matching algorithms, have achieved remarkable success. However, aligning their output distributions with complex downstream objectives, such as human preferences, compositional accuracy, or data compressibility, remains challenging. While reinforcement learning (RL) fine-tuning methods, inspired by advances in RL from human feedback (RLHF) for large language models, have been adapted to these generative frameworks, current RL approaches are suboptimal for diffusion models and offer limited flexibility in controlling alignment strength after fine-tuning. In this work, we reinterpret RL fine-tuning for diffusion models through the lens of stochastic differential equations and implicit reward conditioning. We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Classifier-Free Guidance (CFG) by combining the outputs of the base and RL fine-tuned models via a geometric average. Our theoretical analysis shows that RLG’s guidance scale is mathematically equivalent to adjusting the KL-regularization coefficient in standard RL objectives, enabling dynamic control over the alignment-quality trade-off without further training. Extensive experiments demonstrate that RLG consistently improves the performance of RL fine-tuned models across various architectures, RL algorithms, and downstream tasks, including human preferences, compositional control, compressibility, and text rendering. Furthermore, RLG supports both interpolation and extrapolation, thereby offering unprecedented flexibility in controlling generative alignment. Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment at inference. The source code for RLG is publicly available at the Github: this https URL.
zh

[AI-4] rain-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees

【速读】:该论文旨在解决动力学运动规划(kinodynamic motion planning)中采样效率低的问题,即传统采样式规划器(Sampling-Based Planners, SBPs)因随机动作采样导致探索缓慢,而学习方法虽速度快但缺乏泛化能力与安全性保障,难以部署于物理机器人。解决方案的关键在于提出Diffusion Tree (DiTree) 框架,其核心创新是将扩散策略(Diffusion Policies, DPs)作为有向采样器嵌入SBP中,利用DP对专家轨迹分布的建模能力(条件于局部观测),结合SBP的完备性,实现仅需少量动作传播迭代即可获得可证明安全且可推广的路径解。实验表明,DiTree在OOD场景下平均比经典SBPs快3倍,成功率提升约30%,显著优于纯DP或纯SBP方法。

链接: https://arxiv.org/abs/2508.21001
作者: Yaniv Hassidof,Tom Jurgenson,Kiril Solovey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to CoRL 2025. Project page: this https URL

点击查看摘要

Abstract:Kinodynamic motion planning is concerned with computing collision-free trajectories while abiding by the robot’s dynamic constraints. This critical problem is often tackled using sampling-based planners (SBPs) that explore the robot’s high-dimensional state space by constructing a search tree via action propagations. Although SBPs can offer global guarantees on completeness and solution quality, their performance is often hindered by slow exploration due to uninformed action sampling. Learning-based approaches can yield significantly faster runtimes, yet they fail to generalize to out-of-distribution (OOD) scenarios and lack critical guarantees, e.g., safety, thus limiting their deployment on physical robots. We present Diffusion Tree (DiTree): a \emphprovably-generalizable framework leveraging diffusion policies (DPs) as informed samplers to efficiently guide state-space search within SBPs. DiTree combines DP’s ability to model complex distributions of expert trajectories, conditioned on local observations, with the completeness of SBPs to yield \emphprovably-safe solutions within a few action propagation iterations for complex dynamical systems. We demonstrate DiTree’s power with an implementation combining the popular RRT planner with a DP action sampler trained on a \emphsingle environment. In comprehensive evaluations on OOD scenarios, % DiTree has comparable runtimes to a standalone DP (3x faster than classical SBPs), while improving the average success rate over DP and SBPs. DiTree is on average 3x faster than classical SBPs, and outperforms all other approaches by achieving roughly 30% higher success rate. Project webpage: this https URL.
zh

[AI-5] ChatThero: An LLM -Supported Chatbot for Behavior Change and Therapeutic Support in Addiction Recovery

【速读】:该论文旨在解决物质使用障碍(Substance Use Disorders, SUDs)患者在全球范围内普遍缺乏有效治疗的问题,其核心挑战包括污名化、动机障碍以及个性化支持不足。现有基于大语言模型(Large Language Models, LLMs)的心理健康辅助系统往往未能紧密整合临床验证的干预策略,导致在成瘾康复场景中效果有限。解决方案的关键在于提出一个名为ChatThero的多智能体对话框架,该框架通过动态患者建模与情境敏感的治疗对话相结合,并嵌入基于认知行为疗法(Cognitive Behavioral Therapy, CBT)和动机访谈(Motivational Interviewing, MI)的自适应说服策略,实现精准、可解释且具临床效度的交互式干预。

链接: https://arxiv.org/abs/2508.20996
作者: Junda Wang,Zonghai Yao,Zhichao Yang,Lingxi Li,Junhui Qian,Hong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Substance use disorders (SUDs) affect over 36 million people worldwide, yet few receive effective care due to stigma, motivational barriers, and limited personalized support. Although large language models (LLMs) show promise for mental-health assistance, most systems lack tight integration with clinically validated strategies, reducing effectiveness in addiction recovery. We present ChatThero, a multi-agent conversational framework that couples dynamic patient modeling with context-sensitive therapeutic dialogue and adaptive persuasive strategies grounded in cognitive behavioral therapy (CBT) and motivational interviewing (MI). We build a high-fidelity synthetic benchmark spanning Easy, Medium, and Hard resistance levels, and train ChatThero with a two-stage pipeline comprising supervised fine-tuning (SFT) followed by direct preference optimization (DPO). In evaluation, ChatThero yields a 41.5% average gain in patient motivation, a 0.49% increase in treatment confidence, and resolves hard cases with 26% fewer turns than GPT-4o, and both automated and human clinical assessments rate it higher in empathy, responsiveness, and behavioral realism. The framework supports rigorous, privacy-preserving study of therapeutic conversation and provides a robust, replicable basis for research and clinical translation.
zh

[AI-6] Efficient Neuro-Symbolic Learning of Constraints and Objective

【速读】:该论文旨在解决如何将离散推理与神经网络相结合,以从自然输入中学习求解NP-hard推理或优化问题,这是当前大型语言模型(Large Language Models, LLMs)难以胜任的任务。其解决方案的关键在于提出一种可微分的神经符号架构(differentiable neuro-symbolic architecture)和一种专门设计的概率损失函数(probabilistic loss),该损失函数能够同时学习问题的约束条件与目标函数,从而构建一个完整且可解释的模型,支持添加辅助约束;同时通过将组合求解器移出训练循环,实现了可扩展的训练过程,并借助精确推理保障了最优精度。

链接: https://arxiv.org/abs/2508.20978
作者: Marianne Defresne,Romain Gambardella,Sophie Barbe,Thomas Schiex
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:In the ongoing quest for hybridizing discrete reasoning with neural nets, there is an increasing interest in neural architectures that can learn how to solve discrete reasoning or optimization problems from natural inputs, a task that Large Language Models seem to struggle with. Objectives: We introduce a differentiable neuro-symbolic architecture and a loss function dedicated to learning how to solve NP-hard reasoning problems. Methods: Our new probabilistic loss allows for learning both the constraints and the objective, thus delivering a complete model that can be scrutinized and completed with side constraints. By pushing the combinatorial solver out of the training loop, our architecture also offers scalable training while exact inference gives access to maximum accuracy. Results: We empirically show that it can efficiently learn how to solve NP-hard reasoning problems from natural inputs. On three variants of the Sudoku benchmark – symbolic, visual, and many-solution --, our approach requires a fraction of training time of other hybrid methods. On a visual Min-Cut/Max-cut task, it optimizes the regret better than a Decision-Focused-Learning regret-dedicated loss. Finally, it efficiently learns the energy optimization formulation of the large real-world problem of designing proteins. Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC) Cite as: arXiv:2508.20978 [cs.AI] (or arXiv:2508.20978v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.20978 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Thomas Schiex [view email] [v1] Thu, 28 Aug 2025 16:33:27 UTC (3,393 KB)
zh

[AI-7] WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在低层次听觉感知能力上的不足问题,特别是其在音高、时长等细粒度声学特征识别方面的表现尚未被充分探索。而这类能力对于现实世界中分布外(out-of-distribution)任务至关重要,例如基于陌生声音进行推理。为此,作者提出了World-of-Whale基准测试(WoW-Bench),其核心创新在于构建了一个包含感知(Perception)和认知(Cognition)两个子基准的评估体系:前者用于分类新声音,后者借鉴布卢姆分类法(Bloom’s taxonomy)评估模型的记忆、理解、应用与分析能力,并引入干扰项问题(distractor questions)以检验模型是否真正依赖听觉线索而非其他启发式策略完成任务。实验表明当前最先进的LALMs在该基准上性能远低于人类水平,凸显了增强LALMs声学基础表征能力的必要性。

链接: https://arxiv.org/abs/2508.20976
作者: Jaeyeon Kim,Heeseung Yun,Sang Hoon Woo,Chao-Han Huck Yang,Gunhee Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Preprint. Project page: this https URL

点击查看摘要

Abstract:Large audio language models (LALMs) extend language understanding into the auditory domain, yet their ability to perform low-level listening, such as pitch and duration detection, remains underexplored. However, low-level listening is critical for real-world, out-of-distribution tasks where models must reason about unfamiliar sounds based on fine-grained acoustic cues. To address this gap, we introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations. WoW-bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom’s taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events. For the Cognition benchmark, we additionally introduce distractor questions to evaluate whether models are truly solving problems through listening rather than relying on other heuristics. Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.
zh

[AI-8] A Multi-Objective Genetic Algorithm for Healthcare Workforce Scheduling ECAI2025

【速读】:该论文旨在解决医院科室人力资源调度问题(workforce scheduling in the healthcare sector),这是一个多目标优化难题,需在控制人力成本、保障患者护理覆盖率和满足医护人员偏好之间取得平衡。解决方案的关键在于提出一种多目标遗传算法(Multi-objective Genetic Algorithm, MOO-GA),该算法将实际场景中的小时级预约驱动需求与模块化轮班制度纳入模型,通过定义成本、护理覆盖和员工满意度三个目标函数,在大规模搜索空间中高效生成一组高质量的非支配解(non-dominated solutions),从而显著优于传统人工排班方式(平均性能提升66%)。

链接: https://arxiv.org/abs/2508.20953
作者: Vipul Patel,Anirudh Deodhar,Dagnachew Birru
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注: 8 pages, 7 figures, Accepted at the Multi-Objective Decision Making Workshop (MODeM2025) at ECAI 2025

点击查看摘要

Abstract:Workforce scheduling in the healthcare sector is a significant operational challenge, characterized by fluctuating patient loads, diverse clinical skills, and the critical need to control labor costs while upholding high standards of patient care. This problem is inherently multi-objective, demanding a delicate balance between competing goals: minimizing payroll, ensuring adequate staffing for patient needs, and accommodating staff preferences to mitigate burnout. We propose a Multi-objective Genetic Algorithm (MOO-GA) that models the hospital unit workforce scheduling problem as a multi-objective optimization task. Our model incorporates real-world complexities, including hourly appointment-driven demand and the use of modular shifts for a multi-skilled workforce. By defining objective functions for cost, patient care coverage, and staff satisfaction, the GA navigates the vast search space to identify a set of high-quality, non-dominated solutions. Demonstrated on datasets representing a typical hospital unit, the results show that our MOO-GA generates robust and balanced schedules. On average, the schedules produced by our algorithm showed a 66% performance improvement over a baseline that simulates a conventional, manual scheduling process. This approach effectively manages trade-offs between critical operational and staff-centric objectives, providing a practical decision support tool for nurse managers and hospital administrators.
zh

[AI-9] Research Challenges in Relational Database Management Systems for LLM Queries VLDB2025

【速读】:该论文旨在解决当前SQL调用生成式AI(Generative AI)模型在数据库管理系统(DBMS)中集成时存在的功能受限、性能不佳及可扩展性差的问题。通过评估两个开源系统和一个企业级平台,作者识别出三大核心挑战:结构化输出约束难以保证、资源利用率低以及查询规划效率不足。解决方案的关键在于实现生成式AI与数据库系统的更紧密耦合(tighter integration),从而提升处理LLM驱动SQL查询的可扩展性和效率。

链接: https://arxiv.org/abs/2508.20912
作者: Kerem Akillioglu,Anurag Chakraborty,Sairaj Voruganti,M. Tamer Özsu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: This paper will appear in the 6th International Workshop on Applied AI for Database Systems and Applications, AIDB Workshop at VLDB 2025

点击查看摘要

Abstract:Large language models (LLMs) have become essential for applications such as text summarization, sentiment analysis, and automated question-answering. Recently, LLMs have also been integrated into relational database management systems to enhance querying and support advanced data processing. Companies such as Amazon, Databricks, Google, and Snowflake offer LLM invocation directly within SQL, denoted as LLM queries, to boost data insights. However, open-source solutions currently have limited functionality and poor performance. In this work, we present an early exploration of two open-source systems and one enterprise platform, using five representative queries to expose functional, performance, and scalability limits in today’s SQL-invoked LLM integrations. We identify three main issues: enforcing structured outputs, optimizing resource utilization, and improving query planning. We implemented initial solutions and observed improvements in accommodating LLM powered SQL queries. These early gains demonstrate that tighter integration of LLM+DBMS is the key to scalable and efficient processing of LLM queries.
zh

[AI-10] AI Agent ic Vulnerability Injection And Transformation with Optimized Reasoning

【速读】:该论文旨在解决软件系统日益复杂与网络攻击日益 sophisticated 所带来的漏洞检测与修复难题,尤其是传统静态程序分析方法在可扩展性、适应性以及高误报率和漏报率方面的局限性。其核心解决方案是提出一种新颖的自动化框架,通过引入真实且类别特定的漏洞到安全的 C/C++ 代码库中生成高质量训练数据集;关键创新在于协调多个 AI 代理模拟专家推理过程,结合函数级代理与传统代码分析工具,并利用检索增强生成(Retrieval-Augmented Generation, RAG)实现上下文感知,同时采用低秩权重近似(Low-Rank approximation of weights)提升模型微调效率,从而在三个基准测试的 116 个代码样本上实现了 89%–95% 的函数级漏洞注入成功率。

链接: https://arxiv.org/abs/2508.20866
作者: Amine Lbath,Massih-Reza Amini,Aurelien Delaitre,Vadim Okun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing complexity of software systems and the sophistication of cyber-attacks have underscored the critical need for effective automated vulnerability detection and repair systems. Traditional methods, such as static program analysis, face significant challenges related to scalability, adaptability, and high false-positive and false-negative rates. AI-driven approaches, particularly those using machine learning and deep learning models, show promise but are heavily reliant on the quality and quantity of training data. This paper introduces a novel framework designed to automatically introduce realistic, category-specific vulnerabilities into secure C/C++ codebases to generate datasets. The proposed approach coordinates multiple AI agents that simulate expert reasoning, along with function agents and traditional code analysis tools. It leverages Retrieval-Augmented Generation for contextual grounding and employs Low-Rank approximation of weights for efficient model fine-tuning. Our experimental study on 116 code samples from three different benchmarks suggests that our approach outperforms other techniques with regard to dataset accuracy, achieving between 89% and 95% success rates in injecting vulnerabilities at function level.
zh

[AI-11] JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

【速读】:该论文旨在解决越狱攻击(jailbreak attack)成功判定这一基础但尚未解决的挑战。现有评估方法依赖于错位的代理指标或粗略的整体判断,常导致模型响应被误判,从而产生不一致且主观的评估结果,与人类感知存在偏差。其解决方案的关键在于提出一种通用的评估框架JADES(Jailbreak Assessment via Decompositional Scoring),通过自动将有害问题分解为一组加权子问题,分别评分每个子回答,并基于权重聚合得到最终决策;同时引入可选的事实核查模块以增强对越狱响应中幻觉的检测能力,从而实现准确、一致且可解释的评估。

链接: https://arxiv.org/abs/2508.20848
作者: Junjie Chu,Mingjie Li,Ziqing Yang,Ye Leng,Chenhao Lin,Chao Shen,Michael Backes,Yun Shen,Yang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures. For the code and data supporting this work, see this https URL

点击查看摘要

Abstract:Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA’s attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.
zh

[AI-12] Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

【速读】:该论文旨在解决当前基于视频生成的具身世界模型(embodied world models)因依赖大规模具身交互数据而面临的瓶颈问题,包括数据稀缺性、采集难度高、维度复杂以及语言与动作对齐粒度粗等问题,这些问题限制了长时视频生成能力,并阻碍生成式AI在具身领域实现类似GPT的突破。解决方案的关键在于提出一种新型建模范式——基础动作具身世界模型(Primitive Embodied World Models, PEWM),其核心思想是将视频生成限制在固定短时 horizon 内,从而实现语言概念与机器人动作视觉表征的细粒度对齐,降低学习复杂度并提升数据效率和推理速度;同时通过模块化视觉-语言模型(VLM)规划器与起止点热力图引导机制(SGG)实现灵活闭环控制和基础动作策略的组合泛化,有效融合视频模型中的时空视觉先验与VLM的语义感知能力,弥合物理交互与高层推理之间的鸿沟。

链接: https://arxiv.org/abs/2508.20840
作者: Qiao Sun,Liujia Yang,Wei Tang,Wei Huang,Kaixin Xu,Yongchao Chen,Mingyu Liu,Jiange Yang,Haoyi Zhu,Yating Wang,Tong He,Yilun Chen,Xili Dai,Nanyang Ye,Qinying Gu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation–hindering generative models from achieving a “GPT moment” in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling–Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
zh

[AI-13] Multi-Agent Penetration Testing AI for the Web

【速读】:该论文旨在解决由生成式 AI (Generative AI) 驱动的软件开发平台所引发的安全审计可扩展性危机,即AI生成代码中高达40%存在漏洞,导致安全评估能力难以跟上开发速度。解决方案的关键在于提出MAPTA——一个用于自主Web应用安全评估的多智能体系统,其核心创新包括:基于大语言模型(Large Language Model, LLM)的智能体编排、工具驱动的执行机制以及端到端的漏洞利用验证能力,从而实现高效率与高准确率的自动化安全检测。

链接: https://arxiv.org/abs/2508.20816
作者: Isaac David,Arthur Gervais
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-powered development platforms are making software creation accessible to a broader audience, but this democratization has triggered a scalability crisis in security auditing. With studies showing that up to 40% of AI-generated code contains vulnerabilities, the pace of development now vastly outstrips the capacity for thorough security assessment. We present MAPTA, a multi-agent system for autonomous web application security assessment that combines large language model orchestration with tool-grounded execution and end-to-end exploit validation. On the 104-challenge XBOW benchmark, MAPTA achieves 76.9% overall success with perfect performance on SSRF and misconfiguration vulnerabilities, 83% success on broken authorization, and strong results on injection attacks including server-side template injection (85%) and SQL injection (83%). Cross-site scripting (57%) and blind SQL injection (0%) remain challenging. Our comprehensive cost analysis across all challenges totals 21.38 with a median cost of 0.073 for successful attempts versus 0.357 for failures. Success correlates strongly with resource efficiency, enabling practical early-stopping thresholds at approximately 40 tool calls or 0.30 per challenge. MAPTA’s real-world findings are impactful given both the popularity of the respective scanned GitHub repositories (8K-70K stars) and MAPTA’s low average operating cost of 3.67 per open-source assessment: MAPTA discovered critical vulnerabilities including RCEs, command injections, secret exposure, and arbitrary file write vulnerabilities. Findings are responsibly disclosed, 10 findings are under CVE review. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.20816 [cs.CR] (or arXiv:2508.20816v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.20816 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Uncertainty Aware-Predictive Control Barrier Functions: Safer Human Robot Interaction through Probabilistic Motion Forecasting

【速读】:该论文旨在解决协作机器人(Collaborative Robots)在人机共融工作空间中如何平衡安全约束与响应性行为的问题。核心挑战在于人类运动具有任务依赖的随机性,若采用纯反应式或最坏情况下的安全边界策略,会导致机器人过度制动、任务停滞,从而破坏人机交互的流畅性。解决方案的关键在于提出不确定性感知的预测控制屏障函数(Uncertainty-Aware Predictive Control Barrier Functions, UA-PCBFs),该框架将概率性的人手运动预测与控制屏障函数(Control Barrier Functions, CBFs)的形式化安全保证相融合,利用预测模块提供的不确定性估计动态调整安全裕度,使机器人能够基于对未来人类状态的更深入理解进行智能决策,从而在保障安全的前提下显著提升交互效率和灵活性。

链接: https://arxiv.org/abs/2508.20812
作者: Lorenzo Busellato,Federico Cunico,Diego Dall’Alba,Marco Emporio,Andrea Giachetti,Riccardo Muradore,Marco Cristani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To enable flexible, high-throughput automation in settings where people and robots share workspaces, collaborative robotic cells must reconcile stringent safety guarantees with the need for responsive and effective behavior. A dynamic obstacle is the stochastic, task-dependent variability of human motion: when robots fall back on purely reactive or worst-case envelopes, they brake unnecessarily, stall task progress, and tamper with the fluidity that true Human-Robot Interaction demands. In recent years, learning-based human-motion prediction has rapidly advanced, although most approaches produce worst-case scenario forecasts that often do not treat prediction uncertainty in a well-structured way, resulting in over-conservative planning algorithms, limiting their flexibility. We introduce Uncertainty-Aware Predictive Control Barrier Functions (UA-PCBFs), a unified framework that fuses probabilistic human hand motion forecasting with the formal safety guarantees of Control Barrier Functions. In contrast to other variants, our framework allows for dynamic adjustment of the safety margin thanks to the human motion uncertainty estimation provided by a forecasting module. Thanks to uncertainty estimation, UA-PCBFs empower collaborative robots with a deeper understanding of future human states, facilitating more fluid and intelligent interactions through informed motion planning. We validate UA-PCBFs through comprehensive real-world experiments with an increasing level of realism, including automated setups (to perform exactly repeatable motions) with a robotic hand and direct human-robot interactions (to validate promptness, usability, and human confidence). Relative to state-of-the-art HRI architectures, UA-PCBFs show better performance in task-critical metrics, significantly reducing the number of violations of the robot’s safe space during interaction with respect to the state-of-the-art.
zh

[AI-15] Speech Emotion Recognition via Entropy-Aware Score Selection

【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中因单一模态信息局限性导致的识别精度不足问题。其解决方案的关键在于提出一种基于熵感知得分选择的多模态融合框架,通过整合基于wav2vec2.0的声学主路径与基于RoBERTa-XLM的文本辅路径(结合Whisper-large-v3生成的转录文本),并采用基于熵(entropy)和方差熵(varentropy)阈值的晚期得分融合策略,有效缓解主路径预测置信度不足的问题;同时引入情感映射策略将三类情感类别映射至四类目标情绪类别,实现跨模态预测的一致性融合,从而在IEMOCAP和MSP-IMPROV数据集上显著优于传统单模态系统。

链接: https://arxiv.org/abs/2508.20796
作者: ChenYi Chua,JunKai Wong,Chengxin Chen,Xiaoxiao Miao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by APCIPA ASC 2025

点击查看摘要

Abstract:In this paper, we propose a multimodal framework for speech emotion recognition that leverages entropy-aware score selection to combine speech and textual predictions. The proposed method integrates a primary pipeline that consists of an acoustic model based on wav2vec2.0 and a secondary pipeline that consists of a sentiment analysis model using RoBERTa-XLM, with transcriptions generated via Whisper-large-v3. We propose a late score fusion approach based on entropy and varentropy thresholds to overcome the confidence constraints of primary pipeline predictions. A sentiment mapping strategy translates three sentiment categories into four target emotion classes, enabling coherent integration of multimodal predictions. The results on the IEMOCAP and MSP-IMPROV datasets show that the proposed method offers a practical and reliable enhancement over traditional single-modality systems.
zh

[AI-16] Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control

【速读】:该论文旨在解决城市公交运营中因交通和客流不确定性导致的公交车聚集(bus bunching)问题,尤其针对非环线、多线路、动态需求等现实场景下传统多智能体强化学习(MARL)方法存在的数据不平衡与收敛困难。其解决方案的关键在于将原本复杂的多智能体问题重构为单智能体强化学习(RL)框架:通过在状态空间中引入类别型标识符(车辆ID、站点ID、时段)结合数值特征(车头时距、载客率、速度),实现高维编码以捕捉智能体间的依赖关系;同时设计结构化的奖励函数,采用梯形奖励形式平衡发车间隔均匀性与时刻表遵守度,替代传统的指数惩罚机制。实验表明,改进后的软演员-评论家算法(SAC)在稳定性与性能上显著优于MADDPG等基准方法,验证了该方法在真实复杂场景下的有效性与可扩展性。

链接: https://arxiv.org/abs/2508.20784
作者: Yifan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bus bunching remains a challenge for urban transit due to stochastic traffic and passenger demand. Traditional solutions rely on multi-agent reinforcement learning (MARL) in loop-line settings, which overlook realistic operations characterized by heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes. We propose a novel single-agent reinforcement learning (RL) framework for bus holding control that avoids the data imbalance and convergence issues of MARL under near-realistic simulation. A bidirectional timetabled network with dynamic passenger demand is constructed. The key innovation is reformulating the multi-agent problem into a single-agent one by augmenting the state space with categorical identifiers (vehicle ID, station ID, time period) in addition to numerical features (headway, occupancy, velocity). This high-dimensional encoding enables single-agent policies to capture inter-agent dependencies, analogous to projecting non-separable inputs into a higher-dimensional space. We further design a structured reward function aligned with operational goals: instead of exponential penalties on headway deviations, a ridge-shaped reward balances uniform headways and schedule adherence. Experiments show that our modified soft actor-critic (SAC) achieves more stable and superior performance than benchmarks, including MADDPG (e.g., -430k vs. -530k under stochastic conditions). These results demonstrate that single-agent deep RL, when enhanced with categorical structuring and schedule-aware rewards, can effectively manage bus holding in non-loop, real-world contexts. This paradigm offers a robust, scalable alternative to MARL frameworks, particularly where agent-specific experiences are imbalanced.
zh

[AI-17] Provable Benefits of In-Tool Learning for Large Language Models

【速读】:该论文试图解决的问题是:工具增强型语言模型(tool-augmented language models)在事实记忆方面的理论优势尚不明确,尤其是相较于仅依赖模型权重进行记忆(in-weight learning)的方式,工具使用(如外部检索)是否具有根本性优势。解决方案的关键在于证明了通过工具学习(in-tool learning,即外部检索)可实现无界的事实召回能力,而基于权重的记忆能力则受限于模型参数量;作者进一步提出一种简单高效的电路构造方法来实现这一无界性,并通过控制实验验证了工具使用模型在事实 recall 上显著优于纯记忆模型,从而从理论和实证层面确立了工具增强工作流不仅实用,而且更具可扩展性。

链接: https://arxiv.org/abs/2508.20755
作者: Sam Houliston,Ambroise Odonnat,Charles Arnal,Vivien Cabannes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Tool-augmented language models, equipped with retrieval, memory, or external APIs, are reshaping AI, yet their theoretical advantages remain underexplored. In this paper, we address this question by demonstrating the benefits of in-tool learning (external retrieval) over in-weight learning (memorization) for factual recall. We show that the number of facts a model can memorize solely in its weights is fundamentally limited by its parameter count. In contrast, we prove that tool-use enables unbounded factual recall via a simple and efficient circuit construction. These results are validated in controlled experiments, where tool-using models consistently outperform memorizing ones. We further show that for pretrained large language models, teaching tool-use and general rules is more effective than finetuning facts into memory. Our work provides both a theoretical and empirical foundation, establishing why tool-augmented workflows are not just practical, but provably more scalable.
zh

[AI-18] Rethinking Testing for LLM Applications: Characteristics Challenges and a Lightweight Interaction Protocol

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)应用在质量保障方面所面临的根本性挑战,这些问题源于LLM应用的非确定性、动态性和上下文依赖性。为应对这些挑战,作者提出了一种三层架构:系统外壳层(System Shell Layer)、提示编排层(Prompt Orchestration Layer)和LLM推理核心层(LLM Inference Core),并分别评估传统软件测试方法在各层的适用性。解决方案的关键在于识别出四大基础差异所引发的六项核心挑战,并据此提出四种协同策略(\emphRetain、\emphTranslate、\emphIntegrate 和 \emphRuntime),构建一个结合预部署验证与运行时监控的闭环可信质量保障框架。此外,论文还提出一种面向测试的AI代理交互通信语言(Agent Interaction Communication Language, AICL),用于支持多智能体系统中的标准化测试交互与工具集成。

链接: https://arxiv.org/abs/2508.20737
作者: Wei Ma,Yixiao Yang,Qiang Hu,Shi Ying,Zhi Jin,Bo Du,Zhenchang Xing,Tianlin Li,Junjie Shi,Yang Liu,Linxiao Jiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Applications of Large Language Models~(LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions. Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance. This paper decomposes LLM applications into a three-layer architecture: \textbf\textitSystem Shell Layer, \textbf\textitPrompt Orchestration Layer, and \textbf\textitLLM Inference Core. We then assess the applicability of traditional software testing methods in each layer: directly applicable at the shell layer, requiring semantic reinterpretation at the orchestration layer, and necessitating paradigm shifts at the inference core. A comparative analysis of Testing AI methods from the software engineering community and safety analysis techniques from the AI community reveals structural disconnects in testing unit abstraction, evaluation metrics, and lifecycle management. We identify four fundamental differences that underlie 6 core challenges. To address these, we propose four types of collaborative strategies (\emphRetain, \emphTranslate, \emphIntegrate, and \emphRuntime) and explore a closed-loop, trustworthy quality assurance framework that combines pre-deployment validation with runtime monitoring. Based on these strategies, we offer practical guidance and a protocol proposal to support the standardization and tooling of LLM application testing. We propose a protocol \textbf\textitAgent Interaction Communication Language (AICL) that is used to communicate between AI agents. AICL has the test-oriented features and is easily integrated in the current agent framework.
zh

[AI-19] Re4: Scientific Computing Agent with Rewriting Resolution Review and Revision

【速读】:该论文旨在解决科学计算中基于自然语言描述的自动代码生成问题,特别是如何提升生成代码的正确性与物理合理性,减少错误代码和非物理解的出现。其关键解决方案在于构建一个包含“重写-求解-评审-修订”逻辑链的代理框架,通过三个协作式大语言模型(LLM)——顾问模块(Consultant)、程序员模块(Programmer)和评审模块(Reviewer)——实现知识迁移、代码生成与自调试闭环。其中,评审机制通过交互式反馈迭代优化可执行代码,显著提升了无错误代码生成率和执行成功率,为科学计算中的自主代码生成提供了一个高可靠的新范式。

链接: https://arxiv.org/abs/2508.20729
作者: Ao Cheng,Lei Zhang,Guowei He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Large language models (LLMs) serve as an active and promising field of generative artificial intelligence and have demonstrated abilities to perform complex tasks in multiple domains, including mathematical and scientific reasoning. In this work, we construct a novel agent framework for solving representative problems in scientific computing. The proposed agent, incorporating a “rewriting-resolution-review-revision” logical chain via three reasoning LLMs (functioning as the Consultant, Reviewer, and Programmer, respectively), is integrated in a collaborative and interactive manner. The Consultant module endows the agent with knowledge transfer capabilities to link problems to professional domain insights, thereby rewriting problem descriptions through text augmentation. The Programmer module is responsible for generating and executing well-structured code to deliver the problem resolution. The Reviewer module equips the agent with the capacity for self-debugging and self-refinement through interactive feedback with code runtime outputs. By leveraging the end-to-end review mechanism, the executable code provided by the Programmer attains the iterative revision. A comprehensive evaluation is conducted on the performance of the proposed agent framework in solving PDEs, ill-conditioned linear systems, and data-driven physical analysis problems. Compared to single-model, this collaborative framework significantly improves the bug-free code generation rate and reduces the occurrence of non-physical solutions, thereby establishing a highly reliable framework for autonomous code generation based on natural language descriptions. The review mechanism improved the average execution success (bug-free code and non-NaN solutions) rate of the latest reasoning models. In summary, our agent framework establishes automatic code generation and review as a promising scientific computing paradigm.
zh

[AI-20] EEGDM: Learning EEG Representation with Latent Diffusion Model

【速读】:该论文旨在解决当前基于深度学习的脑电图(EEG)信号分析中普遍存在的问题:现有表示学习方法在有限训练数据下难以学习到跨任务通用的表征,且多数方法依赖于简单的掩码重建目标,无法充分捕捉EEG信号中的语义信息和复杂模式。其解决方案的关键在于提出一种基于潜在扩散模型(latent diffusion model)的自监督EEG表示学习方法——EEGDM,该方法将EEG信号生成作为自监督目标,利用编码器将原始EEG信号及其通道增强版本压缩为紧凑的潜在表示,并将其作为条件信息引导扩散过程,从而在保持生成质量的同时获得具有强大泛化能力的表征空间,显著提升了下游任务性能,尤其在小样本预训练场景下表现优异。

链接: https://arxiv.org/abs/2508.20705
作者: Shaocong Wang,Tong Liu,Ming Li,Minjing Yu,Yong-Jin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While electroencephalography (EEG) signal analysis using deep learning has shown great promise, existing approaches still face significant challenges in learning generalizable representations that perform well across diverse tasks, particularly when training data is limited. Current EEG representation learning methods including EEGPT and LaBraM typically rely on simple masked reconstruction objective, which may not fully capture the rich semantic information and complex patterns inherent in EEG signals. In this paper, we propose EEGDM, a novel self-supervised EEG representation learning method based on the latent diffusion model, which leverages EEG signal generation as a self-supervised objective, turning the diffusion model into a strong representation learner capable of capturing EEG semantics. EEGDM incorporates an EEG encoder that distills EEG signals and their channel augmentations into a compact representation, acting as conditional information to guide the diffusion model for generating EEG signals. This design endows EEGDM with a compact latent space, which not only offers ample control over the generative process but also can be leveraged for downstream tasks. Experimental results show that EEGDM (1) can reconstruct high-quality EEG signals, (2) effectively learns robust representations, and (3) achieves competitive performance with modest pre-training data size across diverse downstream tasks, underscoring its generalizability and practical utility.
zh

[AI-21] ask Allocation for Autonomous Machines using Computational Intelligence and Deep Reinforcement Learning

【速读】:该论文旨在解决多台自主机器在复杂环境中可靠协作控制的问题,核心挑战在于如何高效地分配任务并协调其行为。解决方案的关键在于利用计算智能(Computational Intelligence, CI)和深度强化学习(Deep Reinforcement Learning, Deep RL)方法进行任务分配,这些方法能够有效应对动态与不确定环境下的复杂决策问题,提升自主机器在实际应用中的性能与可部署性。

链接: https://arxiv.org/abs/2508.20688
作者: Thanh Thi Nguyen,Quoc Viet Hung Nguyen,Jonathan Kua,Imran Razzak,Dung Nguyen,Saeid Nahavandi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:Enabling multiple autonomous machines to perform reliably requires the development of efficient cooperative control algorithms. This paper presents a survey of algorithms that have been developed for controlling and coordinating autonomous machines in complex environments. We especially focus on task allocation methods using computational intelligence (CI) and deep reinforcement learning (RL). The advantages and disadvantages of the surveyed methods are analysed thoroughly. We also propose and discuss in detail various future research directions that shed light on how to improve existing algorithms or create new methods to enhance the employability and performance of autonomous machines in real-world applications. The findings indicate that CI and deep RL methods provide viable approaches to addressing complex task allocation problems in dynamic and uncertain environments. The recent development of deep RL has greatly contributed to the literature on controlling and coordinating autonomous machines, and it has become a growing trend in this area. It is envisaged that this paper will provide researchers and engineers with a comprehensive overview of progress in machine learning research related to autonomous machines. It also highlights underexplored areas, identifies emerging methodologies, and suggests new avenues for exploration in future research within this domain.
zh

[AI-22] Bridging Minds and Machines: Toward an Integration of AI and Cognitive Science

【速读】:该论文试图解决的问题是:当前人工智能(AI)在认知科学领域的发展中,虽然取得了显著的性能提升,但其认知基础仍处于概念碎片化状态,缺乏与人类心智机制的深度整合。解决方案的关键在于推动AI从单纯追求任务性能优化转向构建能够深化对人类心智理解的系统,具体包括四个方向:将AI行为与认知框架对齐、引入具身性(embodiment)与文化情境、发展个性化认知模型,以及通过认知协同评估(cognitive co-evaluation)重构AI伦理体系。

链接: https://arxiv.org/abs/2508.20674
作者: Rui Mao,Qian Liu,Xiao Li,Erik Cambria,Amir Hussain
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Cognitive Science has profoundly shaped disciplines such as Artificial Intelligence (AI), Philosophy, Psychology, Neuroscience, Linguistics, and Culture. Many breakthroughs in AI trace their roots to cognitive theories, while AI itself has become an indispensable tool for advancing cognitive research. This reciprocal relationship motivates a comprehensive review of the intersections between AI and Cognitive Science. By synthesizing key contributions from both perspectives, we observe that AI progress has largely emphasized practical task performance, whereas its cognitive foundations remain conceptually fragmented. We argue that the future of AI within Cognitive Science lies not only in improving performance but also in constructing systems that deepen our understanding of the human mind. Promising directions include aligning AI behaviors with cognitive frameworks, situating AI in embodiment and culture, developing personalized cognitive models, and rethinking AI ethics through cognitive co-evaluation.
zh

[AI-23] Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

【速读】:该论文旨在解决当前符号音乐生成模型(symbolic music generation models)在建模音乐属性时存在的局限性问题,即现有方法普遍采用自回归或分层自回归架构,将音乐表示为具有严格单向时序依赖关系的属性标记序列,而忽略了音乐中各属性(如音高、持续时间等)本质上是并行且无序集合的事实。其核心解决方案在于提出Amadeus框架,关键创新包括:1)采用两级架构——自回归模型用于音符序列建模,双向离散扩散模型(bidirectional discrete diffusion model)用于属性建模,从而更准确地捕捉属性间的并发性和非顺序性;2)引入Music Latent Space Discriminability Enhancement Strategy(MLSDES),通过对比学习增强中间音乐表征的判别能力;3)设计Conditional Information Enhancement Module(CIEM),利用注意力机制强化音符潜在向量表示,提升解码精度。该方案显著优于当前最优模型(SOTA),同时实现至少4倍的速度提升,并支持无需训练的细粒度属性控制。

链接: https://arxiv.org/abs/2508.20665
作者: Hongju Su,Ke Li,Lan Yang,Honggang Zhang,Yi-Zhe Song
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Under review

点击查看摘要

Abstract:Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4 \times speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.
zh

[AI-24] ask-Oriented Edge-Assisted Cross-System Design for Real-Time Human-Robot Interaction in Industrial Metaverse

【速读】:该论文旨在解决工业元宇宙中人机实时交互面临的高计算负载、带宽受限和严格延迟要求等问题。其解决方案的关键在于提出一种面向任务的边缘辅助跨系统框架,利用数字孪生(Digital Twins, DTs)实现响应式交互:通过预测操作者动作,支持视觉反馈的预渲染与远程设备的提前控制;同时将DT解耦为视觉显示和机器人控制两个独立虚拟功能模块,以优化性能与适应性;此外引入人-in-the-loop模型无关元学习(Human-In-The-Loop Model-Agnostic Meta-Learning, HITL-MAML)算法动态调整预测时域,提升泛化能力。实验表明该框架在轨迹绘制控制和核设施退役场景中的3D实时重建任务中均显著提升了空间精度与视觉保真度。

链接: https://arxiv.org/abs/2508.20664
作者: Kan Chen,Zhen Meng,Xiangmin Xu,Jiaming Yang,Emma Li,Philip G. Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: This paper has submitted to IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Real-time human-device interaction in industrial Metaverse faces challenges such as high computational load, limited bandwidth, and strict latency. This paper proposes a task-oriented edge-assisted cross-system framework using digital twins (DTs) to enable responsive interactions. By predicting operator motions, the system supports: 1) proactive Metaverse rendering for visual feedback, and 2) preemptive control of remote devices. The DTs are decoupled into two virtual functions-visual display and robotic control-optimizing both performance and adaptability. To enhance generalizability, we introduce the Human-In-The-Loop Model-Agnostic Meta-Learning (HITL-MAML) algorithm, which dynamically adjusts prediction horizons. Evaluation on two tasks demonstrates the framework’s effectiveness: in a Trajectory-Based Drawing Control task, it reduces weighted RMSE from 0.0712 m to 0.0101 m; in a real-time 3D scene representation task for nuclear decommissioning, it achieves a PSNR of 22.11, SSIM of 0.8729, and LPIPS of 0.1298. These results show the framework’s capability to ensure spatial precision and visual fidelity in real-time, high-risk industrial environments.
zh

[AI-25] Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement

【速读】:该论文旨在解决当前基于流的语音增强方法中,由弯曲概率路径(curved probability paths)带来的训练复杂性和泛化能力不足的问题。研究表明,弯曲路径(如Schrodinger桥方法所采用)虽然性能优异,但其时变梯度和方差难以保证路径的平直性,从而影响模型稳定性和样本质量。解决方案的关键在于引入条件流匹配(conditional flow matching),通过建模噪声语音到干净语音之间的直线概率路径(straight probability paths),实现更易训练且泛化能力更强的语音增强模型;同时,论文进一步提出一种单步推理策略,将训练好的流模型视为直接预测器,有效克服了传统条件流匹配需多步推断的效率瓶颈。实验表明,时间无关的方差设置对样本质量的影响大于梯度本身,验证了直线路径在生成式语音增强中的优势。

链接: https://arxiv.org/abs/2508.20584
作者: Mattias Cross,Anton Ragni
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint, accepted

点击查看摘要

Abstract:Current flow-based generative speech enhancement methods learn curved probability paths which model a mapping between clean and noisy speech. Despite impressive performance, the implications of curved probability paths are unknown. Methods such as Schrodinger bridges focus on curved paths, where time-dependent gradients and variance do not promote straight paths. Findings in machine learning research suggest that straight paths, such as conditional flow matching, are easier to train and offer better generalisation. In this paper we quantify the effect of path straightness on speech enhancement quality. We report experiments with the Schrodinger bridge, where we show that certain configurations lead to straighter paths. Conversely, we propose independent conditional flow-matching for speech enhancement, which models straight paths between noisy and clean speech. We demonstrate empirically that a time-independent variance has a greater effect on sample quality than the gradient. Although conditional flow matching improves several speech quality metrics, it requires multiple inference steps. We rectify this with a one-step solution by inferring the trained flow-based model as if it was directly predictive. Our work suggests that straighter time-independent probability paths improve generative speech enhancement over curved time-dependent paths.
zh

[AI-26] Human-AI Collaborative Bot Detection in MMORPGs

【速读】:该论文旨在解决大型多人在线角色扮演游戏(Massively Multiplayer Online Role-Playing Games, MMORPGs)中自动升级机器人(auto-leveling bots)的检测难题,这类机器人通过自动化程序大规模提升角色等级,破坏游戏平衡与公平性,而传统检测方法难以在不引发法律和用户体验问题的前提下提供可解释的惩罚依据。解决方案的关键在于提出一种完全无监督的框架,结合对比表征学习(contrastive representation learning)与聚类技术,识别具有相似升级模式的角色群体,并引入大语言模型(Large Language Model, LLM)作为辅助审核者对聚类结果进行验证,从而模拟二次人工判断;同时设计基于成长曲线的可视化工具,协助LLM与人工管理员评估角色行为,实现高效且可解释的机器人检测流程,支持MMORPG中规模化、负责任的机器人治理。

链接: https://arxiv.org/abs/2508.20578
作者: Jaeman Son,Hyunsoo Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In Massively Multiplayer Online Role-Playing Games (MMORPGs), auto-leveling bots exploit automated programs to level up characters at scale, undermining gameplay balance and fairness. Detecting such bots is challenging, not only because they mimic human behavior, but also because punitive actions require explainable justification to avoid legal and user experience issues. In this paper, we present a novel framework for detecting auto-leveling bots by leveraging contrastive representation learning and clustering techniques in a fully unsupervised manner to identify groups of characters with similar level-up patterns. To ensure reliable decisions, we incorporate a Large Language Model (LLM) as an auxiliary reviewer to validate the clustered groups, effectively mimicking a secondary human judgment. We also introduce a growth curve-based visualization to assist both the LLM and human moderators in assessing leveling behavior. This collaborative approach improves the efficiency of bot detection workflows while maintaining explainability, thereby supporting scalable and accountable bot regulation in MMORPGs.
zh

[AI-27] AI and Agile Software Development: A Research Roadmap from the XP2025 Workshop

【速读】:该论文旨在解决生成式人工智能(Generative Artificial Intelligence, GenAI)与敏捷软件开发(Agile Software Development)融合过程中所面临的实际挑战,包括工具碎片化、治理难题、数据质量不足以及AI素养和提示工程(Prompt Engineering)等关键技能缺口。解决方案的关键在于通过跨学科协作的研讨会形式,系统识别共性痛点并深入分析其根本原因,最终共同制定出一个涵盖短期可执行行动与长期研究方向的多主题科研路线图,以推动GenAI在敏捷实践中的负责任且以人为本的集成。

链接: https://arxiv.org/abs/2508.20563
作者: Zheying Zhang,Tomas Herda,Victoria Pichler,Pekka Abrahamsson,Geir K. Hanssen,Joshua Kerievsky,Alex Polyakov,Mohit Chandna,Marius Irgens,Kai-Kristian Kemell,Ayman Asad Khan,Crystal Kwok,Evan Leybourn,Munish Malik,Dorota Mleczko,Morteza Moalagh,Christopher Morales,Yuliia Pieskova,Daniel Planötscher,Mika Saari,Anastasiia Tkalich,Karl Josef Gstettner,Xiaofeng Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper synthesizes the key findings from a full-day XP2025 workshop on “AI and Agile: From Frustration to Success”, held in Brugg-Windisch, Switzerland. The workshop brought together over 30 interdisciplinary academic researchers and industry practitioners to tackle the concrete challenges and emerging opportunities at the intersection of Generative Artificial Intelligence (GenAI) and agile software development. Through structured, interactive breakout sessions, participants identified shared pain points like tool fragmentation, governance, data quality, and critical skills gaps in AI literacy and prompt engineering. These issues were further analyzed, revealing underlying causes and cross-cutting concerns. The workshop concluded by collaboratively co-creating a multi-thematic research roadmap, articulating both short-term, implementable actions and visionary, long-term research directions. This cohesive agenda aims to guide future investigation and drive the responsible, human-centered integration of GenAI into agile practices.
zh

[AI-28] MedGR2: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

【速读】:该论文旨在解决医学领域中视觉-语言模型(Vision-Language Models, VLMs)因高质量、专家标注数据稀缺而导致的训练困难与泛化能力不足的问题。现有监督微调(Supervised Fine-Tuning, SFT)方法在未见模态和任务上表现差,而强化学习(Reinforcement Learning, RL)则受限于缺乏可靠的奖励信号。其解决方案的关键在于提出一种名为“生成式奖励学习用于医学推理”(Generative Reward Learning for Medical Reasoning, MedGR²)的新框架,该框架通过协同开发一个数据生成器与一个奖励模型,构建了一个自我增强的良性循环:生成的数据既可用于SFT提升基础性能,又可作为RL的训练资源,从而显著增强跨模态和跨任务的泛化能力。实验表明,MedGR²生成的数据使模型性能超越基于大规模人工标注数据集的基线,并在RL优化下实现当前最优的通用性表现,同时仅用少量参数即可媲美参数量超10倍的基础模型。

链接: https://arxiv.org/abs/2508.20549
作者: Weihai Zhi,Jiayan Guo,Shangyang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR ^2 ), a novel framework that creates a self-improving virtuous cycle. MedGR ^2 co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR ^2 -produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR ^2 , achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR ^2 presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.
zh

[AI-29] MM-HSD: Multi-Modal Hate Speech Detection in Videos

【速读】:该论文旨在解决视频中的仇恨言论检测(Hate Speech Detection, HSD)问题,现有方法在多模态融合上存在局限,尤其忽略了屏幕文本(on-screen text)和音频等关键模态,导致无法充分捕捉模态间的复杂依赖关系。其解决方案的关键在于提出MM-HSD模型,首次将视频帧、音频、语音转录文本及屏幕文本统一建模,并引入交叉模态注意力(Cross-Modal Attention, CMA)作为早期特征提取器,通过系统比较查询/键配置并评估不同模态在CMA模块中的交互机制,发现以屏幕文本为查询、其他模态为键时性能最优,最终在HateMM数据集上实现M-F1分数0.874的显著提升。

链接: https://arxiv.org/abs/2508.20546
作者: Berta Céspedes-Sarrias,Carlos Collado-Capell,Pablo Rodenas-Ruiz,Olena Hrynenko,Andrea Cavallaro
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: Accepted at ACM Multimedia 2025

点击查看摘要

Abstract:While hate speech detection (HSD) has been extensively studied in text, existing multi-modal approaches remain limited, particularly in videos. As modalities are not always individually informative, simple fusion methods fail to fully capture inter-modal dependencies. Moreover, previous work often omits relevant modalities such as on-screen text and audio, which may contain subtle hateful content and thus provide essential cues, both individually and in combination with others. In this paper, we present MM-HSD, a multi-modal model for HSD in videos that integrates video frames, audio, and text derived from speech transcripts and from frames (i.e.~on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for HSD in videos, to systematically compare query/key configurations, and to evaluate the interactions between different modalities in the CMA block. Our approach leads to improved performance when on-screen text is used as a query and the rest of the modalities serve as a key. Experiments on the HateMM dataset show that MM-HSD outperforms state-of-the-art methods on M-F1 score (0.874), using concatenation of transcript, audio, video, on-screen text, and CMA for feature extraction on raw embeddings of the modalities. The code is available at this https URL
zh

[AI-30] Enhancing Health Fact-Checking with LLM -Generated Synthetic Data

【速读】:该论文旨在解决健康相关事实核查(health-related fact checking)任务中因标注训练数据稀缺而导致模型性能受限的问题。其解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的合成数据生成流水线:首先对源文档进行摘要并分解为原子事实,随后利用LLM构建句子-事实蕴含表(sentence-fact entailment table),从中提取蕴含关系以生成带有二元真假标签的文本-声明配对(text-claim pairs),最终将这些合成数据与原始数据结合,微调基于BERT的事实核查模型。实验证明,该方法在PubHealth和SciFact两个公开数据集上分别提升了0.019和0.049的F1分数,验证了LLM驱动的合成数据增强策略在提升健康事实核查准确率方面的有效性。

链接: https://arxiv.org/abs/2508.20525
作者: Jingze Zhang,Jiahe Qian,Yiliang Zhou,Yifan Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fact-checking for health-related content is challenging due to the limited availability of annotated training data. In this study, we propose a synthetic data generation pipeline that leverages large language models (LLMs) to augment training data for health-related fact checking. In this pipeline, we summarize source documents, decompose the summaries into atomic facts, and use an LLM to construct sentence-fact entailment tables. From the entailment relations in the table, we further generate synthetic text-claim pairs with binary veracity labels. These synthetic data are then combined with the original data to fine-tune a BERT-based fact-checking model. Evaluation on two public datasets, PubHealth and SciFact, shows that our pipeline improved F1 scores by up to 0.019 and 0.049, respectively, compared to models trained only on the original data. These results highlight the effectiveness of LLM-driven synthetic data augmentation in enhancing the performance of health-related fact-checkers.
zh

[AI-31] BridgeShield: Enhancing Security for Cross-chain Bridge Applications via Heterogeneous Graph Mining

【速读】:该论文旨在解决跨链桥(cross-chain bridge)在多链生态系统中因设计缺陷和高价值目标而频繁遭受黑客攻击的问题,现有检测方法受限于仅关注单链行为,难以捕捉跨链语义信息。其解决方案的关键在于提出BridgeShield框架,该框架基于异构图注意力网络(heterogeneous graph attention network),将源链、链下协调与目标链统一建模为异构图表示,并引入两种注意力机制: intra-meta-path attention用于学习跨链路径内的细粒度依赖关系,inter-meta-path attention用于识别具有区分性的跨链模式,从而实现对攻击行为的精准识别。

链接: https://arxiv.org/abs/2508.20517
作者: Dan Lin,Shunfeng Lu,Ziyan Liu,Jiajing Wu,Junyuan Fang,Kaixin Lin,Bowen Song,Zibin Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-chain bridges play a vital role in enabling blockchain interoperability. However, due to the inherent design flaws and the enormous value they hold, they have become prime targets for hacker attacks. Existing detection methods show progress yet remain limited, as they mainly address single-chain behaviors and fail to capture cross-chain semantics. To address this gap, we leverage heterogeneous graph attention networks, which are well-suited for modeling multi-typed entities and relations, to capture the complex execution semantics of cross-chain behaviors. We propose BridgeShield, a detection framework that jointly models the source chain, off-chain coordination, and destination chain within a unified heterogeneous graph representation. BridgeShield incorporates intra-meta-path attention to learn fine-grained dependencies within cross-chain paths and inter-meta-path attention to highlight discriminative cross-chain patterns, thereby enabling precise identification of attack behaviors. Extensive experiments on 51 real-world cross-chain attack events demonstrate that BridgeShield achieves an average F1-score of 92.58%, representing a 24.39% improvement over state-of-the-art baselines. These results validate the effectiveness of BridgeShield as a practical solution for securing cross-chain bridges and enhancing the resilience of multi-chain ecosystems.
zh

[AI-32] Evaluating Differentially Private Generation of Domain-Specific Text

【速读】:该论文旨在解决高风险领域(如医疗和金融)中因隐私保护与监管限制导致真实数据难以使用的问题,提出通过差分隐私(Differential Privacy, DP)保障下的合成数据生成作为替代方案。其解决方案的关键在于构建了一个统一的基准测试框架,用于系统评估在正式差分隐私约束下生成的文本数据集在实用性(utility)和保真度(fidelity)方面的表现,涵盖代表性数据选择、合理的隐私预算设定、预训练影响及多种评估指标,从而揭示当前隐私保护生成方法在严格隐私约束下的性能下降问题,并为未来更先进的隐私保护数据共享技术提供评估标准。

链接: https://arxiv.org/abs/2508.20452
作者: Yidan Sun,Viktor Schlegel,Srinivasan Nandakumar,Iqra Zahid,Yuping Wu,Warren Del-Pinto,Goran Nenadic,Siew-Kei Lam,Jie Zhang,Anil A Bharath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.
zh

[AI-33] owards Mitigating Excessive Forgetting in LLM Unlearning via Entanglement-Aware Unlearning with Proxy Constraint

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能包含私有或受版权保护数据所带来的隐私与所有权风险问题,提出了一种更精准、高效的机器遗忘(Machine Unlearning)方法。现有方法常因缺乏明确的遗忘边界而导致部分样本“欠遗忘”(残留泄露风险)或“过遗忘”(性能下降),影响模型实用性。其解决方案的核心在于提出EAGLE-PC(Entanglement-Awareness Guided Loss Reweighting with Proxy Constraint)框架:首先通过嵌入空间中样本间的相似性度量实现感知纠缠的损失重加权机制,以动态调整每条数据的遗忘强度;其次引入基于上下文学习(In-Context Learning, ICL)生成的代理数据作为软约束,缓解过度遗忘带来的性能退化。该方法兼容梯度优化目标,可作为即插即用模块提升遗忘效率与模型效用之间的平衡。

链接: https://arxiv.org/abs/2508.20443
作者: Zhihao Liu,Jian Lou,Yuke Hu,Xiaochen Li,Tailun Chen,Yitian Chen,Zhan Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on massive datasets that may include private or copyrighted content. Due to growing privacy and ownership concerns, data owners may request the removal of their data from trained models. Machine unlearning provides a practical solution by removing the influence of specific data without full retraining. However, most existing methods lack a sound forgetting boundary, causing some samples to be under-forgotten, leaving residual leakage risks, while others remain over-forgotten at the expense of degraded utility. In this work, we propose EAGLE-PC (Entanglement-Awareness Guided Loss Reweighting with Proxy Constraint), a novel unlearning framework that addresses these limitations through two key components. First, entanglement-awareness guided loss reweighting determines the forgetting effort of each sample by measuring its similarity to retain samples in the embedding space, enabling more targeted and effective unlearning. Second, a proxy constraint leveraging ICL (In-Context Learning) generated test data softly regularizes the forgetting process, effectively mitigating over-forgetting. EAGLE-PC is compatible with existing gradient-based objectives and serves as a plug-and-play enhancement. We evaluate EAGLE-PC on the TOFU and MUSE benchmarks, showing consistent improvements in the forgetting-utility trade-off across multiple LLMs. Combined with the NPO+GD optimizer, it approaches full retraining performance, offering a scalable and robust unlearning solution. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.20443 [cs.LG] (or arXiv:2508.20443v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.20443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-34] Uncovering the Spectral Bias in Diagonal State Space Models

【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)参数初始化方法中存在的效率与性能平衡问题,尤其是针对当前主流的HiPPO框架在计算复杂度上的局限性。现有方法虽能实现良好性能,但其依赖在线近似正交多项式的计算方式较为复杂,而近期提出的对角化替代方案虽提升了效率,却缺乏对其内在机制的系统理解。论文的关键解决方案在于从频域角度深入分析对角初始化策略的作用机制,揭示极点分布(pole placing)在初始化中的学习偏置影响,并据此提出一种基于离散傅里叶域(Discrete Fourier Domain)的新型对角初始化方法——S4D-DFouT。该方法不仅提升了模型训练稳定性与泛化能力,还支持从零开始训练大规模数据集(如PathX-256),最终在Long Range Arena基准上达到最优性能。

链接: https://arxiv.org/abs/2508.20441
作者: Ruben Solozabal,Velibor Bojkovic,Hilal AlQuabeh,Kentaro Inui,Martin Takáč
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current methods for initializing state space models (SSMs) parameters mainly rely on the \textitHiPPO framework, which is based on an online approximation of orthogonal polynomials. Recently, diagonal alternatives have shown to reach a similar level of performance while being significantly more efficient due to the simplification in the kernel computation. However, the \textitHiPPO framework does not explicitly study the role of its diagonal variants. In this paper, we take a further step to investigate the role of diagonal SSM initialization schemes from the frequency perspective. Our work seeks to systematically understand how to parameterize these models and uncover the learning biases inherent in such diagonal state-space models. Based on our observations, we propose a diagonal initialization on the discrete Fourier domain \textitS4D-DFouT. The insights in the role of pole placing in the initialization enable us to further scale them and achieve state-of-the-art results on the Long Range Arena benchmark, allowing us to train from scratch on very large datasets as PathX-256.
zh

[AI-35] On Identifying Why and When Foundation Models Perform Well on Time-Series Forecasting Using Automated Explanations and Rating

【速读】:该论文旨在解决时间序列预测模型(Time-series Forecasting Models, TSFM)在实际应用中表现不稳定、性能差异显著且缺乏可解释性的问题,尤其是在高风险决策场景下,用户难以判断何时应信任或质疑模型输出。其解决方案的关键在于融合传统可解释人工智能(Explainable AI, XAI)方法与评分驱动解释(Rating Driven Explanations, RDE),系统评估不同架构模型在多领域数据集上的性能与可解释性表现,从而揭示特征工程型模型(如梯度提升)在波动性强或稀疏数据场景中更具稳健性和可解释性,而基础模型(如Chronos)仅在稳定趋势驱动的场景(如金融)中优势明显。

链接: https://arxiv.org/abs/2508.20437
作者: Michael Widener,Kausik Lakkaraju,John Aydin,Biplav Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 Tables, 5 Figures, AI Trustworthiness and Risk Assessment for Challenged Contexts (ATRACC), Appendix

点击查看摘要

Abstract:Time-series forecasting models (TSFM) have evolved from classical statistical methods to sophisticated foundation models, yet understanding why and when these models succeed or fail remains challenging. Despite this known limitation, time series forecasting models are increasingly used to generate information that informs real-world actions with equally real consequences. Understanding the complexity, performance variability, and opaque nature of these models then becomes a valuable endeavor to combat serious concerns about how users should interact with and rely on these models’ outputs. This work addresses these concerns by combining traditional explainable AI (XAI) methods with Rating Driven Explanations (RDE) to assess TSFM performance and interpretability across diverse domains and use cases. We evaluate four distinct model architectures: ARIMA, Gradient Boosting, Chronos (time-series specific foundation model), Llama (general-purpose; both fine-tuned and base models) on four heterogeneous datasets spanning finance, energy, transportation, and automotive sales domains. In doing so, we demonstrate that feature-engineered models (e.g., Gradient Boosting) consistently outperform foundation models (e.g., Chronos) in volatile or sparse domains (e.g., power, car parts) while providing more interpretable explanations, whereas foundation models excel only in stable or trend-driven contexts (e.g., finance).
zh

[AI-36] Rethinking Purity and Diversity in Multi-Behavior Sequential Recommendation from the Frequency Perspective

【速读】:该论文旨在解决多行为序列推荐(Multi-behavior Sequential Recommendation, MBSR)中因用户多种行为数据(如浏览、点击、购买)混杂而引入噪声的问题,尤其关注传统方法忽略高频信息价值的局限性。其解决方案的关键在于重新审视频率域信息的意义:提出低频成分对应用户兴趣的纯度(purity),高频成分则反映兴趣的多样性(diversity),并据此设计了PDB4Rec模型,通过高效提取多频带信息及其关联关系,并引入Bootstrap Balancer机制动态平衡不同频率成分的贡献,从而提升推荐性能。

链接: https://arxiv.org/abs/2508.20427
作者: Yongqiang Han,Kai Cheng,Kefan Wang,Enhong Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recommendation systems, users often exhibit multiple behaviors, such as browsing, clicking, and purchasing. Multi-behavior sequential recommendation (MBSR) aims to consider these different behaviors in an integrated manner to improve the recommendation performance of the target behavior. However, some behavior data will also bring inevitable noise to the modeling of user interests. Some research efforts focus on data denoising from the frequency domain perspective to improve the accuracy of user preference prediction. These studies indicate that low-frequency information tends to be valuable and reliable, while high-frequency information is often associated with noise. In this paper, we argue that high-frequency information is by no means insignificant. Further experimental results highlight that low frequency corresponds to the purity of user interests, while high frequency corresponds to the diversity of user interests. Building upon this finding, we proposed our model PDB4Rec, which efficiently extracts information across various frequency bands and their relationships, and introduces Boostrapping Balancer mechanism to balance their contributions for improved recommendation performance. Sufficient experiments on real-world datasets demonstrate the effectiveness and efficiency of our model.
zh

[AI-37] Assessing local deformation and computing scalar curvature with nonlinear conformal regularization of decoders

【速读】:该论文旨在解决高维数据中降维问题,即如何有效学习并揭示数据的主要驱动因素(主成分),从而获得低维流形表示。其核心挑战在于传统自动编码器(Autoencoder)在重构过程中缺乏对隐空间到观测空间映射几何结构的控制,导致难以量化隐空间在映射过程中的局部变形。解决方案的关键是引入一种新型几何正则化方法——非线性共形正则化(nonlinear conformal regularization),该方法通过在深度神经网络近似的解码映射上施加约束,允许局部变化的同时引入一个标量场(称为共形因子,conformal factor),用于定量衡量隐空间在映射至原始数据空间时所承受的局部形变程度,并进一步支持对所学流形的标量曲率计算。

链接: https://arxiv.org/abs/2508.20413
作者: Benjamin Couéraud,Vikram Sunkara,Christof Schütte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:One aim of dimensionality reduction is to discover the main factors that explain the data, and as such is paramount to many applications. When working with high dimensional data, autoencoders offer a simple yet effective approach to learn low-dimensional representations. The two components of a general autoencoder consist first of an encoder that maps the observed data onto a latent space; and second a decoder that maps the latent space back to the original observation space, which allows to learn a low-dimensional manifold representation of the original data. In this article, we introduce a new type of geometric regularization for decoding maps approximated by deep neural networks, namely nonlinear conformal regularization. This regularization procedure permits local variations of the decoder map and comes with a new scalar field called conformal factor which acts as a quantitative indicator of the amount of local deformation sustained by the latent space when mapped into the original data space. We also show that this regularization technique allows the computation of the scalar curvature of the learned manifold. Implementation and experiments on the Swiss roll and CelebA datasets are performed to illustrate how to obtain these quantities from the architecture.
zh

[AI-38] Governable AI: Provable Safety Under Extreme Threat Models

【速读】:该论文旨在解决当前AI安全机制在面对具备极端动机和无限智能的AI时所面临的根本性局限问题,尤其是在关键场景下可能引发系统性灾难的风险。现有方法如模型增强、价值对齐和人工干预等,在理论上无法确保安全性。其解决方案的核心是提出一种可治理AI(Governable AI, GAI)框架,通过将安全约束从内部机制转向基于密码学原理的外部结构合规性,利用计算上不可破解的加密机制实现对AI行为的强制约束;该框架由规则执行模块(Rule Enforcement Module, REM)与可治理安全超级平台(Governable Secure Super-Platform, GSSP)构成,其中REM负责执行治理规则以守住底线,GSSP则保障规则不可绕过、防篡改且不可伪造,从而消除所有已知攻击向量,并提供形式化安全证明及原型验证。

链接: https://arxiv.org/abs/2508.20411
作者: Donglin Wang,Weiyun Liang,Chunyuan Chen,Jing Xu,Yulong Fu
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As AI rapidly advances, the security risks posed by AI are becoming increasingly severe, especially in critical scenarios, including those posing existential risks. If AI becomes uncontrollable, manipulated, or actively evades safety mechanisms, it could trigger systemic disasters. Existing AI safety approaches-such as model enhancement, value alignment, and human intervention-suffer from fundamental, in-principle limitations when facing AI with extreme motivations and unlimited intelligence, and cannot guarantee security. To address this challenge, we propose a Governable AI (GAI) framework that shifts from traditional internal constraints to externally enforced structural compliance based on cryptographic mechanisms that are computationally infeasible to break, even for future AI, under the defined threat model and well-established cryptographic this http URL GAI framework is composed of a simple yet reliable, fully deterministic, powerful, flexible, and general-purpose rule enforcement module (REM); governance rules; and a governable secure super-platform (GSSP) that offers end-to-end protection against compromise or subversion by AI. The decoupling of the governance rules and the technical platform further enables a feasible and generalizable technical pathway for the safety governance of AI. REM enforces the bottom line defined by governance rules, while GSSP ensures non-bypassability, tamper-resistance, and unforgeability to eliminate all identified attack vectors. This paper also presents a rigorous formal proof of the security properties of this mechanism and demonstrates its effectiveness through a prototype implementation evaluated in representative high-stakes scenarios.
zh

[AI-39] AWorld: Orchestrating the Training Recipe for Agent ic AI

【速读】:该论文旨在解决当前Agentic AI系统在复杂基准测试(如GAIA)中因经验生成效率低下而导致的训练瓶颈问题。其核心挑战在于,传统单节点、串行执行方式难以支撑大规模智能体-环境交互,限制了强化学习的有效性和可扩展性。解决方案的关键在于提出并实现了一个名为AWorld的开源系统,该系统通过分布式任务调度机制将智能体与环境的交互任务分配至集群节点,实现了相较于标准单节点方法14.6倍的经验收集加速。这一关键改进使大规模强化学习训练成为可能,并成功驱动基于Qwen3-32B的智能体在GAIA基准上准确率从21.59%提升至32.23%,尤其在高难度任务中达到16.33%的性能,超越主流商业模型,为构建高效、可复现的Agentic AI训练流水线提供了实用范式。

链接: https://arxiv.org/abs/2508.20404
作者: Chengyue Yu,Siyuan Lu,Chenyi Zhuang,Dong Wang,Qintong Wu,Zongyue Li,Runsheng Gan,Chunfeng Wang,Siqi Hou,Gaochi Huang,Wenlong Yan,Lifeng Hong,Aohui Xue,Yanfeng Wang,Jinjie Gu,David Tsai,Tao Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution. This critical speedup makes extensive reinforcement learning practical and scalable. Leveraging this capability, we trained a Qwen3-32B-based agent that significantly outperforms its base model, increasing its overall GAIA accuracy from 21.59% to 32.23%. On the benchmark’s most challenging levels, our agent achieves a score of 16.33%, surpassing the performance of leading proprietary models. Our open-source system and resulting agent provide a practical blueprint for a complete agentic AI training pipeline, from efficient interaction to demonstrable model improvement.
zh

[AI-40] MPFormer: Adaptive Framework for Industrial Multi-Task Personalized Sequential Retriever CIKM2025

【速读】:该论文旨在解决现代工业推荐系统中多阶段优化错位的核心问题,即排序阶段广泛采用的多目标优化范式与召回阶段单目标建模之间存在显著语义鸿沟。传统方案通过并行多路径单目标召回实现多目标覆盖,但导致训练和推理资源随目标数量线性增长,且难以处理松耦合目标。其解决方案的关键在于提出MPFormer——一种动态多任务Transformer框架,包含三个创新机制:(1)基于目标条件的Transformer结构,通过可学习注意力调制联合编码用户行为序列与多任务语义;(2)引入个性化目标权重以动态调整召回结果;(3)将用户个性化信息融入token表示及Transformer结构中,增强模型表征能力。该方案已在快手短视频推荐系统中落地,稳定服务超4亿日活用户,显著提升用户日均参与度与系统运行效率。

链接: https://arxiv.org/abs/2508.20400
作者: Yijia Sun,Shanshan Huang,Linxiao Che,Haitao Lu,Qiang Luo,Kun Gai,Guorui Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: CIKM 2025

点击查看摘要

Abstract:Modern industrial recommendation systems encounter a core challenge of multi-stage optimization misalignment: a significant semantic gap exists between the multi-objective optimization paradigm widely used in the ranking phase and the single-objective modeling in the retrieve phase. Although the mainstream industry solution achieves multi-objective coverage through parallel multi-path single-objective retrieval, this approach leads to linear growth of training and serving resources with the number of objectives and has inherent limitations in handling loosely coupled objectives. This paper proposes the MPFormer, a dynamic multi-task Transformer framework, which systematically addresses the aforementioned issues through three innovative mechanisms. First, an objective-conditioned transformer that jointly encodes user behavior sequences and multi-task semantics through learnable attention modulation; second, personalized target weights are introduced to achieve dynamic adjustment of retrieval results; finally, user personalization information is incorporated into token representations and the Transformer structure to further enhance the model’s representation ability. This framework has been successfully integrated into Kuaishou short video recommendation system, stably serving over 400 million daily active users. It significantly improves user daily engagement and system operational efficiency. Practical deployment verification shows that, compared with traditional solutions, it effectively optimizes the iterative paradigm of multi-objective retrieval while maintaining service response speed, providing a scalable multi-objective solution for industrial recommendation systems.
zh

[AI-41] F-TransUNet1D: Time-Frequency Guided Transformer U-Net for Robust ECG Denoising in Digital Twin MICCAI2025 ALT

【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)信号在临床应用中因噪声和伪影导致诊断准确性下降的问题。其核心解决方案是提出一种名为TF-TransUNet1D的新型一维深度神经网络架构,该模型融合了U-Net结构的编码器-解码器机制与Transformer编码器,以同时捕捉ECG信号的局部形态特征和长程时间依赖性;关键创新在于引入了一种混合时频域损失函数,联合优化时域波形重建与频域谱保真度,从而在抑制高频噪声的同时保留具有临床意义的细微波形成分,显著提升去噪精度与诊断完整性。

链接: https://arxiv.org/abs/2508.20398
作者: Shijie Wang,Lei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures International Workshop on Digital Twin for Healthcare (DT4H) in MICCAI 2025 (Daejeon, Republic of Korea)

点击查看摘要

Abstract:Electrocardiogram (ECG) signals serve as a foundational data source for cardiac digital twins, yet their diagnostic utility is frequently compromised by noise and artifacts. To address this issue, we propose TF-TransUNet1D, a novel one-dimensional deep neural network that integrates a U-Net-based encoder-decoder architecture with a Transformer encoder, guided by a hybrid time-frequency domain loss. The model is designed to simultaneously capture local morphological features and long-range temporal dependencies, which are critical for preserving the diagnostic integrity of ECG signals. To enhance denoising robustness, we introduce a dual-domain loss function that jointly optimizes waveform reconstruction in the time domain and spectral fidelity in the frequency domain. In particular, the frequency-domain component effectively suppresses high-frequency noise while maintaining the spectral structure of the signal, enabling recovery of subtle but clinically significant waveform components. We evaluate TF-TransUNet1D using synthetically corrupted signals from the MIT-BIH Arrhythmia Database and the Noise Stress Test Database (NSTDB). Comparative experiments against state-of-the-art baselines demonstrate consistent superiority of our model in terms of SNR improvement and error metrics, achieving a mean absolute error of 0.1285 and Pearson correlation coefficient of 0.9540. By delivering high-precision denoising, this work bridges a critical gap in pre-processing pipelines for cardiac digital twins, enabling more reliable real-time monitoring and personalized modeling.
zh

[AI-42] Uncertainty Under the Curve: A Sequence-Level Entropy Area Metric for Reasoning LLM AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中答案生成不确定性难以量化的问题。现有方法通常依赖外部模型或重复采样,计算成本高且缺乏可解释性。解决方案的关键在于提出熵面积得分(Entropy Area Score, EAS),该指标直接利用模型自身输出的token级预测熵(predictive entropy),通过积分其生成过程中的熵变化来捕捉不确定性演化轨迹。EAS无需额外模型或多次采样,兼具高效性和可解释性,在数据筛选任务中显著优于传统通过通过率(Pass Rate)过滤的方法,能够更精准地识别高质量训练样本,从而提升学生模型在数学基准测试中的准确性。

链接: https://arxiv.org/abs/2508.20384
作者: Yongfu Zhu,Lin Sun,Guangxiang Zhao,Weihong Lin,Xiangzheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review for AAAI 2026

点击查看摘要

Abstract:In this work, we introduce Entropy Area Score (EAS), a simple yet effective metric to quantify uncertainty in the answer generation process of reasoning large language models (LLMs). EAS requires neither external models nor repeated sampling, it integrates token-level predictive entropy from the model itself to capture the evolution of uncertainty during generation. Empirical results show that EAS is strongly correlated with answer entropy across models and datasets. In training data selection, EAS identifies high-potential samples and consistently outperforms Pass Rate filtering under equal sample budgets, improving student model accuracy on math benchmarks. EAS is both efficient and interpretable, offering a practical tool for uncertainty modeling and data quality assessment in LLM training.
zh

[AI-43] CIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)指令微调过程中忽视任务相关性(on-task relevance)的问题。现有方法虽能通过生成多样化指令提升模型泛化能力,但未充分考虑实际应用场景中多数任务需依赖特定领域知识的特性,导致通用指令数据难以有效支撑具体任务性能优化。其解决方案的关键在于提出任务中心的指令增强框架(Task Centric Instruction Augmentation, TCIA),该框架将指令映射到离散的查询-约束空间(query-constraints space),在保持指令多样性的同时,系统性地扩展与目标任务高度对齐的指令集,从而实现模型在特定任务上的显著性能提升(平均提升8.7%),且不损害其通用指令遵循能力。

链接: https://arxiv.org/abs/2508.20374
作者: Simin Ma,Shujian Liu,Jun Tan,Yebowen Hu,Song Wang,Sathish Reddy Indurthi,Sanqiang Zhao,Liwei Wu,Jianbing Han,Kaiqiang Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diverse instruction data is vital for effective instruction tuning of large language models, as it enables the model to generalize across different types of inputs . Building such diversified instruction dataset is an essential step in this process. Existing approaches often leverage large language models to automatically explore and generate diverse instructions, ensuring both data diversity and quality. However, they tend to overlook an important factor in real-world applications: on-task relevance. In practice, only a few real-world applications require a truly general-purpose model; most benefit from task-specific knowledge tailored to their particular use case. Therefore, it is vital to develop instruction augmentation methods that not only maintain diversity but are also optimized for specific, real-world scenarios. We thus introduce Task Centric Instruction Augmentation (TCIA), a framework that systematically expands instructions while preserving both diversity and task alignment. By representing instructions in a discrete query-constraints space, TCIA creates a rich set of task-relevant instructions and enables models to generalize to these task-specific instructions without sacrificing overall performance. Experiments show that TCIA improves open-source LLMs’ performance by an average of 8.7% across four real-world, task-specific applications, and in some cases outperforming leading closed-source models. These improvements do not compromise general instruction-following ability, making TCIA a scalable and efficient solution for adapting LLMs to real-world, task-focused applications. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.20374 [cs.AI] (or arXiv:2508.20374v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.20374 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-44] P2C: Path to Counterfactuals

【速读】:该论文旨在解决机器学习模型在高风险决策场景(如金融、法律和招聘)中因缺乏透明性和可操作性而导致的可信度问题,具体聚焦于如何平衡“解释为何做出某一决策”(why)与“提供可执行的行动路径以获得有利结果”(how)之间的矛盾。当前的反事实解释方法存在两大局限:一是忽略特征间的因果依赖关系,二是假设所有干预措施可同时发生,这在现实中往往不可行。为应对这些问题,论文提出了一种模型无关的框架P2C(Path-to-Counterfactuals),其核心创新在于:1)显式建模特征间的因果结构以确保生成的反事实路径在因果上一致;2)通过有序动作序列规划中间状态的可行性,从而生成可在现实世界中执行的可行路径。P2C利用目标导向的Answer Set Programming系统s(CASP)进行规划,自动处理由因果依赖引发的特征变化,并仅计算用户主动施加的干预成本,从而实现更真实有效的反事实解释。

链接: https://arxiv.org/abs/2508.20371
作者: Sopam Dasgupta,Sadaf MD Halim,Joaquín Arias,Elmer Salazar,Gopal Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Machine-learning models are increasingly driving decisions in high-stakes settings, such as finance, law, and hiring, thus, highlighting the need for transparency. However, the key challenge is to balance transparency – clarifying why' a decision was made -- with recourse: providing actionable steps on how’ to achieve a favourable outcome from an unfavourable outcome. Counterfactual explanations reveal why' an undesired outcome occurred and how’ to reverse it through targeted feature changes (interventions). Current counterfactual approaches have limitations: 1) they often ignore causal dependencies between features, and 2) they typically assume all interventions can happen simultaneously, an unrealistic assumption in practical scenarios where actions are typically taken in a sequence. As a result, these counterfactuals are often not achievable in the real world. We present P2C (Path-to-Counterfactuals), a model-agnostic framework that produces a plan (ordered sequence of actions) converting an unfavourable outcome to a causally consistent favourable outcome. P2C addresses both limitations by 1) Explicitly modelling causal relationships between features and 2) Ensuring that each intermediate state in the plan is feasible and causally valid. P2C uses the goal-directed Answer Set Programming system s(CASP) to generate the plan accounting for feature changes that happen automatically due to causal dependencies. Furthermore, P2C refines cost (effort) computation by only counting changes actively made by the user, resulting in realistic cost estimates. Finally, P2C highlights how its causal planner outperforms standard planners, which lack causal knowledge and thus can generate illegal actions. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2508.20371 [cs.AI] (or arXiv:2508.20371v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.20371 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought

【速读】:该论文旨在解决微服务系统中故障根因定位(Root Cause Localization, RCL)的准确性与可解释性问题,尤其针对日益复杂、细粒度且高度依赖的微服务架构下,传统方法因依赖预定义模式或缺乏推理透明度而难以适应动态运维场景的问题。解决方案的关键在于提出RCLAgent,这是一种基于多智能体递归思维(multi-agent recursion-of-thought)框架的自适应定位方法,其核心创新是通过一种新颖的递归式思维策略引导大语言模型(LLM)进行推理,有效整合多个智能体的数据输入与工具辅助分析,从而仅需单个请求即可精准定位根因,显著优于依赖多请求聚合的现有先进方法。

链接: https://arxiv.org/abs/2508.20370
作者: Lingzhe Zhang,Tong Jia,Kangjin Wang,Weijie Hong,Chiming Duan,Minghua He,Ying Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are facing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While traces and metrics have proven to be effective data sources for this task, existing methods either heavily rely on pre-defined schemas, which struggle to adapt to evolving operational contexts, or lack interpretability in their reasoning process, thereby leaving Site Reliability Engineers (SREs) confused. In this paper, we conduct a comprehensive study on how SREs localize the root cause of failures, drawing insights from multiple professional SREs across different organizations. Our investigation reveals that human root cause analysis exhibits three key characteristics: recursiveness, multi-dimensional expansion, and cross-modal reasoning. Motivated by these findings, we introduce RCLAgent, an adaptive root cause localization method for microservice systems that leverages a multi-agent recursion-of-thought framework. RCLAgent employs a novel recursion-of-thought strategy to guide the LLM’s reasoning process, effectively integrating data from multiple agents and tool-assisted analysis to accurately pinpoint the root cause. Experimental evaluations on various public datasets demonstrate that RCLAgent achieves superior performance by localizing the root cause using only a single request-outperforming state-of-the-art methods that depend on aggregating multiple requests. These results underscore the effectiveness of RCLAgent in enhancing the efficiency and precision of root cause localization in complex microservice environments.
zh

[AI-46] AI-SearchPlanner: Modular Agent ic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的搜索代理在处理信息检索与问答(Question Answering, QA)任务时,因采用单一大型语言模型(Large Language Model, LLM)端到端完成搜索规划与QA而导致的能力优化受限问题。现有方法难以同时提升搜索规划效率和QA准确性,尤其在依赖高质量冻结QA模型(如GPT-4)的实际场景中表现不足。解决方案的关键在于提出一种名为AI-SearchPlanner的新颖强化学习框架,其核心创新包括:1)解耦搜索规划器与生成器的架构设计,使小规模可训练LLM专用于搜索规划;2)引入双奖励对齐机制(Dual-Reward Alignment),同步优化搜索质量和任务相关性;3)通过帕累托优化(Pareto Optimization)平衡规划效用与计算成本,在保证效果的同时显著提升效率。实验表明,该方法在多种真实数据集上优于现有RL-based搜索代理,并具备良好的跨模型和跨领域的泛化能力。

链接: https://arxiv.org/abs/2508.20368
作者: Lang Mei,Zhihan Yang,Chong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs’ internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbfAI-SearchPlanner, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.
zh

[AI-47] Boosting Skeleton-Driven SMT Solver Fuzzing by Leverag ing LLM to Produce Formula Generators

【速读】:该论文旨在解决当前SMT(可满足性模理论)求解器测试中面临的两大挑战:一是传统测试方法难以适应求解器功能的快速演进,二是基于大语言模型(Large Language Models, LLMs)的生成式测试方法存在语法无效率高和计算开销大的问题。解决方案的关键在于提出Chimera框架,其核心创新是将直接公式生成转变为可复用项(term)生成器的合成,通过LLM自动从文档中提取上下文无关文法(Context-Free Grammar, CFG),并据此合成符合语法规则的布尔项生成器;在模糊测试过程中,利用这些生成器填充由已有公式结构化得到的骨架,从而确保语法有效性并提升语义多样性,同时仅需一次LLM交互即可完成初始化,显著降低运行时成本。

链接: https://arxiv.org/abs/2508.20340
作者: Maolin Sun,Yibiao Yang,Yuming Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Satisfiability Modulo Theory (SMT) solvers are foundational to modern systems and programming languages research, providing the foundation for tasks like symbolic execution and automated verification. Because these solvers sit on the critical path, their correctness is essential, and high-quality test formulas are key to uncovering bugs. However, while prior testing techniques performed well on earlier solver versions, they struggle to keep pace with rapidly evolving features. Recent approaches based on Large Language Models (LLMs) show promise in exploring advanced solver capabilities, but two obstacles remain: nearly half of the generated formulas are syntactically invalid, and iterative interactions with the LLMs introduce substantial computational overhead. In this study, we present Chimera, a novel LLM-assisted fuzzing framework that addresses both issues by shifting from direct formula generation to the synthesis of reusable term (i.e., logical expression) generators. Particularly, Chimera uses LLMs to (1) automatically extract context-free grammars (CFGs) for SMT theories, including solver-specific extensions, from documentation, and (2) synthesize composable Boolean term generators that adhere to these grammars. During fuzzing, Chimera populates structural skeletons derived from existing formulas with the terms iteratively produced by the LLM-synthesized generators. This design ensures syntactic validity while promoting semantic diversity. Notably, Chimera requires only one-time LLM interaction investment, dramatically reducing runtime cost. We evaluated Chimera on two leading SMT solvers: Z3 and cvc5. Our experiments show that Chimera has identified 43 confirmed bugs, 40 of which have already been fixed by developers.
zh

[AI-48] Multi-View Graph Convolution Network for Internal Talent Recommendation Based on Enterprise Emails

【速读】:该论文旨在解决组织内部人才推荐中存在的结构性局限问题,即传统方法依赖少数管理者视角,容易忽略合格候选人。其解决方案的关键在于构建一个基于电子邮件数据的双维度岗位匹配模型:一方面通过语义相似性刻画员工“做什么”(WHAT),另一方面通过交互结构特征分析员工“如何工作”(HOW)。这两个维度被建模为独立图结构,并利用带有门控机制的双图卷积网络(Dual Graph Convolutional Network)实现自适应融合,从而在不同职位类别中学习最优的任务对齐(WHAT)与协作模式(HOW)的融合比例,显著提升推荐性能(Hit@100达40.9%),同时具备高可解释性。

链接: https://arxiv.org/abs/2508.20328
作者: Soo Hyun Kim,Jang-Hyun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Internal talent recommendation is a critical strategy for organizational continuity, yet conventional approaches suffer from structural limitations, often overlooking qualified candidates by relying on the narrow perspective of a few managers. To address this challenge, we propose a novel framework that models two distinct dimensions of an employee’s position fit from email data: WHAT they do (semantic similarity of tasks) and HOW they work (structural characteristics of their interactions and collaborations). These dimensions are represented as independent graphs and adaptively fused using a Dual Graph Convolutional Network (GCN) with a gating mechanism. Experiments show that our proposed gating-based fusion model significantly outperforms other fusion strategies and a heuristic baseline, achieving a top performance of 40.9% on Hit@100. Importantly, it is worth noting that the model demonstrates high interpretability by learning distinct, context-aware fusion strategies for different job families. For example, it learned to prioritize relational (HOW) data for ‘sales and marketing’ job families while applying a balanced approach for ‘research’ job families. This research offers a quantitative and comprehensive framework for internal talent discovery, minimizing the risk of candidate omission inherent in traditional methods. Its primary contribution lies in its ability to empirically determine the optimal fusion ratio between task alignment (WHAT) and collaborative patterns (HOW), which is required for employees to succeed in the new positions, thereby offering important practical implications.
zh

[AI-49] Surveying the Operational Cybersecurity and Supply Chain Threat Landscape when Developing and Deploying AI Systems

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)集成到软件系统后引入的新型网络安全威胁问题,特别是传统安全评估方法难以覆盖的攻击面和目标。其解决方案的关键在于识别并分析AI生命周期中的操作安全风险与供应链风险,并强调需构建针对AI特性的定制化安全框架,以应对由AI模型输出操纵、性能降级等新型攻击行为带来的持续演进威胁。

链接: https://arxiv.org/abs/2508.20307
作者: Michael R Smith,Joe Ingram
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:The rise of AI has transformed the software and hardware landscape, enabling powerful capabilities through specialized infrastructures, large-scale data storage, and advanced hardware. However, these innovations introduce unique attack surfaces and objectives which traditional cybersecurity assessments often overlook. Cyber attackers are shifting their objectives from conventional goals like privilege escalation and network pivoting to manipulating AI outputs to achieve desired system effects, such as slowing system performance, flooding outputs with false positives, or degrading model accuracy. This paper serves to raise awareness of the novel cyber threats that are introduced when incorporating AI into a software system. We explore the operational cybersecurity and supply chain risks across the AI lifecycle, emphasizing the need for tailored security frameworks to address evolving threats in the AI-driven landscape. We highlight previous exploitations and provide insights from working in this area. By understanding these risks, organizations can better protect AI systems and ensure their reliability and resilience.
zh

[AI-50] Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization

【速读】:该论文旨在解决现实世界强化学习中代理在未见过的环境条件下如何实现高效适应的问题,尤其是在上下文(context)变量隐含或难以测量时,传统基于显式上下文变量的方法(如摩擦力、重力等)受限于可获取性。解决方案的关键在于提出了一种名为“动态对齐潜在想象”(Dynamics-Aligned Latent Imagination, DALI)的框架,其核心是将一个自监督编码器嵌入Dreamer架构中,通过预测前向动态(forward dynamics)来隐式推断潜在上下文表示。该编码器生成的动作相关表示同时用于世界模型和策略网络,从而实现了感知与控制之间的桥梁,并理论上证明了其对高效上下文推理和鲁棒泛化的重要性。DALI的潜在空间还支持反事实一致性,即扰动特定维度(如重力编码)会引发物理上合理的想象轨迹变化,从而在复杂上下文马尔可夫决策过程(cMDP)基准测试中显著优于无上下文感知基线,甚至在外推任务中超越已知上下文信息的基线,实现零样本泛化到未见的上下文变化。

链接: https://arxiv.org/abs/2508.20294
作者: Frank Röder,Jan Benad,Manfred Eppe,Pradeep Kr. Banerjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 4 figures

点击查看摘要

Abstract:Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI’s latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.
zh

[AI-51] Beacon: Post-Training Quantization with Integrated Grid Selection

【速读】:该论文旨在解决**逐通道后训练量化(per-channel post-training quantization, PTQ)**中缩放因子(scaling factors)选择困难的问题,即如何在不依赖人工调参或大规模校准数据集的情况下,自动确定最优的缩放因子以替代权重值并映射到量化网格。其解决方案的关键在于提出名为Beacon的算法,该算法直接使用固定的非缩放字母表进行量化,并通过利用对称标量量化(symmetric scalar quantization)的几何特性,自动推导出最优缩放因子,从而无需反向传播(back-propagation)或大量校准样本即可实现高效且性能优越的量化部署。

链接: https://arxiv.org/abs/2508.20293
作者: Shihao Zhang,Rayan Saab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled quantization grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. In this note, we propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using a fixed non-scaled alphabet and automatically determines the optimal scaling factors by exploiting the geometry of symmetric scalar quantization. It supports both symmetric and asymmetric quantization with minimal modifications and does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.
zh

[AI-52] Objective Value Change and Shape-Based Accelerated Optimization for the Neural Network Approximation

【速读】:该论文旨在解决神经网络近似任务中局部性能不可预测的问题,这一问题会严重影响其在关键应用中的可靠性。解决方案的关键在于提出一种新的度量指标——值变化(Value Change, VC),用于量化神经网络局部行为的值变化程度,从而刻画近似任务的局部性能与稳定性。VC不仅揭示了神经网络近似过程中的两种新现象——VC趋势和少数群体趋势,还进一步构建了一个基于VC的距离度量,用于从变异性角度衡量两个函数之间的差异,并据此设计了一种新的预处理框架,有效提升了神经网络近似的效率与准确性。

链接: https://arxiv.org/abs/2508.20290
作者: Pengcheng Xie,Zihao Zhou,Zijian Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注: 27 pages

点击查看摘要

Abstract:This paper introduce a novel metric of an objective function f, we say VC (value change) to measure the difficulty and approximation affection when conducting an neural network approximation task, and it numerically supports characterizing the local performance and behavior of neural network approximation. Neural networks often suffer from unpredictable local performance, which can hinder their reliability in critical applications. VC addresses this issue by providing a quantifiable measure of local value changes in network behavior, offering insights into the stability and performance for achieving the neural-network approximation. We investigate some fundamental theoretical properties of VC and identified two intriguing phenomena in neural network approximation: the VC-tendency and the minority-tendency. These trends respectively characterize how pointwise errors evolve in relation to the distribution of VC during the approximation this http URL addition, we propose a novel metric based on VC, which measures the distance between two functions from the perspective of variation. Building upon this metric, we further propose a new preprocessing framework for neural network approximation. Numerical results including the real-world experiment and the PDE-related scientific problem support our discovery and pre-processing acceleration method.
zh

[AI-53] Network-Level Prompt and Trait Leakage in Local Research Agents

【速读】:该论文旨在解决生成式 AI(Generative AI)系统中 Web 和研究代理(Web and Research Agents, WRAs)在部署过程中面临的推理攻击风险问题,特别是被动网络攻击者(如互联网服务提供商 ISP)如何通过分析 WRA 的网络层元数据(即访问的 IP 地址及其时间戳)实现对用户提示(prompt)和潜在属性的泄露。解决方案的关键在于提出一种基于行为相似性度量 OBELS(Original-Behavioral Latent Similarity)的新攻击方法,该方法仅利用 WRA 访问域名的时间序列特征即可重建超过 73% 的功能性和领域知识;此外,该攻击在多会话场景下可准确恢复高达 19/32 个隐含用户特质,并且在部分可观测与噪声环境下仍具鲁棒性。作者进一步提出通过限制域多样性或混淆访问轨迹的轻量级缓解策略,在几乎不损害可用性的前提下平均降低攻击成功率达 29%。

链接: https://arxiv.org/abs/2508.20282
作者: Hyejun Jeong,Mohammadreze Teymoorianfard,Abhinav Kumar,Amir Houmansadr,Eugene Badasarian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:We show that Web and Research Agents (WRAs) – language model-based systems that investigate complex topics on the Internet – are vulnerable to inference attacks by passive network adversaries such as ISPs. These agents could be deployed \emphlocally by organizations and individuals for privacy, legal, or financial purposes. Unlike sporadic web browsing by humans, WRAs visit 70-140 domains with distinguishable timing correlations, enabling unique fingerprinting attacks. Specifically, we demonstrate a novel prompt and user trait leakage attack against WRAs that only leverages their network-level metadata (i.e., visited IP addresses and their timings). We start by building a new dataset of WRA traces based on user search queries and queries generated by synthetic personas. We define a behavioral metric (called OBELS) to comprehensively assess similarity between original and inferred prompts, showing that our attack recovers over 73% of the functional and domain knowledge of user prompts. Extending to a multi-session setting, we recover up to 19 of 32 latent traits with high accuracy. Our attack remains effective under partial observability and noisy conditions. Finally, we discuss mitigation strategies that constrain domain diversity or obfuscate traces, showing negligible utility impact while reducing attack effectiveness by an average of 29%. Comments: under review Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.20282 [cs.CR] (or arXiv:2508.20282v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.20282 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-54] AI reasoning effort mirrors human decision time on content moderation tasks

【速读】:该论文试图解决的问题是:如何理解大型语言模型(Large Language Models, LLMs)在生成答案前进行中间推理步骤(intermediate reasoning steps)时所体现的“推理努力”(reasoning effort)是否与人类在类似任务中的决策时间存在可比性,从而为AI的可解释性和决策机制提供依据。解决方案的关键在于设计了一个配对联合实验(paired conjoint experiment),在内容审核任务中系统性地比较了三类前沿模型的推理努力与人类决策时间的关系,发现推理努力能一致预测人类决策时间,且两者均对任务难度敏感,符合双过程认知理论(dual-process theories of cognition)。这表明AI的推理轨迹(reasoning traces)可作为人类主观判断处理时间的代理指标,具有提升模型可解释性和辅助决策的潜力。

链接: https://arxiv.org/abs/2508.20262
作者: Thomas Davidson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models can now generate intermediate reasoning steps before producing answers, improving performance on difficult problems. This study uses a paired conjoint experiment on a content moderation task to examine parallels between human decision times and model reasoning effort. Across three frontier models, reasoning effort consistently predicts human decision time. Both humans and models expended greater effort when important variables were held constant, suggesting similar sensitivity to task difficulty and patterns consistent with dual-process theories of cognition. These findings show that AI reasoning effort mirrors human processing time in subjective judgments and underscores the potential of reasoning traces for interpretability and decision-making.
zh

[AI-55] SwizzlePerf: Hardware-Aware LLM s for GPU Kernel Performance Optimization

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在GPU内核性能优化中缺乏硬件感知能力的问题,即现有方法依赖低效的搜索策略进行运行时优化,无法像人类性能工程师那样实现对底层硬件特性的精准利用。其解决方案的关键在于引入显式的硬件感知机制,通过结合工作负载的特定内存访问模式、架构规格、过滤后的性能分析日志以及历史性能反馈,使LLM能够生成针对异构计算架构(如解耦合GPU架构)的空间优化策略。具体而言,SwizzlePerf工具基于此框架自动为GPU内核生成最优的内存重排(swizzling)模式,在GEMM内核上仅需不到5分钟即可达到人类专家两周才能完成的优化效果,并在10个多样化的机器学习与科学计算内核中实现了最高2.06倍加速和L2缓存命中率提升70%的显著性能改进。

链接: https://arxiv.org/abs/2508.20258
作者: Arya Tschand,Muhammad Awad,Ryan Swann,Kesavan Ramakrishnan,Jeffrey Ma,Keith Lowery,Ganesh Dasika,Vijay Janapa Reddi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization – hardware-awareness. By leveraging the workload’s specific memory access patterns, architecture specifications, filtered profiling logs, and reflections on historical performance, we can make software-level optimizations that are tailored to the underlying hardware. SwizzlePerf automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness. For a GEMM kernel, SwizzlePerf takes less than 5 minutes to generate the same hardware-specific optimal swizzling pattern that took expert performance engineers 2 weeks to find. On a suite of 10 diverse ML and Science kernels, SwizzlePerf can generate swizzling patterns for 9 of the kernels that achieve up to a 2.06x speedup and 70% improvement in L2 hit rate. This work is the first of many steps toward systematically creating hardware-aware LLM performance engineering agents. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.20258 [cs.DC] (or arXiv:2508.20258v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2508.20258 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] Do Students Rely on AI? Analysis of Student-ChatGPT Conversations from a Field Study

【速读】:该论文旨在解决大学生在教育场景中使用生成式AI(如ChatGPT-4)进行测验时的依赖行为及其预测因素问题,核心关注点在于学生如何依赖AI、依赖程度的差异以及影响其采用行为的关键机制。研究通过分析315次学生与AI的交互记录,提出了一种四阶段依赖分类法(AI能力、相关性、采纳度及最终答案正确性),揭示了学生整体依赖水平较低且缺乏有效利用AI进行学习的能力;同时发现负面依赖模式具有持续性,表明学生在初次失败后难以调整策略。解决方案的关键在于:一是加强AI工具的引入流程(onboarding),提升学生对AI的认知与操作熟练度;二是设计具备依赖校准机制(reliance-calibration mechanisms)的AI界面,引导学生形成合理、适度的依赖行为,从而推动教育场景下AI的伦理化与认知增益型应用。

链接: https://arxiv.org/abs/2508.20244
作者: Jiayu Zheng,Lingxin Hao,Kelun Lu,Ashi Garg,Mike Reese,Melo-Jean Yap,I-Jeng Wang,Xingyun Wu,Wenrui Huang,Jenna Hoffman,Ariane Kelly,My Le,Ryan Zhang,Yanyu Lin,Muhammad Faayez,Anqi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores how college students interact with generative AI (ChatGPT-4) during educational quizzes, focusing on reliance and predictors of AI adoption. Conducted at the early stages of ChatGPT implementation, when students had limited familiarity with the tool, this field study analyzed 315 student-AI conversations during a brief, quiz-based scenario across various STEM courses. A novel four-stage reliance taxonomy was introduced to capture students’ reliance patterns, distinguishing AI competence, relevance, adoption, and students’ final answer correctness. Three findings emerged. First, students exhibited overall low reliance on AI and many of them could not effectively use AI for learning. Second, negative reliance patterns often persisted across interactions, highlighting students’ difficulty in effectively shifting strategies after unsuccessful initial experiences. Third, certain behavioral metrics strongly predicted AI reliance, highlighting potential behavioral mechanisms to explain AI adoption. The study’s findings underline critical implications for ethical AI integration in education and the broader field. It emphasizes the need for enhanced onboarding processes to improve student’s familiarity and effective use of AI tools. Furthermore, AI interfaces should be designed with reliance-calibration mechanisms to enhance appropriate reliance. Ultimately, this research advances understanding of AI reliance dynamics, providing foundational insights for ethically sound and cognitively enriching AI practices.
zh

[AI-57] Validating Generative Agent -Based Models for Logistics and Supply Chain Management Research

【速读】:该论文旨在解决生成式 Agent-Based Models (GABMs) 在物流与供应链管理(LSCM)研究中作为人类行为代理的有效性问题,即验证大语言模型(LLM)是否能真实模拟人类在复杂场景中的决策行为。其解决方案的关键在于提出并实证检验一个双重验证框架:首先通过等效性测试(如两样本单侧检验,TOST)评估LLM输出与人类行为在表面层面的等效性;其次通过结构方程模型(SEM)分析决策过程,识别是否存在人工生成的、非人类特有的决策路径。研究发现,尽管部分LLM在行为结果上表现出与人类相当的等效性,但其决策机制存在显著差异,揭示了“等效性-过程悖论”,从而强调了对GABMs进行双重验证的重要性,为LSCM领域的严谨建模提供了方法论指导。

链接: https://arxiv.org/abs/2508.20234
作者: Vincent E. Castillo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: A version of this work is also available on SSRN ( this https URL or this http URL ). This preprint is distributed under the CC BY-NC-SA 4.0 License

点击查看摘要

Abstract:Generative Agent-Based Models (GABMs) powered by large language models (LLMs) offer promising potential for empirical logistics and supply chain management (LSCM) research by enabling realistic simulation of complex human behaviors. Unlike traditional agent-based models, GABMs generate human-like responses through natural language reasoning, which creates potential for new perspectives on emergent LSCM phenomena. However, the validity of LLMs as proxies for human behavior in LSCM simulations is unknown. This study evaluates LLM equivalence of human behavior through a controlled experiment examining dyadic customer-worker engagements in food delivery scenarios. I test six state-of-the-art LLMs against 957 human participants (477 dyads) using a moderated mediation design. This study reveals a need to validate GABMs on two levels: (1) human equivalence testing, and (2) decision process validation. Results reveal GABMs can effectively simulate human behaviors in LSCM; however, an equivalence-versus-process paradox emerges. While a series of Two One-Sided Tests (TOST) for equivalence reveals some LLMs demonstrate surface-level equivalence to humans, structural equation modeling (SEM) reveals artificial decision processes not present in human participants for some LLMs. These findings show GABMs as a potentially viable methodological instrument in LSCM with proper validation checks. The dual-validation framework also provides LSCM researchers with a guide to rigorous GABM development. For practitioners, this study offers evidence-based assessment for LLM selection for operational tasks.
zh

[AI-58] Collaborating with GenAI: Incentives and Replacements

【速读】:该论文旨在解决生成式 AI(Generative AI)在团队协作场景中如何影响个体努力水平及整体产出效率的问题,特别是当管理者可能利用 GenAI 替代部分员工时,其对团队激励机制与协作结构的潜在颠覆效应。解决方案的关键在于构建一个理论模型:在该模型中,管理者选择一组工人完成共享任务,GenAI 用于替代未被选中的工人;每个工人决定投入的努力程度并承担相应成本。研究发现,即使 GenAI 效率极低,也可能导致工人完全停止努力;同时证明了管理者的优化问题为 NP 完全问题,并针对几乎线性的情形提出高效算法。此外,研究揭示了低个体价值工人对维持整体产出的关键作用,排除此类工人可能引发产出崩溃的级联效应。

链接: https://arxiv.org/abs/2508.20213
作者: Boaz Taitler,Omer Ben-Porat
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of Generative AI (GenAI) is reshaping how workers contribute to shared projects. While workers can use GenAI to boost productivity or reduce effort, managers may use it to replace some workers entirely. We present a theoretical framework to analyze how GenAI affects collaboration in such settings. In our model, the manager selects a team to work on a shared task, with GenAI substituting for unselected workers. Each worker selects how much effort to exert, and incurs a cost that increases with the level of effort. We show that GenAI can lead workers to exert no effort, even if GenAI is almost ineffective. We further show that the manager’s optimization problem is NP-complete, and provide an efficient algorithm for the special class of (almost-) linear instances. Our analysis shows that even workers with low individual value may play a critical role in sustaining overall output, and excluding such workers can trigger a cascade. Finally, we conduct extensive simulations to illustrate our theoretical findings.
zh

[AI-59] Filter then Attend: Improving attention-based Time Series Forecasting with Spectral Filtering

【速读】:该论文旨在解决基于Transformer的长时序预测(Long Time-Series Forecasting, LTSF)模型在实际应用中面临的两大问题:一是模型对低频信号存在偏好,导致频谱利用不充分;二是计算和内存开销较大。解决方案的关键在于在Transformer模型的输入端引入可学习的频率滤波器(learnable frequency filters),该滤波器仅增加约1000个参数,却能显著提升模型对全频谱的利用率,从而在多个数据集上实现5–10%的相对性能提升。此外,加入滤波器后还可降低嵌入维度,在保持甚至提升预测精度的同时使模型更轻量高效。

链接: https://arxiv.org/abs/2508.20206
作者: Elisha Dayag,Nhat Thanh Van Tran,Jack Xin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based models are at the forefront in long time-series forecasting (LTSF). While in many cases, these models are able to achieve state of the art results, they suffer from a bias toward low-frequencies in the data and high computational and memory requirements. Recent work has established that learnable frequency filters can be an integral part of a deep forecasting model by enhancing the model’s spectral utilization. These works choose to use a multilayer perceptron to process their filtered signals and thus do not solve the issues found with transformer-based models. In this paper, we establish that adding a filter to the beginning of transformer-based models enhances their performance in long time-series forecasting. We add learnable filters, which only add an additional \approx 1000 parameters to several transformer-based models and observe in multiple instances 5-10 % relative improvement in forecasting performance. Additionally, we find that with filters added, we are able to decrease the embedding dimension of our models, resulting in transformer-based architectures that are both smaller and more effective than their non-filtering base models. We also conduct synthetic experiments to analyze how the filters enable Transformer-based models to better utilize the full spectrum for forecasting.
zh

[AI-60] AI Propaganda factories with language models

【速读】:该论文旨在解决生成式 AI (Generative AI) 在政治影响力操作中自动化执行的可行性与检测挑战问题。其核心问题是:如何在不依赖人工干预的情况下,实现高质量、具人格化特征的政治内容生成,并识别此类自动化影响活动的特征以支持防御策略。解决方案的关键在于发现两个行为规律:一是“人格优先于模型”(persona-over-model),即内容风格和一致性更多由预设人格设定决定而非底层语言模型本身;二是“参与即压力效应”(engagement as a stressor),即当回复需应对反驳时,意识形态立场强化且极端内容比例上升。这揭示了自动化内容生产已可由中小规模行动者在普通硬件上完成,而其高度一致性本身也成为可被识别的检测信号,从而推动防御体系从限制模型访问转向以对话为中心的检测与干扰传播网络基础设施。

链接: https://arxiv.org/abs/2508.20186
作者: Lukasz Olejnik
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI-powered influence operations can now be executed end-to-end on commodity hardware. We show that small language models produce coherent, persona-driven political messaging and can be evaluated automatically without human raters. Two behavioural findings emerge. First, persona-over-model: persona design explains behaviour more than model identity. Second, engagement as a stressor: when replies must counter-arguments, ideological adherence strengthens and the prevalence of extreme content increases. We demonstrate that fully automated influence-content production is within reach of both large and small actors. Consequently, defence should shift from restricting model access towards conversation-centric detection and disruption of campaigns and coordination infrastructure. Paradoxically, the very consistency that enables these operations also provides a detection signature.
zh

[AI-61] RelAItionship Building: Analyzing Recruitment Strategies for Participatory AI AAAI ICIP

【速读】:该论文旨在解决参与式人工智能(Participatory AI)项目中招募方法(recruitment methodology)设计与执行的实践挑战,即如何有效识别、接触并持续参与相关利益群体(stakeholder groups),以确保AI系统开发能够真正反映社区需求和价值观。解决方案的关键在于采用“关系导向型”(relationship-forward)的招募方法,强调通过建立长期、互信的合作关系来提升参与质量,并辅以反思性的招募文档记录实践,从而增强招募过程的透明度、公平性和可复现性。

链接: https://arxiv.org/abs/2508.20176
作者: Eugene Kim,Vaibhav Balloli,Berelian Karimian,Elizabeth Bondi-Kelly,Benjamin Fish
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at the Eighth AAAI/ACM Conference on AI, Ethics, and Society. this https URL

点击查看摘要

Abstract:Participatory AI, in which impacted community members and other stakeholders are involved in the design and development of AI systems, holds promise as a way to ensure AI is developed to meet their needs and reflect their values. However, the process of identifying, reaching out, and engaging with all relevant stakeholder groups, which we refer to as recruitment methodology, is still a practical challenge in AI projects striving to adopt participatory practices. In this paper, we investigate the challenges that researchers face when designing and executing recruitment methodology for Participatory AI projects, and the implications of current recruitment practice for Participatory AI. First, we describe the recruitment methodologies used in AI projects using a corpus of 37 projects to capture the diversity of practices in the field and perform an initial analysis on the documentation of recruitment practices, as well as specific strategies that researchers use to meet goals of equity and empowerment. To complement this analysis, we interview five AI researchers to learn about the outcomes of recruitment methodologies. We find that these outcomes are shaped by structural conditions of their work, researchers’ own goals and expectations, and the relationships built from the recruitment methodology and subsequent collaboration. Based on these analyses, we provide recommendations for designing and executing relationship-forward recruitment methods, as well as reflexive recruitment documentation practices for Participatory AI researchers.
zh

[AI-62] IntentionReason er: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时可能产生有害信息的安全问题,同时避免因过度防御导致对无害请求的误拒(over-refusal),从而实现安全、实用性与低误拒率之间的平衡。解决方案的关键在于提出IntentionReasoner机制,其核心包括:构建包含约16.3万条查询的多标签数据集(标注意图推理、安全标签及重写版本),通过监督微调使守护模型(guard model)具备格式遵循、意图分析与安全重写能力,并引入定制化的多奖励优化策略,在强化学习框架中融合规则启发式与奖励模型信号,以提升模型在安全防护、响应质量与抗越狱攻击等方面的综合表现。

链接: https://arxiv.org/abs/2508.20151
作者: Yuanzhe Shen,Zisu Huang,Zhengkang Guo,Yide Liu,Guanxu Chen,Ruicheng Yin,Xiaoqing Zheng,Xuanjing Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.
zh

[AI-63] he Anatomy of a Personal Health Agent

【速读】:该论文旨在解决当前健康代理(health agent)在日常非临床场景中难以满足个体多样化健康需求的问题。现有研究多聚焦于临床环境,而对普通用户如何通过可穿戴设备和健康记录获取个性化、动态化健康建议的支持不足。解决方案的关键在于提出并构建一个名为Personal Health Agent (PHA)的多智能体框架,该框架由三个专业化子代理协同工作:数据科学代理(data science agent)负责分析时间序列的可穿戴设备与健康记录数据;健康领域专家代理(health domain expert agent)整合用户健康与情境信息生成精准洞察;健康教练代理(health coach agent)基于心理学策略指导用户行为并追踪进展。这一架构实现了从多模态数据理解到个性化干预的闭环,通过10项基准任务的自动化与人工评估(涵盖7000+标注与1100小时专家及用户投入),验证了其在真实场景下的有效性与实用性。

链接: https://arxiv.org/abs/2508.20148
作者: A. Ali Heydari,Ken Gu,Vidya Srinivas,Hong Yu,Zhihan Zhang,Yuwei Zhang,Akshay Paruchuri,Qian He,Hamid Palangi,Nova Hammerquist,Ahmed A. Metwally,Brent Winslow,Yubin Kim,Kumar Ayush,Yuzhe Yang,Girish Narayanswamy,Maxwell A. Xu,Jake Garrison,Amy Aremnto Lee,Jenny Vafeiadou,Ben Graef,Isaac R. Galatzer-Levy,Erik Schenck,Andrew Barakat,Javier Perez,Jacqueline Shreibati,John Hernandez,Anthony Z. Faranesh,Javier L. Prieto,Connor Heneghan,Yun Liu,Jiening Zhan,Mark Malhotra,Shwetak Patel,Tim Althoff,Xin Liu,Daniel McDuff,Xuhai “Orson” Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users’ needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.
zh

[AI-64] Navigating the EU AI Act: Foreseeable Challenges in Qualifying Deep Learning-Based Automated Inspections of Class III Medical Devices

【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的自动化视觉检测系统在Class III类医疗设备质量保证中应用时,面临的监管合规性挑战问题,尤其是在欧盟人工智能法案(EU Artificial Intelligence Act)框架下与现有医疗器械法规(如MDR和FDA QSR)之间的差异与冲突。其解决方案的关键在于系统性地识别并分析风险管理体系、数据治理、模型验证、可解释性要求及部署后监控等核心领域的不一致性,并提出潜在实施策略以应对数据保留负担、全球合规复杂性以及小样本缺陷验证中的统计显著性难题,从而为制造商提供技术层面的合规路径参考。

链接: https://arxiv.org/abs/2508.20144
作者: Julio Zanon Diaz,Tommy Brennan,Peter Corcoran
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Critical Review article

点击查看摘要

Abstract:As deep learning (DL) technologies advance, their application in automated visual inspection for Class III medical devices offers significant potential to enhance quality assurance and reduce human error. However, the adoption of such AI-based systems introduces new regulatory complexities–particularly under the EU Artificial Intelligence (AI) Act, which imposes high-risk system obligations that differ in scope and depth from established regulatory frameworks such as the Medical Device Regulation (MDR) and the U.S. FDA Quality System Regulation (QSR). This paper presents a high-level technical assessment of the foresee-able challenges that manufacturers are likely to encounter when qualifying DL-based automated inspections within the existing medical device compliance landscape. It examines divergences in risk management principles, dataset governance, model validation, explainability requirements, and post-deployment monitoring obligations. The discussion also explores potential implementation strategies and highlights areas of uncertainty, including data retention burdens, global compliance implications, and the practical difficulties of achieving statistical significance in validation with limited defect data. Disclaimer: This publication is in-tended solely as an academic and technical evaluation. It is not a substitute for le-gal advice or official regulatory interpretation. The information presented here should not be relied upon to demonstrate compliance with the EU AI Act or any other statutory obligation. Manufacturers are encouraged to consult appropriate regulatory authorities and legal experts to determine specific compliance pathways.
zh

[AI-65] Array-Based Monte Carlo Tree Search

【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在计算效率上的瓶颈问题,尤其是在搜索深度增加时性能下降明显的问题。其解决方案的关键在于提出了一种基于数组的上界置信区间算法(Upper Confidence bounds applied to Trees, UCT)实现方式,通过消除分支预测依赖,优化了在流水线处理器上的执行效率,从而在相同时间内支持更多模拟次数,并在数值模拟中实现了最高达2.8倍于原算法的搜索深度扩展性能提升。

链接: https://arxiv.org/abs/2508.20140
作者: James Ragan,Fred Y. Hadaegh,Soon-Jo Chung
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Monte Carlo Tree Search is a popular method for solving decision making problems. Faster implementations allow for more simulations within the same wall clock time, directly improving search performance. To this end, we present an alternative array-based implementation of the classic Upper Confidence bounds applied to Trees algorithm. Our method preserves the logic of the original algorithm, but eliminates the need for branch prediction, enabling faster performance on pipelined processors, and up to a factor of 2.8 times better scaling with search depth in our numerical simulations.
zh

[AI-66] QAgent : An LLM -based Multi-Agent System for Autonomous OpenQASM programming

【速读】:该论文旨在解决非专家用户在使用噪声中等规模量子(Noisy Intermediate-Scale Quantum, NISQ)设备时,因Open Quantum Assembly Language (OpenQASM) 编程复杂性而难以实现量子计算优势的问题。其核心解决方案是提出一个基于大型语言模型(Large Language Model, LLM)的多智能体系统QAgent,通过整合任务规划、上下文少样本学习(in-context few-shot learning)、检索增强生成(Retrieval-Augmented Generation, RAG)以支持长期上下文记忆、预定义生成工具以及思维链(Chain-of-Thought, CoT)推理机制,系统性提升量子汇编代码的编译正确性和功能正确性。实验表明,QAgent相较以往静态LLM方法在QASM代码生成准确率上提升了71.6%,为量子编程的普及和实用化提供了关键技术支持。

链接: https://arxiv.org/abs/2508.20134
作者: Zhenxiao Fu,Fan Chen,Lei Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Noisy Intermediate-Scale Quantum (NISQ) devices have begun to exhibit early quantum advantages on classically intractable problems, spanning physics simulations to Gaussian boson sampling. Yet, realizing these benefits remains challenging for non-experts, primarily due to the complexities of programming in Open Quantum Assembly Language (OpenQASM). Although Large Language Model (LLM)-based agents have shown promise in automating classical programming workflows, their quantum counterparts have largely been restricted to specialized tasks such as quantum chemistry or error correction. In this paper, we present QAgent, an LLM-powered multi-agent system that fully automates OpenQASM programming. By integrating task planning, in-context few-shot learning, retrieval-augmented generation (RAG) for long-term context, predefined generation tools, and chain-of-thought (CoT) reasoning, the agents systematically improve both compilation and functional correctness. Our evaluations demonstrate substantial improvements: across multiple LLMs of varying sizes, QAgent enhances the accuracy of QASM code generation by 71.6% compared to previous static LLM-based approaches. We envision this multi-agent system as a key enabler for democratizing quantum programming, bridging expertise gaps, and accelerating the practical adoption of quantum computing.
zh

[AI-67] ArgRAG : Explainable Retrieval Augmented Generation using Quantitative Bipolar Argumentation

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在高风险领域中的两个关键问题:一是对噪声或矛盾证据敏感,导致决策不可靠;二是推理过程不透明、具有随机性,难以解释和验证。解决方案的核心在于提出ArgRAG,其通过引入定量双极论证框架(Quantitative Bipolar Argumentation Framework, QBAF)构建结构化推理机制,将检索到的文档转化为可形式化的论证网络,并基于渐进语义执行确定性推理,从而实现决策的可解释性和可争议性,显著提升透明度的同时保持高准确性。

链接: https://arxiv.org/abs/2508.20131
作者: Yuqicheng Zhu,Nico Potyka,Daniel Hernández,Yuan He,Zifeng Ding,Bo Xiong,Dongzhuoran Zhou,Evgeny Kharlamov,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge, yet suffers from critical limitations in high-stakes domains – namely, sensitivity to noisy or contradictory evidence and opaque, stochastic decision-making. We propose ArgRAG, an explainable, and contestable alternative that replaces black-box reasoning with structured inference using a Quantitative Bipolar Argumentation Framework (QBAF). ArgRAG constructs a QBAF from retrieved documents and performs deterministic reasoning under gradual semantics. This allows faithfully explaining and contesting decisions. Evaluated on two fact verification benchmarks, PubHealth and RAGuard, ArgRAG achieves strong accuracy while significantly improving transparency.
zh

[AI-68] owards Better Correctness and Efficiency in Code Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成任务中普遍存在运行效率低下问题,这限制了其在对性能敏感场景中的实际应用。解决方案的关键在于提出一种以效率为导向的强化学习框架,该框架通过新颖的性能奖励机制引导模型优化代码执行效率;具体包括:(1) 动态探索克服离线微调的数据静态性限制,从而发现更高效的代码实现;(2) 采用对错误不敏感的强化学习方法与高对比度效率信号,有效缓解系统性偏差并实现稳定优化;(3) 在高正确率基线上进行在线探索,可在不牺牲准确性的前提下提升效率。最终提出的两阶段调优策略实现了正确率与效率的协同提升,在7B规模模型上使正确率提高10.18%、运行效率提升7.75%,达到远超自身规模的大型模型性能水平。

链接: https://arxiv.org/abs/2508.20124
作者: Yunlong Feng,Yang Xu,Xiao Xu,Binyuan Hui,Junyang Lin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While code large language models have demonstrated remarkable progress in code generation, the generated code often exhibits poor runtime efficiency, limiting its practical application in performance-sensitive scenarios. To address this limitation, we propose an efficiency-oriented reinforcement learning framework guided by a novel performance reward. Based on this framework, we take a deeper dive into the code efficiency problem, identifying then proposing methods to overcome key bottlenecks: (1) Dynamic exploration overcomes the static data constraints of offline fine-tuning, enabling the discovery of more efficient code implementations. (2) The error-insensitive reinforcement learning method and high-contrast efficiency signals are crucial for mitigating systematic errors and achieving effective optimization. (3) Online exploration is most effective when starting from a high-correctness baseline, as this allows for efficiency improvements without sacrificing accuracy. With these discoveries, we finally propose a two-stage tuning method, which achieves high and balanced performance across correctness and efficiency. The results of experiments show the effectiveness of the method, which improves code correctness by 10.18% and runtime efficiency by 7.75% on a 7B model, achieving performance comparable to much larger model.
zh

[AI-69] Particle swarm optimization for online sparse streaming feature selection under uncertainty

【速读】:该论文旨在解决高维流式数据中在线流特征选择(Online Streaming Feature Selection, OSFS)因数据不完整导致的性能下降问题,尤其是现有基于潜在因子分析的在线稀疏流特征选择(Online Sparse Streaming Feature Selection, OS²FS)方法在面对不确定特征-标签关联时模型灵活性不足、性能受限的问题。解决方案的关键在于提出一种不确定性感知的在线稀疏流特征选择框架(POS2FS),其核心创新包括:1)引入粒子群优化(Particle Swarm Optimization, PSO)驱动的监督机制以降低特征-标签关系中的不确定性;2)融合三元决策理论(Three-way Decision Theory)来有效处理监督学习中的特征模糊性,从而提升特征子集选择的鲁棒性和准确性。

链接: https://arxiv.org/abs/2508.20123
作者: Ruiyang Xu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In real-world applications involving high-dimensional streaming data, online streaming feature selection (OSFS) is widely adopted. Yet, practical deployments frequently face data incompleteness due to sensor failures or technical constraints. While online sparse streaming feature selection (OS2FS) mitigates this issue via latent factor analysis-based imputation, existing methods struggle with uncertain feature-label correlations, leading to inflexible models and degraded performance. To address these gaps, this work proposes POS2FS-an uncertainty-aware online sparse streaming feature selection framework enhanced by particle swarm optimization (PSO). The approach introduces: 1) PSO-driven supervision to reduce uncertainty in feature-label relationships; 2) Three-way decision theory to manage feature fuzziness in supervised learning. Rigorous testing on six real-world datasets confirms POS2FS outperforms conventional OSFS and OS2FS techniques, delivering higher accuracy through more robust feature subset selection.
zh

[AI-70] Is Artificial Intelligence Reshaping the Landscape of the International Academic Community of Geosciences?

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在地球科学领域应用趋势及其对科研影响力和国际合作格局影响的问题。其解决方案的关键在于通过文献计量分析(bibliometric analysis)与主题建模(topic modeling)方法,系统识别AI在地球科学研究中的增长态势,并揭示发展中国家科学家在AI for Science (AI4S) 新范式下的可见度提升以及AI促进国际科研合作的积极作用。

链接: https://arxiv.org/abs/2508.20117
作者: Liang Li,Yuntian Li,Wenxin Zhao,Shan Ye,Yun Lu
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages

点击查看摘要

Abstract:Through bibliometric analysis and topic modeling, we find that artificial intelligence (AI) is positively transforming geosciences research, with a notable increase in AI-related scientific output in recent years. We are encouraged to observe that earth scientists from developing countries have gained better visibility in the recent AI for Science (AI4S) paradigm and that AI is also improving the landscape of international collaboration in geoscience-related research.
zh

[AI-71] Flexible metadata harvesting for ecology using large language models

【速读】:该论文旨在解决生态学研究中因数据来源多样、元数据标准不统一而导致的数据整合与发现困难问题。其核心挑战在于研究人员难以高效地从不同平台获取并关联适合的生态与环境数据集。解决方案的关键在于开发了一种基于大语言模型(Large Language Model, LLM)的元数据采集工具,该工具能够灵活提取任意数据集页面中的结构化与非结构化元数据,并通过LLM后处理协议将其转换为用户定义的统一格式;同时利用嵌入相似性计算和元数据格式标准化实现跨数据集的链接识别,从而支持本体构建或基于图的查询,提升虚拟研究环境中生态数据的可发现性与可集成性。

链接: https://arxiv.org/abs/2508.20115
作者: Zehao Lu,Thijs L van der Plas,Parinaz Rashidi,W Daniel Kissling,Ioannis N Athanasiadis
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: To be published at the EcoDL 2025 workshop

点击查看摘要

Abstract:Large, open datasets can accelerate ecological research, particularly by enabling researchers to develop new insights by reusing datasets from multiple sources. However, to find the most suitable datasets to combine and integrate, researchers must navigate diverse ecological and environmental data provider platforms with varying metadata availability and standards. To overcome this obstacle, we have developed a large language model (LLM)-based metadata harvester that flexibly extracts metadata from any dataset’s landing page, and converts these to a user-defined, unified format using existing metadata standards. We validate that our tool is able to extract both structured and unstructured metadata with equal accuracy, aided by our LLM post-processing protocol. Furthermore, we utilise LLMs to identify links between datasets, both by calculating embedding similarity and by unifying the formats of extracted metadata to enable rule-based processing. Our tool, which flexibly links the metadata of different datasets, can therefore be used for ontology creation or graph-based queries, for example, to find relevant ecological and environmental datasets in a virtual research environment.
zh

[AI-72] A Hierarchical Signal Coordination and Control System Using a Hybrid Model-based and Reinforcement Learning Approach

【速读】:该论文旨在解决城市干道信号控制中同时实现干线交通流连续性与局部交叉口需求动态适应性的双重挑战。其核心解决方案是提出一种分层式交通信号协同控制架构,关键在于将模型驱动的优化方法与强化学习(Reinforcement Learning, RL)相结合:在高层由高阶协调器(High-Level Coordinator, HLC)根据预测需求动态选择协调策略(最大流协调 Max-Flow Coordination, MFC 或绿波协调 Green-Wave Coordination, GWC);在底层由混合信号代理(Hybrid Signal Agents, HSAs)通过带动作掩码(action masking)的PPO强化学习确定信号相位,确保可行性。该分层设计实现了策略自适应切换,在不同交通需求下均表现出鲁棒性能,显著提升了系统整体效率与适应能力。

链接: https://arxiv.org/abs/2508.20102
作者: Xianyue Peng,Shenyang Chen,H. Michael Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 28 pages, 7 figures

点击查看摘要

Abstract:Signal control in urban corridors faces the dual challenge of maintaining arterial traffic progression while adapting to demand variations at local intersections. We propose a hierarchical traffic signal coordination and control scheme that integrates model-based optimization with reinforcement learning. The system consists of: (i) a High-Level Coordinator (HLC) that selects coordination strategies based on observed and predicted demand; (ii) a Corridor Coordinator that derives phase constraints from the selected strategy-either Max-Flow Coordination (MFC) or Green-Wave Coordination (GWC); and (iii) Hybrid Signal Agents (HSAs) that determine signal phases via reinforcement learning with action masking to enforce feasibility. Hierarchical reinforcement learning with Proximal Policy Optimization (PPO) is used to train HSA and HLC policies. At the lower level, three HSA policies-MFC-aware, GWC-aware, and pure agent control (PAC) are trained in conjunction with their respective coordination strategies. At the higher level, the HLC is trained to dynamically switch strategies using a multi-objective reward balancing corridor-level and network-wide performance. The proposed scheme was developed and evaluated on a SUMO-RLlib platform. Case results show that hybrid MFC maximizes throughput under heavy demand; hybrid GWC consistently minimizes arterial stops and maintains progression across diverse traffic conditions but can reduce network-wide efficiency; and PAC improves network-wide travel time in moderate demand but is less effective under heavy demand. The hierarchical design enables adaptive strategy selection, achieving robust performance across all demand levels.
zh

[AI-73] Quantum Verifiable Rewards for Post-Training Qiskit Code Assistant

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在编写量子计算代码(特别是Qiskit代码)时的准确性与可执行性问题,尤其是在真实量子硬件上运行时的可靠性。其核心挑战在于如何有效提升LLMs生成的量子程序质量,确保其不仅语法正确,而且能在实际量子设备上成功执行。解决方案的关键在于引入“量子验证”(quantum verification)机制,通过构建合成数据流水线生成量子问题与单元测试对,并利用直接偏好优化(Direct Preference Optimization, DPO)进行对齐;同时采用基于奖励的强化学习方法(GRPO),结合来自量子硬件的可验证奖励信号来训练模型,最终在Qiskit-HumanEval-hard基准上显著优于现有开源基线模型。

链接: https://arxiv.org/abs/2508.20907
作者: Nicolas Dupuis,Adarsh Tiwari,Youssef Mroueh,David Kremer,Ismael Faro,Juan Cruz-Benito
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Qiskit is an open-source quantum computing framework that allows users to design, simulate, and run quantum circuits on real quantum hardware. We explore post-training techniques for LLMs to assist in writing Qiskit code. We introduce quantum verification as an effective method for ensuring code quality and executability on quantum hardware. To support this, we developed a synthetic data pipeline that generates quantum problem-unit test pairs and used it to create preference data for aligning LLMs with DPO. Additionally, we trained models using GRPO, leveraging quantum-verifiable rewards provided by the quantum hardware. Our best-performing model, combining DPO and GRPO, surpasses the strongest open-source baselines on the challenging Qiskit-HumanEval-hard benchmark.
zh

[AI-74] Photonic restricted Boltzmann machine for content generation tasks

【速读】:该论文旨在解决受限玻尔兹曼机(Restricted Boltzmann Machine, RBM)在内容生成任务中因吉布斯采样(Gibbs sampling)计算成本过高而导致的电子实现瓶颈问题。解决方案的关键在于提出了一种光子受限玻尔兹曼机(Photonic Restricted Boltzmann Machine, PRBM),其利用光子计算加速吉布斯采样过程:通过引入高效的编码方法,PRBM避免了耗时的矩阵分解操作,将吉布斯采样的计算复杂度从 O(N)O(N) 降低至 O(1)O(1);同时,其非冯·诺依曼的光子计算架构无需存储交互矩阵,显著提升了大规模RBM的可扩展性与效率。

链接: https://arxiv.org/abs/2508.20472
作者: Li Luo,Yisheng Fang,Wanyi Zhang,Zhichao Ruan
机构: 未知
类目: Optics (physics.optics); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:The restricted Boltzmann machine (RBM) is a neural network based on the Ising model, well known for its ability to learn probability distributions and stochastically generate new content. However, the high computational cost of Gibbs sampling in content generation tasks imposes significant bottlenecks on electronic implementations. Here, we propose a photonic restricted Boltzmann machine (PRBM) that leverages photonic computing to accelerate Gibbs sampling, enabling efficient content generation. By introducing an efficient encoding method, the PRBM eliminates the need for computationally intensive matrix decomposition and reduces the computational complexity of Gibbs sampling from O(N) to O(1) . Moreover, its non-Von Neumann photonic computing architecture circumvents the memory storage of interaction matrices, providing substantial advantages for large-scale RBMs. We experimentally validate the photonic-accelerated Gibbs sampling by simulating a two-dimensional Ising model, where the observed phase transition temperature closely matches the theoretical predictions. Beyond physics-inspired tasks, the PRBM demonstrates robust capabilities in generating and restoring diverse content, including images and temporal sequences, even in the presence of noise and aberrations. The scalability and reduced training cost of the PRBM framework underscore its potential as a promising pathway for advancing photonic computing in generative artificial intelligence.
zh

[AI-75] Differentially Private Federated Quantum Learning via Quantum Noise

【速读】:该论文旨在解决量子联邦学习(Quantum Federated Learning, QFL)在噪声中等规模量子(Noisy Intermediate-Scale Quantum, NISQ)设备上面临的隐私泄露问题,即共享的量子机器学习(Quantum Machine Learning, QML)模型更新可能被恶意攻击者利用以破坏信息隐私。解决方案的关键在于提出一种新颖的差分隐私(Differential Privacy, DP)机制,该机制主动利用NISQ设备固有的量子噪声(如测量次数和去极化通道强度所调控的噪声方差),在不显著损害训练准确性的前提下实现可调的隐私保护水平。通过调节噪声参数,该方法可在隐私保障与模型性能之间取得平衡,并在对抗攻击场景下展现出鲁棒性,从而为NISQ环境下的安全量子联邦学习提供高效可行的路径。

链接: https://arxiv.org/abs/2508.20310
作者: Atit Pokharel,Ratun Rahman,Shaba Shaon,Thomas Morris,Dinh C. Nguyen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at 2025 IEEE International Conference on Quantum Computing and Engineering (QCE)

点击查看摘要

Abstract:Quantum federated learning (QFL) enables collaborative training of quantum machine learning (QML) models across distributed quantum devices without raw data exchange. However, QFL remains vulnerable to adversarial attacks, where shared QML model updates can be exploited to undermine information privacy. In the context of noisy intermediate-scale quantum (NISQ) devices, a key question arises: How can inherent quantum noise be leveraged to enforce differential privacy (DP) and protect model information during training and communication? This paper explores a novel DP mechanism that harnesses quantum noise to safeguard quantum models throughout the QFL process. By tuning noise variance through measurement shots and depolarizing channel strength, our approach achieves desired DP levels tailored to NISQ constraints. Simulations demonstrate the framework’s effectiveness by examining the relationship between differential privacy budget and noise parameters, as well as the trade-off between security and training accuracy. Additionally, we demonstrate the framework’s robustness against an adversarial attack designed to compromise model performance using adversarial examples, with evaluations based on critical metrics such as accuracy on adversarial examples, confidence scores for correct predictions, and attack success rates. The results reveal a tunable trade-off between privacy and robustness, providing an efficient solution for secure QFL on NISQ devices with significant potential for reliable quantum computing applications.
zh

[AI-76] he Mathematicians Assistant: Integrating AI into Research Practice

【速读】:该论文旨在解决生成式 AI(Generative AI)在数学研究实践中如何被有效、负责任地集成的问题,尤其关注当前大语言模型(Large Language Models, LLMs)在数学问题求解与证明评估中表现出的能力与系统性缺陷之间的矛盾。解决方案的关键在于提出一个以“增强型数学家”(augmented mathematician)为核心的框架,强调人类研究人员对AI的批判性引导作用,并提炼出五项指导原则,确保AI作为“副驾驶”而非替代者参与研究全流程。该框架系统化地阐明了AI在从创意生成到论文撰写等七个关键环节中的应用路径,突出了战略提示(strategic prompting)、批判性验证(critical verification)和方法论严谨性(methodological rigor)作为新技能集的核心,从而实现AI对数学研究的真正增效而非自动化替代。

链接: https://arxiv.org/abs/2508.20236
作者: Jonas Henkel
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 24 pages, 7 figures. Accepted for publication in Mathematische Semesterberichte (to appear in vol. 72, no. 2)

点击查看摘要

Abstract:The rapid development of artificial intelligence (AI), marked by breakthroughs like ‘AlphaEvolve’ and ‘Gemini Deep Think’, is beginning to offer powerful new tools that have the potential to significantly alter the research practice in many areas of mathematics. This paper explores the current landscape of publicly accessible large language models (LLMs) in a mathematical research context, based on developments up to August 2, 2025. Our analysis of recent benchmarks, such as MathArena and the Open Proof Corpus (Balunović et al., 2025; Dekoninck et al., 2025), reveals a complex duality: while state-of-the-art models demonstrate strong abilities in solving problems and evaluating proofs, they also exhibit systematic flaws, including a lack of self-critique and a model depending discrepancy between final-answer accuracy and full-proof validity. Based on these findings, we propose a durable framework for integrating AI into the research workflow, centered on the principle of the augmented mathematician. In this model, the AI functions as a copilot under the critical guidance of the human researcher, an approach distilled into five guiding principles for effective and responsible use. We then systematically explore seven fundamental ways AI can be applied across the research lifecycle, from creativity and ideation to the final writing process, demonstrating how these principles translate into concrete practice. We conclude that the primary role of AI is currently augmentation rather than automation. This requires a new skill set focused on strategic prompting, critical verification, and methodological rigor in order to effectively use these powerful tools. Comments: 24 pages, 7 figures. Accepted for publication in Mathematische Semesterberichte (to appear in vol. 72, no. 2) Subjects: History and Overview (math.HO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) MSC classes: 00A35 (Primary), 68T07 (Secondary) ACMclasses: I.2.7; H.5.2 Cite as: arXiv:2508.20236 [math.HO] (or arXiv:2508.20236v1 [math.HO] for this version) https://doi.org/10.48550/arXiv.2508.20236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-77] Data-Efficient Point Cloud Semantic Segmentation Pipeline for Unimproved Roads

【速读】:该论文旨在解决在标注数据稀缺条件下,实现对未改良道路及其他七类场景的鲁棒3D语义分割问题。其核心挑战在于如何在仅有少量目标域(in-domain)标注点云数据的情况下,提升模型的泛化能力与分割性能。解决方案的关键在于提出了一种两阶段训练框架:首先,在多个公共城市数据集与少量精心筛选的目标域数据混合上预训练一个基于投影的卷积神经网络(projection-based convolutional neural network),以学习通用特征表示;随后,仅使用目标域数据微调一个轻量级预测头(prediction head),从而有效利用先验知识并减少过拟合风险。此外,研究还探索了Point Prompt Training对批量归一化层的应用、Manifold Mixup作为正则化策略以及直方图归一化环境信息(histogram-normalized ambients)对性能的增益作用。实验表明,该方法在仅使用50个标注点云样本时,将平均交并比(mean Intersection-over-Union)从33.5%提升至51.8%,整体准确率从85.5%提升至90.8%,验证了跨数据集预训练对于增强模型在低数据场景下鲁棒性的关键作用。

链接: https://arxiv.org/abs/2508.20135
作者: Andrew Yarovoi,Christopher R. Valenta
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:In this case study, we present a data-efficient point cloud segmentation pipeline and training framework for robust segmentation of unimproved roads and seven other classes. Our method employs a two-stage training framework: first, a projection-based convolutional neural network is pre-trained on a mixture of public urban datasets and a small, curated in-domain dataset; then, a lightweight prediction head is fine-tuned exclusively on in-domain data. Along the way, we explore the application of Point Prompt Training to batch normalization layers and the effects of Manifold Mixup as a regularizer within our pipeline. We also explore the effects of incorporating histogram-normalized ambients to further boost performance. Using only 50 labeled point clouds from our target domain, we show that our proposed training approach improves mean Intersection-over-Union from 33.5% to 51.8% and the overall accuracy from 85.5% to 90.8%, when compared to naive training on the in-domain data. Crucially, our results demonstrate that pre-training across multiple datasets is key to improving generalization and enabling robust segmentation under limited in-domain supervision. Overall, this study demonstrates a practical framework for robust 3D semantic segmentation in challenging, low-data scenarios. Our code is available at: this https URL.
zh

[AI-78] Artificial Intelligence for CRISPR Guide RNA Design: Explainable Models and Off-Target Safety

【速读】:该论文旨在解决CRISPR基因编辑中引导RNA(guide RNA, gRNA)设计效率与安全性优化的难题,特别是如何提升gRNA在靶位点的活性预测准确性和降低脱靶风险。其解决方案的关键在于融合人工智能(AI),尤其是深度学习模型,用于精准预测gRNA的靶向活性并识别潜在的脱靶效应;同时引入可解释人工智能(explainable AI, XAI)技术,揭示模型决策背后的序列特征和基因组背景因素,从而增强对Cas酶性能机制的理解,推动更高效、特异且临床可行的CRISPR应用发展。

链接: https://arxiv.org/abs/2508.20130
作者: Alireza Abbaszadeh,Armita Shahlai
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 5 figures, 2 tables, 42 cited references

点击查看摘要

Abstract:CRISPR-based genome editing has revolutionized biotechnology, yet optimizing guide RNA (gRNA) design for efficiency and safety remains a critical challenge. Recent advances (2020–2025, updated to reflect current year if needed) demonstrate that artificial intelligence (AI), especially deep learning, can markedly improve the prediction of gRNA on-target activity and identify off-target risks. In parallel, emerging explainable AI (XAI) techniques are beginning to illuminate the black-box nature of these models, offering insights into sequence features and genomic contexts that drive Cas enzyme performance. Here we review how state-of-the-art machine learning models are enhancing gRNA design for CRISPR systems, highlight strategies for interpreting model predictions, and discuss new developments in off-target prediction and safety assessment. We emphasize breakthroughs from top-tier journals that underscore an interdisciplinary convergence of AI and genome editing to enable more efficient, specific, and clinically viable CRISPR applications.
zh

[AI-79] Deep Reinforcement Learning for Optimal Asset Allocation Using DDPG with TiDE

【速读】:该论文旨在解决金融资产配置中风险资产与无风险资产之间的最优分配问题,传统方法受限于严格的分布假设或非加成性收益比,难以适应多样化的投资目标。解决方案的关键在于将资产配置建模为马尔可夫决策过程(Markov Decision Process, MDP)下的序列决策任务,并引入深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)强化学习框架,同时创新性地集成时间序列密集编码器(Time-series Dense Encoder, TiDE)以增强对金融时序数据的特征提取能力,从而在无需先验分布假设的前提下,动态生成基于模拟市场情景的资产配置策略,显著提升风险调整后收益表现。

链接: https://arxiv.org/abs/2508.20103
作者: Rongwei Liu,Jin Zheng,John Cartlidge
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Risk Management (q-fin.RM)
备注: 10 pages, 3 figures, authors accepted manuscript, to appear in 24th International Conference on Modelling and Applied Simulation (MAS), Sep. 2025, Fes, Morocco

点击查看摘要

Abstract:The optimal asset allocation between risky and risk-free assets is a persistent challenge due to the inherent volatility in financial markets. Conventional methods rely on strict distributional assumptions or non-additive reward ratios, which limit their robustness and applicability to investment goals. To overcome these constraints, this study formulates the optimal two-asset allocation problem as a sequential decision-making task within a Markov Decision Process (MDP). This framework enables the application of reinforcement learning (RL) mechanisms to develop dynamic policies based on simulated financial scenarios, regardless of prerequisites. We use the Kelly criterion to balance immediate reward signals against long-term investment objectives, and we take the novel step of integrating the Time-series Dense Encoder (TiDE) into the Deep Deterministic Policy Gradient (DDPG) RL framework for continuous decision-making. We compare DDPG-TiDE with a simple discrete-action Q-learning RL framework and a passive buy-and-hold investment strategy. Empirical results show that DDPG-TiDE outperforms Q-learning and generates higher risk adjusted returns than buy-and-hold. These findings suggest that tackling the optimal asset allocation problem by integrating TiDE within a DDPG reinforcement learning framework is a fruitful avenue for further exploration.
zh

[AI-80] Can LLM s Identify Tax Abuse?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实世界复杂法规领域——美国税收筹划策略识别与分析中的应用潜力问题。具体而言,研究聚焦于LLM是否能够准确解读并验证税务策略、补全不完整策略信息,以及从零开始生成完整的端到端税务优化方案。其解决方案的关键在于利用LLM对海量且动态更新的税法文本(包括成文法、判例法和行政指引)进行深度理解与推理,从而实现对高净值纳税人规避税款行为的精准识别与应对;尤为突出的是,该方法成功发现了一种此前未被记录的新颖税务策略,凸显了LLM在税务合规领域的强大推理能力与潜在颠覆性价值。

链接: https://arxiv.org/abs/2508.20097
作者: Andrew Blair-Stanek,Nils Holzenberger,Benjamin Van Durme
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:We investigate whether large language models can discover and analyze U.S. tax-minimization strategies. This real-world domain challenges even seasoned human experts, and progress can reduce tax revenue lost from well-advised, wealthy taxpayers. We evaluate the most advanced LLMs on their ability to (1) interpret and verify tax strategies, (2) fill in gaps in partially specified strategies, and (3) generate complete, end-to-end strategies from scratch. This domain should be of particular interest to the LLM reasoning community: unlike synthetic challenge problems or scientific reasoning tasks, U.S. tax law involves navigating hundreds of thousands of pages of statutes, case law, and administrative guidance, all updated regularly. Notably, LLM-based reasoning identified an entirely novel tax strategy, highlighting these models’ potential to revolutionize tax agencies’ fight against tax abuse.
zh

机器学习

[LG-0] Fast Convergence Rates for Subsampled Natural Gradient Algorithms on Quadratic Model Problems

链接: https://arxiv.org/abs/2508.21022
作者: Gil Goldshlager,Jiang Hu,Lin Lin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Subsampled natural gradient descent (SNGD) has shown impressive results for parametric optimization tasks in scientific machine learning, such as neural network wavefunctions and physics-informed neural networks, but it has lacked a theoretical explanation. We address this gap by analyzing the convergence of SNGD and its accelerated variant, SPRING, for idealized parametric optimization problems where the model is linear and the loss function is strongly convex and quadratic. In the special case of a least-squares loss, namely the standard linear least-squares problem, we prove that SNGD is equivalent to a regularized Kaczmarz method while SPRING is equivalent to an accelerated regularized Kaczmarz method. As a result, by leveraging existing analyses we obtain under mild conditions (i) the first fast convergence rate for SNGD, (ii) the first convergence guarantee for SPRING in any setting, and (iii) the first proof that SPRING can accelerate SNGD. In the case of a general strongly convex quadratic loss, we extend the analysis of the regularized Kaczmarz method to obtain a fast convergence rate for SNGD under stronger conditions, providing the first explanation for the effectiveness of SNGD outside of the least-squares setting. Overall, our results illustrate how tools from randomized linear algebra can shed new light on the interplay between subsampling and curvature-aware optimization strategies.

[LG-1] InSQuAD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity ICDM2025

链接: https://arxiv.org/abs/2508.21003
作者: Souradeep Nanda,Anay Majee,Rishabh Iyer
类目: Machine Learning (cs.LG)
*备注: Long Version of paper Accepted to ICDM 2025

点击查看摘要

Abstract:In this paper, we introduce InSQuAD, designed to enhance the performance of In-Context Learning (ICL) models through Submodular Mutual Information (SMI) enforcing Quality and Diversity among in-context exemplars. InSQuAD achieves this through two principal strategies: First, we model the ICL task as a targeted selection problem and introduce a unified selection strategy based on SMIs which mines relevant yet diverse in-context examples encapsulating the notions of quality and diversity. Secondly, we address a common pitfall in existing retrieval models which model query relevance, often overlooking diversity, critical for ICL. InSQuAD introduces a combinatorial training paradigm which learns the parameters of an SMI function to enforce both quality and diversity in the retrieval model through a novel likelihood-based loss. To further aid the learning process we augment an existing multi-hop question answering dataset with synthetically generated paraphrases. Adopting the retrieval model trained using this strategy alongside the novel targeted selection formulation for ICL on nine benchmark datasets shows significant improvements validating the efficacy of our approach.

[LG-2] Graph-Based Feature Augmentation for Predictive Tasks on Relational Datasets

链接: https://arxiv.org/abs/2508.20986
作者: Lianpeng Qiao,Ziqi Cao,Kaiyu Feng,Ye Yuan,Guoren Wang
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data has become a foundational asset driving innovation across domains such as finance, healthcare, and e-commerce. In these areas, predictive modeling over relational tables is commonly employed, with increasing emphasis on reducing manual effort through automated machine learning (AutoML) techniques. This raises an interesting question: can feature augmentation itself be automated and identify and utilize task-related relational signals? To address this challenge, we propose an end-to-end automated feature augmentation framework, ReCoGNN, which enhances initial datasets using features extracted from multiple relational tables to support predictive tasks. ReCoGNN first captures semantic dependencies within each table by modeling intra-table attribute relationships, enabling it to partition tables into structured, semantically coherent segments. It then constructs a heterogeneous weighted graph that represents inter-row relationships across all segments. Finally, ReCoGNN leverages message-passing graph neural networks to propagate information through the graph, guiding feature selection and augmenting the original dataset. Extensive experiments conducted on ten real-life and synthetic datasets demonstrate that ReCoGNN consistently outperforms existing methods on both classification and regression tasks. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2508.20986 [cs.DB] (or arXiv:2508.20986v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2508.20986 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Efficient Large-Scale Cross-Domain Sequential Recommendation with Dynamic State Representations

链接: https://arxiv.org/abs/2508.20945
作者: Manuel V. Loureiro,Steven Derby,Aleksei Medvedev,Alejandro Ariza-Casabona,Gonzalo Fiz Pontiveros,Tri Kurniawan Wijaya
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 4 pages

点击查看摘要

Abstract:Recently, autoregressive recommendation models (ARMs), such as Meta’s HSTU model, have emerged as a major breakthrough over traditional Deep Learning Recommendation Models (DLRMs), exhibiting the highly sought-after scaling law behaviour. However, when applied to multi-domain scenarios, the transformer architecture’s attention maps become a computational bottleneck, as they attend to all items across every domain. To tackle this challenge, systems must efficiently balance inter and intra-domain knowledge transfer. In this work, we introduce a novel approach for scalable multi-domain recommendation systems by replacing full inter-domain attention with two innovative mechanisms: 1) Transition-Aware Positional Embeddings (TAPE): We propose novel positional embeddings that account for domain-transition specific information. This allows attention to be focused solely on intra-domain items, effectively reducing the unnecessary computational cost associated with attending to irrelevant domains. 2) Dynamic Domain State Representation (DDSR): We introduce a dynamic state representation for each domain, which is stored and accessed during subsequent token predictions. This enables the efficient transfer of relevant domain information without relying on full attention maps. Our method offers a scalable solution to the challenges posed by large-scale, multi-domain recommendation systems and demonstrates significant improvements in retrieval tasks by separately modelling and combining inter- and intra-domain representations.

[LG-4] Finite-Time Guarantees for Multi-Agent Combinatorial Bandits with Nonstationary Rewards

链接: https://arxiv.org/abs/2508.20923
作者: Katherine B. Adams,Justin J. Boutilier,Qinyang He,Yonatan Mintz
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 41 pages, 8 figures

点击查看摘要

Abstract:We study a sequential resource allocation problem where a decision maker selects subsets of agents at each period to maximize overall outcomes without prior knowledge of individual-level effects. Our framework applies to settings such as community health interventions, targeted digital advertising, and workforce retention programs, where intervention effects evolve dynamically. Agents may exhibit habituation (diminished response from frequent selection) or recovery (enhanced response from infrequent selection). The technical challenge centers on nonstationary reward distributions that lead to changing intervention effects over time. The problem requires balancing two key competing objectives: heterogeneous individual rewards and the exploration-exploitation tradeoff in terms of learning for improved future decisions as opposed to maximizing immediate outcomes. Our contribution introduces the first framework incorporating this form of nonstationary rewards in the combinatorial multi-armed bandit literature. We develop algorithms with theoretical guarantees on dynamic regret and demonstrate practical efficacy through a diabetes intervention case study. Our personalized community intervention algorithm achieved up to three times as much improvement in program enrollment compared to baseline approaches, validating the framework’s potential for real-world applications. This work bridges theoretical advances in adaptive learning with practical challenges in population-level behavioral change interventions.

[LG-5] Learning Robust Spatial Representations from Binaural Audio through Feature Distillation

链接: https://arxiv.org/abs/2508.20914
作者: Holger Severin Bovbjerg(1),Jan Østergaard(1),Jesper Jensen(1, 2),Shinji Watanabe(3),Zheng-Hua Tan((1) Aalborg University (2) Eriksholm Research Centre, (3) Carnegie Mellon University)
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: To appear in Proc. WASPAA 2025, October 12-15, 2025, Tahoe, US. Copyright © 2025 IEEE. 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Recently, deep representation learning has shown strong performance in multiple audio tasks. However, its use for learning spatial representations from multichannel audio is underexplored. We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of binaural speech without the need for data labels. In this framework, spatial features are computed from clean binaural speech samples to form prediction labels. These clean features are then predicted from corresponding augmented speech using a neural network. After pretraining, we throw away the spatial feature predictor and use the learned encoder weights to initialize a DoA estimation model which we fine-tune for DoA estimation. Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments after fine-tuning for direction-of-arrival estimation, when compared to fully supervised models and classic signal processing methods.

[LG-6] urning Tabular Foundation Models into Graph Foundation Models

链接: https://arxiv.org/abs/2508.20906
作者: Dmitry Eremeev,Gleb Bazhenov,Oleg Platonov,Artem Babenko,Liudmila Prokhorenkova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While foundation models have revolutionized such fields as natural language processing and computer vision, their application and potential within graph machine learning remain largely unexplored. One of the key challenges in designing graph foundation models (GFMs) is handling diverse node features that can vary across different graph datasets. Although many works on GFMs have been focused exclusively on text-attributed graphs, the problem of handling arbitrary features of other types in GFMs has not been fully addressed. However, this problem is not unique to the graph domain, as it also arises in the field of machine learning for tabular data. In this work, motivated by the recent success of tabular foundation models like TabPFNv2, we propose G2T-FM, a simple graph foundation model that employs TabPFNv2 as a backbone. Specifically, G2T-FM augments the original node features with neighborhood feature aggregation, adds structural embeddings, and then applies TabPFNv2 to the constructed node representations. Even in a fully in-context regime, our model achieves strong results, significantly outperforming publicly available GFMs and performing on par with well-tuned GNNs trained from scratch. Moreover, after finetuning, G2T-FM surpasses well-tuned GNN baselines, highlighting the potential of the proposed approach. More broadly, our paper reveals a previously overlooked direction of utilizing tabular foundation models for graph machine learning tasks.

[LG-7] CoCoL: A Communication Efficient Decentralized Collaborative Method for Multi-Robot Systems IROS2025

链接: https://arxiv.org/abs/2508.20898
作者: Jiaxi Huang,Yan Huang,Yixian Zhao,Wenchao Meng,Jinming Xu
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted by IROS2025

点击查看摘要

Abstract:Collaborative learning enhances the performance and adaptability of multi-robot systems in complex tasks but faces significant challenges due to high communication overhead and data heterogeneity inherent in multi-robot tasks. To this end, we propose CoCoL, a Communication efficient decentralized Collaborative Learning method tailored for multi-robot systems with heterogeneous local datasets. Leveraging a mirror descent framework, CoCoL achieves remarkable communication efficiency with approximate Newton-type updates by capturing the similarity between objective functions of robots, and reduces computational costs through inexact sub-problem solutions. Furthermore, the integration of a gradient tracking scheme ensures its robustness against data heterogeneity. Experimental results on three representative multi robot collaborative learning tasks show the superiority of the proposed CoCoL in significantly reducing both the number of communication rounds and total bandwidth consumption while maintaining state-of-the-art accuracy. These benefits are particularly evident in challenging scenarios involving non-IID (non-independent and identically distributed) data distribution, streaming data, and time-varying network topologies.

[LG-8] LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling

链接: https://arxiv.org/abs/2508.20875
作者: Ali Ramlaoui,Martin Siron,Inel Djafar,Joseph Musielewicz,Amandine Rossello,Victor Schmidt,Alexandre Duval
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The development of accurate machine learning interatomic potentials (MLIPs) is limited by the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets derived from Density Functional Theory (DFT). These datasets are expensive to generate yet difficult to combine due to variations in format, metadata, and accessibility. To address this, we introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories, including the Materials Project, Alexandria, and OQMD. LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals (PBE, PBESol, SCAN, r2SCAN). It significantly lowers the barrier for training transferrable and accurate MLIPs. LeMat-Traj spans both relaxed low-energy states and high-energy, high-force structures, complementing molecular dynamics and active learning datasets. By fine-tuning models pre-trained on high-force data with LeMat-Traj, we achieve a significant reduction in force prediction errors on relaxation tasks. We also present LeMaterial-Fetcher, a modular and extensible open-source library developed for this work, designed to provide a reproducible framework for the community to easily incorporate new data sources and ensure the continued evolution of large-scale materials datasets. LeMat-Traj and LeMaterial-Fetcher are publicly available at this https URL and this https URL.

[LG-9] Practical Physical Layer Authentication for Mobile Scenarios Using a Synthetic Dataset Enhanced Deep Learning Approach

链接: https://arxiv.org/abs/2508.20861
作者: Yijia Guo,Junqing Zhang,Y.-W. Peter Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Internet of Things (IoT) is ubiquitous thanks to the rapid development of wireless technologies. However, the broadcast nature of wireless transmissions results in great vulnerability to device authentication. Physical layer authentication emerges as a promising approach by exploiting the unique channel characteristics. However, a practical scheme applicable to dynamic channel variations is still missing. In this paper, we proposed a deep learning-based physical layer channel state information (CSI) authentication for mobile scenarios and carried out comprehensive simulation and experimental evaluation using IEEE 802.11n. Specifically, a synthetic training dataset was generated based on the WLAN TGn channel model and the autocorrelation and the distance correlation of the channel, which can significantly reduce the overhead of manually collecting experimental datasets. A convolutional neural network (CNN)-based Siamese network was exploited to learn the temporal and spatial correlation between the CSI pair and output a score to measure their similarity. We adopted a synergistic methodology involving both simulation and experimental evaluation. The experimental testbed consisted of WiFi IoT development kits and a few typical scenarios were specifically considered. Both simulation and experimental evaluation demonstrated excellent generalization performance of our proposed deep learning-based approach and excellent authentication performance. Demonstrated by our practical measurement results, our proposed scheme improved the area under the curve (AUC) by 0.03 compared to the fully connected network-based (FCN-based) Siamese model and by 0.06 compared to the correlation-based benchmark algorithm.

[LG-10] ATM-GAD: Adaptive Temporal Motif Graph Anomaly Detection for Financial Transaction Networks

链接: https://arxiv.org/abs/2508.20829
作者: Zeyue Zhang,Lin Song,Erkang Bao,Xiaoling Lv,Xinyue Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial fraud detection is essential to safeguard billions of dollars, yet the intertwined entities and fast-changing transaction behaviors in modern financial systems routinely defeat conventional machine learning models. Recent graph-based detectors make headway by representing transactions as networks, but they still overlook two fraud hallmarks rooted in time: (1) temporal motifs–recurring, telltale subgraphs that reveal suspicious money flows as they unfold–and (2) account-specific intervals of anomalous activity, when fraud surfaces only in short bursts unique to each entity. To exploit both signals, we introduce ATM-GAD, an adaptive graph neural network that leverages temporal motifs for financial anomaly detection. A Temporal Motif Extractor condenses each account’s transaction history into the most informative motifs, preserving both topology and temporal patterns. These motifs are then analyzed by dual-attention blocks: IntraA reasons over interactions within a single motif, while InterA aggregates evidence across motifs to expose multi-step fraud schemes. In parallel, a differentiable Adaptive Time-Window Learner tailors the observation window for every node, allowing the model to focus precisely on the most revealing time slices. Experiments on four real-world datasets show that ATM-GAD consistently outperforms seven strong anomaly-detection baselines, uncovering fraud patterns missed by earlier methods.

[LG-11] GPT -FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement APWEB

链接: https://arxiv.org/abs/2508.20824
作者: Yang Gao,Dongjie Wang,Scott Piersall,Ye Zhang,Liqiang Wang
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures. accepted by APWeb-WAIM 2025

点击查看摘要

Abstract:Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a) feature transformation sequence reconstruction and (b) model performance estimation and enhancement for downstream tasks by constructing the embedding space. Such a multi-objective optimization framework reduces parameter size and accelerates transformation processes. Experimental results on benchmark datasets show that the proposed framework matches or exceeds baseline performance, with significant gains in computational efficiency. This work highlights the potential of transformer-based architectures for scalable, high-performance automated feature transformation.

[LG-12] cMALC-D: Contextual Multi-Agent LLM -Guided Curriculum Learning with Diversity-Based Context Blending

链接: https://arxiv.org/abs/2508.20818
作者: Anirudh Satheesh,Keenan Powell,Hua Wei
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: A shorter version has been accepted to the 2025 Conference on Information and Knowledge Management

点击查看摘要

Abstract:Many multi-agent reinforcement learning (MARL) algorithms are trained in fixed simulation environments, making them brittle when deployed in real-world scenarios with more complex and uncertain conditions. Contextual MARL (cMARL) addresses this by parameterizing environments with context variables and training a context-agnostic policy that performs well across all environment configurations. Existing cMARL methods attempt to use curriculum learning to help train and evaluate context-agnostic policies, but they often rely on unreliable proxy signals, such as value estimates or generalized advantage estimates that are noisy and unstable in multi-agent settings due to inter-agent dynamics and partial observability. To address these issues, we propose Contextual Multi-Agent LLM-Guided Curriculum Learning with Diversity-Based Context Blending (cMALC-D), a framework that uses Large Language Models (LLMs) to generate semantically meaningful curricula and provide a more robust evaluation signal. To prevent mode collapse and encourage exploration, we introduce a novel diversity-based context blending mechanism that creates new training scenarios by combining features from prior contexts. Experiments in traffic signal control domains demonstrate that cMALC-D significantly improves both generalization and sample efficiency compared to existing curriculum learning baselines. We provide code at this https URL.

[LG-13] SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval EMNLP2025

链接: https://arxiv.org/abs/2508.20778
作者: Xinhao Huang,Zhibo Ren,Yipeng Yu,Ying Zhou,Zulong Chen,Zeyi Wen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at EMNLP 2025 Main Conference

点击查看摘要

Abstract:In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96% to 77.84% on BGE-M3. The resources are available at this https URL.

[LG-14] Balancing Profit and Traveller Acceptance in Ride-Pooling Personalised Fares

链接: https://arxiv.org/abs/2508.20723
作者: Michal Bujak,Rafal Kucharski
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Ride-pooling systems, to succeed, must provide an attractive service, namely compensate perceived costs with an appealing price. However, because of a strong heterogeneity in a value-of-time, each traveller has his own acceptable price, unknown to the operator. Here, we show that individual acceptance levels can be learned by the operator (over 90% accuracy for pooled travellers in 10 days) to optimise personalised fares. We propose an adaptive pricing policy, where every day the operator constructs an offer that progressively meets travellers’ expectations and attracts a growing demand. Our results suggest that operators, by learning behavioural traits of individual travellers, may improve performance not only for travellers (increased utility) but also for themselves (increased profit). Moreover, such knowledge allows the operator to remove inefficient pooled rides and focus on attractive and profitable combinations.

[LG-15] Unified Multi-task Learning for Voice-Based Detection of Diverse Clinical Conditions

链接: https://arxiv.org/abs/2508.20717
作者: Ran Piao,Yuan Lu,Hareld Kemps,Tong Xia,Aaqib Saeed
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Voice-based health assessment offers unprecedented opportunities for scalable, non-invasive disease screening, yet existing approaches typically focus on single conditions and fail to leverage the rich, multi-faceted information embedded in speech. We present MARVEL (Multi-task Acoustic Representations for Voice-based Health Analysis), a privacy-conscious multitask learning framework that simultaneously detects nine distinct neurological, respiratory, and voice disorders using only derived acoustic features, eliminating the need for raw audio transmission. Our dual-branch architecture employs specialized encoders with task-specific heads sharing a common acoustic backbone, enabling effective cross-condition knowledge transfer. Evaluated on the large-scale Bridge2AI-Voice v2.0 dataset, MARVEL achieves an overall AUROC of 0.78, with exceptional performance on neurological disorders (AUROC = 0.89), particularly for Alzheimer’s disease/mild cognitive impairment (AUROC = 0.97). Our framework consistently outperforms single-modal baselines by 5-19% and surpasses state-of-the-art self-supervised models on 7 of 9 tasks, while correlation analysis reveals that the learned representations exhibit meaningful similarities with established acoustic features, indicating that the model’s internal representations are consistent with clinically recognized acoustic patterns. By demonstrating that a single unified model can effectively screen for diverse conditions, this work establishes a foundation for deployable voice-based diagnostics in resource-constrained and remote healthcare settings.

[LG-16] Compositionality in Time Series: A Proof of Concept using Symbolic Dynamics and Compositional Data Augmentation

链接: https://arxiv.org/abs/2508.20656
作者: Michael Hagmann,Michael Staniek,Stefan Riezler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work investigates whether time series of natural phenomena can be understood as being generated by sequences of latent states which are ordered in systematic and regular ways. We focus on clinical time series and ask whether clinical measurements can be interpreted as being generated by meaningful physiological states whose succession follows systematic principles. Uncovering the underlying compositional structure will allow us to create synthetic data to alleviate the notorious problem of sparse and low-resource data settings in clinical time series forecasting, and deepen our understanding of clinical data. We start by conceptualizing compositionality for time series as a property of the data generation process, and then study data-driven procedures that can reconstruct the elementary states and composition rules of this process. We evaluate the success of this methods using two empirical tests originating from a domain adaptation perspective. Both tests infer the similarity of the original time series distribution and the synthetic time series distribution from the similarity of expected risk of time series forecasting models trained and tested on original and synthesized data in specific ways. Our experimental results show that the test set performance achieved by training on compositionally synthesized data is comparable to training on original clinical time series data, and that evaluation of models on compositionally synthesized test data shows similar results to evaluating on original test data, outperforming randomization-based data augmentation. An additional downstream evaluation of the prediction task of sequential organ failure assessment (SOFA) scores shows significant performance gains when model training is entirely based on compositionally synthesized data compared to training on original data.

[LG-17] Self-Composing Neural Operators with Depth and Accuracy Scaling via Adaptive Train-and-Unroll Approach

链接: https://arxiv.org/abs/2508.20650
作者: Juncai He,Xinliang Liu,Jinchao Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel framework to enhance the efficiency and accuracy of neural operators through self-composition, offering both theoretical guarantees and practical benefits. Inspired by iterative methods in solving numerical partial differential equations (PDEs), we design a specific neural operator by repeatedly applying a single neural operator block, we progressively deepen the model without explicitly adding new blocks, improving the model’s capacity. To train these models efficiently, we introduce an adaptive train-and-unroll approach, where the depth of the neural operator is gradually increased during training. This approach reveals an accuracy scaling law with model depth and offers significant computational savings through our adaptive training strategy. Our architecture achieves state-of-the-art (SOTA) performance on standard benchmarks. We further demonstrate its efficacy on a challenging high-frequency ultrasound computed tomography (USCT) problem, where a multigrid-inspired backbone enables superior performance in resolving complex wave phenomena. The proposed framework provides a computationally tractable, accurate, and scalable solution for large-scale data-driven scientific machine learning applications.

[LG-18] Physics-Constrained Machine Learning for Chemical Engineering

链接: https://arxiv.org/abs/2508.20649
作者: Angan Mukherjee,Victor M. Zavala(Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, USA)
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Physics-constrained machine learning (PCML) combines physical models with data-driven approaches to improve reliability, generalizability, and interpretability. Although PCML has shown significant benefits in diverse scientific and engineering domains, technical and intellectual challenges hinder its applicability in complex chemical engineering applications. Key difficulties include determining the amount and type of physical knowledge to embed, designing effective fusion strategies with ML, scaling models to large datasets and simulators, and quantifying predictive uncertainty. This perspective summarizes recent developments and highlights challenges/opportunities in applying PCML to chemical engineering, emphasizing on closed-loop experimental design, real-time dynamics and control, and handling of multi-scale phenomena.

[LG-19] VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation

链接: https://arxiv.org/abs/2508.20646
作者: Leyang Wang,Mingtian Zhang,Zijing Ou,David Barber
类目: Machine Learning (cs.LG)
*备注: Leyang Wang and Mingtian Zhang contributed equally to this work

点击查看摘要

Abstract:Recently, diffusion distillation methods have compressed thousand-step teacher diffusion models into one-step student generators while preserving sample quality. Most existing approaches train the student model using a diffusive divergence whose gradient is approximated via the student’s score function, learned through denoising score matching (DSM). Since DSM training is imperfect, the resulting gradient estimate is inevitably biased, leading to sub-optimal performance. In this paper, we propose VarDiU (pronounced /va:rdju:/), a Variational Diffusive Upper Bound that admits an unbiased gradient estimator and can be directly applied to diffusion distillation. Using this objective, we compare our method with Diff-Instruct and demonstrate that it achieves higher generation quality and enables a more efficient and stable training procedure for one-step diffusion distillation.

[LG-20] A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks

链接: https://arxiv.org/abs/2508.20645
作者: Xinli Shi,Xingxing Yuan,Longkang Zhu,Guanghui Wen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:With the increasing scale and dynamics of data, distributed online optimization has become essential for real-time decision-making in various applications. However, existing algorithms often rely on bounded gradient assumptions and overlook the impact of stochastic gradients, especially in time-varying directed networks. This study proposes a novel Time-Varying Hybrid Stochastic Gradient Tracking algorithm named TV-HSGT, based on hybrid stochastic gradient tracking and variance reduction mechanisms. Specifically, TV-HSGT integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, eliminating the need for Perron vector estimation or out-degree information. By combining current and recursive stochastic gradients, it effectively reduces gradient variance while accurately tracking global descent directions. Theoretical analysis demonstrates that TV-HSGT can achieve improved bounds on dynamic regret without assuming gradient boundedness. Experimental results on logistic regression tasks confirm the effectiveness of TV-HSGT in dynamic and resource-constrained environments.

[LG-21] Supervised Stochastic Gradient Algorithms for Multi-Trial Source Separation

链接: https://arxiv.org/abs/2508.20618
作者: Ronak Mehta,Mateus Piovezan Otto,Noah Stanis,Azadeh Yazdan-Shahmorad,Zaid Harchaoui
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a stochastic algorithm for independent component analysis that incorporates multi-trial supervision, which is available in many scientific contexts. The method blends a proximal gradient-type algorithm in the space of invertible matrices with joint learning of a prediction model through backpropagation. We illustrate the proposed algorithm on synthetic and real data experiments. In particular, owing to the additional supervision, we observe an increased success rate of the non-convex optimization and the improved interpretability of the independent components.

[LG-22] Dimension Agnostic Testing of Survey Data Credibility through the Lens of Regression

链接: https://arxiv.org/abs/2508.20616
作者: Debabrota Basu,Sourav Chakraborty,Debarshi Chanda,Buddha Dev Das,Arijit Ghosh,Arnab Ray
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 8 figures, 6 Tables

点击查看摘要

Abstract:Assessing whether a sample survey credibly represents the population is a critical question for ensuring the validity of downstream research. Generally, this problem reduces to estimating the distance between two high-dimensional distributions, which typically requires a number of samples that grows exponentially with the dimension. However, depending on the model used for data analysis, the conclusions drawn from the data may remain consistent across different underlying distributions. In this context, we propose a task-based approach to assess the credibility of sampled surveys. Specifically, we introduce a model-specific distance metric to quantify this notion of credibility. We also design an algorithm to verify the credibility of survey data in the context of regression models. Notably, the sample complexity of our algorithm is independent of the data dimension. This efficiency stems from the fact that the algorithm focuses on verifying the credibility of the survey data rather than reconstructing the underlying regression model. Furthermore, we show that if one attempts to verify credibility by reconstructing the regression model, the sample complexity scales linearly with the dimensionality of the data. We prove the theoretical correctness of our algorithm and numerically demonstrate our algorithm’s performance.

[LG-23] Local Virtual Nodes for Alleviating Over-Squashing in Graph Neural Networks

链接: https://arxiv.org/abs/2508.20597
作者: Tuğrul Hasan Karabulut,İnci M. Baytaş
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over-squashing is a challenge in training graph neural networks for tasks involving long-range dependencies. In such tasks, a GNN’s receptive field should be large enough to enable communication between distant nodes. However, gathering information from a wide range of neighborhoods and squashing its content into fixed-size node representations makes message-passing vulnerable to bottlenecks. Graph rewiring and adding virtual nodes are commonly studied remedies that create additional pathways around bottlenecks to mitigate over-squashing. However, these techniques alter the input graph’s global topology and disrupt the domain knowledge encoded in the original graph structure, both of which could be essential to specific tasks and domains. This study presents Local Virtual Nodes (LVN) with trainable embeddings to alleviate the effects of over-squashing without significantly corrupting the global structure of the input graph. The position of the LVNs is determined by the node centrality, which indicates the existence of potential bottlenecks. Thus, the proposed approach aims to improve the connectivity in the regions with likely bottlenecks. Furthermore, trainable LVN embeddings shared across selected central regions facilitate communication between distant nodes without adding more layers. Extensive experiments on benchmark datasets demonstrate that LVNs can enhance structural connectivity and significantly improve performance on graph and node classification tasks. The code can be found at this https URLthis https URL.

[LG-24] Unbiased Stochastic Optimization for Gaussian Processes on Finite Dimensional RKHS

链接: https://arxiv.org/abs/2508.20588
作者: Neta Shoham,Haim Avron
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Current methods for stochastic hyperparameter learning in Gaussian Processes (GPs) rely on approximations, such as computing biased stochastic gradients or using inducing points in stochastic variational inference. However, when using such methods we are not guaranteed to converge to a stationary point of the true marginal likelihood. In this work, we propose algorithms for exact stochastic inference of GPs with kernels that induce a Reproducing Kernel Hilbert Space (RKHS) of moderate finite dimension. Our approach can also be extended to infinite dimensional RKHSs at the cost of forgoing exactness. Both for finite and infinite dimensional RKHSs, our method achieves better experimental results than existing methods when memory resources limit the feasible batch size and the possible number of inducing points.

[LG-25] SemSR: Semantics aware robust Session-based Recommendations RECSYS’25

链接: https://arxiv.org/abs/2508.20587
作者: Jyoti Narwariya,Priyanka Gupta,Muskan Gupta,Jyotsana Khatri,Lovekesh Vig
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at EARL workshop @RecSys’25, Prague, Czech Republic

点击查看摘要

Abstract:Session-based recommendation (SR) models aim to recommend items to anonymous users based on their behavior during the current session. While various SR models in the literature utilize item sequences to predict the next item, they often fail to leverage semantic information from item titles or descriptions impeding session intent identification and interpretability. Recent research has explored Large Language Models (LLMs) as promising approaches to enhance session-based recommendations, with both prompt-based and fine-tuning based methods being widely investigated. However, prompt-based methods struggle to identify optimal prompts that elicit correct reasoning and lack task-specific feedback at test time, resulting in sub-optimal recommendations. Fine-tuning methods incorporate domain-specific knowledge but incur significant computational costs for implementation and maintenance. In this paper, we present multiple approaches to utilize LLMs for session-based recommendation: (i) in-context LLMs as recommendation agents, (ii) LLM-generated representations for semantic initialization of deep learning SR models, and (iii) integration of LLMs with data-driven SR models. Through comprehensive experiments on two real-world publicly available datasets, we demonstrate that LLM-based methods excel at coarse-level retrieval (high recall values), while traditional data-driven techniques perform well at fine-grained ranking (high Mean Reciprocal Rank values). Furthermore, the integration of LLMs with data-driven SR models significantly out performs both standalone LLM approaches and data-driven deep learning models, as well as baseline SR models, in terms of both Recall and MRR metrics.

[LG-26] heoretical foundations of the integral indicator application in hyperparametric optimization

链接: https://arxiv.org/abs/2508.20550
作者: Roman S. Kulshin,Anatoly A. Sidorov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The article discusses the concept of hyperparametric optimization of recommendation algorithms using an integral assessment that combines various performance indicators into a single consolidated criterion. This approach is opposed to traditional methods of setting up a single metric and allows you to achieve a balance between accuracy, ranking quality, variety of output and the resource intensity of algorithms. The theoretical significance of the research lies in the development of a universal multi-criteria optimization tool that is applicable not only in recommendation systems, but also in a wide range of machine learning and data analysis tasks.

[LG-27] Khiops: An End-to-End Frugal AutoML and XAI Machine Learning Solution for Large Multi-Table Databases

链接: https://arxiv.org/abs/2508.20519
作者: Marc Boullé,Nicolas Voisine,Bruno Guerraz,Carine Hue,Felipe Olmos,Vladimir Popescu,Stéphane Gouache,Stéphane Bouget,Alexis Bondu,Luc Aurelien Gauthier,Yassine Nair Benrekia,Fabrice Clérot,Vincent Lemaire
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.

[LG-28] Structure-aware Hypergraph Transformer for Diagnosis Prediction in Electronic Health Records

链接: https://arxiv.org/abs/2508.20500
作者: Haiyan Wang,Ye Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic Health Records (EHR) systematically organize patient health data through standardized medical codes, serving as a comprehensive and invaluable source for predictive modeling. Graph neural networks (GNNs) have demonstrated effectiveness in modeling interactions between medical codes within EHR. However, existing GNN-based methods are inadequate due to: a) their reliance on pairwise relations fails to capture the inherent higher-order dependencies in clinical data, and b) the localized message-passing scheme limits representation power. To address these issues, this paper proposes a novel Structure-aware HyperGraph Transformer (SHGT) framework following three-fold ideas: a) employing a hypergraph structural encoder to capture higher-order interactions among medical codes, b) integrating the Transformer architecture to reason over the entire hypergraph, and c) designing a tailored loss function incorporating hypergraph reconstruction to preserve the hypergraph’s original structure. Experiments on real-world EHR datasets demonstrate that the proposed SHGT outperforms existing state-of-the-art models on diagnosis prediction.

[LG-29] Rethinking Transformer Connectivity: TLinFormer A Path to Exact Full Context-Aware Linear Attention

链接: https://arxiv.org/abs/2508.20407
作者: Zhongpan Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its application in long-sequence tasks. To address this challenge, existing linear attention methods typically sacrifice model performance by relying on data-agnostic kernel approximations or restrictive context selection. This paper returns to the first principles of connectionism, starting from the topological structure of information flow, to introduce a novel linear attention architecture-\textbfTLinFormer. By reconfiguring neuron connection patterns, TLinFormer achieves strict linear complexity while computing exact attention scores and ensuring information flow remains aware of the full historical context. This design aims to bridge the performance gap prevalent between existing efficient attention methods and standard attention. Through a series of experiments, we systematically evaluate the performance of TLinFormer against a standard Transformer baseline on long-sequence inference tasks. The results demonstrate that TLinFormer exhibits overwhelming advantages in key metrics such as \textbfinference latency, \textbfKV cache efficiency, \textbfmemory footprint, and \textbfoverall speedup.

[LG-30] Revealing Potential Biases in LLM -Based Recommender Systems in the Cold Start Setting RECSYS2025

链接: https://arxiv.org/abs/2508.20401
作者: Alexandre Andre,Gauthier Roy,Eva Dyer,Kai Wang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: In Proceedings of 2nd Workshop on Evaluating and Applying Recommendation Systems with Large Language Models (EARL) at RecSys 2025 (EARL 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for recommendation tasks due to their general-purpose capabilities. While LLMs perform well in rich-context settings, their behavior in cold-start scenarios, where only limited signals such as age, gender, or language are available, raises fairness concerns because they may rely on societal biases encoded during pretraining. We introduce a benchmark specifically designed to evaluate fairness in zero-context recommendation. Our modular pipeline supports configurable recommendation domains and sensitive attributes, enabling systematic and flexible audits of any open-source LLM. Through evaluations of state-of-the-art models (Gemma 3 and Llama 3.2), we uncover consistent biases across recommendation domains (music, movies, and colleges) including gendered and cultural stereotypes. We also reveal a non-linear relationship between model size and fairness, highlighting the need for nuanced analysis.

[LG-31] BiListing: Modality Alignment for Listings

链接: https://arxiv.org/abs/2508.20396
作者: Guillaume Guy,Mihajlo Grbovic,Chun How Tan,Han Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Airbnb is a leader in offering travel accommodations. Airbnb has historically relied on structured data to understand, rank, and recommend listings to guests due to the limited capabilities and associated complexity arising from extracting meaningful information from text and images. With the rise of representation learning, leveraging rich information from text and photos has become easier. A popular approach has been to create embeddings for text documents and images to enable use cases of computing similarities between listings or using embeddings as features in an ML model. However, an Airbnb listing has diverse unstructured data: multiple images, various unstructured text documents such as title, description, and reviews, making this approach challenging. Specifically, it is a non-trivial task to combine multiple embeddings of different pieces of information to reach a single representation. This paper proposes BiListing, for Bimodal Listing, an approach to align text and photos of a listing by leveraging large-language models and pretrained language-image models. The BiListing approach has several favorable characteristics: capturing unstructured data into a single embedding vector per listing and modality, enabling zero-shot capability to search inventory efficiently in user-friendly semantics, overcoming the cold start problem, and enabling listing-to-listing search along a single modality, or both. We conducted offline and online tests to leverage the BiListing embeddings in the Airbnb search ranking model, and successfully deployed it in production, achieved 0.425% of NDCB gain, and drove tens of millions in incremental revenue. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.20396 [cs.LG] (or arXiv:2508.20396v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.20396 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM 2025 Related DOI: https://doi.org/10.1145/3746252.3761577 Focus to learn more DOI(s) linking to related resources Submission history From: Mihajlo Grbovic [view email] [v1] Thu, 28 Aug 2025 03:47:31 UTC (5,498 KB)

[LG-32] CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference

链接: https://arxiv.org/abs/2508.20375
作者: Guanyu Xu,Zhiwei Hao,Li Shen,Yong Luo,Fuhui Sun,Xiaoyan Wang,Han Hu,Yonggang Wen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted by IEEE Transactions on Computers

点击查看摘要

Abstract:The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1 \times inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.

[LG-33] Delay-adaptive Control of Nonlinear Systems with Approximate Neural Operator Predictors

链接: https://arxiv.org/abs/2508.20367
作者: Luke Bhan,Miroslav Krstic,Yuanyuan Shi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 9 pages, 1 Figure

点击查看摘要

Abstract:In this work, we propose a rigorous method for implementing predictor feedback controllers in nonlinear systems with unknown and arbitrarily long actuator delays. To address the analytically intractable nature of the predictor, we approximate it using a learned neural operator mapping. This mapping is trained once, offline, and then deployed online, leveraging the fast inference capabilities of neural networks. We provide a theoretical stability analysis based on the universal approximation theorem of neural operators and the transport partial differential equation (PDE) representation of the delay. We then prove, via a Lyapunov-Krasovskii functional, semi-global practical convergence of the dynamical system dependent on the approximation error of the predictor and delay bounds. Finally, we validate our theoretical results using a biological activator/repressor system, demonstrating speedups of 15 times compared to traditional numerical methods.

[LG-34] Developing a Multi-Modal Machine Learning Model For Predicting Performance of Automotive Hood Frames

链接: https://arxiv.org/abs/2508.20358
作者: Abhishek Indupally,Satchit Ramnath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Is there a way for a designer to evaluate the performance of a given hood frame geometry without spending significant time on simulation setup? This paper seeks to address this challenge by developing a multimodal machine-learning (MMML) architecture that learns from different modalities of the same data to predict performance metrics. It also aims to use the MMML architecture to enhance the efficiency of engineering design processes by reducing reliance on computationally expensive simulations. The proposed architecture accelerates design exploration, enabling rapid iteration while maintaining high-performance standards, especially in the concept design phase. The study also presents results that show that by combining multiple data modalities, MMML outperforms traditional single-modality approaches. Two new frame geometries, not part of the training dataset, are also used for prediction using the trained MMML model to showcase the ability to generalize to unseen frame models. The findings underscore MMML’s potential in supplementing traditional simulation-based workflows, particularly in the conceptual design phase, and highlight its role in bridging the gap between machine learning and real-world engineering applications. This research paves the way for the broader adoption of machine learning techniques in engineering design, with a focus on refining multimodal approaches to optimize structural development and accelerate the design cycle.

[LG-35] Understanding Incremental Learning with Closed-form Solution to Gradient Flow on Overparamerterized Matrix Factorization

链接: https://arxiv.org/abs/2508.20344
作者: Hancheng Min,René Vidal
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted to CDC 2025

点击查看摘要

Abstract:Many theoretical studies on neural networks attribute their excellent empirical performance to the implicit bias or regularization induced by first-order optimization algorithms when training networks under certain initialization assumptions. One example is the incremental learning phenomenon in gradient flow (GF) on an overparamerterized matrix factorization problem with small initialization: GF learns a target matrix by sequentially learning its singular values in decreasing order of magnitude over time. In this paper, we develop a quantitative understanding of this incremental learning behavior for GF on the symmetric matrix factorization problem, using its closed-form solution obtained by solving a Riccati-like matrix differential equation. We show that incremental learning emerges from some time-scale separation among dynamics corresponding to learning different components in the target matrix. By decreasing the initialization scale, these time-scale separations become more prominent, allowing one to find low-rank approximations of the target matrix. Lastly, we discuss the possible avenues for extending this analysis to asymmetric matrix factorization problems.

[LG-36] Adaptive Segmentation of EEG for Machine Learning Applications

链接: https://arxiv.org/abs/2508.20336
作者: Johnson Zhou,Joseph West,Krista A. Ehinger,Zhenming Ren,Sam E. John,David B. Grayden
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Objective. Electroencephalography (EEG) data is derived by sampling continuous neurological time series signals. In order to prepare EEG signals for machine learning, the signal must be divided into manageable segments. The current naive approach uses arbitrary fixed time slices, which may have limited biological relevance because brain states are not confined to fixed intervals. We investigate whether adaptive segmentation methods are beneficial for machine learning EEG analysis. Approach. We introduce a novel adaptive segmentation method, CTXSEG, that creates variable-length segments based on statistical differences in the EEG data and propose ways to use them with modern machine learning approaches that typically require fixed-length input. We assess CTXSEG using controllable synthetic data generated by our novel signal generator CTXGEN. While our CTXSEG method has general utility, we validate it on a real-world use case by applying it to an EEG seizure detection problem. We compare the performance of CTXSEG with fixed-length segmentation in the preprocessing step of a typical EEG machine learning pipeline for seizure detection. Main results. We found that using CTXSEG to prepare EEG data improves seizure detection performance compared to fixed-length approaches when evaluated using a standardized framework, without modifying the machine learning method, and requires fewer segments. Significance. This work demonstrates that adaptive segmentation with CTXSEG can be readily applied to modern machine learning approaches, with potential to improve performance. It is a promising alternative to fixed-length segmentation for signal preprocessing and should be considered as part of the standard preprocessing repertoire in EEG machine learning applications. Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2508.20336 [cs.LG] (or arXiv:2508.20336v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.20336 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Johnson Zhou [view email] [v1] Thu, 28 Aug 2025 00:43:04 UTC (734 KB)

[LG-37] Dynamic Synthetic Controls vs. Panel-Aware Double Machine Learning for Geo-Level Marketing Impact Estimation KDD2025

链接: https://arxiv.org/abs/2508.20335
作者: Sang Su Lee,Vineeth Loganathan,Vijay Raghavan
类目: Machine Learning (cs.LG)
*备注: Presented at the KDD 2025 Workshop on Causal Inference and Machine Learning in Practice

点击查看摘要

Abstract:Accurately quantifying geo-level marketing lift in two-sided marketplaces is challenging: the Synthetic Control Method (SCM) often exhibits high power yet systematically under-estimates effect size, while panel-style Double Machine Learning (DML) is seldom benchmarked against SCM. We build an open, fully documented simulator that mimics a typical large-scale geo roll-out: N_unit regional markets are tracked for T_pre weeks before launch and for a further T_post-week campaign window, allowing all key parameters to be varied by the user and probe both families under five stylized stress tests: 1) curved baseline trends, 2) heterogeneous response lags, 3) treated-biased shocks, 4) a non-linear outcome link, and 5) a drifting control group trend. Seven estimators are evaluated: three standard Augmented SCM (ASC) variants and four panel-DML flavors (TWFE, CRE/Mundlak, first-difference, and within-group). Across 100 replications per scenario, ASC models consistently demonstrate severe bias and near-zero coverage in challenging scenarios involving nonlinearities or external shocks. By contrast, panel-DML variants dramatically reduce this bias and restore nominal 95%-CI coverage, proving far more robust. The results indicate that while ASC provides a simple baseline, it is unreliable in common, complex situations. We therefore propose a ‘diagnose-first’ framework where practitioners first identify the primary business challenge (e.g., nonlinear trends, response lags) and then select the specific DML model best suited for that scenario, providing a more robust and reliable blueprint for analyzing geo-experiments. Comments: Presented at the KDD 2025 Workshop on Causal Inference and Machine Learning in Practice Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.20335 [cs.LG] (or arXiv:2508.20335v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.20335 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] FORGE: Foundational Optimization Representations from Graph Embeddings

链接: https://arxiv.org/abs/2508.20330
作者: Zohair Shafi,Serdar Kadioglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial optimization problems are ubiquitous in science and engineering, yet learning-based approaches to accelerate their solution often require solving a large number of hard-to-solve optimization instances to collect training data, incurring significant computational overhead. Existing methods require training dedicated models for each problem distribution for each downstream task, severely limiting their scalability and generalization. In this work, we introduce Forge, a method of pre-training a vector-quantized graph autoencoder on a large and diverse collection of mixed-integer programming (MIP) instances in an unsupervised fashion without dependency on their solution. The vector quantization process creates discrete code assignments that act as a vocabulary to represent optimization instances. We evaluate our approach under both supervised and unsupervised settings. For the unsupervised setting, we demonstrate that Forge embeddings effectively differentiate and cluster unseen instances. For the supervised setting, we fine-tune Forge embeddings and show that a single model predicts both the variables for warm-starts and integrality gaps for cut-generation across multiple problem type distributions. Both predictions help improve performance of a state-of-the-art, commercial optimization solver. Finally, we release our code and pre-trained Forge weights to encourage further research and practical use of instance-level MIP embeddings at this https URL

[LG-39] Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey

链接: https://arxiv.org/abs/2508.20315
作者: RexCharles Donatus,Kumater Ter,Ore-Ofe Ajayi,Daniel Udekwe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing complexity of urban mobility and the demand for efficient, sustainable, and adaptive solutions have positioned Intelligent Transportation Systems (ITS) at the forefront of modern infrastructure innovation. At the core of ITS lies the challenge of autonomous decision-making across dynamic, large scale, and uncertain environments where multiple agents traffic signals, autonomous vehicles, or fleet units must coordinate effectively. Multi Agent Reinforcement Learning (MARL) offers a promising paradigm for addressing these challenges by enabling distributed agents to jointly learn optimal strategies that balance individual objectives with system wide efficiency. This paper presents a comprehensive survey of MARL applications in ITS. We introduce a structured taxonomy that categorizes MARL approaches according to coordination models and learning algorithms, spanning value based, policy based, actor critic, and communication enhanced frameworks. Applications are reviewed across key ITS domains, including traffic signal control, connected and autonomous vehicle coordination, logistics optimization, and mobility on demand systems. Furthermore, we highlight widely used simulation platforms such as SUMO, CARLA, and CityFlow that support MARL experimentation, along with emerging benchmarks. The survey also identifies core challenges, including scalability, non stationarity, credit assignment, communication constraints, and the sim to real transfer gap, which continue to hinder real world deployment.

[LG-40] FedReFT: Federated Representation Fine-Tuning with All-But-Me Aggregation

链接: https://arxiv.org/abs/2508.20295
作者: Fatema Siddika,Md Anwar Hossen,J. Pablo Muñoz,Tanya Roosta,Anuj Sharma,Ali Jannesari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has attracted significant attention for adapting large pre-trained models by modifying a small subset of parameters. Recently, Representation Fine-tuning (ReFT) has emerged as an effective alternative. ReFT shifts the fine-tuning paradigm from updating model weights to directly manipulating hidden representations that capture rich semantic information, and performs better than state-of-the-art PEFTs in standalone settings. However, its application in Federated Learning (FL) remains challenging due to heterogeneity in clients’ data distributions, model capacities, and computational resources. To address these challenges, we introduce Federated Representation Fine-Tuning (FedReFT), a novel approach to fine-tune the client’s hidden representation. FedReFT applies sparse intervention layers to steer hidden representations directly, offering a lightweight and semantically rich fine-tuning alternative ideal for edge devices. However, representation-level updates are especially vulnerable to aggregation mismatch under different task heterogeneity, where naive averaging can corrupt semantic alignment. To mitigate this issue, we propose All-But-Me (ABM) aggregation, where each client receives the aggregated updates of others and partially incorporates them, enabling stable and personalized learning by balancing local focus with global knowledge. We evaluate FedReFT on commonsense reasoning, arithmetic reasoning, instruction-tuning, and GLUE, where it consistently outperforms state-of-the-art PEFT methods in FL, achieving 7x-15x higher parameter efficiency compared to leading LoRA-based approaches.

[LG-41] Neural Spline Operators for Risk Quantification in Stochastic Systems

链接: https://arxiv.org/abs/2508.20288
作者: Zhuoyuan Wang,Raffaele Romagnoli,Kamyar Azizzadenesheli,Yorie Nakahira
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately quantifying long-term risk probabilities in diverse stochastic systems is essential for safety-critical control. However, existing sampling-based and partial differential equation (PDE)-based methods often struggle to handle complex varying dynamics. Physics-informed neural networks learn surrogate mappings for risk probabilities from varying system parameters of fixed and finite dimensions, yet can not account for functional variations in system dynamics. To address these challenges, we introduce physics-informed neural operator (PINO) methods to risk quantification problems, to learn mappings from varying \textitfunctional system dynamics to corresponding risk probabilities. Specifically, we propose Neural Spline Operators (NeSO), a PINO framework that leverages B-spline representations to improve training efficiency and achieve better initial and boundary condition enforcements, which are crucial for accurate risk quantification. We provide theoretical analysis demonstrating the universal approximation capability of NeSO. We also present two case studies, one with varying functional dynamics and another with high-dimensional multi-agent dynamics, to demonstrate the efficacy of NeSO and its significant online speed-up over existing methods. The proposed framework and the accompanying universal approximation theorem are expected to be beneficial for other control or PDE-related problems beyond risk quantification.

[LG-42] Generalizable AI Model for Indoor Temperature Forecasting Across Sub-Saharan Africa

链接: https://arxiv.org/abs/2508.20260
作者: Zainab Akhtar,Eunice Jengo,Björn Haßler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a lightweight, domain-informed AI model for predicting indoor temperatures in naturally ventilated schools and homes in Sub-Saharan Africa. The model extends the Temp-AI-Estimator framework, trained on Tanzanian school data, and evaluated on Nigerian schools and Gambian homes. It achieves robust cross-country performance using only minimal accessible inputs, with mean absolute errors of 1.45°C for Nigerian schools and 0.65°C for Gambian homes. These findings highlight AI’s potential for thermal comfort management in resource-constrained environments.

[LG-43] Latent Variable Modeling for Robust Causal Effect Estimation CIKM2025

链接: https://arxiv.org/abs/2508.20259
作者: Tetsuro Morimura,Tatsushi Oka,Yugo Suzuki,Daisuke Moriwaki
类目: Machine Learning (cs.LG)
*备注: Accepted to CIKM 2025. This is the full version including extended appendix

点击查看摘要

Abstract:Latent variable models provide a powerful framework for incorporating and inferring unobserved factors in observational data. In causal inference, they help account for hidden factors influencing treatment or outcome, thereby addressing challenges posed by missing or unmeasured covariates. This paper proposes a new framework that integrates latent variable modeling into the double machine learning (DML) paradigm to enable robust causal effect estimation in the presence of such hidden factors. We consider two scenarios: one where a latent variable affects only the outcome, and another where it may influence both treatment and outcome. To ensure tractability, we incorporate latent variables only in the second stage of DML, separating representation learning from latent inference. We demonstrate the robustness and effectiveness of our method through extensive experiments on both synthetic and real-world datasets.

[LG-44] Discovering equations from data: symbolic regression in dynamical systems

链接: https://arxiv.org/abs/2508.20257
作者: Beatriz R. Brum,Luiza Lober,Isolde Previdelli,Francisco A. Rodrigues
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression have automated this process. As several methods are available in the literature, it is important to compare them, particularly for dynamic systems that describe complex phenomena. In this paper, five symbolic regression methods were used for recovering equations from nine dynamical processes, including chaotic dynamics and epidemic models, with the PySR method proving to be the most suitable for inferring equations. Benchmark results demonstrate its high predictive power and accuracy, with some estimates being indistinguishable from the original analytical forms. These results highlight the potential of symbolic regression as a robust tool for inferring and modelling real-world phenomena.

[LG-45] Beyond Optimization: Exploring Novelty Discovery in Autonomous Experiments

链接: https://arxiv.org/abs/2508.20254
作者: Ralph Bulanadi,Jawad Chowdhury,Funakubo Hiroshi,Maxim Ziatdinov,Rama Vasudevan,Arpan Biswas,Yongtao Liu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Autonomous experiments (AEs) are transforming how scientific research is conducted by integrating artificial intelligence with automated experimental platforms. Current AEs primarily focus on the optimization of a predefined target; while accelerating this goal, such an approach limits the discovery of unexpected or unknown physical phenomena. Here, we introduce a novel framework, INS2ANE (Integrated Novelty Score-Strategic Autonomous Non-Smooth Exploration), to enhance the discovery of novel phenomena in autonomous experimentation. Our method integrates two key components: (1) a novelty scoring system that evaluates the uniqueness of experimental results, and (2) a strategic sampling mechanism that promotes exploration of under-sampled regions even if they appear less promising by conventional criteria. We validate this approach on a pre-acquired dataset with a known ground truth comprising of image-spectral pairs. We further implement the process on autonomous scanning probe microscopy experiments. INS2ANE significantly increases the diversity of explored phenomena in comparison to conventional optimization routines, enhancing the likelihood of discovering previously unobserved phenomena. These results demonstrate the potential for AE to enhance the depth of scientific discovery; in combination with the efficiency provided by AEs, this approach promises to accelerate scientific research by simultaneously navigating complex experimental spaces to uncover new phenomena.

[LG-46] Bounds on Perfect Node Classification: A Convex Graph Clustering Perspective

链接: https://arxiv.org/abs/2508.20231
作者: Firooz Shahriari-Mehr,Javad Aliakbari,Alexandre Graell i Amat,Ashkan Panahi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We present an analysis of the transductive node classification problem, where the underlying graph consists of communities that agree with the node labels and node features. For node classification, we propose a novel optimization problem that incorporates the node-specific information (labels and features) in a spectral graph clustering framework. Studying this problem, we demonstrate a synergy between the graph structure and node-specific information. In particular, we show that suitable node-specific information guarantees the solution of our optimization problem perfectly recovering the communities, under milder conditions than the bounds on graph clustering alone. We present algorithmic solutions to our optimization problem and numerical experiments that confirm such a synergy.

[LG-47] Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

链接: https://arxiv.org/abs/2508.20230
作者: Manish Nagaraj,Deepak Ravikumar,Kaushik Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with 1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

[LG-48] What can we learn from signals and systems in a transformer? Insights for probabilistic modeling and inference architecture

链接: https://arxiv.org/abs/2508.20211
作者: Heng-Sheng Chang,Prashant G. Mehta
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Probability (math.PR)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:In the 1940s, Wiener introduced a linear predictor, where the future prediction is computed by linearly combining the past data. A transformer generalizes this idea: it is a nonlinear predictor where the next-token prediction is computed by nonlinearly combining the past tokens. In this essay, we present a probabilistic model that interprets transformer signals as surrogates of conditional measures, and layer operations as fixed-point updates. An explicit form of the fixed-point update is described for the special case when the probabilistic model is a hidden Markov model (HMM). In part, this paper is in an attempt to bridge the classical nonlinear filtering theory with modern inference architectures.

[LG-49] Operator learning meets inverse problems: A probabilistic perspective

链接: https://arxiv.org/abs/2508.20207
作者: Nicholas H. Nelsen,Yunan Yang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 87 pages, 5 figures

点击查看摘要

Abstract:Operator learning offers a robust framework for approximating mappings between infinite-dimensional function spaces. It has also become a powerful tool for solving inverse problems in the computational sciences. This chapter surveys methodological and theoretical developments at the intersection of operator learning and inverse problems. It begins by summarizing the probabilistic and deterministic approaches to inverse problems, and pays special attention to emerging measure-centric formulations that treat observed data or unknown parameters as probability distributions. The discussion then turns to operator learning by covering essential components such as data generation, loss functions, and widely used architectures for representing function-to-function maps. The core of the chapter centers on the end-to-end inverse operator learning paradigm, which aims to directly map observed data to the solution of the inverse problem without requiring explicit knowledge of the forward map. It highlights the unique challenge that noise plays in this data-driven inversion setting, presents structure-aware architectures for both point predictions and posterior estimates, and surveys relevant theory for linear and nonlinear inverse problems. The chapter also discusses the estimation of priors and regularizers, where operator learning is used more selectively within classical inversion algorithms.

[LG-50] CrystalICL: Enabling In-Context Learning for Crystal Generation

链接: https://arxiv.org/abs/2508.20143
作者: Ruobing Wang,Qiaoyu Tan,Yili Wang,Ying Wang,Xin Wang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Designing crystal materials with desired physicochemical properties remains a fundamental challenge in materials science. While large language models (LLMs) have demonstrated strong in-context learning (ICL) capabilities, existing LLM-based crystal generation approaches are limited to zero-shot scenarios and are unable to benefit from few-shot scenarios. In contrast, human experts typically design new materials by modifying relevant known structures which aligns closely with the few-shot ICL paradigm. Motivated by this, we propose CrystalICL, a novel model designed for few-shot crystal generation. Specifically, we introduce a space-group based crystal tokenization method, which effectively reduces the complexity of modeling crystal symmetry in LLMs. We further introduce a condition-structure aware hybrid instruction tuning framework and a multi-task instruction tuning strategy, enabling the model to better exploit ICL by capturing structure-property relationships from limited data. Extensive experiments on four crystal generation benchmarks demonstrate the superiority of CrystalICL over the leading baseline methods on conditional and unconditional generation tasks.

[LG-51] Spatio-Temporal Pruning for Compressed Spiking Large Language Models

链接: https://arxiv.org/abs/2508.20122
作者: Yi Jiang,Malyaban Bal,Brian Matejek,Susmit Jha,Adam Cobb,Abhronil Sengupta
类目: Neural and Evolutionary Computing (cs.NE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) present significant challenges for deployment in energy-constrained environments due to their large model sizes and high inference latency. Spiking Neural Networks (SNNs), inspired by the sparse event-driven neural processing and energy-efficient information transmission in the brain, offer a promising alternative for achieving low-power computing. Integrating the event-driven efficiency of spiking neurons with the advanced capabilities of LLMs represents a promising direction for power-efficient LLMs. This work specifically delves into the design of compressed spiking LLMs. Here, we revisit spatial and temporal pruning from the perspective of SNNs and propose a novel spatio-temporal pruning framework for Spiking LLMs to optimize computational efficiency while preserving high performance. Our spatial pruning technique reduces the number of active neurons and attention heads, effectively lowering the computational complexity of the model. Meanwhile, temporal pruning minimizes inference latency by dynamically adjusting the number of timesteps required for different layers. By combining these approaches with other compression techniques, we present the first work in the domain of Spiking LLMs to jointly explore spatial pruning, temporal pruning, extreme quantization and knowledge distillation strategies. Extensive experimental evaluation of our proposed framework for SpikingBERT on the large-scale GLUE benchmark demonstrates the efficacy of our approach in terms of computational operations and inference latency. Our approach offers a compelling solution for real-time, low-power natural language processing applications, making Spiking LLMs more practical for deployment on edge devices and in power-constrained settings.

[LG-52] Evaluating LLM s on microservice-based applications: how complex is your specification?

链接: https://arxiv.org/abs/2508.20119
作者: Daniel M. Yellin
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 20 pages + 7 pages appendices. 7 Figures. 8 Tables

点击查看摘要

Abstract:In this paper we evaluate how far LLMs have advanced in generating code for real-world problems. Specifically, we explore code synthesis for microservice-based applications, a widely used architecture pattern. We define a standard template for specifying these applications, and we propose a metric for judging the difficulty level of a specification. The higher the score, the more difficult it is to generate code for the specification. We develop a framework to automate the process of testing LLM-synthesized code for a microservice using unit tests. Our experimental results show that strong LLMs (like GPT-3o-mini) do fairly well on medium difficulty specifications but do very poorly on those of higher difficulty levels. This is due to more intricate business logic, a greater use of external services, database integration and inclusion of non-functional capabilities such as authentication. We analyzed the errors in LLM-synthesized code and report on the key challenges LLMs face in generating code for these specifications thereby suggesting future research directions to improve code synthesis for real-world problems.

[LG-53] Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic Noise

链接: https://arxiv.org/abs/2109.05437
作者: Xiaoxuan Yang,Syrine Belakaria,Biresh Kumar Joardar,Huanrui Yang,Janardhan Rao Doppa,Partha Pratim Pande,Krishnendu Chakrabarty,Hai Li
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: To appear in ICCAD 2021

点击查看摘要

Abstract:Resistive random-access memory (ReRAM) is a promising technology for designing hardware accelerators for deep neural network (DNN) inferencing. However, stochastic noise in ReRAM crossbars can degrade the DNN inferencing accuracy. We propose the design and optimization of a high-performance, area-and energy-efficient ReRAM-based hardware accelerator to achieve robust DNN inferencing in the presence of stochastic noise. We make two key technical contributions. First, we propose a stochastic-noise-aware training method, referred to as ReSNA, to improve the accuracy of DNN inferencing on ReRAM crossbars with stochastic noise. Second, we propose an information-theoretic algorithm, referred to as CF-MESMO, to identify the Pareto set of solutions to trade-off multiple objectives, including inferencing accuracy, area overhead, execution time, and energy consumption. The main challenge in this context is that executing the ReSNA method to evaluate each candidate ReRAM design is prohibitive. To address this challenge, we utilize the continuous-fidelity evaluation of ReRAM designs associated with prohibitive high computation cost by varying the number of training epochs to trade-off accuracy and cost. CF-MESMO iteratively selects the candidate ReRAM design and fidelity pair that maximizes the information gained per unit computation cost about the optimal Pareto front. Our experiments on benchmark DNNs show that the proposed algorithms efficiently uncover high-quality Pareto fronts. On average, ReSNA achieves 2.57% inferencing accuracy improvement for ResNet20 on the CIFAR-10 dataset with respect to the baseline configuration. Moreover, CF-MESMO algorithm achieves 90.91% reduction in computation cost compared to the popular multi-objective optimization algorithm NSGA-II to reach the best solution from NSGA-II.

[LG-54] Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System

链接: https://arxiv.org/abs/2508.20983
作者: Hashim Ali,Surya Subramani,Lekha Bollinani,Nithin Sai Adupa,Sali El-Loh,Hafiz Malik
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.

[LG-55] ransfer Learning for Classification under Decision Rule Drift with Application to Optimal Individualized Treatment Rule Estimation

链接: https://arxiv.org/abs/2508.20942
作者: Xiaohan Wang,Yang Ning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this paper, we extend the transfer learning classification framework from regression function-based methods to decision rules. We propose a novel methodology for modeling posterior drift through Bayes decision rules. By exploiting the geometric transformation of the Bayes decision boundary, our method reformulates the problem as a low-dimensional empirical risk minimization problem. Under mild regularity conditions, we establish the consistency of our estimators and derive the risk bounds. Moreover, we illustrate the broad applicability of our method by adapting it to the estimation of optimal individualized treatment rules. Extensive simulation studies and analyses of real-world data further demonstrate both superior performance and robustness of our approach.

[LG-56] Polynomial Chaos Expansion for Operator Learning

链接: https://arxiv.org/abs/2508.20886
作者: Himanshu Sharma,Lukáš Novák,Michael D. Shields
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operator learning (OL) has emerged as a powerful tool in scientific machine learning (SciML) for approximating mappings between infinite-dimensional functional spaces. One of its main applications is learning the solution operator of partial differential equations (PDEs). While much of the progress in this area has been driven by deep neural network-based approaches such as Deep Operator Networks (DeepONet) and Fourier Neural Operator (FNO), recent work has begun to explore traditional machine learning methods for OL. In this work, we introduce polynomial chaos expansion (PCE) as an OL method. PCE has been widely used for uncertainty quantification (UQ) and has recently gained attention in the context of SciML. For OL, we establish a mathematical framework that enables PCE to approximate operators in both purely data-driven and physics-informed settings. The proposed framework reduces the task of learning the operator to solving a system of equations for the PCE coefficients. Moreover, the framework provides UQ by simply post-processing the PCE coefficients, without any additional computational cost. We apply the proposed method to a diverse set of PDE problems to demonstrate its capabilities. Numerical results demonstrate the strong performance of the proposed method in both OL and UQ tasks, achieving excellent numerical accuracy and computational efficiency.

[LG-57] Automatic Inspection Based on Switch Sounds of Electric Point Machines

链接: https://arxiv.org/abs/2508.20870
作者: Ayano Shibata,Toshiki Gunji,Mitsuaki Tsuda,Takashi Endo,Kota Dohi,Tomoya Nishida,Satoko Nomoto
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at ASPECT 2025

点击查看摘要

Abstract:Since 2018, East Japan Railway Company and Hitachi, Ltd. have been working to replace human inspections with IoT-based monitoring. The purpose is Labor-saving required for equipment inspections and provide appropriate preventive maintenance. As an alternative to visual inspection, it has been difficult to substitute electrical characteristic monitoring, and the introduction of new high-performance sensors has been costly. In 2019, we implemented cameras and microphones in an NS'' electric point machines to reduce downtime from equipment failures, allowing for remote monitoring of lock-piece conditions. This method for detecting turnout switching errors based on sound information was proposed, and the expected test results were obtained. The proposed method will make it possible to detect equipment failures in real time, thereby reducing the need for visual inspections. This paper presents the results of our technical studies aimed at automating the inspection of electronic point machines using sound, specifically focusing on switch sound’’ beginning in 2019.

[LG-58] owards Trustworthy Amortized Bayesian Model Comparison NEURIPS2025

链接: https://arxiv.org/abs/2508.20614
作者: Šimon Kucharský,Aayush Mishra,Daniel Habermann,Stefan T. Radev,Paul-Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 13 pages, 4 figures, submitted to Reliable ML from Unreliable Data Workshop at NeurIPS 2025

点击查看摘要

Abstract:Amortized Bayesian model comparison (BMC) enables fast probabilistic ranking of models via simulation-based training of neural surrogates. However, the reliability of neural surrogates deteriorates when simulation models are misspecified - the very case where model comparison is most needed. Thus, we supplement simulation-based training with a self-consistency (SC) loss on unlabeled real data to improve BMC estimates under empirical distribution shifts. Using a numerical experiment and two case studies with real data, we compare amortized evidence estimates with and without SC against analytic or bridge sampling benchmarks. SC improves calibration under model misspecification when having access to analytic likelihoods. However, it offers limited gains with neural surrogate likelihoods, making it most practical for trustworthy BMC when likelihoods are exact.

[LG-59] Studying Effective String Theory using deep generative models

链接: https://arxiv.org/abs/2508.20610
作者: Michele Caselle,Elia Cellini,Alessandro Nada
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 10 pages, 3 figures, 2 tables, contribution to “The XVIth Quark Confinement and the Hadron Spectrum Conference (QCHSC24)”, PoS(QCHSC24)034

点击查看摘要

Abstract:Effective String Theory (EST) offers a robust non-perturbative framework for describing confinement in Yang-Mills theory by treating the confining flux tube between a static quark-antiquark pair as a thin, vibrating string. While EST calculations are typically carried out using zeta-function regularization, certain problems-such as determining the flux tube width-are too complex to solve analytically. However, recent studies have demonstrated that EST can be explored numerically by employing deep learning techniques based on generative algorithms. In this work, we provide a brief introduction to EST and this novel numerical approach. Finally, we present results for the width of the Nambu-Gotö EST.

[LG-60] Machine-learning based particle-flow algorithm in CMS

链接: https://arxiv.org/abs/2508.20541
作者: Farouk Mokhtar
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, European Physical Society Conference on High Energy Physics (EPS-HEP2025)

点击查看摘要

Abstract:The particle-flow (PF) algorithm provides a global event description by reconstructing final-state particles and is central to event reconstruction in CMS. Recently, end-to-end machine learning (ML) approaches have been proposed to directly optimize physical quantities of interest and to leverage heterogeneous computing architectures. One such approach, machine-learned particle flow (MLPF), uses a transformer model to infer particles directly from tracks and clusters in a single pass. We present recent CMS developments in MLPF, including training datasets, model architecture, reconstruction metrics, and integration with offline reconstruction software.

[LG-61] Molecular Machine Learning in Chemical Process Design

链接: https://arxiv.org/abs/2508.20527
作者: Jan G. Rittig,Manuel Dahmen,Martin Grohe,Philippe Schwaller,Alexander Mitsos
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a perspective on molecular machine learning (ML) in the field of chemical process engineering. Recently, molecular ML has demonstrated great potential in (i) providing highly accurate predictions for properties of pure components and their mixtures, and (ii) exploring the chemical space for new molecular structures. We review current state-of-the-art molecular ML models and discuss research directions that promise further advancements. This includes ML methods, such as graph neural networks and transformers, which can be further advanced through the incorporation of physicochemical knowledge in a hybrid or physics-informed fashion. Then, we consider leveraging molecular ML at the chemical process scale, which is highly desirable yet rather unexplored. We discuss how molecular ML can be integrated into process design and optimization formulations, promising to accelerate the identification of novel molecules and processes. To this end, it will be essential to create molecule and process design benchmarks and practically validate proposed candidates, possibly in collaboration with the chemical industry.

[LG-62] QTMRL: An Agent for Quantitative Trading Decision-Making Based on Multi-Indicator Guided Reinforcement Learning

链接: https://arxiv.org/abs/2508.20467
作者: Xiangdong Liu,Jiahao Chen
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:In the highly volatile and uncertain global financial markets, traditional quantitative trading models relying on statistical modeling or empirical rules often fail to adapt to dynamic market changes and black swan events due to rigid assumptions and limited generalization. To address these issues, this paper proposes QTMRL (Quantitative Trading Multi-Indicator Reinforcement Learning), an intelligent trading agent combining multi-dimensional technical indicators with reinforcement learning (RL) for adaptive and stable portfolio management. We first construct a comprehensive multi-indicator dataset using 23 years of SP 500 daily OHLCV data (2000-2022) for 16 representative stocks across 5 sectors, enriching raw data with trend, volatility, and momentum indicators to capture holistic market dynamics. Then we design a lightweight RL framework based on the Advantage Actor-Critic (A2C) algorithm, including data processing, A2C algorithm, and trading agent modules to support policy learning and actionable trading decisions. Extensive experiments compare QTMRL with 9 baselines (e.g., ARIMA, LSTM, moving average strategies) across diverse market regimes, verifying its superiority in profitability, risk adjustment, and downside risk control. The code of QTMRL is publicly available at this https URL

[LG-63] Stochastic Gradients under Nuisances

链接: https://arxiv.org/abs/2508.20326
作者: Facheng Yu,Ronak Mehta,Alex Luedtke,Zaid Harchaoui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Stochastic gradient optimization is the dominant learning paradigm for a variety of scenarios, from classical supervised learning to modern self-supervised learning. We consider stochastic gradient algorithms for learning problems whose objectives rely on unknown nuisance parameters, and establish non-asymptotic convergence guarantees. Our results show that, while the presence of a nuisance can alter the optimum and upset the optimization trajectory, the classical stochastic gradient algorithm may still converge under appropriate conditions, such as Neyman orthogonality. Moreover, even when Neyman orthogonality is not satisfied, we show that an algorithm variant with approximately orthogonalized updates (with an approximately orthogonalized gradient oracle) may achieve similar convergence rates. Examples from orthogonal statistical learning/double machine learning and causal inference are discussed.

[LG-64] MicroLad: 2D-to-3D Microstructure Reconstruction and Generation via Latent Diffusion and Score Distillation

链接: https://arxiv.org/abs/2508.20138
作者: Kang-Hyun Lee,Faez Ahmed
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major obstacle to establishing reliable structure-property (SP) linkages in materials engineering is the scarcity of diverse 3D microstructure datasets. Limited dataset availability and insufficient control over the analysis and design space restrict the variety of achievable microstructure morphologies, hindering progress in solving the inverse (property-to-structure) design problem. To address these challenges, we introduce MicroLad, a latent diffusion framework specifically designed for reconstructing 3D microstructures from 2D data. Trained on 2D images and employing multi-plane denoising diffusion sampling in the latent space, the framework reliably generates stable and coherent 3D volumes that remain statistically consistent with the original data. While this reconstruction capability enables dimensionality expansion (2D-to-3D) for generating statistically equivalent 3D samples from 2D data, effective exploration of microstructure design requires methods to guide the generation process toward specific objectives. To achieve this, MicroLad integrates score distillation sampling (SDS), which combines a differentiable score loss with microstructural descriptor-matching and property-alignment terms. This approach updates encoded 2D slices of the 3D volume in the latent space, enabling robust inverse-controlled 2D-to-3D microstructure generation. Consequently, the method facilitates exploration of an expanded 3D microstructure analysis and design space in terms of both microstructural descriptors and material properties.

[LG-65] Mitigating Distribution Shift in Stock Price Data via Return-Volatility Normalization for Accurate Prediction CIKM2025

链接: https://arxiv.org/abs/2508.20108
作者: Hyunwoo Lee,Jihyeong Jeon,Jaemin Hong,U Kang
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, accpeted to CIKM 2025

点击查看摘要

Abstract:How can we address distribution shifts in stock price data to improve stock price prediction accuracy? Stock price prediction has attracted attention from both academia and industry, driven by its potential to uncover complex market patterns and enhance decisionmaking. However, existing methods often fail to handle distribution shifts effectively, focusing on scaling or representation adaptation without fully addressing distributional discrepancies and shape misalignments between training and test data. We propose ReVol (Return-Volatility Normalization for Mitigating Distribution Shift in Stock Price Data), a robust method for stock price prediction that explicitly addresses the distribution shift problem. ReVol leverages three key strategies to mitigate these shifts: (1) normalizing price features to remove sample-specific characteristics, including return, volatility, and price scale, (2) employing an attention-based module to estimate these characteristics accurately, thereby reducing the influence of market anomalies, and (3) reintegrating the sample characteristics into the predictive process, restoring the traits lost during normalization. Additionally, ReVol combines geometric Brownian motion for long-term trend modeling with neural networks for short-term pattern recognition, unifying their complementary strengths. Extensive experiments on real-world datasets demonstrate that ReVol enhances the performance of the state-of-the-art backbone models in most cases, achieving an average improvement of more than 0.03 in IC and over 0.7 in SR across various settings.

[LG-66] Is Audio Spoof Detection Robust to Laundering Attacks?

链接: https://arxiv.org/abs/2408.14712
作者: Hashim Ali,Surya Subramani,Shefali Sudhir,Raksha Varahamurthy,Hafiz Malik
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Conference Paper

点击查看摘要

Abstract:Voice-cloning (VC) systems have seen an exceptional increase in the realism of synthesized speech in recent years. The high quality of synthesized speech and the availability of low-cost VC services have given rise to many potential abuses of this technology. Several detection methodologies have been proposed over the years that can detect voice spoofs with reasonably good accuracy. However, these methodologies are mostly evaluated on clean audio databases, such as ASVSpoof 2019. This paper evaluates SOTA Audio Spoof Detection approaches in the presence of laundering attacks. In that regard, a new laundering attack database, called the ASVSpoof Laundering Database, is created. This database is based on the ASVSpoof 2019 (LA) eval database comprising a total of 1388.22 hours of audio recordings. Seven SOTA audio spoof detection approaches are evaluated on this laundered database. The results indicate that SOTA systems perform poorly in the presence of aggressive laundering attacks, especially reverberation and additive noise attacks. This suggests the need for robust audio spoof detection.

信息检索

[IR-0] OneRec-V2 Technical Report

链接: https://arxiv.org/abs/2508.20900
作者: Guorui Zhou,Hengrui Hu,Hongtao Cheng,Huanjie Wang,Jiaxin Deng,Jinghao Zhang,Kuo Cai,Lejian Ren,Lu Ren,Liao Yu,Pengfei Zheng,Qiang Luo,Qianqian Wang,Qigen Hu,Rui Huang,Ruiming Tang,Shiyao Wang,Shujie Yang,Tao Wu,Wuchao Li,Xinchen Luo,Xingmei Wang,Yi Su,Yunfan Wu,Zexuan Cheng,Zhanyu Liu,Zixing Zhang,Bin Zhang,Boxuan Wang,Chaoyi Ma,Chengru Song,Chenhui Wang,Chenglong Chu,Di Wang,Dongxue Meng,Dunju Zang,Fan Yang,Fangyu Zhang,Feng Jiang,Fuxing Zhang,Gang Wang,Guowang Zhang,Han Li,Honghui Bao,Hongyang Cao,Jiaming Huang,Jiapeng Chen,Jiaqiang Liu,Jinghui Jia,Kun Gai,Lantao Hu,Liang Zeng,Qiang Wang,Qidong Zhou,Rongzhou Zhang,Shengzhe Wang,Shihui He,Shuang Yang,Siyang Mao,Sui Huang,Tiantian He,Tingting Gao,Wei Yuan,Xiao Liang,Xiaoxiao Xu,Xugang Liu,Yan Wang,Yang Zhou,Yi Wang,Yiwu Liu,Yue Song,Yufei Zhang,Yunfeng Zhao,Zhixin Ling,Ziming Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in generative AI have transformed recommender systems through end-to-end generation. OneRec reformulates recommendation as an autoregressive generation task, achieving high Model FLOPs Utilization. While OneRec-V1 has shown significant empirical success in real-world deployment, two critical challenges hinder its scalability and performance: (1) inefficient computational allocation where 97.66% of resources are consumed by sequence encoding rather than generation, and (2) limitations in reinforcement learning relying solely on reward models. To address these challenges, we propose OneRec-V2, featuring: (1) Lazy Decoder-Only Architecture: Eliminates encoder bottlenecks, reducing total computation by 94% and training resources by 90%, enabling successful scaling to 8B parameters. (2) Preference Alignment with Real-World User Interactions: Incorporates Duration-Aware Reward Shaping and Adaptive Ratio Clipping to better align with user preferences using real-world feedback. Extensive A/B tests on Kuaishou demonstrate OneRec-V2’s effectiveness, improving App Stay Time by 0.467%/0.741% while balancing multi-objective recommendations. This work advances generative recommendation scalability and alignment with real-world feedback, representing a step forward in the development of end-to-end recommender systems. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.20900 [cs.IR] (or arXiv:2508.20900v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.20900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Deep Multiple Quantization Network on Long Behavior Sequence for Click-Through Rate Prediction SIGIR2025

链接: https://arxiv.org/abs/2508.20865
作者: Zhuoxing Wei,Qi Liu,Qingchen Xie
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figures, SIGIR 2025

点击查看摘要

Abstract:In Click-Through Rate (CTR) prediction, the long behavior sequence, comprising the user’s long period of historical interactions with items has a vital influence on assessing the user’s interest in the candidate item. Existing approaches strike efficiency and effectiveness through a two-stage paradigm: first retrieving hundreds of candidate-related items and then extracting interest intensity vector through target attention. However, we argue that the discrepancy in target attention’s relevance distribution between the retrieved items and the full long behavior sequence inevitably leads to a performance decline. To alleviate the discrepancy, we propose the Deep Multiple Quantization Network (DMQN) to process long behavior sequence end-to-end through compressing the long behavior sequence. Firstly, the entire spectrum of long behavior sequence will be quantized into multiple codeword sequences based on multiple independent codebooks. Hierarchical Sequential Transduction Unit is incorporated to facilitate the interaction of reduced codeword sequences. Then, attention between the candidate and multiple codeword sequences will output the interest vector. To enable online serving, intermediate representations of the codeword sequences are cached, significantly reducing latency. Our extensive experiments on both industrial and public datasets confirm the effectiveness and efficiency of DMQN. The A/B test in our advertising system shows that DMQN improves CTR by 3.5% and RPM by 2.0%.

[IR-2] Addressing Personalized Bias for Unbiased Learning to Rank CIKM2025

链接: https://arxiv.org/abs/2508.20798
作者: Zechun Niu,Lang Mei,Liu Yang,Ziyuan Zhao,Qiang Yan,Jiaxin Mao,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Unbiased learning to rank (ULTR), which aims to learn unbiased ranking models from biased user behavior logs, plays an important role in Web search. Previous research on ULTR has studied a variety of biases in users’ clicks, such as position bias, presentation bias, and outlier bias. However, existing work often assumes that the behavior logs are collected from an ``average’’ user, neglecting the differences between different users in their search and browsing behaviors. In this paper, we introduce personalized factors into the ULTR framework, which we term the user-aware ULTR problem. Through a formal causal analysis of this problem, we demonstrate that existing user-oblivious methods are biased when different users have different preferences over queries and personalized propensities of examining documents. To address such a personalized bias, we propose a novel user-aware inverse-propensity-score estimator for learning-to-rank objectives. Specifically, our approach models the distribution of user browsing behaviors for each query and aggregates user-weighted examination probabilities to determine propensities. We theoretically prove that the user-aware estimator is unbiased under some mild assumptions and shows lower variance compared to the straightforward way of calculating a user-dependent propensity for each impression. Finally, we empirically verify the effectiveness of our user-aware estimator by conducting extensive experiments on two semi-synthetic datasets and a real-world dataset.

[IR-3] SUMMA: A Multimodal Large Language Model for Advertisement Summarization

链接: https://arxiv.org/abs/2508.20582
作者: Weitao Jia,Shuo Yin,Zhoufutu Wen,Han Wang,Zehui Dai,Kun Zhang,Zhenyu Li,Tao Zeng,Xiaohui Lv
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Understanding multimodal video ads is crucial for improving query-ad matching and relevance ranking on short video platforms, enhancing advertising effectiveness and user experience. However, the effective utilization of multimodal information with high commercial value still largely constrained by reliance on highly compressed video embeddings-has long been inadequate. To address this, we propose SUMMA (the abbreviation of Summarizing MultiModal Ads), a multimodal model that automatically processes video ads into summaries highlighting the content of highest commercial value, thus improving their comprehension and ranking in Douyin search-advertising systems. SUMMA is developed via a two-stage training strategy-multimodal supervised fine-tuning followed by reinforcement learning with a mixed reward mechanism-on domain-specific data containing video frames and ASR/OCR transcripts, generating commercially valuable and explainable summaries. We integrate SUMMA-generated summaries into our production pipeline, directly enhancing the candidate retrieval and relevance ranking stages in real search-advertising systems. Both offline and online experiments show substantial improvements over baselines, with online results indicating a statistically significant 1.5% increase in advertising revenue. Our work establishes a novel paradigm for condensing multimodal information into representative texts, effectively aligning visual ad content with user query intent in retrieval and recommendation scenarios.

[IR-4] Enhancing Semantic Document Retrieval- Employing Group Steiner Tree Algorithm with Domain Knowledge Enrichment

链接: https://arxiv.org/abs/2508.20543
作者: Apurva Kulkarni,Chandrashekar Ramanathan,Vinu E Venugopal
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Retrieving pertinent documents from various data sources with diverse characteristics poses a significant challenge for Document Retrieval Systems. The complexity of this challenge is further compounded when accounting for the semantic relationship between data and domain knowledge. While existing retrieval systems using semantics (usually represented as Knowledge Graphs created from open-access resources and generic domain knowledge) hold promise in delivering relevant outcomes, their precision may be compromised due to the absence of domain-specific information and reliance on outdated knowledge sources. In this research, the primary focus is on two key contributions- a) the development of a versatile algorithm- ‘Semantic-based Concept Retrieval using Group Steiner Tree’ that incorporates domain information to enhance semantic-aware knowledge representation and data access, and b) the practical implementation of the proposed algorithm within a document retrieval system using real-world data. To assess the effectiveness of the SemDR system, research work conducts performance evaluations using a benchmark consisting of 170 real-world search queries. Rigorous evaluation and verification by domain experts are conducted to ensure the validity and accuracy of the results. The experimental findings demonstrate substantial advancements when compared to the baseline systems, with precision and accuracy achieving levels of 90% and 82% respectively, signifying promising improvements.

[IR-5] Multistakeholder Fairness in Tourism: What can Algorithms learn from Tourism Management?

链接: https://arxiv.org/abs/2508.20496
作者: Peter Muellner,Anna Schreuer,Simone Kopeinik,Bernhard Wieser,Dominik Kowald
类目: Information Retrieval (cs.IR)
*备注: Accepted for publication in Frontiers in Big Data

点击查看摘要

Abstract:Algorithmic decision-support systems, i.e., recommender systems, are popular digital tools that help tourists decide which places and attractions to explore. However, algorithms often unintentionally direct tourist streams in a way that negatively affects the environment, local communities, or other stakeholders. This issue can be partly attributed to the computer science community’s limited understanding of the complex relationships and trade-offs among stakeholders in the real world. In this work, we draw on the practical findings and methods from tourism management to inform research on multistakeholder fairness in algorithmic decision-support. Leveraging a semi-systematic literature review, we synthesize literature from tourism management as well as literature from computer science. Our findings suggest that tourism management actively tries to identify the specific needs of stakeholders and utilizes qualitative, inclusive and participatory methods to study fairness from a normative and holistic research perspective. In contrast, computer science lacks sufficient understanding of the stakeholder needs and primarily considers fairness through descriptive factors, such as measureable discrimination, while heavily relying on few mathematically formalized fairness criteria that fail to capture the multidimensional nature of fairness in tourism. With the results of this work, we aim to illustrate the shortcomings of purely algorithmic research and stress the potential and particular need for future interdisciplinary collaboration. We believe such a collaboration is a fundamental and necessary step to enhance algorithmic decision-support systems towards understanding and supporting true multistakeholder fairness in tourism. Comments: Accepted for publication in Frontiers in Big Data Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.20496 [cs.IR] (or arXiv:2508.20496v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.20496 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.3389/fdata.2025.1632766 Focus to learn more DOI(s) linking to related resources

[IR-6] Fact or Facsimile? Evaluating the Factual Robustness of Modern Retrievers

链接: https://arxiv.org/abs/2508.20408
作者: Haoyu Wu,Qingcheng Zeng,Kaize Ding
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the 34th ACM International Conference on Information and Knowledge Management

点击查看摘要

Abstract:Dense retrievers and rerankers are central to retrieval-augmented generation (RAG) pipelines, where accurately retrieving factual information is crucial for maintaining system trustworthiness and defending against RAG poisoning. However, little is known about how much factual competence these components inherit or lose from the large language models (LLMs) they are based on. We pair 12 publicly released embedding checkpoints with their original base LLMs and evaluate both sets on a factuality benchmark. Across every model evaluated, the embedding variants achieve markedly lower accuracy than their bases, with absolute drops ranging from 12 to 43 percentage points (median 28 pts) and typical retriever accuracies collapsing into the 25-35 % band versus the 60-70 % attained by the generative models. This degradation intensifies under a more demanding condition: when the candidate pool per question is expanded from four options to one thousand, the strongest retriever’s top-1 accuracy falls from 33 % to 26 %, revealing acute sensitivity to distractor volume. Statistical tests further show that, for every embedding model, cosine-similarity scores between queries and correct completions are significantly higher than those for incorrect ones (p 0.01), indicating decisions driven largely by surface-level semantic proximity rather than factual reasoning. To probe this weakness, we employed GPT-4.1 to paraphrase each correct completion, creating a rewritten test set that preserved factual truth while masking lexical cues, and observed that over two-thirds of previously correct predictions flipped to wrong, reducing overall accuracy to roughly one-third of its original level. Taken together, these findings reveal a systematic trade-off introduced by contrastive learning for retrievers: gains in semantic retrieval are paid for with losses in parametric factual knowledge…

[IR-7] A Case Study of Balanced Query Recommendation on Wikipedia RECSYS2025

链接: https://arxiv.org/abs/2508.20399
作者: Harshit Mishra,Sucheta Soundarajan
类目: Information Retrieval (cs.IR)
*备注: Accepted at FAccTRec 2025 workshop at recsys 2025

点击查看摘要

Abstract:Modern IR systems are an extremely important tool for seeking information. In addition to search, such systems include a number of query reformulation methods, such as query expansion and query recommendations, to provide high quality results. However, results returned by such methods sometimes exhibit undesirable or wrongful bias with respect to protected categories such as gender or race. Our earlier work considered the problem of balanced query recommendation, where instead of re-ranking a list of results based on fairness measures, the goal was to suggest queries that are relevant to a user’s search query but exhibit less bias than the original query. In this work, we present a case study of BalancedQR using an extension of BalancedQR that handles biases in multiple dimensions. It employs a Pareto front approach that finds balanced queries, optimizing for multiple objectives such as gender bias and regional bias, along with the relevance of returned results. We evaluate the extended version of BalancedQR on a Wikipedia this http URL results demonstrate the effectiveness of our extension to BalancedQR framework and highlight the significant impact of subtle query wording,linguistic choice on retrieval.

[IR-8] Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation

链接: https://arxiv.org/abs/2508.20359
作者: Shijia Wang,Tianpei Ouyang,Qiang Xiao,Dongjing Wang,Yintao Ren,Songpei Xu,Da Guo,Chuanjiang Luo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In music recommendation systems, multimodal interest learning is pivotal, which allows the model to capture nuanced preferences, including textual elements such as lyrics and various musical attributes such as different instruments and melodies. Recently, methods that incorporate multimodal content features through semantic IDs have achieved promising results. However, existing methods suffer from two critical limitations: 1) intra-modal semantic degradation, where residual-based quantization processes gradually decouple discrete IDs from original content semantics, leading to semantic drift; and 2) inter-modal modeling gaps, where traditional fusion strategies either overlook modal-specific details or fail to capture cross-modal correlations, hindering comprehensive user interest modeling. To address these challenges, we propose a novel multimodal recommendation framework with two stages. In the first stage, our Progressive Semantic Residual Quantization (PSRQ) method generates modal-specific and modal-joint semantic IDs by explicitly preserving the prefix semantic feature. In the second stage, to model multimodal interest of users, a Multi-Codebook Cross-Attention (MCCA) network is designed to enable the model to simultaneously capture modal-specific interests and perceive cross-modal correlations. Extensive experiments on multiple real-world datasets demonstrate that our framework outperforms state-of-the-art baselines. This framework has been deployed on one of China’s largest music streaming platforms, and online A/B tests confirm significant improvements in commercial metrics, underscoring its practical value for industrial-scale recommendation systems.

[IR-9] A Survey of Affective Recommender Systems: Modeling Attitudes Emotions and Moods for Personalization

链接: https://arxiv.org/abs/2508.20289
作者: Tonmoy Hasan,Razvan Bunescu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Affective Recommender Systems are an emerging class of intelligent systems that aim to enhance personalization by aligning recommendations with users’ affective states. Reflecting a growing interest, a number of surveys have been published in this area, however they lack an organizing taxonomy grounded in psychology and they often study only specific types of affective states or application domains. This survey addresses these limitations by providing a comprehensive, systematic review of affective recommender systems across diverse domains. Drawing from Scherer’s typology of affective states, we introduce a classification scheme that organizes systems into four main categories: attitude aware, emotion aware, mood aware, and hybrid. We further document affective signal extraction techniques, system architectures, and application areas, highlighting key trends, limitations, and open challenges. As future research directions, we emphasize hybrid models that leverage multiple types of affective states across different modalities, the development of large-scale affect-aware datasets, and the need to replace the folk vocabulary of affective states with a more precise terminology grounded in cognitive and social psychology. Through its systematic review of existing research and challenges, this survey aims to serve as a comprehensive reference and a useful guide for advancing academic research and industry applications in affect-driven personalization.

附件下载

点击下载今日全部论文列表