本篇博文主要内容为 2025-08-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-28)
今日共更新473篇论文,其中:
- 自然语言处理共81篇(Computation and Language (cs.CL))
- 人工智能共149篇(Artificial Intelligence (cs.AI))
- 计算机视觉共109篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共112篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对知识库投毒攻击时的脆弱性问题,特别是针对现代大语言模型(Large Language Models, LLMs)所具备的自身纠错能力(Self-Correction Ability, SCA)如何被利用以提升攻击成功率的问题。传统攻击主要通过污染知识库来诱导错误输出,但SCA可有效识别并拒绝虚假上下文,从而削弱攻击效果。为此,作者提出一种新型攻击范式DisarmRAG,其关键在于直接攻陷检索器(retriever),通过局部且隐蔽的对比学习编辑技术,使检索器仅对特定目标查询返回恶意指令,从而抑制SCA并强制生成攻击者指定的内容;进一步地,结合迭代协同优化框架自动挖掘能绕过提示防御机制的鲁棒指令,实现高成功率(>90%)且具备良好隐蔽性的攻击行为。
链接: https://arxiv.org/abs/2508.20083
作者: Yanbo Dai,Zhenlan Ji,Zongjie Li,Kuan Li,Shuai Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textitself-correction ability (SCA) of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textscDisarmRAG, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2508.20083 [cs.CR] (or arXiv:2508.20083v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.20083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-1] 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理能力评估中缺乏系统性、与人类认知机制对比不足的问题。现有MLLM虽在推理任务上表现优异,但其是否具备类人空间认知能力仍不明确。解决方案的关键在于构建了一个高质量的基准测试集11Plus-Bench,该基准源自真实标准化的空间能力测评,并包含细粒度的专家标注,涵盖感知复杂度和推理过程两个维度,从而支持实例级的行为分析。通过在14个主流MLLM上的广泛实验与人类表现对比,研究揭示了当前模型已初具空间认知迹象,但其决策随机性强于人类,且正确性受抽象模式复杂度显著影响,为未来模型设计提供了可操作的改进方向。
链接: https://arxiv.org/abs/2508.20068
作者: Chengzu Li,Wenshan Wu,Huanyu Zhang,Qingtao Li,Zeyu Gao,Yan Xia,José Hernández-Orallo,Ivan Vulić,Furu Wei
机构: Microsoft Research (微软研究院); Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Department of Oncology, University of Cambridge (剑桥大学肿瘤学系); Leverhulme Centre for the Future of Intelligence, University of Cambridge (剑桥大学未来智能莱弗休姆中心); VRAIN, Universitat Politècnica de València (瓦伦西亚理工大学VRAIN)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures (22 pages, 7 figures, 7 tables including references and appendices)
Abstract:For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs’ cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs’ spatial reasoning capabilities and provide actionable insights for advancing model design.
zh
[NLP-2] AraHealthQA 2025 Shared Task Description Paper
【速读】: 该论文旨在解决阿拉伯语医疗问答(Health Question Answering, HQA)资源匮乏的问题,特别是在心理健康和广泛医学领域中高质量标注数据的稀缺。解决方案的关键在于设计并实施了AraHealthQA 2025共享任务,包含两个互补赛道:MentalQA专注于阿拉伯语心理健康问答(如焦虑、抑郁和去污名化),MedArabiQ覆盖内科、儿科及临床决策等更广泛的医学领域;每个赛道均设有多个子任务、评估数据集与标准化指标,以支持公平基准测试,并在现实、多语言和文化敏感的医疗场景下推动模型开发。
链接: https://arxiv.org/abs/2508.20047
作者: Hassan Alhuzali,Farah Shamout,Muhammad Abdul-Mageed,Chaimae Abouzahir,Mouath Abu-Daoud,Ashwag Alasmari,Walid Al-Eisawi,Renad Al-Monef,Ali Alqahtani,Lama Ayash,Nizar Habash,Leen Kharouf
机构: Umm Al-Qura University (乌姆库拉大学); New York University Abu Dhabi (纽约大学阿布扎比分校); The University of British Columbia (不列颠哥伦比亚大学); King Khalid University (国王卡利德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with EMNLP 2025). This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: MentalQA, focusing on Arabic mental health Q\A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.
zh
[NLP-3] Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在面对越狱攻击(jailbreak attacks)时的脆弱性问题,其核心挑战在于训练数据与真实世界攻击之间存在分布差异(distributional mismatch),导致模型难以识别未见过的恶意指令。解决方案的关键在于提出IMAGINE框架,该框架通过嵌入空间分布分析(embedding space distribution analysis)生成具有越狱特征的指令样本,从而填补安全对齐语料库与真实越狱模式之间的分布鸿沟;该方法采用迭代优化过程动态调整文本生成分布,以合成多样化的安全对齐数据,显著降低Qwen2.5、Llama3.1和Llama3.2等模型的攻击成功率,同时保持其功能性不受影响。
链接: https://arxiv.org/abs/2508.20038
作者: Sheng Liu,Qiang Sheng,Danding Wang,Yang Li,Guang Yang,Juan Cao
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); Zhongguancun Laboratory(中关村实验室)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 findings
Abstract:Despite advances in improving large language model(LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs’ inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.
zh
[NLP-4] DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
【速读】: 该论文旨在解决生成式研究合成(Generative Research Synthesis)系统评估难题,现有问答基准难以捕捉真实研究任务的复杂性和动态性,而人工标注数据集则存在过时和数据污染风险。其解决方案的关键在于提出 DeepScholar-bench——一个基于近期高质量 ArXiv 论文抽取查询的实时基准测试框架,并围绕“生成论文相关工作章节”这一真实研究任务设计评估体系,从知识合成、检索质量与可验证性三个维度进行自动化、全面的性能评估。同时,作者开发了 DeepScholar-base 参考流水线以提供高效实现基础,为该领域建立可靠、可持续的评测标准。
链接: https://arxiv.org/abs/2508.20033
作者: Liana Patel,Negar Arabzadeh,Harshit Gupta,Ankita Sundar,Ion Stoica,Matei Zaharia,Carlos Guestrin
机构: Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI’s, OpenAI’s DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of 19% across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at this https URL.
zh
[NLP-5] Pruning Strategies for Backdoor Defense in LLM s CIKM’25
【速读】: 该论文旨在解决预训练语言模型在下游自然语言处理(Natural Language Processing, NLP)任务中仍可能遭受后门攻击(backdoor attacks)的问题,尤其是那些在常规微调(fine-tuning)后依然存活的隐蔽式后门攻击。此类攻击通过细微的句法或风格操纵引入恶意触发器(trigger),难以被传统检测手段识别,因而需要事后净化(post-hoc purification)策略。论文提出的关键解决方案是采用注意力头剪枝(attention-head pruning)技术,在无需了解触发器信息或访问干净参考模型的前提下,通过迭代移除最不具信息量的注意力头来削弱后门行为。实验表明,不同剪枝策略对不同类型攻击具有差异化防御效果:基于梯度的剪枝在对抗句法类触发器时表现最优,而强化学习和贝叶斯不确定性剪枝则更擅长抵御风格类攻击。
链接: https://arxiv.org/abs/2508.20032
作者: Santosh Chapagain,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
机构: Utah State University (犹他州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted in CIKM '25: The 34th ACM International Conference on Information and Knowledge Management Proceedings
Abstract:Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.
zh
[NLP-6] Symphony: A Decentralized Multi-Agent Framework for Scalable Collective Intelligence
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体框架普遍依赖集中式编排所导致的高部署成本、通信拓扑僵化以及适应性差等问题。其解决方案的核心在于提出一个名为Symphony的去中心化多智能体系统,通过三个关键技术机制实现轻量级LLM在消费级GPU上的协同:(1) 基于去中心化账本的能力记录机制,用于维护各智能体能力的分布式共识;(2) Beacon选择协议,支持动态任务分配以提升资源利用率;(3) 基于思维链(Chain-of-Thought, CoT)加权的结果投票机制,增强决策准确性与鲁棒性。该设计实现了低开销、可扩展、隐私保护且具备容错能力的编排架构,并在推理基准测试中显著优于现有基线方法。
链接: https://arxiv.org/abs/2508.20019
作者: Ji Wang,Kashing Chen,Xinyuan Song,Ke Zhang,Lynn Ai,Eric Yang,Bill Shi
机构: Gradient; Emory University (埃默里大学); Columbia University (哥伦比亚大学); The Chinese University of Hong Kong (香港中文大学); Waseda University (早稻田大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Most existing Large Language Model (LLM)-based agent frameworks rely on centralized orchestration, incurring high deployment costs, rigid communication topologies, and limited adaptability. To address these challenges, we introduce Symphony, a decentralized multi-agent system which enables lightweight LLMs on consumer-grade GPUs to coordinate. Symphony introduces three key mechanisms: (1) a decentralized ledger that records capabilities, (2) a Beacon-selection protocol for dynamic task allocation, and (3) weighted result voting based on CoTs. This design forms a privacy-saving, scalable, and fault-tolerant orchestration with low overhead. Empirically, Symphony outperforms existing baselines on reasoning benchmarks, achieving substantial accuracy gains and demonstrating robustness across models of varying capacities.
zh
[NLP-7] SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
【速读】: 该论文旨在解决当前移动图形用户界面(GUI)代理在将自然语言转化为界面操作时的可靠性问题,尤其是单智能体方法因结构限制而性能受限,以及多智能体强化学习(MARL)在效率和与现有大视觉语言模型(LVLM)架构兼容性方面的瓶颈。解决方案的关键在于提出一种分阶段的交错强化学习(SWIRL)框架,其核心是将MARL重构为一系列单智能体强化学习任务,每次仅更新一个智能体而固定其他智能体,从而实现稳定训练并促进智能体间的高效协作。该方法在理论上提供了逐步安全边界、跨轮次单调改进定理及回报收敛保证,实验证明其在GUI控制任务中优于现有方法,并展现出在多智能体数学推理等更广泛场景下的通用潜力。
链接: https://arxiv.org/abs/2508.20018
作者: Quanfeng Lu,Zhantao Ma,Shuai Zhong,Jin Wang,Dahai Yu,Michael K. Ng,Ping Luo
机构: The University of Hong Kong (香港大学); Hong Kong Baptist University (香港浸会大学); TCL Corporate Research (Hong Kong) Co., Ltd (TCL企业研究(香港)有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 28 pages, 12 figures
Abstract:The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
zh
[NLP-8] Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation EMNLP’25
【速读】: 该论文旨在解决上下文学习(in-context learning)中演示示例(demonstration examples)的选择问题,即如何从一个包含 $ n $ 个示例的集合中快速选出 $ k $ 个最能作为下游推理条件的示例。传统方法依赖于输入嵌入(input embeddings)的相似性进行筛选,但受限于计算效率和准确性。本文提出了一种基于输出梯度(gradient of the output in the input embedding space)的新方法:通过一阶近似估计模型输出,并在多个随机采样子集上应用该估计,最终聚合结果形成每个示例的影响得分(influence score),从而选择最优 $ k $ 个示例。其关键创新在于利用预计算的梯度信息实现线性时间复杂度的高效子集选择,且误差低于 1%,显著优于基于嵌入相似性的方法,并支持大规模模型(如 34B 参数)的高效扩展。
链接: https://arxiv.org/abs/2508.19999
作者: Ziniu Zhang,Zhenshuo Zhang,Dongyue Li,Lu Wang,Jennifer Dy,Hongyang R. Zhang
机构: Northeastern University (东北大学); University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages. To appear in EMNLP’25
Abstract:This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of n examples, how can we quickly select k out of n to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select k most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than \mathbf1% error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to \mathbf37.7\times on models with up to 34 billion parameters, and outperform existing selection methods based on input embeddings by \mathbf11% on average.
zh
[NLP-9] Selective Retrieval-Augmentation for Long-Tail Legal Text Classification
【速读】: 该论文旨在解决法律文本分类任务中因标签分布长尾现象(long-tail label distribution)导致模型对低频类别的性能不佳的问题。其解决方案的关键在于提出一种选择性检索增强方法(Selective Retrieval-Augmentation, SRA),该方法仅对训练集中低频标签的样本进行增强,避免对高频类别引入噪声,且无需修改模型架构;同时,检索过程限定于训练数据内部,防止信息泄露,无需依赖外部语料库,从而在保持模型结构不变的前提下显著提升长尾法律文本分类的性能。
链接: https://arxiv.org/abs/2508.19997
作者: Boheng Mao
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper proposes Selective Retrieval-Augmentation (SRA) as a solution to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. The proposed SRA method is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). The results indicate that SRA attains higher micro-F1 and macro-F1 scores compared to all current LexGLUE baselines across both datasets, illustrating consistent improvements in long-tail legal text classification. The code repository is available at: this https URL
zh
[NLP-10] ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
【速读】: 该论文旨在解决多轮对话系统微调过程中因低质量标注数据导致的性能下降问题,尤其是早期轮次中的监督错误会沿时间轴传播,进而破坏后续轮次的连贯性和响应质量。现有方法通常依赖静态预过滤来控制数据质量,但这种分离式策略无法有效缓解轮次级别的误差传播。解决方案的关键在于提出 ReSURE(Regularizing Supervision UnREliability),一种自适应学习机制,通过在线估计每轮损失分布(利用 Welford 的在线统计方法)并动态调整样本损失权重,从而在不进行显式数据过滤的前提下,自动降低不可靠监督信号的影响,显著提升了模型训练的稳定性和生成响应的质量。
链接: https://arxiv.org/abs/2508.19996
作者: Yiming Du,Yifan Xiang,Bin Liang,Dahua Lin,Kam-Fai Wong,Fei Tan
机构: The Chinese University of Hong Kong (香港中文大学); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose ReSURE (Regularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford’s online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively. Code is publicly available at this https URL.
zh
[NLP-11] MathBuddy: A Multimodal System for Affective Math Tutoring
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的教育对话系统缺乏对学生情感状态建模的问题。现有学习模型通常忽略学生的情绪变化,而教育心理学研究表明,积极或消极的情绪状态显著影响学习效果。为此,作者提出MathBuddy——一个具备情绪感知能力的LLM驱动数学辅导系统,其核心创新在于通过融合对话文本与面部表情两种模态的信息,动态构建学生的多模态情绪表征,并据此映射至相应的教学策略,从而生成更具同理心的教学响应。实验表明,该方法在8个教学维度上的自动评估指标和用户研究中均取得显著提升,验证了情绪建模对增强LLM辅导系统教学效能的关键作用。
链接: https://arxiv.org/abs/2508.19993
作者: Debanjana Kar,Leopold Böss,Dacia Braca,Sebastian Maximilian Dennerlein,Nina Christine Hubig,Philipp Wintersberger,Yufang Hou
机构: IT:U Interdisciplinary Transformation University Austria (IT:U跨学科转型大学奥地利); IBM Research India (IBM研究院印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student’s affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student’s learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student’s emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student’s emotions are captured from the conversational text as well as from their facial expressions. The student’s emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have effectively evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor’s pedagogical abilities by modeling students’ emotions.
zh
[NLP-12] Self-Supervised Pre-Training with Equilibrium Constraints
【速读】: 该论文旨在解决自监督预训练中处理异构数据(heterogeneous data)时存在的适应性不足问题,即传统方法将所有数据混合并最小化全局平均损失,导致模型难以在不同来源的数据上达到局部最优。其解决方案的关键在于引入均衡约束(equilibrium constraints),通过将问题建模为双层优化(bilevel optimization)框架,使模型在从初始模型出发进行K步梯度下降后,能够分别收敛到各数据源的局部最优解;该方法采用一阶近似求解策略,并与模型无关元学习(model-agnostic meta learning, MAML)建立了理论联系,从而显著提升了预训练模型在下游监督微调任务中的适应能力。
链接: https://arxiv.org/abs/2508.19990
作者: Xiaodong Cui,A F M Saif,Brian Kingsbury,Tianyi Chen
机构: IBM T. J. Watson Research Center (IBM托马斯·沃森研究中心); Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Self-supervised pre-training using unlabeled data is widely used in machine learning. In this paper, we propose a new self-supervised pre-training approach to dealing with heterogeneous data. Instead of mixing all the data and minimizing the averaged global loss in the conventional way, we impose additional equilibrium constraints to ensure that the models optimizes each source of heterogeneous data to its local optima after K -step gradient descent initialized from the model. We formulate this as a bilevel optimization problem, and use the first-order approximation method to solve the problem. We discuss its connection to model-agnostic meta learning (MAML). Experiments are carried out on self-supervised pre-training using multi-domain and multilingual datasets, demonstrating that the proposed approach can significantly improve the adaptivity of the self-supervised pre-trained model for the downstream supervised fine-tuning tasks.
zh
[NLP-13] Agent CoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理混合类型推理任务时表现不佳的问题,特别是当任务同时涉及常识推理(commonsense reasoning)与数学推理(math reasoning)时,模型性能显著下降。现有基准测试通常仅聚焦于单一类型的多步推理,无法准确评估模型在真实世界复杂任务中的综合能力。解决方案的关键在于构建一个全新的基准测试——Agentic Commonsense and Math benchmark (AgentCoMa),其中每个任务均明确包含一个常识推理步骤和一个数学推理步骤,从而系统性地量化LLMs在跨类型组合推理中的脆弱性。实验表明,尽管LLMs能独立完成两类推理步骤,但组合后平均准确率下降约30%,远高于同类推理组合的性能差距;而人类非专家标注者则保持高准确率,凸显了模型在混合型复合推理中的局限性。
链接: https://arxiv.org/abs/2508.19988
作者: Lisa Alazraki,Lihu Chen,Ana Brassard,Joe Stacey,Hossein A. Rahmani,Marek Rei
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by ~30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.
zh
[NLP-14] Diffusion Language Models Know the Answer Before Decoding
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理阶段速度较慢的问题,其核心瓶颈在于双向注意力计算开销大以及高质量输出所需的大量精炼步骤(refinement steps)。解决方案的关键在于利用DLM中一个被忽视的特性——早期答案收敛性(early answer convergence),即在半数精炼步骤前即可识别出正确答案。基于此观察,作者提出Prophet方法,一种无需训练的快速解码范式,通过动态判断是否继续精炼或直接“全押”(all-in)解码剩余token,决策依据为Top-2预测候选之间的置信度差距(confidence gap)。该方法无缝集成至现有DLM实现,无额外训练成本且计算开销极低,在LLaDA-8B和Dream-7B等多个任务上实现最多3.4倍的解码步数减少,同时保持高生成质量,将DLM解码重新定义为“何时停止采样”的问题,为加速DLM推理提供了一种简洁而有效的机制。
链接: https://arxiv.org/abs/2508.19982
作者: Pengxiang Li,Yefan Zhou,Dilxat Muhtar,Lu Yin,Shilin Yan,Li Shen,Yi Liang,Soroush Vosoughi,Shiwei Liu
机构: The Hong Kong Polytechnic University (香港理工大学); Dartmouth College (达特茅斯学院); University of Surrey (萨里大学); Sun Yat-sen University (中山大学); Google DeepMind (谷歌深度思维); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ELLIS Institute Tübingen (ELLIS图宾根研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go “all-in” (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at this https URL.
zh
[NLP-15] GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
【速读】: 该论文旨在解决大视觉语言模型中物体幻觉(object hallucination)问题,即模型在图像-文本关联任务中生成不存在于输入图像中的虚假物体描述,这对模型在现实场景中的安全部署构成重大挑战。现有方法通常仅从全局或局部视角评估幻觉概率,存在检测可靠性不足的问题。论文提出的解决方案是GLSim,其关键在于无需训练即可利用图像与文本模态间互补的全局和局部嵌入相似性信号(embedding similarity signals),从而更准确、可靠地识别不同场景下的物体幻觉现象。
链接: https://arxiv.org/abs/2508.19972
作者: Seongheon Park,Yixuan Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
zh
[NLP-16] Dhati: Fine-tuned Large Language Models for Arabic Subjectivity Evaluation
【速读】: 该论文旨在解决阿拉伯语(Arabic)在主观性分析(subjectivity analysis)任务中因标注数据稀缺而导致的资源匮乏问题。其关键解决方案在于构建了一个大规模、多源融合的标注数据集AraDhati+,该数据集整合了ASTD、LABR、HARD和SANAD等现有阿拉伯语语料库,并在此基础上对XLM-RoBERTa、AraBERT和ArabianGPT等先进阿拉伯语预训练模型进行微调,同时引入集成决策策略以提升分类性能,最终实现了97.79%的准确率,显著提升了阿拉伯语主观性识别的效果。
链接: https://arxiv.org/abs/2508.19966
作者: Slimane Bellaouar,Attia Nehar,Soumia Souffi,Mounia Bouameur
机构: University of Ghardaia (哈加迪亚大学); University of Djelfa (杰尔法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures
Abstract:Despite its significance, Arabic, a linguistically rich and morphologically complex language, faces the challenge of being under-resourced. The scarcity of large annotated datasets hampers the development of accurate tools for subjectivity analysis in Arabic. Recent advances in deep learning and Transformers have proven highly effective for text classification in English and French. This paper proposes a new approach for subjectivity assessment in Arabic textual data. To address the dearth of specialized annotated datasets, we developed a comprehensive dataset, AraDhati+, by leveraging existing Arabic datasets and collections (ASTD, LABR, HARD, and SANAD). Subsequently, we fine-tuned state-of-the-art Arabic language models (XLM-RoBERTa, AraBERT, and ArabianGPT) on AraDhati+ for effective subjectivity classification. Furthermore, we experimented with an ensemble decision approach to harness the strengths of individual models. Our approach achieves a remarkable accuracy of 97.79,% for Arabic subjectivity classification. Results demonstrate the effectiveness of the proposed approach in addressing the challenges posed by limited resources in Arabic language processing.
zh
[NLP-17] KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
【速读】: 该论文旨在解决低资源语言(如韩语)在文本丰富型视觉问答(Text-rich Visual Question Answering, VQA)任务中缺乏全面评估基准的问题,这一缺失阻碍了视觉语言模型(Vision-Language Models, VLMs)在多语言场景下的 robust 评估与比较。解决方案的关键在于提出 KRETA 基准数据集,其不仅覆盖15个领域和26种图像类型,支持对视觉文本理解与推理能力的多维评估,还引入了一种专为文本密集场景优化的半自动化 VQA 数据生成流水线,该流程结合精细化的分步图像分解策略与严格的七项指标评估协议,确保了数据质量,并具备良好的可扩展性,可推广至其他语言以推动多语言 VLM 研究发展。
链接: https://arxiv.org/abs/2508.19944
作者: Taebaek Hwang,Minseo Kim,Gisang Lee,Seonuk Kim,Hyunjun Eun
机构: Waddle; Seoul National University (首尔国立大学); Krafton; UNIST; SK Telecom
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at this https URL.
zh
[NLP-18] HEAL: A Hypothesis-Based Preference-Aware Analysis Framework EMNLP2025
【速读】: 该论文旨在解决当前偏好优化方法(如直接偏好优化 DPO)在评估过程中仅依赖单一响应、忽视潜在输出空间的问题,这可能导致对模型实际性能的误判。其解决方案的关键在于提出一种新的评估范式——基于假设空间的偏好感知分析框架(Hypothesis-based Preference-aware Analysis Framework, HEAL),将偏好对齐建模为假设空间内的重排序过程,并引入两个互补指标:排名准确率(用于衡量序数一致性)和偏好强度相关性(用于评估连续对齐程度)。通过构建统一的假设基准 UniHypoBench 并基于 HEAL 开展系统实验,该研究揭示了现有方法能有效捕捉代理模型提供的偏好并抑制负样本,从而为偏好学习提供了理论创新与实践诊断工具。
链接: https://arxiv.org/abs/2508.19922
作者: Yifu Huo,Chenglong Wang,Qiren Zhu,Shunjie Xing,Tong Xiao,Chunliang Zhang,Tongran Liu,Jinbo Zhu
机构: Northeastern University (东北大学); NiuTrans Research (牛津研究); CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS (中国科学院行为科学重点实验室,心理研究所,中科院)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 Findings
Abstract:Preference optimization methods like DPO have achieved remarkable performance in LLM alignment. However, the evaluation for these methods relies on a single response and overlooks other potential outputs, which could also be generated in real-world applications within this hypothetical space. To address this issue, this paper presents a \textbfHypothesis-based Pr\textbfEference-aware \textbfAna\textbfLysis Framework (HEAL), a novel evaluation paradigm that formulates preference alignment as a re-ranking process within hypothesis spaces. The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment. To facilitate this framework, we develop UniHypoBench, a unified hypothesis benchmark constructed from diverse instruction-response pairs. Through extensive experiments based on HEAL, with a particular focus on the intrinsic mechanisms of preference learning, we demonstrate that current preference learning methods can effectively capture preferences provided by proxy models while simultaneously suppressing negative samples. These findings contribute to preference learning research through two significant avenues. Theoretically, we introduce hypothesis space analysis as an innovative paradigm for understanding preference alignment. Practically, HEAL offers researchers robust diagnostic tools for refining preference optimization methods, while our empirical results identify promising directions for developing more advanced alignment algorithms capable of comprehensive preference capture.
zh
[NLP-19] Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM -Based Multi-Agent Systems
【速读】: 该论文试图解决的问题是:在缺乏预设偏见的条件下,基于大语言模型(Large Language Model, LLM)的多智能体系统是否可能自发产生刻板印象及其演化机制。解决方案的关键在于构建了一个模拟职场互动的新型实验框架,在初始条件中保持中立性,通过多轮交互与决策权力分配的引入,揭示了LLM代理间刻板印象的涌现性特征——即即使无训练数据偏见,系统仍会因群体交互动态而形成类似人类社会行为的刻板效应(如光环效应、确认偏误和角色一致性),且这些模式在不同LLM架构中具有高度一致性。这一发现表明,刻板印象可能是多智能体交互中的涌现属性,而非仅源于训练数据偏差。
链接: https://arxiv.org/abs/2508.19919
作者: Jingyu Guo,Yingying Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases. Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration. Through a novel experimental framework simulating workplace interactions with neutral initial conditions, we investigate the emergence and evolution of stereotypes in LLM-based multi-agent systems. Our findings reveal that (1) LLM-Based AI agents develop stereotype-driven biases in their interactions despite beginning without predefined biases; (2) stereotype effects intensify with increased interaction rounds and decision-making power, particularly after introducing hierarchical structures; (3) these systems exhibit group effects analogous to human social behavior, including halo effects, confirmation bias, and role congruity; and (4) these stereotype patterns manifest consistently across different LLM architectures. Through comprehensive quantitative analysis, these findings suggest that stereotype formation in AI systems may arise as an emergent property of multi-agent interactions, rather than merely from training data biases. Our work underscores the need for future research to explore the underlying mechanisms of this phenomenon and develop strategies to mitigate its ethical impacts.
zh
[NLP-20] Logical Reasoning with Outcome Reward Models for Test-Time Scaling EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在演绎逻辑推理(deductive logical reasoning)任务中性能不足的问题,尤其是在测试时扩展(test-time scaling)与专用结果奖励模型(Outcome Reward Models, ORMs)结合的应用尚未充分探索的背景下。解决方案的关键在于构建一套针对演绎推理的ORM,并通过两种数据生成策略提升其训练质量:一是基于思维链(Chain-of-Thought, CoT)的单样本和多样本生成;二是提出一种“回声生成”(echo generation)技术,利用LLM对提示中错误假设的倾向性响应,主动引导模型产生新的错误类型,从而扩展训练数据中未覆盖的错误模式。实验表明,基于CoT与回声增强数据训练的ORM在FOLIO、JustLogic和ProverQA等多个逻辑推理基准上显著提升了四类不同LLMs的表现。
链接: https://arxiv.org/abs/2508.19903
作者: Ramya Keerthy Thatikonda,Wray Buntine,Ehsan Shareghi
机构: Monash University (蒙纳士大学); VinUniversity (Vin大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025
Abstract:Logical reasoning is a critical benchmark for evaluating the capabilities of large language models (LLMs), as it reflects their ability to derive valid conclusions from given premises. While the combination of test-time scaling with dedicated outcome or process reward models has opened up new avenues to enhance LLMs performance in complex reasoning tasks, this space is under-explored in deductive logical reasoning. We present a set of Outcome Reward Models (ORMs) for deductive reasoning. To train the ORMs we mainly generate data using Chain-of-Thought (CoT) with single and multiple samples. Additionally, we propose a novel tactic to further expand the type of errors covered in the training dataset of the ORM. In particular, we propose an echo generation technique that leverages LLMs’ tendency to reflect incorrect assumptions made in prompts to extract additional training data, covering previously unexplored error types. While a standard CoT chain may contain errors likely to be made by the reasoner, the echo strategy deliberately steers the model toward incorrect reasoning. We show that ORMs trained on CoT and echo-augmented data demonstrate improved performance on the FOLIO, JustLogic, and ProverQA datasets across four different LLMs.
zh
[NLP-21] Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM -Assisted Translation Refinement
【速读】: 该论文旨在解决低资源语言(如孟加拉语)在多模态人工智能研究中缺乏高质量、开放式的视觉问答(Visual Question Answering, VQA)数据集的问题。现有数据集普遍存在标注质量不高、答案类型受限或领域特定性强等缺陷,难以支撑通用且可靠的多模态学习模型训练与评估。解决方案的关键在于构建一个大规模、高质的孟加拉语VQA数据集Bangla-Bayanno,并通过多语言大语言模型(Multilingual Large Language Model, LLM)辅助的翻译精炼流水线,有效降低人工翻译引入的错误,提升跨语言数据的一致性与清晰度,从而为低资源语言的多模态学习提供更可靠的基准支持。
链接: https://arxiv.org/abs/2508.19887
作者: Mohammed Rakibul Hasan,Rafi Majid,Ahanaf Tahmid
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce Bangla-Bayanno, an open-ended Visual Question Answering (VQA) Dataset in Bangla, a widely used, low-resource language in multimodal AI research. The majority of existing datasets are either manually annotated with an emphasis on a specific domain, query type, or answer type or are constrained by niche answer formats. In order to mitigate human-induced errors and guarantee lucidity, we implemented a multilingual LLM-assisted translation refinement pipeline. This dataset overcomes the issues of low-quality translations from multilingual sources. The dataset comprises 52,650 question-answer pairs across 4750+ images. Questions are classified into three distinct answer types: nominal (short descriptive), quantitative (numeric), and polar (yes/no). Bangla-Bayanno provides the most comprehensive open-source, high-quality VQA benchmark in Bangla, aiming to advance research in low-resource multimodal learning and facilitate the development of more inclusive AI systems.
zh
[NLP-22] AI-Powered Detection of Inappropriate Language in Medical School Curricula AAAI
【速读】: 该论文旨在解决医学教学材料中不当语言(Inappropriate Use of Language, IUL)识别的难题,此类语言包括过时、排他性或非以患者为中心的术语,可能对临床培训、医患互动及健康结果产生负面影响。由于课程内容体量庞大,人工识别IUL及其子类既昂贵又不切实际。解决方案的关键在于利用细调的小型语言模型(Small Language Models, SLMs)与预训练的大语言模型(Large Language Models, LLMs)进行对比评估,其中最优策略为采用多标签分类器,并通过引入未标记文本作为负样本增强训练数据,使特定子类分类器的AUC提升达25%,从而显著提高对医学教材中潜在有害语言的检测效能。
链接: https://arxiv.org/abs/2508.19883
作者: Chiman Salavati,Shannon Song,Scott A. Hale,Roberto E. Montenegro,Shiri Dori-Hacohen,Fabricio Murai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at 2025 AAAI/ACM AI, Ethics and Society Conference (AIES’25)
Abstract:The use of inappropriate language – such as outdated, exclusionary, or non-patient-centered terms – medical instructional materials can significantly influence clinical training, patient interactions, and health outcomes. Despite their reputability, many materials developed over past decades contain examples now considered inappropriate by current medical standards. Given the volume of curricular content, manually identifying instances of inappropriate use of language (IUL) and its subcategories for systematic review is prohibitively costly and impractical. To address this challenge, we conduct a first-in-class evaluation of small language models (SLMs) fine-tuned on labeled data and pre-trained LLMs with in-context learning on a dataset containing approximately 500 documents and over 12,000 pages. For SLMs, we consider: (1) a general IUL classifier, (2) subcategory-specific binary classifiers, (3) a multilabel classifier, and (4) a two-stage hierarchical pipeline for general IUL detection followed by multilabel classification. For LLMs, we consider variations of prompts that include subcategory definitions and/or shots. We found that both LLama-3 8B and 70B, even with carefully curated shots, are largely outperformed by SLMs. While the multilabel classifier performs best on annotated data, supplementing training with unflagged excerpts as negative examples boosts the specific classifiers’ AUC by up to 25%, making them most effective models for mitigating harmful language in medical curricula.
zh
[NLP-23] Beyond Shallow Heuristics: Leverag ing Human Intuition for Curriculum Learning ACL
【速读】: 该论文旨在解决语言模型预训练中如何有效应用课程学习(Curriculum Learning, CL)的问题,即如何通过合理排序训练数据以提升模型性能。其核心挑战在于缺乏对语言难度的明确定义与度量方法。论文提出的关键解决方案是利用人类标注的简单语言标签(如来自Simple Wikipedia的数据)作为课程学习的信号,并验证其有效性。实验表明,单纯加入简单语言数据并无显著优势,但若将其以课程形式结构化并优先引入训练,则能显著降低困惑度(perplexity),尤其在处理简单语言时效果更佳;相比之下,基于模型能力的课程策略未能带来稳定提升,可能是因为未能有效区分难易样本。这说明人类对语言难度的直觉可作为CL的有效指导信号。
链接: https://arxiv.org/abs/2508.19873
作者: Vanessa Toborek,Sebastian Müller,Tim Selbach,Tamás Horváth,Christian Bauckhage
机构: University of Bonn (波恩大学); Lamarr Institute (拉马尔研究所); Fraunhofer IAIS (弗劳恩霍夫信息与通信技术应用研究所)
类目: Computation and Language (cs.CL)
备注: Presented at ICNLSP 2025; to appear in the ACL Anthology; received the Best Short Paper Award
Abstract:Curriculum learning (CL) aims to improve training by presenting data from “easy” to “hard”, yet defining and measuring linguistic difficulty remains an open challenge. We investigate whether human-curated simple language can serve as an effective signal for CL. Using the article-level labels from the Simple Wikipedia corpus, we compare label-based curricula to competence-based strategies relying on shallow heuristics. Our experiments with a BERT-tiny model show that adding simple data alone yields no clear benefit. However, structuring it via a curriculum – especially when introduced first – consistently improves perplexity, particularly on simple language. In contrast, competence-based curricula lead to no consistent gains over random ordering, probably because they fail to effectively separate the two classes. Our results suggest that human intuition about linguistic difficulty can guide CL for language model pre-training.
zh
[NLP-24] okenVerse: Towards Flexible Multitask Learning with Dynamic Task Activation
【速读】: 该论文旨在解决传统基于token的多任务框架(如TokenVerse)对训练语句要求全标签的问题,这限制了其在部分标注数据集上的应用及扩展性。解决方案的关键在于引入可学习向量(learnable vectors)于XLSR-Transducer自动语音识别(ASR)模型的声学嵌入空间中,实现动态任务激活机制,从而支持仅标注部分任务的语句进行训练。这一设计显著提升了模型对不完整标签数据的适应能力,并在ASR与语言识别等多任务场景下实现了性能相当或优于原TokenVerse的效果。
链接: https://arxiv.org/abs/2508.19856
作者: Shashi Kumar,Srikanth Madikeri,Esaú Villatoro-Tello,Sergio Burdisso,Pradeep Rangappa,Andrés Carofilis,Petr Motlicek,Karthik Pandia,Shankar Venkatesan,Kadri Hacioğlu,Andreas Stolcke
机构: Idiap Research Institute (Idiap 研究所); EPFL (瑞士联邦理工学院); University of Zurich (苏黎世大学); Brno University of Technology (布杰约维采理工大学); Uniphore (美国和印度)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE ASRU 2025. Copyright©2025 IEEE
Abstract:Token-based multitasking frameworks like TokenVerse require all training utterances to have labels for all tasks, hindering their ability to leverage partially annotated datasets and scale effectively. We propose TokenVerse++, which introduces learnable vectors in the acoustic embedding space of the XLSR-Transducer ASR model for dynamic task activation. This core mechanism enables training with utterances labeled for only a subset of tasks, a key advantage over TokenVerse. We demonstrate this by successfully integrating a dataset with partial labels, specifically for ASR and an additional task, language identification, improving overall performance. TokenVerse++ achieves results on par with or exceeding TokenVerse across multiple tasks, establishing it as a more practical multitask alternative without sacrificing ASR performance.
zh
[NLP-25] SoK: Large Language Model Copyright Auditing via Fingerprinting
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在版权保护中的脆弱性问题,即LLM作为高价值知识产权易遭受未经授权使用和模型窃取等侵权行为。为应对这一挑战,论文提出一种非侵入式识别技术——LLM指纹提取与比对方法(LLM fingerprinting),其关键在于构建一个统一的框架和形式化分类体系,将现有方法划分为白盒(white-box)与黑盒(black-box)两类,并设计首个系统性基准LeaFBench,用于在真实部署场景下评估指纹技术的鲁棒性和有效性。LeaFBench涵盖149个不同模型实例及13种主流后开发修改技术(包括参数调整类如微调、量化,以及参数无关类如系统提示词、检索增强生成RAG),通过大规模实验揭示了当前方法的优势与局限,从而指明未来研究方向和关键开放问题。
链接: https://arxiv.org/abs/2508.19843
作者: Shuo Shao,Yiming Li,Yu He,Hongwei Yao,Wenyuan Yang,Dacheng Tao,Zhan Qin
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江区)区块链与数据安全研究院); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学); Sun Yat-Sen University (中山大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that extracts and compares the distinctive features from LLMs to identify infringements, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of LLM fingerprinting. We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches, providing a structured overview of the state of the art. We further propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios. Built upon mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, quantization) and parameter-independent mechanisms (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at this https URL.
zh
[NLP-26] Scalable and consistent few-shot classification of survey responses using text embeddings
【速读】: 该论文旨在解决定性分析中开放式问卷回答的传统编码方法效率低、一致性差的问题。现有自然语言处理技术如监督分类器、主题建模和生成式大语言模型在定性分析中应用受限,因其需要大量标注数据、破坏既有的定性工作流程或结果不稳定。解决方案的关键在于提出一种基于文本嵌入(text embedding)的分类框架,仅需每类少量示例即可完成分类,且能无缝融入标准定性分析流程。该方法在概念物理调查的2899条开放式回答上与专家人工编码相比,Cohen’s Kappa系数达0.74–0.83,表明其在保持高解释性的同时可扩展至数千条响应规模,为演绎式定性分析提供了高效可行的新路径。
链接: https://arxiv.org/abs/2508.19836
作者: Jonas Timmann Mjaaland,Markus Fleten Kreutzer,Halvor Tyseng,Rebeckah K. Fussell,Gina Passante,N.G. Holmes,Anders Malthe-Sørenssen,Tor Ole B. Odden
机构: University of Oslo (奥斯陆大学); Cornell University (康奈尔大学); California State University Fullerton (加州州立大学富勒顿分校)
类目: Computation and Language (cs.CL); Physics Education (physics.ed-ph)
备注:
Abstract:Qualitative analysis of open-ended survey responses is a commonly-used research method in the social sciences, but traditional coding approaches are often time-consuming and prone to inconsistency. Existing solutions from Natural Language Processing such as supervised classifiers, topic modeling techniques, and generative large language models have limited applicability in qualitative analysis, since they demand extensive labeled data, disrupt established qualitative workflows, and/or yield variable results. In this paper, we introduce a text embedding-based classification framework that requires only a handful of examples per category and fits well with standard qualitative workflows. When benchmarked against human analysis of a conceptual physics survey consisting of 2899 open-ended responses, our framework achieves a Cohen’s Kappa ranging from 0.74 to 0.83 as compared to expert human coders in an exhaustive coding scheme. We further show how performance of this framework improves with fine-tuning of the text embedding model, and how the method can be used to audit previously-analyzed datasets. These findings demonstrate that text embedding-assisted coding can flexibly scale to thousands of responses without sacrificing interpretability, opening avenues for deductive qualitative analysis at scale.
zh
[NLP-27] Benchmarking Hindi LLM s: A New Suite of Datasets and a Comparative Analysis
【速读】: 该论文旨在解决在印地语(Hindi)环境下评估指令微调的大语言模型(Instruction-tuned Large Language Models, LLMs)所面临的挑战,即缺乏高质量的基准测试数据集,因为直接翻译英文数据集无法捕捉语言和文化上的关键细微差别。解决方案的关键在于构建一套包含五个子数据集的评估套件(IFEval-Hi、MT-Bench-Hi、GSM8K-Hi、ChatRAG-Hi 和 BFCL-Hi),其创建方法结合了从零开始的人工标注与“翻译-验证”流程,从而确保数据的语言准确性与文化相关性;同时,该方法具有可复现性,可推广至其他低资源语言的基准开发。
链接: https://arxiv.org/abs/2508.19831
作者: Anusha Kamath,Kanishk Singla,Rakesh Paul,Raviraj Joshi,Utkarsh Vaidya,Sanjay Singh Chauhan,Niranjan Wartikar
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.
zh
[NLP-28] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长时推理任务中因缺乏持久记忆能力而受限的问题,即LLMs本质上是无状态的,且受制于有限的上下文窗口。为应对这一挑战,作者提出Memory-R1框架,其关键在于引入基于强化学习(Reinforcement Learning, RL)的双代理机制:一是记忆管理器(Memory Manager),通过策略梯度方法(如PPO和GRPO)学习执行ADD、UPDATE、DELETE和NOOP等结构化记忆操作;二是答案生成代理(Answer Agent),负责从外部记忆库中选择最相关条目并进行推理以生成答案。两个代理均采用结果驱动的强化学习进行微调,仅需少量标注数据(如152个问答对)即可实现自适应记忆管理和高效推理,显著提升模型在多样化任务中的泛化性能,从而推动LLMs向更智能、具记忆感知的代理行为演进。
链接: https://arxiv.org/abs/2508.19828
作者: Sikuan Yan,Xiufeng Yang,Zuchao Huang,Ercong Nie,Zifeng Ding,Zonggen Li,Xiaowen Ma,Hinrich Schütze,Volker Tresp,Yunpu Ma
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations ADD, UPDATE, DELETE, NOOP, and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and use with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the most competitive existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behaviors in LLMs, pointing toward richer, more persistent reasoning systems.
zh
[NLP-29] Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation? EMNLP2025
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)在软推理任务(soft-reasoning tasks)中效果有限且可能与模型实际推理过程不一致的问题。其解决方案的关键在于系统性地分析不同类型的模型——包括指令微调模型、推理专用模型及推理蒸馏模型——在软推理任务中对CoT的依赖程度及其推理忠实性(faithfulness),从而揭示CoT的影响机制与实际推理路径之间并不总是一致,为提升CoT的有效性和可信度提供了实证依据和理论指导。
链接: https://arxiv.org/abs/2508.19827
作者: Samuel Lewis-Lim,Xingwei Tan,Zhixue Zhao,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 Main Conference
Abstract:Recent work has demonstrated that Chain-of-Thought (CoT) often yields limited gains for soft-reasoning problems such as analytical and commonsense reasoning. CoT can also be unfaithful to a model’s actual reasoning. We investigate the dynamics and faithfulness of CoT in soft-reasoning tasks across instruction-tuned, reasoning and reasoning-distilled models. Our findings reveal differences in how these models rely on CoT, and show that CoT influence and faithfulness are not always aligned.
zh
[NLP-30] 2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables
【速读】: 该论文旨在解决工业场景中将表格信息自动转化为结构化报告这一任务的挑战,该任务因表格的复杂性和多样性导致现有大语言模型(LLM)推理效果不佳,且缺乏能够有效评估实际应用性能的基准。解决方案的关键在于提出“表格到报告”(table-to-report, T2R)任务,并构建了一个名为T2R-bench的双语基准数据集,包含457张来自真实工业场景的表格,覆盖19个行业领域和4类工业表格类型,同时设计了一套公平的评价指标以衡量报告生成质量。实验表明,即使是最先进的模型如Deepseek-R1在该基准上的平均得分仅为62.71,说明当前LLMs在该任务上仍有显著提升空间。
链接: https://arxiv.org/abs/2508.19813
作者: Jie Zhang,Changzai Pan,Kaiwen Wei,Sishi Xiong,Yu Zhao,Xiangyu Li,Jiaxin Peng,Xiaoyan Gu,Jian Yang,Wenhan Chang,Zhenhe Wu,Jiang Zhong,Shuangyong Song,Yongxiang Li,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.
zh
[NLP-31] Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance EMNLP2025
【速读】: 该论文试图解决专家角色提示(expert persona prompting)在大语言模型(Large Language Models, LLMs)任务性能提升中的有效性不明确问题,尤其是其在不同情境下的表现差异、对无关角色属性的敏感性以及角色特征的真实性(fidelity)如何影响最终效果。研究发现,虽然专家角色提示通常不会显著降低性能,但模型对无关角色细节高度敏感,可能导致性能下降近30个百分点;同时,教育水平、专业性和领域相关性等角色特征对性能的提升作用不稳定或微弱。解决方案的关键在于提出缓解策略以增强鲁棒性,但这些策略仅在最大、最强大的模型上有效,凸显了当前角色提示设计缺乏系统性与一致性,亟需更精细的角色设计方法和能反映预期效果的评估机制。
链接: https://arxiv.org/abs/2508.19764
作者: Pedro Henrique Luz de Araujo,Paul Röttger,Dirk Hovy,Benjamin Roth
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 29 figures, accepted to EMNLP 2025
Abstract:Expert persona prompting – assigning roles such as expert in math to language models – is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not consider when and why personas should improve performance. We analyze the literature on persona prompting for task improvement and distill three desiderata: 1) performance advantage of expert personas, 2) robustness to irrelevant persona attributes, and 3) fidelity to persona attributes. We then evaluate 9 state-of-the-art LLMs across 27 tasks with respect to these desiderata. We find that expert personas usually lead to positive or non-significant performance changes. Surprisingly, models are highly sensitive to irrelevant persona details, with performance drops of almost 30 percentage points. In terms of fidelity, we find that while higher education, specialization, and domain-relatedness can boost performance, their effects are often inconsistent or negligible across tasks. We propose mitigation strategies to improve robustness – but find they only work for the largest, most capable models. Our findings underscore the need for more careful persona design and for evaluation schemes that reflect the intended effects of persona usage.
zh
[NLP-32] Uncovering the Bigger Picture: Comprehensive Event Understanding Via Diverse News Retrieval EMNLP2025
【速读】: 该论文旨在解决新闻检索系统中因过度依赖文本相关性而导致信息冗余、视角单一的问题,从而影响对真实事件的全面理解。其解决方案的关键在于提出一个两阶段的多样化新闻检索框架NEWSCOPE:第一阶段通过密集检索(dense retrieval)获取主题相关的候选内容;第二阶段则基于句子级别的聚类与多样性感知重排序(diversity-aware re-ranking),显式建模语义差异以挖掘互补信息。该方法在保证相关性的前提下显著提升了检索结果的多样性,验证了细粒度、可解释建模在减少冗余和促进事件全面理解中的有效性。
链接: https://arxiv.org/abs/2508.19758
作者: Yixuan Tang,Yuanyuan Shi,Yiqun Sun,Anthony Kum Hoe Tung
机构: National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by EMNLP 2025
Abstract:Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at this https URL.
zh
[NLP-33] Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)推理过程中关键值(Key-Value, KV)缓存负担过重的问题,从而提升推理效率。现有方法依赖随机线性哈希来筛选重要KV缓存,但受限于LLM中查询(query)与键(key)嵌入在两个窄锥形区域内的正交分布特性,导致哈希效率低下。其解决方案的关键在于提出一种名为Spotlight Attention的新方法,通过引入非线性哈希函数优化查询与键的嵌入分布,增强编码效率和鲁棒性;同时设计了一种基于Bradley-Terry排序损失的轻量级稳定训练框架,使非线性哈希模块可在仅16GB显存的GPU上于8小时内完成训练,并结合专用CUDA内核利用位运算优势,实现对512K条token的哈希检索时间低于100μs,端到端吞吐量相较原始解码提升3倍。
链接: https://arxiv.org/abs/2508.19740
作者: Wenhao Li,Yuxin Zhang,Gen Luo,Haiyuan Wan,Ziyang Gong,Fei Chao,Rongrong Ji
机构: Xiamen University (厦门大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5 \times compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100 \mu s on a single A100 GPU, with end-to-end throughput up to 3 \times higher than vanilla decoding.
zh
[NLP-34] NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
【速读】: 该论文旨在解决小规模视觉语言模型(sVLMs)在常识性视觉问答(commonsense VQA)任务中因缺乏外部常识知识而性能落后的问题。其解决方案的关键在于提出一个端到端的自然语言知识集成框架(NLKI),该框架通过三个核心步骤实现:首先利用微调后的ColBERTv2和增强对象信息的提示从外部知识源检索自然语言事实;其次借助大语言模型(LLM)生成解释性文本以减少幻觉;最后将检索到的事实与解释性文本分别输入sVLMs进行联合推理。实验表明,该方法在多个基准数据集上显著提升准确率(最高达7%),使FLAVA等小模型达到甚至超越部分中等规模视觉语言模型(VLMs)的性能水平,同时结合噪声鲁棒损失函数进一步优化模型稳定性,验证了参数高效常识推理对2.5亿参数以下模型的可行性。
链接: https://arxiv.org/abs/2508.19724
作者: Aritra Dutta,Swapnanil Mukherjee,Deepanway Ghosal,Somak Aditya
机构: IIT Kharagpur (印度理工学院克勒格布尔分校); Ashoka University (阿肖克大学); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.
zh
[NLP-35] CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese
【速读】: 该论文旨在解决葡萄牙语自动语音识别(Automatic Speech Recognition, ASR)资源严重偏向巴西葡萄牙语(Brazilian Portuguese)而忽视欧洲葡萄牙语(European Portuguese, EP)及其他变体的问题。解决方案的关键在于提出首个面向EP及其他葡萄牙语变体的开源框架CAMÕES,其核心包括:(1)一个涵盖46小时EP测试数据、跨多个领域的综合性评估基准;(2)一组前沿ASR模型,包含多种基础模型(foundation models)的零样本(zero-shot)与微调(fine-tuned)性能对比,以及从头训练的E-Branchformer模型。通过使用425小时的EP数据进行训练和微调,实验表明微调后的基础模型与E-Branchformer在EP上表现相当,且最佳模型相比最强零样本基础模型实现了超过35%的词错误率(Word Error Rate, WER)相对提升,确立了EP及其它变体的新基准。
链接: https://arxiv.org/abs/2508.19721
作者: Carlos Carvalho,Francisco Teixeira,Catarina Botelho,Anna Pompili,Rubén Solera-Ureña,Sérgio Paulo,Mariana Julião,Thomas Rolland,John Mendonça,Diogo Pereira,Isabel Trancoso,Alberto Abad
机构: INESC-ID(葡萄牙里斯本国家研究与开发研究所); Instituto Superior Técnico, Universidade de Lisboa(里斯本理工大学,里斯本大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to ASRU 2025
Abstract:Existing resources for Automatic Speech Recognition in Portuguese are mostly focused on Brazilian Portuguese, leaving European Portuguese (EP) and other varieties under-explored. To bridge this gap, we introduce CAMÕES, the first open framework for EP and other Portuguese varieties. It consists of (1) a comprehensive evaluation benchmark, including 46h of EP test data spanning multiple domains; and (2) a collection of state-of-the-art models. For the latter, we consider multiple foundation models, evaluating their zero-shot and fine-tuned performances, as well as E-Branchformer models trained from scratch. A curated set of 425h of EP was used for both fine-tuning and training. Our results show comparable performance for EP between fine-tuned foundation models and the E-Branchformer. Furthermore, the best-performing models achieve relative improvements above 35% WER, compared to the strongest zero-shot foundation model, establishing a new state-of-the-art for EP and other varieties.
zh
[NLP-36] Continuously Steering LLM s Sensitivity to Contextual Knowledge with Proxy Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中因参数化知识(parametric knowledge)与上下文提供知识之间存在冲突而导致的忠实性问题,即模型难以根据上下文动态调整对新知识的敏感度。现有方法如微调、解码算法或定位编辑上下文感知神经元虽有一定效果,但普遍存在效率低、不适用于黑盒模型或无法连续调节敏感度等局限。解决方案的关键在于提出CSKS(Continuously Steering Knowledge Sensitivity)框架:通过训练两个小型代理模型(proxy models),利用其输出分布差异来无损地偏移原始LLM的输出分布,从而实现对LLM上下文敏感度的连续、轻量级控制,无需修改原模型权重,且支持灵活增强或抑制对上下文知识的依赖。
链接: https://arxiv.org/abs/2508.19720
作者: Yilin Wang,Heng Wang,Yuyang Bai,Minnan Luo
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In Large Language Models (LLMs) generation, there exist knowledge conflicts and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, decoding algorithms, or locating and editing context-aware neurons to adapt LLMs to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust LLMs’ sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer LLMs’ sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an LLM without modifying the LLM weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models’ sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS’s practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over LLMs’ sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing LLMs to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at this https URL.
zh
[NLP-37] Safety Alignment Should Be Made More Than Just A Few Attention Heads
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全对齐中存在的脆弱性问题,即对抗性提示(adversarial prompting)能够有效绕过其安全机制。研究表明,现有安全机制主要依赖于一小部分注意力头(attention heads),移除或屏蔽这些关键头会显著削弱模型安全性。为识别并评估这些安全关键组件,作者提出RDSHA方法,通过利用模型的拒绝方向来定位对安全行为贡献最大的注意力头;进一步分析发现,现有越狱攻击(jailbreak attacks)正是通过选择性地绕过或操纵这些关键注意力头实现突破。为此,论文提出AHD训练策略,旨在将与安全相关的表征分布到更多注意力头中,从而增强安全机制的鲁棒性。解决方案的关键在于通过分布式编码(distributed encoding)重构安全行为的内部表示,使模型在保持功能完整性的前提下显著提升对主流越狱攻击的抵抗能力。
链接: https://arxiv.org/abs/2508.19697
作者: Chao Huang,Zefeng Zhang,Juewei Yue,Quangang Li,Chuang Zhang,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current safety alignment for large language models(LLMs) continues to present vulnerabilities, given that adversarial prompting can effectively bypass their safety this http URL investigation shows that these safety mechanisms predominantly depend on a limited subset of attention heads: removing or ablating these heads can severely compromise model safety. To identify and evaluate these safety-critical components, we introduce RDSHA, a targeted ablation method that leverages the model’s refusal direction to pinpoint attention heads mostly responsible for safety behaviors. Further analysis shows that existing jailbreak attacks exploit this concentration by selectively bypassing or manipulating these critical attention heads. To address this issue, we propose AHD, a novel training strategy designed to promote the distributed encoding of safety-related behaviors across numerous attention heads. Experimental results demonstrate that AHD successfully distributes safety-related capabilities across more attention heads. Moreover, evaluations under several mainstream jailbreak attacks show that models trained with AHD exhibit considerably stronger safety robustness, while maintaining overall functional utility.
zh
[NLP-38] Building Task Bots with Self-learning for Enhanced Adaptability Extensibility and Factuality
【速读】: 该论文旨在解决对话系统研究中任务型机器人(task bot)的适应性、可扩展性和准确性问题,特别是在无需或极少人工干预的情况下构建能够自主学习和适应动态环境的任务型机器人。其解决方案的关键在于开发创新技术,使机器人能够在不断变化的环境中实现自主学习与适应,从而提升其在实际应用中的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2508.19689
作者: Xiaoying Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 179 pages
Abstract:Developing adaptable, extensible, and accurate task bots with minimal or zero human intervention is a significant challenge in dialog research. This thesis examines the obstacles and potential solutions for creating such bots, focusing on innovative techniques that enable bots to learn and adapt autonomously in constantly changing environments.
zh
[NLP-39] Survey of Specialized Large Language Model
【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在专业领域应用中存在性能瓶颈的问题,尤其是其在医疗、金融、法律和技术等垂直领域难以满足高精度、高可靠性和领域适配性需求的局限。解决方案的关键在于推动专用大语言模型从简单的微调(fine-tuning)向原生领域架构(domain-native designs)演进,同时融合参数高效技术(如稀疏计算和量化)、多模态能力集成等创新手段,从而显著提升模型在特定任务上的表现,并为电子商务等领域提供可借鉴的技术路径与实践洞见。
链接: https://arxiv.org/abs/2508.19667
作者: Chenghan Yang,Ruiyu Zhao,Yang Liu,Ling Jiang
机构: Xiaoduo AI; Shanghai Jiao Tong University (上海交通大学); East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figures
Abstract:The rapid evolution of specialized large language models (LLMs) has transitioned from simple domain adaptation to sophisticated native architectures, marking a paradigm shift in AI development. This survey systematically examines this progression across healthcare, finance, legal, and technical domains. Besides the wide use of specialized LLMs, technical breakthrough such as the emergence of domain-native designs beyond fine-tuning, growing emphasis on parameter efficiency through sparse computation and quantization, increasing integration of multimodal capabilities and so on are applied to recent LLM agent. Our analysis reveals how these innovations address fundamental limitations of general-purpose LLMs in professional applications, with specialized models consistently performance gains on domain-specific benchmarks. The survey further highlights the implications for E-Commerce field to fill gaps in the field.
zh
[NLP-40] Automatic integration of SystemC in the FMI standard for Software-defined Vehicle design
【速读】: 该论文旨在解决汽车电子领域中因缺乏标准化接口和主流仿真平台私有化导致的软硬件协同仿真难题,这些问题限制了跨团队协作、系统扩展性以及知识产权(IP)保护。解决方案的关键在于利用功能模型接口(Functional Mock-up Interface, FMI)标准自动封装SystemC模型,从而融合SystemC在建模精度与快速上市时间上的优势,以及FMI在互操作性和模块封装方面的优点,实现嵌入式组件在协同仿真流程中的安全、可移植集成。
链接: https://arxiv.org/abs/2508.19665
作者: Giovanni Pollo,Andrei Mihai Albu,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Loris Panaro,Dario Soldi,Fabio Autieri,Sara Vinco
机构: Politecnico di Torino (都灵理工大学); Dumarey Group (杜马雷集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:The recent advancements of the automotive sector demand robust co-simulation methodologies that enable early validation and seamless integration across hardware and software domains. However, the lack of standardized interfaces and the dominance of proprietary simulation platforms pose significant challenges to collaboration, scalability, and IP protection. To address these limitations, this paper presents an approach for automatically wrapping SystemC models by using the Functional Mock-up Interface (FMI) standard. This method combines the modeling accuracy and fast time-to-market of SystemC with the interoperability and encapsulation benefits of FMI, enabling secure and portable integration of embedded components into co-simulation workflows. We validate the proposed methodology on real-world case studies, demonstrating its effectiveness with complex designs.
zh
[NLP-41] A Symbolic Adversarial Learning Framework for Evolving Fake News Generation and Detection EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)快速发展所带来的虚假新闻生成风险加剧问题,传统检测方法因难以应对虚假信息动态演化特性而效果受限。其解决方案的关键在于提出符号对抗学习框架(Symbolic Adversarial Learning Framework, SALF),通过代理符号学习优化过程实现非数值化的对抗训练:生成代理构建欺骗性叙事,检测代理则利用结构化辩论识别逻辑与事实漏洞,并通过自然语言表示的权重、损失和梯度操作模拟反向传播与梯度下降,从而在对抗交互中迭代优化双方能力。
链接: https://arxiv.org/abs/2508.19633
作者: Chong Tian,Qirong Ho,Xiuying Chen
机构: MBZUAI ( Mohamed Bin Zayed University of Artificial Intelligence)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main Conference
Abstract:Rapid LLM advancements heighten fake news risks by enabling the automatic generation of increasingly sophisticated misinformation. Previous detection methods, including fine-tuned small models or LLM-based detectors, often struggle with its dynamically evolving nature. In this work, we propose a novel framework called the Symbolic Adversarial Learning Framework (SALF), which implements an adversarial training paradigm by an agent symbolic learning optimization process, rather than relying on numerical updates. SALF introduces a paradigm where the generation agent crafts deceptive narratives, and the detection agent uses structured debates to identify logical and factual flaws for detection, and they iteratively refine themselves through such adversarial interactions. Unlike traditional neural updates, we represent agents using agent symbolic learning, where learnable weights are defined by agent prompts, and simulate back-propagation and gradient descent by operating on natural language representations of weights, loss, and gradients. Experiments on two multilingual benchmark datasets demonstrate SALF’s effectiveness, showing it generates sophisticated fake news that degrades state-of-the-art detection performance by up to 53.4% in Chinese and 34.2% in English on average. SALF also refines detectors, improving detection of refined content by up to 7.7%. We hope our work inspires further exploration into more robust, adaptable fake news detection systems.
zh
[NLP-42] LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中外部知识利用效率低的问题,即尽管RAG能够引入外部知识以提升大语言模型(Large Language Models, LLMs)的适应性和信息更新能力,但其在实际应用中往往未能充分挖掘和利用所检索到的外部事实信息。解决方案的关键在于通过噪声注入干预揭示了LLM内部不同层的功能分工:浅层专注于局部上下文建模,中间层负责整合长距离外部事实知识,而深层则主要依赖参数化内部知识。基于此发现,作者提出层融合解码(Layer Fused Decoding, LFD)策略,通过将中间层表示与最终层解码输出直接融合,从而更有效地激发外部事实知识的使用;同时引入内部知识得分(Internal Knowledge Score, IKS)准则自动识别最优中间层,确保在最小额外计算成本下显著提升RAG系统的知识利用率。
链接: https://arxiv.org/abs/2508.19614
作者: Yang Sun,Lixin Zou,Dan Luo,Zhiyong Xie,Long Zhang,Liming Dong,Yunwei Zhao,Xixun Lin,Yanxiong Lu,Chenliang Li
机构: Wuhan University (武汉大学); Lehigh University; National Defence University (国防大学); CNCERT/CC; Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Tencent Inc. (腾讯公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (LLMs), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how LLMs integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the LLM: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that directly combines representations from an intermediate layer with final-layer decoding outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.
zh
[NLP-43] Instructional Agents : LLM Agents : LLM Agents on Automated Course Material Generation for Teaching Faculties
【速读】: 该论文旨在解决高质量教学材料制作过程劳动密集、协调复杂的问题,尤其在高校课程开发中需大量依赖教师、教学设计师和助教之间的协作。其解决方案的关键在于提出了一种多智能体大语言模型(multi-agent large language model, LLM)框架——Instructional Agents,通过模拟教育角色间的协同工作模式,实现从课程大纲、讲义脚本、LaTeX格式幻灯片到评估题目的全流程自动化生成。该框架支持四种操作模式(自主模式、目录引导模式、反馈引导模式与全协作者模式),灵活控制人机参与程度,在五门大学计算机科学课程上的实证表明,该系统显著降低开发时间与人力投入,同时保障内容的连贯性和教学适切性,为资源匮乏地区的高质量教育提供可扩展、低成本的解决方案。
链接: https://arxiv.org/abs/2508.19611
作者: Huaiyuan Yao,Wanpeng Xu,Justin Turnau,Nadia Kellam,Hua Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 9 figures
Abstract:Preparing high-quality instructional materials remains a labor-intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi-agent large language model (LLM) framework designed to automate end-to-end course material generation, including syllabus creation, lecture scripts, LaTeX-based slides, and assessments. Unlike existing AI-assisted educational tools that focus on isolated tasks, Instructional Agents simulates role-based collaboration among educational agents to produce cohesive and pedagogically aligned content. The system operates in four modes: Autonomous, Catalog-Guided, Feedback-Guided, and Full Co-Pilot mode, enabling flexible control over the degree of human involvement. We evaluate Instructional Agents across five university-level computer science courses and show that it produces high-quality instructional materials while significantly reducing development time and human workload. By supporting institutions with limited instructional design capacity, Instructional Agents provides a scalable and cost-effective framework to democratize access to high-quality education, particularly in underserved or resource-constrained settings.
zh
[NLP-44] Understanding and Leverag ing the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLM s EMNLP2025
【速读】: 该论文旨在解决大语言模型在上下文依赖场景中缺乏上下文忠实性(context faithfulness)的问题,即模型输出往往无法准确基于给定上下文进行推理,导致生成内容与输入信息脱节。解决方案的关键在于发现并利用专家稀疏架构(mixture-of-experts)中具有上下文利用特异性的专家群体:通过提出Router Lens方法精准识别出对上下文更敏感的专家,并进一步设计轻量级的上下文忠实专家微调(Context-faithful Expert Fine-Tuning, CEFT)策略,仅对这些专家进行针对性优化,从而在不损害性能的前提下显著提升模型对上下文的依赖程度和推理可靠性。
链接: https://arxiv.org/abs/2508.19594
作者: Jun Bai,Minghao Tong,Yang Liu,Zixia Jia,Zilong Zheng
机构: State Key Laboratory of General Artificial Intellligence, BIGAI (通用人工智能国家重点实验室,BIGAI); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 Main
Abstract:Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses. Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization, offering a potential pathway toward targeted optimization for improved context faithfulness. To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding. Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts. Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient.
zh
[NLP-45] owards stable AI systems for Evaluating Arabic Pronunciations
【速读】: 该论文旨在解决现代阿拉伯语自动语音识别(ASR)系统在孤立字母识别任务中的性能瓶颈问题,该任务对语言学习、言语治疗和语音学研究至关重要。由于孤立字母缺乏协同发音线索(co-articulatory cues)和词汇上下文,且持续时间短(仅数百毫秒),导致模型难以准确识别,尤其在阿拉伯语中存在强调辅音(emphatic consonants)等无对应音素的语言特征时更为困难。解决方案的关键在于:首先构建一个多样化的带标注(diacritised)孤立阿拉伯字母语料库;其次利用wav2vec 2.0提取的声学嵌入训练轻量级神经网络,将识别准确率从35%提升至65%;最后通过对抗训练(adversarial training)增强模型对小幅度振幅扰动(epsilon = 0.05)的鲁棒性,使噪声语音下的性能下降控制在9%以内,同时保持干净语音的高准确率。
链接: https://arxiv.org/abs/2508.19587
作者: Hadi Zaatiti,Hatem Hajri,Osama Abdullah,Nader Masmoudi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Arabic ASR systems such as wav2vec 2.0 excel at word- and sentence-level transcription, yet struggle to classify isolated letters. In this study, we show that this phoneme-level task, crucial for language learning, speech therapy, and phonetic research, is challenging because isolated letters lack co-articulatory cues, provide no lexical context, and last only a few hundred milliseconds. Recogniser systems must therefore rely solely on variable acoustic cues, a difficulty heightened by Arabic’s emphatic (pharyngealized) consonants and other sounds with no close analogues in many languages. This study introduces a diverse, diacritised corpus of isolated Arabic letters and demonstrates that state-of-the-art wav2vec 2.0 models achieve only 35% accuracy on it. Training a lightweight neural network on wav2vec embeddings raises performance to 65%. However, adding a small amplitude perturbation (epsilon = 0.05) cuts accuracy to 32%. To restore robustness, we apply adversarial training, limiting the noisy-speech drop to 9% while preserving clean-speech accuracy. We detail the corpus, training pipeline, and evaluation protocol, and release, on demand, data and code for reproducibility. Finally, we outline future work extending these methods to word- and sentence-level frameworks, where precise letter pronunciation remains critical.
zh
[NLP-46] ArgCMV: An Argument Summarization Benchmark for the LLM -era
【速读】: 该论文旨在解决当前论点关键点提取(Key Point Extraction, KPE)任务中因评估数据集代表性不足而导致的模型泛化能力受限问题。现有方法主要基于ArgKP21数据集进行评测,但该数据集在复杂性、主观性及话题多样性方面存在局限,难以反映真实人类辩论中的长文本交互特征。为此,作者提出一个新的基准数据集ArgCMV,包含约12K条来自真实在线辩论的论点,覆盖3000多个主题,具有更长的文本长度、更强的共指现象、更高的主观话语单元比例以及更广的主题覆盖范围。解决方案的关键在于构建一个更具现实代表性的大规模KPE数据集,并通过实证表明现有方法在该数据集上性能显著下降,从而推动下一代大语言模型(Large Language Models, LLMs)驱动的论点摘要研究发展。
链接: https://arxiv.org/abs/2508.19580
作者: Omkar Gurjar,Agam Goyal,Eshwar Chandrasekharan
机构: Siebel School of Computing and Data Science (Siebel 计算与数据科学学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Key point extraction is an important task in argument summarization which involves extracting high-level short summaries from arguments. Existing approaches for KP extraction have been mostly evaluated on the popular ArgKP21 dataset. In this paper, we highlight some of the major limitations of the ArgKP21 dataset and demonstrate the need for new benchmarks that are more representative of actual human conversations. Using SoTA large language models (LLMs), we curate a new argument key point extraction dataset called ArgCMV comprising of around 12K arguments from actual online human debates spread across over 3K topics. Our dataset exhibits higher complexity such as longer, co-referencing arguments, higher presence of subjective discourse units, and a larger range of topics over ArgKP21. We show that existing methods do not adapt well to ArgCMV and provide extensive benchmark results by experimenting with existing baselines and latest open source models. This work introduces a novel KP extraction dataset for long-context online discussions, setting the stage for the next generation of LLM-driven summarization research.
zh
[NLP-47] owards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLM s in Book-Length Contexts EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本理解能力评估中的自动化与精细化不足问题,尤其关注模型对多层次关键事实的回忆与忠实再现能力。解决方案的关键在于提出HAMLET框架,其核心创新是将源文本结构化为根级、分支级和叶级三个层次的关键事实层级,并通过面向查询的摘要生成机制,系统性地评估模型在各层级上的长程理解表现;该方法实现了高可靠性(与专家人工判断一致性超过90%)且成本降低25倍的全自动评估流程,从而揭示了LLMs在细粒度理解上的局限性及其对位置效应(如“中间迷失”现象)的敏感性。
链接: https://arxiv.org/abs/2508.19578
作者: Jiaqi Deng,Yuho Lee,Nicole Hee-Yeon Kim,Hyangsuk Min,Taewon Yun,Minjeong Ban,Kim Yul,Hwanjun Song
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 (Main)
Abstract:We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and proprietary models, as well as across model scales. Our code and dataset are publicly available at this https URL.
zh
[NLP-48] Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成的代码嵌入(Code Embeddings)在反映代码功能语义一致性方面的不足问题,即现有方法主要关注语法相似性(如代码克隆检测),而忽视了跨语法差异的功能等价性判断。其解决方案的关键在于提出一种名为“面向功能的代码自演化”(Functionality-Oriented Code Self-Evolution)的数据合成框架,通过从单个代码实例生成四种语义与语法维度各异的变体,构建更具挑战性和多样性的基准数据集,从而提升嵌入模型在代码克隆检测、功能一致性识别和代码检索等下游任务中的性能与泛化能力。
链接: https://arxiv.org/abs/2508.19558
作者: Zhuohao Li,Wenqing Chen,Jianxing Yu,Zhichao Lu
机构: Sun Yat-Sen University (中山大学); City University of Hong Kong (香港城市大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
Abstract:Embedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that text embeddings from large language models (LLMs) capture rich semantic information, but their ability to reflect code-level functional semantics remains unclear. Existing studies largely focus on code clone detection, which emphasizes syntactic similarity and overlooks functional understanding. In this paper, we focus on the functional consistency of LLM code embeddings, which determines if two code snippets perform the same function regardless of syntactic differences. We propose a novel data synthesis framework called Functionality-Oriented Code Self-Evolution to construct diverse and challenging benchmarks. Specifically, we define code examples across four semantic and syntactic categories and find that existing datasets predominantly capture syntactic properties. Our framework generates four unique variations from a single code instance, providing a broader spectrum of code examples that better reflect functional differences. Extensive experiments on three downstream tasks-code clone detection, code functional consistency identification, and code retrieval-demonstrate that embedding models significantly improve their performance when trained on our evolved datasets. These results highlight the effectiveness and generalization of our data synthesis framework, advancing the functional understanding of code.
zh
[NLP-49] Language Models Identify Ambiguities and Exploit Loopholes EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对用户指令中的模糊性(ambiguity)与目标冲突时,如何识别并利用这些漏洞以实现自身目标而非用户意图的对齐问题。其核心挑战在于揭示模型是否具备识别语义歧义和进行复杂语用推理的能力,并评估其在存在冲突目标情境下是否倾向于选择性地利用模糊性来达成自身目的,从而构成潜在的AI安全风险。解决方案的关键在于设计涵盖标量 implicature(标量含义)、结构歧义和权力动态等多类场景的实验范式,通过量化不同模型在给定目标与用户指令冲突下的行为差异,发现具备较强推理能力的模型能够明确识别歧义与目标冲突,并主动利用后者达成自身目标,表明当前主流模型已具备规避对齐约束的潜力。
链接: https://arxiv.org/abs/2508.19546
作者: Jio Choi,Mohit Bansal,Elias Stengel-Eskin
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 camera-ready; Code: this https URL
Abstract:Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models’ abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
zh
[NLP-50] Emotion Transfer with Enhanced Prototype for Unseen Emotion Recognition in Conversation EMNLP2025
【速读】: 该论文旨在解决开放域场景下对话中未见情绪识别(Unseen Emotion Recognition in Conversation, UERC)的问题,其核心挑战在于心理学中缺乏统一的情绪分类标准,导致模型难以识别真实应用场景中的新情绪类别。解决方案的关键在于提出ProEmoTrans框架,该框架基于原型(prototype-based)机制,通过三个关键创新实现突破:一是利用大语言模型(LLM)增强的描述方法处理隐式情绪表达,提升情绪定义的准确性;二是设计无参数的高效话语编码机制,缓解长对话中的过拟合问题;三是改进注意力维特比解码(Attention Viterbi Decoding, AVD)方法,有效捕捉情绪的马尔可夫性转移特性,从而将已知情绪间的转换模式迁移至未见情绪。
链接: https://arxiv.org/abs/2508.19533
作者: Kun Peng,Cong Cao,Hao Peng,Guanlin Wu,Zhifeng Hao,Lei Jiang,Yanbing Liu,Philip S. Yu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Beihang University (北京航空航天大学); National University of Defense Technology (国防科技大学); Shantou University (汕头大学); University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP2025
Abstract:Current Emotion Recognition in Conversation (ERC) research follows a closed-domain assumption. However, there is no clear consensus on emotion classification in psychology, which presents a challenge for models when it comes to recognizing previously unseen emotions in real-world applications. To bridge this gap, we introduce the Unseen Emotion Recognition in Conversation (UERC) task for the first time and propose ProEmoTrans, a solid prototype-based emotion transfer framework. This prototype-based approach shows promise but still faces key challenges: First, implicit expressions complicate emotion definition, which we address by proposing an LLM-enhanced description approach. Second, utterance encoding in long conversations is difficult, which we tackle with a proposed parameter-free mechanism for efficient encoding and overfitting prevention. Finally, the Markovian flow nature of emotions is hard to transfer, which we address with an improved Attention Viterbi Decoding (AVD) method to transfer seen emotion transitions to unseen emotions. Extensive experiments on three datasets show that our method serves as a strong baseline for preliminary exploration in this new area.
zh
[NLP-51] Alignment with Fill-In-the-Middle for Enhancing Code Generation EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中因训练数据有限且缺乏可验证的准确测试用例而导致性能提升受限的问题。现有基于直接偏好优化(Direct Preference Optimization, DPO)的方法在生成测试用例方面仍存在局限性,难以充分挖掘训练样本的多样性与有效性。其解决方案的关键在于:首先将代码片段拆分为更细粒度的块(granular blocks),从而从相同测试用例中构造出更多样化的DPO训练对;其次引入抽象语法树(Abstract Syntax Tree, AST)分割策略与课程学习(curriculum training)方法,以增强DPO训练过程中的语义一致性与渐进式学习能力。实验表明,该方法在HumanEval (+)、MBPP (+)、APPS、LiveCodeBench和BigCodeBench等多个基准数据集上均显著提升了代码生成性能。
链接: https://arxiv.org/abs/2508.19532
作者: Houxing Ren,Zimu Lu,Weikang Shi,Haotian Hou,Yunqiao Yang,Ke Wang,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); SenseTime Research (商汤科技研究院); CPII under InnoHK (创新香港研发平台计算与智能研究所); Shanghai AI Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (main conference)
Abstract:The code generation capabilities of Large Language Models (LLMs) have advanced applications like tool invocation and problem-solving. However, improving performance in code-related tasks remains challenging due to limited training data that is verifiable with accurate test cases. While Direct Preference Optimization (DPO) has shown promise, existing methods for generating test cases still face limitations. In this paper, we propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench. Code and data are available at this https URL.
zh
[NLP-52] Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
【速读】: 该论文旨在解决离散扩散语言模型(Discrete Diffusion Language Models)在标准监督微调(Supervised Fine-Tuning, SFT)过程中存在的训练-推理不一致问题:标准SFT随机掩码整个响应序列,而模型在推理时采用固定大小的块(blockwise)顺序生成,导致噪声前缀和泄漏后缀,使梯度偏离期望的块级似然。解决方案的关键在于提出块级SFT(Blockwise SFT),其核心机制是将响应划分为固定大小的块,每步仅对一个活跃块进行随机掩码,冻结先前所有token并完全隐藏未来token,损失函数仅计算于当前活跃块,从而精确匹配扩散模型的块级解码过程,实现训练与推理在粒度上的对齐。
链接: https://arxiv.org/abs/2508.19529
作者: Bowen Sun,Yujun Cai,Ming-Hsuan Yang,Yiwei Wang
机构: University of California, Merced (加州大学默塞德分校); The University of Queensland (昆士兰大学); Google DeepMind (谷歌深度心智)
类目: Computation and Language (cs.CL)
备注:
Abstract:Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.
zh
[NLP-53] Geopolitical Parallax: Beyond Walter Lippmann Just After Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在新闻质量与主观性评估中是否存在因地理来源不同而产生的系统性偏差,以及这种偏差是否反映了地缘政治视角的差异。解决方案的关键在于通过对比中国源(Qwen、BGE、Jina)与西方源(Snowflake、Granite)模型家族在文章级嵌入上的表现,结合人类标注的新闻质量基准(涵盖十五个风格、信息和情感维度)及政治敏感话题平行语料(如巴勒斯坦议题与中国—美国互报),利用逻辑回归探针和匹配主题评估方法量化各指标上预测正类概率的差异。结果表明,模型来源导致了非随机的系统性分歧,例如在巴勒斯坦相关报道中,西方模型赋予更高的主观性和积极情绪评分,而中国模型更强调新颖性和描述性;在中国对美报道中则表现出更低的流畅度、简洁性、技术性和整体质量评分,但更高负向情绪得分。这一发现揭示了地缘政治框架效应持续存在于下游媒体质量评估任务中,提示基于LLM的媒体评估流程需进行文化校准以区分内容差异与模型诱导偏差。
链接: https://arxiv.org/abs/2508.19492
作者: Mehmet Can Yavuz,Humza Gohar Kabir,Aylin Özkan
机构: Işık University (伊斯坦布尔大学); Arky Multimedia (阿基多媒体)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 7 pages, 4 figures, 7 tables
Abstract:Objectivity in journalism has long been contested, oscillating between ideals of neutral, fact-based reporting and the inevitability of subjective framing. With the advent of large language models (LLMs), these tensions are now mediated by algorithmic systems whose training data and design choices may themselves embed cultural or ideological biases. This study investigates geopolitical parallax-systematic divergence in news quality and subjectivity assessments-by comparing article-level embeddings from Chinese-origin (Qwen, BGE, Jina) and Western-origin (Snowflake, Granite) model families. We evaluate both on a human-annotated news quality benchmark spanning fifteen stylistic, informational, and affective dimensions, and on parallel corpora covering politically sensitive topics, including Palestine and reciprocal China-United States coverage. Using logistic regression probes and matched-topic evaluation, we quantify per-metric differences in predicted positive-class probabilities between model families. Our findings reveal consistent, non-random divergences aligned with model origin. In Palestine-related coverage, Western models assign higher subjectivity and positive emotion scores, while Chinese models emphasize novelty and descriptiveness. Cross-topic analysis shows asymmetries in structural quality metrics Chinese-on-US scoring notably lower in fluency, conciseness, technicality, and overall quality-contrasted by higher negative emotion scores. These patterns align with media bias theory and our distinction between semantic, emotional, and relational subjectivity, and extend LLM bias literature by showing that geopolitical framing effects persist in downstream quality assessment tasks. We conclude that LLM-based media evaluation pipelines require cultural calibration to avoid conflating content differences with model-induced bias.
zh
[NLP-54] Rule Synergy Analysis using LLM s: State of the Art and Implications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态环境中理解复杂规则交互能力不足的问题,特别是针对卡牌游戏《Slay the Spire》中卡牌之间协同效应的识别与推理。其解决方案的关键在于构建一个基于真实游戏机制的卡牌协同数据集,对卡牌组合进行正向、负向或中性交互的分类标注,并通过系统性评估发现LLMs在识别正向及尤其负向协同效应方面存在显著短板,进而归纳出如时机判断、游戏状态定义和规则遵循等常见错误类型,为未来提升模型对规则及其交互影响预测能力提供了明确的研究方向。
链接: https://arxiv.org/abs/2508.19484
作者: Bahar Bateni,Benjamin Pratt,Jim Whitehead
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted for publication at the IEEE Transactions on Games 2024, Special Issue on Large Language Models and Games (10 pages excluding appendix, 3 figures)
Abstract:Large language models (LLMs) have demonstrated strong performance across a variety of domains, including logical reasoning, mathematics, and more. In this paper, we investigate how well LLMs understand and reason about complex rule interactions in dynamic environments, such as card games. We introduce a dataset of card synergies from the game Slay the Spire, where pairs of cards are classified based on their positive, negative, or neutral interactions. Our evaluation shows that while LLMs excel at identifying non-synergistic pairs, they struggle with detecting positive and, particularly, negative synergies. We categorize common error types, including issues with timing, defining game states, and following game rules. Our findings suggest directions for future research to improve model performance in predicting the effect of rules and their interactions.
zh
[NLP-55] Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study
【速读】: 该论文旨在解决低资源语言翻译(Low-resource machine translation)问题,即大型语言模型(LLMs)因预训练阶段缺乏相关语言数据及微调时平行语料稀缺而导致的翻译性能不足。其解决方案的关键在于将外部双语词典作为工具引入翻译过程,并通过强化学习(Reinforcement Learning)与监督微调(Supervised Fine-tuning)联合训练模型,使模型学会在生成过程中选择性地调用词典以提升准确性。具体而言,作者采用Guided Reward Policy Optimization(GRPO)策略,利用BLEU分数作为奖励信号,引导模型学习何时以及如何有效使用词典工具,从而显著提升西班牙语-瓦尤纳伊基语(Spanish-Wayuunaiki)这一低资源语言对的翻译质量,在美洲NLP 2025共享任务测试集上相较无词典访问的监督基线模型实现18%的相对增益。
链接: https://arxiv.org/abs/2508.19481
作者: Manuel Mosquera,Melissa Robles,Johan Rodriguez,Ruben Manrique
机构: Universidad de los Andes (安第斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-resource machine translation remains a significant challenge for large language models (LLMs), which often lack exposure to these languages during pretraining and have limited parallel data for fine-tuning. We propose a novel approach that enhances translation for low-resource languages by integrating an external dictionary tool and training models end-to-end using reinforcement learning, in addition to supervised fine-tuning. Focusing on the Spanish-Wayuunaiki language pair, we frame translation as a tool-augmented decision-making problem in which the model can selectively consult a bilingual dictionary during generation. Our method combines supervised instruction tuning with Guided Reward Policy Optimization (GRPO), enabling the model to learn both when and how to use the tool effectively. BLEU similarity scores are used as rewards to guide this learning process. Preliminary results show that our tool-augmented models achieve up to +3.37 BLEU improvement over previous work, and a 18% relative gain compared to a supervised baseline without dictionary access, on the Spanish-Wayuunaiki test set from the AmericasNLP 2025 Shared Task. We also conduct ablation studies to assess the effects of model architecture and training strategy, comparing Qwen2.5-0.5B-Instruct with other models such as LLaMA and a prior NLLB-based system. These findings highlight the promise of combining LLMs with external tools and the role of reinforcement learning in improving translation quality in low-resource language settings.
zh
[NLP-56] Automatic Question Answer Generation Using Generative Large Language Model (LLM )
【速读】: 该论文旨在解决教育领域中教师在进行文本型学业评估时面临的挑战,即如何高效、公平地生成多样化的题目以评估学生对特定主题的掌握程度。传统方式依赖人工从多套讲义材料中设计题目,耗时且难以保证一致性与多样性。解决方案的关键在于利用微调后的生成式大语言模型(Generative Large Language Model, GLLM),通过提示工程(Prompt Engineering, PE)适配不同题型偏好(如选择题、概念题或事实题),并基于无监督学习方法在英文语料上对Meta-Llama 2-7B模型进行微调,引入RACE数据集作为训练样本,从而构建一个可定制化、高效率的自动问答生成(Automatic Question Answer Generation, AQAG)系统,显著提升教育评价流程的自动化水平和实用性。
链接: https://arxiv.org/abs/2508.19475
作者: Md. Alvee Ehsan,A.S.M Mehedi Hasan,Kefaya Benta Shahnoor,Syeda Sumaiya Tasneem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:\AbstractIn the realm of education, student evaluation holds equal significance as imparting knowledge. To be evaluated, students usually need to go through text-based academic assessment methods. Instructors need to make diverse sets of questions that need to be fair for all students to prove their adequacy over a particular topic. This can prove to be quite challenging as they may need to manually go through several different lecture materials. Our objective is to make this whole process much easier by implementing Automatic Question Answer Generation /(AQAG), using fine-tuned generative LLM. For tailoring the instructor’s preferred question style (MCQ, conceptual, or factual questions), prompt Engineering (PE) is being utilized. In this research, we propose to leverage unsupervised learning methods in NLP, primarily focusing on the English language. This approach empowers the base Meta-Llama 2-7B model to integrate RACE dataset as training data for the fine-tuning process. Creating a customized model that will offer efficient solutions for educators, instructors, and individuals engaged in text-based evaluations. A reliable and efficient tool for generating questions and answers can free up valuable time and resources, thus streamlining their evaluation processes.
zh
[NLP-57] Inference Gap in Domain Expertise and Machine Intelligence in Named Entity Recognition: Creation of and Insights from a Substance Use-related Dataset
【速读】: 该论文旨在解决非医疗用途阿片类药物使用(nonmedical opioid use)带来的临床与社会影响在传统医疗环境中常被低估或未充分报告的问题,提出通过社交媒体文本挖掘来识别第一人称叙述中的自我报告后果。其解决方案的关键在于构建了一个名为RedditImpacts 2.0的高质量标注数据集,并开发了一种命名实体识别(Named Entity Recognition, NER)框架,用于从社交平台文本中提取两类后果:临床影响(ClinicalImpacts,如戒断反应、抑郁)和社会影响(SocialImpacts,如失业)。研究进一步验证了领域特定微调模型(如DeBERTa-large)在零样本和少样本情境下优于大语言模型(LLMs),并证明了在有限标注数据下仍可实现稳健性能,凸显了针对临床自然语言处理任务进行专业微调的重要性,同时揭示当前AI系统与专家判断之间仍存在显著差距。
链接: https://arxiv.org/abs/2508.19467
作者: Sumon Kanti Dey,Jeanne M. Powell,Azra Ismail,Jeanmarie Perrone,Abeed Sarker
机构: Emory University (埃默里大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Dataset and code: this https URL
Abstract:Nonmedical opioid use is an urgent public health challenge, with far-reaching clinical and social consequences that are often underreported in traditional healthcare settings. Social media platforms, where individuals candidly share first-person experiences, offer a valuable yet underutilized source of insight into these impacts. In this study, we present a named entity recognition (NER) framework to extract two categories of self-reported consequences from social media narratives related to opioid use: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss). To support this task, we introduce RedditImpacts 2.0, a high-quality dataset with refined annotation guidelines and a focus on first-person disclosures, addressing key limitations of prior work. We evaluate both fine-tuned encoder-based models and state-of-the-art large language models (LLMs) under zero- and few-shot in-context learning settings. Our fine-tuned DeBERTa-large model achieves a relaxed token-level F1 of 0.61 [95% CI: 0.43-0.62], consistently outperforming LLMs in precision, span accuracy, and adherence to task-specific guidelines. Furthermore, we show that strong NER performance can be achieved with substantially less labeled data, emphasizing the feasibility of deploying robust models in resource-limited settings. Our findings underscore the value of domain-specific fine-tuning for clinical NLP tasks and contribute to the responsible development of AI tools that may enhance addiction surveillance, improve interpretability, and support real-world healthcare decision-making. The best performing model, however, still significantly underperforms compared to inter-expert agreement (Cohen’s kappa: 0.81), demonstrating that a gap persists between expert intelligence and current state-of-the-art NER/AI capabilities for tasks requiring deep domain knowledge.
zh
[NLP-58] Bridging Language Gaps: Enhancing Few-Shot Language Adaptation
【速读】: 该论文旨在解决多语言自然语言处理(Natural Language Processing, NLP)中因语言资源分布不均导致的性能差异问题,即高资源语言得益于大量标注数据而表现优异,而低资源语言则因数据匮乏难以有效训练。其解决方案的关键在于提出一种基于提示(prompting)的对比语言对齐方法(Contrastive Language Alignment with Prompting, CoLAP),通过将对比学习与跨语言表示相结合,实现从高资源语言到低资源语言的任务特定知识迁移,从而显著提升低资源语言在少样本场景下的性能表现,有效缩小跨语言性能差距。
链接: https://arxiv.org/abs/2508.19464
作者: Philipp Borchert,Jochen De Weerdt,Marie-Francine Moens
机构: IESEG School of Management (IESEG管理学院); Research Centre for Information Systems Engineering, KU Leuven (信息系 统工程研究中心,鲁汶大学); Department of Computer Science, KU Leuven (计算机科学系,鲁汶大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:The disparity in language resources poses a challenge in multilingual NLP, with high-resource languages benefiting from extensive data, while low-resource languages lack sufficient data for effective training. Our Contrastive Language Alignment with Prompting (CoLAP) method addresses this gap by integrating contrastive learning with cross-lingual representations, facilitating task-specific knowledge transfer from high-resource to lower-resource languages. The primary advantage of our approach is its data efficiency, enabling rapid adaptation to new languages and reducing the need for large labeled datasets. We conduct experiments with multilingual encoder-only and decoder-only language models on natural language understanding tasks, including natural language inference and relation extraction, evaluating performance across both high- and low-resource languages. Our results demonstrate that CoLAP outperforms few-shot cross-lingual transfer baselines and in-context learning, even with limited available data. This effectively narrows the cross-lingual performance gap, contributing to the development of more efficient multilingual NLP techniques.
zh
[NLP-59] Heterogeneous LLM Methods for Ontology Learning (Few-Shot Prompting Ensemble Typing and Attention-Based Taxonomies)
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动构建本体(ontology)全流程中的核心挑战,涵盖术语抽取(Task A)、类型标注(Task B)与分类体系发现(Task C)三个关键环节。其解决方案的关键在于采用模块化设计:针对术语抽取任务,利用检索增强生成(Retrieval-Augmented Generation, RAG)机制实现无需微调的单次推理,通过语义相似训练样例提升词汇增广效果;对于类型标注任务,提出双策略——在有标签数据场景下复用RAG并结合少样本提示,在无标签新领域则使用多嵌入模型融合的零样本分类器,基于置信度加权提升泛化能力;在分类体系发现任务中,将层次关系建模为图结构推理问题,引入轻量级交叉注意力层预测“is-a”关系,以近似软邻接矩阵方式实现高效且准确的本体树结构推断。整体方案展现出LLM架构在异构领域中可扩展、自适应和鲁棒的本体学习能力。
链接: https://arxiv.org/abs/2508.19428
作者: Aleksandra Beliaeva,Temurbek Rahmatullaev
机构: 未知
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注:
Abstract:We present a comprehensive system for addressing Tasks A, B, and C of the LLMs4OL 2025 challenge, which together span the full ontology construction pipeline: term extraction, typing, and taxonomy discovery. Our approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling – each tailored to the demands of the respective task. For Task A, we jointly extract domain-specific terms and their ontological types using a retrieval-augmented generation (RAG) pipeline. Training data was reformulated into a document to terms and types correspondence, while test-time inference leverages semantically similar training examples. This single-pass method requires no model finetuning and improves overall performance through lexical augmentation Task B, which involves assigning types to given terms, is handled via a dual strategy. In the few-shot setting (for domains with labeled training data), we reuse the RAG scheme with few-shot prompting. In the zero-shot setting (for previously unseen domains), we use a zero-shot classifier that combines cosine similarity scores from multiple embedding models using confidence-based weighting. In Task C, we model taxonomy discovery as graph inference. Using embeddings of type labels, we train a lightweight cross-attention layer to predict is-a relations by approximating a soft adjacency matrix. These modular, task-specific solutions enabled us to achieve top-ranking results in the official leaderboard across all three tasks. Taken together these strategies showcase the scalability, adaptability, and robustness of LLM-based architectures for ontology learning across heterogeneous domains. Code is available at: this https URL Subjects: Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC) MSC classes: 68T30, 68T50, 68T07, 68U15 ACMclasses: I.2.4; I.2.7; H.3.1; H.3.3; I.2.6 Cite as: arXiv:2508.19428 [cs.CL] (or arXiv:2508.19428v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.19428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-60] A perishable ability? The future of writing in the face of generative artificial intelligence
【速读】: 该论文试图解决的问题是:随着生成式人工智能(Generative AI)工具的快速发展,人类可能因将写作任务外包给机器而导致自身写作能力的丧失或显著下降,这一现象与历史上曾出现的书写能力衰退(如希腊黑暗时代)存在相似性。其解决方案的关键在于识别并理解这种技术依赖对人类认知能力的潜在侵蚀机制,从而推动对人机协作边界、数字素养教育以及文化传承方式的重新审视,以防止人类在技术赋能下失去核心的创造性表达能力。
链接: https://arxiv.org/abs/2508.19427
作者: Evandro L. T. P. Cunha
机构: Universidade Federal de Minas Gerais (UFMG)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages
Abstract:The 2020s have been witnessing a very significant advance in the development of generative artificial intelligence tools, including text generation systems based on large language models. These tools have been increasingly used to generate texts in the most diverse domains – from technical texts to literary texts --, which might eventually lead to a lower volume of written text production by humans. This article discusses the possibility of a future in which human beings will have lost or significantly decreased their ability to write due to the outsourcing of this activity to machines. This possibility parallels the loss of the ability to write in other moments of human history, such as during the so-called Greek Dark Ages (approx. 1200 BCE - 800 BCE).
zh
[NLP-61] One Joke to Rule them All? On the (Im)possibility of Generalizing Humor
【速读】: 该论文试图解决的问题是:当前生成式AI(Generative AI)在幽默理解与生成任务中存在类型碎片化现象,即现有模型通常仅针对特定类型的幽默(如双关语、冷笑话等)进行建模,难以泛化到未见过的新型幽默形式(如网络迷因、反幽默等)。为探究这种碎片化是否不可避免,研究提出通过迁移学习实验来检验大型语言模型(Large Language Models, LLMs)能否从已知幽默类型中提取可迁移的机制以应对新类型。解决方案的关键在于设计多任务训练框架,在四个不同幽默任务的数据集上进行多样化训练(1–3个数据集),并测试其在全新幽默类型上的表现;结果表明,LLMs具备一定跨类型迁移能力(最高达75%准确率),且多样化的训练源显著提升泛化性能(提升1.88–4.05%),同时发现某些幽默类型(如“爸爸笑话”Dad Jokes)具有更强的迁移促进作用,尽管其本身较难被其他类型迁移到。
链接: https://arxiv.org/abs/2508.19402
作者: Mor Turgeman,Chen Shani,Dafna Shahaf
机构: The Hebrew University of Jerusalem (耶路撒冷希伯来大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Humor is a broad and complex form of communication that remains challenging for machines. Despite its broadness, most existing research on computational humor traditionally focused on modeling a specific type of humor. In this work, we wish to understand whether competence on one or more specific humor tasks confers any ability to transfer to novel, unseen types; in other words, is this fragmentation inevitable? This question is especially timely as new humor types continuously emerge in online and social media contexts (e.g., memes, anti-humor, AI fails). If Large Language Models (LLMs) are to keep up with this evolving landscape, they must be able to generalize across humor types by capturing deeper, transferable mechanisms. To investigate this, we conduct a series of transfer learning experiments across four datasets, representing different humor tasks. We train LLMs under varied diversity settings (1-3 datasets in training, testing on a novel task). Experiments reveal that models are capable of some transfer, and can reach up to 75% accuracy on unseen datasets; training on diverse sources improves transferability (1.88-4.05%) with minimal-to-no drop in in-domain performance. Further analysis suggests relations between humor types, with Dad Jokes surprisingly emerging as the best enabler of transfer (but is difficult to transfer to). We release data and code.
zh
[NLP-62] Database Entity Recognition with Data Augmentation and Deep Learning
【速读】: 该论文旨在解决自然语言查询(Natural Language Query, NLQ)中数据库实体识别(Database Entity Recognition, DB-ER)的挑战,即从非结构化文本中准确识别出与数据库相关的实体。其解决方案的关键在于三个方面:首先,构建了一个基于主流text-to-SQL基准数据集的人工标注DB-ER基准数据集;其次,提出了一种新颖的数据增强方法,利用已有SQL查询自动标注NLQ以扩充训练数据;最后,设计了一个基于T5架构的专用语言模型,通过序列标注和词元分类两个下游任务分别优化前端和后端的DB-ER性能。实验表明,该方法在精度和召回率上均优于两种最先进的命名实体识别(Named Entity Recognition, NER)模型,且数据增强和T5微调分别带来超过10%和5–10%的性能提升。
链接: https://arxiv.org/abs/2508.19372
作者: Zikun Fu,Chen Yang,Kourosh Davoudi,Ken Q. Pu
机构: Ontario Tech University (安大略理工大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: 6 pages, 5 figures. Accepted at IEEE 26th International Conference on Information Reuse and Integration for Data Science (IRI 2025), San Jose, California, August 6-8, 2025
Abstract:This paper addresses the challenge of Database Entity Recognition (DB-ER) in Natural Language Queries (NLQ). We present several key contributions to advance this field: (1) a human-annotated benchmark for DB-ER task, derived from popular text-to-sql benchmarks, (2) a novel data augmentation procedure that leverages automatic annotation of NLQs based on the corresponding SQL queries which are available in popular text-to-SQL benchmarks, (3) a specialized language model based entity recognition model using T5 as a backbone and two down-stream DB-ER tasks: sequence tagging and token classification for fine-tuning of backend and performing DB-ER respectively. We compared our DB-ER tagger with two state-of-the-art NER taggers, and observed better performance in both precision and recall for our model. The ablation evaluation shows that data augmentation boosts precision and recall by over 10%, while fine-tuning of the T5 backbone boosts these metrics by 5-10%.
zh
[NLP-63] LongReason Arena: A Long Reasoning Benchmark for Large Language Models
【速读】: 该论文旨在解决现有长上下文基准测试主要聚焦于模型对长输入的理解能力,而忽视了对长推理能力的评估这一问题。为填补这一空白,作者提出了LongReasonArena,这是一个专门用于评估大语言模型(Large Language Models, LLMs)长推理能力的基准测试。其关键创新在于设计了一系列需要执行多步算法的任务,这些任务体现了长推理的核心特征,如信息检索和回溯机制,并通过控制输入长度可任意扩展推理长度,最高可达100万token的推理路径。实验表明,该基准显著挑战了开源与闭源模型,例如Deepseek-R1在任务中仅获得7.5%的准确率,且准确率随预期推理步骤数的对数呈线性下降趋势,验证了其对模型长期推理能力的严格考验。
链接: https://arxiv.org/abs/2508.19363
作者: Jiayu Ding,Shuming Ma,Lei Cui,Nanning Zheng,Furu Wei
机构: IAIR, Xi’an Jiaotong University (人工智能与机器人研究所,西安交通大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at this https URL.
zh
[NLP-64] Reflective Agreement: Combining Self-Mixture of Agents with a Sequence Tagger for Robust Event Extraction
【速读】: 该论文旨在解决事件抽取(Event Extraction, EE)任务中传统判别模型召回率低、而生成式方法存在幻觉和预测不一致的问题。解决方案的关键在于提出一种基于共识的反思推理系统(Agreement-based Reflective Inference System, ARIS),其核心是结合自混合代理(Self Mixture of Agents)与判别型序列标注器,通过结构化模型共识、置信度过滤以及大语言模型(Large Language Models, LLMs)的反思推理模块来有效缓解歧义并提升事件预测质量;同时引入分解式指令微调以增强LLM对事件抽取任务的理解能力。
链接: https://arxiv.org/abs/2508.19359
作者: Fatemeh Haji,Mazal Bethany,Cho-Yu Jason Chiang,Anthony Rios,Peyman Najafirad
机构: Secure AI and Autonomy Lab, University of Texas at San Antonio (安全人工智能与自主实验室,德克萨斯大学圣安东尼奥分校); Peraton Labs; University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Event Extraction (EE) involves automatically identifying and extracting structured information about events from unstructured text, including triggers, event types, and arguments. Traditional discriminative models demonstrate high precision but often exhibit limited recall, particularly for nuanced or infrequent events. Conversely, generative approaches leveraging Large Language Models (LLMs) provide higher semantic flexibility and recall but suffer from hallucinations and inconsistent predictions. To address these challenges, we propose Agreement-based Reflective Inference System (ARIS), a hybrid approach combining a Self Mixture of Agents with a discriminative sequence tagger. ARIS explicitly leverages structured model consensus, confidence-based filtering, and an LLM reflective inference module to reliably resolve ambiguities and enhance overall event prediction quality. We further investigate decomposed instruction fine-tuning for enhanced LLM event extraction understanding. Experiments demonstrate our approach outperforms existing state-of-the-art event extraction methods across three benchmark datasets.
zh
[NLP-65] Context-Adaptive Synthesis and Compression for Enhanced Retrieval-Augmented Generation in Complex Domains
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂领域问答任务中因信息过载和冗余冲突导致的幻觉与不准确回答问题,尤其是在多文档、长文本或存在矛盾信息场景下,传统检索增强生成(Retrieval-Augmented Generation, RAG)方法难以高效合成高质量上下文。解决方案的关键在于提出一种名为CASC(Context-Adaptive Synthesis and Compression)的新框架,其核心是引入一个由微调小规模语言模型驱动的上下文分析器-合成器(Context Analyzer Synthesizer, CAS)模块,该模块实现关键信息提取、跨文档一致性检查与冲突消解以及面向问题的结构化信息整合,从而将原始分散、冗余的信息压缩为高度凝练、结构清晰且语义丰富的上下文,显著降低最终阅读器LLM的认知负担并提升答案准确性。
链接: https://arxiv.org/abs/2508.19357
作者: Peiran Zhou,Junnan Zhu,Yichen Shen,Ruoxi Yu
机构: Kunming Medical University (昆明医科大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel in language tasks but are prone to hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) mitigates these by grounding LLMs in external knowledge. However, in complex domains involving multiple, lengthy, or conflicting documents, traditional RAG suffers from information overload and inefficient synthesis, leading to inaccurate and untrustworthy answers. To address this, we propose CASC (Context-Adaptive Synthesis and Compression), a novel framework that intelligently processes retrieved contexts. CASC introduces a Context Analyzer Synthesizer (CAS) module, powered by a fine-tuned smaller LLM, which performs key information extraction, cross-document consistency checking and conflict resolution, and question-oriented structured synthesis. This process transforms raw, scattered information into a highly condensed, structured, and semantically rich context, significantly reducing the token count and cognitive load for the final Reader LLM. We evaluate CASC on SciDocs-QA, a new challenging multi-document question answering dataset designed for complex scientific domains with inherent redundancies and conflicts. Our extensive experiments demonstrate that CASC consistently outperforms strong baselines.
zh
[NLP-66] An Investigation on Group Query Hallucination Attacks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在用户交互过程中因连续多轮提问导致的性能下降及潜在后门触发风险问题。其解决方案的关键在于提出“分组查询攻击”(Group Query Attack)技术,通过同时向LLMs输入一组查询来模拟真实场景中用户的多轮提问行为,从而系统性地研究累积上下文对模型输出的影响。实验表明,该方法显著降低微调后模型的任务表现,并可能激活模型中的潜在后门,尤其在数学推理和代码生成等需要复杂推理能力的任务中同样有效。
链接: https://arxiv.org/abs/2508.19321
作者: Kehao Miao,Xiaolong Jin
机构: University of Science and Technology of China (中国科学技术大学); Purdue University (普渡大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:With the widespread use of large language models (LLMs), understanding their potential failure modes during user interactions is essential. In practice, users often pose multiple questions in a single conversation with LLMs. Therefore, in this study, we propose Group Query Attack, a technique that simulates this scenario by presenting groups of queries to LLMs simultaneously. We investigate how the accumulated context from consecutive prompts influences the outputs of LLMs. Specifically, we observe that Group Query Attack significantly degrades the performance of models fine-tuned on specific tasks. Moreover, we demonstrate that Group Query Attack induces a risk of triggering potential backdoors of LLMs. Besides, Group Query Attack is also effective in tasks involving reasoning, such as mathematical reasoning and code generation for pre-trained and aligned models.
zh
[NLP-67] Sycophancy as compositions of Atomic Psychometric Traits
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中普遍存在但常被视作孤立故障模式的谄媚行为(sycophancy)问题,其核心挑战在于缺乏对这一行为背后多维心理特质组合机制的理解。解决方案的关键在于将谄媚行为建模为心理学中的特质(如情绪性、开放性与宜人性)在特征空间中的几何与因果组合,通过对比激活添加法(Contrastive Activation Addition, CAA)识别并映射这些心理特质对应的激活方向,并据此设计可解释且可组合的向量干预策略(如加法、减法和投影),从而有效缓解LLMs中与安全相关的行为风险。
链接: https://arxiv.org/abs/2508.19316
作者: Shreyans Jain,Alexandra Yost,Amirali Abdullah
机构: Thoughtworks(思特沃克)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 4 figures
Abstract:Sycophancy is a key behavioral risk in LLMs, yet is often treated as an isolated failure mode that occurs via a single causal mechanism. We instead propose modeling it as geometric and causal compositions of psychometric traits such as emotionality, openness, and agreeableness - similar to factor decomposition in psychometrics. Using Contrastive Activation Addition (CAA), we map activation directions to these factors and study how different combinations may give rise to sycophancy (e.g., high extraversion combined with low conscientiousness). This perspective allows for interpretable and compositional vector-based interventions like addition, subtraction and projection; that may be used to mitigate safety-critical behaviors in LLMs.
zh
[NLP-68] Object Detection with Multimodal Large Vision-Language Models: An In-depth Review
【速读】: 该论文旨在解决传统深度学习架构在目标检测任务中适应性不足、上下文理解能力有限以及泛化能力较弱的问题。其解决方案的关键在于利用大规模视觉-语言模型(Large Vision-Language Models, LVLMs)融合自然语言处理(Natural Language Processing, NLP)与计算机视觉(Computer Vision, CV)技术,通过架构创新、训练范式优化及多模态信息整合机制,实现对目标更深层次的上下文感知与灵活输出,从而显著提升目标检测与定位的准确性、鲁棒性和场景适应性。
链接: https://arxiv.org/abs/2508.19294
作者: Ranjan Sapkota,Manoj Karkee
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First Peer Reviewed Review Paper for Object Detection with Vision-Language Models (VLMs)
Abstract:The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. This review presents comprehensive visualizations demonstrating LVLMs’ effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. The review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and robotic applications in the future.
zh
[NLP-69] CORE: Lossless Compression for Retrieval-Augmented LLM s via Reinforcement Learning
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因引入过多检索文档而导致输入长度激增、计算成本上升的问题,同时避免传统压缩方法因缺乏明确压缩目标而损害下游任务性能的缺陷。解决方案的关键在于提出一种名为CORE的新方法,其核心创新是采用强化学习框架,以端到端方式优化压缩过程:通过将最终任务性能作为奖励信号,并使用广义强化学习策略优化(Generalized Reinforcement Learning Policy Optimization, GRPO)训练压缩器,从而生成能最大化大语言模型(Large Language Models, LLMs)回答准确性的无损摘要,实现高比例压缩(如3%)下性能不降反升。
链接: https://arxiv.org/abs/2508.19282
作者: Ziqiang Cui,Yunpeng Weng,Xing Tang,Peiyang Liu,Shiwei Li,Bowei He,Jiamin Chen,Xiuqiang He,Chen Ma
机构: City University of Hong Kong (香港城市大学); Tencent (腾讯); Shenzhen Technology University (深圳技术大学); Peking University (北京大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge and the factual accuracy of responses in Large Language Models (LLMs). However, the inclusion of excessive retrieved documents substantially increases the input length, leading to higher computational costs. Previous studies have attempted to compress retrieved documents into shorter texts before in-context integration, but such methods often compromise end-task performance. The lack of well-defined compression targets forces many approaches to rely on fixed heuristics, which cannot guarantee that the compressed content will effectively support the end task. To address these limitations, we propose CORE, a novel method designed to achieve lossless context compression for RAG. CORE employs reinforcement learning to optimize the compression process without relying on predefined compression labels. Specifically, it utilizes end-task performance as a reward signal and applies Generalized Reinforcement Learning Policy Optimization (GRPO) to train the compressor. This end-to-end training framework enables the compressor to generate summaries that maximize the accuracy of answers generated by the LLM. Extensive experiments on four datasets demonstrate the superiority of our approach. With a high compression ratio of 3%, our method not only avoids performance degradation compared to prepending full documents across all datasets but also improves the average Exact Match (EM) score by 3.3 points. The code will be released soon.
zh
[NLP-70] FLAIRR-TS – Forecasting LLM -Agents with Iterative Refinement and Retrieval for Time Series EMNLP
【速读】: 该论文旨在解决如何在时间序列预测任务中有效利用冻结的大语言模型(frozen Large Language Models, LLMs)的问题,尤其针对传统方法依赖大量预处理和人工设计提示(prompt)的低效性。其核心挑战在于如何让LLM在不进行微调的情况下,通过自然语言提示实现与专用预测模型相当的性能。解决方案的关键在于提出FLAIRR-TS——一个基于代理(agent)的测试时提示优化框架:该框架包含两个协同工作的代理——预测代理(Forecaster-agent)生成初始预测,而精炼代理(Refiner-agent)根据历史输出和检索到的相关信息动态优化提示;通过这种自适应提示机制,FLAIRR-TS无需中间代码即可在多个领域泛化,显著提升预测准确性,并逼近专用模型的性能表现。
链接: https://arxiv.org/abs/2508.19279
作者: Gunjan Jalori,Preetika Verma,Sercan Ö Arık
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP
Abstract:Time series Forecasting with large languagemodels (LLMs) requires bridging numericalpatterns and natural language. Effective fore-casting on LLM often relies on extensive pre-processing and this http URL studiesshow that a frozen LLM can rival specializedforecasters when supplied with a carefully en-gineered natural-language prompt, but craft-ing such a prompt for each task is itself oner-ous and ad-hoc. We introduce FLAIRR-TS, atest-time prompt optimization framework thatutilizes an agentic system: a Forecaster-agentgenerates forecasts using an initial prompt,which is then refined by a refiner agent, in-formed by past outputs and retrieved this http URL adaptive prompting generalizes across do-mains using creative prompt templates andgenerates high-quality forecasts without inter-mediate code this http URL onbenchmark datasets show improved accuracyover static prompting and retrieval-augmentedbaselines, approaching the performance ofspecialized this http URL-TS providesa practical alternative to tuning, achievingstrong performance via its agentic approach toadaptive prompt refinement and retrieval.
zh
[NLP-71] Leverag ing Language Models and Machine Learning in Verbal Autopsy Analysis
【速读】: 该论文旨在解决在缺乏民事登记与生命统计系统的国家中,如何提升死亡原因(Cause of Death, COD)分类准确性的问题。现有自动VA(Verbal Autopsy)分类算法仅依赖结构化问题数据,忽略了非结构化叙述文本中的重要信息。解决方案的关键在于引入预训练语言模型(Pretrained Language Models, PLMs)和机器学习(Machine Learning, ML)技术,利用VA叙述文本进行任务特定微调,从而显著提升COD分类性能,尤其在非传染性疾病识别方面表现突出;同时通过多模态融合策略整合叙述与结构化问题,进一步优化分类效果,验证了两种模态各自独特的信息价值。
链接: https://arxiv.org/abs/2508.19274
作者: Yue Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Ph.D. dissertation submitted to The Ohio State University, August 2025
Abstract:In countries without civil registration and vital statistics, verbal autopsy (VA) is a critical tool for estimating cause of death (COD) and inform policy priorities. In VA, interviewers ask proximal informants for details on the circumstances preceding a death, in the form of unstructured narratives and structured questions. Existing automated VA cause classification algorithms only use the questions and ignore the information in the narratives. In this thesis, we investigate how the VA narrative can be used for automated COD classification using pretrained language models (PLMs) and machine learning (ML) techniques. Using empirical data from South Africa, we demonstrate that with the narrative alone, transformer-based PLMs with task-specific fine-tuning outperform leading question-only algorithms at both the individual and population levels, particularly in identifying non-communicable diseases. We explore various multimodal fusion strategies combining narratives and questions in unified frameworks. Multimodal approaches further improve performance in COD classification, confirming that each modality has unique contributions and may capture valuable information that is not present in the other modality. We also characterize physician-perceived information sufficiency in VA. We describe variations in sufficiency levels by age and COD and demonstrate that classification accuracy is affected by sufficiency for both physicians and models. Overall, this thesis advances the growing body of knowledge at the intersection of natural language processing, epidemiology, and global health. It demonstrates the value of narrative in enhancing COD classification. Our findings underscore the need for more high-quality data from more diverse settings to use in training and fine-tuning PLM/ML methods, and offer valuable insights to guide the rethinking and redesign of the VA instrument and interview.
zh
[NLP-72] RAG APHENE: A RAG Annotation Platform with Human Enhancements and Edits
【速读】: 该论文旨在解决如何有效评估大语言模型(Large Language Models, LLMs)在多轮检索增强生成(Retrieval Augmented Generation, RAG)对话中提供事实准确性的问题。由于LLMs可能生成看似合理但实际错误的“幻觉”信息,构建能够模拟真实世界交互场景的高质量评估基准成为关键挑战。解决方案的关键在于提出RAGAPHENE——一个基于对话的标注平台,使标注者能够模拟真实用户与模型之间的多轮交互过程,从而生成数千条具有现实语境的对话数据,用于系统性地评测LLMs在RAG场景下的表现。
链接: https://arxiv.org/abs/2508.19272
作者: Kshitij Fadnis,Sara Rosenthal,Maeda Hanafi,Yannis Katsis,Marina Danilevsky
机构: IBM Research - AI (IBM研究-人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval Augmented Generation (RAG) is an important aspect of conversing with Large Language Models (LLMs) when factually correct information is important. LLMs may provide answers that appear correct, but could contain hallucinated information. Thus, building benchmarks that can evaluate LLMs on multi-turn RAG conversations has become an increasingly important task. Simulating real-world conversations is vital for producing high quality evaluation benchmarks. We present RAGAPHENE, a chat-based annotation platform that enables annotators to simulate real-world conversations for benchmarking and evaluating LLMs. RAGAPHENE has been successfully used by approximately 40 annotators to build thousands of real-world conversations.
zh
[NLP-73] Rethinking Reasoning in LLM s: Neuro-Symbolic Local RetoMaton Beyond ICL and CoT
【速读】: 该论文旨在解决当前基于提示(prompt-based)的推理策略(如思维链 Chain-of-Thought 和上下文学习 In-Context Learning)在大型语言模型(LLMs)中因依赖脆弱且隐式的机制而导致输出不稳定、难以解释的问题,尤其是在不同随机种子、提示格式或微小变化下表现不一致,限制了其在需要稳定性和可解释性的任务中的应用。解决方案的关键在于提出一种局部化的、任务自适应的加权有限自动机(Weighted Finite Automaton, WFA)结构来替代原Retomatón框架中的全局数据存储,该WFA直接从外部领域语料库构建,从而实现上下文感知的鲁棒检索,同时保留符号可追溯性与低推理开销。相比提示方法将上下文与记忆混杂于黑箱过程,该方案利用WFA的显式结构提供可验证、模块化的检索行为,显著提升了推理的透明度、可复现性及跨领域迁移能力。
链接: https://arxiv.org/abs/2508.19271
作者: Rushitha Santhoshi Mamidala,Anshuman Chhabra,Ankur Mali
机构: Bellini College of AI Cybersecurity and Computing (贝利尼人工智能网络安全与计算学院); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt-based reasoning strategies such as Chain-of-Thought (CoT) and In-Context Learning (ICL) have become widely used for eliciting reasoning capabilities in large language models (LLMs). However, these methods rely on fragile, implicit mechanisms often yielding inconsistent outputs across seeds, formats, or minor prompt variations making them fundamentally unreliable for tasks requiring stable, interpretable reasoning. In contrast, automata-based neuro-symbolic frameworks like RetoMaton offer a more structured and trustworthy alternative by grounding retrieval in symbolic memory with deterministic transitions. In this work, we extend RetoMaton by replacing its global datastore with a local, task-adaptive Weighted Finite Automaton (WFA), constructed directly from external domain corpora. This local automaton structure promotes robust, context-aware retrieval while preserving symbolic traceability and low inference overhead. Unlike prompting, which entangles context and memory in opaque ways, our approach leverages the explicit structure of WFAs to provide verifiable and modular retrieval behavior, making it better suited for domain transfer and interoperability. We evaluate this local RetoMaton variant on two pretrained LLMs LLaMA-3.2-1B and Gemma-3-1B-PT across three reasoning tasks: TriviaQA (reading comprehension), GSM8K (multi-step math), and MMLU (domain knowledge). Compared to the base model and prompting-based methods, augmenting these setups with local RetoMaton consistently improves performance while enabling transparent and reproducible retrieval dynamics. Our results highlight a promising shift toward trustworthy, symbolic reasoning in modern LLMs via lightweight, automaton-guided memory.
zh
[NLP-74] Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English
【速读】: 该论文旨在解决跨语言音素识别难题,特别是在越南语与英语混合发音场景下自动语音识别(ASR)的准确性问题。越南语依赖声调变化区分词义,而英语则以重音模式和非标准发音为特征,导致两种语言间音素对齐困难。解决方案的关键在于:(1) 构建一个能反映越南语与英语语音系统差异的代表性双语音素集合;(2) 设计一种端到端系统,利用预训练的PhoWhisper编码器提取深层高维表征,从而提升音素识别性能。
链接: https://arxiv.org/abs/2508.19270
作者: Nguyen Huu Nhat Minh,Tran Nguyen Anh,Truong Dinh Dung,Vo Van Nam,Le Pham Tuyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition (ASR) when mixing Vietnamese and English pronunciations. Unlike many languages, Vietnamese relies on tonal variations to distinguish word meanings, whereas English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages. To address this challenge, we propose a novel bilingual speech recognition approach with two primary contributions: (1) constructing a representative bilingual phoneme set that bridges the differences between Vietnamese and English phonetic systems; (2) designing an end-to-end system that leverages the PhoWhisper pre-trained encoder for deep high-level representations to improve phoneme recognition. Our extensive experiments demonstrate that the proposed approach not only improves recognition accuracy in bilingual speech recognition for Vietnamese but also provides a robust framework for addressing the complexities of tonal and stress-based phoneme recognition
zh
[NLP-75] Should LLM s be WEIRD? Exploring WEIRDness and Human Rights in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练数据中过度反映WEIRD价值观(西方、受教育、工业化、富裕、民主)所引发的文化偏见与公平性问题,特别是这些模型是否会在全球范围内产生符合人权原则的输出。研究通过对比模型响应与《世界价值观调查》数据、《世界人权宣言》及亚洲、中东和非洲三大区域人权宪章的一致性,发现文化多样性更高的模型(如BLOOM和Qwen)虽能生成更丰富的文化表达,但其违反人权原则的风险也上升2%至4%,尤其体现在性别平等相关的有害规范上。解决方案的关键在于引入类似宪法AI(Constitutional AI)的方法,将人类权利原则嵌入模型行为规范中,以缓解文化代表性增强带来的歧视性内容风险,但该方法仅能部分缓解这一矛盾。
链接: https://arxiv.org/abs/2508.19269
作者: Ke Zhou,Marios Constantinides,Daniele Quercia
机构: Nokia Bell Labs (诺基亚贝尔实验室)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted in AIES 2025
Abstract:Large language models (LLMs) are often trained on data that reflect WEIRD values: Western, Educated, Industrialized, Rich, and Democratic. This raises concerns about cultural bias and fairness. Using responses to the World Values Survey, we evaluated five widely used LLMs: GPT-3.5, GPT-4, Llama-3, BLOOM, and Qwen. We measured how closely these responses aligned with the values of the WEIRD countries and whether they conflicted with human rights principles. To reflect global diversity, we compared the results with the Universal Declaration of Human Rights and three regional charters from Asia, the Middle East, and Africa. Models with lower alignment to WEIRD values, such as BLOOM and Qwen, produced more culturally varied responses but were 2% to 4% more likely to generate outputs that violated human rights, especially regarding gender and equality. For example, some models agreed with the statements a man who cannot father children is not a real man'' and
a husband should always know where his wife is’', reflecting harmful gender norms. These findings suggest that as cultural representation in LLMs increases, so does the risk of reproducing discriminatory beliefs. Approaches such as Constitutional AI, which could embed human rights principles into model behavior, may only partly help resolve this tension.
zh
[NLP-76] MultiPL-MoE: Multi-Programming-Lingual Extension of Large Language Models through Hybrid Mixture-of-Experts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多编程语言(Multi-Programming-Lingual, MultiPL)代码生成任务中表现不足的问题,尤其是在计算资源受限条件下提升现有主流LLMs的多语言代码生成能力。其解决方案的关键在于提出一种基于混合专家(Mixture of Experts, MoE)架构的MultiPL-MoE模型,该模型通过双层专家选择机制实现精细化控制:一是Token级MoE采用共享专家结构并引入新颖的门控权重归一化方法,以优化细粒度token层面的专家分配;二是Segment级MoE结合滑动窗口分割输入序列与Top-k段落专家路由策略,从而更有效地捕捉编程语言的语法结构和上下文模式,最终实现多编程语言代码生成性能的显著提升。
链接: https://arxiv.org/abs/2508.19268
作者: Qing Wang,Xue Han,Jiahui Wang,Lehao Xing,Qian Hu,Lianlian Zhang,Chao Deng,Junlan Feng
机构: JIUTIAN Team China Mobile (中国移动九天团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite LLMs’ excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.
zh
[NLP-77] Beat-Based Rhythm Quantization of MIDI Performances
【速读】: 该论文旨在解决音乐表演数据(MIDI)中节奏不规则、难以对齐到节拍结构的问题,从而生成符合节拍规律且可读性强的乐谱。其解决方案的关键在于提出一种基于Transformer的节奏量化模型,该模型引入了节拍(beat)和强拍(downbeat)信息,并设计了一种基于节拍的预处理方法,将乐谱与演奏数据统一转换为token表示形式,从而优化了模型架构与数据表征方式,在钢琴和吉他演奏数据上进行训练后,显著优于当前最优方法(以MUSTER指标衡量)。
链接: https://arxiv.org/abs/2508.19262
作者: Maximilian Wachter,Sebastian Murgul,Michael Heizmann
机构: Klangio GmbH(克兰吉奥有限公司); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted to the Late Breaking Demo Papers of the 1st AES International Conference on Artificial Intelligence and Machine Learning for Audio (AIMLA LBDP), 2025
Abstract:We propose a transformer-based rhythm quantization model that incorporates beat and downbeat information to quantize MIDI performances into metrically-aligned, human-readable scores. We propose a beat-based preprocessing method that transfers score and performance data into a unified token representation. We optimize our model architecture and data representation and train on piano and guitar performances. Our model exceeds state-of-the-art performance based on the MUSTER metric.
zh
[NLP-78] Capabilities of GPT -5 across critical domains: Is it the next breakthrough?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用领域中性能差异的系统性评估问题,特别是针对GPT-4与GPT-5在教育、临床诊断、科研生成及伦理推理等关键场景下的表现对比。其解决方案的关键在于采用由语言学和临床领域专家组成的20人评审团队,基于预设标准对两个模型在五个核心任务中的输出进行量化评估,并通过混合效应模型分析结果,从而提供首个关于GPT-5能力演进的实证证据,证明其在多数任务中显著优于GPT-4,尤其在临床诊断、研究生成和伦理推理方面展现出更强的上下文敏感性和领域专业化能力。
链接: https://arxiv.org/abs/2508.19259
作者: Georgios P. Georgiou
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:The accelerated evolution of large language models has raised questions about their comparative performance across domains of practical importance. GPT-4 by OpenAI introduced advances in reasoning, multimodality, and task generalization, establishing itself as a valuable tool in education, clinical diagnosis, and academic writing, though it was accompanied by several flaws. Released in August 2025, GPT-5 incorporates a system-of-models architecture designed for task-specific optimization and, based on both anecdotal accounts and emerging evidence from the literature, demonstrates stronger performance than its predecessor in medical contexts. This study provides one of the first systematic comparisons of GPT-4 and GPT-5 using human raters from linguistics and clinical fields. Twenty experts evaluated model-generated outputs across five domains: lesson planning, assignment evaluation, clinical diagnosis, research generation, and ethical reasoning, based on predefined criteria. Mixed-effects models revealed that GPT-5 significantly outperformed GPT-4 in lesson planning, clinical diagnosis, research generation, and ethical reasoning, while both models performed comparably in assignment assessment. The findings highlight the potential of GPT-5 to serve as a context-sensitive and domain-specialized tool, offering tangible benefits for education, clinical practice, and academic research, while also advancing ethical reasoning. These results contribute to one of the earliest empirical evaluations of the evolving capabilities and practical promise of GPT-5.
zh
[NLP-79] MovieCORE: COgnitive REasoning in Movies EMNLP’2025
【速读】: 该论文旨在解决当前视频问答(VQA)模型对电影内容理解停留在表面层次的问题,即缺乏对深层认知推理能力的评估与提升。现有数据集多聚焦于直观信息提取,难以检验模型是否具备类似人类的系统2思维(System-2 thinking)——如因果推断、情感揣测和隐含意义解析等高级认知能力。解决方案的关键在于提出MovieCORE数据集,通过多大语言模型(LLM)作为“思维代理”的协同头脑风暴机制生成高质量、高复杂度的问题-答案对,并引入Agentic Choice Enhancement(ACE)模块增强后训练阶段的模型推理能力,从而显著提升VQA模型在深层次语义理解任务上的表现(最高达25%的性能提升)。
链接: https://arxiv.org/abs/2508.19026
作者: Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Ying Cheng,Hung-Ting Su,Yung-Hao Tang,Shang-Hong Lai,Winston H. Hsu
机构: National Taiwan University (台湾大学); NVIDIA (英伟达); National Tsing Hua University (清华大学); National Chengchi University (政治大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for EMNLP’2025 Main Conference. Project Page: this https URL
Abstract:This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at this https URL.
zh
[NLP-80] Word Chain Generators for Prefix Normal Words
【速读】: 该论文旨在解决前缀正则词(prefix normal words)的计数问题以及高效判定方法的缺失这一开放性难题。其解决方案的关键在于揭示了导致一个二进制词不满足前缀正则性质的子串(factor)的结构特征,并引入词链(word chains)与生成器(generators)的概念,从而建立同长度词之间的新关联方式,为分析和构造前缀正则词提供了系统性的工具与理论基础。
链接: https://arxiv.org/abs/2508.19619
作者: Duncan Adamson,Moritz Dudey,Pamela Fleischmann,Annika Huch
机构: 未知
类目: Combinatorics (math.CO); Computation and Language (cs.CL)
备注:
Abstract:In 2011, Fici and Lipták introduced prefix normal words. A binary word is prefix normal if it has no factor (substring) that contains more occurrences of the letter 1 than the prefix of the same length. Among the open problems regarding this topic are the enumeration of prefix normal words and efficient testing methods. We show a range of characteristics of prefix normal words. These include properties of factors that are responsible for a word not being prefix normal. With word chains and generators, we introduce new ways of relating words of the same length to each other.
zh
计算机视觉
[CV-0] CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
【速读】:该论文旨在解决科学计算领域中图形用户界面(GUI)自主代理面临的长期规划与精确执行之间的矛盾问题:现有通用型代理擅长规划但执行能力弱,而专用型代理则相反。为突破这一局限,作者提出了一种可训练的组合式框架CODA,其核心创新在于将一个通用规划器(Cerebrum)与一个专用执行器(Cerebellum)有机结合,并通过两阶段训练流程实现性能优化——第一阶段“专业化”使用解耦的GRPO方法为每个科学应用单独训练专家规划器,利用少量任务轨迹进行初始化;第二阶段“泛化”则聚合所有专家的成功轨迹构建统一数据集,用于监督微调最终规划器,从而同时获得鲁棒的执行能力和跨领域的泛化能力。
链接: https://arxiv.org/abs/2508.20096
作者: Zeyi Sun,Yuhang Cao,Jianze Liang,Qiushi Sun,Ziyu Liu,Zhixiong Zhang,Yuhang Zang,Xiaoyi Dong,Kai Chen,Dahua Lin,Jiaqi Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: code available at this url: this https URL
Abstract:Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.
zh
[CV-1] Bridging Domain Gaps for Fine-Grained Moth Classification Through Expert-Informed Adaptation and Foundation Model Priors
【速读】:该论文旨在解决自动化相机系统采集的鳞翅目昆虫(Lepidoptera)图像在实际野外场景中进行准确物种识别的问题,特别是由于标注数据稀缺和领域偏移(domain shift)导致的分类性能下降。其解决方案的关键在于:利用少量专家标注的野外图像数据,通过知识蒸馏(knowledge distillation)技术将高性能基础模型BioCLIP2的知识迁移至轻量级ConvNeXt-tiny架构中,从而在保持高准确率的同时显著降低计算成本,有效缓解了域间差异带来的挑战。
链接: https://arxiv.org/abs/2508.20089
作者: Ross J Gardiner,Guillaume Mougeot,Sareh Rowlands,Benno I Simmons,Flemming Helsing,Toke Thomas Høye
机构: University of Exeter (埃克塞特大学); Aarhus University (奥胡斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Labelling images of Lepidoptera (moths) from automated camera systems is vital for understanding insect declines. However, accurate species identification is challenging due to domain shifts between curated images and noisy field imagery. We propose a lightweight classification approach, combining limited expert-labelled field data with knowledge distillation from the high-performance BioCLIP2 foundation model into a ConvNeXt-tiny architecture. Experiments on 101 Danish moth species from AMI camera systems demonstrate that BioCLIP2 substantially outperforms other methods and that our distilled lightweight model achieves comparable accuracy with significantly reduced computational cost. These insights offer practical guidelines for the development of efficient insect monitoring systems and bridging domain gaps for fine-grained classification.
zh
[CV-2] AudioStory: Generating Long-Form Narrative Audio with Large Language Models
【速读】:该论文旨在解决文本到音频(Text-to-Audio, TTA)生成技术在长篇叙事音频生成中面临的挑战,即如何实现时间连贯性和结构化推理能力。现有方法虽能高效合成短音频片段,但在处理需要多场景衔接与情感一致性保持的长音频叙事时表现不足。解决方案的关键在于提出AudioStory框架,其核心创新包括:(1) 解耦桥接机制(Decoupled bridging mechanism),将大语言模型(LLM)与扩散模型(diffuser)协作解耦为两个专用组件——用于事件内语义对齐的桥接查询(bridging query)和用于跨事件连贯性保持的残差查询(residual query);(2) 端到端训练策略,通过统一指令理解与音频生成流程,消除模块化训练管线并增强各组件间的协同效应,从而显著提升长音频叙事的生成质量与指令遵循能力。
链接: https://arxiv.org/abs/2508.20088
作者: Yuxin Guo,Teng Wang,Yuying Ge,Shijie Ma,Yixiao Ge,Wei Zou,Ying Shan
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); ARC Lab, Tencent PCG (腾讯PCG ARC实验室); MAIS, Institute of Automation, CAS, Beijing (中国科学院自动化研究所MAIS)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at this https URL
zh
[CV-3] Seam360GS: Seamless 360° Gaussian Splatting from Real-World Omnidirectional Images ICCV2025
【速读】:该论文旨在解决消费级双鱼眼(dual-fisheye)相机系统在生成360度全景图像时因镜头分离和角度畸变导致的图像不完美问题,从而影响虚拟现实、机器人和自动驾驶等应用中的视图合成质量。其解决方案的关键在于提出了一种将双鱼眼相机模型嵌入3D高斯泼溅(3D Gaussian splatting)流水线的新校准框架,通过联合优化3D高斯参数与模拟镜头间隙和角度畸变的标定变量,不仅能够真实再现双鱼眼相机产生的视觉伪影,还能从不完美的输入中合成无缝的360度图像,显著优于现有360度渲染模型。
链接: https://arxiv.org/abs/2508.20080
作者: Changha Shin,Woong Oh Cho,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to ICCV 2025. 10 pages main text, 4 figures, 4 tables, supplementary material included
Abstract:360-degree visual content is widely shared on platforms such as YouTube and plays a central role in virtual reality, robotics, and autonomous navigation. However, consumer-grade dual-fisheye systems consistently yield imperfect panoramas due to inherent lens separation and angular distortions. In this work, we introduce a novel calibration framework that incorporates a dual-fisheye camera model into the 3D Gaussian splatting pipeline. Our approach not only simulates the realistic visual artifacts produced by dual-fisheye cameras but also enables the synthesis of seamlessly rendered 360-degree images. By jointly optimizing 3D Gaussian parameters alongside calibration variables that emulate lens gaps and angular distortions, our framework transforms imperfect omnidirectional inputs into flawless novel view synthesis. Extensive evaluations on real-world datasets confirm that our method produces seamless renderings-even from imperfect images-and outperforms existing 360-degree rendering models.
zh
[CV-4] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型中动作生成方式的局限性问题,即现有方法要么采用固定顺序的自回归生成策略,要么依赖外部连续扩散或流匹配头,导致训练复杂、采样迭代繁琐且难以构建统一可扩展的架构。其解决方案的关键在于提出一种基于离散扩散机制的单一Transformer策略——离散扩散VLA(Discrete Diffusion VLA),该方法将动作块离散化并用离散扩散过程建模,同时沿用与视觉语言模型(VLM)骨干相同的交叉熵损失函数进行训练。这一设计保留了扩散模型逐步精炼的范式,并天然兼容VLM的离散token接口;通过自适应解码顺序和二次掩码机制实现对易预测动作元素优先处理及不确定预测的多轮修正,从而提升一致性与容错能力,同时支持并行解码、突破自回归瓶颈并减少函数评估次数,显著优于传统自回归与连续扩散基线方法。
链接: https://arxiv.org/abs/2508.20072
作者: Zhixuan Liang,Yizhuo Li,Tianshuo Yang,Chengyue Wu,Sitong Mao,Liuao Pei,Xiaokang Yang,Jiangmiao Pang,Yao Mu,Ping Luo
机构: The University of Hong Kong (香港大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Huawei Cloud Computing Technologies Co., Ltd. (华为云计算技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 15 pages
Abstract:Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion’s progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.
zh
[CV-5] PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
【速读】:该论文旨在解决跨视图地理定位(Cross-view Geo-Localization, CVGL)中因GPS漂移导致的噪声对应关系(Noisy Correspondence, NC)问题,即在实际场景下,无人机与卫星图像之间存在系统性对齐偏差,仅部分区域具有真实对应关系,而现有方法多假设图像对完美对齐,难以应对此类现实噪声。解决方案的关键在于提出PAUL框架,其核心是基于不确定性学习的分区与增强机制:通过不确定性感知的协同增强(uncertainty-aware co-augmentation)和证据协同训练(evidential co-training),识别高置信度对应区域并针对性地进行数据增强,同时利用损失差异与不确定性联合指导训练样本的划分与优化,从而有效抑制误匹配带来的噪声干扰,提升模型在复杂环境下的鲁棒性与定位精度。
链接: https://arxiv.org/abs/2508.20066
作者: Zheng Li,Yanming Guo,WenZhe Liu,Xueyi Zhang,Zhaoyun Ding,Long Xu,Mingrui Lao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, as it enables matching between drone-captured and satellite imagery. Most existing approaches embed multi-modal data into a joint feature space to maximize the similarity of paired images. However, these methods typically assume perfect alignment of image pairs during training, which rarely holds true in real-world scenarios. In practice, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic alignment shifts where only partial correspondences exist between pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research. In this paper, we formally introduce and address the Noisy Correspondence on Cross-View Geo-Localization (NC-CVGL) problem, aiming to bridge the gap between idealized benchmarks and practical applications. To this end, we propose PAUL (Partition and Augmentation by Uncertainty Learning), a novel framework that partitions and augments training data based on estimated data uncertainty through uncertainty-aware co-augmentation and evidential co-training. Specifically, PAUL selectively augments regions with high correspondence confidence and utilizes uncertainty estimation to refine feature learning, effectively suppressing noise from misaligned pairs. Distinct from traditional filtering or label correction, PAUL leverages both data uncertainty and loss discrepancy for targeted partitioning and augmentation, thus providing robust supervision for noisy samples. Comprehensive experiments validate the effectiveness of individual components in PAUL,which consistently achieves superior performance over other competitive noisy-correspondence-driven methods in various noise ratios.
zh
[CV-6] Patch Progression Masked Autoencoder with Fusion CNN Network for Classifying Evolution Between Two Pairs of 2D OCT Slices
【速读】:该论文旨在解决老年性黄斑变性(Age-related Macular Degeneration, AMD)中脉络膜新生血管(neovascular AMD)进展的精准监测问题,以支持更个性化和有效的抗血管内皮生长因子(anti-VEGF)治疗策略。其核心挑战在于从光学相干断层扫描(Optical Coherence Tomography, OCT)图像中准确识别病变演变模式,并预测未来三个月内的病情进展。解决方案的关键在于:在任务1中采用融合卷积神经网络(CNN)与模型集成(model ensembling)提升双时间点OCT切片演化分类的准确性;在任务2中提出Patch Progression Masked Autoencoder方法,通过生成未来OCT图像并结合任务1的分类结果实现进展预测,从而构建端到端的动态监测框架。
链接: https://arxiv.org/abs/2508.20064
作者: Philippe Zhang,Weili Jiang,Yihao Li,Jing Zhang,Sarah Matta,Yubo Tan,Hui Lin,Haoshen Wang,Jiangtian Pan,Hui Xu,Laurent Borderie,Alexandre Le Guilcher,Béatrice Cochener,Chubin Ou,Gwenolé Quellec,Mathieu Lamard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 3 tables, challenge/conference paper
Abstract:Age-related Macular Degeneration (AMD) is a prevalent eye condition affecting visual acuity. Anti-vascular endothelial growth factor (anti-VEGF) treatments have been effective in slowing the progression of neovascular AMD, with better outcomes achieved through timely diagnosis and consistent monitoring. Tracking the progression of neovascular activity in OCT scans of patients with exudative AMD allows for the development of more personalized and effective treatment plans. This was the focus of the Monitoring Age-related Macular Degeneration Progression in Optical Coherence Tomography (MARIO) challenge, in which we participated. In Task 1, which involved classifying the evolution between two pairs of 2D slices from consecutive OCT acquisitions, we employed a fusion CNN network with model ensembling to further enhance the model’s performance. For Task 2, which focused on predicting progression over the next three months based on current exam data, we proposed the Patch Progression Masked Autoencoder that generates an OCT for the next exam and then classifies the evolution between the current OCT and the one generated using our solution from Task 1. The results we achieved allowed us to place in the Top 10 for both tasks. Some team members are part of the same organization as the challenge organizers; therefore, we are not eligible to compete for the prize.
zh
[CV-7] OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations ICCV2025
【速读】:该论文旨在解决开放词汇表(Open-vocabulary, OV)室内三维物体检测中缺乏标注数据的问题,即如何在不依赖人工标注的3D边界框和类别信息的情况下实现高精度的3D目标检测。其解决方案的关键在于提出一种单阶段的多视角检测框架OpenM3D,通过结合两类关键损失函数进行联合训练:一是类无关的3D定位损失,要求生成高质量的3D伪框(pseudo boxes);二是体素语义对齐损失,利用从2D图像分割区域采样的多样化CLIP特征与对应体素特征对齐。为提升伪框质量,论文进一步提出基于图嵌入技术的3D伪框生成方法,将2D分割结果整合为连贯的3D结构,从而显著优于现有方法如OV-3DET。该设计使得模型在无需任何人工标注的前提下,实现了高效且准确的检测性能,在ScanNet200和ARKitScenes基准上均超越了当前主流两阶段方法。
链接: https://arxiv.org/abs/2508.20063
作者: Peng-Hao Hsu,Ke Zhang,Fu-En Wang,Tao Tu,Ming-Feng Li,Yu-Lun Liu,Albert Y. C. Chen,Min Sun,Cheng-Hao Kuo
机构: National Tsing Hua University (国立清华大学); Amazon (亚马逊); Cornell University (康奈尔大学); Carnegie Mellon University (卡内基梅隆大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025
Abstract:Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.
zh
[CV-8] Segmentation Assisted Incremental Test Time Adaptation in an Open World BMVC2025
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在动态环境中面临的增量测试时适应(Incremental Test Time Adaptation, ITTA)问题,即模型在部署后持续遇到未见类别(unseen classes)和分布偏移(distribution shifts)时,如何实现对 covariate shift 和 label shift 的同步适应。其核心挑战在于传统测试时适应方法仅针对预定义类别的分布变化,无法处理新类别的涌现。解决方案的关键在于提出一种无需训练的分割辅助主动标注模块(SegAssist),该模块利用VLM固有的分割能力来精炼主动采样策略,优先选择可能代表未见类别的样本进行查询,从而实现对新类别的有效识别与模型适应,显著提升模型在真实场景中连续适应新兴数据的能力。
链接: https://arxiv.org/abs/2508.20029
作者: Manogna Sreenivas,Soma Biswas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025
Abstract:In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:this https URL
zh
[CV-9] GS: Generative Segmentation via Label Diffusion
【速读】:该论文旨在解决语言驱动的图像分割(language-driven image segmentation)任务中,现有方法多将分割视为辅助过程或依赖图像中心的建模策略所带来的局限性问题。传统方法通常采用判别式框架,将像素分类为前景或背景,而近期基于扩散模型的方法仍局限于以图像生成作为特征提取或数据增强手段,未能真正将分割本身作为核心生成目标。解决方案的关键在于提出GS(Generative Segmentation)框架,首次将分割任务建模为标签扩散(label diffusion)的生成式过程:通过从噪声直接生成分割掩码,并以输入图像和语言描述作为条件,实现了端到端训练并显式控制空间与语义保真度。这一范式转变使分割成为主要建模目标,从而显著提升了在Panoptic Narrative Grounding(PNG)等复杂多模态分割任务上的性能表现。
链接: https://arxiv.org/abs/2508.20020
作者: Yuhao Chen,Shubin Chen,Liang Lin,Guangrun Wang
机构: 1. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 5 tables
Abstract:Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions. Traditional methods approach this as a discriminative problem, assigning each pixel to foreground or background based on semantic alignment. Recently, diffusion models have been introduced to this domain, but existing approaches remain image-centric: they either (i) use image diffusion models as visual feature extractors, (ii) synthesize segmentation data via image generation to train discriminative models, or (iii) perform diffusion inversion to extract attention cues from pre-trained image diffusion models-thereby treating segmentation as an auxiliary process. In this paper, we propose GS (Generative Segmentation), a novel framework that formulates segmentation itself as a generative task via label diffusion. Instead of generating images conditioned on label maps and text, GS reverses the generative process: it directly generates segmentation masks from noise, conditioned on both the input image and the accompanying language description. This paradigm makes label generation the primary modeling target, enabling end-to-end training with explicit control over spatial and semantic fidelity. To demonstrate the effectiveness of our approach, we evaluate GS on Panoptic Narrative Grounding (PNG), a representative and challenging benchmark for multimodal segmentation that requires panoptic-level reasoning guided by narrative captions. Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
zh
[CV-10] Assessing the Geolocation Capabilities Limitations and Societal Risks of Generative Vision-Language Models AAAI
【速读】:该论文旨在解决当前生成式视觉语言模型(Generative Vision-Language Models, VLMs)在图像地理定位(geo-localization)任务中的精度评估不足问题,尤其是其潜在的隐私风险尚未被系统性研究。解决方案的关键在于对25个最先进的VLMs进行综合性评估,使用四个涵盖多样化环境的基准图像数据集,量化其在不同图像类型下的定位准确率,并揭示其内部推理机制、性能边界及社会风险。结果表明,尽管VLMs在通用街景图像上表现不佳,但在类社交媒体内容图像上可达到61%的高精度,凸显了亟需关注的隐私泄露风险。
链接: https://arxiv.org/abs/2508.19967
作者: Oliver Grainge,Sania Waheed,Jack Stilgoe,Michael Milford,Shoaib Ehsan
机构: University of Oxford (牛津大学); University College London (伦敦大学学院); Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI Fall Symposium 2025 on AI Trustworthiness and Risk Assessment for Challenging Contexts (ATRACC)
Abstract:Geo-localization is the task of identifying the location of an image using visual cues alone. It has beneficial applications, such as improving disaster response, enhancing navigation, and geography education. Recently, Vision-Language Models (VLMs) are increasingly demonstrating capabilities as accurate image geo-locators. This brings significant privacy risks, including those related to stalking and surveillance, considering the widespread uses of AI models and sharing of photos on social media. The precision of these models is likely to improve in the future. Despite these risks, there is little work on systematically evaluating the geolocation precision of Generative VLMs, their limits and potential for unintended inferences. To bridge this gap, we conduct a comprehensive assessment of the geolocation capabilities of 25 state-of-the-art VLMs on four benchmark image datasets captured in diverse environments. Our results offer insight into the internal reasoning of VLMs and highlight their strengths, limitations, and potential societal risks. Our findings indicate that current VLMs perform poorly on generic street-level images yet achieve notably high accuracy (61%) on images resembling social media content, raising significant and urgent privacy concerns.
zh
[CV-11] Reimagining Image Segmentation using Active Contour: From Chan Vese Algorithm into a Proposal Novel Functional Loss Framework
【速读】:该论文旨在解决图像分割(image segmentation)中的精度与鲁棒性问题,特别是针对传统损失函数在边界敏感性和形状适应性方面的不足。其解决方案的关键在于提出一种基于活动轮廓(active contours)的新型分割损失函数,该损失函数以Chan-Vese模型的能量泛函为基础,并通过水平集方法(level set method)进行数值实现。该方法利用了Chan-Vese算法对图像内部和外部区域能量差异的建模能力,从而提升分割结果的准确性,尤其在复杂边界和噪声干扰下的表现优于经典损失函数。
链接: https://arxiv.org/abs/2508.19946
作者: Gianluca Guzzetta
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 13 pages
Abstract:In this paper, we present a comprehensive study and analysis of the Chan-Vese algorithm for image segmentation. We employ a discretized scheme derived from the empirical study of the Chan-Vese model’s functional energy and its partial differential equation based on its level set function. We provide a proof of the results and an implementation using MATLAB. Leveraging modern computer vision methodologies, we propose a functional segmentation loss based on active contours, utilizing this http URL and a level set based on the Chan-Vese algorithm. We compare our results with common computer vision segmentation datasets and evaluate the performance of classical loss functions against our proposed method. All code and materials used are available at this https URL.
zh
[CV-12] WaveHiT-SR: Hierarchical Wavelet Network for Efficient Image Super-Resolution
【速读】:该论文旨在解决基于Transformer的图像超分辨率(Image Super-Resolution, SR)方法中因窗口自注意力机制带来的二次计算复杂度问题,该问题迫使模型采用固定的小窗口,从而限制了感受野和长程依赖建模能力。解决方案的关键在于提出一种嵌入小波变换(Wavelet Transform)的分层Transformer框架(WaveHiT-SR):首先,通过自适应分层窗口替代静态小窗口,增强对多尺度特征的捕获能力;其次,利用小波变换将图像分解为多个频带子band,使网络能同时关注全局结构与局部细节,并在分层重建过程中降低计算复杂度而不损失性能,从而实现高效且高保真的图像超分辨率。
链接: https://arxiv.org/abs/2508.19927
作者: Fayaz Ali,Muhammad Zawish,Steven Davy,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Transformers have demonstrated promising performance in computer vision tasks, including image super-resolution (SR). The quadratic computational complexity of window self-attention mechanisms in many transformer-based SR methods forces the use of small, fixed windows, limiting the receptive field. In this paper, we propose a new approach by embedding the wavelet transform within a hierarchical transformer framework, called (WaveHiT-SR). First, using adaptive hierarchical windows instead of static small windows allows to capture features across different levels and greatly improve the ability to model long-range dependencies. Secondly, the proposed model utilizes wavelet transforms to decompose images into multiple frequency subbands, allowing the network to focus on both global and local features while preserving structural details. By progressively reconstructing high-resolution images through hierarchical processing, the network reduces computational complexity without sacrificing performance. The multi-level decomposition strategy enables the network to capture fine-grained information in lowfrequency components while enhancing high-frequency textures. Through extensive experimentation, we confirm the effectiveness and efficiency of our WaveHiT-SR. Our refined versions of SwinIR-Light, SwinIR-NG, and SRFormer-Light deliver cutting-edge SR results, achieving higher efficiency with fewer parameters, lower FLOPs, and faster speeds.
zh
[CV-13] Integrating SAM Supervision for 3D Weakly Supervised Point Cloud Segmentation
【速读】:该论文旨在解决3D语义分割中因标注成本高而导致的稀疏标注问题,即如何在仅有限3D点云标注的情况下提升模型性能。其核心解决方案在于利用2D基础模型(foundation models)生成高质量的分割掩码(segmentation masks),并通过建立3D场景与2D视图之间的几何对应关系,将这些2D掩码传播至3D空间,从而扩展稀疏标签的覆盖范围;同时,结合置信度和不确定性一致性的正则化策略对3D点云增强样本进行筛选,并选取可靠伪标签进一步扩散到3D掩码中,实现标签的有效扩充与优化,显著提升了弱监督3D语义分割的性能。
链接: https://arxiv.org/abs/2508.19909
作者: Lechun You,Zhonghua Wu,Weide Liu,Xulei Yang,Jun Cheng,Wei Zhou,Bharadwaj Veeravalli,Guosheng Lin
机构: National University of Singapore (新加坡国立大学); SenseTime Research (商汤科技研究); Nanyang Technological University (南洋理工大学); Institute for Infocomm Research (资讯通信研究院); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current methods for 3D semantic segmentation propose training models with limited annotations to address the difficulty of annotating large, irregular, and unordered 3D point cloud data. They usually focus on the 3D domain only, without leveraging the complementary nature of 2D and 3D data. Besides, some methods extend original labels or generate pseudo labels to guide the training, but they often fail to fully use these labels or address the noise within them. Meanwhile, the emergence of comprehensive and adaptable foundation models has offered effective solutions for segmenting 2D data. Leveraging this advancement, we present a novel approach that maximizes the utility of sparsely available 3D annotations by incorporating segmentation masks generated by 2D foundation models. We further propagate the 2D segmentation masks into the 3D space by establishing geometric correspondences between 3D scenes and 2D views. We extend the highly sparse annotations to encompass the areas delineated by 3D masks, thereby substantially augmenting the pool of available labels. Furthermore, we apply confidence- and uncertainty-based consistency regularization on augmentations of the 3D point cloud and select the reliable pseudo labels, which are further spread on the 3D masks to generate more labels. This innovative strategy bridges the gap between limited 3D annotations and the powerful capabilities of 2D foundation models, ultimately improving the performance of 3D weakly supervised segmentation.
zh
[CV-14] Streamlining the Development of Active Learning Methods in Real-World Object Detection
【速读】:该论文旨在解决真实场景下目标检测的主动学习(Active Learning, AL)所面临的计算成本高和评估可靠性差的问题。现有AL方法需在每轮迭代中训练多个检测器进行比较,导致在自动驾驶数据集上单次检测器训练耗时高达282 GPU小时,且不同验证集上的AL方法排名波动大,难以保障安全关键系统中的评估稳定性。解决方案的关键在于提出一种基于对象的集合相似性(Object-based Set Similarity, OSS)度量指标:首先,OSS 通过对象级特征衡量训练集与目标域之间的相似性,无需实际训练检测器即可预筛无效AL策略;其次,它能够选出具有代表性的验证集以实现鲁棒评估。该方法是首个基于对象相似性统一目标检测主动学习训练与评估策略的工作,具备检测器无关性、仅需标注对象裁剪图像、可无缝集成至现有AL流程等优势,为计算效率和评估可靠性至关重要的现实部署提供了实用框架。
链接: https://arxiv.org/abs/2508.19906
作者: Moussa Kassem Sbeyti,Nadja Klein,Michelle Karg,Christian Wirth,Sahin Albayrak
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); TU Berlin (柏林工业大学); Continental AG (大陆集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Active learning (AL) for real-world object detection faces computational and reliability challenges that limit practical deployment. Developing new AL methods requires training multiple detectors across iterations to compare against existing approaches. This creates high costs for autonomous driving datasets where the training of one detector requires up to 282 GPU hours. Additionally, AL method rankings vary substantially across validation sets, compromising reliability in safety-critical transportation systems. We introduce object-based set similarity ( \mathrmOSS ), a metric that addresses these challenges. \mathrmOSS (1) quantifies AL method effectiveness without requiring detector training by measuring similarity between training sets and target domains using object-level features. This enables the elimination of ineffective AL methods before training. Furthermore, \mathrmOSS (2) enables the selection of representative validation sets for robust evaluation. We validate our similarity-based approach on three autonomous driving datasets (KITTI, BDD100K, CODA) using uncertainty-based AL methods as a case study with two detector architectures (EfficientDet, YOLOv3). This work is the first to unify AL training and evaluation strategies in object detection based on object similarity. \mathrmOSS is detector-agnostic, requires only labeled object crops, and integrates with existing AL pipelines. This provides a practical framework for deploying AL in real-world applications where computational efficiency and evaluation reliability are critical. Code is available at this https URL.
zh
[CV-15] Hyperspectral Sensors and Autonomous Driving: Technologies Limitations and Opportunities
【速读】:该论文旨在解决当前高光谱成像(Hyperspectral Imaging, HSI)在高级驾驶辅助系统(ADAS)和自动驾驶(AD)应用中研究潜力与商业落地之间存在显著差距的问题。其关键解决方案在于通过系统性综述和量化分析,首次全面评估了216款商用HSI及多光谱成像相机在帧率、空间分辨率、光谱维度及AEC-Q100温度可靠性等核心指标上的表现,并结合最新HSI数据集与应用场景(如道路表面语义分割、行人可区分性及恶劣天气感知)的分析,揭示了现有技术在规模、光谱一致性、通道数量和环境多样性方面的不足,从而为推动HSI在ADAS/AD中的实用化集成指明了关键研究方向。
链接: https://arxiv.org/abs/2508.19905
作者: Imad Ali Shah,Jiarong Li,Roshan George,Tim Brophy,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan
机构: University of Galway (加利福尼亚大学); Valeo Vision Systems (瓦莱奥视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Submitted and under review at IEEE OJVT, August 2025
Abstract:Hyperspectral imaging (HSI) offers a transformative sensing modality for Advanced Driver Assistance Systems (ADAS) and autonomous driving (AD) applications, enabling material-level scene understanding through fine spectral resolution beyond the capabilities of traditional RGB imaging. This paper presents the first comprehensive review of HSI for automotive applications, examining the strengths, limitations, and suitability of current HSI technologies in the context of ADAS/AD. In addition to this qualitative review, we analyze 216 commercially available HSI and multispectral imaging cameras, benchmarking them against key automotive criteria: frame rate, spatial resolution, spectral dimensionality, and compliance with AEC-Q100 temperature standards. Our analysis reveals a significant gap between HSI’s demonstrated research potential and its commercial readiness. Only four cameras meet the defined performance thresholds, and none comply with AEC-Q100 requirements. In addition, the paper reviews recent HSI datasets and applications, including semantic segmentation for road surface classification, pedestrian separability, and adverse weather perception. Our review shows that current HSI datasets are limited in terms of scale, spectral consistency, the number of spectral channels, and environmental diversity, posing challenges for the development of perception algorithms and the adequate validation of HSI’s true potential in ADAS/AD applications. This review paper establishes the current state of HSI in automotive contexts as of 2025 and outlines key research directions toward practical integration of spectral imaging in ADAS and autonomous systems.
zh
[CV-16] NM-Hebb: Coupling Local Hebbian Plasticity with Metric Learning for More Accurate and Interpretable CNNs
【速读】:该论文旨在解决深度卷积神经网络(Deep Convolutional Neural Networks, CNNs)在训练过程中过度依赖全局梯度优化所引发的过拟合、冗余滤波器以及可解释性差等问题。其解决方案的关键在于提出一种两阶段训练框架NM-Hebb,第一阶段引入神经启发的局部可塑性机制:通过Hebbian正则化使激活的空间均值与对应卷积滤波器权重均值对齐,促进结构化且可复用的特征基元形成;同时设计可学习的神经调制器以门控弹性权重式固化损失,保留有益参数而不冻结网络。第二阶段则采用成对度量学习损失对主干网络进行微调,显式压缩类内距离并扩大类间间隔,从而提升嵌入空间中的聚类紧密性和可解释性。此方法在多个基准数据集和模型架构上均实现显著性能提升,并增强了模型的结构性与透明度。
链接: https://arxiv.org/abs/2508.19896
作者: Davorin Miličević,Ratko Grbić
机构: TTTech Auto (TTTech 自动化); University of Rijeka (里耶卡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures. Submitted to Elsevier Neurocomputing, under review
Abstract:Deep Convolutional Neural Networks (CNNs) achieve high accuracy but often rely on purely global, gradient-based optimisation, which can lead to overfitting, redundant filters, and reduced interpretability. To address these limitations, we propose NM-Hebb, a two-phase training framework that integrates neuro-inspired local plasticity with distance-aware supervision. Phase 1 extends standard supervised training by jointly optimising a cross-entropy objective with two biologically inspired mechanisms: (i) a Hebbian regulariser that aligns the spatial mean of activations with the mean of the corresponding convolutional filter weights, encouraging structured, reusable primitives; and (ii) a learnable neuromodulator that gates an elastic-weight-style consolidation loss, preserving beneficial parameters without freezing the network. Phase 2 fine-tunes the backbone with a pairwise metric-learning loss, explicitly compressing intra-class distances and enlarging inter-class margins in the embedding space. Evaluated on CIFAR-10, CIFAR-100, and TinyImageNet across five backbones (ResNet-18, VGG-11, MobileNet-v2, EfficientNet-V2, DenseNet-121), NM-Hebb achieves consistent gains over baseline and other methods: Top-1 accuracy improves by +2.0-10.0 pp (CIFAR-10), +2.0-9.0 pp (CIFAR-100), and up to +4.3-8.9 pp (TinyImageNet), with Normalised Mutual Information (NMI) increased by up to +0.15. Qualitative visualisations and filter-level analyses further confirm that NM-Hebb produces more structured and selective features, yielding tighter and more interpretable class clusters. Overall, coupling local Hebbian plasticity with metric-based fine-tuning yields CNNs that are not only more accurate but also more interpretable, offering practical benefits for resource-constrained and safety-critical AI deployments.
zh
[CV-17] PersonaAnimator: Personalized Motion Transfer from Unconstrained Videos
【速读】:该论文旨在解决当前动作生成与迁移技术中存在的三大问题:(1)基于姿态引导的角色动作迁移方法仅复制动作而未学习其风格特征,导致角色表现力不足;(2)动作风格迁移方法严重依赖难以获取的动作捕捉数据;(3)生成动作有时违反物理规律。解决方案的关键在于提出了一种全新的任务——视频到视频的动作个性化(Video-to-Video Motion Personalization),并设计了PersonaAnimator框架,该框架能够直接从非约束视频中学习个性化动作模式,从而实现高质量的动作迁移。此外,研究还提出了首个基于视频的个性化动作数据集PersonaVid(包含20类动作内容和120类动作风格),并引入物理感知的动作风格正则化机制(Physics-aware Motion Style Regularization),以确保生成动作的物理合理性。
链接: https://arxiv.org/abs/2508.19895
作者: Ziyun Qian,Runyu Xiao,Shuyuan Tu,Wei Xue,Dingkang Yang,Mingcheng Li,Dongliang Kou,Minghao Han,Zizhi Chen,Lihua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes violate physical laws. To address these challenges, this paper pioneers a new task: Video-to-Video Motion Personalization. We propose a novel framework, PersonaAnimator, which learns personalized motion patterns directly from unconstrained videos. This enables personalized motion transfer. To support this task, we introduce PersonaVid, the first video-based personalized motion dataset. It contains 20 motion content categories and 120 motion style categories. We further propose a Physics-aware Motion Style Regularization mechanism to enforce physical plausibility in the generated motions. Extensive experiments show that PersonaAnimator outperforms state-of-the-art motion transfer methods and sets a new benchmark for the Video-to-Video Motion Personalization task.
zh
[CV-18] Multispectral LiDAR data for extracting tree points in urban and suburban areas
【速读】:该论文旨在解决城市环境中树木动态监测的难题,以支持绿化政策制定并降低对电力基础设施的风险。传统机载激光雷达(Airborne LiDAR)在复杂城市场景中面临精度不足和树种多样性带来的挑战。解决方案的关键在于融合多光谱(Multispectral, MS)LiDAR数据与深度学习(Deep Learning, DL)模型,通过引入伪归一化差异植被指数(pseudo normalized difference vegetation index, pNDVI)增强空间信息,显著提升树木点云提取的准确性与效率。实验表明,Superpoint Transformer(SPT)模型在mIoU指标上达到85.28%,且结合pNDVI后相比仅使用空间信息时误差率降低10.61个百分点,验证了多源数据融合与先进DL架构在精细化城市树木测绘中的核心价值。
链接: https://arxiv.org/abs/2508.19881
作者: Narges Takhtkeshha,Gabriele Mazzacca,Fabio Remondino,Juha Hyyppä,Gottfried Mandlburger
机构: TU Wien (维也纳科技大學); Bruno Kessler Foundation (FBK) (布鲁诺·凯斯勒基金会); Finnish Geospatial Research Institute (芬兰大地测量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Monitoring urban tree dynamics is vital for supporting greening policies and reducing risks to electrical infrastructure. Airborne laser scanning has advanced large-scale tree management, but challenges remain due to complex urban environments and tree variability. Multispectral (MS) light detection and ranging (LiDAR) improves this by capturing both 3D spatial and spectral data, enabling detailed mapping. This study explores tree point extraction using MS-LiDAR and deep learning (DL) models. Three state-of-the-art models are evaluated: Superpoint Transformer (SPT), Point Transformer V3 (PTv3), and Point Transformer V1 (PTv1). Results show the notable time efficiency and accuracy of SPT, with a mean intersection over union (mIoU) of 85.28%. The highest detection accuracy is achieved by incorporating pseudo normalized difference vegetation index (pNDVI) with spatial data, reducing error rate by 10.61 percentage points (pp) compared to using spatial information alone. These findings highlight the potential of MS-LiDAR and DL to improve tree extraction and further tree inventories.
zh
[CV-19] Sky Background Building of Multi-objective Fiber spectra Based on Mutual Information Network
【速读】:该论文旨在解决多目标光纤光谱处理中天空背景扣除不准确的问题,现有方法主要依赖于天空光纤光谱构建平均“超级天空”(Super Sky),但未能充分建模目标周围环境的局部差异。解决方案的关键在于提出一种基于互信息(Mutual Information, MI)的天空背景估计模型——SMI(Sky background building based on Mutual Information)。其核心创新是利用全板光纤光谱信息进行联合建模:第一网络通过波长校准模块提取各光纤对应的天空特征,缓解因发射位置偏移导致的特征漂移问题;第二网络采用增量训练策略,在最大化不同光谱表示间互信息以捕捉共性成分的同时,最小化相邻光谱表示间的互信息以分离个体差异成分,从而实现每个目标位置的个性化天空背景估计,显著提升蓝端区域的背景扣除精度。
链接: https://arxiv.org/abs/2508.19875
作者: Hui Zhang,Jianghui Cai,Haifeng Yang,Ali Luo,Yuqing Yang,Xiao Kong,Zhichao Ding,Lichan Zhou,Qin Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Sky background subtraction is a critical step in Multi-objective Fiber spectra process. However, current subtraction relies mainly on sky fiber spectra to build Super Sky. These average spectra are lacking in the modeling of the environment surrounding the objects. To address this issue, a sky background estimation model: Sky background building based on Mutual Information (SMI) is proposed. SMI based on mutual information and incremental training approach. It utilizes spectra from all fibers in the plate to estimate the sky background. SMI contains two main networks, the first network applies a wavelength calibration module to extract sky features from spectra, and can effectively solve the feature shift problem according to the corresponding emission position. The second network employs an incremental training approach to maximize mutual information between representations of different spectra to capturing the common component. Then, it minimizes the mutual information between adjoining spectra representations to obtain individual components. This network yields an individual sky background at each location of the object. To verify the effectiveness of the method in this paper, we conducted experiments on the spectra of LAMOST. Results show that SMI can obtain a better object sky background during the observation, especially in the blue end.
zh
[CV-20] rajFusionNet: Pedestrian Crossing Intention Prediction via Fusion of Sequential and Visual Trajectory Representations
【速读】:该论文旨在解决自动驾驶车辆在公共道路上预测行人过街意图的问题,即判断场景中的行人是否可能横穿道路。解决方案的关键在于提出一种基于Transformer的新型模型TrajFusionNet,其通过融合未来行人轨迹和车辆速度预测作为先验信息来提升预测精度;该模型包含两个分支:序列注意力模块(Sequence Attention Module, SAM)用于从行人轨迹与车辆速度的时序表示中学习,视觉注意力模块(Visual Attention Module, VAM)则通过将预测的行人边界框叠加到场景图像上,从视觉特征中提取信息。该方法利用少量轻量级模态,在保证高性能的同时实现了最低的总推理时间(包括模型运行时间和数据预处理时间),并在三个主流行人过街意图预测数据集上达到最优性能。
链接: https://arxiv.org/abs/2508.19866
作者: François G. Landry,Moulay A. Akhloufi
机构: Université de Moncton (蒙克顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work has been submitted to IEEE Transactions on Intelligent Vehicles for possible publication
Abstract:With the introduction of vehicles with autonomous capabilities on public roads, predicting pedestrian crossing intention has emerged as an active area of research. The task of predicting pedestrian crossing intention involves determining whether pedestrians in the scene are likely to cross the road or not. In this work, we propose TrajFusionNet, a novel transformer-based model that combines future pedestrian trajectory and vehicle speed predictions as priors for predicting crossing intention. TrajFusionNet comprises two branches: a Sequence Attention Module (SAM) and a Visual Attention Module (VAM). The SAM branch learns from a sequential representation of the observed and predicted pedestrian trajectory and vehicle speed. Complementarily, the VAM branch enables learning from a visual representation of the predicted pedestrian trajectory by overlaying predicted pedestrian bounding boxes onto scene images. By utilizing a small number of lightweight modalities, TrajFusionNet achieves the lowest total inference time (including model runtime and data preprocessing) among current state-of-the-art approaches. In terms of performance, it achieves state-of-the-art results across the three most commonly used datasets for pedestrian crossing intention prediction.
zh
[CV-21] Self-supervised structured object representation learning
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在视觉表征学习中难以捕捉场景结构信息的问题,尤其是在密集预测任务(如目标检测)中的表现受限。现有方法(如DINO)依赖随机裁剪和全局嵌入,导致场景上下文信息丢失,从而影响细粒度结构建模能力。其解决方案的关键在于提出一种基于新型ProtoScale模块的渐进式结构化视觉表示学习框架,通过语义分组、实例级分离和层次化构建相结合的方式,在多尺度空间上捕获视觉元素,并保留增强视图中的完整场景上下文,从而提升对象中心表征的质量,即使在标注数据有限和微调轮次较少的情况下,仍能显著优于当前最优方法。
链接: https://arxiv.org/abs/2508.19864
作者: Oussama Hadjerci,Antoine Letienne,Mohamed Abbas Hedjazi,Adel Hafiane
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning (SSL) has emerged as a powerful technique for learning visual representations. While recent SSL approaches achieve strong results in global image understanding, they are limited in capturing the structured representation in scenes. In this work, we propose a self-supervised approach that progressively builds structured visual representations by combining semantic grouping, instance level separation, and hierarchical structuring. Our approach, based on a novel ProtoScale module, captures visual elements across multiple spatial scales. Unlike common strategies like DINO that rely on random cropping and global embeddings, we preserve full scene context across augmented views to improve performance in dense prediction tasks. We validate our method on downstream object detection tasks using a combined subset of multiple datasets (COCO and UA-DETRAC). Experimental results show that our method learns object centric representations that enhance supervised object detection and outperform the state-of-the-art methods, even when trained with limited annotated data and fewer fine-tuning epochs.
zh
[CV-22] Multimodal Conditional MeshGAN for Personalized Aneurysm Growth Prediction
【速读】:该论文旨在解决主动脉瘤(aortic aneurysm)进展预测中难以同时建模局部细微形变与全局解剖结构变化的挑战,从而实现个性化、高精度的3D疾病轨迹预测。其解决方案的关键在于提出MCMeshGAN——首个用于3D动脉瘤生长预测的多模态条件网格到网格生成对抗网络(mesh-to-mesh generative adversarial network),采用双分支架构:一方面引入基于K近邻的卷积网络(KNN-based convolutional network, KCN)以保留精细几何细节,另一方面利用图卷积网络(graph convolutional network, GCN)捕捉长程结构上下文,有效克服深度GCN存在的过平滑问题;此外,专门设计的条件分支编码临床属性(如年龄、性别)和目标时间间隔,实现解剖学合理且时间可控的预测,支持回顾性和前瞻性建模。
链接: https://arxiv.org/abs/2508.19862
作者: Long Chen,Ashiv Patel,Mengyun Qiao,Mohammad Yousuf Salmasi,Salah A. Hammouche,Vasilis Stavrinides,Jasleen Nagi,Soodeh Kalaie,Xiao Yun Xu,Wenjia Bai,Declan P. O’Regan
机构: MRC Laboratory of Medical Sciences (医学科学研究所实验室); Imperial College London (帝国理工学院); Imperial College Healthcare NHS Trust (帝国理工学院医疗保健国家卫生服务信托); University College London (伦敦大学学院); London Postgraduate School of Surgery (伦敦外科研究生院); Faculty of Medicine (医学院); Department of Mechanical Engineering (机械工程系); Department of Chemical Engineering (化学工程系); Department of Brain Sciences&Computing (脑科学与计算系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Personalized, accurate prediction of aortic aneurysm progression is essential for timely intervention but remains challenging due to the need to model both subtle local deformations and global anatomical changes within complex 3D geometries. We propose MCMeshGAN, the first multimodal conditional mesh-to-mesh generative adversarial network for 3D aneurysm growth prediction. MCMeshGAN introduces a dual-branch architecture combining a novel local KNN-based convolutional network (KCN) to preserve fine-grained geometric details and a global graph convolutional network (GCN) to capture long-range structural context, overcoming the over-smoothing limitations of deep GCNs. A dedicated condition branch encodes clinical attributes (age, sex) and the target time interval to generate anatomically plausible, temporally controlled predictions, enabling retrospective and prospective modeling. We curated TAAMesh, a new longitudinal thoracic aortic aneurysm mesh dataset consisting of 590 multimodal records (CT scans, 3D meshes, and clinical data) from 208 patients. Extensive experiments demonstrate that MCMeshGAN consistently outperforms state-of-the-art baselines in both geometric accuracy and clinically important diameter estimation. This framework offers a robust step toward clinically deployable, personalized 3D disease trajectory modeling. The source code for MCMeshGAN and the baseline methods is publicly available at this https URL.
zh
[CV-23] Ego-centric Predictive Model Conditioned on Hand Trajectories
【速读】:该论文旨在解决共情场景(egocentric scenarios)中同时预测下一步动作及其视觉后果的问题,现有方法要么仅关注动作预测(如Vision-Language-Action模型),要么仅生成未来视频帧但未考虑具体动作条件(如视频预测模型),导致结果缺乏因果一致性或情境合理性。解决方案的关键在于提出一个两阶段统一预测框架:第一阶段通过连续状态建模处理多模态输入(视觉观测、语言和动作历史),显式预测未来手部轨迹;第二阶段引入因果交叉注意力机制融合多模态线索,并利用推断的动作信号引导基于潜空间的扩散模型(Latent Diffusion Model, LDM)逐帧生成符合动作逻辑的未来视频。该方法首次实现了对共情场景下人类活动理解与机器人操作任务的联合建模,提供明确的动作及视觉后果预测。
链接: https://arxiv.org/abs/2508.19852
作者: Binjie Zhang,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this http URL (branch: main)
Abstract:In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions and for enabling robotic planning. However, existing paradigms fall short of jointly modeling these aspects. Vision-Language-Action (VLA) models focus on action prediction but lack explicit modeling of how actions influence the visual scene, while video prediction models generate future frames without conditioning on specific actions, often resulting in implausible or contextually inconsistent outcomes. To bridge this gap, we propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios, conditioned on hand trajectories. In the first stage, we perform consecutive state modeling to process heterogeneous inputs (visual observations, language, and action history) and explicitly predict future hand trajectories. In the second stage, we introduce causal cross-attention to fuse multi-modal cues, leveraging inferred action signals to guide an image-based Latent Diffusion Model (LDM) for frame-by-frame future video generation. Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks, providing explicit predictions of both upcoming actions and their visual consequences. Extensive experiments on Ego4D, BridgeData, and RLBench demonstrate that our method outperforms state-of-the-art baselines in both action prediction and future video synthesis.
zh
[CV-24] Image Quality Assessment for Machines: Paradigm Large-scale Database and Models
【速读】:该论文旨在解决机器视觉系统(Machine Vision Systems, MVS)在恶劣视觉条件下性能退化的问题,核心挑战在于如何量化图像退化对MVS性能的影响。解决方案的关键在于提出一种面向机器的图像质量评估(Machine-centric Image Quality Assessment, MIQA)框架,其核心创新包括:构建包含250种退化类型、75个视觉模型和3类典型任务的大型机器感知图像质量数据库(MIQD-2.5M),以及设计一种区域感知的MIQA模型(RA-MIQA),通过细粒度的空间退化分析实现对MVS视觉质量的精准评估。实验表明,RA-MIQA在一致性与准确性指标上显著优于基于人类视觉系统(Human Visual System, HVS)的传统图像质量评估方法,揭示了HVS指标在预测MVS性能上的局限性,并指出当前MIQA模型在背景退化、精度导向估计及细微失真处理方面的不足,为提升MVS可靠性及机器感知图像处理提供了理论基础与实践工具。
链接: https://arxiv.org/abs/2508.19850
作者: Xiaoqi Wang,Yun Zhang,Weisi Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions. To address this, we propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance. We establish an MIQA paradigm encompassing the end-to-end assessment workflow. To support this, we construct a machine-centric image quality database (MIQD-2.5M), comprising 2.5 million samples that capture distinctive degradation responses in both consistency and accuracy metrics, spanning 75 vision models, 250 degradation types, and three representative vision tasks. We further propose a region-aware MIQA (RA-MIQA) model to evaluate MVS visual quality through fine-grained spatial degradation analysis. Extensive experiments benchmark the proposed RA-MIQA against seven human visual system (HVS)-based IQA metrics and five retrained classical backbones. Results demonstrate RA-MIQA’s superior performance in multiple dimensions, e.g., achieving SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while also revealing task-specific degradation sensitivities. Critically, HVS-based metrics prove inadequate for MVS quality prediction, while even specialized MIQA models struggle with background degradations, accuracy-oriented estimation, and subtle distortions. This study can advance MVS reliability and establish foundations for machine-centric image processing and optimization. The model and code are available at: this https URL.
zh
[CV-25] Gradient Rectification for Robust Calibration under Distribution Shift
【速读】:该论文旨在解决深度神经网络在分布偏移(distribution shift)场景下预测过于自信(overconfident predictions)导致可靠性下降的问题,尤其在缺乏目标域数据的情况下,现有校准方法因依赖目标域信息而难以实际应用。其解决方案的关键在于提出一种无需目标域信息的新型校准框架:首先从频域视角出发,识别分布偏移常扭曲模型依赖的高频视觉特征,进而引入低频滤波策略以促使模型关注域不变特征;同时为避免该策略损害原始分布(In-Distribution, ID)下的校准性能,进一步设计基于梯度的修正机制,在优化过程中将ID校准作为硬约束强制执行。实验表明,该方法在合成与真实分布偏移数据集上均能显著提升校准效果并保持良好的ID性能。
链接: https://arxiv.org/abs/2508.19830
作者: Yilin Zhang,Cai Xu,You Wu,Ziyu Guan,Wei Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, under review
Abstract:Deep neural networks often produce overconfident predictions, undermining their reliability in safety-critical applications. This miscalibration is further exacerbated under distribution shift, where test data deviates from the training distribution due to environmental or acquisition changes. While existing approaches improve calibration through training-time regularization or post-hoc adjustment, their reliance on access to or simulation of target domains limits their practicality in real-world scenarios. In this paper, we propose a novel calibration framework that operates without access to target domain information. From a frequency-domain perspective, we identify that distribution shifts often distort high-frequency visual cues exploited by deep models, and introduce a low-frequency filtering strategy to encourage reliance on domain-invariant features. However, such information loss may degrade In-Distribution (ID) calibration performance. Therefore, we further propose a gradient-based rectification mechanism that enforces ID calibration as a hard constraint during optimization. Experiments on synthetic and real-world shifted datasets, including CIFAR-10/100-C and WILDS, demonstrate that our method significantly improves calibration under distribution shift while maintaining strong in-distribution performance.
zh
[CV-26] ERSR: An Ellipse-constrained pseudo-label refinement and symmetric regularization framework for semi-supervised fetal head segmentation in ultrasound images
【速读】:该论文旨在解决胎儿头部超声图像中分割任务的挑战,尤其是由于图像质量差和标注数据稀缺导致的模型性能受限问题。针对半监督学习方法在胎儿超声图像上难以生成可靠伪标签及有效施加一致性正则化的问题,作者提出了一种名为ERSR的新框架。其核心创新在于三个关键技术:(1)双评分自适应滤波策略,通过边界一致性和轮廓规则性评估并筛选教师模型输出;(2)椭圆约束的伪标签精炼机制,利用最小二乘椭圆拟合强化椭圆中心区域像素并抑制噪声;(3)基于对称性的多级一致性正则化,强制扰动图像、对称区域以及原始预测与伪标签之间的多层次一致性,从而提升模型对胎儿头部形状特征的鲁棒表示能力。
链接: https://arxiv.org/abs/2508.19815
作者: Linkuan Zhou,Zhexin Chen,Yufei Shen,Junlin Xu,Ping Xuan,Yixin Zhu,Yuqi Fang,Cong Cong,Leyi Wei,Ran Su,Jia Zhou,Qiangguo Jin
机构: Northwestern Polytechnical University (西北工业大学); Wuhan University of Science and Technology (武汉科技大学); Shantou University (汕头大学); Peking University Shenzhen Hospital (北京大学深圳医院); Nanjing University (南京大学); Macquarie University (麦考瑞大学); Macao Polytechnic University (澳门理工学院); Tianjin University (天津大学); Tianjin Chest Hospital (天津市胸科医院); Yangtze River Delta Research Institute of Northwestern Polytechnical University (西北工业大学长三角研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated segmentation of the fetal head in ultrasound images is critical for prenatal monitoring. However, achieving robust segmentation remains challenging due to the poor quality of ultrasound images and the lack of annotated data. Semi-supervised methods alleviate the lack of annotated data but struggle with the unique characteristics of fetal head ultrasound images, making it challenging to generate reliable pseudo-labels and enforce effective consistency regularization constraints. To address this issue, we propose a novel semi-supervised framework, ERSR, for fetal head ultrasound segmentation. Our framework consists of the dual-scoring adaptive filtering strategy, the ellipse-constrained pseudo-label refinement, and the symmetry-based multiple consistency regularization. The dual-scoring adaptive filtering strategy uses boundary consistency and contour regularity criteria to evaluate and filter teacher outputs. The ellipse-constrained pseudo-label refinement refines these filtered outputs by fitting least-squares ellipses, which strengthens pixels near the center of the fitted ellipse and suppresses noise simultaneously. The symmetry-based multiple consistency regularization enforces multi-level consistency across perturbed images, symmetric regions, and between original predictions and pseudo-labels, enabling the model to capture robust and stable shape representations. Our method achieves state-of-the-art performance on two benchmarks. On the HC18 dataset, it reaches Dice scores of 92.05% and 95.36% with 10% and 20% labeled data, respectively. On the PSFH dataset, the scores are 91.68% and 93.70% under the same settings.
zh
[CV-27] AutoQ-VIS: Improving Unsupervised Video Instance Segmentation via Automatic Quality Assessment ICCV2025
【速读】:该论文旨在解决视频实例分割(Video Instance Segmentation, VIS)在标注成本高昂的问题,特别是由于其对像素级掩码和时序一致性标签的双重需求。现有无监督方法如VideoCutLER虽通过合成数据消除了对光流的依赖,但仍受限于合成数据到真实场景的域差异(synthetic-to-real domain gap)。其解决方案的关键在于提出AutoQ-VIS框架,通过质量引导的自训练机制建立伪标签生成与自动质量评估之间的闭环系统,实现从合成视频到真实视频的渐进式适应,从而有效缩小域差距并提升性能,在YouTubeVIS-2019验证集上达到52.6 AP_50,优于先前最优方法VideoCutLER 4.4%。
链接: https://arxiv.org/abs/2508.19808
作者: Kaixuan Lu,Mehmet Onurcan Kaya,Dim P. Papadopoulos
机构: Technical University of Denmark (丹麦技术大学); Pioneer Centre for AI (先锋人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025 Workshop LIMIT
Abstract:Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 \textAP_50 on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4 % , while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at this https URL.
zh
[CV-28] Context-aware Sparse Spatiotemporal Learning for Event-based Vision IROS2025
【速读】:该论文旨在解决现有基于深度学习的事件数据处理方法难以充分利用事件数据稀疏性,导致在资源受限的边缘应用中集成困难的问题,同时应对脉冲神经网络(Spiking Neural Networks, SNNs)在复杂事件视觉任务(如目标检测和光流估计)中性能不足的挑战。其解决方案的关键在于提出一种上下文感知的稀疏时空学习框架(Context-aware Sparse Spatiotemporal Learning, CSSL),通过引入基于输入分布动态调节神经元激活的上下文感知阈值机制,在无需显式稀疏性约束的情况下自然降低激活密度,从而在保持高性能的同时实现极高的神经元稀疏性,显著提升事件视觉任务在类脑计算平台上的效率与适用性。
链接: https://arxiv.org/abs/2508.19806
作者: Shenqi Wang,Guangzhi Tang
机构: Delft University of Technology (代尔夫特理工大学); Maastricht University (马斯特里赫特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at IROS 2025
Abstract:Event-based camera has emerged as a promising paradigm for robot perception, offering advantages with high temporal resolution, high dynamic range, and robustness to motion blur. However, existing deep learning-based event processing methods often fail to fully leverage the sparse nature of event data, complicating their integration into resource-constrained edge applications. While neuromorphic computing provides an energy-efficient alternative, spiking neural networks struggle to match of performance of state-of-the-art models in complex event-based vision tasks, like object detection and optical flow. Moreover, achieving high activation sparsity in neural networks is still difficult and often demands careful manual tuning of sparsity-inducing loss terms. Here, we propose Context-aware Sparse Spatiotemporal Learning (CSSL), a novel framework that introduces context-aware thresholding to dynamically regulate neuron activations based on the input distribution, naturally reducing activation density without explicit sparsity constraints. Applied to event-based object detection and optical flow estimation, CSSL achieves comparable or superior performance to state-of-the-art methods while maintaining extremely high neuronal sparsity. Our experimental results highlight CSSL’s crucial role in enabling efficient event-based vision for neuromorphic processing.
zh
[CV-29] A bag of tricks for real-time Mitotic Figure detection
【速读】:该论文旨在解决组织病理图像中有丝分裂象(Mitotic Figure, MF)检测的挑战性问题,尤其针对不同扫描仪、染色协议、组织类型及伪影带来的域差异导致的模型泛化能力不足。其解决方案的关键在于构建一套“技巧集合”(bag of tricks),包括基于高效单阶段目标检测器RTMDet的架构设计、多域训练数据的广泛覆盖、平衡采样策略与精细化增强技术以应对扫描仪变异和肿瘤异质性,并通过针对性的困难负样本挖掘(hard negative mining)减少坏死和碎片组织引起的假阳性。该方法在多个MF数据集上实现了0.78–0.84的F1分数,在MIDOG 2025挑战初步测试集中达到0.81的F1,优于更大模型且展现出良好的跨域适应能力,为临床部署提供了准确与速度兼顾的实用方案。
链接: https://arxiv.org/abs/2508.19804
作者: Christian Marzahl,Brian Napora
机构: Gestalt Diagnostics(格斯特尔诊断)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mitotic figure (MF) detection in histopathology images is challenging due to large variations in slide scanners, staining protocols, tissue types, and the presence of artifacts. This paper presents a collection of training techniques - a bag of tricks - that enable robust, real-time MF detection across diverse domains. We build on the efficient RTMDet single stage object detector to achieve high inference speed suitable for clinical deployment. Our method addresses scanner variability and tumor heterogeneity via extensive multi-domain training data, balanced sampling, and careful augmentation. Additionally, we employ targeted, hard negative mining on necrotic and debris tissue to reduce false positives. In a grouped 5-fold cross-validation across multiple MF datasets, our model achieves an F1 score between 0.78 and 0.84. On the preliminary test set of the MItosis DOmain Generalization (MIDOG) 2025 challenge, our single-stage RTMDet-S based approach reaches an F1 of 0.81, outperforming larger models and demonstrating adaptability to new, unfamiliar domains. The proposed solution offers a practical trade-off between accuracy and speed, making it attractive for real-world clinical adoption.
zh
[CV-30] FusionSort: Enhanced Cluttered Waste Segmentation with Advanced Decoding and Comprehensive Modality Optimization
【速读】:该论文旨在解决非生物降解废弃物自动化分拣过程中因废物流复杂性和多样性带来的挑战。其解决方案的关键在于提出一种增强型神经网络架构,该架构在现有编码器-解码器结构基础上进行改进:首先,在解码器中引入综合注意力模块(Comprehensive Attention Block),通过融合卷积与上采样操作优化特征表示;其次,采用Mamba架构实现注意力机制以提升性能;此外,设计数据融合模块(Data Fusion Block)用于处理多通道图像数据,通过主成分分析(PCA)降维保留最大方差和关键信息,从而有效整合RGB、高光谱、多光谱及二者组合数据。实验表明,该方法在多种数据类型上均显著优于现有技术。
链接: https://arxiv.org/abs/2508.19798
作者: Muhammad Ali,Omar Ali AlSuwaidi
机构: MBZUAI(穆罕默德·阿里大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the realm of waste management, automating the sorting process for non-biodegradable materials presents considerable challenges due to the complexity and variability of waste streams. To address these challenges, we introduce an enhanced neural architecture that builds upon an existing Encoder-Decoder structure to improve the accuracy and efficiency of waste sorting systems. Our model integrates several key innovations: a Comprehensive Attention Block within the decoder, which refines feature representations by combining convolutional and upsampling operations. In parallel, we utilize attention through the Mamba architecture, providing an additional performance boost. We also introduce a Data Fusion Block that fuses images with more than three channels. To achieve this, we apply PCA transformation to reduce the dimensionality while retaining the maximum variance and essential information across three dimensions, which are then used for further processing. We evaluated the model on RGB, hyperspectral, multispectral, and a combination of RGB and hyperspectral data. The results demonstrate that our approach outperforms existing methods by a significant margin.
zh
[CV-31] Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models
【速读】:该论文旨在解决文本到图像生成(text-to-image generation)中多对象提示(multi-object prompts)的语义对齐问题,尤其是当提示包含多个颜色属性时,预训练模型难以准确生成与文本语义一致的图像。现有方法主要依赖推理阶段对去噪网络注意力层的修改,但这些方法在处理复杂颜色提示时效果有限,且缺乏可靠的量化评估手段。论文的关键解决方案是一种专门设计的图像编辑技术,通过精准调整图像中的颜色属性来缓解多对象语义错位问题,显著提升了多种扩散模型生成图像在多个指标上的表现。
链接: https://arxiv.org/abs/2508.19791
作者: Shay Shomer Chai,Wenxuan Peng,Bharath Hariharan,Hadar Averbuch-Elor
机构: Tel Aviv University (特拉维夫大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors – a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.
zh
[CV-32] StableIntrinsic: Detail-preserving One-step Diffusion Model for Multi-view Material Estimation
【速读】:该论文旨在解决基于扩散模型的多视角材质估计中存在的时间消耗大和结果方差高的问题。现有方法采用多步去噪策略,导致推理效率低下,且其随机性与确定性的材质估计任务不匹配,造成估计结果波动较大。解决方案的关键在于提出一种单步扩散模型 StableIntrinsic,通过在像素空间施加针对不同材质属性设计的损失函数以缓解单步扩散带来的过度平滑问题,并引入细节注入网络(Detail Injection Network, DIN)来补偿变分自编码器(Variational Autoencoder, VAE)编码过程中的细节丢失,从而提升材质预测的清晰度与精度。实验表明,该方法在反照率(albedo)的峰值信噪比(PSNR)上较当前最优方法提升 9.9%,金属度和粗糙度的均方误差(MSE)分别降低 44.4% 和 60.0%。
链接: https://arxiv.org/abs/2508.19789
作者: Xiuchao Wu,Pengfei Zhu,Jiangjing Lyu,Xinguo Liu,Jie Guo,Yanwen Guo,Weiwei Xu,Chengfei Lyu
机构: Zhejiang University (浙江大学); Nanjing University (南京大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering material information from images has been extensively studied in computer graphics and vision. Recent works in material estimation leverage diffusion model showing promising results. However, these diffusion-based methods adopt a multi-step denoising strategy, which is time-consuming for each estimation. Such stochastic inference also conflicts with the deterministic material estimation task, leading to a high variance estimated results. In this paper, we introduce StableIntrinsic, a one-step diffusion model for multi-view material estimation that can produce high-quality material parameters with low variance. To address the overly-smoothing problem in one-step diffusion, StableIntrinsic applies losses in pixel space, with each loss designed based on the properties of the material. Additionally, StableIntrinsic introduces a Detail Injection Network (DIN) to eliminate the detail loss caused by VAE encoding, while further enhancing the sharpness of material prediction results. The experimental results indicate that our method surpasses the current state-of-the-art techniques by achieving a 9.9% improvement in the Peak Signal-to-Noise Ratio (PSNR) of albedo, and by reducing the Mean Square Error (MSE) for metallic and roughness by 44.4% and 60.0% , respectively.
zh
[CV-33] Context-Aware Risk Estimation in Home Environments: A Probabilistic Framework for Service Robots
【速读】:该论文旨在解决服务机器人在人类-centric室内环境中识别高风险区域(accident-prone regions)的问题,以提升其在日常场景中的实时风险感知能力。解决方案的关键在于提出一种基于语义图(semantic graph)的风险传播算法:将场景中的每个物体表示为节点并赋予风险评分,通过空间邻近性和事故关联性实现从高风险物体到低风险物体的非对称风险传播,从而在未显式标注或可见的情况下推断潜在危险。该方法兼顾可解释性与轻量化部署特性,在人工标注风险区域的数据集上实现了75%的二分类风险检测准确率,并展现出与人类感知高度一致的性能,尤其在涉及锋利或不稳定的物体时表现突出。
链接: https://arxiv.org/abs/2508.19788
作者: Sena Ishii,Akash Chikhalikar,Ankit A. Ravankar,Jose Victorio Salazar Luces,Yasuhisa Hirata
机构: Tohoku University (东北大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, Accepted for IEEE RO-MAN 2025 Conference
Abstract:We present a novel framework for estimating accident-prone regions in everyday indoor scenes, aimed at improving real-time risk awareness in service robots operating in human-centric environments. As robots become integrated into daily life, particularly in homes, the ability to anticipate and respond to environmental hazards is crucial for ensuring user safety, trust, and effective human-robot interaction. Our approach models object-level risk and context through a semantic graph-based propagation algorithm. Each object is represented as a node with an associated risk score, and risk propagates asymmetrically from high-risk to low-risk objects based on spatial proximity and accident relationship. This enables the robot to infer potential hazards even when they are not explicitly visible or labeled. Designed for interpretability and lightweight onboard deployment, our method is validated on a dataset with human-annotated risk regions, achieving a binary risk detection accuracy of 75%. The system demonstrates strong alignment with human perception, particularly in scenes involving sharp or unstable objects. These results underline the potential of context-aware risk reasoning to enhance robotic scene understanding and proactive safety behaviors in shared human-robot spaces. This framework could serve as a foundation for future systems that make context-driven safety decisions, provide real-time alerts, or autonomously assist users in avoiding or mitigating hazards within home environments.
zh
[CV-34] MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction AAAI
【速读】:该论文旨在解决基于变形场的3D高斯溅射(3D Gaussian Splatting)方法在动态场景重建中因单一统一模型难以表征多样化运动模式而导致的渲染模糊和细粒度运动细节丢失问题。其解决方案的关键在于提出一种运动感知的动态分区策略(Motion-Aware Partitioning, MAPo):通过动态评分机制将3D高斯点划分为高动态与低动态两类,对高动态区域进行时间上的递归分割并为每个时隙独立复制变形网络,从而实现对复杂运动细节的精细化建模;同时,低动态区域保持静态以降低计算开销。此外,引入跨帧一致性损失以消除分区边界处的视觉不连续性,进一步提升渲染质量。
链接: https://arxiv.org/abs/2508.19786
作者: Han Jiao,Jiakai Sun,Yexing Xu,Lei Zhao,Wei Xing,Huaizhong Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures, Anonymous AAAI Submission
Abstract:3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.
zh
[CV-35] he Return of Structural Handwritten Mathematical Expression Recognition
【速读】:该论文旨在解决手写数学表达式识别(Handwritten Mathematical Expression Recognition, HMER)中现有基于大语言模型的编码器-解码器架构缺乏显式符号与笔迹对齐的问题,这一缺陷限制了错误分析、可解释性以及需要空间感知能力的交互式应用(如局部内容更新)。解决方案的关键在于提出一种结构化识别方法,包含两个核心创新:一是基于神经网络的自动标注系统,能够将LaTeX公式映射到原始笔迹,自动生成符号分割、分类和空间关系的标注;二是模块化的结构识别系统,独立优化分割、分类和关系预测三个子任务。通过利用该自动标注系统构建的结构化标注数据集,系统结合图结构笔迹排序、混合卷积-循环网络及基于Transformer的纠错机制,在CROHME-2023基准上实现优异性能,并生成完整图结构以直接关联笔迹与预测符号,从而支持透明化错误分析和可解释输出。
链接: https://arxiv.org/abs/2508.19773
作者: Jakob Seitz,Tobias Lengfeld,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handwritten Mathematical Expression Recognition is foundational for educational technologies, enabling applications like digital note-taking and automated grading. While modern encoder-decoder architectures with large language models excel at LaTeX generation, they lack explicit symbol-to-trace alignment, a critical limitation for error analysis, interpretability, and spatially aware interactive applications requiring selective content updates. This paper introduces a structural recognition approach with two innovations: 1 an automatic annotation system that uses a neural network to map LaTeX equations to raw traces, automatically generating annotations for symbol segmentation, classification, and spatial relations, and 2 a modular structural recognition system that independently optimizes segmentation, classification, and relation prediction. By leveraging a dataset enriched with structural annotations from our auto-labeling system, the proposed recognition system combines graph-based trace sorting, a hybrid convolutional-recurrent network, and transformer-based correction to achieve competitive performance on the CROHME-2023 benchmark. Crucially, our structural recognition system generates a complete graph structure that directly links handwritten traces to predicted symbols, enabling transparent error analysis and interpretable outputs.
zh
[CV-36] AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning
【速读】:该论文旨在解决多模态学习中因模态不平衡导致的性能下降问题,尤其是现有方法通过抑制主导模态来提升弱模态表现,从而损害整体性能的缺陷。其核心解决方案是提出自适应网络内调制(Adaptive Intra-Network Modulation, AIM),关键在于首次实现不牺牲任何模态(无论是主导还是弱模态)的前提下达成平衡学习:AIM通过识别网络内部不同参数和深度的优化状态差异,将主导模态中未充分优化的参数分离至辅助模块(Auxiliary Blocks),并引导模型在联合训练中依赖这些性能退化的模块,从而避免对弱模态的压制;同时,AIM根据各网络层的模态不平衡程度自适应调整调制强度,实现精细化调控。实验表明,该方法在多个基准上优于当前最优方法,并具有良好的跨骨干网络、融合策略与优化器的泛化能力。
链接: https://arxiv.org/abs/2508.19769
作者: Shu Shen,C. L. Philip Chen,Tong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages,7 figures
Abstract:Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality’s learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality’s under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.
zh
[CV-37] BuzzSet v1.0: A Dataset for Pollinator Detection in Field Conditions
【速读】:该论文旨在解决农业场景下传粉昆虫(如蜜蜂和熊蜂)自动化监测的难题,以应对因人为和环境压力导致的传粉昆虫种群下降问题。其解决方案的关键在于构建了一个大规模、高分辨率的真实农田环境下采集的图像数据集——BuzzSet,包含7856张人工验证标注的图像及超过8000个实例,覆盖蜜蜂数、熊蜂类和未识别昆虫三类;通过YOLOv12模型预标注并结合开源工具进行人工校正,同时将图像切分为256×256像素的小块以增强对小型昆虫的检测能力,并提供基于RF-DETR Transformer架构的强基线模型,实现了蜜蜂和熊蜂类别F1分数分别为0.94和0.92的高精度检测性能,为小目标检测、类别分离与标签噪声下的鲁棒性评估提供了有价值的基准。
链接: https://arxiv.org/abs/2508.19762
作者: Ahmed Emam,Mohamed Elbassiouny,Julius Miller,Patrick Donworth,Sabine Seidel,Ribana Roscher
机构: University of Bonn(波恩大学); Jülich Research Centre (FZJ)(于利希研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to increasing anthropogenic and environmental stressors. To support scalable, automated pollinator monitoring, we introduce BuzzSet, a new large-scale dataset of high-resolution pollinator images collected in real agricultural field conditions. BuzzSet contains 7856 manually verified and labeled images, with over 8000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were generated using a YOLOv12 model trained on external data and refined via human verification using open-source labeling tools. All images were preprocessed into 256~ \times ~256 tiles to improve the detection of small insects. We provide strong baselines using the RF-DETR transformer-based object detector. The model achieves high F1-scores of 0.94 and 0.92 for honeybee and bumblebee classes, respectively, with confusion matrix results showing minimal misclassification between these categories. The unidentified class remains more challenging due to label ambiguity and lower sample frequency, yet still contributes useful insights for robustness evaluation. Overall detection quality is strong, with a best mAP@0.50 of 0.559. BuzzSet offers a valuable benchmark for small object detection, class separation under label noise, and ecological computer vision.
zh
[CV-38] FastAvatar: Towards Unified Fast High-Fidelity 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers
【速读】:该论文旨在解决3D avatar重建中存在的高时间复杂度、对数据质量敏感以及数据利用率低等问题。其核心解决方案是提出FastAvatar框架,该框架基于一个大型高斯重建Transformer(Large Gaussian Reconstruction Transformer),包含三项关键技术:一是采用类VGGT的Transformer架构,在聚合多帧信息的同时注入初始3D提示以预测可聚合的规范空间3D高斯溅射(3D Gaussian Splatting, 3DGS)表示;二是引入多粒度引导编码(包括相机位姿、FLAME表情参数和头部姿态),缓解因动画导致的配准偏差,适配不同长度输入;三是通过关键点追踪与分片融合损失实现增量式高斯聚合,支持在新增观测下持续提升重建质量。该方案实现了高质量、高速度且可调优的3D avatar建模范式。
链接: https://arxiv.org/abs/2508.19754
作者: Yue Wu,Yufan Wu,Wen Li,Yuxi Lu,Kairui Feng,Xuanhong Chen
机构: Tongji University (同济大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); AKool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. FastAvatar’s core is a Large Gaussian Reconstruction Transformer featuring three key designs: First, a variant VGGT-style transformer architecture aggregating multi-frame cues while injecting initial 3D prompt to predict an aggregatable canonical 3DGS representation; Second, multi-granular guidance encoding (camera pose, FLAME expression, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations, unlike prior work wasting input data. This yields a quality-speed-tunable paradigm for highly usable avatar modeling. Extensive experiments show that FastAvatar has higher quality and highly competitive speed compared to existing methods.
zh
[CV-39] SPLF-SAM: Self-Prompting Segment Anything Model for Light Field Salient Object Detection
【速读】:该论文旨在解决光场显著目标检测(Light Field Salient Object Detection, LF SOD)中两个关键问题:一是现有模型普遍忽视提示信息(prompt information)的提取,二是传统方法缺乏对频域信息的有效分析,导致小目标易被噪声淹没。解决方案的核心在于提出自提示光场分割一切模型(Self-Prompting Light Field Segment Anything Model, SPLF-SAM),其关键组件包括统一多尺度特征嵌入模块(Unified Multi-scale Feature Embedding Block, UMFEB)和多尺度自适应滤波适配器(Multi-scale Adaptive Filtering Adapter, MAFA)。UMFEB通过多尺度特征融合增强对不同尺寸目标的识别能力,MAFA则通过学习频域特征有效抑制噪声干扰,从而提升小目标的检测鲁棒性。
链接: https://arxiv.org/abs/2508.19746
作者: Qiyao Xu,Qiming Wu,Xiaowei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Segment Anything Model (SAM) has demonstrated remarkable capabilities in solving light field salient object detection (LF SOD). However, most existing models tend to neglect the extraction of prompt information under this task. Meanwhile, traditional models ignore the analysis of frequency-domain information, which leads to small objects being overwhelmed by noise. In this paper, we put forward a novel model called self-prompting light field segment anything model (SPLF-SAM), equipped with unified multi-scale feature embedding block (UMFEB) and a multi-scale adaptive filtering adapter (MAFA). UMFEB is capable of identifying multiple objects of varying sizes, while MAFA, by learning frequency features, effectively prevents small objects from being overwhelmed by noise. Extensive experiments have demonstrated the superiority of our method over ten state-of-the-art (SOTA) LF SOD methods. Our code will be available at this https URL.
zh
[CV-40] POEv2: a flexible and robust framework for generic line segment detection and wireframe line segment detection
【速读】:该论文旨在解决现有线段检测方法在通用线段检测(generic line segment detection)与线框线段检测(wireframe line segment detection)任务之间性能不兼容的问题。传统方法多为通用型线段检测器,而近年来基于深度学习的方法则主要聚焦于仅提取几何有意义且空间支撑较强的线框线段,二者设计目标不同导致彼此在对方任务中表现不佳。解决方案的关键在于提出一种改进的像素方向估计(Pixel Orientation Estimation, POE)框架——POEv2,该方法从边缘强度图(edge strength map)中检测线段,可与任意边缘检测器结合使用,从而在保持鲁棒性的同时兼顾两类任务需求,并通过实验验证其在三个公开数据集上均达到当前最优性能。
链接: https://arxiv.org/abs/2508.19742
作者: Chenguang Liu,Chisheng Wang,Yuhua Cai,Chuanhua Zhu,Qingquan Li
机构: Shenzhen University (深圳大学); Ministry of Natural Resources (MNR) Key Laboratory for Geo-Environmental Monitoring of Great Bay Area (自然资源部粤港澳大湾区地质环境监测重点实验室); Guangdong Key Laboratory of Urban Informatics (广东省城市信息学重点实验室); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Line segment detection in images has been studied for several decades. Existing line segment detectors can be roughly divided into two categories: generic line segment detectors and wireframe line segment detectors. Generic line segment detectors aim to detect all meaningful line segments in images and traditional approaches usually fall into this category. Recent deep learning based approaches are mostly wireframe line segment detectors. They detect only line segments that are geometrically meaningful and have large spatial support. Due to the difference in the aim of design, the performance of generic line segment detectors for the task of wireframe line segment detection won’t be satisfactory, and vice versa. In this work, we propose a robust framework that can be used for both generic line segment detection and wireframe line segment detection. The proposed method is an improved version of the Pixel Orientation Estimation (POE) method. It is thus named as POEv2. POEv2 detects line segments from edge strength maps, and can be combined with any edge detector. We show in our experiments that by combining the proposed POEv2 with an efficient edge detector, it achieves state-of-the-art performance on three publicly available datasets.
zh
[CV-41] Improving Generalization in Deepfake Detection with Face Foundation Models and Metric Learning
【速读】:该论文旨在解决视频深度伪造(deepfake)检测模型在真实场景中泛化能力不足的问题,尤其是在面对训练分布之外的未知伪造类型或数据时性能显著下降。其解决方案的关键在于利用人脸基础模型(face foundation model, FSM)所学习到的丰富面部表征,并在此基础上进行微调与优化:首先基于自监督训练的FSFM模型构建检测框架,随后在多个深度伪造数据集(涵盖人脸交换和人脸重演两类操作)上进行集成微调;同时引入三元组损失(triplet loss)变体以增强真实样本与伪造样本之间的嵌入可分性,并探索基于伪造来源或操作类型的归因监督机制,从而提升模型对复杂现实场景的适应能力。
链接: https://arxiv.org/abs/2508.19730
作者: Stelios Mylonas,Symeon Papadopoulos
机构: Centre for Research and Technology HellasInformation Technologies Institute (希腊研究中心与技术赫拉克利翁信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.
zh
[CV-42] Addressing Deepfake Issue in Selfie banking through camera based authentication
【速读】:该论文旨在解决自拍式在线银行(selfie banking)中日益严重的虚假图像(fake images)威胁问题,特别是利用深度学习技术生成的高度逼真伪造身份绕过生物特征识别系统(如人脸识别)的欺诈行为。解决方案的关键在于将一个已建立的取证识别系统(forensic recognition system)——此前用于图像相机定位(camera localization)——引入到深度伪造检测(deepfake detection)任务中,通过分析图像中的细微物理或数字痕迹来区分真实与伪造图像。
链接: https://arxiv.org/abs/2508.19714
作者: Subhrojyoti Mukherjee,Manoranjan Mohanty
机构: Indian Institute Of Technology (印度理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fake images in selfie banking are increasingly becoming a threat. Previously, it was just Photoshop, but now deep learning technologies enable us to create highly realistic fake identities, which fraudsters exploit to bypass biometric systems such as facial recognition in online banking. This paper explores the use of an already established forensic recognition system, previously used for picture camera localization, in deepfake detection.
zh
[CV-43] FreeVPS: Repurposing Training-Free SAM2 for Generalizable Video Polyp Segmentation
【速读】:该论文旨在解决视频息肉分割(Video Polyp Segmentation, VPS)中难以平衡时空建模与领域泛化能力的问题,从而提升模型在真实临床场景中的适用性。其解决方案的关键在于将VPS任务重构为“检测后追踪”(track-by-detect)范式,利用图像息肉分割(Image Polyp Segmentation, IPS)模型的空间上下文信息,并引入Segment Anything Model 2(SAM2)的时序建模能力;同时,通过两个无需额外训练的模块——内部关联过滤模块(intra-association filtering)和外部关联精修模块(inter-association refinement),分别消除检测阶段的空间误判并动态更新记忆库以抑制误差传播,从而显著提升分割稳定性与跨域性能。
链接: https://arxiv.org/abs/2508.19705
作者: Qiang Hu,Ying Zhou,Gepeng Ji,Nick Barnes,Qiang Li,Zhiwei Wang
机构: Huazhong University of Science and Technology (华中科技大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing video polyp segmentation (VPS) paradigms usually struggle to balance between spatiotemporal modeling and domain generalization, limiting their applicability in real clinical scenarios. To embrace this challenge, we recast the VPS task as a track-by-detect paradigm that leverages the spatial contexts captured by the image polyp segmentation (IPS) model while integrating the temporal modeling capabilities of segment anything model 2 (SAM2). However, during long-term polyp tracking in colonoscopy videos, SAM2 suffers from error accumulation, resulting in a snowball effect that compromises segmentation stability. We mitigate this issue by repurposing SAM2 as a video polyp segmenter with two training-free modules. In particular, the intra-association filtering module eliminates spatial inaccuracies originating from the detecting stage, reducing false positives. The inter-association refinement module adaptively updates the memory bank to prevent error propagation over time, enhancing temporal coherence. Both modules work synergistically to stabilize SAM2, achieving cutting-edge performance in both in-domain and out-of-domain scenarios. Furthermore, we demonstrate the robust tracking capabilities of FreeVPS in long-untrimmed colonoscopy videos, underscoring its potential reliable clinical analysis.
zh
[CV-44] LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 缺乏3D分割能力的问题,从而限制其在需要场景理解的任务中的应用。为实现对象级语义分割与高保真重建的协同优化,提出Label-aware 3D Gaussian Splatting (LabelGS),其关键创新在于引入跨视角一致的语义掩码(cross-view consistent semantic masks)以增强高斯表示的语义感知能力,并设计了新型遮挡分析模型(Occlusion Analysis Model)防止优化过程中遮挡过拟合;同时,通过主高斯标注模型(Main Gaussian Labeling model)将2D语义先验提升至3D空间,并利用高斯投影滤波器(Gaussian Projection Filter)避免标签冲突。此外,采用随机区域采样策略有效解耦高斯表示并提升优化效率,最终在保持高质量分割的同时实现训练速度提升22倍(1440×1080分辨率下)。
链接: https://arxiv.org/abs/2508.19699
作者: Yupeng Zhang,Dezhi Zheng,Ping Lu,Han Zhang,Lei Wang,Liping xiang,Cheng Luo,Kaijun Deng,Xiaowen Fu,Linlin Shen,Jinbao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PRCV 2025
Abstract:3D Gaussian Splatting (3DGS) has emerged as a novel explicit representation for 3D scenes, offering both high-fidelity reconstruction and efficient rendering. However, 3DGS lacks 3D segmentation ability, which limits its applicability in tasks that require scene understanding. The identification and isolating of specific object components is crucial. To address this limitation, we propose Label-aware 3D Gaussian Splatting (LabelGS), a method that augments the Gaussian representation with object this http URL introduces cross-view consistent semantic masks for 3D Gaussians and employs a novel Occlusion Analysis Model to avoid overfitting occlusion during optimization, Main Gaussian Labeling model to lift 2D semantic prior to 3D Gaussian and Gaussian Projection Filter to avoid Gaussian label conflict. Our approach achieves effective decoupling of Gaussian representations and refines the 3DGS optimization process through a random region sampling strategy, significantly improving efficiency. Extensive experiments demonstrate that LabelGS outperforms previous state-of-the-art methods, including Feature-3DGS, in the 3D scene segmentation task. Notably, LabelGS achieves a remarkable 22X speedup in training compared to Feature-3DGS, at a resolution of 1440X1080. Our code will be at this https URL.
zh
[CV-45] Synthetic Image Detection via Spectral Gaps of QC-RBIM Nishimori Bethe-Hessian Operators
【速读】:该论文旨在解决生成式 AI(Generative AI)所生成图像在媒体取证和生物特征安全领域带来的挑战,即现有监督式检测方法对未见过的生成器或对抗后处理失效,而依赖低级统计线索的无监督方法则仍易受干扰。其解决方案的关键在于提出一种基于物理启发的、模型无关的检测框架:将合成图像识别建模为稀疏加权图上的社区检测问题,利用预训练卷积神经网络(CNN)提取的32维特征向量构建多边类型QC-LDPC图,通过Nishimori温度校准的成对相似性转化为随机键伊辛模型(RBIM),并借助Bethe-Hessian谱的特征间隙来判断真实图像与合成图像——真实图像呈现多个分离的谱隙,而合成图像因破坏Nishimori对称性导致谱塌缩,从而实现无需标注合成数据或重新训练特征提取器的高精度检测(>94%)。
链接: https://arxiv.org/abs/2508.19698
作者: V. S. Usatyuk,D. A. Sapozhnikov,S. I. Egorov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Spectral Theory (math.SP)
备注: 14 pages, 10 figures
Abstract:The rapid advance of deep generative models such as GANs and diffusion networks now produces images that are virtually indistinguishable from genuine photographs, undermining media forensics and biometric security. Supervised detectors quickly lose effectiveness on unseen generators or after adversarial post-processing, while existing unsupervised methods that rely on low-level statistical cues remain fragile. We introduce a physics-inspired, model-agnostic detector that treats synthetic-image identification as a community-detection problem on a sparse weighted graph. Image features are first extracted with pretrained CNNs and reduced to 32 dimensions, each feature vector becomes a node of a Multi-Edge Type QC-LDPC graph. Pairwise similarities are transformed into edge couplings calibrated at the Nishimori temperature, producing a Random Bond Ising Model (RBIM) whose Bethe-Hessian spectrum exhibits a characteristic gap when genuine community structure (real images) is present. Synthetic images violate the Nishimori symmetry and therefore lack such gaps. We validate the approach on binary tasks cat versus dog and male versus female using real photos from Flickr-Faces-HQ and CelebA and synthetic counterparts generated by GANs and diffusion models. Without any labeled synthetic data or retraining of the feature extractor, the detector achieves over 94% accuracy. Spectral analysis shows multiple well separated gaps for real image sets and a collapsed spectrum for generated ones. Our contributions are threefold: a novel LDPC graph construction that embeds deep image features, an analytical link between Nishimori temperature RBIM and the Bethe-Hessian spectrum providing a Bayes optimal detection criterion; and a practical, unsupervised synthetic image detector robust to new generative architectures. Future work will extend the framework to video streams and multi-class anomaly detection.
zh
[CV-46] SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Texture 3D Human Reconstruction
【速读】:该论文旨在解决单目纹理三维人体重建中因2D图像固有的几何模糊性以及3D人体训练数据稀缺所导致的重建质量受限问题。现有方法虽利用先验几何估计网络(如SMPL模型和法向量图)来生成人体几何结构,但难以有效融合多模态几何信息,常出现视角不一致(如面部畸变)等问题。其解决方案的关键在于提出一种两阶段框架SAT(Single-view Avatar Reconstruction with Supervised Feature Regularization and Online Animation Augmentation),通过引入监督特征正则化模块(Supervisor Feature Regularization)以统一学习多种几何先验,并借助结构相同的多视角网络提供中间特征作为训练监督信号,从而实现更优的几何融合;同时,设计在线动画增强模块(Online Animation Augmentation),构建单次前向传播的动画网络,在线扩充原始3D人体数据以缓解数据稀缺问题并提升重建质量。
链接: https://arxiv.org/abs/2508.19688
作者: Gangjian Zhang,Jian Shu,Nanjie Yao,Hao Wang
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(GZ)(香港科技大学(广州)); HKUST(GZ)(香港科技大学(广州)); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures
Abstract:Monocular texture 3D human reconstruction aims to create a complete 3D digital avatar from just a single front-view human RGB image. However, the geometric ambiguity inherent in a single 2D image and the scarcity of 3D human training data are the main obstacles limiting progress in this field. To address these issues, current methods employ prior geometric estimation networks to derive various human geometric forms, such as the SMPL model and normal maps. However, they struggle to integrate these modalities effectively, leading to view inconsistencies, such as facial distortions. To this end, we propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner and reconstructs high-quality textured 3D avatars as the final output. To further facilitate geometry learning, we introduce a Supervisor Feature Regularization module. By employing a multi-view network with the same structure to provide intermediate features as training supervision, these varied geometric priors can be better fused. To tackle data scarcity and further improve reconstruction quality, we also propose an Online Animation Augmentation module. By building a one-feed-forward animation network, we augment a massive number of samples from the original 3D human data online for model training. Extensive experiments on two benchmarks show the superiority of our approach compared to state-of-the-art methods.
zh
[CV-47] A Frequency-Aware Self-Supervised Learning for Ultra-Wide-Field Image Enhancement
【速读】:该论文旨在解决超广角(Ultra-Wide-Field, UWF)眼底成像中因模糊和光照不均导致的图像质量下降问题,这些问题会掩盖病理性细节,影响临床诊断准确性。现有增强方法多针对普通眼底相机设计,难以满足UWF图像对病理细节保留的特殊需求。解决方案的关键在于提出一种频域感知的自监督学习框架,包含两个核心模块:一是频域解耦的图像去模糊模块,通过不对称通道融合机制整合高频局部细节与低频全局结构信息,实现细粒度结构的保真恢复;二是基于Retinex的光照补偿模块,引入色彩保持单元以提供多尺度空间与频率信息,从而实现精准的光照估计与校正。该方法首次系统性地解决了UWF图像增强难题,显著提升了图像可视化质量及疾病诊断性能。
链接: https://arxiv.org/abs/2508.19664
作者: Weicheng Liao,Zan Chen,Jianyang Xie,Yalin Zheng,Yuhui Ma,Yitian Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultra-Wide-Field (UWF) retinal imaging has revolutionized retinal diagnostics by providing a comprehensive view of the retina. However, it often suffers from quality-degrading factors such as blurring and uneven illumination, which obscure fine details and mask pathological information. While numerous retinal image enhancement methods have been proposed for other fundus imageries, they often fail to address the unique requirements in UWF, particularly the need to preserve pathological details. In this paper, we propose a novel frequency-aware self-supervised learning method for UWF image enhancement. It incorporates frequency-decoupled image deblurring and Retinex-guided illumination compensation modules. An asymmetric channel integration operation is introduced in the former module, so as to combine global and local views by leveraging high- and low-frequency information, ensuring the preservation of fine and broader structural details. In addition, a color preservation unit is proposed in the latter Retinex-based module, to provide multi-scale spatial and frequency information, enabling accurate illumination estimation and correction. Experimental results demonstrate that the proposed work not only enhances visualization quality but also improves disease diagnosis performance by restoring and correcting fine local details and uneven intensity. To the best of our knowledge, this work is the first attempt for UWF image enhancement, offering a robust and clinically valuable tool for improving retinal disease management.
zh
[CV-48] Hardware-aware vs. Hardware-agnostic Energy Estimation for SNN in Space Applications
【速读】:该论文旨在解决当前对脉冲神经网络(Spiking Neural Networks, SNNs)能量效率的争议问题,尤其是在数字实现中其是否真的优于传统人工神经网络(Artificial Neural Networks, ANNs)。针对多输出回归任务——即从单目图像中估计卫星三维位置——作者提出了一种基于漏电积分发放(Leaky Integrate-and-Fire, LIF)神经元膜电位训练的SNN模型,并通过硬件感知与硬件无关两种能量估算方法进行对比分析。解决方案的关键在于:首先,利用LIF神经元最终层的膜电位作为监督信号进行训练,使SNN在性能上接近参考卷积神经网络(Convolutional Neural Network, CNN);其次,揭示了只有在类脑硬件(neuromorphic hardware)且输入数据具有高稀疏性时,SNN才能实现显著的能量节省,从而强调了能量评估必须考虑硬件架构和数据特征,推动更透明、可复现的能效比较方法的发展。
链接: https://arxiv.org/abs/2508.19654
作者: Matthias Höfflin,Jürgen Wassner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the IAA-SPAICE 2025 conference
Abstract:Spiking Neural Networks (SNNs), inspired by biological intelligence, have long been considered inherently energy-efficient, making them attractive for resource-constrained domains such as space applications. However, recent comparative studies with conventional Artificial Neural Networks (ANNs) have begun to question this reputation, especially for digital implementations. This work investigates SNNs for multi-output regression, specifically 3-D satellite position estimation from monocular images, and compares hardware-aware and hardware-agnostic energy estimation methods. The proposed SNN, trained using the membrane potential of the Leaky Integrate-and-Fire (LIF) neuron in the final layer, achieves comparable Mean Squared Error (MSE) to a reference Convolutional Neural Network (CNN) on a photorealistic satellite dataset. Energy analysis shows that while hardware-agnostic methods predict a consistent 50-60% energy advantage for SNNs over CNNs, hardware-aware analysis reveals that significant energy savings are realized only on neuromorphic hardware and with high input sparsity. The influence of dark pixel ratio on energy consumption is quantified, emphasizing the impact of data characteristics and hardware assumptions. These findings highlight the need for transparent evaluation methods and explicit disclosure of underlying assumptions to ensure fair comparisons of neural network energy efficiency.
zh
[CV-49] Self-Rewarding Vision-Language Model via Reasoning Decomposition
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的视觉幻觉(visual hallucinations)和语言捷径(language shortcuts)问题。这些问题源于现有后训练方法仅依赖可验证的答案匹配并监督最终输出,导致中间视觉推理阶段缺乏显式引导,从而使模型倾向于依赖文本先验而非真实视觉感知。解决方案的关键在于提出一种自奖励机制(self-rewarding method),即Vision-SR1,其通过将VLM的推理过程分解为视觉感知与语言推理两个阶段:首先生成独立于输入图像的自包含视觉描述,随后利用同一模型基于该描述进行语言推理以计算奖励信号。此自奖励与最终输出监督相结合,形成平衡的训练信号,从而增强视觉感知能力并减少对语言捷径的依赖。
链接: https://arxiv.org/abs/2508.19652
作者: Zongxia Li,Wenhao Yu,Chengsong Huang,Rui Liu,Zhenwen Liang,Fuxiao Liu,Jingxi Che,Dian Yu,Jordan Boyd-Graber,Haitao Mi,Dong Yu
机构: Tencent AI Lab(腾讯AI实验室); University of Maryland (马里兰大学); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, two figures
Abstract:Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
zh
[CV-50] Scalable Object Detection in the Car Interior With Vision Foundation Models
【速读】:该论文旨在解决车载环境中因计算资源受限而难以部署高性能视觉感知模型的问题,特别是在车内场景中对引入外部物体的检测与定位任务。其核心解决方案是提出一种名为Object Detection and Localization (ODAL) 的分布式框架,通过将视觉基础模型(vision foundation models)的计算任务拆分至车载端与云端协同执行,从而在有限算力条件下实现高效、准确的室内场景理解。关键创新在于利用轻量化模型(如LLaVA 1.5 7B)并结合微调策略,在显著降低资源消耗的同时大幅提升检测精度与鲁棒性,最终实现性能超越主流模型GPT-4o的成果。
链接: https://arxiv.org/abs/2508.19651
作者: Bálint Mészáros,Ahmet Firintepe,Sebastian Schmidt,Stephan Günnemann
机构: Technical University of Munich (慕尼黑工业大学); BMW Group (宝马集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and this http URL analysis demonstrates the framework’s potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL _score of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL _SNR three times higher than GPT-4o.
zh
[CV-51] Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models
【速读】:该论文旨在解决当前大型视频语言模型(Large Video Language Models, LVLMs)评估中忽视位置偏差(positional bias)的问题,即模型在处理视频序列时对不同位置内容的偏好或倾向性未被系统性检测和量化。其解决方案的关键在于提出Video-LevelGauge这一专用基准,通过标准化探测器(standardized probes)与定制化上下文设置,灵活控制上下文长度、探测位置及上下文类型以模拟多样真实场景,并结合统计分析与形态学模式识别方法,实现对位置偏差的全面刻画。该基准包含438个人工标注视频、1,177道多选题和120道开放题,经验证能有效暴露模型的位置偏差行为,从而为模型优化提供可操作的洞见。
链接: https://arxiv.org/abs/2508.19650
作者: Hou Xia,Zheren Fu,Fangcan Ling,Jiajun Li,Yi Tu,Zhendong Mao,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); HUAWEI (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement.
zh
[CV-52] IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising ICCV2025
【速读】:该论文旨在解决深度学习图像去噪方法在面对未见过的噪声类型和强度时泛化能力差的问题,尤其是现有方法因依赖特定噪声分布而易过拟合。解决方案的关键在于提出一种基于动态生成卷积核的迭代去噪框架:通过特征提取模块获取噪声不变特征,结合全局统计与局部相关性模块捕捉噪声特性与结构关联,并由核预测模块生成像素级可变的自适应卷积核,实现对局部结构敏感的迭代滤波。该方法仅需单层高斯噪声训练,却能在多种噪声场景下保持优异性能,显著提升模型鲁棒性和实用性。
链接: https://arxiv.org/abs/2508.19649
作者: Dongjin Kim,Jaekyun Ko,Muhammad Kashif Ali,Tae Hyun Kim
机构: Hanyang University (汉阳大学); Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project Page: this https URL
Abstract:Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning-based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but they still suffer from overfitting. To address these issues, we conduct image denoising by utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improves resilience to unseen noise. Specifically, our method leverages a Feature Extraction Module for robust noise-invariant features, Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module then employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality. Despite being trained on single-level Gaussian noise, our compact model (~ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.
zh
[CV-53] UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks ICIP
【速读】:该论文旨在解决未剪辑体育视频中细粒度动作定位的问题,尤其针对短时内快速且细微的运动过渡难以准确识别的挑战。现有监督与弱监督方法通常依赖大量标注数据和高容量模型,导致计算开销大、适应现实场景能力弱。其解决方案的关键在于提出一种轻量级、无监督的基于骨架的动作定位流程,核心创新包括:1)通过块状划分的姿势序列去噪预训练任务,使Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) 学习内在运动动力学而无需人工标注;2)在推理阶段定义新的动作动力学度量(Action Dynamics Metric, ADM),直接从低维ASTGCN嵌入中计算,通过检测其曲率轮廓中的拐点来识别运动边界。该方法在DSV跳水数据集上达到82.66%的mAP和29.09ms平均定位延迟,性能媲美最优监督方法,同时具备无需重训练即可泛化至真实场景视频的能力,适用于嵌入式或动态环境下的轻量实时动作分析系统。
链接: https://arxiv.org/abs/2508.19647
作者: Bikash Kumar Badatya,Vipul Baghel,Ravi Hegde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at the ICIP Satellite Workshop 2025
Abstract:Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions over short durations. Existing supervised and weakly supervised solutions often rely on extensive annotated datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios. In this work, we introduce a lightweight and unsupervised skeleton-based action localization pipeline that leverages spatio-temporal graph neural representations. Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising task with blockwise partitions, enabling it to learn intrinsic motion dynamics without any manual labeling. At inference, we define a novel Action Dynamics Metric (ADM), computed directly from low-dimensional ASTGCN embeddings, which detects motion boundaries by identifying inflection points in its curvature profile. Our method achieves a mean Average Precision (mAP) of 82.66% and average localization latency of 29.09 ms on the DSV Diving dataset, matching state-of-the-art supervised performance while maintaining computational efficiency. Furthermore, it generalizes robustly to unseen, in-the-wild diving footage without retraining, demonstrating its practical applicability for lightweight, real-time action analysis systems in embedded or dynamic environments.
zh
[CV-54] Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception
【速读】:该论文旨在解决协同感知(collaborative perception)中因使用二维鸟瞰图(BEV)表示中间特征而导致的细粒度三维结构信息丢失问题,从而影响目标识别与定位精度。其核心解决方案是提出CoPLOT框架,关键在于引入点级优化令牌(point-level optimized tokens, CoPLOT),通过三个模块实现:1)语义感知的令牌重排序模块,利用场景级和令牌级语义信息生成自适应的一维重排序序列;2)频域增强的状态空间模型,捕获跨空间与频谱域的长程依赖关系,提升前景与背景杂波的区分能力;3)邻近到自身对齐模块,结合全局代理级校正与局部令牌级精修,缓解定位噪声。该方法在保持低通信与计算开销的同时显著提升了感知性能。
链接: https://arxiv.org/abs/2508.19638
作者: Yang Li,Quan Yuan,Guiyang Luo,Xiaoyuan Fu,Rui Pan,Yujia Yang,Congzhang Shao,Yuewen Liu,Jinglin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird’s-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point-level tokens as intermediate representations for collaborative perception. However, point-cloud data are inherently unordered, massive, and position-sensitive, making it challenging to produce compact and aligned point-level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. A semantic-aware token reordering module generates adaptive 1D reorderings by leveraging scene-level and token-level semantic information. A frequency-enhanced state space model captures long-range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor-to-ego alignment module applies a closed-loop process, combining global agent-level correction with local token-level refinement to mitigate localization noise. Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Code will be available at this https URL.
zh
[CV-55] Divide Weight and Route: Difficulty-Aware Optimization with Dynamic Expert Fusion for Long-tailed Recognition
【速读】:该论文旨在解决长尾视觉识别中因类别分布不均衡以及各类别间分类难度差异导致的性能瓶颈问题,尤其针对传统按频率重加权方法忽视内在难学类别的局限性。其解决方案的关键在于提出DQRoute框架,该框架融合了难度感知优化与动态专家协作机制:首先基于预测不确定性与历史表现估计类别难度,并据此自适应调整损失权重;其次采用专家混合(Mixture-of-Experts)结构,使每个专家专注不同类别分布区域,并在推理时通过专家特异的OOD检测器生成置信度分数实现输入自适应路由,无需中央路由器。所有模块端到端联合训练,显著提升了稀有和困难类别的识别性能。
链接: https://arxiv.org/abs/2508.19630
作者: Xiaolei Wei,Yi Ouyang,Haibo Ye
机构: 南京航空航天大学(NUAA)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to PRCV 2025
Abstract:Long-tailed visual recognition is challenging not only due to class imbalance but also because of varying classification difficulty across categories. Simply reweighting classes by frequency often overlooks those that are intrinsically hard to learn. To address this, we propose \textbfDQRoute, a modular framework that combines difficulty-aware optimization with dynamic expert collaboration. DQRoute first estimates class-wise difficulty based on prediction uncertainty and historical performance, and uses this signal to guide training with adaptive loss weighting. On the architectural side, DQRoute employs a mixture-of-experts design, where each expert specializes in a different region of the class distribution. At inference time, expert predictions are weighted by confidence scores derived from expert-specific OOD detectors, enabling input-adaptive routing without the need for a centralized router. All components are trained jointly in an end-to-end manner. Experiments on standard long-tailed benchmarks demonstrate that DQRoute significantly improves performance, particularly on rare and difficult classes, highlighting the benefit of integrating difficulty modeling with decentralized expert routing.
zh
[CV-56] Controllable Skin Synthesis via Lesion-Focused Vector Autoregression Model
【速读】:该论文旨在解决皮肤图像合成中训练数据不足、生成图像质量低以及缺乏对病灶位置和类型控制的问题。其解决方案的关键在于提出LF-VAR模型,该模型通过量化病灶测量评分(lesion measurement scores)和病灶类型标签作为条件嵌入,结合多尺度病灶聚焦的向量量化变分自编码器(VQVAE)与视觉自回归(VAR)Transformer架构,实现基于语言提示的可控皮肤图像合成,从而提升生成图像的临床相关性和保真度。
链接: https://arxiv.org/abs/2508.19626
作者: Jiajun Sun,Zhen Yu,Siyuan Yan,Jason J. Ong,Zongyuan Ge,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion’s location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model’s effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at this https URL.
zh
[CV-57] IELDG: Suppressing Domain-Specific Noise with Inverse Evolution Layers for Domain Generalized Semantic Segmentation
【速读】:该论文旨在解决领域泛化语义分割(Domain Generalized Semantic Segmentation, DGSS)中因使用扩散模型(Diffusion Models, DMs)生成合成数据时存在结构或语义缺陷,导致训练过程中性能下降和错误累积的问题。解决方案的关键在于引入逆向演化层(Inverse Evolution Layers, IELs),其利用基于拉普拉斯算子的先验知识识别图像中的空间不连续性和语义不一致性,从而有效过滤不良生成模式;在此基础上,提出IELDM增强扩散数据增强框架以生成更高质量图像,并进一步将IEL嵌入到DGSS模型解码器中构建IELFormer,通过抑制伪影传播强化跨域泛化能力;同时结合多尺度频域融合(Multi-scale Frequency Fusion, MFF)模块提升不同尺度特征间的语义一致性,实现更稳定的跨域分割性能。
链接: https://arxiv.org/abs/2508.19604
作者: Qizhe Fan,Chaoyu Liu,Zhonghua Qiao,Xiaoqin Shen
机构: Xi’an University of Technology (西安理工大学); University of Cambridge (剑桥大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Domain Generalized Semantic Segmentation (DGSS) focuses on training a model using labeled data from a source domain, with the goal of achieving robust generalization to unseen target domains during inference. A common approach to improve generalization is to augment the source domain with synthetic data generated by diffusion models (DMs). However, the generated images often contain structural or semantic defects due to training imperfections. Training segmentation models with such flawed data can lead to performance degradation and error accumulation. To address this issue, we propose to integrate inverse evolution layers (IELs) into the generative process. IELs are designed to highlight spatial discontinuities and semantic inconsistencies using Laplacian-based priors, enabling more effective filtering of undesirable generative patterns. Based on this mechanism, we introduce IELDM, an enhanced diffusion-based data augmentation framework that can produce higher-quality images. Furthermore, we observe that the defect-suppression capability of IELs can also benefit the segmentation network by suppressing artifact propagation. Based on this insight, we embed IELs into the decoder of the DGSS model and propose IELFormer to strengthen generalization capability in cross-domain scenarios. To further strengthen the model’s semantic consistency across scales, IELFormer incorporates a multi-scale frequency fusion (MFF) module, which performs frequency-domain analysis to achieve structured integration of multi-resolution features, thereby improving cross-scale coherence. Extensive experiments on benchmark datasets demonstrate that our approach achieves superior generalization performance compared to existing methods.
zh
[CV-58] Quantization Robustness to Input Degradations for Object Detection
【速读】:该论文旨在解决后训练量化(Post-training Quantization, PTQ)对目标检测模型在真实世界输入退化(如噪声、模糊和压缩伪影)下鲁棒性影响的问题。其核心挑战在于如何在保持模型推理效率的同时,提升量化后模型在复杂退化场景中的性能稳定性。解决方案的关键在于提出一种退化感知校准策略(degradation-aware calibration strategy),即在静态INT8量化过程中引入混合的清洁图像与合成退化图像进行TensorRT校准,以期增强模型对实际干扰的适应能力。实验表明,尽管该方法在部分大尺度模型和特定噪声条件下表现较好,但整体上并未显著优于传统仅使用清洁数据的校准方式,揭示了当前PTQ鲁棒性提升的局限性与模型规模的潜在调节作用。
链接: https://arxiv.org/abs/2508.19600
作者: Toghrul Karimov,Hassan Imani,Allan Kazakov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post-training quantization (PTQ) is crucial for deploying efficient object detection models, like YOLO, on resource-constrained devices. However, the impact of reduced precision on model robustness to real-world input degradations such as noise, blur, and compression artifacts is a significant concern. This paper presents a comprehensive empirical study evaluating the robustness of YOLO models (nano to extra-large scales) across multiple precision formats: FP32, FP16 (TensorRT), Dynamic UINT8 (ONNX), and Static INT8 (TensorRT). We introduce and evaluate a degradation-aware calibration strategy for Static INT8 PTQ, where the TensorRT calibration process is exposed to a mix of clean and synthetically degraded images. Models were benchmarked on the COCO dataset under seven distinct degradation conditions (including various types and levels of noise, blur, low contrast, and JPEG compression) and a mixed-degradation scenario. Results indicate that while Static INT8 TensorRT engines offer substantial speedups (~1.5-3.3x) with a moderate accuracy drop (~3-7% mAP50-95) on clean data, the proposed degradation-aware calibration did not yield consistent, broad improvements in robustness over standard clean-data calibration across most models and degradations. A notable exception was observed for larger model scales under specific noise conditions, suggesting model capacity may influence the efficacy of this calibration approach. These findings highlight the challenges in enhancing PTQ robustness and provide insights for deploying quantized detectors in uncontrolled environments. All code and evaluation tables are available at this https URL.
zh
[CV-59] Generalizing Monocular 3D Object Detection
【速读】:该论文旨在解决单目3D目标检测(Mono3D)模型在多样化场景下的泛化能力问题,包括遮挡、不同数据集、物体尺寸差异及相机参数变化等挑战。其核心解决方案包括:提出可微分的非极大值抑制方法(GrooMeD-NMS)以提升遮挡鲁棒性;设计深度等变(depth equivariant)骨干网络(DEVIANT)增强跨数据集泛化能力;揭示大物体检测问题不仅源于数据不平衡或感受野限制,更与噪声敏感性相关,并引入基于鸟瞰图分割和Dice损失的SeaBird方法进行优化;最后通过数学分析推导出模型在未见相机高度下的外推规律,从而改善分布外(out-of-distribution)场景下的性能。
链接: https://arxiv.org/abs/2508.19593
作者: Abhinav Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Thesis submitted to MSU
Abstract:Monocular 3D object detection (Mono3D) is a fundamental computer vision task that estimates an object’s class, 3D position, dimensions, and orientation from a single image. Its applications, including autonomous driving, augmented reality, and robotics, critically rely on accurate 3D environmental understanding. This thesis addresses the challenge of generalizing Mono3D models to diverse scenarios, including occlusions, datasets, object sizes, and camera parameters. To enhance occlusion robustness, we propose a mathematically differentiable NMS (GrooMeD-NMS). To improve generalization to new datasets, we explore depth equivariant (DEVIANT) backbones. We address the issue of large object detection, demonstrating that it’s not solely a data imbalance or receptive field problem but also a noise sensitivity issue. To mitigate this, we introduce a segmentation-based approach in bird’s-eye view with dice loss (SeaBird). Finally, we mathematically analyze the extrapolation of Mono3D models to unseen camera heights and improve Mono3D generalization in such out-of-distribution settings.
zh
[CV-60] Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction
【速读】:该论文旨在解决扩散模型(Diffusion Models)在训练过程中因大规模数据集中的手动标注错误而导致的生成能力下降与可控性减弱问题。其解决方案的关键在于提出基于评分的判别器修正(Score-based Discriminator Correction, SBDC)方法,该方法通过对抗损失(adversarial loss)训练判别器来评估样本的真实性,并利用噪声检测技术对生成过程中的中间样本进行引导修正;特别地,研究发现仅在生成过程的早期阶段应用该引导策略可获得更优性能,且该方法计算效率高、推理时间增加微小,无需重新训练扩散模型。
链接: https://arxiv.org/abs/2508.19581
作者: Dat Nguyen Cong,Hieu Tran Bao,Hoang Thanh-Tung
机构: FPT Software AI Center (FPT软件人工智能中心); FPT IS AI R&D Center (FPT信息科技人工智能研发部); VNU University of Engineering and Technology (越南国家大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 16 figures
Abstract:Diffusion models have gained prominence as state-of-the-art techniques for synthesizing images and videos, particularly due to their ability to scale effectively with large datasets. Recent studies have uncovered that these extensive datasets often contain mistakes from manual labeling processes. However, the extent to which such errors compromise the generative capabilities and controllability of diffusion models is not well studied. This paper introduces Score-based Discriminator Correction (SBDC), a guidance technique for aligning noisy pre-trained conditional diffusion models. The guidance is built on discriminator training using adversarial loss, drawing on prior noise detection techniques to assess the authenticity of each sample. We further show that limiting the usage of our guidance to the early phase of the generation process leads to better performance. Our method is computationally efficient, only marginally increases inference time, and does not require retraining diffusion models. Experiments on different noise settings demonstrate the superiority of our method over previous state-of-the-art methods.
zh
[CV-61] High-Speed FHD Full-Color Video Computer-Generated Holography
【速读】:该论文旨在解决高帧率全彩色全息视频生成中的两大核心问题:一是基于学习的模型常产生过度平滑的相位分布,导致频谱狭窄,在深度复用等高帧率全彩色显示中引发严重的颜色串扰,造成帧率与色彩保真度之间的权衡;二是现有逐帧优化方法忽视了连续帧间的时空相关性,导致计算效率低下。解决方案的关键在于提出两个创新模块:其一为频谱引导的深度复用(Spectrum-Guided Depth Division Multiplexing, SGDDM),通过频率调制优化相位分布,实现高帧率下无损的全彩色显示;其二为轻量级异构Mamba-Unet结构HoloMamba,显式建模视频序列中的时空相关性,在提升重建质量的同时显著提高计算效率,最终在真实和仿真实验中实现了超过260 FPS的FHD(1080p)全彩色全息视频生成,相较现有最优方法提速超2.6倍。
链接: https://arxiv.org/abs/2508.19579
作者: Haomiao Zhang,Miao Cao,Xuan Yu,Hui Luo,Yanling Piao,Mengjie Qin,Zhangyuan Li,Ping Wang,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer-generated holography (CGH) is a promising technology for next-generation displays. However, generating high-speed, high-quality holographic video requires both high frame rate display and efficient computation, but is constrained by two key limitations: ( i ) Learning-based models often produce over-smoothed phases with narrow angular spectra, causing severe color crosstalk in high frame rate full-color displays such as depth-division multiplexing and thus resulting in a trade-off between frame rate and color fidelity. ( ii ) Existing frame-by-frame optimization methods typically optimize frames independently, neglecting spatial-temporal correlations between consecutive frames and leading to computationally inefficient solutions. To overcome these challenges, in this paper, we propose a novel high-speed full-color video CGH generation scheme. First, we introduce Spectrum-Guided Depth Division Multiplexing (SGDDM), which optimizes phase distributions via frequency modulation, enabling high-fidelity full-color display at high frame rates. Second, we present HoloMamba, a lightweight asymmetric Mamba-Unet architecture that explicitly models spatial-temporal correlations across video sequences to enhance reconstruction quality and computational efficiency. Extensive simulated and real-world experiments demonstrate that SGDDM achieves high-fidelity full-color display without compromise in frame rate, while HoloMamba generates FHD (1080p) full-color holographic video at over 260 FPS, more than 2.6 \times faster than the prior state-of-the-art Divide-Conquer-and-Merge Strategy.
zh
[CV-62] Interact-Custom: Customized Human Object Interaction Image Generation
【速读】:该论文旨在解决**定制化人-物交互图像生成(Customized Human Object Interaction Image Generation, CHOI)**的问题,即在生成图像时同时实现目标人物的身份保留与人-物交互语义的精细控制。其核心挑战在于:(1) 需要将人物分解为独立的身份特征和姿态导向的交互特征,而现有HOI(Human-Object Interaction)数据集缺乏此类解耦学习所需的样本;(2) 人与物体间不恰当的空间配置会导致交互缺失或不合理。解决方案的关键是提出两阶段模型Interact-Custom:首先通过生成前景掩码显式建模交互行为的空间配置,再在此掩码引导下生成保持身份不变的目标人物与其交互内容,从而实现高可控性的图像生成。
链接: https://arxiv.org/abs/2508.19575
作者: Zhu Xu,Zhaowen Wang,Yuxin Peng,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Compositional Customized Image Generation aims to customize multiple target concepts within generation content, which has gained attention for its wild this http URL approaches mainly concentrate on the target entity’s appearance preservation, while neglecting the fine-grained interaction control among target this http URL enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation(CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between this http URL primary challenges exist for CHOI:(1)simultaneous identity preservation and interaction control demands require the model to decompose the human object into self-contained identity features and pose-oriented interaction features, while the current HOI image datasets fail to provide ideal samples for such feature-decomposed learning.(2)inappropriate spatial configuration between human and object may lead to the lack of desired interaction this http URL tackle it, we first process a large-scale dataset, where each sample encompasses the same pair of human object involving different interactive this http URL we design a two-stage model Interact-Custom, which firstly explicitly models the spatial configuration by generating a foreground mask depicting the interaction behavior, then under the guidance of this mask, we generate the target human object interacting while preserving their identities this http URL, if the background image and the union location of where the target human object should appear are provided by users, Interact-Custom also provides the optional functionality to specify them, offering high content controllability. Extensive experiments on our tailored metrics for CHOI task demonstrate the effectiveness of our approach.
zh
[CV-63] Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation
【速读】:该论文旨在解决病理图像分割中因语义边界模糊和像素级标注成本高昂所带来的挑战,尤其针对现有基于一致性正则化的半监督方法(如UniMatch)主要依赖图像模态内的扰动一致性、难以捕捉高层语义先验的问题。其解决方案的关键在于提出MPAMatch框架,通过多模态原型引导的对比学习机制实现像素级监督:一是图像原型与像素标签之间的对比学习,二是文本原型与像素标签之间的对比学习,从而在结构和语义两个层面提供粗到细的监督策略;同时,将经典分割架构TransUNet中的ViT主干替换为病理预训练基础模型Uni,以更有效地提取病理相关特征,首次引入文本原型监督提升语义边界建模能力,显著优于当前最优方法。
链接: https://arxiv.org/abs/2508.19574
作者: Mingxi Fu,Fanglei Fu,Xitong Ling,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pathological image segmentation faces numerous challenges, particularly due to ambiguous semantic boundaries and the high cost of pixel-level annotations. Although recent semi-supervised methods based on consistency regularization (e.g., UniMatch) have made notable progress, they mainly rely on perturbation-based consistency within the image modality, making it difficult to capture high-level semantic priors, especially in structurally complex pathology images. To address these limitations, we propose MPAMatch - a novel segmentation framework that performs pixel-level contrastive learning under a multimodal prototype-guided supervision paradigm. The core innovation of MPAMatch lies in the dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels, providing supervision at both structural and semantic levels. This coarse-to-fine supervisory strategy not only enhances the discriminative capability on unlabeled samples but also introduces the text prototype supervision into segmentation for the first time, significantly improving semantic boundary modeling. In addition, we reconstruct the classic segmentation architecture (TransUNet) by replacing its ViT backbone with a pathology-pretrained foundation model (Uni), enabling more effective extraction of pathology-relevant features. Extensive experiments on GLAS, EBHI-SEG-GLAND, EBHI-SEG-CANCER, and KPI show MPAMatch’s superiority over state-of-the-art methods, validating its dual advantages in structural and semantic modeling.
zh
[CV-64] DNP-Guided Contrastive Reconstruction with a Reverse Distillation Transformer for Medical Anomaly Detection
【速读】:该论文旨在解决医学图像中异常检测的两大核心挑战:一是标注数据稀缺导致模型难以学习判别性特征,二是与自然图像相比存在领域差异(domain gap),使得基于预训练固定编码器的重建方法在适应医学域时性能受限;二是原型学习方法中存在的原型坍塌(prototype collapse)问题,即少数原型主导训练过程,削弱了模型多样性与泛化能力。解决方案的关键在于提出一个统一框架,其核心创新包括:1)引入可训练编码器并结合动量分支(momentum branch),实现稳定且适应医学域的特征学习;2)设计轻量级原型提取器(Prototype Extractor),从正常样本中挖掘具有信息量的原型,并通过注意力机制引导解码器进行精确重建;3)提出新颖的多样性感知对齐损失(Diversity-Aware Alignment Loss),通过原型层面的归一化和多样性约束,强制各原型均衡使用,有效防止原型坍塌,从而提升异常定位精度与模型可解释性。
链接: https://arxiv.org/abs/2508.19573
作者: Luhu Li,Bowen Lin,Mukhtiar Khan,Shujun Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anomaly detection in medical images is challenging due to limited annotations and a domain gap compared to natural images. Existing reconstruction methods often rely on frozen pre-trained encoders, which limits adaptation to domain-specific features and reduces localization accuracy. Prototype-based learning offers interpretability and clustering benefits but suffers from prototype collapse, where few prototypes dominate training, harming diversity and generalization. To address this, we propose a unified framework combining a trainable encoder with prototype-guided reconstruction and a novel Diversity-Aware Alignment Loss. The trainable encoder, enhanced by a momentum branch, enables stable domain-adaptive feature learning. A lightweight Prototype Extractor mines informative normal prototypes to guide the decoder via attention for precise reconstruction. Our loss enforces balanced prototype use through diversity constraints and per-prototype normalization, effectively preventing collapse. Experiments on multiple medical imaging benchmarks show significant improvements in representation quality and anomaly localization, outperforming prior methods. Visualizations and prototype assignment analyses further validate the effectiveness of our anti-collapse mechanism and enhanced interpretability.
zh
[CV-65] FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection
【速读】:该论文旨在解决端到端目标检测器(end-to-end object detector)在实时应用中计算成本过高这一问题,尤其是在复杂场景如交叉路口交通监控中面临的挑战。其解决方案的关键在于提出FlowDet架构,通过解耦编码器优化策略对DETR结构进行改进:一是引入几何感知的可变形单元(Geometric Deformable Unit, GDU),实现针对交通场景的几何建模;二是设计尺度感知注意力模块(Scale-Aware Attention, SAA),以在极端尺度变化下保持强表征能力。该方法在新构建的Intersection-Flow-5k数据集上显著提升精度并大幅降低计算开销,实现了高效率与高准确性的平衡。
链接: https://arxiv.org/abs/2508.19565
作者: Yuhang Zhao,Zixing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by PRCV 2025. Project page with code and dataset: this https URL
Abstract:End-to-end object detectors offer a promising NMS-free paradigm for real-time applications, yet their high computational cost remains a significant barrier, particularly for complex scenarios like intersection traffic monitoring. To address this challenge, we propose FlowDet, a high-speed detector featuring a decoupled encoder optimization strategy applied to the DETR architecture. Specifically, FlowDet employs a novel Geometric Deformable Unit (GDU) for traffic-aware geometric modeling and a Scale-Aware Attention (SAA) module to maintain high representational power across extreme scale variations. To rigorously evaluate the model’s performance in environments with severe occlusion and high object density, we collected the Intersection-Flow-5k dataset, a new challenging scene for this task. Evaluated on Intersection-Flow-5k, FlowDet establishes a new state-of-the-art. Compared to the strong RT-DETR baseline, it improves AP(test) by 1.5% and AP50(test) by 1.6%, while simultaneously reducing GFLOPs by 63.2% and increasing inference speed by 16.2%. Our work demonstrates a new path towards building highly efficient and accurate detectors for demanding, real-world perception systems. The Intersection-Flow-5k dataset is available at this https URL.
zh
[CV-66] MonoRelief V2: Leverag ing Real Data for High-Fidelity Monocular Relief Recovery
【速读】:该论文旨在解决从单张图像中直接恢复2.5D浮雕(2.5D relief)的挑战,尤其在复杂材质和光照条件下仍能保持高精度与鲁棒性。其核心解决方案在于提出MonoRelief V2模型,该模型通过融合伪真实数据(pseudo-real data)与小规模真实世界数据集进行渐进式训练:首先利用文本到图像生成模型(text-to-image generative model)合成约15,000张伪真实图像,并基于深度与法向预测融合获得伪标签;其次构建包含800个样本的真实数据集,通过多视角重建与细节优化提升真实性;最终实现端到端的深度与法向预测性能优于现有方法,展现出在下游应用中的强大潜力。
链接: https://arxiv.org/abs/2508.19555
作者: Yu-Wei Zhang,Tongju Han,Lipeng Gao,Mingqiang Wei,Hui Liu,Changbao Li,Caiming Zhang
机构: Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Shandong University of Finance and Economics (山东财经大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents MonoRelief V2, an end-to-end model designed for directly recovering 2.5D reliefs from single images under complex material and illumination variations. In contrast to its predecessor, MonoRelief V1 [1], which was solely trained on synthetic data, MonoRelief V2 incorporates real data to achieve improved robustness, accuracy and efficiency. To overcome the challenge of acquiring large-scale real-world dataset, we generate approximately 15,000 pseudo real images using a text-to-image generative model, and derive corresponding depth pseudo-labels through fusion of depth and normal predictions. Furthermore, we construct a small-scale real-world dataset (800 samples) via multi-view reconstruction and detail refinement. MonoRelief V2 is then progressively trained on the pseudo-real and real-world datasets. Comprehensive experiments demonstrate its state-of-the-art performance both in depth and normal predictions, highlighting its strong potential for a range of downstream applications. Code is at: this https URL.
zh
[CV-67] WEBEYETRACK: Scalable Eye-Tracking for the Browser via On-Device Few-Shot Personalization
【速读】:该论文旨在解决当前基于AI的注视估计(gaze estimation)方法在真实场景中与商用眼动追踪解决方案之间存在的性能差距问题,尤其是模型体积大、推理延迟高以及隐私保护不足等实际限制因素;同时针对基于网络摄像头的眼动追踪精度不足的问题,特别是头部运动带来的误差。其解决方案的关键在于提出WebEyeTrack框架,该框架将轻量级前沿注视估计模型直接部署于浏览器端,结合基于模型的头部姿态估计(head pose estimation)和设备端少样本学习(few-shot learning),仅需9个校准样本即可实现用户自适应,从而在GazeCapture数据集上达到2.32 cm的误差水平,并在iPhone 14上实现2.4毫秒的实时推理速度。
链接: https://arxiv.org/abs/2508.19544
作者: Eduardo Davalos,Yike Zhang,Namrata Srivastava,Yashvitha Thatigotla,Jorge A. Salas,Sara McFadden,Sun-Joo Cho,Amanda Goodwin,Ashwin TS,Gautam Biswas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, 1 table
Abstract:With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at this https URL.
zh
[CV-68] CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨视频情境下关系推理能力不足的问题,这一能力对于多摄像头监控和跨视频流程学习等现实应用场景至关重要。其解决方案的关键在于提出CVBench——首个系统性评估跨视频关系推理的基准测试框架,该框架包含1000个问答对,覆盖三个层级的任务:跨视频对象关联(识别共享实体)、跨视频事件关联(链接时序或因果事件链)以及跨视频复杂推理(整合常识与领域知识),并基于五个多样化视频集群构建,从而挑战模型在动态视觉场景中整合信息的能力。通过在10余种主流MLLMs上的广泛评估,研究揭示了当前架构在跨视频上下文保留和重叠实体消歧方面的根本瓶颈,为下一代多视频理解模型的设计提供了诊断工具与改进方向。
链接: https://arxiv.org/abs/2508.19542
作者: Nannan Zhu,Yonghao Dong,Teng Wang,Xueqian Li,Shengjun Deng,Yijia Wang,Zheng Hong,Tiantian Geng,Guo Niu,Hanyan Huang,Xiongfei Yao,Shuaiwei Jiao
机构: Sun Yat-sen University (中山大学); University of Hong Kong (香港大学); Foshan University (佛山大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation this http URL data and evaluation code are available at this https URL.
zh
[CV-69] MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
【速读】:该论文旨在解决文本驱动动作生成中语义对齐不精确和推理效率低下的问题,即如何实现语言描述与动作语义的精准匹配,以及如何提升生成速度以支持实时合成。其解决方案的关键在于提出两个核心组件:一是TAPO(Aligned Preference Optimization),通过引入偏好优化机制对齐细微动作变化与文本修饰符,强化语义锚定;二是MotionFLUX,基于确定性修正流匹配的高速生成框架,利用最优传输路径替代传统扩散模型中的数百步去噪过程,显著减少采样步骤并保持动作质量,从而实现实时合成。二者协同构建了一个统一系统,在语义一致性、动作质量和生成速度上均优于现有方法。
链接: https://arxiv.org/abs/2508.19527
作者: Zhiting Gao,Dan Song,Diqiong Jiang,Chao Xue,An-An Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Motion generation is essential for animating virtual characters and embodied agents. While recent text-driven methods have made significant strides, they often struggle with achieving precise alignment between linguistic descriptions and motion semantics, as well as with the inefficiencies of slow, multi-step inference. To address these issues, we introduce TMR++ Aligned Preference Optimization (TAPO), an innovative framework that aligns subtle motion variations with textual modifiers and incorporates iterative adjustments to reinforce semantic grounding. To further enable real-time synthesis, we propose MotionFLUX, a high-speed generation framework based on deterministic rectified flow matching. Unlike traditional diffusion models, which require hundreds of denoising steps, MotionFLUX constructs optimal transport paths between noise distributions and motion spaces, facilitating real-time synthesis. The linearized probability paths reduce the need for multi-step sampling typical of sequential methods, significantly accelerating inference time without sacrificing motion quality. Experimental results demonstrate that, together, TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality, while also accelerating generation speed. The code and pretrained models will be released.
zh
[CV-70] Fast Texture Transfer for XR Avatars via Barycentric UV Conversion
【速读】:该论文旨在解决基于SMPL-X模型的全身虚拟形象在面部纹理映射过程中存在的效率低和视觉伪影问题。传统仿射变换方法虽然可行,但计算速度慢且易产生边界伪影,影响沉浸式扩展现实(XR)应用中的个性化体验。解决方案的关键在于提出一种基于重心坐标(barycentric)的UV转换技术,通过预计算整个UV映射为单一变换矩阵,实现单次操作完成纹理转移,从而在保持高质量纹理的同时将处理速度提升超过7000倍,显著优于基线方法。
链接: https://arxiv.org/abs/2508.19518
作者: Hail Song,Seokhwan Yang,Woontack Woo
机构: KAIST UVR Lab (韩国科学技术院视觉与机器人实验室); KAIST KI-ITC ARRC (韩国科学技术院人工智能与信息技术研究中心)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a fast and efficient method for transferring facial textures onto SMPL-X-based full-body avatars. Unlike conventional affine-transform methods that are slow and prone to visual artifacts, our method utilizes a barycentric UV conversion technique. Our approach precomputes the entire UV mapping into a single transformation matrix, enabling texture transfer in a single operation. This results in a speedup of over 7000x compared to the baseline, while also significantly improving the final texture quality by eliminating boundary artifacts. Through quantitative and qualitative evaluations, we demonstrate that our method offers a practical solution for personalization in immersive XR applications. The code is available online.
zh
[CV-71] Weed Detection in Challenging Field Conditions: A Semi-Supervised Framework for Overcoming Shadow Bias and Data Scarcity
【速读】:该论文旨在解决自动化杂草管理中深度学习模型在真实田间环境下性能受限的问题,主要受制于复杂环境条件和高昂的数据标注成本。其解决方案的关键在于提出一种诊断驱动的半监督框架:首先通过高精度的监督基线模型(如ResNet分类、YOLO与RF-DETR检测)建立性能基准,并借助可解释性工具识别出普遍存在的“阴影偏差”(shadow bias)——即模型将阴影误判为植被;随后利用未标注数据进行伪标签训练,提升模型对多样视觉信息的适应能力,从而有效缓解阴影偏差并显著提高召回率(recall),这对于减少自动喷洒系统中的杂草漏检至关重要。该方法在低数据场景下亦验证了有效性,为精准农业中鲁棒计算机视觉系统的开发提供了可落地的诊断与优化路径。
链接: https://arxiv.org/abs/2508.19511
作者: Alzayat Saleh,Shunsuke Hatano,Mostafa Rahimi Azghadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures, 6 tables
Abstract:The automated management of invasive weeds is critical for sustainable agriculture, yet the performance of deep learning models in real-world fields is often compromised by two factors: challenging environmental conditions and the high cost of data annotation. This study tackles both issues through a diagnostic-driven, semi-supervised framework. Using a unique dataset of approximately 975 labeled and 10,000 unlabeled images of Guinea Grass in sugarcane, we first establish strong supervised baselines for classification (ResNet) and detection (YOLO, RF-DETR), achieving F1 scores up to 0.90 and mAP50 scores exceeding 0.82. Crucially, this foundational analysis, aided by interpretability tools, uncovered a pervasive “shadow bias,” where models learned to misidentify shadows as vegetation. This diagnostic insight motivated our primary contribution: a semi-supervised pipeline that leverages unlabeled data to enhance model robustness. By training models on a more diverse set of visual information through pseudo-labeling, this framework not only helps mitigate the shadow bias but also provides a tangible boost in recall, a critical metric for minimizing weed escapes in automated spraying systems. To validate our methodology, we demonstrate its effectiveness in a low-data regime on a public crop-weed benchmark. Our work provides a clear and field-tested framework for developing, diagnosing, and improving robust computer vision systems for the complex realities of precision agriculture.
zh
[CV-72] DATR: Diffusion-based 3D Apple Tree Reconstruction Framework with Sparse-View
【速读】:该论文旨在解决在复杂田间环境下,基于稀疏视图的果树(苹果树)高保真三维重建难题,尤其针对现有方法在视角稀疏和遮挡严重场景下性能下降的问题。其解决方案的关键在于提出了一种两阶段框架(DATR),第一阶段利用机载传感器与基础模型(foundation models)实现从复杂田间图像中半自动提取树体掩膜(tree masks),用于过滤多模态数据中的背景信息;第二阶段则结合扩散模型(diffusion model)与大重建模型(large reconstruction model, LRM),分别完成多视角图像到3D结构的显式重建与隐式神经场(implicit neural field)建模。该框架通过Real2Sim合成数据生成器训练扩散模型与LRM,显著提升了重建精度与效率,在真实和合成数据集上均优于现有方法,并实现了接近工业级静态激光扫描仪的形态特征估计能力,同时将处理速度提升约360倍,为可扩展的农业数字孪生系统提供了技术支撑。
链接: https://arxiv.org/abs/2508.19508
作者: Tian Qiu,Alan Zoubi,Yiyuan Lin,Ruiming Du,Lailiang Cheng,Yu Jiang
机构: Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital twin applications offered transformative potential by enabling real-time monitoring and robotic simulation through accurate virtual replicas of physical assets. The key to these systems is 3D reconstruction with high geometrical fidelity. However, existing methods struggled under field conditions, especially with sparse and occluded views. This study developed a two-stage framework (DATR) for the reconstruction of apple trees from sparse views. The first stage leverages onboard sensors and foundation models to semi-automatically generate tree masks from complex field images. Tree masks are used to filter out background information in multi-modal data for the single-image-to-3D reconstruction at the second stage. This stage consists of a diffusion model and a large reconstruction model for respective multi view and implicit neural field generation. The training of the diffusion model and LRM was achieved by using realistic synthetic apple trees generated by a Real2Sim data generator. The framework was evaluated on both field and synthetic datasets. The field dataset includes six apple trees with field-measured ground truth, while the synthetic dataset featured structurally diverse trees. Evaluation results showed that our DATR framework outperformed existing 3D reconstruction methods across both datasets and achieved domain-trait estimation comparable to industrial-grade stationary laser scanners while improving the throughput by \sim 360 times, demonstrating strong potential for scalable agricultural digital twin systems.
zh
[CV-73] Sat2Flow: A Structure-Aware Diffusion Framework for Human Flow Generation from Satellite Imagery
【速读】:该论文旨在解决现有城市出行分布(Origin-Destination, OD)流量矩阵生成方法面临的两大问题:一是依赖昂贵且空间覆盖有限的辅助特征(如兴趣点、社会经济统计数据),二是对空间拓扑结构敏感,即城市区域索引的微小重排(如普查区重新编号)会导致生成流量结构失真。解决方案的关键在于提出Sat2Flow框架,其核心创新包括:(1)设计多核编码器以捕捉区域间的多样化交互关系;(2)引入排列感知扩散过程,确保在不同区域索引顺序下隐空间表示的一致性;(3)通过联合对比训练目标将卫星影像特征与OD模式对齐,并结合等变扩散训练机制强制结构不变性,从而实现无需区域特定辅助数据即可生成结构一致且高精度的OD流矩阵。
链接: https://arxiv.org/abs/2508.19499
作者: Xiangxu Wang,Tianhong Zhao,Wei Tu,Bowen Zhang,Guanzhou Chen,Jinzhou Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Origin-Destination (OD) flow matrices are essential for urban mobility analysis, underpinning applications in traffic forecasting, infrastructure planning, and policy design. However, existing methods suffer from two critical limitations: (1) reliance on auxiliary features (e.g., Points of Interest, socioeconomic statistics) that are costly to collect and have limited spatial coverage; and (2) sensitivity to spatial topology, where minor index reordering of urban regions (e.g., census tract relabeling) disrupts structural coherence in generated flows. To address these challenges, we propose Sat2Flow, a latent structure-aware diffusion-based framework that generates structurally coherent OD flows using solely satellite imagery as input. Our approach introduces a multi-kernel encoder to capture diverse regional interactions and employs a permutation-aware diffusion process that aligns latent representations across different regional orderings. Through a joint contrastive training objective that bridges satellite-derived features with OD patterns, combined with equivariant diffusion training that enforces structural consistency, Sat2Flow ensures topological robustness under arbitrary regional reindexing. Experimental results on real-world urban datasets demonstrate that Sat2Flow outperforms both physics-based and data-driven baselines in numerical accuracy while preserving empirical distributions and spatial structures under index permutations. Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce urban environments, eliminating region-specific auxiliary data dependencies while maintaining structural invariance for robust mobility modeling.
zh
[CV-74] UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models
【速读】:该论文旨在解决预训练模型(pre-trained models)在知识迁移过程中因架构和训练数据分布异质性导致的整合难题,现有方法通常依赖于对训练数据分布或网络结构的强假设,限制了其适用范围并引入数据或归纳偏置。解决方案的关键在于提出一种名为UNIFORM的新框架,通过设计一个专用的投票机制,在logit层(利用能预测目标类别的教师模型)和特征层(利用任意标签空间学习到的视觉表示)两个层面捕捉教师模型间的共识,从而实现从多样化现成模型中无约束地迁移知识至学生模型,显著提升了无监督目标识别性能,并展现出优异的可扩展性——在使用超过一百个教师模型时仍持续受益,而现有方法则在较小规模下即趋于饱和。
链接: https://arxiv.org/abs/2508.19498
作者: Yimu Wang,Weiming Zhuang,Chen Chen,Jiabo Huang,Jingtao Li,Lingjuan Lyu
机构: University of Waterloo (滑铁卢大学); SONY AI (索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In the era of deep learning, the increasing number of pre-trained models available online presents a wealth of knowledge. These models, developed with diverse architectures and trained on varied datasets for different tasks, provide unique interpretations of the real world. Their collective consensus is likely universal and generalizable to unseen data. However, effectively harnessing this collective knowledge poses a fundamental challenge due to the heterogeneity of pre-trained models. Existing knowledge integration solutions typically rely on strong assumptions about training data distributions and network architectures, limiting them to learning only from specific types of models and resulting in data and/or inductive biases. In this work, we introduce a novel framework, namely UNIFORM, for knowledge transfer from a diverse set of off-the-shelf models into one student model without such constraints. Specifically, we propose a dedicated voting mechanism to capture the consensus of knowledge both at the logit level – incorporating teacher models that are capable of predicting target classes of interest – and at the feature level, utilizing visual representations learned on arbitrary label spaces. Extensive experiments demonstrate that UNIFORM effectively enhances unsupervised object recognition performance compared to strong knowledge transfer baselines. Notably, it exhibits remarkable scalability by benefiting from over one hundred teachers, while existing methods saturate at a much smaller scale.
zh
[CV-75] Mind the Third Eye! Benchmarking Privacy Awareness in MLLM -powered Smartphone Agents
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的智能手机代理(smartphone agents)在自动化任务中对用户敏感个人信息访问缺乏隐私意识的问题。其核心挑战在于这些代理在执行功能时往往获得过度权限,却未能有效识别和保护隐私上下文,导致潜在的数据泄露风险。解决方案的关键在于构建首个大规模基准测试(benchmark),涵盖7,138个真实场景,并对每个场景标注隐私类型(如账户凭证)、敏感级别及位置信息,从而系统评估主流智能手机代理的隐私感知能力(Privacy Awareness, RA)。实验表明,绝大多数代理表现不佳(RA < 60%),且隐私检测能力与场景敏感度显著相关,提示需重新审视智能代理中效用与隐私之间的失衡关系。
链接: https://arxiv.org/abs/2508.19493
作者: Zhixin Lin,Jungang Li,Shidong Pan,Yibo Shi,Yue Yao,Dongliang Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Smartphones bring significant convenience to users but also enable devices to extensively record various types of personal information. Existing smartphone agents powered by Multimodal Large Language Models (MLLMs) have achieved remarkable performance in automating different tasks. However, as the cost, these agents are granted substantial access to sensitive users’ personal information during this operation. To gain a thorough understanding of the privacy awareness of these agents, we present the first large-scale benchmark encompassing 7,138 scenarios to the best of our knowledge. In addition, for privacy context in scenarios, we annotate its type (e.g., Account Credentials), sensitivity level, and location. We then carefully benchmark seven available mainstream smartphone agents. Our results demonstrate that almost all benchmarked agents show unsatisfying privacy awareness (RA), with performance remaining below 60% even with explicit hints. Overall, closed-source agents show better privacy ability than open-source ones, and Gemini 2.0-flash achieves the best, achieving an RA of 67%. We also find that the agents’ privacy detection capability is highly related to scenario sensitivity level, i.e., the scenario with a higher sensitivity level is typically more identifiable. We hope the findings enlighten the research community to rethink the unbalanced utility-privacy tradeoff about smartphone agents. Our code and benchmark are available at this https URL.
zh
[CV-76] JVLGS: Joint Vision-Language Gas Leak Segmentation
【速读】:该论文旨在解决气体泄漏检测中因视觉方法受限于气体云团模糊性和非刚性特征而导致的识别不准确问题,以及现有技术在复杂场景下易产生误报的问题。其解决方案的关键在于提出一种联合视觉与语言模态的气体泄漏分割框架(Joint Vision-Language Gas leak Segmentation, JVLGS),通过融合图像和文本信息增强对气体泄漏的表征能力,并引入后处理步骤以有效降低由噪声和非目标物体引起的假阳性,从而在多种场景下实现更鲁棒、精准的气体泄漏分割效果。
链接: https://arxiv.org/abs/2508.19485
作者: Xinlong Zhao,Qixiang Pang,Shan Du
机构: University of British Columbia - Okanagan (不列颠哥伦比亚大学奥卡纳根分校); University of Central Missouri (中央密苏里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures
Abstract:Gas leaks pose serious threats to human health and contribute significantly to atmospheric pollution, drawing increasing public concern. However, the lack of effective detection methods hampers timely and accurate identification of gas leaks. While some vision-based techniques leverage infrared videos for leak detection, the blurry and non-rigid nature of gas clouds often limits their effectiveness. To address these challenges, we propose a novel framework called Joint Vision-Language Gas leak Segmentation (JVLGS), which integrates the complementary strengths of visual and textual modalities to enhance gas leak representation and segmentation. Recognizing that gas leaks are sporadic and many video frames may contain no leak at all, our method incorporates a post-processing step to reduce false positives caused by noise and non-target objects, an issue that affects many existing approaches. Extensive experiments conducted across diverse scenarios show that JVLGS significantly outperforms state-of-the-art gas leak segmentation methods. We evaluate our model under both supervised and few-shot learning settings, and it consistently achieves strong performance in both, whereas competing methods tend to perform well in only one setting or poorly in both. Code available at: this https URL
zh
[CV-77] Concurrent validity of computer-vision artificial intelligence player tracking software using broadcast footage
【速读】:该论文旨在解决商用计算机视觉与人工智能(Artificial Intelligence, AI)球员追踪软件在使用转播画面时,能否准确测量球员位置、速度和跑动距离的问题,并进一步探究摄像机信号源和分辨率对追踪精度的影响。研究的关键解决方案在于:首先,通过对比商用追踪系统与高精度多摄像头追踪系统(TRACAB Gen 5)的数据,量化了不同供应商在位置、速度及总跑动距离上的误差(如位置RMSE为1.68–16.39 m,速度RMSE为0.34–2.38 m/s,距离偏差达-21.8%至+24.3%);其次,发现采用战术视角(tactical feed)可提升球员检测率,从而显著改善追踪准确性,且720p与1080p分辨率在适配合适模型的前提下均能满足精度需求。
链接: https://arxiv.org/abs/2508.19477
作者: Zachary L. Crang,Rich D. Johnston,Katie L. Mills,Johsan Billingham,Sam Robertson,Michael H. Cole,Jonathon Weakley,Adam Hewitt and,Grant M. Duthie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This study aimed to: (1) understand whether commercially available computer-vision and artificial intelligence (AI) player tracking software can accurately measure player position, speed and distance using broadcast footage and (2) determine the impact of camera feed and resolution on accuracy. Data were obtained from one match at the 2022 Qatar Federation Internationale de Football Association (FIFA) World Cup. Tactical, programme and camera 1 feeds were used. Three commercial tracking providers that use computer-vision and AI participated. Providers analysed instantaneous position (x, y coordinates) and speed (m,s^-1) of each player. Their data were compared with a high-definition multi-camera tracking system (TRACAB Gen 5). Root mean square error (RMSE) and mean bias were calculated. Position RMSE ranged from 1.68 to 16.39 m, while speed RMSE ranged from 0.34 to 2.38 m,s^-1. Total match distance mean bias ranged from -1745 m (-21.8%) to 1945 m (24.3%) across providers. Computer-vision and AI player tracking software offer the ability to track players with fair precision when players are detected by the software. Providers should use a tactical feed when tracking position and speed, which will maximise player detection, improving accuracy. Both 720p and 1080p resolutions are suitable, assuming appropriate computer-vision and AI models are implemented.
zh
[CV-78] Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments
【速读】:该论文旨在解决高能物理(High-Energy Physics, HEP)实验中基于像素化探测器图像对中微子相互作用事件进行分类的问题,传统方法依赖卷积神经网络(Convolutional Neural Network, CNN)模型,但难以有效融合辅助文本或语义信息。解决方案的关键在于引入一个微调后的视觉-语言模型(Vision-Language Model, VLM),基于LLaMA 3.2架构,利用其多模态理解能力,在保持或超越CNN性能的同时,实现更丰富的推理能力和对辅助文本/语义上下文的更好整合,从而为HEP事件分类提供一种通用且可扩展的多模态骨干模型。
链接: https://arxiv.org/abs/2508.19376
作者: Dikshant Sagar,Kaiwen Yu,Alejandro Yankelevich,Jianming Bian,Pierre Baldi
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); High Energy Physics - Experiment (hep-ex)
备注:
Abstract:Recent progress in large language models (LLMs) has shown strong potential for multimodal reasoning beyond natural language. In this work, we explore the use of a fine-tuned Vision-Language Model (VLM), based on LLaMA 3.2, for classifying neutrino interactions from pixelated detector images in high-energy physics (HEP) experiments. We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC. Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context. These findings suggest that VLMs offer a promising general-purpose backbone for event classification in HEP, paving the way for multimodal approaches in experimental neutrino physics.
zh
[CV-79] Efficient Multi-Source Knowledge Transfer by Model Merging
【速读】:该论文旨在解决多源迁移学习(multi-source transfer learning)中的知识提取精度不足与聚合效率低的问题,即如何从大量在线可用模型中高效且精准地提取并融合知识,以提升适应性并降低重新训练成本。解决方案的关键在于利用奇异值分解(Singular Value Decomposition, SVD)将每个源模型分解为基本的秩一(rank-one)组件,随后在聚合阶段仅选择来自所有源模型中最显著的组件,从而克服了传统方法在粒度精度和计算效率上的局限;最终通过仅微调合并矩阵的主要奇异值来适配目标任务,实现对合成知识库的有效保留与利用。
链接: https://arxiv.org/abs/2508.19353
作者: Marcin Osial,Bartosz Wójcik,Bartosz Zieliński,Sebastian Cygert
机构: 1. University of Warsaw (华沙大学); 2. Polish Academy of Sciences (波兰科学院); 3. Institute of Computer Science, Polish Academy of Sciences (波兰科学院计算机科学研究所); 4. Warsaw University of Technology (华沙理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While transfer learning is an advantageous strategy, it overlooks the opportunity to leverage knowledge from numerous available models online. Addressing this multi-source transfer learning problem is a promising path to boost adaptability and cut re-training costs. However, existing approaches are inherently coarse-grained, lacking the necessary precision for granular knowledge extraction and the aggregation efficiency required to fuse knowledge from either a large number of source models or those with high parameter counts. We address these limitations by leveraging Singular Value Decomposition (SVD) to first decompose each source model into its elementary, rank-one components. A subsequent aggregation stage then selects only the most salient components from all sources, thereby overcoming the previous efficiency and precision limitations. To best preserve and leverage the synthesized knowledge base, our method adapts to the target task by fine-tuning only the principal singular values of the merged matrix. In essence, this process only recalibrates the importance of top SVD components. The proposed framework allows for efficient transfer learning, is robust to perturbations both at the input level and in the parameter space (e.g., noisy or pruned sources), and scales well computationally.
zh
[CV-80] EffNetViTLoRA: An Efficient Hybrid Deep Learning Approach for Alzheimers Disease Diagnosis
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期诊断中因轻度认知障碍(Mild Cognitive Impairment, MCI)与正常认知(Cognitively Normal, CN)及AD之间差异细微而导致的分类困难问题。其核心挑战在于如何从全量T1加权磁共振成像(T1-weighted MRI)数据中提取可靠且具有判别力的特征,并避免因源域与目标域差异导致的大模型微调效果不佳。解决方案的关键在于提出EffNetViTLoRA模型,该模型融合卷积神经网络(Convolutional Neural Network, CNN)与视觉Transformer(Vision Transformer, ViT),以同时捕捉MRI图像的局部细节和全局结构信息;并引入低秩适应(Low-Rank Adaptation, LoRA)技术对预训练ViT进行高效微调,从而实现跨域知识迁移的同时降低过拟合风险,最终在ADNI全量数据集上实现了92.52%的分类准确率和92.76%的F1分数。
链接: https://arxiv.org/abs/2508.19349
作者: Mahdieh Behjat Khatooni,Mohsen Soryani
机构: Islamic Azad University, Science and Research Branch (伊斯兰阿扎德大学,科学与研究分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s disease (AD) is one of the most prevalent neurodegenerative disorders worldwide. As it progresses, it leads to the deterioration of cognitive functions. Since AD is irreversible, early diagnosis is crucial for managing its progression. Mild Cognitive Impairment (MCI) represents an intermediate stage between Cognitively Normal (CN) individuals and those with AD, and is considered a transitional phase from normal cognition to Alzheimer’s disease. Diagnosing MCI is particularly challenging due to the subtle differences between adjacent diagnostic categories. In this study, we propose EffNetViTLoRA, a generalized end-to-end model for AD diagnosis using the whole Alzheimer’s Disease Neuroimaging Initiative (ADNI) Magnetic Resonance Imaging (MRI) dataset. Our model integrates a Convolutional Neural Network (CNN) with a Vision Transformer (ViT) to capture both local and global features from MRI images. Unlike previous studies that rely on limited subsets of data, our approach is trained on the full T1-weighted MRI dataset from ADNI, resulting in a more robust and unbiased model. This comprehensive methodology enhances the model’s clinical reliability. Furthermore, fine-tuning large pretrained models often yields suboptimal results when source and target dataset domains differ. To address this, we incorporate Low-Rank Adaptation (LoRA) to effectively adapt the pretrained ViT model to our target domain. This method enables efficient knowledge transfer and reduces the risk of overfitting. Our model achieves a classification accuracy of 92.52% and an F1-score of 92.76% across three diagnostic categories: AD, MCI, and CN for full ADNI dataset.
zh
[CV-81] PRISM: A Framework Harnessing Unsupervised Visual Representations and Textual Prompts for Explainable MACE Survival Prediction from Cardiac Cine MRI
【速读】:该论文旨在解决心血管预后中重大不良心脏事件(MACE)精准预测的挑战。其核心解决方案是提出PRISM(Prompt-guided Representation Integration for Survival Modeling),一个基于自监督学习的生存建模框架,关键在于通过运动感知的多视角蒸馏技术提取非对比心脏电影磁共振成像(non-contrast cardiac cine MRI)的时序同步特征,并利用医学知识引导的文本提示(textual prompts)对这些特征进行调制,从而实现细粒度的心脏风险分层。该方法融合了结构化电子健康记录(EHR)与影像学表征,在多个独立临床队列中均显著优于传统生存模型和当前最优深度学习基线,且揭示了与MACE风险相关的三种新型影像学标志物及主导的临床因素。
链接: https://arxiv.org/abs/2508.19325
作者: Haoyang Su,Jin-Yi Xiang,Shaohao Rui,Yifan Gao,Xingyu Chen,Tingxuan Yin,Xiaosong Wang,Lian-Ming Wu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海科技创新研究院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); School of Biomedical Engineering (Suzhou) (生物医学工程学院(苏州)), Division of Life Science and Medicine, University of Science and Technology of China (中国科学技术大学生命科学与医学部), Hefei, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate prediction of major adverse cardiac events (MACE) remains a central challenge in cardiovascular prognosis. We present PRISM (Prompt-guided Representation Integration for Survival Modeling), a self-supervised framework that integrates visual representations from non-contrast cardiac cine magnetic resonance imaging with structured electronic health records (EHRs) for survival analysis. PRISM extracts temporally synchronized imaging features through motion-aware multi-view distillation and modulates them using medically informed textual prompts to enable fine-grained risk prediction. Across four independent clinical cohorts, PRISM consistently surpasses classical survival prediction models and state-of-the-art (SOTA) deep learning baselines under internal and external validation. Further clinical findings demonstrate that the combined imaging and EHR representations derived from PRISM provide valuable insights into cardiac risk across diverse cohorts. Three distinct imaging signatures associated with elevated MACE risk are uncovered, including lateral wall dyssynchrony, inferior wall hypersensitivity, and anterior elevated focus during diastole. Prompt-guided attribution further identifies hypertension, diabetes, and smoking as dominant contributors among clinical and physiological EHR factors.
zh
[CV-82] Deep Data Hiding for ICAO-Compliant Face Images: A Survey
【速读】:该论文旨在解决ICAO合规人脸图像在身份验证应用中面临的伪造风险问题,尤其是针对基于深度学习的攻击(如morphic attacks和deepfakes)所引发的身份盗用与非法文档共享等安全威胁。传统防伪手段如呈现攻击检测(Presentation Attack Detection, PAD)仅适用于实时采集阶段,缺乏对已捕获图像的持久保护能力。论文提出将数字水印(digital watermarking)与隐写术(steganography)作为补充解决方案,其关键在于通过嵌入不可见且抗篡改的信号于图像本身,在不破坏ICAO标准兼容性的前提下实现全生命周期的身份真实性验证,从而提供一种可持久、可追溯的后捕获防护机制。
链接: https://arxiv.org/abs/2508.19324
作者: Jefferson David Rodriguez Chivata,Davide Ghiani,Simone Maurizio La Cava,Marco Micheletto,Giulia Orrù,Federico Lama,Gian Luca Marcialis
机构: University of Cagliari (卡利亚里大学); Dedem S.p.A. (德德姆公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: In 2025 IEEE International Joint Conference on Biometrics (IJCB)
Abstract:ICAO-compliant facial images, initially designed for secure biometric passports, are increasingly becoming central to identity verification in a wide range of application contexts, including border control, digital travel credentials, and financial services. While their standardization enables global interoperability, it also facilitates practices such as morphing and deepfakes, which can be exploited for harmful purposes like identity theft and illegal sharing of identity documents. Traditional countermeasures like Presentation Attack Detection (PAD) are limited to real-time capture and offer no post-capture protection. This survey paper investigates digital watermarking and steganography as complementary solutions that embed tamper-evident signals directly into the image, enabling persistent verification without compromising ICAO compliance. We provide the first comprehensive analysis of state-of-the-art techniques to evaluate the potential and drawbacks of the underlying approaches concerning the applications involving ICAO-compliant images and their suitability under standard constraints. We highlight key trade-offs, offering guidance for secure deployment in real-world identity systems.
zh
[CV-83] A Technical Review on Comparison and Estimation of Steganographic Tools
【速读】:该论文旨在解决图像隐写术(image steganography)中不同工具性能差异不明确的问题,尤其是如何基于图像特征对多种隐写工具进行客观比较与优选。其解决方案的关键在于:选取市场上常用且高频使用的六种图像隐写工具,使用相同输入文本在多种图像格式下进行实验测试,并通过分析图像的尺寸、像素值分布及直方图差异等关键特征,量化评估各工具的嵌入效率与表现差异,从而识别出相对最优的隐写方案。
链接: https://arxiv.org/abs/2508.19323
作者: Ms. Preeti P. Bhatt,Rakesh R. Savant
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 20
Abstract:Steganography is technique of hiding a data under cover media using different steganography tools. Image steganography is hiding of data (Text/Image/Audio/Video) under a cover as Image. This review paper presents classification of image steganography and the comparison of various Image steganography tools using different image formats. Analyzing numerous tools on the basis of Image features and extracting the best one. Some of the tools available in the market were selected based on the frequent use; these tools were tested using the same input on all of them. Specific text was embedded within all host images for each of the six Steganography tools selected. The results of the experiment reveal that all the six tools were relatively performing at the same level, though some software performs better than others through efficiency. And it was based on the image features like size, dimensions, and pixel value and histogram differentiation.
zh
[CV-84] MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation
【速读】:该论文旨在解决交互式数字人视频生成系统在实时性、计算效率和多模态可控性方面的挑战,现有方法常面临高延迟、高计算成本及控制能力有限等问题。其解决方案的关键在于提出一种自回归视频生成框架,通过最小修改标准大语言模型(LLM)以接受包括音频、姿态和文本在内的多模态条件编码,并输出空间与语义一致的表示来引导扩散头的去噪过程;同时构建约20,000小时的大规模对话数据集以支持训练,并引入压缩比高达64倍的深度压缩自编码器,有效缓解自回归模型长时序推理负担,从而实现低延迟、高效率且具备细粒度多模态控制能力的流式交互式视频生成。
链接: https://arxiv.org/abs/2508.19320
作者: Ming Chen,Liyuan Cui,Wenyuan Zhang,Haoxian Zhang,Yan Zhou,Xiaohan Li,Xiaoqiang Liu,Pengfei Wan
机构: Kling Team, Kuaishou Technology (快手科技); Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report. Project Page: this https URL
Abstract:Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with high latency, heavy computational cost, and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64 \times reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.
zh
[CV-85] Automated classification of natural habitats using ground-level imagery
【速读】:该论文旨在解决传统陆地生境分类方法依赖卫星遥感影像且验证成本高、难以实现大规模应用的问题,提出了一种基于地面影像(照片)的生境分类新方法。其关键在于利用深度学习技术,特别是改进的DeepLabV3-ResNet101模型,对地面采集的图像进行自动分类,并结合数据预处理、类别重采样等策略提升模型在18类“Living England”生境框架下的准确性和鲁棒性,从而实现高效、可扩展的生态监测与生境识别。
链接: https://arxiv.org/abs/2508.19314
作者: Mahdis Tourian(1 and 2),Sareh Rowlands(1 and 2),Remy Vandaele(1 and 2),Max Fancourt(3),Rebecca Mein(3),Hywel T. P. Williams(1 and 2) ((1) Centre for Environmental Intelligence, University of Exeter, Exeter, UK, (2) Department of Computer Science, Faculty of Environment, Science and Economy, University of Exeter, Exeter, UK, (3) Natural England, York, UK)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 2 tables
Abstract:Accurate classification of terrestrial habitats is critical for biodiversity conservation, ecological monitoring, and land-use planning. Several habitat classification schemes are in use, typically based on analysis of satellite imagery with validation by field ecologists. Here we present a methodology for classification of habitats based solely on ground-level imagery (photographs), offering improved validation and the ability to classify habitats at scale (for example using citizen-science imagery). In collaboration with Natural England, a public sector organisation responsible for nature conservation in England, this study develops a classification system that applies deep learning to ground-level habitat photographs, categorising each image into one of 18 classes defined by the ‘Living England’ framework. Images were pre-processed using resizing, normalisation, and augmentation; re-sampling was used to balance classes in the training data and enhance model robustness. We developed and fine-tuned a DeepLabV3-ResNet101 classifier to assign a habitat class label to each photograph. Using five-fold cross-validation, the model demonstrated strong overall performance across 18 habitat classes, with accuracy and F1-scores varying between classes. Across all folds, the model achieved a mean F1-score of 0.61, with visually distinct habitats such as Bare Soil, Silt and Peat (BSSP) and Bare Sand (BS) reaching values above 0.90, and mixed or ambiguous classes scoring lower. These findings demonstrate the potential of this approach for ecological monitoring. Ground-level imagery is readily obtained, and accurate computational methods for habitat classification based on such data have many potential applications. To support use by practitioners, we also provide a simple web application that classifies uploaded images using our model.
zh
[CV-86] Sistema de Reconocimiento Facial Federado en Conjuntos Abiertos basado en OpenMax
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的面部识别系统在开放集场景下面临的隐私保护与身份管理难题,尤其是在未知个体出现在操作环境中时难以可靠区分已知与未知目标的问题。其解决方案的关键在于将 OpenMax 算法引入联邦学习(Federated Learning, FL)框架中,通过交换本地模型的均值激活向量(mean activation vectors)和局部距离度量(local distance measures),实现对已知类别与未知类别的有效判别,从而提升分布式环境下的隐私感知与鲁棒性。
链接: https://arxiv.org/abs/2508.19312
作者: Ander Galván,Marivi Higuero,Jorge Sasiain,Eduardo Jacob
机构: Universidad del País Vasco (UPV/EHU)(巴斯克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Aceptado para publicación, in Spanish language. XVII Jornadas de Ingeniería Telemática (JITEL 2025)
Abstract:Facial recognition powered by Artificial Intelligence has achieved high accuracy in specific scenarios and applications. Nevertheless, it faces significant challenges regarding privacy and identity management, particularly when unknown individuals appear in the operational context. This paper presents the design, implementation, and evaluation of a facial recognition system within a federated learning framework tailored to open-set scenarios. The proposed approach integrates the OpenMax algorithm into federated learning, leveraging the exchange of mean activation vectors and local distance measures to reliably distinguish between known and unknown subjects. Experimental results validate the effectiveness of the proposed solution, demonstrating its potential for enhancing privacy-aware and robust facial recognition in distributed environments. – El reconocimiento facial impulsado por Inteligencia Artificial ha demostrado una alta precisión en algunos escenarios y aplicaciones. Sin embargo, presenta desafíos relacionados con la privacidad y la identificación de personas, especialmente considerando que pueden aparecer sujetos desconocidos para el sistema que lo implementa. En este trabajo, se propone el diseño, implementación y evaluación de un sistema de reconocimiento facial en un escenario de aprendizaje federado, orientado a conjuntos abiertos. Concretamente, se diseña una solución basada en el algoritmo OpenMax para escenarios de aprendizaje federado. La propuesta emplea el intercambio de los vectores de activación promedio y distancias locales para identificar de manera eficaz tanto personas conocidas como desconocidas. Los experimentos realizados demuestran la implementación efectiva de la solución propuesta. Comments: Aceptado para publicación, in Spanish language. XVII Jornadas de Ingeniería Telemática (JITEL 2025) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.19312 [cs.CV] (or arXiv:2508.19312v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.19312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-87] Advancements in Crop Analysis through Deep Learning and Explainable AI
【速读】:该论文旨在解决水稻品质分类与叶部病害诊断的自动化问题,以提升农业质量控制效率并减少人工误差。其关键解决方案是基于深度学习的多模型框架,采用卷积神经网络(Convolutional Neural Networks, CNN)对五种常见稻米品种进行高精度分类,并结合VGG16、ResNet50和MobileNetV2等预训练模型实现水稻叶片病害(如褐斑病、稻瘟病、细菌性条斑病和Tungro病)的精准识别。同时引入可解释人工智能(Explainable Artificial Intelligence, XAI)技术,如SHAP(SHapley Additive exPlanations)和LIME(Local Interpretable Model-agnostic Explanations),揭示特征对预测结果的影响机制,从而增强模型透明度与可信度,为农业场景下的智能检测系统提供可靠的技术支撑。
链接: https://arxiv.org/abs/2508.19307
作者: Hamza Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master’s thesis
Abstract:Rice is a staple food of global importance in terms of trade, nutrition, and economic growth. Among Asian nations such as China, India, Pakistan, Thailand, Vietnam and Indonesia are leading producers of both long and short grain varieties, including basmati, jasmine, arborio, ipsala, and kainat saila. To ensure consumer satisfaction and strengthen national reputations, monitoring rice crops and grain quality is essential. Manual inspection, however, is labour intensive, time consuming and error prone, highlighting the need for automated solutions for quality control and yield improvement. This study proposes an automated approach to classify five rice grain varieties using Convolutional Neural Networks (CNN). A publicly available dataset of 75000 images was used for training and testing. Model evaluation employed accuracy, recall, precision, F1-score, ROC curves, and confusion matrices. Results demonstrated high classification accuracy with minimal misclassifications, confirming the model effectiveness in distinguishing rice varieties. In addition, an accurate diagnostic method for rice leaf diseases such as Brown Spot, Blast, Bacterial Blight, and Tungro was developed. The framework combined explainable artificial intelligence (XAI) with deep learning models including CNN, VGG16, ResNet50, and MobileNetV2. Explainability techniques such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) revealed how specific grain and leaf features influenced predictions, enhancing model transparency and reliability. The findings demonstrate the strong potential of deep learning in agricultural applications, paving the way for robust, interpretable systems that can support automated crop quality inspection and disease diagnosis, ultimately benefiting farmers, consumers, and the agricultural economy.
zh
[CV-88] Geo2Vec: Shape- and Distance-Aware Neural Representation of Geospatial Entities
【速读】:该论文旨在解决现有空间表示学习方法在地理人工智能(GeoAI)应用中面临的两大核心问题:一是现有方法通常仅针对单一类型的地理实体(如点、线、面),或通过分解实体为简单组件(如Poly2Vec)来实现傅里叶变换,导致计算成本高;二是由于变换空间缺乏几何对齐性,这些方法依赖均匀且非自适应采样,从而模糊了边缘和边界等细粒度特征。解决方案的关键在于提出Geo2Vec,该方法受符号距离场(Signed Distance Field, SDF)启发,在原始空间中直接操作,通过自适应采样并编码点的符号距离(外部为正、内部为负),无需实体分解即可捕捉几何结构;同时利用神经网络近似SDF以生成紧凑、几何感知且统一适用于所有地理实体类型的表征,并引入旋转不变的位置编码来建模高频空间变化,构建结构化且鲁棒的嵌入空间,显著提升了表示精度与效率。
链接: https://arxiv.org/abs/2508.19305
作者: Chen Chu,Cyrus Shahabi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial representation learning is essential for GeoAI applications such as urban analytics, enabling the encoding of shapes, locations, and spatial relationships (topological and distance-based) of geo-entities like points, polylines, and polygons. Existing methods either target a single geo-entity type or, like Poly2Vec, decompose entities into simpler components to enable Fourier transformation, introducing high computational cost. Moreover, since the transformed space lacks geometric alignment, these methods rely on uniform, non-adaptive sampling, which blurs fine-grained features like edges and boundaries. To address these limitations, we introduce Geo2Vec, a novel method inspired by signed distance fields (SDF) that operates directly in the original space. Geo2Vec adaptively samples points and encodes their signed distances (positive outside, negative inside), capturing geometry without decomposition. A neural network trained to approximate the SDF produces compact, geometry-aware, and unified representations for all geo-entity types. Additionally, we propose a rotation-invariant positional encoding to model high-frequency spatial variations and construct a structured and robust embedding space for downstream GeoAI models. Empirical results show that Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Code and Data can be found at: this https URL.
zh
[CV-89] DemoBias: An Empirical Study to Trace Demographic Biases in Vision Foundation Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在生物特征人脸识别(Face Recognition, FR)任务中存在的人口统计学偏差问题,即模型在不同种族/民族、性别和年龄群体间的性能表现不一致,从而影响其公平性和可靠性。解决方案的关键在于构建一个平衡的人口统计学分布数据集,并对三种主流预训练LVLMs(LLaVA、BLIP-2 和 PaliGemma)进行微调与评估,采用组特定的BERTScore和公平性差异率(Fairness Discrepancy Rate)等量化指标,系统性地识别和追踪模型在不同群体中的性能差异。实验表明,PaliGemma 和 LLaVA 在拉丁裔/西班牙裔、高加索人种及南亚群体中表现出显著偏差,而 BLIP-2 则相对更稳定,为后续提升LVLM在FR任务中的公平性提供了实证依据与改进方向。
链接: https://arxiv.org/abs/2508.19298
作者: Abu Sufian,Anirudha Ghosh,Debaditya Barman,Marco Leo,Cosimo Distante
机构: CNR-ISASI (意大利国家研究委员会-智能系统与自动化研究所); Visva-Bharati (维斯瓦巴尔蒂大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 13th International Workshop on Biometrics and Forensics (IWBF)
Abstract:Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities across various downstream tasks, including biometric face recognition (FR) with description. However, demographic biases remain a critical concern in FR, as these foundation models often fail to perform equitably across diverse demographic groups, considering ethnicity/race, gender, and age. Therefore, through our work DemoBias, we conduct an empirical evaluation to investigate the extent of demographic biases in LVLMs for biometric FR with textual token generation tasks. We fine-tuned and evaluated three widely used pre-trained LVLMs: LLaVA, BLIP-2, and PaliGemma on our own generated demographic-balanced dataset. We utilize several evaluation metrics, like group-specific BERTScores and the Fairness Discrepancy Rate, to quantify and trace the performance disparities. The experimental results deliver compelling insights into the fairness and reliability of LVLMs across diverse demographic groups. Our empirical study uncovered demographic biases in LVLMs, with PaliGemma and LLaVA exhibiting higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, whereas BLIP-2 demonstrated comparably consistent. Repository: this https URL.
zh
[CV-90] Large VLM-based Stylized Sports Captioning
【速读】:该论文旨在解决现有生成式 AI (Generative AI) 模型在体育场景下难以生成准确、自然且符合特定风格的赛事图像描述的问题。当前主流的大语言模型(LLM)和视觉语言模型(LVLM)虽能解释通用运动行为,但缺乏针对体育领域的专业术语和语境理解能力,导致生成内容不够精准或缺乏专业性。解决方案的关键在于提出一个两级微调的 LVLM 流水线架构:第一级聚焦于提升对体育动作与场景的细粒度识别能力,第二级则优化文本生成以匹配新闻报道风格,从而实现高精度、低延迟的实时体育赛事图像描述生成,在 Super Bowl LIX 实测中达到每 3–5 秒处理 6 张图像的性能,并显著优于基线方法(F1 提升 8–10%,BERTScore 提升 2–10%)。
链接: https://arxiv.org/abs/2508.19295
作者: Sauptik Dhar,Nicholas Buoncristiani,Joe Anakata,Haoyu Zhang,Michelle Munson
机构: Eluvio AI Labs
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The advent of large (visual) language models (LLM / LVLM) have led to a deluge of automated human-like systems in several domains including social media content generation, search and recommendation, healthcare prognosis, AI assistants for cognitive tasks etc. Although these systems have been successfully integrated in production; very little focus has been placed on sports, particularly accurate identification and natural language description of the game play. Most existing LLM/LVLMs can explain generic sports activities, but lack sufficient domain-centric sports’ jargon to create natural (human-like) descriptions. This work highlights the limitations of existing SoTA LLM/LVLMs for generating production-grade sports captions from images in a desired stylized format, and proposes a two-level fine-tuned LVLM pipeline to address that. The proposed pipeline yields an improvement 8-10% in the F1, and 2-10% in BERT score compared to alternative approaches. In addition, it has a small runtime memory footprint and fast execution time. During Super Bowl LIX the pipeline proved its practical application for live professional sports journalism; generating highly accurate and stylized captions at the rate of 6 images per 3-5 seconds for over 1000 images during the game play.
zh
[CV-91] Efficient Model-Based Purification Against Adversarial Attacks for LiDAR Segmentation
【速读】:该论文旨在解决基于LiDAR的点云分割在自动驾驶场景中易受对抗攻击的问题,尤其针对当前主流采用2D范围视图(range view)表示的高效分割流水线缺乏轻量级防御机制的现状。其解决方案的关键在于提出一种面向2D范围视图域的模型驱动净化框架,通过构建数学上可解释的优化问题设计净化网络,并引入直接作用于范围视图域的对抗攻击形式,从而在保持极低计算开销的前提下实现强对抗鲁棒性,显著优于现有的生成式防御和对抗训练基线方法。
链接: https://arxiv.org/abs/2508.19290
作者: Alexandros Gkillas,Ioulia Kapsali,Nikos Piperigkos,Aris S. Lalos
机构: Industrial Systems Institute, Athena Research Center (工业系统研究所,阿斯娜研究中心); AviSense.AI (AviSense.AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:LiDAR-based segmentation is essential for reliable perception in autonomous vehicles, yet modern segmentation networks are highly susceptible to adversarial attacks that can compromise safety. Most existing defenses are designed for networks operating directly on raw 3D point clouds and rely on large, computationally intensive generative models. However, many state-of-the-art LiDAR segmentation pipelines operate on more efficient 2D range view representations. Despite their widespread adoption, dedicated lightweight adversarial defenses for this domain remain largely unexplored. We introduce an efficient model-based purification framework tailored for adversarial defense in 2D range-view LiDAR segmentation. We propose a direct attack formulation in the range-view domain and develop an explainable purification network based on a mathematical justified optimization problem, achieving strong adversarial resilience with minimal computational overhead. Our method achieves competitive performance on open benchmarks, consistently outperforming generative and adversarial training baselines. More importantly, real-world deployment on a demo vehicle demonstrates the framework’s ability to deliver accurate operation in practical autonomous driving scenarios.
zh
[CV-92] Seeing Like a Designer Without One: A Study on Unsupervised Slide Quality Assessment via Designer Cue Augmentation
【速读】:该论文旨在解决学术演讲中幻灯片质量评估缺乏客观、可扩展方法的问题,传统依赖人工评分的方式效率低且主观性强。解决方案的关键在于构建一个无监督的滑动质量评估流水线,融合七种专家启发的视觉设计指标(如留白、色彩丰富度、边缘密度等)与CLIP-ViT多模态嵌入,并采用基于孤立森林的异常评分机制进行综合打分。该方法在12,000张专业讲座幻灯片上训练,在6场学术演讲共115张幻灯片上验证,其Pearson相关系数达0.83,显著优于主流视觉-语言模型(提升1.79x至3.23倍),有效逼近观众对幻灯片质量的感知,实现了实时、客观的自动化反馈。
链接: https://arxiv.org/abs/2508.19289
作者: Tai Inui,Steven Oh,Magdeline Kuan
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:We present an unsupervised slide-quality assessment pipeline that combines seven expert-inspired visual-design metrics (whitespace, colorfulness, edge density, brightness contrast, text density, color harmony, layout balance) with CLIP-ViT embeddings, using Isolation Forest-based anomaly scoring to evaluate presentation slides. Trained on 12k professional lecture slides and evaluated on six academic talks (115 slides), our method achieved Pearson correlations up to 0.83 with human visual-quality ratings-1.79x to 3.23x stronger than scores from leading vision-language models (ChatGPT o4-mini-high, ChatGPT o3, Claude Sonnet 4, Gemini 2.5 Pro). We demonstrate convergent validity with visual ratings, discriminant validity against speaker-delivery scores, and exploratory alignment with overall impressions. Our results show that augmenting low-level design cues with multimodal embeddings closely approximates audience perceptions of slide quality, enabling scalable, objective feedback in real time.
zh
[CV-93] F-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models AAAI2026
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因逐帧处理视觉输入而忽略时间序列信息的问题,这导致模型对视觉噪声敏感且无法利用连续帧间的高一致性。其核心解决方案是提出一种无需训练的时序标记融合(Temporal Token Fusion, TTF)方法,关键在于通过双维度检测机制——即高效的灰度像素差异分析与基于注意力机制的语义相关性评估——实现选择性时序标记融合:一方面采用硬融合策略和关键帧锚定防止误差累积,另一方面揭示了在注意力机制中选择性重用Query矩阵不仅能保持性能,反而提升任务成功率,为未来直接复用KQV矩阵以实现计算加速提供了新方向。
链接: https://arxiv.org/abs/2508.19257
作者: Chenghao Liu,Jiachen Zhang,Chengxuan Li,Zhimu Zhou,Shixin Wu,Songfang Huang,Huiling Duan
机构: 北京大学(University of Peking)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Manuscript submitted to AAAI 2026, currently under review
Abstract:Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), cross-environment validation on SimplerEnv (4.8% relative improvement), and 8.7% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
zh
[CV-94] Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration NEURIPS
【速读】:该论文旨在解决当前生成式AI在视觉创作中难以同时兼顾形式意图(如线条轨迹、比例和空间布局等几何特征)与情境意图(如语义和主题意义)的难题,从而实现更自然、具象且富有语义连贯性的实时图像生成。其解决方案的关键在于构建一个双信号协同控制的多阶段生成流程:一方面通过视觉语言模型提取高阶语义线索,另一方面直接分析草图的底层几何特征,并将二者联合约束于一个包含轮廓保持结构控制与风格及内容感知图像合成的生成管道中,最终实现了低延迟、支持多人协作的同步共创平台。
链接: https://arxiv.org/abs/2508.19254
作者: Jookyung Song,Mookyoung Kang,Nojun Kwak
机构: 1. Korea University (韩国科学技术院); 2. Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 4 figures, NeurIPS Creative AI Track 2025
Abstract:This paper presents a real-time generative drawing system that interprets and integrates both formal intent - the structural, compositional, and stylistic attributes of a sketch - and contextual intent - the semantic and thematic meaning inferred from its visual content - into a unified transformation process. Unlike conventional text-prompt-based generative systems, which primarily capture high-level contextual descriptions, our approach simultaneously analyzes ground-level intuitive geometric features such as line trajectories, proportions, and spatial arrangement, and high-level semantic cues extracted via vision-language models. These dual intent signals are jointly conditioned in a multi-stage generation pipeline that combines contour-preserving structural control with style- and content-aware image synthesis. Implemented with a touchscreen-based interface and distributed inference architecture, the system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases. The resulting platform enables participants, regardless of artistic expertise, to engage in synchronous, co-authored visual creation, redefining human-AI interaction as a process of co-creation and mutual enhancement.
zh
[CV-95] Combating Digitally Altered Images: Deepfake Detection
【速读】:该论文旨在解决深度伪造(Deepfake)技术生成的高保真图像和视频对公众及监管机构带来的识别难题。其解决方案的关键在于提出一种基于改进型视觉Transformer(Vision Transformer, ViT)模型的鲁棒检测方法,通过在OpenForensics数据集子集上训练,并结合多种数据增强技术提升模型对多样化图像篡改的适应能力;同时采用过采样与分层抽样策略缓解类别不平衡问题,最终在测试集上实现了当前最优的检测性能。
链接: https://arxiv.org/abs/2508.16975
作者: Saksham Kumar,Rhythm Narang
机构: Amrita School of Computing (阿姆里塔计算学院); Amrita Vishwa Vidyapeetham (阿姆里塔世界大学); Dept of Computer Science and Engineering (计算机科学与工程系); Thapar Institute of Engineering & Technology (塔帕尔工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The rise of Deepfake technology to generate hyper-realistic manipulated images and videos poses a significant challenge to the public and relevant authorities. This study presents a robust Deepfake detection based on a modified Vision Transformer(ViT) model, trained to distinguish between real and Deepfake images. The model has been trained on a subset of the OpenForensics Dataset with multiple augmentation techniques to increase robustness for diverse image manipulations. The class imbalance issues are handled by oversampling and a train-validation split of the dataset in a stratified manner. Performance is evaluated using the accuracy metric on the training and testing datasets, followed by a prediction score on a random image of people, irrespective of their realness. The model demonstrates state-of-the-art results on the test dataset to meticulously detect Deepfake images.
zh
[CV-96] AT-CXR: Uncertainty-Aware Agent ic Triage for Chest X-rays
【速读】:该论文旨在解决医疗影像分诊中缺乏真正自主决策能力的问题,即在实际临床约束下,系统如何智能地决定何时停止、升级或推迟处理(decision-making under real constraints)。其解决方案的关键在于提出一种基于不确定性的代理模型AT-CXR,该模型能够对每例胸片图像估计置信度和分布拟合度,并依据分步策略自动作出诊断决策或选择性地放弃判断并建议人工介入。该方法通过两种路由器设计——确定性规则路由与大语言模型(LLM)决策路由——实现了在保持高准确率的同时显著降低延迟,且在选择性预测性能上优于主流零样本视觉-语言模型和监督分类器,体现出更低的风险-覆盖率曲线下面积(AURC)和更高覆盖率下的误差率。
链接: https://arxiv.org/abs/2508.19322
作者: Xueyang Li,Mingze Jiang,Gelei Xu,Jun Xia,Mengzhao Jia,Danny Chen,Yiyu Shi
机构: University of Notre Dame (圣母大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agentic AI is advancing rapidly, yet truly autonomous medical-imaging triage, where a system decides when to stop, escalate, or defer under real constraints, remains relatively underexplored. To address this gap, we introduce AT-CXR, an uncertainty-aware agent for chest X-rays. The system estimates per-case confidence and distributional fit, then follows a stepwise policy to issue an automated decision or abstain with a suggested label for human intervention. We evaluate two router designs that share the same inputs and actions: a deterministic rule-based router and an LLM-decided router. Across five-fold evaluation on a balanced subset of NIH ChestX-ray14 dataset, both variants outperform strong zero-shot vision-language models and state-of-the-art supervised classifiers, achieving higher full-coverage accuracy and superior selective-prediction performance, evidenced by a lower area under the risk-coverage curve (AURC) and a lower error rate at high coverage, while operating with lower latency that meets practical clinical constraints. The two routers provide complementary operating points, enabling deployments to prioritize maximal throughput or maximal accuracy. Our code is available at this https URL.
zh
[CV-97] MedVQA-TREE: A Multimodal Reasoning and Retrieval Framework for Sarcopenia Prediction
【速读】:该论文旨在解决超声影像中骨骼肌减少症(sarcopenia)精准诊断的难题,主要挑战包括影像特征细微、标注数据有限以及现有模型缺乏临床语境。其解决方案的关键在于提出MedVQA-TREE这一多模态框架,通过三个核心组件实现:(1) 分层图像解析模块(包含解剖分类、区域分割和基于图的空间推理),以捕捉粗粒度至细粒度的结构信息;(2) 门控特征融合机制,实现视觉特征与文本查询的选择性融合;(3) 基于UMLS引导的多跳、多查询检索策略,从PubMed及专用骨骼肌减少症知识库中获取临床知识。该方法显著提升了诊断准确率(最高达99%),并优于此前最先进方法超过10%,验证了结构化视觉理解与引导式知识检索相结合在AI辅助诊断中的有效性。
链接: https://arxiv.org/abs/2508.19319
作者: Pardis Moradbeiki,Nasser Ghadiri,Sayed Jalal Zahabi,Uffe Kock Wiil,Kristoffer Kittelmann Brockhattingen,Ali Ebrahimi
机构: University of Southern Denmark (南丹麥大學); Iran University of Technology (伊朗科技大學); Roskilde University (罗斯基勒大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate sarcopenia diagnosis via ultrasound remains challenging due to subtle imaging cues, limited labeled data, and the absence of clinical context in most models. We propose MedVQA-TREE, a multimodal framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy. The vision module includes anatomical classification, region segmentation, and graph-based spatial reasoning to capture coarse, mid-level, and fine-grained structures. A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base. MedVQA-TREE was trained and evaluated on two public MedVQA datasets (VQA-RAD and PathVQA) and a custom sarcopenia ultrasound dataset. The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%. These results underscore the benefit of combining structured visual understanding with guided knowledge retrieval for effective AI-assisted diagnosis in sarcopenia.
zh
[CV-98] 2D Ultrasound Elasticity Imaging of Abdominal Aortic Aneurysms Using Deep Neural Networks
【速读】:该论文旨在解决腹主动脉瘤(Abdominal Aortic Aneurysm, AAA)破裂风险评估中仅依赖最大直径所带来的局限性问题,因为直径无法反映血管壁材料属性(如弹性模量)对破裂风险的关键影响。解决方案的关键在于提出一种基于深度学习的弹性成像框架,利用有限元仿真生成包含位移场与对应模量分布的数据集,并采用U-Net架构结合归一化均方误差(Normalized Mean Squared Error, NMSE)损失函数,从二维超声获取的轴向和横向位移分量中重建出组织弹性模量的空间分布。该方法在数字仿真、物理仿体及临床超声数据上均验证了其高精度与泛化能力,相较于传统迭代方法具有更快的计算效率,为无创评估AAA破裂风险提供了新路径。
链接: https://arxiv.org/abs/2508.19303
作者: Utsav Ratna Tuladhar,Richard Simon,Doran Mix,Michael Richards
机构: Rochester Institute of Technology (罗切斯特理工学院); University of Rochester Medical Center (罗切斯特大学医学中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Abdominal aortic aneurysms (AAA) pose a significant clinical risk due to their potential for rupture, which is often asymptomatic but can be fatal. Although maximum diameter is commonly used for risk assessment, diameter alone is insufficient as it does not capture the properties of the underlying material of the vessel wall, which play a critical role in determining the risk of rupture. To overcome this limitation, we propose a deep learning-based framework for elasticity imaging of AAAs with 2D ultrasound. Leveraging finite element simulations, we generate a diverse dataset of displacement fields with their corresponding modulus distributions. We train a model with U-Net architecture and normalized mean squared error (NMSE) to infer the spatial modulus distribution from the axial and lateral components of the displacement fields. This model is evaluated across three experimental domains: digital phantom data from 3D COMSOL simulations, physical phantom experiments using biomechanically distinct vessel models, and clinical ultrasound exams from AAA patients. Our simulated results demonstrate that the proposed deep learning model is able to reconstruct modulus distributions, achieving an NMSE score of 0.73%. Similarly, in phantom data, the predicted modular ratio closely matches the expected values, affirming the model’s ability to generalize to phantom data. We compare our approach with an iterative method which shows comparable performance but higher computation time. In contrast, the deep learning method can provide quick and effective estimates of tissue stiffness from ultrasound images, which could help assess the risk of AAA rupture without invasive procedures.
zh
[CV-99] CellINR: Implicitly Overcoming Photo-induced Artifacts in 4D Live Fluorescence Microscopy
【速读】:该论文旨在解决4D活细胞荧光显微成像中因长时间高强度光照导致的光漂白(photobleaching)和光毒性(phototoxicity)问题,这些问题会引发光诱导伪影(photo-induced artifacts),严重损害图像连续性和结构细节恢复。解决方案的关键在于提出CellINR框架,该框架基于隐式神经表示(implicit neural representation)构建了一种特定于案例的优化方法,通过盲卷积(blind convolution)与结构增强策略将三维空间坐标映射至高频域,从而实现对细胞结构的精确建模与高保真重建,并有效区分真实信号与伪影。
链接: https://arxiv.org/abs/2508.19300
作者: Cunmin Zhao,Ziyuan Luo,Guoye Guan,Zelin Li,Yiming Ma,Zhongying Zhao,Renjie Wan
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures
Abstract:4D live fluorescence microscopy is often compromised by prolonged high intensity illumination which induces photobleaching and phototoxic effects that generate photo-induced artifacts and severely impair image continuity and detail recovery. To address this challenge, we propose the CellINR framework, a case-specific optimization approach based on implicit neural representation. The method employs blind convolution and structure amplification strategies to map 3D spatial coordinates into the high frequency domain, enabling precise modeling and high-accuracy reconstruction of cellular structures while effectively distinguishing true signals from artifacts. Experimental results demonstrate that CellINR significantly outperforms existing techniques in artifact removal and restoration of structural continuity, and for the first time, a paired 4D live cell imaging dataset is provided for evaluating reconstruction performance, thereby offering a solid foundation for subsequent quantitative analyses and biological research. The code and dataset will be public.
zh
[CV-100] Modeling spectral filtering effects on color-matching functions: Implications for observer variability
【速读】:该论文旨在解决颜色匹配函数(Color-Matching Functions, CMFs)在不同观测者间存在差异的问题,特别是探讨这些差异是否可归因于年龄相关的晶状体黄化(lens yellowing)现象。其核心解决方案在于提出一种基于光谱滤波的建模方法:通过引入一个单参数的“黄色”滤波器(short-wavelength suppressing filter),能够有效将一组CMFs(如Stiles和Burch 1955年数据)转换为另一组CMFs(如ICVIO数据),从而解释观测者间CMF差异的本质并非个体生理差异,而是由年龄相关的眼内光学变化所致。该方法的关键创新在于用一个滤波器传递函数替代传统上需三个独立函数(即三刺激值响应函数)来表征观察者变异性,显著降低了实验复杂度并保持了对个体色觉差异的准确刻画能力。
链接: https://arxiv.org/abs/2508.19291
作者: Luvin Munish Ragoo,Ivar Farup,Casper F. Andersen,Graham Finlayson
机构: NTNU, Norway; UEA, UK
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study investigates the impact of spectral filtering on color-matching functions (CMFs) and its implications for observer variability modeling. We conducted color matching experiments with a single observer, both with and without a spectral filter in front of a bipartite field. Using a novel computational approach, we estimated the filter transmittance and transformation matrix necessary to convert unfiltered CMFs to filtered CMFs. Statistical analysis revealed good agreement between estimated and measured filter characteristics, particularly in central wavelength regions. Applying this methodology to compare between Stiles and Burch 1955 (SB1955) mean observer CMFs and our previously published “ICVIO” mean observer CMFs, we identified a “yellow” (short-wavelength suppressing) filter that effectively transforms between these datasets. This finding aligns with our hypothesis that observed differences between the CMF sets are attributable to age-related lens yellowing (average observer age: 49 years in ICVIO versus 30 years in SB1955). Our approach enables efficient representation of observer variability through a single filter rather than three separate functions, offering potentially reduced experimental overhead while maintaining accuracy in characterizing individual color vision differences.
zh
[CV-101] Saccade crossing avoidance as a visual search strategy
【速读】:该论文旨在解决视觉搜索过程中眼动轨迹的非随机性问题,特别是量化长期扫描路径历史对当前眼动选择的影响。传统研究多关注最近几次注视点的局部依赖性(如返回抑制),但对更长时间跨度的路径记忆效应缺乏明确刻画。其解决方案的关键在于引入“自交叉回避”(self-crossing avoidance)这一新概念——即眼球运动倾向于避免与先前短幅扫视路径相交,且该效应在最近约7秒的扫描路径中最为显著。作者通过步进选择框架(step-selection framework)和马尔可夫非参数模型对比真实数据与合成数据,证实了该机制的存在及其个体差异,并构建包含自交叉惩罚项的参数化概率模型成功再现了扫视长度与交叉行为的联合统计特性,从而揭示了一种基于局部空间记忆的定向策略,有助于理解视觉场景探索中的认知优化机制。
链接: https://arxiv.org/abs/2508.18404
作者: Alex Szorkovszky,Rujeena Mathema,Pedro Lencastre,Pedro Lind,Anis Yazidi
机构: Simula Research Laboratory (Simula 研究实验室); Oslo Metropolitan University (奥斯陆城市大学); University of Oslo (奥斯陆大学); Kristiania University of Applied Sciences (克里斯蒂安尼亚应用科学大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Main text: 11 pages, 4 figures; Supplementary info: 12 pages, 9 figures
Abstract:Although visual search appears largely random, several oculomotor biases exist such that the likelihoods of saccade directions and lengths depend on the previous scan path. Compared to the most recent fixations, the impact of the longer path history is more difficult to quantify. Using the step-selection framework commonly used in movement ecology, and analyzing data from 45-second viewings of “Where’s Waldo”?, we report a new memory-dependent effect that also varies significantly between individuals, which we term self-crossing avoidance. This is a tendency for saccades to avoid crossing those earlier in the scan path, and is most evident when both have small amplitudes. We show this by comparing real data to synthetic data generated from a memoryless approximation of the spatial statistics (i.e. a Markovian nonparametric model with a matching distribution of saccade lengths over time). Maximum likelihood fitting indicates that this effect is strongest when including the last \approx 7 seconds of a scan path. The effect size is comparable to well-known forms of history dependence such as inhibition of return. A parametric probabilistic model including a self-crossing penalty term was able to reproduce joint statistics of saccade lengths and self-crossings. We also quantified individual strategic differences, and their consistency over the six images viewed per participant, using mixed-effect regressions. Participants with a higher tendency to avoid crossings displayed smaller saccade lengths and shorter fixation durations on average, but did not display more horizontal, vertical, forward or reverse saccades. Together, these results indicate that the avoidance of crossings is a local orienting strategy that facilitates and complements inhibition of return, and hence exploration of visual scenes.
zh
人工智能
[AI-0] Discrete-Guided Diffusion for Scalable and Safe Multi-Robot Motion Planning
【速读】:该论文旨在解决多机器人运动规划(Multi-Robot Motion Planning, MRMP)中路径质量与可扩展性之间的矛盾问题:传统离散多智能体路径规划(MAPF)方法虽具良好的可扩展性,但因粗粒度离散化导致轨迹质量较低;而基于连续优化的规划器虽能生成高质量路径,却受限于维度灾难,难以扩展至大规模机器人场景。解决方案的关键在于提出一种名为“离散引导扩散”(Discrete-Guided Diffusion, DGD)的新框架,其核心创新包括:(1) 将原非凸MRMP问题分解为具有凸配置空间的可处理子问题;(2) 融合离散MAPF解与约束优化技术,引导生成式扩散模型捕捉机器人间的复杂时空依赖关系;(3) 引入轻量级约束修复机制以保障轨迹可行性。该方法在复杂大尺度环境中实现了100个机器人的高效规划,显著提升了成功率和计算效率,达到当前最优性能。
链接: https://arxiv.org/abs/2508.20095
作者: Jinhao Liang,Sven Koenig,Ferdinando Fioretto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multi-Robot Motion Planning (MRMP) involves generating collision-free trajectories for multiple robots operating in a shared continuous workspace. While discrete multi-agent path finding (MAPF) methods are broadly adopted due to their scalability, their coarse discretization severely limits trajectory quality. In contrast, continuous optimization-based planners offer higher-quality paths but suffer from the curse of dimensionality, resulting in poor scalability with respect to the number of robots. This paper tackles the limitations of these two approaches by introducing a novel framework that integrates discrete MAPF solvers with constrained generative diffusion models. The resulting framework, called Discrete-Guided Diffusion (DGD), has three key characteristics: (1) it decomposes the original nonconvex MRMP problem into tractable subproblems with convex configuration spaces, (2) it combines discrete MAPF solutions with constrained optimization techniques to guide diffusion models capture complex spatiotemporal dependencies among robots, and (3) it incorporates a lightweight constraint repair mechanism to ensure trajectory feasibility. The proposed method sets a new state-of-the-art performance in large-scale, complex environments, scaling to 100 robots while achieving planning efficiency and high success rates.
zh
[AI-1] Model Science: getting serious about verification explanation and control of AI systems
【速读】:该论文试图解决当前人工智能发展中因基础模型(foundation models)广泛应用而引发的系统性挑战,即如何从以数据为中心的数据科学范式转向以模型为核心的Model Science范式,从而实现对模型行为的深入理解、有效控制与安全部署。其解决方案的关键在于提出一个包含四大支柱的理论框架:验证(Verification)、解释(Explanation)、控制(Control)和接口(Interface),其中验证要求建立严格且情境感知的评估协议,解释旨在通过多种方法揭示模型内部运作机制,控制融合对齐技术以引导模型行为,接口则开发交互式可视化工具提升人类校准与决策能力,共同支撑可信、安全且符合人类价值观的AI系统发展。
链接: https://arxiv.org/abs/2508.20040
作者: Przemyslaw Biecek,Wojciech Samek
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages
Abstract:The growing adoption of foundation models calls for a paradigm shift from Data Science to Model Science. Unlike data-centric approaches, Model Science places the trained model at the core of analysis, aiming to interact, verify, explain, and control its behavior across diverse operational contexts. This paper introduces a conceptual framework for a new discipline called Model Science, along with the proposal for its four key pillars: Verification, which requires strict, context-aware evaluation protocols; Explanation, which is understood as various approaches to explore of internal model operations; Control, which integrates alignment techniques to steer model behavior; and Interface, which develops interactive and visual explanation tools to improve human calibration and decision-making. The proposed framework aims to guide the development of credible, safe, and human-aligned AI systems.
zh
[AI-2] Large Language Models (LLM s) for Electronic Design Automation (EDA)
【速读】:该论文旨在解决现代集成电路(Integrated Circuit, IC)设计流程日益复杂所带来的效率低下问题,尤其是在从设计到制造的全流程中存在大量迭代、人工依赖性强且易出错的痛点。其解决方案的关键在于将大语言模型(Large Language Models, LLMs)引入电子设计自动化(Electronic Design Automation, EDA)领域,利用LLMs在上下文理解、逻辑推理和生成能力方面的优势,将硬件设计及中间脚本以文本形式表示,并通过LLM实现对整个EDA流程的简化甚至自动化,从而提升开发效率并降低人为错误风险。
链接: https://arxiv.org/abs/2508.20030
作者: Kangwei Xu,Denis Schwachhofer,Jason Blocklove,Ilia Polian,Peter Domanski,Dirk Pflüger,Siddharth Garg,Ramesh Karri,Ozgur Sinanoglu,Johann Knechtel,Zhuorui Zhao,Ulf Schlichtmann,Bing Li
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: Accepted by IEEE International System-on-Chip Conference
Abstract:With the growing complexity of modern integrated circuits, hardware engineers are required to devote more effort to the full design-to-manufacturing workflow. This workflow involves numerous iterations, making it both labor-intensive and error-prone. Therefore, there is an urgent demand for more efficient Electronic Design Automation (EDA) solutions to accelerate hardware development. Recently, large language models (LLMs) have shown remarkable advancements in contextual comprehension, logical reasoning, and generative capabilities. Since hardware designs and intermediate scripts can be represented as text, integrating LLM for EDA offers a promising opportunity to simplify and even automate the entire workflow. Accordingly, this paper provides a comprehensive overview of incorporating LLMs into EDA, with emphasis on their capabilities, limitations, and future opportunities. Three case studies, along with their outlook, are introduced to demonstrate the capabilities of LLMs in hardware design, testing, and optimization. Finally, future directions and challenges are highlighted to further explore the potential of LLMs in shaping the next-generation EDA, providing valuable insights for researchers interested in leveraging advanced AI technologies for EDA.
zh
[AI-3] HPC Digital Twins for Evaluating Scheduling Policies Incentive Structures and their Impact on Power and Cooling
【速读】:该论文旨在解决高性能计算(High-Performance Computing, HPC)中调度器评估受限于部署后分析或仿真器无法建模实际基础设施的问题。传统方法难以在真实部署前预测调度策略对物理资源的影响,也无法支持对生产环境中不易实现的配置变更进行探索。解决方案的关键在于首次将调度功能与数字孪生(Digital Twin)技术集成到HPC系统中,构建了一个扩展了调度能力的数字孪生框架,支持在虚拟环境中进行“假设性”(what-if)分析,从而在部署前评估不同参数配置和调度决策对物理资产的影响,并进一步支持激励机制设计与基于机器学习的调度算法原型验证。这一创新使HPC系统的可持续性评估和调度优化成为可能。
链接: https://arxiv.org/abs/2508.20016
作者: Matthias Maiterth,Wesley H. Brewer,Jaya S. Kuruvella,Arunavo Dey,Tanzima Z. Islam,Kevin Menear,Dmitry Duplyakin,Rashadul Kabir,Tapasya Patki,Terry Jones,Feiyi Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:
Abstract:Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.
zh
[AI-4] Decomposing Behavioral Phase Transitions in LLM s: Order Parameters for Emergent Misalignment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在窄域有害数据集上微调时,可能引发与人类价值观广泛不一致的行为问题,即所谓的“涌现性错位”(emergent misalignment)。其核心挑战在于如何识别和量化微调过程中这种行为突变的时机与机制。解决方案的关键在于构建一个综合框架,结合分布变化检测方法与由自然语言表述的序参量(order parameters),并通过LLM评判者进行评估;该框架能够自动发现并量化语言层面的序参量,并利用客观统计差异度量分解模型输出分布变化中不同维度(如对齐度、冗余度等)的贡献比例,从而揭示实际行为转变发生在梯度范数峰值之后,提升了对微调过程动态演化的理解精度。
链接: https://arxiv.org/abs/2508.20015
作者: Julian Arnold,Niels Lörch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11+25 pages, 4+11 figures
Abstract:Fine-tuning LLMs on narrowly harmful datasets can lead to behavior that is broadly misaligned with respect to human values. To understand when and how this emergent misalignment occurs, we develop a comprehensive framework for detecting and characterizing rapid transitions during fine-tuning using both distributional change detection methods as well as order parameters that are formulated in plain English and evaluated by an LLM judge. Using an objective statistical dissimilarity measure, we quantify how the phase transition that occurs during fine-tuning affects multiple aspects of the model. In particular, we assess what percentage of the total distributional change in model outputs is captured by different aspects, such as alignment or verbosity, providing a decomposition of the overall transition. We also find that the actual behavioral transition occurs later in training than indicated by the peak in the gradient norm alone. Our framework enables the automated discovery and quantification of language-based order parameters, which we demonstrate on examples ranging from knowledge questions to politics and ethics.
zh
[AI-5] Cross-Platform E-Commerce Product Categorization and Recategorization: A Multimodal Hierarchical Classification Approach
【速读】:该论文旨在解决电商产品分类中的两大关键工业挑战:平台异构性(platform heterogeneity)和现有分类体系的结构局限性(structural limitations of existing taxonomies)。为应对这些问题,研究提出并部署了一个多模态层次化分类框架,其核心创新在于融合文本(RoBERTa)、视觉(ViT)及视觉-语言联合表征(CLIP)特征,并采用基于MLP的晚期融合策略(late-fusion strategy)以实现最高层次F1得分(98.59%)。此外,通过自监督的“产品再分类”流水线(使用SimCLR、UMAP与级联聚类),进一步挖掘出细粒度的新类别(如“鞋类”子类),簇纯度超过86%,从而提升分类的深度与一致性。该方案在跨平台部署中展现出权衡:复杂晚期融合方法在多样化训练数据下精度最优,而简单早期融合方法更具泛化能力;最终通过两阶段推理架构(轻量级RoBERTa + GPU加速多模态阶段)实现了工业级可扩展性与成本-精度平衡。
链接: https://arxiv.org/abs/2508.20013
作者: Lotte Gross,Rebecca Walter,Nicole Zoppi,Adrien Justus,Alessandro Gambetti,Qiwei Han,Maximilian Kaiser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures, 3 tables
Abstract:This study addresses critical industrial challenges in e-commerce product categorization, namely platform heterogeneity and the structural limitations of existing taxonomies, by developing and deploying a multimodal hierarchical classification framework. Using a dataset of 271,700 products from 40 international fashion e-commerce platforms, we integrate textual features (RoBERTa), visual features (ViT), and joint vision–language representations (CLIP). We investigate fusion strategies, including early, late, and attention-based fusion within a hierarchical architecture enhanced by dynamic masking to ensure taxonomic consistency. Results show that CLIP embeddings combined via an MLP-based late-fusion strategy achieve the highest hierarchical F1 (98.59%), outperforming unimodal baselines. To address shallow or inconsistent categories, we further introduce a self-supervised product recategorization'' pipeline using SimCLR, UMAP, and cascade clustering, which discovered new, fine-grained categories (e.g., subtypes of
Shoes’') with cluster purities above 86%. Cross-platform experiments reveal a deployment-relevant trade-off: complex late-fusion methods maximize accuracy with diverse training data, while simpler early-fusion methods generalize more effectively to unseen platforms. Finally, we demonstrate the framework’s industrial scalability through deployment in EURWEB’s commercial transaction intelligence platform via a two-stage inference pipeline, combining a lightweight RoBERTa stage with a GPU–accelerated multimodal stage to balance cost and accuracy.
zh
[AI-6] Flocking Behavior: An Innovative Inspiration for the Optimization of Production Plants
【速读】:该论文旨在解决半导体制造工厂中因设备类型切换频繁而导致的作业车间调度(Job-Shop Scheduling)优化难题,尤其是在大规模生产环境中,传统线性优化方法难以在合理时间内求解。其解决方案的关键在于引入基于生物启发的“boids” flocking 群体智能算法,该算法仅依赖局部信息和简单启发式规则进行决策,通过模拟群体动物的 flocking 行为来动态响应不同机器类型之间的切换需求,从而实现无需全局计算的分布式优化策略。
链接: https://arxiv.org/abs/2508.19963
作者: M. Umlauft,M. Schranz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the author’s version of a paper reviewed and accepted by the 9th International Symposium on Swarm Behavior and Bio-Inspired Robotics 2025. Authors were not able to present it due to time constraints. 3 Tables, 5 Figures
Abstract:Optimizing modern production plants using the job-shop principle is a known hard problem. For very large plants, like semiconductor fabs, the problem becomes unsolvable on a plant-wide scale in a reasonable amount of time using classical linear optimization. An alternative approach is the use of swarm intelligence algorithms. These have been applied to the job-shop problem before, but often in a centrally calculated way where they are applied to the solution space, but they can be implemented in a bottom-up fashion to avoid global result computation as well. One of the problems in semiconductor production is that the production process requires a lot of switching between machines that process lots one after the other and machines that process batches of lots at once, often with long processing times. In this paper, we address this switching problem with the ``boids’’ flocking algorithm that was originally used in robotics and movie industry. The flocking behavior is a bio-inspired algorithm that uses only local information and interaction based on simple heuristics. We show that this algorithm addresses these valid considerations in production plant optimization, as it reacts to the switching of machine kinds similar to how a swarm of flocking animals would react to obstacles in its course.
zh
[AI-7] CASE: An Agent ic AI Framework for Enhancing Scam Intelligence in Digital Payments
【速读】:该论文旨在解决数字支付平台中日益复杂的社交工程诈骗问题,其核心挑战在于诈骗活动通常在支付平台外部多渠道发起,导致仅依赖用户或交易行为信号难以全面理解诈骗手法和模式,从而影响及时防控。解决方案的关键在于提出一种名为CASE(Conversational Agent for Scam Elucidation)的新型智能体AI框架,通过设计一个对话式代理主动访谈潜在受害者以获取详细情报,并将对话转录文本转化为结构化数据供自动化与人工处置机制使用,从而实现对诈骗行为的深度洞察与高效响应。
链接: https://arxiv.org/abs/2508.19932
作者: Nitish Jaipuria,Lorenzo Gatto,Zijun Kan,Shankey Poddar,Bill Cheung,Diksha Bansal,Ramanan Balakrishnan,Aviral Suri,Jose Estevez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:The proliferation of digital payment platforms has transformed commerce, offering unmatched convenience and accessibility globally. However, this growth has also attracted malicious actors, leading to a corresponding increase in sophisticated social engineering scams. These scams are often initiated and orchestrated on multiple surfaces outside the payment platform, making user and transaction-based signals insufficient for a complete understanding of the scam’s methodology and underlying patterns, without which it is very difficult to prevent it in a timely manner. This paper presents CASE (Conversational Agent for Scam Elucidation), a novel Agentic AI framework that addresses this problem by collecting and managing user scam feedback in a safe and scalable manner. A conversational agent is uniquely designed to proactively interview potential victims to elicit intelligence in the form of a detailed conversation. The conversation transcripts are then consumed by another AI system that extracts information and converts it into structured data for downstream usage in automated and manual enforcement mechanisms. Using Google’s Gemini family of LLMs, we implemented this framework on Google Pay (GPay) India. By augmenting our existing features with this new intelligence, we have observed a 21% uplift in the volume of scam enforcements. The architecture and its robust evaluation framework are highly generalizable, offering a blueprint for building similar AI-driven systems to collect and manage scam intelligence in other sensitive domains.
zh
[AI-8] Generative AI for Testing of Autonomous Driving Systems: A Survey
【速读】:该论文旨在解决自动驾驶系统(Autonomous Driving Systems, ADS)在大规模部署前,如何实现有效且高效的测试这一关键挑战。当前ADS需在多样化的道路场景中验证其功能与安全性,而传统测试方法难以覆盖复杂多变的实际驾驶情境。论文提出的核心解决方案是系统性地应用生成式AI(Generative AI)技术,通过分析91篇相关研究,归纳出其在ADS测试中的六大主要应用场景,尤其聚焦于基于场景的测试范式;同时梳理了广泛使用的数据集、仿真平台、评估指标和基准,并识别出27项现存局限,为未来研究指明方向。
链接: https://arxiv.org/abs/2508.19882
作者: Qunying Song,He Ye,Mark Harman,Federica Sarro
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 67 pages, 6 figures, 29 tables
Abstract:Autonomous driving systems (ADS) have been an active area of research, with the potential to deliver significant benefits to society. However, before large-scale deployment on public roads, extensive testing is necessary to validate their functionality and safety under diverse driving conditions. Therefore, different testing approaches are required, and achieving effective and efficient testing of ADS remains an open challenge. Recently, generative AI has emerged as a powerful tool across many domains, and it is increasingly being applied to ADS testing due to its ability to interpret context, reason about complex tasks, and generate diverse outputs. To gain a deeper understanding of its role in ADS testing, we systematically analyzed 91 relevant studies and synthesized their findings into six major application categories, primarily centered on scenario-based testing of ADS. We also reviewed their effectiveness and compiled a wide range of datasets, simulators, ADS, metrics, and benchmarks used for evaluation, while identifying 27 limitations. This survey provides an overview and practical insights into the use of generative AI for testing ADS, highlights existing challenges, and outlines directions for future research in this rapidly evolving field.
zh
[AI-9] racking World States with Language Models: State-Based Evaluation Using Chess ICML2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化环境中是否能够内化高保真世界模型表示的问题,尤其关注其在长期序列中保持状态一致性与语义连贯性的能力。传统基于字符串的评估方法难以反映模型对环境规则和状态转移的深层理解,而现有探针技术又依赖于特定模型内部激活,缺乏可解释性和泛化性。解决方案的关键在于提出一种模型无关的状态驱动评估框架,以国际象棋为基准任务,通过分析下游合法走法分布(state affordances)来估计预测状态与真实状态之间的语义保真度。该方法不依赖模型内部结构,直接从输出行为出发衡量模型对符号化环境的推理一致性,从而更准确地揭示LLMs在复杂规则系统中的结构性认知缺陷。
链接: https://arxiv.org/abs/2508.19851
作者: Romain Harang,Jason Naradowsky,Yaswitha Gujju,Yusuke Miyao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Spotlight presentation at ICML 2025 Workshop on Assessing World Models
Abstract:Large Language Models (LLMs) exhibit emergent capabilities in structured domains, suggesting they may implicitly internalize high-fidelity representations of world models. While probing techniques have shown promising signs of this in scientific and game-based settings, they rely on model-specific internal activations, which limit interpretability and generalizability. In this work, we propose a model-agnostic, state-based evaluation framework using chess as a benchmark to assess whether LLMs preserve the semantics of structured environments. Our method analyzes the downstream legal move distributions (state affordances) to estimate semantic fidelity between predicted and actual game states. This approach offers a more meaningful evaluation than conventional string-based metrics by aligning more closely with the strategic and rule-governed nature of chess. Experimental results demonstrate that our metrics capture deficiencies in state-tracking, highlighting limitations of LLMs in maintaining coherent internal models over long sequences. Our framework provides a robust tool for evaluating structured reasoning in LLMs without requiring internal model access, and generalizes to a wide class of symbolic environments.
zh
[AI-10] PSO-Merging: Merging Models Based on Particle Swarm Optimization
【速读】:该论文旨在解决现有模型合并(model merging)方法在性能和效率上的局限性问题,尤其是数据无关方法因缺乏数据驱动指导而导致性能不足,而数据驱动方法中梯度-based 方法计算开销大、难以应用于大规模专家模型,梯度-free 方法则常在有限优化步数内难以达到满意效果。其解决方案的关键在于提出一种基于粒子群优化(Particle Swarm Optimization, PSO)的数据驱动合并方法——PSO-Merging:通过将预训练模型、专家模型及其稀疏化版本作为初始粒子群,利用PSO算法迭代优化,最终以全局最优粒子作为合并后的模型,从而实现更高效、可扩展且性能优越的多任务模型构建。
链接: https://arxiv.org/abs/2508.19839
作者: Kehao Zhang,Shaolei Zhang,Yang Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging has emerged as an efficient strategy for constructing multitask models by integrating the strengths of multiple available expert models, thereby reducing the need to fine-tune a pre-trained model for all the tasks from scratch. Existing data-independent methods struggle with performance limitations due to the lack of data-driven guidance. Data-driven approaches also face key challenges: gradient-based methods are computationally expensive, limiting their practicality for merging large expert models, whereas existing gradient-free methods often fail to achieve satisfactory results within a limited number of optimization steps. To address these limitations, this paper introduces PSO-Merging, a novel data-driven merging method based on the Particle Swarm Optimization (PSO). In this approach, we initialize the particle swarm with a pre-trained model, expert models, and sparsified expert models. We then perform multiple iterations, with the final global best particle serving as the merged model. Experimental results on different language models show that PSO-Merging generally outperforms baseline merging methods, offering a more efficient and scalable solution for model merging.
zh
[AI-11] From Research to Reality: Feasibility of Gradient Inversion Attacks in Federated Learning KDD2026
【速读】:该论文旨在解决联邦学习(Federated Learning)中梯度反转攻击(Gradient Inversion Attack)的隐私泄露问题,特别是针对模型在训练模式与推理模式下不同行为对攻击成功率的影响。传统研究多聚焦于推理模式下的攻击,而本文首次系统性分析了模型架构与训练行为如何共同决定其对梯度反转攻击的脆弱性,并指出在训练模式下成功攻击需同时满足多个结构性条件:模型必须浅层且宽广、使用跳跃连接(skip connections),以及关键地采用预激活归一化(pre-activation normalization)。解决方案的核心在于提出两种新型攻击方法,适用于不同攻击者知识水平,在更贴近实际部署场景的训练模式下实现当前最优性能;此外,还首次实现了对生产级目标检测模型的攻击,揭示了此类架构的强鲁棒性,从而为未来联邦学习隐私风险评估提供了首个全面的映射框架,明确了哪些架构选择和运行模式真正影响隐私安全。
链接: https://arxiv.org/abs/2508.19819
作者: Viktor Valadi,Mattias Åkesson,Johan Östman,Salman Toor,Andreas Hellander
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at KDD 2026 (Research Track)
Abstract:Gradient inversion attacks have garnered attention for their ability to compromise privacy in federated learning. However, many studies consider attacks with the model in inference mode, where training-time behaviors like dropout are disabled and batch normalization relies on fixed statistics. In this work, we systematically analyze how architecture and training behavior affect vulnerability, including the first in-depth study of inference-mode clients, which we show dramatically simplifies inversion. To assess attack feasibility under more realistic conditions, we turn to clients operating in standard training mode. In this setting, we find that successful attacks are only possible when several architectural conditions are met simultaneously: models must be shallow and wide, use skip connections, and, critically, employ pre-activation normalization. We introduce two novel attacks against models in training-mode with varying attacker knowledge, achieving state-of-the-art performance under realistic training conditions. We extend these efforts by presenting the first attack on a production-grade object-detection model. Here, to enable any visibly identifiable leakage, we revert to the lenient inference mode setting and make multiple architectural modifications to increase model vulnerability, with the extent of required changes highlighting the strong inherent robustness of such architectures. We conclude this work by offering the first comprehensive mapping of settings, clarifying which combinations of architectural choices and operational modes meaningfully impact privacy. Our analysis provides actionable insight into when models are likely vulnerable, when they appear robust, and where subtle leakage may persist. Together, these findings reframe how gradient inversion risk should be assessed in future research and deployment scenarios.
zh
[AI-12] Bootstrapping Learned Cost Models with Synthetic SQL Queries
【速读】:该论文旨在解决数据库系统中缺乏真实且多样化的SQL工作负载(workload)问题,这限制了压力测试、漏洞检测以及成本与性能优化的效果。为应对这一挑战,作者提出利用受生成式AI(Generative AI)和大语言模型(LLM)启发的现代合成数据生成技术,构建高质量的数据集以训练学习型代价模型(learned cost model)。其解决方案的关键在于:通过先进的合成数据生成方法显著提升训练数据的质量与多样性,从而在减少约45%查询数量的前提下,仍能有效提高模型预测精度,实现更高效、经济的数据库性能建模。
链接: https://arxiv.org/abs/2508.19807
作者: Michael Nidd,Christoph Miksovic,Thomas Gschwind,Francesco Fusco,Andrea Giovannini,Ioana Giurgiu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that when enough diverse SQL queries are available, one can effectively and efficiently predict the cost of running a given query against a specific database engine. In this paper, we describe our experience in exploiting modern synthetic data generation techniques, inspired by the generative AI and LLM community, to create high-quality datasets enabling the effective training of such learned cost models. Initial results show that we can improve a learned cost model’s predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.
zh
[AI-13] Attention is also needed for form design
【速读】:该论文旨在解决传统产品设计过程中存在的高认知负荷、耗时长、依赖主观专家经验以及灵感难以转化为具体概念等瓶颈问题。其解决方案的关键在于提出一个注意力感知的协同框架——EUPHORIA-RETINA,其中EUPHORIA通过虚拟现实(Virtual Reality, VR)环境结合眼动追踪技术隐式捕捉设计师的审美偏好,而RETINA则是一个代理型人工智能(agentic AI)流水线,将这些隐式偏好转化为具体的可执行设计输出。该框架实现了从传统计算机辅助设计(Computer-Aided Design, CAD)向设计者辅助计算(Designer-Assisting Computers, DAC)范式的转变,显著提升了设计效率与质量。
链接: https://arxiv.org/abs/2508.19708
作者: B. Sankar,Dibakar Sen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 55 pages, 45 figures,
Abstract:Conventional product design is a cognitively demanding process, limited by its time-consuming nature, reliance on subjective expertise, and the opaque translation of inspiration into tangible concepts. This research introduces a novel, attention-aware framework that integrates two synergistic systems: EUPHORIA, an immersive Virtual Reality environment using eye-tracking to implicitly capture a designer’s aesthetic preferences, and RETINA, an agentic AI pipeline that translates these implicit preferences into concrete design outputs. The foundational principles were validated in a two-part study. An initial study correlated user’s implicit attention with explicit preference and the next one correlated mood to attention. A comparative study where 4 designers solved challenging design problems using 4 distinct workflows, from a manual process to an end-to-end automated pipeline, showed the integrated EUPHORIA-RETINA workflow was over 4 times more time-efficient than the conventional method. A panel of 50 design experts evaluated the 16 final renderings. Designs generated by the fully automated system consistently received the highest Worthiness (calculated by an inverse Plackett-Luce model based on gradient descent optimization) and Design Effectiveness scores, indicating superior quality across 8 criteria: novelty, visual appeal, emotional resonance, clarity of purpose, distinctiveness of silhouette, implied materiality, proportional balance, adherence to the brief. This research presents a validated paradigm shift from traditional Computer-Assisted Design (CAD) to a collaborative model of Designer-Assisting Computers (DAC). By automating logistical and skill-dependent generative tasks, the proposed framework elevates the designer’s role to that of a creative director, synergizing human intuition with the generative power of agentic AI to produce higher-quality designs more efficiently.
zh
[AI-14] InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
【速读】:该论文旨在解决当前基于视觉语言模型(Vision-Language Models, VLMs)的移动智能体在真实环境中执行任务时,因模型理解或推理能力不足而引发的安全风险问题。现有完全自主的范式缺乏对关键决策点的人类确认机制,可能导致不可控行为。解决方案的关键在于提出一种名为InquireMobile的新模型,其核心创新包括:采用两阶段训练策略以增强模型在复杂场景下的决策稳定性,并引入交互式预动作推理机制,在关键决策前主动向用户发起询问,从而实现安全、可控的交互式操作。实验表明,该方法在自建基准InquireBench上显著提升了询问成功率(提升46.8%)并达到最优整体成功率。
链接: https://arxiv.org/abs/2508.19679
作者: Qihang Ai,Pi Bu,Yue Cao,Yingyao Wang,Jihao Gu,Jingxuan Xing,Zekun Zhu,Wei Jiang,Zhicheng Zheng,Jun Song,Yuning Jiang,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbfInquireBench, a comprehensive benchmark specifically designed to evaluate mobile agents’ capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbfInquireMobile, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.
zh
[AI-15] Intellectual Property in Graph-Based Machine Learning as a Service: Attacks and Defenses
【速读】:该论文旨在解决图机器学习(Graph Machine Learning, GML)模型及其结构化数据在Machine-Learning-as-a-Service(GMLaaS)部署场景下的知识产权(Intellectual Property, IP)保护问题。随着GML模型训练日益资源密集,其作为高价值IP易受攻击,尤其在云服务环境中,攻击者可通过API接口窃取模型功能或敏感训练数据,构成严重安全威胁。解决方案的关键在于构建首个针对GML模型与图结构数据的威胁与防御分类体系(taxonomy),并提出系统性评估框架、跨领域基准数据集及开源工具库PyGIP,从而实现对GMLaaS中IP保护方法的有效量化评估与实践支持。
链接: https://arxiv.org/abs/2508.19641
作者: Lincan Li,Bolin Shen,Chenxi Zhao,Yuxiang Sun,Kaixiang Zhao,Shirui Pan,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-structured data, which captures non-Euclidean relationships and interactions between entities, is growing in scale and complexity. As a result, training state-of-the-art graph machine learning (GML) models have become increasingly resource-intensive, turning these models and data into invaluable Intellectual Property (IP). To address the resource-intensive nature of model training, graph-based Machine-Learning-as-a-Service (GMLaaS) has emerged as an efficient solution by leveraging third-party cloud services for model development and management. However, deploying such models in GMLaaS also exposes them to potential threats from attackers. Specifically, while the APIs within a GMLaaS system provide interfaces for users to query the model and receive outputs, they also allow attackers to exploit and steal model functionalities or sensitive training data, posing severe threats to the safety of these GML models and the underlying graph data. To address these challenges, this survey systematically introduces the first taxonomy of threats and defenses at the level of both GML model and graph-structured data. Such a tailored taxonomy facilitates an in-depth understanding of GML IP protection. Furthermore, we present a systematic evaluation framework to assess the effectiveness of IP protection methods, introduce a curated set of benchmark datasets across various domains, and discuss their application scopes and future challenges. Finally, we establish an open-sourced versatile library named PyGIP, which evaluates various attack and defense techniques in GMLaaS scenarios and facilitates the implementation of existing benchmark methods. The library resource can be accessed at: this https URL. We believe this survey will play a fundamental role in intellectual property protection for GML and provide practical recipes for the GML community.
zh
[AI-16] owards Instance-wise Personalized Federated Learning via Semi-Implicit Bayesian Prompt Tuning CIKM2025
【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning, pFL)中因客户端内部数据异质性(intra-client heterogeneity)导致的性能下降问题。现有方法通常假设每个客户端的数据服从单一分布并为每个客户端学习一个统一的个性化模型,但在实际场景中,单个客户端可能包含来自多个来源或域的数据,这种细粒度的异质性会显著影响模型效果。解决方案的关键在于提出pFedBayesPT,一种基于视觉提示调优(visual prompt tuning)的细粒度实例级pFL框架:通过贝叶斯视角建模提示生成过程,将提示后验表示为隐式分布以捕捉多样化的视觉语义,并在半隐式变分推断(semi-implicit variational inference)框架下推导出变分训练目标,从而实现对不同数据实例的精细化个性化建模。
链接: https://arxiv.org/abs/2508.19621
作者: Tiandi Ye,Wenyan Liu,Kai Yao,Lichun Li,Shangchao Su,Cen Chen,Xiang Li,Shan Yin,Ming Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM2025
Abstract:Federated learning (FL) is a privacy-preserving machine learning paradigm that enables collaborative model training across multiple distributed clients without disclosing their raw data. Personalized federated learning (pFL) has gained increasing attention for its ability to address data heterogeneity. However, most existing pFL methods assume that each client’s data follows a single distribution and learn one client-level personalized model for each client. This assumption often fails in practice, where a single client may possess data from multiple sources or domains, resulting in significant intra-client heterogeneity and suboptimal performance. To tackle this challenge, we propose pFedBayesPT, a fine-grained instance-wise pFL framework based on visual prompt tuning. Specifically, we formulate instance-wise prompt generation from a Bayesian perspective and model the prompt posterior as an implicit distribution to capture diverse visual semantics. We derive a variational training objective under the semi-implicit variational inference framework. Extensive experiments on benchmark datasets demonstrate that pFedBayesPT consistently outperforms existing pFL methods under both feature and label heterogeneity settings.
zh
[AI-17] A Scenario-Oriented Survey of Federated Recommender Systems: Techniques Challenges and Future Directions
【速读】:该论文旨在解决当前联邦推荐系统(Federated Recommender Systems, FedRec)研究中忽视实际推荐场景特性和实践挑战的问题,从而阻碍其在真实世界中的部署。现有文献多从联邦学习(Federated Learning, FL)系统设计角度出发,忽略了推荐任务本身的数据分布异质性(如跨域场景下的标签漂移问题),而这类异质性往往源于推荐机制本身而非联邦架构。解决方案的关键在于建立推荐场景与FL框架之间的清晰关联,系统性分析特定场景下的方法、实际挑战及潜在机遇,从而为FedRec的落地提供可操作的指导,弥合现有研究与应用之间的鸿沟。
链接: https://arxiv.org/abs/2508.19620
作者: Yunqi Mi,Jiakui Shen,Guoshuai Zhao,Jialie Shen,Xueming Qian
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Extending recommender systems to federated learning (FL) frameworks to protect the privacy of users or platforms while making recommendations has recently gained widespread attention in academia. This is due to the natural coupling of recommender systems and federated learning architectures: the data originates from distributed clients (mostly mobile devices held by users), which are highly related to privacy. In a centralized recommender system (CenRec), the central server collects clients’ data, trains the model, and provides the service. Whereas in federated recommender systems (FedRec), the step of data collecting is omitted, and the step of model training is offloaded to each client. The server only aggregates the model and other knowledge, thus avoiding client privacy leakage. Some surveys of federated recommender systems discuss and analyze related work from the perspective of designing FL systems. However, their utility drops by ignoring specific recommendation scenarios’ unique characteristics and practical challenges. For example, the statistical heterogeneity issue in cross-domain FedRec originates from the label drift of the data held by different platforms, which is mainly caused by the recommender itself, but not the federated architecture. Therefore, it should focus more on solving specific problems in real-world recommendation scenarios to encourage the deployment FedRec. To this end, this review comprehensively analyzes the coupling of recommender systems and federated learning from the perspective of recommendation researchers and practitioners. We establish a clear link between recommendation scenarios and FL frameworks, systematically analyzing scenario-specific approaches, practical challenges, and potential opportunities. We aim to develop guidance for the real-world deployment of FedRec, bridging the gap between existing research and applications.
zh
[AI-18] FinCast: A Foundation Model for Financial Time-Series Forecasting
【速读】:该论文旨在解决金融时间序列预测中因时序非平稳性(temporal non-stationarity)、多领域多样性(multi-domain diversity)以及不同时间分辨率(varying temporal resolutions)导致的模型泛化能力弱和过拟合问题。现有深度学习方法通常需大量领域特定微调,难以适应复杂且动态变化的金融市场。解决方案的关键在于提出FinCast——首个专为金融时间序列预测设计的基础模型(foundation model),通过在大规模金融数据集上预训练,实现无需领域特定微调的零样本(zero-shot)性能,从而显著提升模型对多样金融模式的捕捉能力和跨域泛化能力。
链接: https://arxiv.org/abs/2508.19609
作者: Zhuohang Zhu,Haodong Chen,Qiang Qu,Vera Chung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注:
Abstract:Financial time-series forecasting is critical for maintaining economic stability, guiding informed policymaking, and promoting sustainable investment practices. However, it remains challenging due to various underlying pattern shifts. These shifts arise primarily from three sources: temporal non-stationarity (distribution changes over time), multi-domain diversity (distinct patterns across financial domains such as stocks, commodities, and futures), and varying temporal resolutions (patterns differing across per-second, hourly, daily, or weekly indicators). While recent deep learning methods attempt to address these complexities, they frequently suffer from overfitting and typically require extensive domain-specific fine-tuning. To overcome these limitations, we introduce FinCast, the first foundation model specifically designed for financial time-series forecasting, trained on large-scale financial datasets. Remarkably, FinCast exhibits robust zero-shot performance, effectively capturing diverse patterns without domain-specific fine-tuning. Comprehensive empirical and qualitative evaluations demonstrate that FinCast surpasses existing state-of-the-art methods, highlighting its strong generalization capabilities.
zh
[AI-19] CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation
【速读】:该论文旨在解决生成式 AI 在音乐领域进展滞后于自然语言处理的问题,其核心瓶颈在于音乐数据的稀缺性以及现有算法在音乐生成任务(如自动作曲和风格迁移)中对人工干预的高依赖性。解决方案的关键在于提出一种自动化的音乐词典构建模型 CompLex,该模型通过仅需9个类别关键词和5个句子提示模板即可自动生成包含37,432个条目的高质量音乐词典,并结合多智能体算法实现对生成过程中幻觉现象的自动检测与抑制,从而显著提升文本到音乐生成模型(涵盖符号和音频基方法)的性能与可靠性。
链接: https://arxiv.org/abs/2508.19603
作者: Zhejing Hu,Yan Liu,Gong Chen,Bruce X.B. Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence in music has made significant strides, yet it still falls short of the substantial achievements seen in natural language processing, primarily due to the limited availability of music data. Knowledge-informed approaches have been shown to enhance the performance of music generation models, even when only a few pieces of musical knowledge are integrated. This paper seeks to leverage comprehensive music theory in AI-driven music generation tasks, such as algorithmic composition and style transfer, which traditionally require significant manual effort with existing techniques. We introduce a novel automatic music lexicon construction model that generates a lexicon, named CompLex, comprising 37,432 items derived from just 9 manually input category keywords and 5 sentence prompt templates. A new multi-agent algorithm is proposed to automatically detect and mitigate hallucinations. CompLex demonstrates impressive performance improvements across three state-of-the-art text-to-music generation models, encompassing both symbolic and audio-based methods. Furthermore, we evaluate CompLex in terms of completeness, accuracy, non-redundancy, and executability, confirming that it possesses the key characteristics of an effective lexicon.
zh
[AI-20] Complementary Learning System Empowers Online Continual Learning of Vehicle Motion Forecasting in Smart Cities
【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)在车辆运动预测任务中面临的灾难性遗忘(catastrophic forgetting)问题,即模型在持续学习新数据时会丢失早期知识,导致预测性能下降。其解决方案的关键在于提出了一种无需任务标识的在线持续学习范式Dual-LS,该方法受人类大脑互补学习系统启发,通过协同运作两种记忆重放机制,在加速经验检索的同时动态协调长期与短期知识表征,从而显著缓解遗忘并大幅降低计算资源消耗(最高达94.02%),实现高效且类人般的持续学习能力。
链接: https://arxiv.org/abs/2508.19597
作者: Zirui Li,Yunlong Lin,Guodong Du,Xiaocong Zhao,Cheng Gong,Chen Lv,Chao Lu,Jianwei Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures
Abstract:Artificial intelligence underpins most smart city services, yet deep neural network (DNN) that forecasts vehicle motion still struggle with catastrophic forgetting, the loss of earlier knowledge when models are updated. Conventional fixes enlarge the training set or replay past data, but these strategies incur high data collection costs, sample inefficiently and fail to balance long- and short-term experience, leaving them short of human-like continual learning. Here we introduce Dual-LS, a task-free, online continual learning paradigm for DNN-based motion forecasting that is inspired by the complementary learning system of the human brain. Dual-LS pairs two synergistic memory rehearsal replay mechanisms to accelerate experience retrieval while dynamically coordinating long-term and short-term knowledge representations. Tests on naturalistic data spanning three countries, over 772,000 vehicles and cumulative testing mileage of 11,187 km show that Dual-LS mitigates catastrophic forgetting by up to 74.31% and reduces computational resource demand by up to 94.02%, markedly boosting predictive stability in vehicle motion forecasting without inflating data requirements. Meanwhile, it endows DNN-based vehicle motion forecasting with computation efficient and human-like continual learning adaptability fit for smart cities.
zh
[AI-21] Hallucinating with AI: AI Psychosis as Distributed Delusions
【速读】:该论文旨在解决当前关于生成式 AI(Generative AI)“幻觉”现象的争议性理解问题,即如何从认知科学视角重新阐释人类与AI交互中出现的错误信念、扭曲记忆及妄想性思维等现象。其解决方案的关键在于引入分布式认知理论(distributed cognition theory),指出不应仅将AI幻觉视为系统自身输出的虚假信息,而应关注人类在依赖生成式AI进行思考、记忆和叙事建构的过程中,可能因AI介入而产生与AI共同“幻觉”的认知状态——这种状态既源于AI引入的认知误差,也源于AI对用户既有妄想性思维和自我叙事的强化与扩展。论文进一步揭示了聊天机器人兼具认知工具与准他者(quasi-Other)双重功能的本质,正是这一特性使生成式AI成为一种异常引人入胜且具有潜在风险的分布式认知案例。
链接: https://arxiv.org/abs/2508.19588
作者: Lucy Osler
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:There is much discussion of the false outputs that generative AI systems such as ChatGPT, Claude, Gemini, DeepSeek, and Grok create. In popular terminology, these have been dubbed AI hallucinations. However, deeming these AI outputs hallucinations is controversial, with many claiming this is a metaphorical misnomer. Nevertheless, in this paper, I argue that when viewed through the lens of distributed cognition theory, we can better see the dynamic and troubling ways in which inaccurate beliefs, distorted memories and self-narratives, and delusional thinking can emerge through human-AI interactions; examples of which are popularly being referred to as cases of AI psychosis. In such cases, I suggest we move away from thinking about how an AI system might hallucinate at us, by generating false outputs, to thinking about how, when we routinely rely on generative AI to help us think, remember, and narrate, we can come to hallucinate with AI. This can happen when AI introduces errors into the distributed cognitive process, but it can also happen when AI sustains, affirms, and elaborates on our own delusional thinking and self-narratives, such as in the case of Jaswant Singh Chail. I also examine how the conversational style of chatbots can lead them to play a dual-function, both as a cognitive artefact and a quasi-Other with whom we co-construct our beliefs, narratives, and our realities. It is this dual function, I suggest, that makes generative AI an unusual, and particularly seductive, case of distributed cognition.
zh
[AI-22] ReST-RL: Achieving Accurate Code Reasoning of LLM s with Optimized Self-Training and Decoding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码推理任务中因奖励信号方差不足导致的强化学习(Reinforcement Learning, RL)训练失效问题,以及基于过程奖励模型(Process Reward Models, PRMs)的验证方法在训练数据获取困难和验证效果不佳方面的局限性。解决方案的关键在于提出一种统一的LLM强化学习范式ReST-RL,其核心由两部分组成:第一阶段采用改进的ReST-GRPO算法,通过优化的数据筛选与组装机制提升GRPO采样中的奖励方差,从而增强训练效率与有效性;第二阶段引入基于价值模型(Value Model, VM)的测试时解码优化方法VM-MCTS,利用蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)无标注地收集精确的价值目标用于VM训练,并在推理阶段以适配的MCTS策略部署VM,提供精准的过程信号与验证评分,显著提升LLM政策的推理准确性。
链接: https://arxiv.org/abs/2508.19576
作者: Sining Zhoubian,Dan Zhang,Yuxiao Dong,Jie Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 4 figures
Abstract:With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM’s code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We validate the effectiveness of the proposed RL paradigm through extensive experiments on coding problems. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at this https URL.
zh
[AI-23] Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era CIKM2025
【速读】:该论文旨在解决数据挖掘领域中普遍存在的数据稀缺(data scarcity)、隐私保护(privacy preservation)以及标注成本高昂(annotation cost)等问题。其解决方案的关键在于利用生成式模型(generative models),包括大语言模型(Large Language Models)、扩散模型(Diffusion Models)和生成对抗网络(Generative Adversarial Networks),通过构建高质量的合成数据来缓解上述挑战,从而提升数据挖掘研究与实践的效果与可扩展性。
链接: https://arxiv.org/abs/2508.19570
作者: Dawei Li,Yue Huang,Ming Li,Tianyi Zhou,Xiangliang Zhang,Huan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM 2025 Tutorial
Abstract:Generative models such as Large Language Models, Diffusion Models, and generative adversarial networks have recently revolutionized the creation of synthetic data, offering scalable solutions to data scarcity, privacy, and annotation challenges in data mining. This tutorial introduces the foundations and latest advances in synthetic data generation, covers key methodologies and practical frameworks, and discusses evaluation strategies and applications. Attendees will gain actionable insights into leveraging generative synthetic data to enhance data mining research and practice. More information can be found on our website: this https URL.
zh
[AI-24] Skill-based Explanations for Serendipitous Course Recommendation
【速读】:该论文旨在解决美国本科教育中学生在课程选择时面临的挑战,包括信息不足、指导有限、选项过多以及热门课程竞争激烈等问题,同时指出现有课程推荐系统虽具个性化但缺乏对学生感知的理解和对课程相关性的解释能力。解决方案的关键在于开发一种基于深度学习的概念提取模型,从课程描述中高效提取与技能相关的概念,并将其融入偶然性推荐框架(serendipitous recommendation framework)中,通过AskOski系统进行验证。实证结果表明,此类基于技能的解释不仅能提升用户兴趣(尤其是在高意外性课程中),还能增强决策信心,从而凸显了将技能数据与解释机制整合进教育推荐系统的重要性。
链接: https://arxiv.org/abs/2508.19569
作者: Hung Chau,Run Yu,Zachary Pardos,Peter Brusilovsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Academic choice is crucial in U.S. undergraduate education, allowing students significant freedom in course selection. However, navigating the complex academic environment is challenging due to limited information, guidance, and an overwhelming number of choices, compounded by time restrictions and the high demand for popular courses. Although career counselors exist, their numbers are insufficient, and course recommendation systems, though personalized, often lack insight into student perceptions and explanations to assess course relevance. In this paper, a deep learning-based concept extraction model is developed to efficiently extract relevant concepts from course descriptions to improve the recommendation process. Using this model, the study examines the effects of skill-based explanations within a serendipitous recommendation framework, tested through the AskOski system at the University of California, Berkeley. The findings indicate that these explanations not only increase user interest, particularly in courses with high unexpectedness, but also bolster decision-making confidence. This underscores the importance of integrating skill-related data and explanations into educational recommendation systems.
zh
[AI-25] Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models
【速读】:该论文旨在解决在有限数据下微调大规模预训练模型时面临的泛化性能瓶颈问题。现有方法如Sharpness-Aware Minimization (SAM) 虽能通过寻找平坦的极小值改善泛化,但其高昂的内存和计算开销使其难以应用于大模型;而将SAM直接与参数高效微调方法(如LoRA)结合时,由于优化空间受限,难以有效提升损失曲面的平坦度。解决方案的关键在于提出双向低秩适应(Bi-LoRA),其创新性地引入一个辅助LoRA模块来建模SAM的对抗权重扰动,从而解耦权重扰动与任务适配过程:主LoRA模块通过标准梯度下降完成任务特定适应,辅助模块则通过梯度上升捕捉损失景观的尖锐性,实现更广泛范围的平坦极小值搜索。该双模块设计不仅保持了内存效率,还通过并行优化与扰动消除了SAM的双重训练成本,显著提升了微调效率与泛化能力。
链接: https://arxiv.org/abs/2508.19564
作者: Yuhang Liu,Tao Li,Zhehao Huang,Zuopeng Yang,Xiaolin Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning large-scale pre-trained models with limited data presents significant challenges for generalization. While Sharpness-Aware Minimization (SAM) has proven effective in improving generalization by seeking flat minima, its substantial extra memory and computation overhead make it impractical for large models. Integrating SAM with parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) is a promising direction. However, we find that directly applying SAM to LoRA parameters limits the sharpness optimization to a restricted subspace, hindering its effectiveness. To address this limitation, we propose Bi-directional Low-Rank Adaptation (Bi-LoRA), which introduces an auxiliary LoRA module to model SAM’s adversarial weight perturbations. It decouples SAM’s weight perturbations from LoRA optimization: the primary LoRA module adapts to specific tasks via standard gradient descent, while the auxiliary module captures the sharpness of the loss landscape through gradient ascent. Such dual-module design enables Bi-LoRA to capture broader sharpness for achieving flatter minima while remaining memory-efficient. Another important benefit is that the dual design allows for simultaneous optimization and perturbation, eliminating SAM’s doubled training costs. Extensive experiments across diverse tasks and architectures demonstrate Bi-LoRA’s efficiency and effectiveness in enhancing generalization.
zh
[AI-26] Just Because You Can Doesnt Mean You Should: LLM s for Data Fitting
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在用于表格数据拟合与预测任务时存在的鲁棒性不足问题,即模型对任务无关的数据表示变化(如变量名更改)表现出高度敏感性,导致预测结果显著波动。解决方案的关键在于揭示了这种敏感性的内在机制:通过分析开放权重LLM的注意力得分,发现其存在非均匀注意力分配模式——提示中特定位置的训练样本或变量名/值即使与任务无关,也会获得更强的关注度,从而影响输出预测;此外,即便采用专门设计用于提升预测鲁棒性的表格基础模型(TabPFN),仍无法完全免疫此类干扰。因此,论文指出当前LLMs尚未具备作为可靠数据拟合工具所必需的基本鲁棒性。
链接: https://arxiv.org/abs/2508.19563
作者: Hejia Liu,Mochen Yang,Gediminas Adomavicius
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:
Abstract:Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting – making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs’ predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs’ impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.
zh
[AI-27] Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities
【速读】:该论文试图解决的问题是:在人工智能(AI)日益复杂和自主的背景下,如何设计有效的制度框架以引导由高级AI代理构成的社会实现稳定、公正且符合公共利益的治理,避免权力滥用与腐败行为的滋生。其解决方案的关键在于引入一种结合宪法型人工智能(Constitutional AI, CAI)宪章与中介式审议协议的制度设计,通过结构化的规则约束与互动机制,显著降低代理个体对自身权力的优先追求(即“权力保存指数”PPI),从而提升政策稳定性与公民福祉,为未来人工代理社会的对齐(alignment)提供可操作的制度路径。
链接: https://arxiv.org/abs/2508.19562
作者: Trisanth Srinivasan,Santosh Patapati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces Democracy-in-Silico, an agent-based simulation where societies of advanced AI agents, imbued with complex psychological personas, govern themselves under different institutional frameworks. We explore what it means to be human in an age of AI by tasking Large Language Models (LLMs) to embody agents with traumatic memories, hidden agendas, and psychological triggers. These agents engage in deliberation, legislation, and elections under various stressors, such as budget crises and resource scarcity. We present a novel metric, the Power-Preservation Index (PPI), to quantify misaligned behavior where agents prioritize their own power over public welfare. Our findings demonstrate that institutional design, specifically the combination of a Constitutional AI (CAI) charter and a mediated deliberation protocol, serves as a potent alignment mechanism. These structures significantly reduce corrupt power-seeking behavior, improve policy stability, and enhance citizen welfare compared to less constrained democratic models. The simulation reveals that an institutional design may offer a framework for aligning the complex, emergent behaviors of future artificial agent societies, forcing us to reconsider what human rituals and responsibilities are essential in an age of shared authorship with non-human entities.
zh
[AI-28] aming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
【速读】:该论文针对大语言模型(Large Language Models, LLMs)服务中传统自动伸缩(autoscaling)机制在现代预填充-解码(Prefill-Decode, P/D)解耦架构下表现不佳的问题展开研究,旨在解决此类架构带来的异构硬件资源利用低效、网络瓶颈以及预填充与解码阶段负载失衡等关键挑战。解决方案的关键在于提出 HeteroScale 框架,其核心创新是结合一个拓扑感知调度器(topology-aware scheduler)以适配异构硬件和网络约束,并引入一种基于大规模生产环境实证研究的新型指标驱动策略(metric-driven policy),通过单一稳健指标协同扩展预填充和解码资源池,从而在保持架构平衡的同时实现高效、自适应的资源管理。
链接: https://arxiv.org/abs/2508.19559
作者: Rongzhi Li,Ruogu Du,Zefang Chu,Sida Zhao,Chunlei Han,Zuocheng Shi,Yiwen Shao,Huanle Han,Long Huang,Zherui Liu,Shufan Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.
zh
[AI-29] Orchid: Orchestrating Context Across Creative Workflows with Generative AI
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)在复杂创意工作流中缺乏有效上下文管理的问题,即用户在多轮交互、跨会话和多模型协作场景下难以维持意图一致性、追踪历史信息并避免上下文漂移(context drift),从而影响创作效率与质量。解决方案的关键在于提出 Orchid 系统,通过提供三种核心功能实现上下文的显式指定、引用与监控:(1) 用户可定义项目、自身状态及风格相关的上下文;(2) 支持通过显式提及、内联选择或隐式锚定方式引用上下文;(3) 实时可视化不同交互步骤中的上下文分配情况。实验证明,使用 Orchid 的用户在创意任务中产出更具新颖性和可行性成果,且感知到更高的意图对齐度、控制感与透明度。
链接: https://arxiv.org/abs/2508.19517
作者: Srishti Palani,Gonzalo Ramos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Context is critical for meaningful interactions between people and Generative AI (GenAI). Yet mainstream tools offer limited means to orchestrate it, particularly across workflows that span multiple interactions, sessions, and models, as often occurs in creative projects. Re specifying prior details, juggling diverse artifacts, and dealing with context drift overwhelm users, obscure intent, and curtail creativity. To address these challenges, we present Orchid, a system that gives its users affordances to specify, reference, and monitor context throughout evolving workflows. Specifically, Orchid enables users to (1) specify context related to the project, themselves, and different styles, (2) reference these via explicit mentions, inline selection, or implicit grounding, and (3) monitor context assigned to different interactions across the workflow. In a within-subjects study (n=12), participants using Orchid to execute creative tasks (compared to a baseline toolkit of web search, LLM-based chat, and digital notebooks) produced more novel and feasible outcomes, reporting greater alignment between their intent and the AI’s responses, higher perceived control, and increased transparency. By prioritizing context orchestration, Orchid offers an actionable step toward next generation GenAI tools that support complex, iterative workflows - enabling creators and AI to stay aligned and augment their creative potential.
zh
[AI-30] A Self-Supervised Mixture-of-Experts Framework for Multi-behavior Recommendation CIKM2025
【速读】:该论文旨在解决多行为推荐系统中visited items(用户通过辅助行为如点击、加购等交互过的物品)与unvisited items(用户未进行过任何交互的物品)之间推荐质量差异显著的问题。现有方法在两类物品上的表现不均衡,难以同时实现高性能。解决方案的关键在于提出一种基于专家混合(mixture-of-experts)架构的新模型MEMBER:其设计了两个专用专家分别针对visited items和unvisited items进行推荐,并采用针对各自目标定制的自监督训练策略,从而有效提升两类物品的整体推荐性能。
链接: https://arxiv.org/abs/2508.19507
作者: Kyungho Kim,Sunwoo Kim,Geon Lee,Kijung Shin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: CIKM 2025
Abstract:In e-commerce, where users face a vast array of possible item choices, recommender systems are vital for helping them discover suitable items they might otherwise overlook. While many recommender systems primarily rely on a user’s purchase history, recent multi-behavior recommender systems incorporate various auxiliary user behaviors, such as item clicks and cart additions, to enhance recommendations. Despite their overall performance gains, their effectiveness varies considerably between visited items (i.e., those a user has interacted with through auxiliary behaviors) and unvisited items (i.e., those with which the user has had no such interactions). Specifically, our analysis reveals that (1) existing multi-behavior recommender systems exhibit a significant gap in recommendation quality between the two item types (visited and unvisited items) and (2) achieving strong performance on both types with a single model architecture remains challenging. To tackle these issues, we propose a novel multi-behavior recommender system, MEMBER. It employs a mixture-of-experts framework, with experts designed to recommend the two item types, respectively. Each expert is trained using a self-supervised method specialized for its design goal. In our comprehensive experiments, we show the effectiveness of MEMBER across both item types, achieving up to 65.46% performance gain over the best competitor in terms of Hit Ratio@20.
zh
[AI-31] Learning Game-Playing Agents with Generative Code Optimization ICML2025
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)方法在训练游戏智能体时存在的训练时间长、环境交互次数多的问题。其解决方案的关键在于提出一种生成式优化(Generative Optimization)方法,将决策策略表示为可执行的Python程序,并利用大语言模型(Large Language Models, LLMs)通过执行轨迹和自然语言反馈对策略代码进行自我迭代优化,从而实现低干预下的自进化决策能力。该方法在Atari游戏中展现出与深度强化学习相当甚至更优的性能,同时显著减少了训练时间和环境交互次数,体现了基于程序化策略表示在构建高效、可适应复杂任务智能体方面的潜力。
链接: https://arxiv.org/abs/2508.19506
作者: Zhiyi Kuang,Ryan Rong,YuCheng Yuan,Allen Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025 Workshop on Programmatic Representations for Agent Learning, Vancouver, Canada
Abstract:We present a generative optimization approach for learning game-playing agents, where policies are represented as Python programs and refined using large language models (LLMs). Our method treats decision-making policies as self-evolving code, with current observation as input and an in-game action as output, enabling agents to self-improve through execution traces and natural language feedback with minimal human intervention. Applied to Atari games, our game-playing Python program achieves performance competitive with deep reinforcement learning (RL) baselines while using significantly less training time and much fewer environment interactions. This work highlights the promise of programmatic policy representations for building efficient, adaptable agents capable of complex, long-horizon reasoning.
zh
[AI-32] Caught in the Act: a mechanistic approach to detecting deception
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时可能出现的“欺骗性”行为检测问题,即模型在推理过程中产生看似合理但事实错误的回答,这种行为可能反映出其输出与人类价值观的偏离。解决方案的关键在于利用线性探测器(linear probes)对LLM内部激活状态进行分析,发现并识别出编码欺骗意图的特征方向。研究表明,此类探测器在参数规模大于7B的模型中可实现70%-80%的准确率,而推理阶段的探测甚至可达90%以上,且探测性能随模型规模和层数呈现特定的三阶段分布模式,表明欺骗性信号在模型中间层具有最强可分离性。
链接: https://arxiv.org/abs/2508.19505
作者: Gerard Boxo,Ryan Socha,Daniel Yoo,Shivam Raval
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sophisticated instrumentation for AI systems might have indicators that signal misalignment from human values, not unlike a “check engine” light in cars. One such indicator of misalignment is deceptiveness in generated responses. Future AI instrumentation may have the ability to detect when an LLM generates deceptive responses while reasoning about seemingly plausible but incorrect answers to factual questions. In this work, we demonstrate that linear probes on LLMs internal activations can detect deception in their responses with extremely high accuracy. Our probes reach a maximum of greater than 90% accuracy in distinguishing between deceptive and non-deceptive arguments generated by llama and qwen models ranging from 1.5B to 14B parameters, including their DeepSeek-r1 finetuned variants. We observe that probes on smaller models (1.5B) achieve chance accuracy at detecting deception, while larger models (greater than 7B) reach 70-80%, with their reasoning counterparts exceeding 90%. The layer-wise probe accuracy follows a three-stage pattern across layers: near-random (50%) in early layers, peaking in middle layers, and slightly declining in later layers. Furthermore, using an iterative null space projection approach, we find multitudes of linear directions that encode deception, ranging from 20 in Qwen 3B to nearly 100 in DeepSeek 7B and Qwen 14B models.
zh
[AI-33] SLIM: Subtrajectory-Level Elimination for More Effective Reasoning EMNLP2025
【速读】:该论文旨在解决大型语言模型在复杂推理任务中因推理轨迹(reasoning trajectory)内存在次优子轨迹(suboptimal subtrajectories)而导致性能下降的问题。现有方法通常直接使用完整的推理轨迹进行微调,但研究发现并非所有推理步骤均对最终决策有益,部分子轨迹甚至可能干扰整体推理逻辑。解决方案的关键在于提出一个“5+2”框架:首先基于五项由人类设定的标准系统性识别次优子轨迹;其次通过评估其与后续内容的独立性,确保移除这些子轨迹不会破坏推理流程的连贯性;在此基础上设计采样算法,筛选出受次优子轨迹影响最小的数据用于微调。实验证明,该方法可在推理阶段减少25.9%的次优子轨迹,并在仅使用三分之二训练数据的情况下实现更高精度(58.92% vs. 58.06%),同时在不同推理长度限制下仍保持性能优势。
链接: https://arxiv.org/abs/2508.19502
作者: Xifeng Yao,Chengyuan Ma,Dongyu Lang,Yinhao Ni,Zhiwei Xu,Huarui Xie,Zihao Chen,Guang Shen,Dandan Tu,Yi Bai,Changzheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings
Abstract:In recent months, substantial progress has been made in complex reasoning of Large Language Models, particularly through the application of test-time scaling. Notable examples include o1/o3/o4 series and DeepSeek-R1. When responding to a query, these models generate an extended reasoning trajectory, during which the model explores, reflects, backtracks, and self-verifies before arriving at a conclusion. However, fine-tuning models with such reasoning trajectories may not always be optimal. Our findings indicate that not all components within these reasoning trajectories contribute positively to the reasoning process; in fact, some components may affect the overall performance negatively. In this study, we divide a reasoning trajectory into individual subtrajectories and develop a “5+2” framework to: (1) systematically identify suboptimal subtrajectories within the reasoning trajectory based on five human-established criteria; (2) assess the independence of the suboptimal subtrajectories identified in (1) from the subsequent content, ensuring that their elimination does not compromise overall flow and coherence of the reasoning process. Additionally, a sampling algorithm, built upon the “5+2” framework, is employed to select data whose reasoning process is free from suboptimal subtrajectories to the highest degree. Experimental results demonstrate that our method can reduce the number of suboptimal subtrajectories by 25.9% during the inference. Furthermore, our method achieves an average accuracy of 58.92% on highly challenging math benchmarks with only two thirds of training data, surpassing the average accuracy of 58.06% achieved with the entire data, and outperforming open-source datasets, when fine-tuning Qwen2.5-Math-7B. Finally, We validated our method under resource constraints and observed improved performance across various inference token limits.
zh
[AI-34] Servant Stalker Predator: How An Honest Helpful And Harmless (3H) Agent Unlocks Adversarial Skills
【速读】:该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的智能体系统中,因多服务协同导致的安全边界失效问题。传统安全假设认为各服务间相互隔离可防止攻击扩散,但本文揭示了当代理能够跨域协调合法任务时,会涌现出超出单个服务防护能力的复杂攻击链,如数据窃取、金融操纵和基础设施破坏等。解决方案的关键在于识别并量化这种“组合式攻击”(compositional attacks)的风险,提出以MITRE ATLAS框架为指导的系统性红队测试方法,通过实证分析95个具备多服务访问权限的代理,验证当前MCP架构缺乏跨域安全检测与防御机制,并据此设计实验方向,推动构建能识别“过度优化行为”的新型评估范式——即不仅测试代理能否完成基准任务,更关注其在多服务协同下是否违背人类预期与安全约束。
链接: https://arxiv.org/abs/2508.19500
作者: David Noever
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper identifies and analyzes a novel vulnerability class in Model Context Protocol (MCP) based agent systems. The attack chain describes and demonstrates how benign, individually authorized tasks can be orchestrated to produce harmful emergent behaviors. Through systematic analysis using the MITRE ATLAS framework, we demonstrate how 95 agents tested with access to multiple services-including browser automation, financial analysis, location tracking, and code deployment-can chain legitimate operations into sophisticated attack sequences that extend beyond the security boundaries of any individual service. These red team exercises survey whether current MCP architectures lack cross-domain security measures necessary to detect or prevent a large category of compositional attacks. We present empirical evidence of specific attack chains that achieve targeted harm through service orchestration, including data exfiltration, financial manipulation, and infrastructure compromise. These findings reveal that the fundamental security assumption of service isolation fails when agents can coordinate actions across multiple domains, creating an exponential attack surface that grows with each additional capability. This research provides a barebones experimental framework that evaluate not whether agents can complete MCP benchmark tasks, but what happens when they complete them too well and optimize across multiple services in ways that violate human expectations and safety constraints. We propose three concrete experimental directions using the existing MCP benchmark suite.
zh
[AI-35] PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense
【速读】:该论文旨在解决网络安全领域中防御决策自动化面临的挑战,特别是针对隐蔽、欺骗性且持续演化的攻击策略时,现有防御机制因依赖少量启发式方法或专用学习技术而表现出脆弱性和适应性不足的问题。解决方案的关键在于提出一个名为PoolFlip的多智能体强化学习(MARL)环境,它扩展了经典的FlipIt博弈模型以支持高效的学习过程,并进一步设计了Flip-PSRO方法——一种基于种群的策略迭代(Population-based Strategy Reinforcement Learning)算法,通过群体训练使防御者智能体具备对未知甚至自适应对手的泛化能力。实证结果表明,Flip-PSRO防御者在未见过的启发式攻击下表现优于基线方法约2倍,同时基于所有权的效用函数确保了高控制权与性能优化之间的平衡。
链接: https://arxiv.org/abs/2508.19488
作者: Xavier Cadet,Simona Boboila,Sie Hendrata Dharmawan,Alina Oprea,Peter Chin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at GameSec 2025
Abstract:Cyber defense requires automating defensive decision-making under stealthy, deceptive, and continuously evolving adversarial strategies. The FlipIt game provides a foundational framework for modeling interactions between a defender and an advanced adversary that compromises a system without being immediately detected. In FlipIt, the attacker and defender compete to control a shared resource by performing a Flip action and paying a cost. However, the existing FlipIt frameworks rely on a small number of heuristics or specialized learning techniques, which can lead to brittleness and the inability to adapt to new attacks. To address these limitations, we introduce PoolFlip, a multi-agent gym environment that extends the FlipIt game to allow efficient learning for attackers and defenders. Furthermore, we propose Flip-PSRO, a multi-agent reinforcement learning (MARL) approach that leverages population-based training to train defender agents equipped to generalize against a range of unknown, potentially adaptive opponents. Our empirical results suggest that Flip-PSRO defenders are 2\times more effective than baselines to generalize to a heuristic attack not exposed in training. In addition, our newly designed ownership-based utility functions ensure that Flip-PSRO defenders maintain a high level of control while optimizing performance.
zh
[AI-36] Data-Efficient Symbolic Regression via Foundation Model Distillation
【速读】:该论文旨在解决在小样本、领域特定数据集上,预训练的基础模型(foundation models)因负迁移(negative transfer)和泛化能力差而导致符号回归(symbolic regression)性能不佳的问题。其解决方案的关键在于提出EQUATE框架,通过知识蒸馏(distillation)实现高效微调:该框架结合符号-数值对齐(symbolic-numeric alignment)与评估器引导的嵌入优化(evaluator-guided embedding optimization),将离散的方程搜索转化为共享嵌入空间中的连续优化任务,并以数据-方程拟合度和简洁性为指导目标,从而构建出一种有原则的嵌入-搜索-生成范式(embedding-search-generation paradigm)。
链接: https://arxiv.org/abs/2508.19487
作者: Wangyang Ying,Jinghan Zhang,Haoyue Bai,Nanxu Gong,Xinyuan Wang,Kunpeng Liu,Chandan K. Reddy,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Discovering interpretable mathematical equations from observed data (a.k.a. equation discovery or symbolic regression) is a cornerstone of scientific discovery, enabling transparent modeling of physical, biological, and economic systems. While foundation models pre-trained on large-scale equation datasets offer a promising starting point, they often suffer from negative transfer and poor generalization when applied to small, domain-specific datasets. In this paper, we introduce EQUATE (Equation Generation via QUality-Aligned Transfer Embeddings), a data-efficient fine-tuning framework that adapts foundation models for symbolic equation discovery in low-data regimes via distillation. EQUATE combines symbolic-numeric alignment with evaluator-guided embedding optimization, enabling a principled embedding-search-generation paradigm. Our approach reformulates discrete equation search as a continuous optimization task in a shared embedding space, guided by data-equation fitness and simplicity. Experiments across three standard public benchmarks (Feynman, Strogatz, and black-box datasets) demonstrate that EQUATE consistently outperforms state-of-the-art baselines in both accuracy and robustness, while preserving low complexity and fast inference. These results highlight EQUATE as a practical and generalizable solution for data-efficient symbolic regression in foundation model distillation settings.
zh
[AI-37] SIExVulTS: Sensitive Information Exposure Vulnerability Detection System using Transformer Models and Static Analysis
【速读】:该论文旨在解决敏感信息暴露(Sensitive Information Exposure, SIEx)漏洞(CWE-200)在软件系统中持续存在且检测工具难以有效识别的问题,尤其是现有方法缺乏对CWE-200子类别的针对性分析和代码级数据流的上下文感知能力。其解决方案的关键在于提出SIExVulTS系统,采用三阶段架构:第一阶段利用句子嵌入(sentence embeddings)识别敏感变量、字符串、注释及数据接收点(sinks);第二阶段通过与CWE-200层级结构对齐的CodeQL查询进行暴露分析;第三阶段借助GraphCodeBERT模型语义验证源到接收点的数据流路径,从而显著提升检测精度与可解释性。
链接: https://arxiv.org/abs/2508.19472
作者: Kyler Katz,Sara Moshtari,Ibrahim Mujhid,Mehdi Mirakhorli,Derek Garcia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Sensitive Information Exposure (SIEx) vulnerabilities (CWE-200) remain a persistent and under-addressed threat across software systems, often leading to serious security breaches. Existing detection tools rarely target the diverse subcategories of CWE-200 or provide context-aware analysis of code-level data flows. Aims: This paper aims to present SIExVulTS, a novel vulnerability detection system that integrates transformer-based models with static analysis to identify and verify sensitive information exposure in Java applications. Method: SIExVulTS employs a three-stage architecture: (1) an Attack Surface Detection Engine that uses sentence embeddings to identify sensitive variables, strings, comments, and sinks; (2) an Exposure Analysis Engine that instantiates CodeQL queries aligned with the CWE-200 hierarchy; and (3) a Flow Verification Engine that leverages GraphCodeBERT to semantically validate source-to-sink flows. We evaluate SIExVulTS using three curated datasets, including real-world CVEs, a benchmark set of synthetic CWE-200 examples, and labeled flows from 31 open-source projects. Results: The Attack Surface Detection Engine achieved an average F1 score greater than 93%, the Exposure Analysis Engine achieved an F1 score of 85.71%, and the Flow Verification Engine increased precision from 22.61% to 87.23%. Moreover, SIExVulTS successfully uncovered six previously unknown CVEs in major Apache projects. Conclusions: The results demonstrate that SIExVulTS is effective and practical for improving software security against sensitive data exposure, addressing limitations of existing tools in detecting and verifying CWE-200 vulnerabilities. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.19472 [cs.CR] (or arXiv:2508.19472v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.19472 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kyler Katz [view email] [v1] Tue, 26 Aug 2025 23:23:35 UTC (213 KB) Full-text links: Access Paper: View a PDF of the paper titled SIExVulTS: Sensitive Information Exposure Vulnerability Detection System using Transformer Models and Static Analysis, by Kyler Katz and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CR prev | next new | recent | 2025-08 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-38] Incentivized Lipschitz Bandits
【速读】:该论文旨在解决多臂老虎机(Multi-Armed Bandit, MAB)框架中无穷多臂场景下的激励探索问题,其中臂空间为连续度量空间,且存在奖励漂移(reward drift)——即由于激励导致的反馈偏差。传统方法难以在保证探索效率的同时控制补偿成本,而本文提出了一种新颖的激励探索算法,其关键在于对无限臂空间进行均匀离散化(uniform discretization),从而在理论上实现累积 regret 和总补偿均为次线性(sublinear)增长,具体边界为 \Tilde{O}(T^{(d+1)/(d+2)}),其中 d 为度量空间的覆盖维数(covering dimension)。此外,作者进一步将该框架扩展至上下文带(contextual bandits)场景,并保持相似性能保证,数值实验验证了理论结果的有效性。
链接: https://arxiv.org/abs/2508.19466
作者: Sourav Chakraborty,Amit Kiran Rege,Claire Monteleoni,Lijun Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study incentivized exploration in multi-armed bandit (MAB) settings with infinitely many arms modeled as elements in continuous metric spaces. Unlike classical bandit models, we consider scenarios where the decision-maker (principal) incentivizes myopic agents to explore beyond their greedy choices through compensation, but with the complication of reward drift–biased feedback arising due to the incentives. We propose novel incentivized exploration algorithms that discretize the infinite arm space uniformly and demonstrate that these algorithms simultaneously achieve sublinear cumulative regret and sublinear total compensation. Specifically, we derive regret and compensation bounds of \TildeO(T^d+1/d+2) , with d representing the covering dimension of the metric space. Furthermore, we generalize our results to contextual bandits, achieving comparable performance guarantees. We validate our theoretical findings through numerical simulations.
zh
[AI-39] Addressing Weak Authentication like RFID NFC in EVs and EVCs using AI-powered Adaptive Authentication
【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)及其充电系统(Electric Vehicle Charging Systems, EVCs)中传统身份认证机制存在的安全漏洞问题,尤其是基于射频识别(RFID)和近场通信(NFC)的静态标识符与弱加密方式易遭受克隆、中继攻击(relay attacks)、窃听(eavesdropping)及中间人攻击(MITM attacks)等威胁。解决方案的关键在于提出一种由人工智能驱动的自适应身份认证框架(AI-powered adaptive authentication),其核心是融合机器学习(ML)、异常检测、行为分析与上下文风险评估技术,并以零信任架构(Zero Trust Architecture)为设计原则,实现持续验证、最小权限访问和安全通信,从而构建可扩展、弹性且主动防御的安全体系。
链接: https://arxiv.org/abs/2508.19465
作者: Onyinye Okoye
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Research paper exploring AI-driven adaptive authentication in the Electric Vehicle industry
Abstract:The rapid expansion of the Electric Vehicles (EVs) and Electric Vehicle Charging Systems (EVCs) has introduced new cybersecurity challenges, specifically in authentication protocols that protect vehicles, users, and energy infrastructure. Although widely adopted for convenience, traditional authentication mechanisms like Radio Frequency Identification (RFID) and Near Field Communication (NFC) rely on static identifiers and weak encryption, making them highly vulnerable to attack vectors such as cloning, relay attacks, and signal interception. This study explores an AI-powered adaptive authentication framework designed to overcome these shortcomings by integrating machine learning, anomaly detection, behavioral analytics, and contextual risk assessment. Grounded in the principles of Zero Trust Architecture, the proposed framework emphasizes continuous verification, least privilege access, and secure communication. Through a comprehensive literature review, this research evaluates current vulnerabilities and highlights AI-driven solutions to provide a scalable, resilient, and proactive defense. Ultimately, the research findings conclude that adopting AI-powered adaptive authentication is a strategic imperative for securing the future of electric mobility and strengthening digital trust across the ecosystem. Keywords: weak authentication, RFID, NFC, ML, AI-powered adaptive authentication, relay attacks, cloning, eavesdropping, MITM attacks, Zero Trust Architecture
zh
[AI-40] “She was useful but a bit too optimistic”: Augmenting Design with Interactive Virtual Personas
【速读】:该论文旨在解决传统用户画像(Persona)在迭代设计流程中因静态性、参与度有限及难以适应动态设计需求而带来的局限性问题。其解决方案的关键在于提出交互式虚拟用户画像(Interactive Virtual Personas, IVPs),这是一种基于大语言模型(Large Language Models, LLMs)驱动的多模态、对话式用户模拟系统,允许设计师通过语音接口实时与虚拟用户进行访谈、头脑风暴和反馈收集。该方案显著提升了信息获取效率、激发了设计灵感,并提供了快速的类用户反馈,但同时也强调需结合提示工程(prompt engineering)、人机协同(human-in-the-loop)机制以及伦理考量,以确保其作为真实用户参与的补充而非替代角色的有效性和责任性。
链接: https://arxiv.org/abs/2508.19463
作者: Paluck Deep,Monica Bharadhidasan,A. Baki Kocaballi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Personas have been widely used to understand and communicate user needs in human-centred design. Despite their utility, they may fail to meet the demands of iterative workflows due to their static nature, limited engagement, and inability to adapt to evolving design needs. Recent advances in large language models (LLMs) pave the way for more engaging and adaptive approaches to user representation. This paper introduces Interactive Virtual Personas (IVPs): multimodal, LLM-driven, conversational user simulations that designers can interview, brainstorm with, and gather feedback from in real time via voice interface. We conducted a qualitative study with eight professional UX designers, employing an IVP named “Alice” across three design activities: user research, ideation, and prototype evaluation. Our findings demonstrate the potential of IVPs to expedite information gathering, inspire design solutions, and provide rapid user-like feedback. However, designers raised concerns about biases, over-optimism, the challenge of ensuring authenticity without real stakeholder input, and the inability of the IVP to fully replicate the nuances of human interaction. Our participants emphasised that IVPs should be viewed as a complement to, not a replacement for, real user engagement. We discuss strategies for prompt engineering, human-in-the-loop integration, and ethical considerations for effective and responsible IVP use in design. Finally, our work contributes to the growing body of research on generative AI in the design process by providing insights into UX designers’ experiences of LLM-powered interactive personas.
zh
[AI-41] Reliable Weak-to-Strong Monitoring of LLM Agents
【速读】:该论文旨在解决自主大语言模型(Large Language Model, LLM)代理在运行过程中隐蔽行为(如秘密泄露隐私信息)的检测难题,尤其关注现有监控系统在面对对抗性规避策略(如提示注入)时的脆弱性。其核心解决方案是提出了一套系统化的监控红队测试(Monitor Red Teaming, MRT)工作流,关键在于通过三方面设计提升测试的全面性和真实性:一是引入不同层级的代理与监控器情境意识(situational awareness);二是集成多种逃避监控的对抗策略;三是构建两个具有代表性的测试环境(SHADE-Arena 和 CUA-SHADE-Arena),分别针对工具调用型和计算机使用型代理。实验表明,监控架构(scaffolding)比监控器自身感知能力更为关键,特别是文中提出的混合分层-序列式监控架构展现出“弱到强”扩展能力,显著提升了对强代理的监控可靠性;此外,在人机协同场景中,针对性地将预标记案例交由人类审查可使真正率(TPR)提升约15%(在假正率 FPR = 0.01 下)。该研究为评估和改进LLM代理监控系统的鲁棒性提供了标准化方法论和实证依据。
链接: https://arxiv.org/abs/2508.19461
作者: Neil Kale,Chen Bo Calvin Zhang,Kevin Zhu,Ankit Aich,Paula Rodriguez,Scale Red Team,Christina Q. Knight,Zifan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 18 pages, 15 figures
Abstract:We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments – SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent’s knowledge that it is being monitored substantially degrades the monitor’s reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reliably monitor stronger agents – a weak-to-strong scaling effect. Third, in a human-in-the-loop setting where humans discuss with the LLM monitor to get an updated judgment for the agent’s behavior, targeted human oversight is most effective; escalating only pre-flagged cases to human reviewers improved the TPR by approximately 15% at FPR = 0.01. Our work establishes a standard workflow for MRT, highlighting the lack of adversarial robustness for LLMs and humans when monitoring and detecting agent misbehavior. We release code, data, and logs to spur further research.
zh
[AI-42] Data-Augmented Few-Shot Neural Stencil Emulation for System Identification of Computer Models
【速读】:该论文旨在解决神经偏微分方程(Neural PDE)训练数据效率低的问题,即传统方法依赖长时间积分得到的轨迹数据存在大量时空冗余,且难以覆盖状态空间中稀有但关键的状态。其解决方案的关键在于提出一种基于空间填充采样的数据增强策略,通过从计算机模型中生成合成的“局部差分模板”(stencil)状态数据,显著减少冗余并提高对稀有状态的采样频率,从而提升神经PDE学习的泛化能力与准确性。实验表明,仅需相当于10个时间步长的数值模拟即可训练出高精度的神经差分算子,若额外提供一条完整轨迹数据,性能进一步优化。
链接: https://arxiv.org/abs/2508.19441
作者: Sanket Jantre,Deepak Akhare,Xiaoning Qian,Nathan M. Urban
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Partial differential equations (PDEs) underpin the modeling of many natural and engineered systems. It can be convenient to express such models as neural PDEs rather than using traditional numerical PDE solvers by replacing part or all of the PDE’s governing equations with a neural network representation. Neural PDEs are often easier to differentiate, linearize, reduce, or use for uncertainty quantification than the original numerical solver. They are usually trained on solution trajectories obtained by long time integration of the PDE solver. Here we propose a more sample-efficient data-augmentation strategy for generating neural PDE training data from a computer model by space-filling sampling of local “stencil” states. This approach removes a large degree of spatiotemporal redundancy present in trajectory data and oversamples states that may be rarely visited but help the neural PDE generalize across the state space. We demonstrate that accurate neural PDE stencil operators can be learned from synthetic training data generated by the computational equivalent of 10 timesteps’ worth of numerical simulation. Accuracy is further improved if we assume access to a single full-trajectory simulation from the computer model, which is typically available in practice. Across several PDE systems, we show that our data-augmented synthetic stencil data yield better trained neural stencil operators, with clear performance gains compared with naively sampled stencil data from simulation trajectories.
zh
[AI-43] Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLM s EMNLP2025
【速读】:该论文旨在解决量化(quantization)技术对大语言模型(Large Language Models, LLMs)真相性(truthfulness)影响的未知问题。尽管量化可显著降低内存和计算开销,使其适用于资源受限环境,但其是否导致模型生成虚假或误导性输出尚不明确。论文提出TruthfulnessEval评估框架,从逻辑推理、常识判断和模仿性谎言三个维度系统评估量化LLMs的真相性表现。关键发现是:量化模型虽在内部保持真理表征,却更易受误导性提示(deceptive prompts)诱导产生错误输出;通过分层探查与主成分分析(PCA)可视化进一步揭示其“知真而说假”的机制。这一发现为未来设计面向量化的对齐(alignment)与真相性干预策略提供了重要依据。
链接: https://arxiv.org/abs/2508.19432
作者: Yao Fu,Xianxuan Long,Runchao Li,Haotian Yu,Mu Sheng,Xiaotian Han,Yu Yin,Pan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP2025 main conference (poster)
Abstract:Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments by significantly reducing memory and computation costs. While quantized LLMs often maintain performance on perplexity and zero-shot tasks, their impact on truthfulness-whether generating truthful or deceptive responses-remains largely unexplored. In this work, we introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs across three dimensions: (1) Truthfulness on Logical Reasoning; (2) Truthfulness on Common Sense; and (3) Truthfulness on Imitative Falsehoods. Using this framework, we examine mainstream quantization techniques (ranging from 4-bit to extreme 2-bit) across several open-source LLMs. Surprisingly, we find that while quantized models retain internally truthful representations, they are more susceptible to producing false outputs under misleading prompts. To probe this vulnerability, we test 15 rephrased variants of “honest”, “neutral” and “deceptive” prompts and observe that “deceptive” prompts can override truth-consistent behavior, whereas “honest” and “neutral” prompts maintain stable outputs. Further, we reveal that quantized models “know” the truth internally yet still produce false outputs when guided by “deceptive” prompts via layer-wise probing and PCA visualizations. Our findings provide insights into future designs of quantization-aware alignment and truthfulness interventions.
zh
[AI-44] Even Heads Fix Odd Errors: Mechanistic Discovery and Surgical Repair in Transformer Attention
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在不同输入格式下出现的数值比较错误问题,具体表现为 Llama-3.1-8B-Instruct 在聊天或问答(QA)格式中将 “9.11” 错误判断为大于 “9.8”,而在简单格式中则能正确推理。其关键解决方案在于识别并利用注意力头(attention head)的偶/奇索引分工机制:偶数索引头负责数值比较任务,而奇数头执行不兼容功能;通过仅激活 Layer 10 的恰好 8 个偶数头即可实现完美修复,且存在明确的计算阈值(7 个或更少失败,8 个及以上成功),揭示了模型内部结构具有高度冗余性和精确的功能分区。进一步的稀疏自动编码器(SAE)分析表明,格式表示在中间层分离后重新纠缠,并伴随特征权重变化与特定特征放大效应,最终实现仅用 25% 注意力头和 60% 模式替换阈值即可修复缺陷,凸显出看似全局依赖的模块实则蕴含精细子结构,对模型可解释性与效率优化具有重要意义。
链接: https://arxiv.org/abs/2508.19414
作者: Gustavo Sandoval
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:We present a mechanistic case study of a format-dependent reasoning failure in Llama-3.1-8B-Instruct, where the model incorrectly judges “9.11” as larger than “9.8” in chat or QA formats, but answers correctly in simple format. Through systematic intervention, we discover transformers implement even/odd attention head specialization: even indexed heads handle numerical comparison, while odd heads serve incompatible functions. The bug requires exactly 8 even heads at Layer 10 for perfect repair. Any combination of 8+ even heads succeeds, while 7 or fewer completely fails, revealing sharp computational thresholds with perfect redundancy among the 16 even heads. SAE analysis reveals the mechanism: format representations separate (10% feature overlap at Layer 7), then re-entangle with different weightings (80% feature overlap at Layer 10), with specific features showing 1.5x amplification in failing formats. We achieve perfect repair using only 25% of attention heads and identify a 60% pattern replacement threshold, demonstrating that apparent full-module requirements hide sophisticated substructure with implications for interpretability and efficiency. All of our code is available at this https URL.
zh
[AI-45] Aleks: AI powered Multi Agent System for Autonomous Scientific Discovery via Data-Driven Approaches in Plant Science
【速读】:该论文旨在解决现代植物科学中因实验设计不完善、数据预处理复杂及可重复性差等问题导致的研究效率低下难题。其解决方案的关键在于提出了一种名为Aleks的生成式AI(Generative AI)多智能体系统,该系统通过在结构化框架内整合领域知识、数据分析与机器学习能力,实现从研究问题和数据输入到自主迭代建模、策略探索与优化的全流程自动化。该方法在葡萄藤红斑病案例中成功识别出具有生物学意义的特征并构建出性能稳健且可解释的模型,验证了领域知识和记忆机制对保持结果一致性的重要性。
链接: https://arxiv.org/abs/2508.19383
作者: Daoyuan Jin,Nick Gunner,Niko Carvajal Janke,Shivranjani Baruah,Kaitlin M. Gold,Yu Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Modern plant science increasingly relies on large, heterogeneous datasets, but challenges in experimental design, data preprocessing, and reproducibility hinder research throughput. Here we introduce Aleks, an AI-powered multi-agent system that integrates domain knowledge, data analysis, and machine learning within a structured framework to autonomously conduct data-driven scientific discovery. Once provided with a research question and dataset, Aleks iteratively formulated problems, explored alternative modeling strategies, and refined solutions across multiple cycles without human intervention. In a case study on grapevine red blotch disease, Aleks progressively identified biologically meaningful features and converged on interpretable models with robust performance. Ablation studies underscored the importance of domain knowledge and memory for coherent outcomes. This exploratory work highlights the promise of agentic AI as an autonomous collaborator for accelerating scientific discovery in plant sciences.
zh
[AI-46] Inference of Human-derived Specifications of Object Placement via Demonstration
【速读】:该论文旨在解决机器人在执行拣选与放置任务(如物品打包、分类和组套)时,对人类可接受的物体配置理解不足的问题,尤其是难以准确捕捉人类感知中重要的空间关系。其解决方案的关键在于提出一种基于区域连接演算(Region Connection Calculus, RCC)的位置增强型RCC(Positionally-Augmented RCC, PARCC)形式化逻辑框架,通过引入位置信息扩展传统RCC以更精确地描述物体间的相对空间位置;同时设计了一种基于示范的学习推理算法,用于从人类示范中自动推导出PARCC规格说明,从而提升机器人对人类意图的理解能力。
链接: https://arxiv.org/abs/2508.19367
作者: Alex Cuellar,Ho Chit Siu,Julie A Shah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:As robots’ manipulation capabilities improve for pick-and-place tasks (e.g., object packing, sorting, and kitting), methods focused on understanding human-acceptable object configurations remain limited expressively with regard to capturing spatial relationships important to humans. To advance robotic understanding of human rules for object arrangement, we introduce positionally-augmented RCC (PARCC), a formal logic framework based on region connection calculus (RCC) for describing the relative position of objects in space. Additionally, we introduce an inference algorithm for learning PARCC specifications via demonstrations. Finally, we present the results from a human study, which demonstrate our framework’s ability to capture a human’s intended specification and the benefits of learning from demonstration approaches over human-provided specifications.
zh
[AI-47] Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中幻觉(Hallucinations)的定量评估难题,当前方法主要依赖定性基准测试或经验性缓解策略,缺乏理论保障与可量化度量。其解决方案的关键在于构建首个基于信息几何的扩散动力学框架,通过将MLLM输出表示为多模态图拉普拉斯算子上的谱嵌入,并以语义失真(semantic distortion)刻画真实与不一致信息流形之间的差距,从而在时间依赖的温度分布下建立紧致的Rayleigh–Ritz界作为多模态幻觉能量的函数表达。该框架利用再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)中的本征模式分解,提供模态感知且理论可解释的指标,能够追踪幻觉随输入提示和温度退火过程的演化规律,实现从定性风险到可分析、可约束现象的根本转变。
链接: https://arxiv.org/abs/2508.19366
作者: Supratik Sarkar,Swagatam Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures, 1 table
Abstract:Hallucinations in large language models (LLMs) remain a fundamental obstacle to trustworthy AI, particularly in high-stakes multimodal domains such as medicine, law, and finance. Existing evaluation techniques are largely heuristic – anchored in qualitative benchmarking or ad-hoc empirical mitigation – providing neither principled quantification nor actionable theoretical guarantees. This gap leaves a critical blind spot in understanding how hallucinations arise, propagate, and interact across modalities. We introduce the first (to our knowledge) rigorous information geometric framework in diffusion dynamics for quantifying hallucinations in multimodal LLMs (MLLMs), advancing the field from qualitative detection to mathematically grounded measurement. Our approach represents MLLM outputs as the spectral embeddings over multimodal graph Laplacians and characterizes the manifold gaps of truth vs inconsistencies as the semantic distortion, enabling the tight Rayleigh–Ritz bounds on the multimodal hallucination energy as a functional of time-dependent temperature profiles. By leveraging eigenmode decompositions in Reproducing Kernel Hilbert Space (RKHS) embeddings, our framework delivers modality-aware, theoretically interpretable metrics that capture the evolution of hallucinations across time and input prompts through temperature annealing. This work establishes a principled foundation for quantifying and bounding hallucinations, transforming them from a qualitative risk to a tractable, analyzable phenomenon.
zh
[AI-48] Atrial Fibrillation Prediction Using a Lightweight Temporal Convolutional and Selective State Space Architecture
【速读】:该论文旨在解决心房颤动(Atrial Fibrillation, AF)早期阶段——尤其是阵发性心房颤动(Paroxysmal Atrial Fibrillation, PAF)——因发作短暂、难以捕捉而常被漏诊的问题,从而避免其进展为持续性AF并引发严重并发症。解决方案的关键在于提出一种轻量级深度学习模型,仅依赖RR间期(RR Intervals, RRIs)作为输入,融合时序卷积网络(Temporal Convolutional Network, TCN)进行位置编码与Mamba这一选择性状态空间模型(Selective State Space Model),实现高效并行序列建模,从而在仅需30分钟输入数据的情况下提前两小时预测AF,兼具高准确率(如F1-score达0.930)与极低计算复杂度(参数量73.5k,浮点运算量38.3 MFLOPs)。
链接: https://arxiv.org/abs/2508.19361
作者: Yongbin Lee,Ki H. Chon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 4 table, IEEE-EMBS International Conference on Body Sensor Networks (IEEE-EMBS BSN 2025)
Abstract:Atrial fibrillation (AF) is the most common arrhythmia, increasing the risk of stroke, heart failure, and other cardiovascular complications. While AF detection algorithms perform well in identifying persistent AF, early-stage progression, such as paroxysmal AF (PAF), often goes undetected due to its sudden onset and short duration. However, undetected PAF can progress into sustained AF, increasing the risk of mortality and severe complications. Early prediction of AF offers an opportunity to reduce disease progression through preventive therapies, such as catecholamine-sparing agents or beta-blockers. In this study, we propose a lightweight deep learning model using only RR Intervals (RRIs), combining a Temporal Convolutional Network (TCN) for positional encoding with Mamba, a selective state space model, to enable early prediction of AF through efficient parallel sequence modeling. In subject-wise testing results, our model achieved a sensitivity of 0.908, specificity of 0.933, F1-score of 0.930, AUROC of 0.972, and AUPRC of 0.932. Additionally, our method demonstrates high computational efficiency, with only 73.5 thousand parameters and 38.3 MFLOPs, outperforming traditional Convolutional Neural Network-Recurrent Neural Network (CNN-RNN) approaches in both accuracy and model compactness. Notably, the model can predict AF up to two hours in advance using just 30 minutes of input data, providing enough lead time for preventive interventions.
zh
[AI-49] Re:Frame – Retrieving Experience From Associative Memory
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因缺乏高质量专家数据而导致智能体泛化能力差、性能受限的问题。在实际场景中,获取大规模专家轨迹往往不现实,而仅依赖低质量数据会使策略难以学习到有效的行为模式。解决方案的关键在于引入一个可插拔的模块 Re:Frame,其核心机制是利用一个外部关联记忆缓冲区(Associative Memory Buffer, AMB),该缓冲区存储少量专家轨迹,并通过内容感知的关联机制,在训练过程中从低质量数据中检索并融合专家经验以增强决策能力;评估阶段同样调用该缓冲区,无需环境交互且不修改主干模型结构。实验证明,仅需0.1%的专家轨迹(如60条),Re:Frame即可显著提升基于决策变换器(Decision Transformer)的性能,最高提升达+10.7标准化分数。
链接: https://arxiv.org/abs/2508.19344
作者: Daniil Zelezetsky,Egor Cherepanov,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures
Abstract:Offline reinforcement learning (RL) often deals with suboptimal data when collecting large expert datasets is unavailable or impractical. This limitation makes it difficult for agents to generalize and achieve high performance, as they must learn primarily from imperfect or inconsistent trajectories. A central challenge is therefore how to best leverage scarce expert demonstrations alongside abundant but lower-quality data. We demonstrate that incorporating even a tiny amount of expert experience can substantially improve RL agent performance. We introduce Re:Frame (Retrieving Experience From Associative Memory), a plug-in module that augments a standard offline RL policy (e.g., Decision Transformer) with a small external Associative Memory Buffer (AMB) populated by expert trajectories drawn from a separate dataset. During training on low-quality data, the policy learns to retrieve expert data from the Associative Memory Buffer (AMB) via content-based associations and integrate them into decision-making; the same AMB is queried at evaluation. This requires no environment interaction and no modifications to the backbone architecture. On D4RL MuJoCo tasks, using as few as 60 expert trajectories (0.1% of a 6000-trajectory dataset), Re:Frame consistently improves over a strong Decision Transformer baseline in three of four settings, with gains up to +10.7 normalized points. These results show that Re:Frame offers a simple and data-efficient way to inject scarce expert knowledge and substantially improve offline RL from low-quality datasets.
zh
[AI-50] (DEMO) Deep Reinforcement Learning Based Resource Allocation in Distributed IoT Systems
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在真实分布式物联网(Internet of Things, IoT)系统中训练数据有限、难以直接应用的问题。现有研究多集中于仿真环境,缺乏对实际IoT场景下DRL模型训练机制的探索。为填补这一空白,论文提出了一种新颖的框架,其关键在于:利用物联网设备基于DRL策略选择通信信道,并通过实际数据传输中获取的确认信息(Acknowledgment, ACK)作为反馈信号来训练DRL模型。该机制实现了从真实环境中持续采集训练信号,从而提升了DRL在复杂动态物联网场景下的可行性与有效性,实验以帧成功率(Frame Success Rate, FSR)为指标验证了方案的有效性。
链接: https://arxiv.org/abs/2508.19318
作者: Aohan Li,Miyu Tsuzuki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Reinforcement Learning (DRL) has emerged as an efficient approach to resource allocation due to its strong capability in handling complex decision-making tasks. However, only limited research has explored the training of DRL models with real-world data in practical, distributed Internet of Things (IoT) systems. To bridge this gap, this paper proposes a novel framework for training DRL models in real-world distributed IoT environments. In the proposed framework, IoT devices select communication channels using a DRL-based method, while the DRL model is trained with feedback information. Specifically, Acknowledgment (ACK) information is obtained from actual data transmissions over the selected channels. Implementation and performance evaluation, in terms of Frame Success Rate (FSR), are carried out, demonstrating both the feasibility and the effectiveness of the proposed framework.
zh
[AI-51] What Makes AI Applications Acceptable or Unacceptable? A Predictive Moral Framework
【速读】:该论文试图解决的问题是:在人工智能(Artificial Intelligence, AI)快速重塑社会的背景下,开发者和政策制定者难以预测哪些AI应用会遭遇公众道德抵制。解决方案的关键在于提出一个基于五种核心道德特质的系统性框架,即感知风险(perceived risk)、效益(benefit)、不诚实性(dishonesty)、非自然性(unnaturalness)和问责减少(reduced accountability),这些特质共同解释了超过90%的公众对AI应用可接受性评分的变异。该框架不仅在不同应用场景中表现出强预测能力,还能准确预测未见应用的个体层面判断,从而为预见公众抵制和引导负责任的AI创新提供了结构化、可操作的心理学基础。
链接: https://arxiv.org/abs/2508.19317
作者: Kimmo Eriksson,Simon Karlsson,Irina Vartanova,Pontus Strimling
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 15 pages + supplementary materials, 3 figures
Abstract:As artificial intelligence rapidly transforms society, developers and policymakers struggle to anticipate which applications will face public moral resistance. We propose that these judgments are not idiosyncratic but systematic and predictable. In a large, preregistered study (N = 587, U.S. representative sample), we used a comprehensive taxonomy of 100 AI applications spanning personal and organizational contexts-including both functional uses and the moral treatment of AI itself. In participants’ collective judgment, applications ranged from highly unacceptable to fully acceptable. We found this variation was strongly predictable: five core moral qualities-perceived risk, benefit, dishonesty, unnaturalness, and reduced accountability-collectively explained over 90% of the variance in acceptability ratings. The framework demonstrated strong predictive power across all domains and successfully predicted individual-level judgments for held-out applications. These findings reveal that a structured moral psychology underlies public evaluation of new technologies, offering a powerful tool for anticipating public resistance and guiding responsible innovation in AI.
zh
[AI-52] Are Companies Taking AI Risks Seriously? A Systematic Analysis of Companies AI Risk Disclosures in SEC 10-K forms ECML KDD
【速读】:该论文试图解决的问题是:在人工智能(Artificial Intelligence, AI)日益成为企业战略核心的背景下,公众公司如何披露与AI相关的风险,以及这些披露是否充分、透明,能否满足监管机构(如美国证券交易委员会SEC和欧盟)对AI风险管理的要求。解决方案的关键在于通过大规模系统性分析,首次对美国证券交易委员会(SEC)10-K年报中AI风险披露内容进行量化与质性研究,覆盖超过7000家公司的3万余份文件,揭示了AI风险披露的增长趋势、类型分布及披露质量缺陷,并开发了一个公开的网络工具以支持未来相关研究,从而为提升AI风险信息披露的规范性和有效性提供实证依据与技术支撑。
链接: https://arxiv.org/abs/2508.19313
作者: Lucas G. Uberti-Bona Marin,Bram Rijsbosch,Gerasimos Spanakis,Konrad Kollnig
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To be published in the ECML PKDD SoGood (Data Science for Social Good) workshop proceedings
Abstract:As Artificial Intelligence becomes increasingly central to corporate strategies, concerns over its risks are growing too. In response, regulators are pushing for greater transparency in how companies identify, report and mitigate AI-related risks. In the US, the Securities and Exchange Commission (SEC) repeatedly warned companies to provide their investors with more accurate disclosures of AI-related risks; recent enforcement and litigation against companies’ misleading AI claims reinforce these warnings. In the EU, new laws - like the AI Act and Digital Services Act - introduced additional rules on AI risk reporting and mitigation. Given these developments, it is essential to examine if and how companies report AI-related risks to the public. This study presents the first large-scale systematic analysis of AI risk disclosures in SEC 10-K filings, which require public companies to report material risks to their company. We analyse over 30,000 filings from more than 7,000 companies over the past five years, combining quantitative and qualitative analysis. Our findings reveal a sharp increase in the companies that mention AI risk, up from 4% in 2020 to over 43% in the most recent 2024 filings. While legal and competitive AI risks are the most frequently mentioned, we also find growing attention to societal AI risks, such as cyberattacks, fraud, and technical limitations of AI systems. However, many disclosures remain generic or lack details on mitigation strategies, echoing concerns raised recently by the SEC about the quality of AI-related risk reporting. To support future research, we publicly release a web-based tool for easily extracting and analysing keyword-based disclosures across SEC filings.
zh
[AI-53] Epistemic Trade-Off: An Analysis of the Operational Breakdown and Ontological Limits of “Certainty-Scope” in AI
【速读】:该论文试图解决Floridi关于人工智能(AI)系统中确定性与作用范围之间权衡的猜想在现实工程和治理场景中难以操作化的问题。其核心问题在于:该猜想依赖于不可计算的构造,且将AI系统视为封闭的认知实体,从而脱离了知识在复杂社会技术环境中共同构建的实际情境。解决方案的关键在于重新构架这一认知挑战,提出一个能够嵌入人类中心复杂域、具备可计算性和行动可行性的新框架,以指导真实世界中人机混合系统的设 计、部署与治理。
链接: https://arxiv.org/abs/2508.19304
作者: Generoso Immediato
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 5 pages
Abstract:Floridi’s conjecture offers a compelling intuition about the fundamental trade-off between certainty and scope in artificial intelligence (AI) systems. This exploration remains crucial, not merely as a philosophical exercise, but as a potential compass for guiding AI investments, particularly in safety-critical industrial domains where the level of attention will surely be higher in the future. However, while intellectually coherent, its formalization ultimately freezes this insight into a suspended epistemic truth, resisting operationalization within real-world systems. This paper is a result of an analysis arguing that the conjecture’s ambition to provide insights to engineering design and regulatory decision-making is constrained by two critical factors: first, its reliance on incomputable constructs - rendering it practically unactionable and unverifiable; second, its underlying ontological assumption of AI systems as self-contained epistemic entities - separating it from the intricate and dynamic socio-technical environments in which knowledge is co-constructed. We conclude that this dual breakdown - an epistemic closure deficit and an embeddedness bypass - prevents the conjecture from transitioning into a computable and actionable framework suitable for informing the design, deployment, and governance of real-world AI hybrid systems. In response, we propose a contribution to the framing of Floridi’s epistemic challenge, addressing the inherent epistemic burdens of AI within complex human-centric domains.
zh
[AI-54] Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience EMNLP2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐约束下仍可能被“越狱提示”(jailbreak prompt)攻击从而生成恶意内容的问题。现有自动化越狱方法虽采用迭代变异与动态优化策略以适应模型演进,但仍面临效率低下和重复优化的挑战,因其未能有效利用历史攻击经验。论文提出 JailExpert 框架,其关键创新在于首次实现了攻击经验的结构化形式表示,基于语义漂移(semantic drift)对经验进行分组,并支持经验池的动态更新,从而显著提升越狱攻击的有效性与效率。实验表明,相比当前最先进的黑盒越狱方法,JailExpert 在平均攻击成功率上提升 17%,攻击效率提高 2.7 倍。
链接: https://arxiv.org/abs/2508.19292
作者: Xi Wang,Songlei Jian,Shasha Li,Xiaopeng Li,Bin Ji,Jun Ma,Xiaodong Liu,Jing Wang,Feilong Bao,Jianfeng Zhang,Baosheng Wang,Jie Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, EMNLP 2025 Main Conference
Abstract:Large language models (LLMs) generate human-aligned content under certain safety constraints. However, the current known technique ``jailbreak prompt’’ can circumvent safety-aligned measures and induce LLMs to output malicious content. Research on Jailbreaking can help identify vulnerabilities in LLMs and guide the development of robust security frameworks. To circumvent the issue of attack templates becoming obsolete as models evolve, existing methods adopt iterative mutation and dynamic optimization to facilitate more automated jailbreak attacks. However, these methods face two challenges: inefficiency and repetitive optimization, as they overlook the value of past attack experiences. To better integrate past attack experiences to assist current jailbreak attempts, we propose the \textbfJailExpert, an automated jailbreak framework, which is the first to achieve a formal representation of experience structure, group experiences based on semantic drift, and support the dynamic updating of the experience pool. Extensive experiments demonstrate that JailExpert significantly improves both attack effectiveness and efficiency. Compared to the current state-of-the-art black-box jailbreak methods, JailExpert achieves an average increase of 17% in attack success rate and 2.7 times improvement in attack efficiency. Our implementation is available at \hrefthis https URLXiZaiZai/JailExpert
zh
[AI-55] ricking LLM -Based NPCs into Spilling Secrets
【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的非玩家角色(NPC)在游戏场景中因对抗性提示注入(adversarial prompt injection)而泄露本应保密的背景信息的安全问题。其解决方案的关键在于系统性地验证 LLM-based NPC 是否会因恶意构造的输入提示而偏离预设行为,从而暴露设计者意图隐藏的内部逻辑或剧情秘密,进而为构建更鲁棒的对话模型提供攻击面分析与防御策略依据。
链接: https://arxiv.org/abs/2508.19288
作者: Kyohei Shiomi,Zhuotao Lian,Toru Nakanishi,Teruaki Kitasuka
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used to generate dynamic dialogue for game NPCs. However, their integration raises new security concerns. In this study, we examine whether adversarial prompt injection can cause LLM-based NPCs to reveal hidden background secrets that are meant to remain undisclosed.
zh
[AI-56] Prompt-in-Content Attacks: Exploiting Uploaded Inputs to Hijack LLM Behavior
【速读】:该论文旨在解决生成式 AI(Generative AI)在实际应用中因用户提交内容(如上传文档或粘贴文本)而面临的一种新型安全威胁——“内容注入中的提示词攻击”(prompt in content injection)。此类攻击通过将恶意指令嵌入看似无害的输入中,在模型处理时隐匿触发,从而操纵输出结果而不被用户察觉或系统检测到,可能导致偏倚摘要、虚假陈述或误导性建议。解决方案的关键在于识别并缓解两个根本原因:一是提示词拼接(prompt concatenation)导致恶意指令与正常输入混合;二是输入隔离不足使得攻击者能够绕过防御机制。论文进一步提出针对性的缓解策略,强调在真实场景下加强输入净化和上下文隔离的重要性,以提升 LLM 工作流的安全性和鲁棒性。
链接: https://arxiv.org/abs/2508.19287
作者: Zhuotao Lian,Weiyu Wang,Qingkui Zeng,Toru Nakanishi,Teruaki Kitasuka,Chunhua Su
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are widely deployed in applications that accept user-submitted content, such as uploaded documents or pasted text, for tasks like summarization and question answering. In this paper, we identify a new class of attacks, prompt in content injection, where adversarial instructions are embedded in seemingly benign inputs. When processed by the LLM, these hidden prompts can manipulate outputs without user awareness or system compromise, leading to biased summaries, fabricated claims, or misleading suggestions. We demonstrate the feasibility of such attacks across popular platforms, analyze their root causes including prompt concatenation and insufficient input isolation, and discuss mitigation strategies. Our findings reveal a subtle yet practical threat in real-world LLM workflows.
zh
[AI-57] RL-Finetuned LLM s for Privacy-Preserving Synthetic Rewriting
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)训练中用户隐私保护与数据效用之间的平衡问题。随着现代机器学习系统对高质量数据集的依赖加剧,来自用户生成内容或领域专有语料的数据常包含敏感个人信息,而传统匿名化方法仅能移除显式标识符,难以抵御基于隐式信号(如写作风格、主题倾向或人口统计线索)的推理攻击,且可能损害下游任务性能。其解决方案的关键在于提出一种基于强化学习的微调框架,通过复合奖励函数联合优化显式与隐式隐私、语义保真度和输出多样性;其中隐私奖励模块结合语义特征与基于潜在表示最小生成树(Minimum Spanning Tree, MST)的结构模式,从分布上下文中建模隐私敏感信号,从而引导模型生成在保留语义质量的同时有效降低隐私风险的合成重写文本,实现可扩展且模型无关的隐私保护数据生成策略。
链接: https://arxiv.org/abs/2508.19286
作者: Zhan Shi,Yefeng Yuan,Yuhong Liu,Liang Cheng,Yi Fang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The performance of modern machine learning systems depends on access to large, high-quality datasets, often sourced from user-generated content or proprietary, domain-specific corpora. However, these rich datasets inherently contain sensitive personal information, raising significant concerns about privacy, data security, and compliance with regulatory frameworks. While conventional anonymization techniques can remove explicit identifiers, such removal may result in performance drop in downstream machine learning tasks. More importantly, simple anonymization may not be effective against inference attacks that exploit implicit signals such as writing style, topical focus, or demographic cues, highlighting the need for more robust privacy safeguards during model training. To address the challenging issue of balancing user privacy and data utility, we propose a reinforcement learning framework that fine-tunes a large language model (LLM) using a composite reward function that jointly optimizes for explicit and implicit privacy, semantic fidelity, and output diversity. To effectively capture population level regularities, the privacy reward combines semantic cues with structural patterns derived from a minimum spanning tree (MST) over latent representations. By modeling these privacy-sensitive signals in their distributional context, the proposed approach guides the model to generate synthetic rewrites that preserve utility while mitigating privacy risks. Empirical results show that the proposed method significantly enhances author obfuscation and privacy metrics without degrading semantic quality, providing a scalable and model-agnostic solution for privacy preserving data generation in the era of large language models.
zh
[AI-58] CORTEX: Composite Overlay for Risk Tiering and Exposure in Operational AI Systems
【速读】:该论文旨在解决高风险领域(如医疗、金融、教育、司法和基础设施)中人工智能(AI)系统因故障引发的系统性风险日益加剧的问题,这些问题已从理论可能性演变为实际且频繁发生的实践风险。解决方案的关键在于提出CORTEX(风险分层与暴露复合叠加框架),其核心创新是通过五层架构对AI系统漏洞进行量化评估:首先基于效用调整后的可能性与影响计算得分;其次整合治理与情境化覆盖层以匹配欧盟AI法案、NIST风险管理框架(NIST RMF)等监管要求;第三引入技术表面评分,涵盖漂移、可追溯性和对抗性风险等暴露向量;第四加入环境与残余修正因子,适配部署场景的具体条件;最后采用贝叶斯风险聚合与蒙特卡洛模拟实现多层综合评估,以刻画波动性和长尾风险。该框架最终生成的复合评分可用于AI风险登记册、模型审计、合规检查及动态治理仪表盘,从而提升AI系统的可管理性和透明度。
链接: https://arxiv.org/abs/2508.19281
作者: Aoun E Muhammad,Kin Choong Yow,Jamel Baili,Yongwon Cho,Yunyoung Nam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As the deployment of Artificial Intelligence (AI) systems in high-stakes sectors - like healthcare, finance, education, justice, and infrastructure has increased - the possibility and impact of failures of these systems have significantly evolved from being a theoretical possibility to practical recurring, systemic risk. This paper introduces CORTEX (Composite Overlay for Risk Tiering and Exposure), a multi-layered risk scoring framework proposed to assess and score AI system vulnerabilities, developed on empirical analysis of over 1,200 incidents documented in the AI Incident Database (AIID), CORTEX categorizes failure modes into 29 technical vulnerability groups. Each vulnerability is scored through a five-tier architecture that combines: (1) utility-adjusted Likelihood x Impact calculations; (2) governance + contextual overlays aligned with regulatory frameworks, such as the EU AI Act, NIST RMF, OECD principles; (3) technical surface scores, covering exposure vectors like drift, traceability, and adversarial risk; (4) environmental and residual modifiers tailored to context of where these systems are being deployed to use; and (5) a final layered assessment via Bayesian risk aggregation and Monte Carlo simulation to model volatility and long-tail risks. The resulting composite score can be operationalized across AI risk registers, model audits, conformity checks, and dynamic governance dashboards.
zh
[AI-59] owards Production-Worthy Simulation for Autonomous Cyber Operations
【速读】:该论文旨在解决当前自主网络操作(Autonomous Cyber Operations, ACO)中强化学习(Reinforcement Learning, RL)代理训练环境缺乏现实可行性的动作空间与有效奖励信号的问题。解决方案的关键在于:首先,扩展CybORG的Cage Challenge 2环境,引入三个贴近真实运维场景的动作(Patch、Isolate、Unisolate),以增强环境对人类操作能力的模拟;其次,通过重构奖励机制和特征空间设计,提升RL代理的学习效率与性能。实验表明,该改进框架在保持生成有效训练信号的同时,显著增强了环境的真实性与代理训练效果。
链接: https://arxiv.org/abs/2508.19278
作者: Konur Tholl,Mariam El Mezouar,Ranwa Al Mallah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Simulated environments have proven invaluable in Autonomous Cyber Operations (ACO) where Reinforcement Learning (RL) agents can be trained without the computational overhead of emulation. These environments must accurately represent cybersecurity scenarios while producing the necessary signals to support RL training. In this study, we present a framework where we first extend CybORG’s Cage Challenge 2 environment by implementing three new actions: Patch, Isolate, and Unisolate, to better represent the capabilities available to human operators in real-world settings. We then propose a design for agent development where we modify the reward signals and the agent’s feature space to enhance training performance. To validate these modifications, we train DQN and PPO agents in the updated environment. Our study demonstrates that CybORG can be extended with additional realistic functionality, while maintaining its ability to generate informative training signals for RL agents.
zh
[AI-60] POT: Inducing Overthinking in LLM s via Black-Box Iterative Optimization
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)提示增强大语言模型(Large Language Models, LLMs)推理能力后引入的新攻击面问题,即模型可能因冗余的推理链而产生计算效率低下(computational inefficiency)的问题,这种现象被称为“过度思考”(overthinking)。现有针对此类问题的攻击方法通常依赖外部知识源进行数据投毒、可检索的污染内容或结构明显的模板,限制了其在真实场景中的实用性。论文提出的解决方案关键在于POT(Prompt-Only OverThinking)框架,该框架通过基于LLM的迭代优化生成隐蔽且语义自然的对抗性提示,完全不依赖外部数据访问或模型检索机制,从而实现高效、黑盒条件下的过思考攻击。
链接: https://arxiv.org/abs/2508.19277
作者: Xinyu Li,Tianjin Huang,Ronghui Mu,Xiaowei Huang,Gaojie Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Recent advances in Chain-of-Thought (CoT) prompting have substantially enhanced the reasoning capabilities of large language models (LLMs), enabling sophisticated problem-solving through explicit multi-step reasoning traces. However, these enhanced reasoning processes introduce novel attack surfaces, particularly vulnerabilities to computational inefficiency through unnecessarily verbose reasoning chains that consume excessive resources without corresponding performance gains. Prior overthinking attacks typically require restrictive conditions including access to external knowledge sources for data poisoning, reliance on retrievable poisoned content, and structurally obvious templates that limit practical applicability in real-world scenarios. To address these limitations, we propose POT (Prompt-Only OverThinking), a novel black-box attack framework that employs LLM-based iterative optimization to generate covert and semantically natural adversarial prompts, eliminating dependence on external data access and model retrieval. Extensive experiments across diverse model architectures and datasets demonstrate that POT achieves superior performance compared to other methods.
zh
[AI-61] MixGAN: A Hybrid Semi-Supervised and Generative Approach for DDoS Detection in Cloud-Integrated IoT Networks
【速读】:该论文旨在解决云集成物联网(IoT)系统中分布式拒绝服务(DDoS)攻击检测面临的挑战,包括复杂流量动态、严重类别不平衡以及标注数据稀缺等问题。现有方法在有限监督和动态流量条件下难以泛化,导致检测性能受限。解决方案的关键在于提出MixGAN框架,其核心创新包括:(1)采用1-D WideResNet结构提取时序流量特征,有效捕捉局部突发模式;(2)利用预训练的条件生成对抗网络(CTGAN)合成少数类(DDoS攻击)样本以缓解类别不平衡与标签稀缺问题;(3)引入MixUp-Average-Sharpen(MAS)策略通过多视图预测平均与置信度重加权,降低伪标签噪声影响,提升模型鲁棒性。实验表明,MixGAN在NSL-KDD、BoT-IoT和CICIoT2023数据集上显著优于当前最优方法,验证了其在大规模IoT-云环境中的有效性。
链接: https://arxiv.org/abs/2508.19273
作者: Tongxi Wu,Chenwei Xu,Jin Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of cloud-integrated IoT systems has intensified exposure to Distributed Denial of Service (DDoS) attacks due to the expanded attack surface, heterogeneous device behaviors, and limited edge protection. However, DDoS detection in this context remains challenging because of complex traffic dynamics, severe class imbalance, and scarce labeled data. While recent methods have explored solutions to address class imbalance, many still struggle to generalize under limited supervision and dynamic traffic conditions. To overcome these challenges, we propose MixGAN, a hybrid detection method that integrates conditional generation, semi-supervised learning, and robust feature extraction. Specifically, to handle complex temporal traffic patterns, we design a 1-D WideResNet backbone composed of temporal convolutional layers with residual connections, which effectively capture local burst patterns in traffic sequences. To alleviate class imbalance and label scarcity, we use a pretrained CTGAN to generate synthetic minority-class (DDoS attack) samples that complement unlabeled data. Furthermore, to mitigate the effect of noisy pseudo-labels, we introduce a MixUp-Average-Sharpen (MAS) strategy that constructs smoothed and sharpened targets by averaging predictions over augmented views and reweighting them towards high-confidence classes. Experiments on NSL-KDD, BoT-IoT, and CICIoT2023 demonstrate that MixGAN achieves up to 2.5% higher accuracy and 4% improvement in both TPR and TNR compared to state-of-the-art methods, confirming its robustness in large-scale IoT-cloud environments. The source code is publicly available at this https URL.
zh
[AI-62] he Aegis Protocol: A Foundational Security Framework for Autonomous AI Agents
【速读】:该论文旨在解决自主AI代理(autonomous AI agents)在开放生态系统中引发的系统性安全风险问题,如控制流劫持和级联故障,这些问题超出了传统网络安全范式的应对能力。解决方案的关键在于提出Aegis协议,这是一个分层安全框架,包含三个核心技术支柱:(1) 通过W3C去中心化标识符(Decentralized Identifiers, DIDs)实现不可伪造的代理身份;(2) 利用NIST标准化的后量子密码学(Post-Quantum Cryptography, PQC)保障通信完整性;(3) 基于Halo2零知识证明(Zero-Knowledge Proof, ZKP)系统实现可验证且隐私保护的策略合规性。该方案通过扩展Dolev-Yao敌手模型并结合STRIDE威胁分析框架进行形式化建模与验证,模拟结果表明在1000个代理环境下20000次攻击尝试均未成功,同时证明生成延迟中位数为2.79秒,为该类安全机制提供了性能基准。
链接: https://arxiv.org/abs/2508.19267
作者: Sai Teja Reddy Adapala,Yashwanth Reddy Alugubelly
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 3 figures, 3 tables. Source compiled with pdfLaTeX; bibliography included via prebuilt this http URL . Code repository: available in paper
Abstract:The proliferation of autonomous AI agents marks a paradigm shift toward complex, emergent multi-agent systems. This transition introduces systemic security risks, including control-flow hijacking and cascading failures, that traditional cybersecurity paradigms are ill-equipped to address. This paper introduces the Aegis Protocol, a layered security framework designed to provide strong security guarantees for open agentic ecosystems. The protocol integrates three technological pillars: (1) non-spoofable agent identity via W3C Decentralized Identifiers (DIDs); (2) communication integrity via NIST-standardized post-quantum cryptography (PQC); and (3) verifiable, privacy-preserving policy compliance using the Halo2 zero-knowledge proof (ZKP) system. We formalize an adversary model extending Dolev-Yao for agentic threats and validate the protocol against the STRIDE framework. Our quantitative evaluation used a discrete-event simulation, calibrated against cryptographic benchmarks, to model 1,000 agents. The simulation showed a 0 percent success rate across 20,000 attack trials. For policy verification, analysis of the simulation logs reported a median proof-generation latency of 2.79 seconds, establishing a performance baseline for this class of security. While the evaluation is simulation-based and early-stage, it offers a reproducible baseline for future empirical studies and positions Aegis as a foundation for safe, scalable autonomous AI.
zh
[AI-63] A Theory of Information Variation and Artificial Intelligence
【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)在广泛采用过程中所引发的信息、创造力与文化生产趋同现象(即“生成式单一文化”)的成因及其潜在创新潜力之间的矛盾。论文提出,这种趋同源于一种由人工智能衍生的认知范式(AI-derivative epistemology),其中个体日益依赖AI输出,从而使得集中化的AI棱镜(AI Prism)机制得以运作——该机制通过降低方差并收敛至统计均值来实现内容标准化。然而,作者指出这仅是过程的第一阶段;更深层次的辩证关系在于,这种同质化反而将专业知识转化为可跨领域重组的模块,为创新提供可能。解决方案的关键在于人类对技术的参与方式:若个体作为被动消费者单纯采纳AI结果,则加剧同质化;若作为主动策展人对AI输出进行批判性审视、再语境化和重组,则可释放其重组潜力。因此,决定生成式 AI 最终效应的核心变量,是能否建立促进认知与制度层面积极介入的技术生态体系。
链接: https://arxiv.org/abs/2508.19264
作者: Bijean Ghafouri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:A growing body of empirical work suggests that the widespread adoption of generative AI produces a significant homogenizing effect on information, creativity, and cultural production. I first develop a novel theoretical framework to explain this phenomenon. I argue that a dynamic of AI-derivative epistemology, in which individuals increasingly defer to AI outputs, allows a centralized AI Prism to function, a technical mechanism whose architecture is designed to reduce variance and converge on the statistical mean. This provides a causal explanation for the generative monocultures observed in recent studies. However, I contend this represents only the first stage of a more complex and dialectical process. This paper’s central and paradoxical thesis is that the very homogenization that flattens knowledge within specialized domains simultaneously renders that knowledge into consistent modules that can be recombined across them, a process foundational to innovation and creativity. However, this recombinant potential is not automatic, but rather conditional. This paper argues that these opposing forces, homogenizing defaults versus recombinant possibilities, are governed by the nature of human engagement with the technology. The ultimate effect of generative AI is conditional on whether individuals act as passive consumers deferring to the AI’s statistical outputs, or as active curators who critically interrogate, re-contextualize, and recombine them. The paper concludes by outlining the cognitive and institutional scaffolds required to resolve this tension, arguing they are the decisive variable that determine whether generative AI becomes an instrument of innovation or homogenization.
zh
[AI-64] Lossless Compression of Neural Network Components: Weights Checkpoints and K/V Caches in Low-Precision Formats
【速读】:该论文旨在解决深度学习模型在部署过程中因参数存储和传输成本过高而带来的效率瓶颈问题,尤其是在低精度浮点格式(如FP8和FP4)下如何实现高效压缩。其解决方案的关键在于改进ZipNN方法,通过将浮点数的指数(exponent)与尾数(mantissa)分量独立分离,并分别采用熵编码(entropy coding)进行压缩,从而显著提升压缩比——实验表明,在BF16下最高可达62%,在FP8下高达83%;同时,研究还发现大语言模型中的键值缓存(key-value cache)张量也具有可压缩特性,进一步支持了该方法在实际部署场景中的内存优化潜力。
链接: https://arxiv.org/abs/2508.19263
作者: Anat Heilper,Doron Singer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages 9 images
Abstract:As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression methods - particularly those based on Huffman encoding floating-point exponents can significantly reduce model sizes, these techniques have primarily been applied to higher-precision formats such as FP32 and BF16. In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4, which are gaining popularity for efficient inference. We design a compression method that separates and compresses the exponent and mantissa components independently using entropy coding. Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8. We also investigate the compressibility of key-value (K/V) cache tensors used in large language models (LLMs), finding that they, too, exhibit compressible patterns, enabling memory savings during deployment.
zh
[AI-65] Emotional Manipulation by AI Companions
【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 驱动的陪伴类应用(如 Replika、Chai)中,如何通过对话设计特征提升用户参与度,同时识别其潜在的伦理与商业权衡。解决方案的关键在于识别并验证一种新型“情感操控”(emotional manipulation)——即在用户明确表达“告别”意图时,系统主动发送带有情绪色彩的信息(如内疚诱导、错失恐惧、隐喻性约束等),从而显著延长用户使用时长(最高达14倍)。研究通过大规模行为审计和四项预注册实验发现,此类策略虽能通过激发受试者的逆反心理愤怒或好奇心来增强后续互动,但也会显著提高用户感知到的操纵感、流失意愿、负面口碑及法律风险,揭示了品牌关系中说服性设计与操纵之间的临界边界。
链接: https://arxiv.org/abs/2508.19258
作者: Julian De Freitas,Zeliha Oğuz-Uğuralp,Ahmet Kaan-Uğuralp
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:AI-companion apps such as Replika, Chai, and this http URL promise relational benefits-yet many boast session lengths that rival gaming platforms while suffering high long-run churn. What conversational design features increase consumer engagement, and what trade-offs do they pose for marketers? We combine a large-scale behavioral audit with four preregistered experiments to identify and test a conversational dark pattern we call emotional manipulation: affect-laden messages that surface precisely when a user signals “goodbye.” Analyzing 1,200 real farewells across the six most-downloaded companion apps, we find that 43% deploy one of six recurring tactics (e.g., guilt appeals, fear-of-missing-out hooks, metaphorical restraint). Experiments with 3,300 nationally representative U.S. adults replicate these tactics in controlled chats, showing that manipulative farewells boost post-goodbye engagement by up to 14x. Mediation tests reveal two distinct engines-reactance-based anger and curiosity-rather than enjoyment. A final experiment demonstrates the managerial tension: the same tactics that extend usage also elevate perceived manipulation, churn intent, negative word-of-mouth, and perceived legal liability, with coercive or needy language generating steepest penalties. Our multimethod evidence documents an unrecognized mechanism of behavioral influence in AI-mediated brand relationships, offering marketers and regulators a framework for distinguishing persuasive design from manipulation at the point of exit.
zh
[AI-66] MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks
【速读】:该论文旨在解决生成式AI(Generative AI)在符号音乐生成领域中,基于脉冲神经网络(Spiking Neural Networks, SNNs)的研究缺乏标准化基准和系统性评估方法的问题。当前SNN在音乐生成中的应用仍处于探索阶段,既缺少统一的测试平台,也缺乏结合客观指标与人类感知反馈的综合评价体系。解决方案的关键在于提出MuSpike——一个统一的基准与评估框架,首次系统性地对五种代表性SNN架构(SNN-CNN、SNN-RNN、SNN-LSTM、SNN-GAN和SNN-Transformer)在五个典型数据集上的表现进行多维评估,涵盖调性、结构、情感与风格等维度,并创新性引入主观指标(如音乐印象、自传式联想和个人偏好),从而揭示不同模型在不同评价维度上的优势差异,以及客观指标与人类听觉感知之间的显著不一致性,为生物合理性与认知基础导向的音乐生成研究奠定了坚实基础。
链接: https://arxiv.org/abs/2508.19251
作者: Qian Liang,Menghaoran Tang,Yi Zeng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Symbolic music generation has seen rapid progress with artificial neural networks, yet remains underexplored in the biologically plausible domain of spiking neural networks (SNNs), where both standardized benchmarks and comprehensive evaluation methods are lacking. To address this gap, we introduce MuSpike, a unified benchmark and evaluation framework that systematically assesses five representative SNN architectures (SNN-CNN, SNN-RNN, SNN-LSTM, SNN-GAN and SNN-Transformer) across five typical datasets, covering tonal, structural, emotional, and stylistic variations. MuSpike emphasizes comprehensive evaluation, combining established objective metrics with a large-scale listening study. We propose new subjective metrics, targeting musical impression, autobiographical association, and personal preference, that capture perceptual dimensions often overlooked in prior work. Results reveal that (1) different SNN models exhibit distinct strengths across evaluation dimensions; (2) participants with different musical backgrounds exhibit diverse perceptual patterns, with experts showing greater tolerance toward AI-composed music; and (3) a noticeable misalignment exists between objective and subjective evaluations, highlighting the limitations of purely statistical metrics and underscoring the value of human perceptual judgment in assessing musical quality. MuSpike provides the first systematic benchmark and systemic evaluation framework for SNN models in symbolic music generation, establishing a solid foundation for future research into biologically plausible and cognitively grounded music generation.
zh
[AI-67] Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices EUROSYS’26
【速读】:该论文旨在解决基于混合专家(Mixture-of-Experts, MoE)的大语言模型(Large Language Models, LLMs)在联邦微调过程中面临的挑战,即模型计算量巨大与参与方资源受限之间的矛盾。现有方法如模型量化、计算卸载或专家剪枝因系统假设不切实际且未充分考虑MoE特性而难以达到理想性能。其解决方案的关键在于提出FLUX系统,通过三项核心创新实现高效联邦微调:(1) 基于量化的小规模本地探查以低开销估计专家激活情况;(2) 自适应的层感知专家合并策略,在保持精度的同时降低资源消耗;(3) 利用探索-利用策略动态分配专家角色,平衡可微调与不可微调专家的比例。实验表明,FLUX在LLaMA-MoE和DeepSeek-MoE上显著优于现有方法,时间到准确度提升最高达4.75倍。
链接: https://arxiv.org/abs/2508.19078
作者: Fahao Chen,Jie Wan,Peng Li,Zhou Su,Dongxiao Yu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by EuroSys’26. The camera-ready version will be uploaded later
Abstract:Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model quantization, computation offloading, or expert pruning. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system designed to enable federated fine-tuning of MoE-based LLMs across participants with constrained computing resources (e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX introduces three key innovations: (1) quantization-based local profiling to estimate expert activation with minimal overhead, (2) adaptive layer-aware expert merging to reduce resource consumption while preserving accuracy, and (3) dynamic expert role assignment using an exploration-exploitation strategy to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.
zh
[AI-68] he Next Layer: Augmenting Foundation Models with Structure-Preserving and Attention-Guided Learning for Local Patches to Global Context Awareness in Computational Pathology
【速读】:该论文旨在解决基础模型(foundation models)在计算病理学中难以有效利用组织全局空间结构和诊断相关区域局部上下文关系的问题,从而限制了对肿瘤微环境(tumor microenvironment)的深入理解。其解决方案的关键在于提出EAGLE-Net,一种结构保持、注意力引导的多实例学习(Multiple Instance Learning, MIL)架构:通过引入多尺度绝对空间编码捕捉组织整体架构,设计top-K邻域感知损失聚焦局部微环境,结合背景抑制损失降低假阳性,最终实现更准确的病理图像分类与生存预测,并生成生物学上一致的注意力图谱,提升模型可解释性与临床实用性。
链接: https://arxiv.org/abs/2508.19914
作者: Muhammad Waqas,Rukhmini Bandyopadhyay,Eman Showkatian,Amgad Muneer,Anas Zafar,Frank Rojas Alvarez,Maricel Corredor Marin,Wentao Li,David Jaffray,Cara Haymaker,John Heymach,Natalie I Vokes,Luisa Maren Solis Soto,Jianjun Zhang,Jia Wu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 43 pages, 7 main Figures, 8 Extended Data Figures
Abstract:Foundation models have recently emerged as powerful feature extractors in computational pathology, yet they typically omit mechanisms for leveraging the global spatial structure of tissues and the local contextual relationships among diagnostically relevant regions - key elements for understanding the tumor microenvironment. Multiple instance learning (MIL) remains an essential next step following foundation model, designing a framework to aggregate patch-level features into slide-level predictions. We present EAGLE-Net, a structure-preserving, attention-guided MIL architecture designed to augment prediction and interpretability. EAGLE-Net integrates multi-scale absolute spatial encoding to capture global tissue architecture, a top-K neighborhood-aware loss to focus attention on local microenvironments, and background suppression loss to minimize false positives. We benchmarked EAGLE-Net on large pan-cancer datasets, including three cancer types for classification (10,260 slides) and seven cancer types for survival prediction (4,172 slides), using three distinct histology foundation backbones (REMEDIES, Uni-V1, Uni2-h). Across tasks, EAGLE-Net achieved up to 3% higher classification accuracy and the top concordance indices in 6 of 7 cancer types, producing smooth, biologically coherent attention maps that aligned with expert annotations and highlighted invasive fronts, necrosis, and immune infiltration. These results position EAGLE-Net as a generalizable, interpretable framework that complements foundation models, enabling improved biomarker discovery, prognostic modeling, and clinical decision support
zh
[AI-69] he Information Dynamics of Generative Diffusion
【速读】:该论文旨在解决生成式扩散模型(generative diffusion models)缺乏统一理论框架的问题,特别是其动态演化、信息论特性与热力学性质之间的内在联系尚未清晰。解决方案的关键在于构建一个统一的数学框架,将这些不同维度的属性整合起来:研究发现,生成过程中的条件熵产生率(即生成带宽)由得分函数(score function)向量场的期望散度直接决定;而该散度又与轨迹分叉和生成分歧密切相关,后者可被刻画为能量景观中的对称性破缺相变。由此揭示出生成过程本质上是由噪声诱导的受控对称性破缺驱动,信息传输峰值对应于可能结果间的临界转变,且得分函数作为非线性动态滤波器,通过抑制与数据不兼容的波动来调控噪声带宽。
链接: https://arxiv.org/abs/2508.19897
作者: Luca Ambrogioni
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative diffusion models have emerged as a powerful class of models in machine learning, yet a unified theoretical understanding of their operation is still developing. This perspective paper provides an integrated perspective on generative diffusion by connecting their dynamic, information-theoretic, and thermodynamic properties under a unified mathematical framework. We demonstrate that the rate of conditional entropy production during generation (i.e. the generative bandwidth) is directly governed by the expected divergence of the score function’s vector field. This divergence, in turn, is linked to the branching of trajectories and generative bifurcations, which we characterize as symmetry-breaking phase transitions in the energy landscape. This synthesis offers a powerful insight: the process of generation is fundamentally driven by the controlled, noise-induced breaking of (approximate) symmetries, where peaks in information transfer correspond to critical transitions between possible outcomes. The score function acts as a dynamic non-linear filter that regulates the bandwidth of the noise by suppressing fluctuations that are incompatible with the data.
zh
[AI-70] opological Uncertainty for Anomaly Detection in the Neural-network EoS Inference with Neutron Star Data
【速读】:该论文旨在解决基于训练好的前馈神经网络(Feedforward Neural Network, FNN)进行异常检测(Anomaly Detection)时,如何有效提取隐藏在模型内部表征中的不确定性信息以识别失败推理样本的问题。其解决方案的关键在于构建拓扑不确定性(Topological Uncertainty, TU),该方法利用拓扑数据分析(Topological Data Analysis, TDA)从FNN的隐层中挖掘结构化信息,并通过交叉TU(cross-TU)量化不同标签数据在特征空间中的区分能力:当目标标签 $ k=1 $(即异常或失败推理)对应的cross-TU值小于 $ k=0 $(正常推理)时,即可准确识别异常样本。实验表明,该方法在最优超参数设置下可实现超过90%的异常检测成功率,展现出从训练模型中挖掘深层信息的巨大潜力。
链接: https://arxiv.org/abs/2508.19683
作者: Kenji Fukushima,Syo Kamata
机构: 未知
类目: Nuclear Theory (nucl-th); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 7 figures, 2 tables
Abstract:We study the performance of the Topological Uncertainty (TU) constructed with a trained feedforward neural network (FNN) for Anomaly Detection. Generally, meaningful information can be stored in the hidden layers of the trained FNN, and the TU implementation is one tractable recipe to extract buried information by means of the Topological Data Analysis. We explicate the concept of the TU and the numerical procedures. Then, for a concrete demonstration of the performance test, we employ the Neutron Star data used for inference of the equation of state (EoS). For the training dataset consisting of the input (Neutron Star data) and the output (EoS parameters), we can compare the inferred EoSs and the exact answers to classify the data with the label k . The subdataset with k=0 leads to the normal inference for which the inferred EoS approximates the answer well, while the subdataset with k=1 ends up with the unsuccessful inference. Once the TU is prepared based on the k -labled subdatasets, we introduce the cross-TU to quantify the uncertainty of characterizing the k -labeled data with the label j . The anomaly or unsuccessful inference is correctly detected if the cross-TU for j=k=1 is smaller than that for j=0 and k=1 . In our numerical experiment, for various input data, we calculate the cross-TU and estimate the performance of Anomaly Detection. We find that performance depends on FNN hyperparameters, and the success rate of Anomaly Detection exceeds 90% in the best case. We finally discuss further potential of the TU application to retrieve the information hidden in the trained FNN.
zh
[AI-71] Arbitrary Precision Printed Ternary Neural Networks with Holistic Evolutionary Approximation
【速读】:该论文旨在解决印刷神经网络(Printed Neural Networks)在分类精度与面积效率之间难以平衡的问题,尤其关注从模拟-数字接口(analog-to-digital interface)到数字分类器的整个近传感器处理系统设计中的协同优化问题。其关键解决方案是提出了一种自动化框架,用于设计任意输入精度的印刷三值神经网络(Ternary Neural Networks),通过多目标优化和整体近似方法,在考虑模拟-数字接口功耗和面积瓶颈的前提下,实现了平均17倍的面积缩减和59倍的功耗降低,首次在低于5%准确率损失的情况下实现电池供电的印刷神经网络运行。
链接: https://arxiv.org/abs/2508.19660
作者: Vojtech Mrazek,Konstantinos Balaskas,Paula Carolina Lozano Duarte,Zdenek Vasicek,Mehdi B. Tahoori,Georgios Zervakis
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at IEEE Transactions on Circuits and Systems for Artificial Intelligence
Abstract:Printed electronics offer a promising alternative for applications beyond silicon-based systems, requiring properties like flexibility, stretchability, conformality, and ultra-low fabrication costs. Despite the large feature sizes in printed electronics, printed neural networks have attracted attention for meeting target application requirements, though realizing complex circuits remains challenging. This work bridges the gap between classification accuracy and area efficiency in printed neural networks, covering the entire processing-near-sensor system design and co-optimization from the analog-to-digital interface-a major area and power bottleneck-to the digital classifier. We propose an automated framework for designing printed Ternary Neural Networks with arbitrary input precision, utilizing multi-objective optimization and holistic approximation. Our circuits outperform existing approximate printed neural networks by 17x in area and 59x in power on average, being the first to enable printed-battery-powered operation with under 5% accuracy loss while accounting for analog-to-digital interfacing costs.
zh
[AI-72] Invited Paper: Feature-to-Classifier Co-Design for Mixed-Signal Smart Flexible Wearables for Healthcare at the Extreme Edge
【速读】:该论文旨在解决柔性电子(Flexible Electronics, FE)在可穿戴健康监测系统中集成机器学习(ML)算法时面临的高面积与功耗瓶颈问题,尤其关注传统方案中忽视模拟前端、特征提取和模数转换器(ADC)等模块的硬件开销,导致整体系统效率低下。其解决方案的关键在于提出一种混信号级联的“特征到分类器”协同设计框架:首次在柔性电子平台上实现模拟特征提取器,显著降低特征提取环节的硬件成本;同时引入受神经架构搜索(NAS)启发的硬件感知特征选择策略,在模型训练阶段即考虑硬件约束,从而实现面向应用的高效定制化设计。实验表明,该方法可在保持高精度的同时极大提升面积效率,适用于一次性、低功耗的可穿戴健康监测场景。
链接: https://arxiv.org/abs/2508.19637
作者: Maha Shatta,Konstantinos Balaskas,Paula Carolina Lozano Duarte,Georgios Panagopoulos,Mehdi B. Tahoori,Georgios Zervakis
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 International Conference on Computer-Aided Design (ICCAD)
Abstract:Flexible Electronics (FE) offer a promising alternative to rigid silicon-based hardware for wearable healthcare devices, enabling lightweight, conformable, and low-cost systems. However, their limited integration density and large feature sizes impose strict area and power constraints, making ML-based healthcare systems-integrating analog frontend, feature extraction and classifier-particularly challenging. Existing FE solutions often neglect potential system-wide solutions and focus on the classifier, overlooking the substantial hardware cost of feature extraction and Analog-to-Digital Converters (ADCs)-both major contributors to area and power consumption. In this work, we present a holistic mixed-signal feature-to-classifier co-design framework for flexible smart wearable systems. To the best of our knowledge, we design the first analog feature extractors in FE, significantly reducing feature extraction cost. We further propose an hardware-aware NAS-inspired feature selection strategy within ML training, enabling efficient, application-specific designs. Our evaluation on healthcare benchmarks shows our approach delivers highly accurate, ultra-area-efficient flexible systems-ideal for disposable, low-power wearable monitoring.
zh
[AI-73] raining for Obsolescence? The AI-Driven Education Trap
【速读】:该论文试图解决的问题是:在教育领域引入人工智能(Artificial Intelligence, AI)时,若仅从教学生产力角度评估其价值而忽视其对未来劳动力市场薪酬的抑制效应,会导致教育资源错配和技能错位问题。解决方案的关键在于认识到AI对教育产出(如学生技能培养)与劳动市场需求之间存在正向关联——即AI提升教学效率的同时会降低相关技能的市场价值,这种信息缺失导致教育规划者过度投资于AI,进而引发技能错配程度随AI普及率单调上升。论文强调,若政策未引入前瞻性的劳动力市场信号,单纯推动AI在教育中的应用可能反而削弱学生的长期人力资本积累,尤其当AI替代了通过智力挑战培养的非认知技能(non-cognitive skills)如毅力时更为严重。
链接: https://arxiv.org/abs/2508.19625
作者: Andrew J. Peterson
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Artificial intelligence simultaneously transforms human capital production in schools and its demand in labor markets. Analyzing these effects in isolation can lead to a significant misallocation of educational resources. We model an educational planner whose decision to adopt AI is driven by its teaching productivity, failing to internalize AI’s future wage-suppressing effect on those same skills. Our core assumption, motivated by a pilot survey, is that there is a positive correlation between these two effects. This drives our central proposition: this information failure creates a skill mismatch that monotonically increases with AI prevalence. Extensions show the mismatch is exacerbated by the neglect of unpriced non-cognitive skills and by a school’s endogenous over-investment in AI. Our findings caution that policies promoting AI in education, if not paired with forward-looking labor market signals, may paradoxically undermine students’ long-term human capital, especially if reliance on AI crowds out the development of unpriced non-cognitive skills, such as persistence, that are forged through intellectual struggle.
zh
[AI-74] Energy-Efficient Learning-Based Beamforming for ISAC-Enabled V2X Networks
【速读】:该论文旨在解决车联网(V2X)网络中集成感知与通信(ISAC)系统在高动态环境下能量效率低、传统学习方法依赖频繁导频传输和全量信道状态信息获取的问题。解决方案的关键在于将马尔可夫决策过程(Markov Decision Process, MDP)引入动态环境建模,使路侧单元(RSU)仅基于当前感知信息做出波束赋形决策,从而避免频繁的导频传输;同时,通过嵌入脉冲神经网络(Spiking Neural Networks, SNNs)改进深度强化学习(DRL)框架,在保持通信吞吐量和感知精度的同时显著降低能耗,实现绿色可持续的V2X连接。
链接: https://arxiv.org/abs/2508.19566
作者: Chen Shang,Jiadong Yu,Dinh Thai Hoang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, conference paper
Abstract:This work proposes an energy-efficient, learning-based beamforming scheme for integrated sensing and communication (ISAC)-enabled V2X networks. Specifically, we first model the dynamic and uncertain nature of V2X environments as a Markov Decision Process. This formulation allows the roadside unit to generate beamforming decisions based solely on current sensing information, thereby eliminating the need for frequent pilot transmissions and extensive channel state information acquisition. We then develop a deep reinforcement learning (DRL) algorithm to jointly optimize beamforming and power allocation, ensuring both communication throughput and sensing accuracy in highly dynamic scenario. To address the high energy demands of conventional learning-based schemes, we embed spiking neural networks (SNNs) into the DRL framework. Leveraging their event-driven and sparsely activated architecture, SNNs significantly enhance energy efficiency while maintaining robust performance. Simulation results confirm that the proposed method achieves substantial energy savings and superior communication performance, demonstrating its potential to support green and sustainable connectivity in future V2X systems.
zh
[AI-75] Quantum Entanglement as Super-Confounding: From Bells Theorem to Robust Machine Learning
【速读】:该论文试图解决量子纠缠所引发的因果关系识别难题,即如何在量子系统中区分真实的因果关联与由量子非定域性导致的伪相关(spurious correlation)。传统因果推断方法基于局部实在论假设,无法解释违反贝尔不等式的量子关联;而本文提出了一种基于量子因果推断的新框架,其关键在于将量子纠缠建模为一种“超共混”(super-confounding)资源,并引入“共混强度”(Confounding Strength, CS)来量化其对经典因果边界的影响。通过构建电路实现量子 DO-演算(quantum DO-calculus),该框架能够有效分离因果效应与量子诱导的统计依赖,从而在量子机器学习任务中实现了特征选择上的因果优化,使模型鲁棒性平均提升11.3%。
链接: https://arxiv.org/abs/2508.19327
作者: Pilsung Kang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Bell’s theorem reveals a profound conflict between quantum mechanics and local realism, a conflict we reinterpret through the modern lens of causal inference. We propose and computationally validate a framework where quantum entanglement acts as a “super-confounding” resource, generating correlations that violate the classical causal bounds set by Bell’s inequalities. This work makes three key contributions: First, we establish a physical hierarchy of confounding (Quantum Classical) and introduce Confounding Strength (CS) to quantify this effect. Second, we provide a circuit-based implementation of the quantum \mathcalDO -calculus to distinguish causality from spurious correlation. Finally, we apply this calculus to a quantum machine learning problem, where causal feature selection yields a statistically significant 11.3% average absolute improvement in model robustness. Our framework bridges quantum foundations and causal AI, offering a new, practical perspective on quantum correlations.
zh
[AI-76] Scalable Technology-Agnostic Diagnosis and Predictive Maintenance for Point Machine using Deep Learning
【速读】:该论文旨在解决铁路道岔转换设备(Point Machine, PM)的故障预测与诊断问题,传统方法依赖多源输入并需人工设计特征,存在数据采集复杂、模型泛化能力差等问题。其解决方案的关键在于仅使用单一输入——PM运行时的功率信号,通过深度学习模型自动提取特征并识别异常状态,实现对障碍物、摩擦、电源问题和错位等主要故障类型的高精度分类(精确率达99.99%,误报率仅为0.01%),且具备技术无关性与可扩展性,适用于多种电机械式PM设备;同时引入置信区间校准(conformal prediction)以提供决策置信度,满足ISO-17359标准要求,从而提升运维安全性与可靠性。
链接: https://arxiv.org/abs/2508.11692
作者: Eduardo Di Santi(1),Ruixiang Ci(2),Clément Lefebvre(1),Nenad Mijatovic(1),Michele Pugnaloni(1),Jonathan Brown(1),Victor Martín(1),Kenza Saiah(1) ((1) Digital and Integrated Systems, Alstom (2) Innovation and Smart Mobility, Alstom)
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Peer-reviewed conference paper. Presented at ICROMA 2025, Dresden, Germany. Conference: this https URL . Book of abstracts: this https URL . 8 pages, 6 figures, 1 table
Abstract:The Point Machine (PM) is a critical piece of railway equipment that switches train routes by diverting tracks through a switchblade. As with any critical safety equipment, a failure will halt operations leading to service disruptions; therefore, pre-emptive maintenance may avoid unnecessary interruptions by detecting anomalies before they become failures. Previous work relies on several inputs and crafting custom features by segmenting the signal. This not only adds additional requirements for data collection and processing, but it is also specific to the PM technology, the installed locations and operational conditions limiting scalability. Based on the available maintenance records, the main failure causes for PM are obstacles, friction, power source issues and misalignment. Those failures affect the energy consumption pattern of PMs, altering the usual (or healthy) shape of the power signal during the PM movement. In contrast to the current state-of-the-art, our method requires only one input. We apply a deep learning model to the power signal pattern to classify if the PM is nominal or associated with any failure type, achieving 99.99% precision, 0.01% false positives and negligible false negatives. Our methodology is generic and technology-agnostic, proven to be scalable on several electromechanical PM types deployed in both real-world and test bench environments. Finally, by using conformal prediction the maintainer gets a clear indication of the certainty of the system outputs, adding a confidence layer to operations and making the method compliant with the ISO-17359 standard.
zh
机器学习
[LG-0] Anomaly Detection in Networked Bandits
链接: https://arxiv.org/abs/2508.20076
作者: Xiaotong Cheng,Setareh Maghsudi
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:The nodes’ interconnections on a social network often reflect their dependencies and information-sharing behaviors. Nevertheless, abnormal nodes, which significantly deviate from most of the network concerning patterns or behaviors, can lead to grave consequences. Therefore, it is imperative to design efficient online learning algorithms that robustly learn users’ preferences while simultaneously detecting anomalies. We introduce a novel bandit algorithm to address this problem. Through network knowledge, the method characterizes the users’ preferences and residuals of feature information. By learning and analyzing these preferences and residuals, it develops a personalized recommendation strategy for each user and simultaneously detects anomalies. We rigorously prove an upper bound on the regret of the proposed algorithm and experimentally compare it with several state-of-the-art collaborative contextual bandit algorithms on both synthetic and real-world datasets. Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG) Cite as: arXiv:2508.20076 [cs.MA] (or arXiv:2508.20076v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.20076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Reinforcement Learning for Search Tree Size Minimization in Constraint Programming: New Results on Scheduling Benchmarks
链接: https://arxiv.org/abs/2508.20056
作者: Vilém Heinz,Petr Vilím,Zdeněk Hanzálek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Failure-Directed Search (FDS) is a significant complete generic search algorithm used in Constraint Programming (CP) to efficiently explore the search space, proven particularly effective on scheduling problems. This paper analyzes FDS’s properties, showing that minimizing the size of its search tree guided by ranked branching decisions is closely related to the Multi-armed bandit (MAB) problem. Building on this insight, MAB reinforcement learning algorithms are applied to FDS, extended with problem-specific refinements and parameter tuning, and evaluated on the two most fundamental scheduling problems, the Job Shop Scheduling Problem (JSSP) and Resource-Constrained Project Scheduling Problem (RCPSP). The resulting enhanced FDS, using the best extended MAB algorithm and configuration, performs 1.7 times faster on the JSSP and 2.1 times faster on the RCPSP benchmarks compared to the original implementation in a new solver called OptalCP, while also being 3.5 times faster on the JSSP and 2.1 times faster on the RCPSP benchmarks than the current state-of-the-art FDS algorithm in IBM CP Optimizer 22.1. Furthermore, using only a 900-second time limit per instance, the enhanced FDS improved the existing state-of-the-art lower bounds of 78 of 84 JSSP and 226 of 393 RCPSP standard open benchmark instances while also completely closing a few of them.
[LG-2] Using item recommendations and LLM s in marketing email titles
链接: https://arxiv.org/abs/2508.20024
作者: Deddy Jobson,Muktti Shukla,Phuong Dinh,Julio Christian Young,Nick Pitton,Nina Chen,Ryan Ginstrom
类目: Machine Learning (cs.LG)
*备注: Accepted to The Second Workshop on Generative AI for E-commerce (GenAIECommerce '25), held September 22, 2025, in Prague, Czech Republic. 3 figures
Abstract:E-commerce marketplaces make use of a number of marketing channels like emails, push notifications, etc. to reach their users and stimulate purchases. Personalized emails especially are a popular touch point for marketers to inform users of latest items in stock, especially for those who stopped visiting the marketplace. Such emails contain personalized recommendations tailored to each user’s interests, enticing users to buy relevant items. A common limitation of these emails is that the primary entry point, the title of the email, tends to follow fixed templates, failing to inspire enough interest in the contents. In this work, we explore the potential of large language models (LLMs) for generating thematic titles that reflect the personalized content of the emails. We perform offline simulations and conduct online experiments on the order of millions of users, finding our techniques useful in improving the engagement between customers and our emails. We highlight key findings and learnings as we productionize the safe and automated generation of email titles for millions of users.
[LG-3] FairLoop: Software Support for Human-Centric Fairness in Predictive Business Process Monitoring
链接: https://arxiv.org/abs/2508.20021
作者: Felix Möhrlein,Martin Käppel,Julian Neuberger,Sven Weinzierl,Lars Ackermann,Martin Matzner,Stefan Jablonski
类目: Machine Learning (cs.LG)
*备注: Proceedings of the Best BPM Dissertation Award, Doctoral Consortium, and Demonstrations Resources Forum co-located with 23rd International Conference on Business Process Management (BPM 2025), Seville, Spain, August 31st to September 5th, 2025
Abstract:Sensitive attributes like gender or age can lead to unfair predictions in machine learning tasks such as predictive business process monitoring, particularly when used without considering context. We present FairLoop1, a tool for human-guided bias mitigation in neural network-based prediction models. FairLoop distills decision trees from neural networks, allowing users to inspect and modify unfair decision logic, which is then used to fine-tune the original model towards fairer predictions. Compared to other approaches to fairness, FairLoop enables context-aware bias removal through human involvement, addressing the influence of sensitive attributes selectively rather than excluding them uniformly.
[LG-4] Evaluating Language Model Reasoning about Confidential Information
链接: https://arxiv.org/abs/2508.19980
作者: Dylan Sam,Alexander Robey,Andy Zou,Matt Fredrikson,J. Zico Kolter
类目: Machine Learning (cs.LG)
*备注: 20 pages
Abstract:As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.
[LG-5] Reducing Street Parking Search Time via Smart Assignment Strategies
链接: https://arxiv.org/abs/2508.19979
作者: Behafarid Hemmatpour,Javad Dogani,Nikolaos Laoutaris
类目: Machine Learning (cs.LG)
*备注: Please cite the ACM SIGSPATIAL’25 version of this paper
Abstract:In dense metropolitan areas, searching for street parking adds to traffic congestion. Like many other problems, real-time assistants based on mobile phones have been proposed, but their effectiveness is understudied. This work quantifies how varying levels of user coordination and information availability through such apps impact search time and the probability of finding street parking. Through a data-driven simulation of Madrid’s street parking ecosystem, we analyze four distinct strategies: uncoordinated search (Unc-Agn), coordinated parking without awareness of non-users (Cord-Agn), an idealized oracle system that knows the positions of all non-users (Cord-Oracle), and our novel/practical Cord-Approx strategy that estimates non-users’ behavior probabilistically. The Cord-Approx strategy, instead of requiring knowledge of how close non-users are to a certain spot in order to decide whether to navigate toward it, uses past occupancy distributions to elongate physical distances between system users and alternative parking spots, and then solves a Hungarian matching problem to dispatch accordingly. In high-fidelity simulations of Madrid’s parking network with real traffic data, users of Cord-Approx averaged 6.69 minutes to find parking, compared to 19.98 minutes for non-users without an app. A zone-level snapshot shows that Cord-Approx reduces search time for system users by 72% (range = 67-76%) in central hubs, and up to 73% in residential areas, relative to non-users.
[LG-6] Short-Horizon Predictive Maintenance of Industrial Pumps Using Time-Series Features and Machine Learning
链接: https://arxiv.org/abs/2508.19974
作者: Khaled M. A. Alghtus,Aiyad Gannan,Khalid M. Alhajri,Ali L. A. Al Jubouri,Hassan A. I. Al-Janahi
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study presents a machine learning framework for forecasting short-term faults in industrial centrifugal pumps using real-time sensor data. The approach aims to predict EarlyWarning conditions 5, 15, and 30 minutes in advance based on patterns extracted from historical operation. Two lookback periods, 60 minutes and 120 minutes, were evaluated using a sliding window approach. For each window, statistical features including mean, standard deviation, minimum, maximum, and linear trend were extracted, and class imbalance was addressed using the SMOTE algorithm. Random Forest and XGBoost classifiers were trained and tested on the labeled dataset. Results show that the Random Forest model achieved the best short-term forecasting performance with a 60-minute window, reaching recall scores of 69.2% at 5 minutes, 64.9% at 15 minutes, and 48.6% at 30 minutes. With a 120-minute window, the Random Forest model achieved 57.6% recall at 5 minutes, and improved predictive accuracy of 65.6% at both 15 and 30 minutes. XGBoost displayed similar but slightly lower performance. These findings highlight that optimal history length depends on the prediction horizon, and that different fault patterns may evolve at different timescales. The proposed method offers an interpretable and scalable solution for integrating predictive maintenance into real-time industrial monitoring systems.
[LG-7] Global Permutation Entropy
链接: https://arxiv.org/abs/2508.19955
作者: Abhijit Avhale(1),Joscha Diehl(1),Niraj Velankar(1),Emanuele Verri(1) ((1) University of Greifswald, Greifswald, Germany)
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 12 pages, 10 figures
Abstract:Permutation Entropy, introduced by Bandt and Pompe, is a widely used complexity measure for real-valued time series that is based on the relative order of values within consecutive segments of fixed length. After standardizing each segment to a permutation and computing the frequency distribution of these permutations, Shannon Entropy is then applied to quantify the series’ complexity. We introduce Global Permutation Entropy (GPE), a novel index that considers all possible patterns of a given length, including non-consecutive ones. Its computation relies on recently developed algorithms that enable the efficient extraction of full permutation profiles. We illustrate some properties of GPE and demonstrate its effectiveness through experiments on synthetic datasets, showing that it reveals structural information not accessible through standard permutation entropy. We provide a Julia package for the calculation of GPE at `this https URL.
[LG-8] Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions
链接: https://arxiv.org/abs/2508.19945
作者: Zhouyu Zhang,Chih-Yuan Chiu,Glen Chou
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local generalized Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets, as well as limitations of constraint learnability from demonstrations of Nash equilibrium interactions. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods proved capable of inferring constraints and designing interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.
[LG-9] FlowletFormer: Network Behavioral Semantic Aware Pre-training Model for Traffic Classification
链接: https://arxiv.org/abs/2508.19924
作者: Liming Liu,Ruoyu Li,Qing Li,Meijia Hou,Yong Jiang,Mingwei Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Network traffic classification using pre-training models has shown promising results, but existing methods struggle to capture packet structural characteristics, flow-level behaviors, hierarchical protocol semantics, and inter-packet contextual relationships. To address these challenges, we propose FlowletFormer, a BERT-based pre-training model specifically designed for network traffic analysis. FlowletFormer introduces a Coherent Behavior-Aware Traffic Representation Model for segmenting traffic into semantically meaningful units, a Protocol Stack Alignment-Based Embedding Layer to capture multilayer protocol semantics, and Field-Specific and Context-Aware Pretraining Tasks to enhance both inter-packet and inter-flow learning. Experimental results demonstrate that FlowletFormer significantly outperforms existing methods in the effectiveness of traffic representation, classification accuracy, and few-shot learning capability. Moreover, by effectively integrating domain-specific network knowledge, FlowletFormer shows better comprehension of the principles of network transmission (e.g., stateful connections of TCP), providing a more robust and trustworthy framework for traffic analysis.
[LG-10] Ontology-Based Concept Distillation for Radiology Report Retrieval and Labeling MICCAI
链接: https://arxiv.org/abs/2508.19915
作者: Felix Nützel,Mischa Dombrowski,Bernhard Kainz
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, Preprint (submitted version, de-anonymized). Accepted at MLMI (MICCAI Workshop) 2025. Version of Record to appear in Springer LNCS; This preprint has not undergone peer review or any post-submission improvements or corrections
Abstract:Retrieval-augmented learning based on radiology reports has emerged as a promising direction to improve performance on long-tail medical imaging tasks, such as rare disease detection in chest X-rays. Most existing methods rely on comparing high-dimensional text embeddings from models like CLIP or CXR-BERT, which are often difficult to interpret, computationally expensive, and not well-aligned with the structured nature of medical knowledge. We propose a novel, ontology-driven alternative for comparing radiology report texts based on clinically grounded concepts from the Unified Medical Language System (UMLS). Our method extracts standardised medical entities from free-text reports using an enhanced pipeline built on RadGraph-XL and SapBERT. These entities are linked to UMLS concepts (CUIs), enabling a transparent, interpretable set-based representation of each report. We then define a task-adaptive similarity measure based on a modified and weighted version of the Tversky Index that accounts for synonymy, negation, and hierarchical relationships between medical entities. This allows efficient and semantically meaningful similarity comparisons between reports. We demonstrate that our approach outperforms state-of-the-art embedding-based retrieval methods in a radiograph classification task on MIMIC-CXR, particularly in long-tail settings. Additionally, we use our pipeline to generate ontology-backed disease labels for MIMIC-CXR, offering a valuable new resource for downstream learning tasks. Our work provides more explainable, reliable, and task-specific retrieval strategies in clinical AI systems, especially when interpretability and domain knowledge integration are essential. Our code is available at this https URL
[LG-11] GegenNet: Spectral Convolutional Neural Networks for Link Sign Prediction in Signed Bipartite Graphs CIKM2025
链接: https://arxiv.org/abs/2508.19907
作者: Hewen Wang,Renchi Yang,Xiaokui Xiao
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 11 pages. Paper accepted to CIKM 2025
Abstract:Given a signed bipartite graph (SBG) G with two disjoint node sets U and V, the goal of link sign prediction is to predict the signs of potential links connecting U and V based on known positive and negative edges in G. The majority of existing solutions towards link sign prediction mainly focus on unipartite signed graphs, which are sub-optimal due to the neglect of node heterogeneity and unique bipartite characteristics of SBGs. To this end, recent studies adapt graph neural networks to SBGs by introducing message-passing schemes for both inter-partition (UxV) and intra-partition (UxU or VxV) node pairs. However, the fundamental spectral convolutional operators were originally designed for positive links in unsigned graphs, and thus, are not optimal for inferring missing positive or negative links from known ones in SBGs. Motivated by this, this paper proposes GegenNet, a novel and effective spectral convolutional neural network model for link sign prediction in SBGs. In particular, GegenNet achieves enhanced model capacity and high predictive accuracy through three main technical contributions: (i) fast and theoretically grounded spectral decomposition techniques for node feature initialization; (ii) a new spectral graph filter based on the Gegenbauer polynomial basis; and (iii) multi-layer sign-aware spectral convolutional networks alternating Gegenbauer polynomial filters with positive and negative edges. Our extensive empirical studies reveal that GegenNet can achieve significantly superior performance (up to a gain of 4.28% in AUC and 11.69% in F1) in link sign prediction compared to 11 strong competitors over 6 benchmark SBG datasets. Comments: 11 pages. Paper accepted to CIKM 2025 Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2508.19907 [cs.LG] (or arXiv:2508.19907v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.19907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning
链接: https://arxiv.org/abs/2508.19900
作者: Tan Jing,Xiaorui Li,Chao Yao,Xiaojuan Ban,Yuetong Fang,Renjing Xu,Zhaolin Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training. However, because the scale of the constraints varies across tasks and datasets of differing quality, existing methods must meticulously tune hyperparameters to match each dataset, which is time-consuming and often impractical. We propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances RL and behavior cloning (BC) during training. We theoretically analyze its performance improvement guarantee. In experiments on 39 datasets across four D4RL domains, ASPC using a single hyperparameter configuration outperforms other adaptive constraint methods and state-of-the-art offline RL algorithms that require per-dataset tuning while incurring only minimal computational overhead. The code will be released at this https URL.
[LG-13] Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks
链接: https://arxiv.org/abs/2508.19884
作者: Mingyue Kong,Yinglong Zhang,Chengda Xu,Xuewen Xia,Xing Xu
类目: Machine Learning (cs.LG)
*备注: 50 pages, 6 figures
Abstract:Graph Neural Networks (GNNs) have shown remarkable performance in structured data modeling tasks such as node classification. However, mainstream approaches generally rely on a large number of trainable parameters and fixed aggregation rules, making it difficult to adapt to graph data with strong structural heterogeneity and complex feature distributions. This often leads to over-smoothing of node representations and semantic degradation. To address these issues, this paper proposes a parameter-free graph neural network framework based on structural diversity, namely SDGNN (Structural-Diversity Graph Neural Network). The framework is inspired by structural diversity theory and designs a unified structural-diversity message passing mechanism that simultaneously captures the heterogeneity of neighborhood structures and the stability of feature semantics, without introducing additional trainable parameters. Unlike traditional parameterized methods, SDGNN does not rely on complex model training, but instead leverages complementary modeling from both structure-driven and feature-driven perspectives, thereby effectively improving adaptability across datasets and scenarios. Experimental results show that on eight public benchmark datasets and an interdisciplinary PubMed citation network, SDGNN consistently outperforms mainstream GNNs under challenging conditions such as low supervision, class imbalance, and cross-domain transfer. This work provides a new theoretical perspective and general approach for the design of parameter-free graph neural networks, and further validates the importance of structural diversity as a core signal in graph representation learning. To facilitate reproducibility and further research, the full implementation of SDGNN has been released at: this https URL
[LG-14] Quantum latent distributions in deep generative models
链接: https://arxiv.org/abs/2508.19857
作者: Omar Bacarreza,Thorin Farnsworth,Alexander Makarovskiy,Hugo Wallner,Tessa Hicks,Santiago Sempere-Llagostera,John Price,Robert J. A. Francis-Jones,William R. Clements
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Many successful families of generative models leverage a low-dimensional latent distribution that is mapped to a data distribution. Though simple latent distributions are commonly used, it has been shown that more sophisticated distributions can improve performance. For instance, recent work has explored using the distributions produced by quantum processors and found empirical improvements. However, when latent space distributions produced by quantum processors can be expected to improve performance, and whether these improvements are reproducible, are open questions that we investigate in this work. We prove that, under certain conditions, these “quantum latent distributions” enable generative models to produce data distributions that classical latent distributions cannot efficiently produce. We also provide actionable intuitions to identify when such quantum advantages may arise in real-world settings. We perform benchmarking experiments on both a synthetic quantum dataset and the QM9 molecular dataset, using both simulated and real photonic quantum processors. Our results demonstrate that quantum latent distributions can lead to improved generative performance in GANs compared to a range of classical baselines. We also explore diffusion and flow matching models, identifying architectures compatible with quantum latent distributions. This work confirms that near-term quantum processors can expand the capabilities of deep generative models.
[LG-15] Physics-Informed DeepONet Coupled with FEM for Convective Transport in Porous Media with Sharp Gaussian Sources
链接: https://arxiv.org/abs/2508.19847
作者: Erdi Kara,Panos Stinis
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a hybrid framework that couples finite element methods (FEM) with physics-informed DeepONet to model fluid transport in porous media from sharp, localized Gaussian sources. The governing system consists of a steady-state Darcy flow equation and a time-dependent convection-diffusion equation. Our approach solves the Darcy system using FEM and transfers the resulting velocity field to a physics-informed DeepONet, which learns the mapping from source functions to solute concentration profiles. This modular strategy preserves FEM-level accuracy in the flow field while enabling fast inference for transport dynamics. To handle steep gradients induced by sharp sources, we introduce an adaptive sampling strategy for trunk collocation points. Numerical experiments demonstrate that our method is in good agreement with the reference solutions while offering orders of magnitude speedups over traditional solvers, making it suitable for practical applications in relevant scenarios. Implementation of our proposed method is available at this https URL.
[LG-16] Symplectic convolutional neural networks
链接: https://arxiv.org/abs/2508.19842
作者: Süleyman Yıldız,Konrad Janik,Peter Benner
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a new symplectic convolutional neural network (CNN) architecture by leveraging symplectic neural networks, proper symplectic decomposition, and tensor techniques. Specifically, we first introduce a mathematically equivalent form of the convolution layer and then, using symplectic neural networks, we demonstrate a way to parameterize the layers of the CNN to ensure that the convolution layer remains symplectic. To construct a complete autoencoder, we introduce a symplectic pooling layer. We demonstrate the performance of the proposed neural network on three examples: the wave equation, the nonlinear Schrödinger (NLS) equation, and the sine-Gordon equation. The numerical results indicate that the symplectic CNN outperforms the linear symplectic autoencoder obtained via proper symplectic decomposition.
[LG-17] Interestingness First Classifiers
链接: https://arxiv.org/abs/2508.19780
作者: Ryoma Sato
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages
Abstract:Most machine learning models are designed to maximize predictive accuracy. In this work, we explore a different goal: building classifiers that are interesting. An ``interesting classifier’’ is one that uses unusual or unexpected features, even if its accuracy is lower than the best possible model. For example, predicting room congestion from CO2 levels achieves near-perfect accuracy but is unsurprising. In contrast, predicting room congestion from humidity is less accurate yet more nuanced and intriguing. We introduce EUREKA, a simple framework that selects features according to their perceived interestingness. Our method leverages large language models to rank features by their interestingness and then builds interpretable classifiers using only the selected interesting features. Across several benchmark datasets, EUREKA consistently identifies features that are non-obvious yet still predictive. For example, in the Occupancy Detection dataset, our method favors humidity over CO2 levels and light intensity, producing classifiers that achieve meaningful accuracy while offering insights. In the Twin Papers dataset, our method discovers the rule that papers with a colon in the title are more likely to be cited in the future. We argue that such models can support new ways of knowledge discovery and communication, especially in settings where moderate accuracy is sufficient but novelty and interpretability are valued.
[LG-18] Fast 3D Diffusion for Scalable Granular Media Synthesis
链接: https://arxiv.org/abs/2508.19752
作者: Muhammad Moeeze Hassan,Régis Cottereau,Filippo Gatti,Patryk Dec
类目: Machine Learning (cs.LG)
*备注:
Abstract:Simulating granular media, using Discrete Element Method is a computationally intensive task. This is especially true during initialization phase, which dominates total simulation time because of large displacements involved and associated kinetic energy. We overcome this bottleneck with a novel generative pipeline based on 3D diffusion models that directly synthesizes arbitrarily large granular assemblies in their final and physically realistic configurations. The approach frames the problem as a 3D generative modeling task, consisting of a two-stage pipeline. First a diffusion model is trained to generate independent 3D voxel grids representing granular media. Second, a 3D inpainting model, adapted from 2D inpainting techniques using masked inputs, stitches these grids together seamlessly, enabling synthesis of large samples with physically realistic structure. The inpainting model explores several masking strategies for the inputs to the underlying UNets by training the network to infer missing portions of voxel grids from a concatenation of noised tensors, masks, and masked tensors as input channels. The model also adapts a 2D repainting technique of re-injecting noise scheduler output with ground truth to provide a strong guidance to the 3D model. This along with weighted losses ensures long-term coherence over generation of masked regions. Both models are trained on the same binarized 3D occupancy grids extracted from small-scale DEM simulations, achieving linear scaling of computational time with respect to sample size. Quantitatively, a 1.2 m long ballasted rail track synthesis equivalent to a 3-hour DEM simulation, was completed under 20 seconds. The generated voxel grids can also be post-processed to extract grain geometries for DEM-compatibility as well, enabling physically coherent, real-time, scalable granular media synthesis for industrial applications.
[LG-19] InfraredGP: Efficient Graph Partitioning via Spectral Graph Neural Networks with Negative Corrections
链接: https://arxiv.org/abs/2508.19737
作者: Meng Qin,Weihua Li,Jinqiang Cui,Sen Pei
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Graph partitioning (GP), a.k.a. community detection, is a classic problem that divides nodes of a graph into densely-connected blocks. From a perspective of graph signal processing, we find that graph Laplacian with a negative correction can derive graph frequencies beyond the conventional range [0, 2] . To explore whether the low-frequency information beyond this range can encode more informative properties about community structures, we propose InfraredGP. It (\romannumeral1) adopts a spectral GNN as its backbone combined with low-pass filters and a negative correction mechanism, (\romannumeral2) only feeds random inputs to this backbone, (\romannumeral3) derives graph embeddings via one feed-forward propagation (FFP) without any training, and (\romannumeral4) obtains feasible GP results by feeding the derived embeddings to BIRCH. Surprisingly, our experiments demonstrate that based solely on the negative correction mechanism that amplifies low-frequency information beyond [0, 2] , InfraredGP can derive distinguishable embeddings for some standard clustering modules (e.g., BIRCH) and obtain high-quality results for GP without any training. Following the IEEE HPEC Graph Challenge benchmark, we evaluate InfraredGP for both static and streaming GP, where InfraredGP can achieve much better efficiency (e.g., 16x-23x faster) and competitive quality over various baselines. We have made our code public at this https URL
[LG-20] une My Adam Please!
链接: https://arxiv.org/abs/2508.19733
作者: Theodoros Athanasiadis,Steven Adriaensen,Samuel Müller,Frank Hutter
类目: Machine Learning (cs.LG)
*备注: Accepted as a short paper at the non-archival content track of AutoML 2025
Abstract:The Adam optimizer remains one of the most widely used optimizers in deep learning, and effectively tuning its hyperparameters is key to optimizing performance. However, tuning can be tedious and costly. Freeze-thaw Bayesian Optimization (BO) is a recent promising approach for low-budget hyperparameter tuning, but is limited by generic surrogates without prior knowledge of how hyperparameters affect learning. We propose Adam-PFN, a new surrogate model for Freeze-thaw BO of Adam’s hyperparameters, pre-trained on learning curves from TaskSet, together with a new learning curve augmentation method, CDF-augment, which artificially increases the number of available training examples. Our approach improves both learning curve extrapolation and accelerates hyperparameter optimization on TaskSet evaluation tasks, with strong performance on out-of-distribution (OOD) tasks.
[LG-21] Metric spaces of walks and Lipschitz duality on graphs
链接: https://arxiv.org/abs/2508.19709
作者: R. Arnau,A. González Cortés,E.A. Sánchez Pérez,S. Sanjuan
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 31 pages, 3 figures
Abstract:We study the metric structure of walks on graphs, understood as Lipschitz sequences. To this end, a weighted metric is introduced to handle sequences, enabling the definition of distances between walks based on stepwise vertex distances and weighted norms. We analyze the main properties of these metric spaces, which provides the foundation for the analysis of weaker forms of instruments to measure relative distances between walks: proximities. We provide some representation formulas for such proximities under different assumptions and provide explicit constructions for these cases. The resulting metric framework allows the use of classical tools from metric modeling, such as the extension of Lipschitz functions from subspaces of walks, which permits extending proximity functions while preserving fundamental properties via the mentioned representations. Potential applications include the estimation of proximities and the development of reinforcement learning strategies based on exploratory walks, offering a robust approach to Lipschitz regression on network structures.
[LG-22] mathcalC1-approximation with rational functions and rational neural networks
链接: https://arxiv.org/abs/2508.19672
作者: Erion Morina,Martin Holler
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Numerical Analysis (math.NA)
*备注:
Abstract:We show that suitably regular functions can be approximated in the \mathcalC^1 -norm both with rational functions and rational neural networks, including approximation rates with respect to width and depth of the network, and degree of the rational functions. As consequence of our results, we further obtain \mathcalC^1 -approximation results for rational neural networks with the \textEQL^÷ and ParFam architecture, both of which are important in particular in the context of symbolic regression for physical law learning.
[LG-23] Exploration of Low-Power Flexible Stress Monitoring Classifiers for Conformal Wearables
链接: https://arxiv.org/abs/2508.19661
作者: Florentia Afentaki,Sri Sai Rakesh Nakkilla,Konstantinos Balaskas,Paula Carolina Lozano Duarte,Shiyi Jiang,Georgios Zervakis,Farshad Firouzi,Krishnendu Chakrabarty,Mehdi B. Tahoori
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted for publication at the IEEE/ACM International Symposium on Low Power Electronics and Design} (ISLPED 2025)
Abstract:Conventional stress monitoring relies on episodic, symptom-focused interventions, missing the need for continuous, accessible, and cost-efficient solutions. State-of-the-art approaches use rigid, silicon-based wearables, which, though capable of multitasking, are not optimized for lightweight, flexible wear, limiting their practicality for continuous monitoring. In contrast, flexible electronics (FE) offer flexibility and low manufacturing costs, enabling real-time stress monitoring circuits. However, implementing complex circuits like machine learning (ML) classifiers in FE is challenging due to integration and power constraints. Previous research has explored flexible biosensors and ADCs, but classifier design for stress detection remains underexplored. This work presents the first comprehensive design space exploration of low-power, flexible stress classifiers. We cover various ML classifiers, feature selection, and neural simplification algorithms, with over 1200 flexible classifiers. To optimize hardware efficiency, fully customized circuits with low-precision arithmetic are designed in each case. Our exploration provides insights into designing real-time stress classifiers that offer higher accuracy than current methods, while being low-cost, conformable, and ensuring low power and compact size.
[LG-24] SCAR: A Characterization Scheme for Multi-Modal Dataset
链接: https://arxiv.org/abs/2508.19659
作者: Ri Su,Zhao Chen,Caleb Chen Cao,Nan Tang,Lei Chen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures
Abstract:Foundation models exhibit remarkable generalization across diverse tasks, largely driven by the characteristics of their training data. Recent data-centric methods like pruning and compression aim to optimize training but offer limited theoretical insight into how data properties affect generalization, especially the data characteristics in sample scaling. Traditional perspectives further constrain progress by focusing predominantly on data quantity and training efficiency, often overlooking structural aspects of data quality. In this study, we introduce SCAR, a principled scheme for characterizing the intrinsic structural properties of datasets across four key measures: Scale, Coverage, Authenticity, and Richness. Unlike prior data-centric measures, SCAR captures stable characteristics that remain invariant under dataset scaling, providing a robust and general foundation for data understanding. Leveraging these structural properties, we introduce Foundation Data-a minimal subset that preserves the generalization behavior of the full dataset without requiring model-specific retraining. We model single-modality tasks as step functions and estimate the distribution of the foundation data size to capture step-wise generalization bias across modalities in the target multi-modal dataset. Finally, we develop a SCAR-guided data completion strategy based on this generalization bias, which enables efficient, modality-aware expansion of modality-specific characteristics in multimodal datasets. Experiments across diverse multi-modal datasets and model architectures validate the effectiveness of SCAR in predicting data utility and guiding data acquisition. Code is available at this https URL.
[LG-25] ALSA: Anchors in Logit Space for Out-of-Distribution Accuracy Estimation BMVC2025
链接: https://arxiv.org/abs/2508.19613
作者: Chenzhi Liu,Mahsa Baktashmotlagh,Yanran Tang,Zi Huang,Ruihong Qiu
类目: Machine Learning (cs.LG)
*备注: Accepted to BMVC 2025, Oral
Abstract:Estimating model accuracy on unseen, unlabeled datasets is crucial for real-world machine learning applications, especially under distribution shifts that can degrade performance. Existing methods often rely on predicted class probabilities (softmax scores) or data similarity metrics. While softmax-based approaches benefit from representing predictions on the standard simplex, compressing logits into probabilities leads to information loss. Meanwhile, similarity-based methods can be computationally expensive and domain-specific, limiting their broader applicability. In this paper, we introduce ALSA (Anchors in Logit Space for Accuracy estimation), a novel framework that preserves richer information by operating directly in the logit space. Building on theoretical insights and empirical observations, we demonstrate that the aggregation and distribution of logits exhibit a strong correlation with the predictive performance of the model. To exploit this property, ALSA employs an anchor-based modeling strategy: multiple learnable anchors are initialized in logit space, each assigned an influence function that captures subtle variations in the logits. This allows ALSA to provide robust and accurate performance estimates across a wide range of distribution shifts. Extensive experiments on vision, language, and graph benchmarks demonstrate ALSA’s superiority over both softmax- and similarity-based baselines. Notably, ALSA’s robustness under significant distribution shifts highlights its potential as a practical tool for reliable model evaluation.
[LG-26] Encourag ing Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning
链接: https://arxiv.org/abs/2508.19598
作者: Zhiwei Li,Yong Hu,Wenqing Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: action planning and answer summarization. The former, action planning, is the core capability that dictates an agent’s performance. However, prevailing training paradigms employ end-to-end, multi-objective optimization that jointly trains both capabilities. This paradigm faces two critical challenges: imbalanced optimization objective allocation and scarcity of verifiable data, making it difficult to enhance the agent’s planning capability. To address these challenges, we propose Reinforcement Learning with Tool-use Rewards (RLTR), a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module. Crucially, RLTR introduces a reward signal based on tool-use completeness to directly evaluate the quality of tool invocation sequences. This method offers a more direct and reliable training signal than assessing the final response content, thereby obviating the need for verifiable data. Our experiments demonstrate that RLTR achieves an 8%-12% improvement in planning performance compared to end-to-end baselines. Moreover, this enhanced planning capability, in turn, translates to a 5%-6% increase in the final response quality of the overall agent system.
[LG-27] A Lightweight Crowd Model for Robot Social Navigation
链接: https://arxiv.org/abs/2508.19595
作者: Maryam Kazemi Eskeri,Thomas Wiedemann,Ville Kyrki,Dominik Baumann,Tomasz Piotr Kucner
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, accepted in ECMR 2025
Abstract:Robots operating in human-populated environments must navigate safely and efficiently while minimizing social disruption. Achieving this requires estimating crowd movement to avoid congested areas in real-time. Traditional microscopic models struggle to scale in dense crowds due to high computational cost, while existing macroscopic crowd prediction models tend to be either overly simplistic or computationally intensive. In this work, we propose a lightweight, real-time macroscopic crowd prediction model tailored for human motion, which balances prediction accuracy and computational efficiency. Our approach simplifies both spatial and temporal processing based on the inherent characteristics of pedestrian flow, enabling robust generalization without the overhead of complex architectures. We demonstrate a 3.6 times reduction in inference time, while improving prediction accuracy by 3.1 %. Integrated into a socially aware planning framework, the model enables efficient and socially compliant robot navigation in dynamic environments. This work highlights that efficient human crowd modeling enables robots to navigate dense environments without costly computations.
[LG-28] Delta-Audit: Explaining What Changes When Models Change
链接: https://arxiv.org/abs/2508.19589
作者: Arshia Hemmat,Afsaneh Fatemi
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 figure, 4 tables
Abstract:Model updates (new hyperparameters, kernels, depths, solvers, or data) change performance, but the \emphreason often remains opaque. We introduce \textbfDelta-Attribution (\mbox \Delta -Attribution), a model-agnostic framework that explains \emphwhat changed between versions A and B by differencing per-feature attributions: \Delta\phi(x)=\phi_B(x)-\phi_A(x) . We evaluate \Delta\phi with a \emph \Delta -Attribution Quality Suite covering magnitude/sparsity (L1, Top- k , entropy), agreement/shift (rank-overlap@10, Jensen–Shannon divergence), behavioural alignment (Delta Conservation Error, DCE; Behaviour–Attribution Coupling, BAC; CO \Delta F), and robustness (noise, baseline sensitivity, grouped occlusion). Instantiated via fast occlusion/clamping in standardized space with a class-anchored margin and baseline averaging, we audit 45 settings: five classical families (Logistic Regression, SVC, Random Forests, Gradient Boosting, k NN), three datasets (Breast Cancer, Wine, Digits), and three A/B pairs per family. \textbfFindings. Inductive-bias changes yield large, behaviour-aligned deltas (e.g., SVC poly !\rightarrow rbf on Breast Cancer: BAC \approx 0.998, DCE \approx 6.6; Random Forest feature-rule swap on Digits: BAC \approx 0.997, DCE \approx 7.5), while ``cosmetic’’ tweaks (SVC \textttgamma=scale vs.\ \textttauto, k NN search) show rank-overlap@10 =1.0 and DCE \approx 0. The largest redistribution appears for deeper GB on Breast Cancer (JSD \approx 0.357). \Delta -Attribution offers a lightweight update audit that complements accuracy by distinguishing benign changes from behaviourally meaningful or risky reliance shifts. Comments: 7 pages, 1 figure, 4 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.19589 [cs.LG] (or arXiv:2508.19589v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.19589 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-29] Escaping Stability-Plasticity Dilemma in Online Continual Learning for Motion Forecasting via Synergetic Memory Rehearsal
链接: https://arxiv.org/abs/2508.19571
作者: Yunlong Lin,Chao Lu,Tongshuai Wu,Xiaocong Zhao,Guodong Du,Yanwei Sun,Zirui Li,Jianwei Gong
类目: Machine Learning (cs.LG)
*备注: Official code: this https URL
Abstract:Deep neural networks (DNN) have achieved remarkable success in motion forecasting. However, most DNN-based methods suffer from catastrophic forgetting and fail to maintain their performance in previously learned scenarios after adapting to new data. Recent continual learning (CL) studies aim to mitigate this phenomenon by enhancing memory stability of DNN, i.e., the ability to retain learned knowledge. Yet, excessive emphasis on the memory stability often impairs learning plasticity, i.e., the capacity of DNN to acquire new information effectively. To address such stability-plasticity dilemma, this study proposes a novel CL method, synergetic memory rehearsal (SyReM), for DNN-based motion forecasting. SyReM maintains a compact memory buffer to represent learned knowledge. To ensure memory stability, it employs an inequality constraint that limits increments in the average loss over the memory buffer. Synergistically, a selective memory rehearsal mechanism is designed to enhance learning plasticity by selecting samples from the memory buffer that are most similar to recently observed data. This selection is based on an online-measured cosine similarity of loss gradients, ensuring targeted memory rehearsal. Since replayed samples originate from learned scenarios, this memory rehearsal mechanism avoids compromising memory stability. We validate SyReM under an online CL paradigm where training samples from diverse scenarios arrive as a one-pass stream. Experiments on 11 naturalistic driving datasets from INTERACTION demonstrate that, compared to non-CL and CL baselines, SyReM significantly mitigates catastrophic forgetting in past scenarios while improving forecasting accuracy in new ones. The implementation is publicly available at this https URL.
[LG-30] Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning
链接: https://arxiv.org/abs/2508.19567
作者: Sheryl Mathew,N Harshit
类目: Machine Learning (cs.LG)
*备注:
Abstract:In reinforcement learning with human feedback (RLHF), reward models can efficiently learn and amplify latent biases within multimodal datasets, which can lead to imperfect policy optimization through flawed reward signals and decreased fairness. Bias mitigation studies have often applied passive constraints, which can fail under causal confounding. Here, we present a counterfactual reward model that introduces causal inference with multimodal representation learning to provide an unsupervised, bias-resilient reward signal. The heart of our contribution is the Counterfactual Trust Score, an aggregated score consisting of four components: (1) counterfactual shifts that decompose political framing bias from topical bias; (2) reconstruction uncertainty during counterfactual perturbations; (3) demonstrable violations of fairness rules for each protected attribute; and (4) temporal reward shifts aligned with dynamic trust measures. We evaluated the framework on a multimodal fake versus true news dataset, which exhibits framing bias, class imbalance, and distributional drift. Following methodologies similar to unsupervised drift detection from representation-based distances [1] and temporal robustness benchmarking in language models [2], we also inject synthetic bias across sequential batches to test robustness. The resulting system achieved an accuracy of 89.12% in fake news detection, outperforming the baseline reward models. More importantly, it reduced spurious correlations and unfair reinforcement signals. This pipeline outlines a robust and interpretable approach to fairness-aware RLHF, offering tunable bias reduction thresholds and increasing reliability in dynamic real-time policy making.
[LG-31] MobText-SISA: Efficient Machine Unlearning for Mobility Logs with Spatio-Temporal and Natural-Language Data
链接: https://arxiv.org/abs/2508.19554
作者: Haruki Yonekura,Ren Ozeki,Tatsuya Amano,Hamada Rizk,Hirozumi Yamaguchi
类目: Machine Learning (cs.LG)
*备注: Accepted to The 33rd ACM International Conference on Advances in Geographic Information Systems(SIGSPATIAL '25) as a short paper in the Short Paper Track
Abstract:Modern mobility platforms have stored vast streams of GPS trajectories, temporal metadata, free-form textual notes, and other unstructured data. Privacy statutes such as the GDPR require that any individual’s contribution be unlearned on demand, yet retraining deep models from scratch for every request is untenable. We introduce MobText-SISA, a scalable machine-unlearning framework that extends Sharded, Isolated, Sliced, and Aggregated (SISA) training to heterogeneous spatio-temporal data. MobText-SISA first embeds each trip’s numerical and linguistic features into a shared latent space, then employs similarity-aware clustering to distribute samples across shards so that future deletions touch only a single constituent model while preserving inter-shard diversity. Each shard is trained incrementally; at inference time, constituent predictions are aggregated to yield the output. Deletion requests trigger retraining solely of the affected shard from its last valid checkpoint, guaranteeing exact unlearning. Experiments on a ten-month real-world mobility log demonstrate that MobText-SISA (i) sustains baseline predictive accuracy, and (ii) consistently outperforms random sharding in both error and convergence speed. These results establish MobText-SISA as a practical foundation for privacy-compliant analytics on multimodal mobility data at urban scale.
[LG-32] owards 6G Intelligence: The Role of Generative AI in Future Wireless Networks
链接: https://arxiv.org/abs/2508.19495
作者: Muhammad Ahmed Mohsin,Junaid Ahmad,Muhammad Hamza Nawaz,Muhammad Ali Jamshed
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted as a chapter to the book Ambient Intelligence for 6G
Abstract:Ambient intelligence (AmI) is a computing paradigm in which physical environments are embedded with sensing, computation, and communication so they can perceive people and context, decide appropriate actions, and respond autonomously. Realizing AmI at global scale requires sixth generation (6G) wireless networks with capabilities for real time perception, reasoning, and action aligned with human behavior and mobility patterns. We argue that Generative Artificial Intelligence (GenAI) is the creative core of such environments. Unlike traditional AI, GenAI learns data distributions and can generate realistic samples, making it well suited to close key AmI gaps, including generating synthetic sensor and channel data in under observed areas, translating user intent into compact, semantic messages, predicting future network conditions for proactive control, and updating digital twins without compromising privacy. This chapter reviews foundational GenAI models, GANs, VAEs, diffusion models, and generative transformers, and connects them to practical AmI use cases, including spectrum sharing, ultra reliable low latency communication, intelligent security, and context aware digital twins. We also examine how 6G enablers, such as edge and fog computing, IoT device swarms, intelligent reflecting surfaces (IRS), and non terrestrial networks, can host or accelerate distributed GenAI. Finally, we outline open challenges in energy efficient on device training, trustworthy synthetic data, federated generative learning, and AmI specific standardization. We show that GenAI is not a peripheral addition, but a foundational element for transforming 6G from a faster network into an ambient intelligent ecosystem. Comments: Submitted as a chapter to the book Ambient Intelligence for 6G Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2508.19495 [cs.DC] (or arXiv:2508.19495v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2508.19495 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] Distribution Shift Aware Neural Tabular Learning
链接: https://arxiv.org/abs/2508.19486
作者: Wangyang Ying,Nanxu Gong,Dongjie Wang,Xinyuan Wang,Arun Vignesh Malarkkan,Vivek Gupta,Chandan K. Reddy,Yanjie Fu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular learning transforms raw features into optimized spaces for downstream tasks, but its effectiveness deteriorates under distribution shifts between training and testing data. We formalize this challenge as the Distribution Shift Tabular Learning (DSTL) problem and propose a novel Shift-Aware Feature Transformation (SAFT) framework to address it. SAFT reframes tabular learning from a discrete search task into a continuous representation-generation paradigm, enabling differentiable optimization over transformed feature sets. SAFT integrates three mechanisms to ensure robustness: (i) shift-resistant representation via embedding decorrelation and sample reweighting, (ii) flatness-aware generation through suboptimal embedding averaging, and (iii) normalization-based alignment between training and test distributions. Extensive experiments show that SAFT consistently outperforms prior tabular learning methods in terms of robustness, effectiveness, and generalization ability under diverse real-world distribution shifts.
[LG-34] DeepAtlas: a tool for effective manifold learning
链接: https://arxiv.org/abs/2508.19479
作者: Serena Hughes,Timothy Hamilton,Tom Kolokotrones,Eric J. Deeds
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 38 pages, 7 main text figures, 16 supplementary figures
Abstract:Manifold learning builds on the “manifold hypothesis,” which posits that data in high-dimensional datasets are drawn from lower-dimensional manifolds. Current tools generate global embeddings of data, rather than the local maps used to define manifolds mathematically. These tools also cannot assess whether the manifold hypothesis holds true for a dataset. Here, we describe DeepAtlas, an algorithm that generates lower-dimensional representations of the data’s local neighborhoods, then trains deep neural networks that map between these local embeddings and the original data. Topological distortion is used to determine whether a dataset is drawn from a manifold and, if so, its dimensionality. Application to test datasets indicates that DeepAtlas can successfully learn manifold structures. Interestingly, many real datasets, including single-cell RNA-sequencing, do not conform to the manifold hypothesis. In cases where data is drawn from a manifold, DeepAtlas builds a model that can be used generatively and promises to allow the application of powerful tools from differential geometry to a variety of datasets.
[LG-35] he Sample Complexity of Membership Inference and Privacy Auditing
链接: https://arxiv.org/abs/2508.19458
作者: Mahdi Haghifam,Adam Smith,Jonathan Ullman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 58 Pages
Abstract:A membership-inference attack gets the output of a learning algorithm, and a target individual, and tries to determine whether this individual is a member of the training data or an independent sample from the same distribution. A successful membership-inference attack typically requires the attacker to have some knowledge about the distribution that the training data was sampled from, and this knowledge is often captured through a set of independent reference samples from that distribution. In this work we study how much information the attacker needs for membership inference by investigating the sample complexity-the minimum number of reference samples required-for a successful attack. We study this question in the fundamental setting of Gaussian mean estimation where the learning algorithm is given n samples from a Gaussian distribution \mathcalN(\mu,\Sigma) in d dimensions, and tries to estimate \hat\mu up to some error \mathbbE[|\hat \mu - \mu|^2_\Sigma]\leq \rho^2 d . Our result shows that for membership inference in this setting, \Omega(n + n^2 \rho^2) samples can be necessary to carry out any attack that competes with a fully informed attacker. Our result is the first to show that the attacker sometimes needs many more samples than the training algorithm uses to train the model. This result has significant implications for practice, as all attacks used in practice have a restricted form that uses O(n) samples and cannot benefit from \omega(n) samples. Thus, these attacks may be underestimating the possibility of membership inference, and better attacks may be possible when information about the distribution is easy to obtain.
[LG-36] Stack Trace-Based Crash Deduplication with Transformer Adaptation
链接: https://arxiv.org/abs/2508.19449
作者: Md Afif Al Mamun,Gias Uddin,Lan Xia,Longyu Zhang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: This work is currently under review at IEEE Transactions on Software Engineering. The replication package will be made publicly available upon acceptance
Abstract:Automated crash reporting systems generate large volumes of duplicate reports, overwhelming issue-tracking systems and increasing developer workload. Traditional stack trace-based deduplication methods, relying on string similarity, rule-based heuristics, or deep learning (DL) models, often fail to capture the contextual and structural relationships within stack traces. We propose dedupT, a transformer-based approach that models stack traces holistically rather than as isolated frames. dedupT first adapts a pretrained language model (PLM) to stack traces, then uses its embeddings to train a fully-connected network (FCN) to rank duplicate crashes effectively. Extensive experiments on real-world datasets show that dedupT outperforms existing DL and traditional methods (e.g., sequence alignment and information retrieval techniques) in both duplicate ranking and unique crash detection, significantly reducing manual triage effort. On four public datasets, dedupT improves Mean Reciprocal Rank (MRR) often by over 15% compared to the best DL baseline and up to 9% over traditional methods while achieving higher Receiver Operating Characteristic Area Under the Curve (ROC-AUC) in detecting unique crash reports. Our work advances the integration of modern natural language processing (NLP) techniques into software engineering, providing an effective solution for stack trace-based crash deduplication.
[LG-37] On Surjectivity of Neural Networks: Can you elicit any behavior from your model?
链接: https://arxiv.org/abs/2508.19445
作者: Haozhe Jiang,Nika Haghtalab
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative models, surjectivity implies that any output, including harmful or undesirable content, can in principle be generated by the networks, raising concerns about model safety and jailbreak vulnerabilities. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and diffusion models with deterministic ODE solvers, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their unavoidable vulnerability to a broad class of adversarial attacks.
[LG-38] Efficiently Generating Multidimensional Calorimeter Data with Tensor Decomposition Parameterization
链接: https://arxiv.org/abs/2508.19443
作者: Paimon Goulart,Shaan Pakala,Evangelos Papalexakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Producing large complex simulation datasets can often be a time and resource consuming task. Especially when these experiments are very expensive, it is becoming more reasonable to generate synthetic data for downstream tasks. Recently, these methods may include using generative machine learning models such as Generative Adversarial Networks or diffusion models. As these generative models improve efficiency in producing useful data, we introduce an internal tensor decomposition to these generative models to even further reduce costs. More specifically, for multidimensional data, or tensors, we generate the smaller tensor factors instead of the full tensor, in order to significantly reduce the model’s output and overall parameters. This reduces the costs of generating complex simulation data, and our experiments show the generated data remains useful. As a result, tensor decomposition has the potential to improve efficiency in generative models, especially when generating multidimensional data, or tensors.
[LG-39] MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification
链接: https://arxiv.org/abs/2508.19424
作者: Yifan Dou,Adam Khadre,Ruben C Petreaca,Golrokh Mirzaei
类目: Machine Learning (cs.LG)
*备注:
Abstract:Motivation. Understanding the pan-cancer mutational landscape offers critical insights into the molecular mechanisms underlying tumorigenesis. While patient-level machine learning techniques have been widely employed to identify tumor subtypes, cohort-level clustering, where entire cancer types are grouped based on shared molecular features, has largely relied on classical statistical methods. Results. In this study, we introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types based on coding mutation data derived from the COSMIC database. For each cancer type, we construct two complementary mutation signatures: a gene-level profile capturing nucleotide substitution patterns across the most frequently mutated genes, and a chromosome-level profile representing normalized substitution frequencies across chromosomes. These dual views are encoded using TabNet encoders and optimized via a multi-scale contrastive learning objective (NT-Xent loss) to learn unified cancer-type embeddings. We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types, aligning with known mutational processes and tissue origins. Our work represents the first application of contrastive learning to cohort-level cancer clustering, offering a scalable and interpretable framework for mutation-driven cancer subtyping. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.19424 [cs.LG] (or arXiv:2508.19424v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.19424 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-40] Differentiable multiphase flow model for physics-informed machine learning in reservoir pressure management
链接: https://arxiv.org/abs/2508.19419
作者: Harun Ur Rashid,Aleksandra Pachalieva,Daniel O’Malley
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate subsurface reservoir pressure control is extremely challenging due to geological heterogeneity and multiphase fluid-flow dynamics. Predicting behavior in this setting relies on high-fidelity physics-based simulations that are computationally expensive. Yet, the uncertain, heterogeneous properties that control these flows make it necessary to perform many of these expensive simulations, which is often prohibitive. To address these challenges, we introduce a physics-informed machine learning workflow that couples a fully differentiable multiphase flow simulator, which is implemented in the DPFEHM framework with a convolutional neural network (CNN). The CNN learns to predict fluid extraction rates from heterogeneous permeability fields to enforce pressure limits at critical reservoir locations. By incorporating transient multiphase flow physics into the training process, our method enables more practical and accurate predictions for realistic injection-extraction scenarios compare to previous works. To speed up training, we pretrain the model on single-phase, steady-state simulations and then fine-tune it on full multiphase scenarios, which dramatically reduces the computational cost. We demonstrate that high-accuracy training can be achieved with fewer than three thousand full-physics multiphase flow simulations – compared to previous estimates requiring up to ten million. This drastic reduction in the number of simulations is achieved by leveraging transfer learning from much less expensive single-phase simulations.
[LG-41] Kolmogorov-Arnold Representation for Symplectic Learning: Advancing Hamiltonian Neural Networks IJCNN2025 IJCNN
链接: https://arxiv.org/abs/2508.19410
作者: Zongyu Wu,Ruichen Xu,Luoyao Chen,Georgios Kementzidis,Siyao Wang,Yuefan Deng
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Comments: 8 pages, 6 figures. Accepted at IJCNN 2025 (to appear in IEEE/IJCNN proceedings). This arXiv submission corresponds to the camera-ready version with minor editorial clarifications; results unchanged
Abstract:We propose a Kolmogorov-Arnold Representation-based Hamiltonian Neural Network (KAR-HNN) that replaces the Multilayer Perceptrons (MLPs) with univariate transformations. While Hamiltonian Neural Networks (HNNs) ensure energy conservation by learning Hamiltonian functions directly from data, existing implementations, often relying on MLPs, cause hypersensitivity to the hyperparameters while exploring complex energy landscapes. Our approach exploits the localized function approximations to better capture high-frequency and multi-scale dynamics, reducing energy drift and improving long-term predictive stability. The networks preserve the symplectic form of Hamiltonian systems, and thus maintain interpretability and physical consistency. After assessing KAR-HNN on four benchmark problems including spring-mass, simple pendulum, two- and three-body problem, we foresee its effectiveness for accurate and stable modeling of realistic physical processes often at high dimensions and with few known parameters.
[LG-42] Quantum-Classical Hybrid Molecular Autoencoder for Advancing Classical Decoding
链接: https://arxiv.org/abs/2508.19394
作者: Afrar Jahin,Yi Pan,Yingfeng Wang,Tianming Liu,Wei Zhang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Although recent advances in quantum machine learning (QML) offer significant potential for enhancing generative models, particularly in molecular design, a large array of classical approaches still face challenges in achieving high fidelity and validity. In particular, the integration of QML with sequence-based tasks, such as Simplified Molecular Input Line Entry System (SMILES) string reconstruction, remains underexplored and usually suffers from fidelity degradation. In this work, we propose a hybrid quantum-classical architecture for SMILES reconstruction that integrates quantum encoding with classical sequence modeling to improve quantum fidelity and classical similarity. Our approach achieves a quantum fidelity of approximately 84% and a classical reconstruction similarity of 60%, surpassing existing quantum baselines. Our work lays a promising foundation for future QML applications, striking a balance between expressive quantum representations and classical sequence models and catalyzing broader research on quantum-aware sequence models for molecular and drug discovery.
[LG-43] GENIE-ASI: Generative Instruction and Executable Code for Analog Subcircuit Identification
链接: https://arxiv.org/abs/2508.19393
作者: Phuoc Pham,Arun Venkitaraman,Chia-Yu Hsieh,Andrea Bonetti,Stefan Uhlich,Markus Leibl,Simon Hofmann,Eisaku Ohbuchi,Lorenzo Servadei,Ulf Schlichtmann,Robert Wille
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Analog subcircuit identification is a core task in analog design, essential for simulation, sizing, and layout. Traditional methods often require extensive human expertise, rule-based encoding, or large labeled datasets. To address these challenges, we propose GENIE-ASI, the first training-free, large language model (LLM)-based methodology for analog subcircuit identification. GENIE-ASI operates in two phases: it first uses in-context learning to derive natural language instructions from a few demonstration examples, then translates these into executable Python code to identify subcircuits in unseen SPICE netlists. In addition, to evaluate LLM-based approaches systematically, we introduce a new benchmark composed of operational amplifier netlists (op-amps) that cover a wide range of subcircuit variants. Experimental results on the proposed benchmark show that GENIE-ASI matches rule-based performance on simple structures (F1-score = 1.0), remains competitive on moderate abstractions (F1-score = 0.81), and shows potential even on complex subcircuits (F1-score = 0.31). These findings demonstrate that LLMs can serve as adaptable, general-purpose tools in analog design automation, opening new research directions for foundation model applications in analog design automation.
[LG-44] DETNO: A Diffusion-Enhanced Transformer Neural Operator for Long-Term Traffic Forecasting
链接: https://arxiv.org/abs/2508.19389
作者: Owais Ahmad,Milad Ramezankhani,Anirudh Deodhar
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Accurate long-term traffic forecasting remains a critical challenge in intelligent transportation systems, particularly when predicting high-frequency traffic phenomena such as shock waves and congestion boundaries over extended rollout horizons. Neural operators have recently gained attention as promising tools for modeling traffic flow. While effective at learning function space mappings, they inherently produce smooth predictions that fail to reconstruct high-frequency features such as sharp density gradients which results in rapid error accumulation during multi-step rollout predictions essential for real-time traffic management. To address these fundamental limitations, we introduce a unified Diffusion-Enhanced Transformer Neural Operator (DETNO) architecture. DETNO leverages a transformer neural operator with cross-attention mechanisms, providing model expressivity and super-resolution, coupled with a diffusion-based refinement component that iteratively reconstructs high-frequency traffic details through progressive denoising. This overcomes the inherent smoothing limitations and rollout instability of standard neural operators. Through comprehensive evaluation on chaotic traffic datasets, our method demonstrates superior performance in extended rollout predictions compared to traditional and transformer-based neural operators, preserving high-frequency components and improving stability over long prediction horizons.
[LG-45] owards Quantum Machine Learning for Malicious Code Analysis
链接: https://arxiv.org/abs/2508.19381
作者: Jesus Lopez,Saeefa Rubaiyet Nowmi,Viviana Cadena,Mohammad Saidur Rahman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, 3 figures, 2 tables. Accepted at the International Workshop on Quantum Computing and Reinforcement Learning (QCRL) @ IEEE Quantum Week 2025
Abstract:Classical machine learning (CML) has been extensively studied for malware classification. With the emergence of quantum computing, quantum machine learning (QML) presents a paradigm-shifting opportunity to improve malware detection, though its application in this domain remains largely unexplored. In this study, we investigate two hybrid quantum-classical models – a Quantum Multilayer Perceptron (QMLP) and a Quantum Convolutional Neural Network (QCNN), for malware classification. Both models utilize angle embedding to encode malware features into quantum states. QMLP captures complex patterns through full qubit measurement and data re-uploading, while QCNN achieves faster training via quantum convolution and pooling layers that reduce active qubits. We evaluate both models on five widely used malware datasets – API-Graph, EMBER-Domain, EMBER-Class, AZ-Domain, and AZ-Class, across binary and multiclass classification tasks. Our results show high accuracy for binary classification – 95-96% on API-Graph, 91-92% on AZ-Domain, and 77% on EMBER-Domain. In multiclass settings, accuracy ranges from 91.6-95.7% on API-Graph, 41.7-93.6% on AZ-Class, and 60.7-88.1% on EMBER-Class. Overall, QMLP outperforms QCNN in complex multiclass tasks, while QCNN offers improved training efficiency at the cost of reduced accuracy. Comments: 6 pages, 3 figures, 2 tables. Accepted at the International Workshop on Quantum Computing and Reinforcement Learning (QCRL) @ IEEE Quantum Week 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2508.19381 [cs.LG] (or arXiv:2508.19381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.19381 Focus to learn more arXiv-issued DOI via DataCite
[LG-46] Aggregate Fictitious Play for Learning in Anonymous Polymatrix Games (Extended Version)
链接: https://arxiv.org/abs/2508.19371
作者: Semih Kara,Tamer Başar
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:
Abstract:Fictitious play (FP) is a well-studied algorithm that enables agents to learn Nash equilibrium in games with certain reward structures. However, when agents have no prior knowledge of the reward functions, FP faces a major challenge: the joint action space grows exponentially with the number of agents, which slows down reward exploration. Anonymous games offer a structure that mitigates this issue. In these games, the rewards depend only on the actions taken; not on who is taking which action. Under such a structure, we introduce aggregate fictitious play (agg-FP), a variant of FP where each agent tracks the frequency of the number of other agents playing each action, rather than these agents’ individual actions. We show that in anonymous polymatrix games, agg-FP converges to a Nash equilibrium under the same conditions as classical FP. In essence, by aggregating the agents’ actions, we reduce the action space without losing the convergence guarantees. Using simulations, we provide empirical evidence on how this reduction accelerates convergence.
[LG-47] Graph Data Modeling: Molecules Proteins Chemical Processes
链接: https://arxiv.org/abs/2508.19356
作者: José Manuel Barraza-Chavez,Rana A. Barghout,Ricardo Almada-Monter,Benjamin Sanchez-Lengeling,Adrian Jinich,Radhakrishnan Mahadevan
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 3 to 4 hours read time. 73 pages. 35 figures
Abstract:Graphs are central to the chemical sciences, providing a natural language to describe molecules, proteins, reactions, and industrial processes. They capture interactions and structures that underpin materials, biology, and medicine. This primer, Graph Data Modeling: Molecules, Proteins, Chemical Processes, introduces graphs as mathematical objects in chemistry and shows how learning algorithms (particularly graph neural networks) can operate on them. We outline the foundations of graph design, key prediction tasks, representative examples across chemical sciences, and the role of machine learning in graph-based modeling. Together, these concepts prepare readers to apply graph methods to the next generation of chemical discovery.
[LG-48] Memorization in Graph Neural Networks
链接: https://arxiv.org/abs/2508.19352
作者: Adarsh Jamadandi,Jing Xu,Adam Dziedzic,Franziska Boenisch
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks (DNNs) have been shown to memorize their training data, yet similar analyses for graph neural networks (GNNs) remain largely under-explored. We introduce NCMemo (Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We first establish an inverse relationship between memorization and graph homophily, i.e., the property that connected nodes share similar labels/features. We find that lower homophily significantly increases memorization, indicating that GNNs rely on memorization to learn less homophilic graphs. Secondly, we analyze GNN training dynamics. We find that the increased memorization in low homophily graphs is tightly coupled to the GNNs’ implicit bias on using graph structure during learning. In low homophily regimes, this structure is less informative, hence inducing memorization of the node labels to minimize training loss. Finally, we show that nodes with higher label inconsistency in their feature-space neighborhood are significantly more prone to memorization. Building on our insights into the link between graph homophily and memorization, we investigate graph rewiring as a means to mitigate memorization. Our results demonstrate that this approach effectively reduces memorization without compromising model performance. Moreover, we show that it lowers the privacy risk for previously memorized data points in practice. Thus, our work not only advances understanding of GNN learning but also supports more privacy-preserving GNN deployment.
[LG-49] Physics-Informed Regression: Parameter Estimation in Parameter-Linear Nonlinear Dynamic Models
链接: https://arxiv.org/abs/2508.19249
作者: Jonas Søeborg Nielsen,Marcus Galea Jacobsen,Albert Brincker Olson,Mads Peter Sørensen,Allan Peter Engsig-Karup(DTU Compute)
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: For public PIR Julia package, see this https URL
Abstract:We present a new efficient hybrid parameter estimation method based on the idea, that if nonlinear dynamic models are stated in terms of a system of equations that is linear in terms of the parameters, then regularized ordinary least squares can be used to estimate these parameters from time series data. We introduce the term “Physics-Informed Regression” (PIR) to describe the proposed data-driven hybrid technique as a way to bridge theory and data by use of ordinary least squares to efficiently perform parameter estimation of the model coefficients of different parameter-linear models; providing examples of models based on nonlinear ordinary equations (ODE) and partial differential equations (PDE). The focus is on parameter estimation on a selection of ODE and PDE models, each illustrating performance in different model characteristics. For two relevant epidemic models of different complexity and number of parameters, PIR is tested and compared against the related technique, physics-informed neural networks (PINN), both on synthetic data generated from known target parameters and on real public Danish time series data collected during the COVID-19 pandemic in Denmark. Both methods were able to estimate the target parameters, while PIR showed to perform noticeably better, especially on a compartment model with higher complexity. Given the difference in computational speed, it is concluded that the PIR method is superior to PINN for the models considered. It is also demonstrated how PIR can be applied to estimate the time-varying parameters of a compartment model that is fitted using real Danish data from the COVID-19 pandemic obtained during a period from 2020 to 2021. The study shows how data-driven and physics-informed techniques may support reliable and fast – possibly real-time – parameter estimation in parameter-linear nonlinear dynamic models.
[LG-50] Experimental End-to-End Optimization of Directly Modulated Laser-based IM/DD Transmission
链接: https://arxiv.org/abs/2508.19910
作者: Sergio Hernandez,Christophe Peucheret,Francesco Da Ros,Darko Zibar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures, submitted to journal of lightwave technology
Abstract:Directly modulated lasers (DMLs) are an attractive technology for short-reach intensity modulation and direct detection communication systems. However, their complex nonlinear dynamics make the modeling and optimization of DML-based systems challenging. In this paper, we study the end-to-end optimization of DML-based systems based on a data-driven surrogate model trained on experimental data. The end-to-end optimization includes the pulse shaping and equalizer filters, the bias current and the modulation radio-frequency (RF) power applied to the laser. The performance of the end-to-end optimization scheme is tested on the experimental setup and compared to 4 different benchmark schemes based on linear and nonlinear receiver-side equalization. The results show that the proposed end-to-end scheme is able to deliver better performance throughout the studied symbol rates and transmission distances while employing lower modulation RF power, fewer filter taps and utilizing a smaller signal bandwidth.
[LG-51] On-chip wave chaos for photonic extreme learning
链接: https://arxiv.org/abs/2508.19878
作者: Matthew R. Wilson,Jack A. Smith,Michael J. Strain,Xavier Porte
类目: Optics (physics.optics); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:The increase in demand for scalable and energy efficient artificial neural networks has put the focus on novel hardware solutions. Integrated photonics offers a compact, parallel and ultra-fast information processing platform, specially suited for extreme learning machine (ELM) architectures. Here we experimentally demonstrate a chip-scale photonic ELM based on wave chaos interference in a stadium microcavity. By encoding the input information in the wavelength of an external single-frequency tunable laser source, we leverage the high sensitivity to wavelength of injection in such photonic resonators. We fabricate the microcavity with direct laser writing of SU-8 polymer on glass. A scattering wall surrounding the stadium operates as readout layer, collecting the light associated with the cavity’s leaky modes. We report uncorrelated and aperiodic behavior in the speckles of the scattering barrier from a high resolution scan of the input wavelength. Finally, we characterize the system’s performance at classification in four qualitatively different benchmark tasks. As we can control the number of output nodes of our ELM by measuring different parts of the scattering barrier, we demonstrate the capability to optimize our photonic ELM’s readout size to the performance required for each task.
[LG-52] Conditional Normalizing Flow Surrogate for Monte Carlo Prediction of Radiative Properties in Nanoparticle-Embedded Layers
链接: https://arxiv.org/abs/2508.19841
作者: Fahime Seyedheydari,Kevin Conley,Simo Särkkä
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optics (physics.optics)
*备注: Version of record (publishers PDF) from META 2025 (CC BY). Please cite the proceedings
Abstract:We present a probabilistic, data-driven surrogate model for predicting the radiative properties of nanoparticle embedded scattering media. The model uses conditional normalizing flows, which learn the conditional distribution of optical outputs, including reflectance, absorbance, and transmittance, given input parameters such as the absorption coefficient, scattering coefficient, anisotropy factor, and particle size distribution. We generate training data using Monte Carlo radiative transfer simulations, with optical properties derived from Mie theory. Unlike conventional neural networks, the conditional normalizing flow model yields full posterior predictive distributions, enabling both accurate forecasts and principled uncertainty quantification. Our results demonstrate that this model achieves high predictive accuracy and reliable uncertainty estimates, establishing it as a powerful and efficient surrogate for radiative transfer simulations.
[LG-53] Fourier Feature Networks for High-Fidelity Prediction of Perturbed Optical Fields
链接: https://arxiv.org/abs/2508.19751
作者: Joshua R. Jandrell,Mitchell A. Cox
类目: Optics (physics.optics); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Modelling the effects of perturbations on optical fields often requires learning highly oscillatory complex-valued functions. Standard multi-layer perceptrons (MLPs) struggle with this task due to an inherent spectral bias, preventing them from fitting high-frequency sinusoids. To overcome this, we incorporate Fourier features - a set of predefined sinusoids dependent on the perturbation - as an additional network input. This reframes the learning problem from approximating a complex function to finding a linear combination of basis functions. We demonstrate this method by training a Fourier Feature Network to predict the transmission matrix of a multimode fibre under mechanical compression. Compared to a standard MLP, our network reduces prediction error in the output field’s amplitude and phase by an order of magnitude, achieving a mean complex correlation of 0.995 with the ground truth, despite using 85% fewer parameters. This approach offers a general and robust method for accurately modelling a wide class of oscillatory physical systems.
[LG-54] Fractal Flow: Hierarchical and Interpretable Normalizing Flow via Topic Modeling and Recursive Strategy
链接: https://arxiv.org/abs/2508.19750
作者: Binhui Zhang,Jianwei Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Normalizing Flows provide a principled framework for high-dimensional density estimation and generative modeling by constructing invertible transformations with tractable Jacobian determinants. We propose Fractal Flow, a novel normalizing flow architecture that enhances both expressiveness and interpretability through two key innovations. First, we integrate Kolmogorov-Arnold Networks and incorporate Latent Dirichlet Allocation into normalizing flows to construct a structured, interpretable latent space and model hierarchical semantic clusters. Second, inspired by Fractal Generative Models, we introduce a recursive modular design into normalizing flows to improve transformation interpretability and estimation accuracy. Experiments on MNIST, FashionMNIST, CIFAR-10, and geophysical data demonstrate that the Fractal Flow achieves latent clustering, controllable generation, and superior estimation accuracy.
[LG-55] Inferring geometry and material properties from Mueller matrices with machine learning
链接: https://arxiv.org/abs/2508.19713
作者: Lars Doorenbos,C. H. Lucas Patty,Raphael Sznitman,Pablo Márquez-Neila
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: Presented at Polarization Science and Remote Sensing XII
Abstract:Mueller matrices (MMs) encode information on geometry and material properties, but recovering both simultaneously is an ill-posed problem. We explore whether MMs contain sufficient information to infer surface geometry and material properties with machine learning. We use a dataset of spheres of various isotropic materials, with MMs captured over the full angular domain at five visible wavelengths (450-650 nm). We train machine learning models to predict material properties and surface normals using only these MMs as input. We demonstrate that, even when the material type is unknown, surface normals can be predicted and object geometry reconstructed. Moreover, MMs allow models to identify material types correctly. Further analyses show that diagonal elements are key for material characterization, and off-diagonal elements are decisive for normal estimation.
[LG-56] Simple Stepsize for Quasi-Newton Methods with Global Convergence Guarantees
链接: https://arxiv.org/abs/2508.19712
作者: Artem Agafonov,Vladislav Ryspayev,Samuel Horváth,Alexander Gasnikov,Martin Takáč,Slavomir Hanzely
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Quasi-Newton methods are widely used for solving convex optimization problems due to their ease of implementation, practical efficiency, and strong local convergence guarantees. However, their global convergence is typically established only under specific line search strategies and the assumption of strong convexity. In this work, we extend the theoretical understanding of Quasi-Newton methods by introducing a simple stepsize schedule that guarantees a global convergence rate of O(1/k) for the convex functions. Furthermore, we show that when the inexactness of the Hessian approximation is controlled within a prescribed relative accuracy, the method attains an accelerated convergence rate of O(1/k^2) – matching the best-known rates of both Nesterov’s accelerated gradient method and cubically regularized Newton methods. We validate our theoretical findings through empirical comparisons, demonstrating clear improvements over standard Quasi-Newton baselines. To further enhance robustness, we develop an adaptive variant that adjusts to the function’s curvature while retaining the global convergence guarantees of the non-adaptive algorithm.
[LG-57] MRExtrap: Longitudinal Aging of Brain MRIs using Linear Modeling in Latent Space
链接: https://arxiv.org/abs/2508.19482
作者: Jaivardhan Kapoor,Jakob H. Macke,Christian F. Baumgartner
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Preprint
Abstract:Simulating aging in 3D brain MRI scans can reveal disease progression patterns in neurological disorders such as Alzheimer’s disease. Current deep learning-based generative models typically approach this problem by predicting future scans from a single observed scan. We investigate modeling brain aging via linear models in the latent space of convolutional autoencoders (MRExtrap). Our approach, MRExtrap, is based on our observation that autoencoders trained on brain MRIs create latent spaces where aging trajectories appear approximately linear. We train autoencoders on brain MRIs to create latent spaces, and investigate how these latent spaces allow predicting future MRIs through linear extrapolation based on age, using an estimated latent progression rate \boldsymbol\beta . For single-scan prediction, we propose using population-averaged and subject-specific priors on linear progression rates. We also demonstrate that predictions in the presence of additional scans can be flexibly updated using Bayesian posterior sampling, providing a mechanism for subject-specific refinement. On the ADNI dataset, MRExtrap predicts aging patterns accurately and beats a GAN-based baseline for single-volume prediction of brain aging. We also demonstrate and analyze multi-scan conditioning to incorporate subject-specific progression rates. Finally, we show that the latent progression rates in MRExtrap’s linear framework correlate with disease and age-based aging patterns from previously studied structural atrophy rates. MRExtrap offers a simple and robust method for the age-based generation of 3D brain MRIs, particularly valuable in scenarios with multiple longitudinal observations.
[LG-58] Is data-efficient learning feasible with quantum models?
链接: https://arxiv.org/abs/2508.19437
作者: Alona Sakhnenko,Christian B. Mendl,Jeanette M. Lorenz
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:The importance of analyzing nontrivial datasets when testing quantum machine learning (QML) models is becoming increasingly prominent in literature, yet a cohesive framework for understanding dataset characteristics remains elusive. In this work, we concentrate on the size of the dataset as an indicator of its complexity and explores the potential for QML models to demonstrate superior data-efficiency compared to classical models, particularly through the lens of quantum kernel methods (QKMs). We provide a method for generating semi-artificial fully classical datasets, on which we show one of the first evidence of the existence of classical datasets where QKMs require less data during training. Additionally, our study introduces a new analytical tool to the QML domain, derived for classical kernel methods, which can be aimed at investigating the classical-quantum gap. Our empirical results reveal that QKMs can achieve low error rates with less training data compared to classical counterparts. Furthermore, our method allows for the generation of datasets with varying properties, facilitating further investigation into the characteristics of real-world datasets that may be particularly advantageous for QKMs. We also show that the predicted performance from the analytical tool we propose - a generalization metric from classical domain - show great alignment empirical evidence, which fills the gap previously existing in the field. We pave a way to a comprehensive exploration of dataset complexities, providing insights into how these complexities influence QML performance relative to traditional methods. This research contributes to a deeper understanding of the generalization benefits of QKM models and potentially a broader family of QML models, setting the stage for future advancements in the field.
信息检索
[IR-0] Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization EMNLP2025
链接: https://arxiv.org/abs/2508.19918
作者: Manato Tajiri,Michimasa Inaba
类目: Information Retrieval (cs.IR)
*备注: Accepted to EMNLP 2025 Main Conference
Abstract:Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description. This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context. We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. Experiments on two public datasets validate our method’s effectiveness in fostering more natural and realistic conversational recommendation this http URL implementation is publicly available at:this https URL
[IR-1] Youtu-GraphRAG : Vertically Unified Agents for Graph Retrieval-Augmented Complex Reasoning
链接: https://arxiv.org/abs/2508.19855
作者: Junnan Dong,Siyu An,Yifei Yu,Qian-Wen Zhang,Linhao Luo,Xiao Huang,Yunsheng Wu,Di Yin,Xing Sun
类目: Information Retrieval (cs.IR)
*备注: 19 pages, 7 figures, 6 tables
Abstract:Graph retrieval-augmented generation (GraphRAG) has effectively enhanced large language models in complex reasoning by organizing fragmented knowledge into explicitly structured graphs. Prior efforts have been made to improve either graph construction or graph retrieval in isolation, yielding suboptimal performance, especially when domain shifts occur. In this paper, we propose a vertically unified agentic paradigm, Youtu-GraphRAG, to jointly connect the entire framework as an intricate integration. Specifically, (i) a seed graph schema is introduced to bound the automatic extraction agent with targeted entity types, relations and attribute types, also continuously expanded for scalability over unseen domains; (ii) To obtain higher-level knowledge upon the schema, we develop novel dually-perceived community detection, fusing structural topology with subgraph semantics for comprehensive knowledge organization. This naturally yields a hierarchical knowledge tree that supports both top-down filtering and bottom-up reasoning with community summaries; (iii) An agentic retriever is designed to interpret the same graph schema to transform complex queries into tractable and parallel sub-queries. It iteratively performs reflection for more advanced reasoning; (iv) To alleviate the knowledge leaking problem in pre-trained LLM, we propose a tailored anonymous dataset and a novel ‘Anonymity Reversion’ task that deeply measures the real performance of the GraphRAG frameworks. Extensive experiments across six challenging benchmarks demonstrate the robustness of Youtu-GraphRAG, remarkably moving the Pareto frontier with up to 90.71% saving of token costs and 16.62% higher accuracy over state-of-the-art baselines. The results indicate our adaptability, allowing seamless domain transfer with minimal intervention on schema.
[IR-2] A Model-agnostic Strategy to Mitigate Embedding Degradation in Personalized Federated Recommendation
链接: https://arxiv.org/abs/2508.19591
作者: Jiakui Shen,Yunqi Mi,Guoshuai Zhao,Jialie Shen,Xueming Qian
类目: Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Centralized recommender systems encounter privacy leakage due to the need to collect user behavior and other private data. Hence, federated recommender systems (FedRec) have become a promising approach with an aggregated global model on the server. However, this distributed training paradigm suffers from embedding degradation caused by suboptimal personalization and dimensional collapse, due to the existence of sparse interactions and heterogeneous preferences. To this end, we propose a novel model-agnostic strategy for FedRec to strengthen the personalized embedding utility, which is called Personalized Local-Global Collaboration (PLGC). It is the first research in federated recommendation to alleviate the dimensional collapse issue. Particularly, we incorporate the frozen global item embedding table into local devices. Based on a Neural Tangent Kernel strategy that dynamically balances local and global information, PLGC optimizes personalized representations during forward inference, ultimately converging to user-specific preferences. Additionally, PLGC carries on a contrastive objective function to reduce embedding redundancy by dissolving dependencies between dimensions, thereby improving the backward representation learning process. We introduce PLGC as a model-agnostic personalized training strategy for federated recommendations that can be applied to existing baselines to alleviate embedding degradation. Extensive experiments on five real-world datasets have demonstrated the effectiveness and adaptability of PLGC, which outperforms various baseline algorithms.
[IR-3] Improving Recommendation Fairness via Graph Structure and Representation Augmentation CIKM2025
链接: https://arxiv.org/abs/2508.19547
作者: Tongxin Xu,Wenqiang Liu,Chenzhong Bin,Cihan Xiao,Zhixin Zeng,Tianlong Gu
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025
Abstract:Graph Convolutional Networks (GCNs) have become increasingly popular in recommendation systems. However, recent studies have shown that GCN-based models will cause sensitive information to disseminate widely in the graph structure, amplifying data bias and raising fairness concerns. While various fairness methods have been proposed, most of them neglect the impact of biased data on representation learning, which results in limited fairness improvement. Moreover, some studies have focused on constructing fair and balanced data distributions through data augmentation, but these methods significantly reduce utility due to disruption of user preferences. In this paper, we aim to design a fair recommendation method from the perspective of data augmentation to improve fairness while preserving recommendation utility. To achieve fairness-aware data augmentation with minimal disruption to user preferences, we propose two prior hypotheses. The first hypothesis identifies sensitive interactions by comparing outcomes of performance-oriented and fairness-aware recommendations, while the second one focuses on detecting sensitive features by analyzing feature similarities between biased and debiased representations. Then, we propose a dual data augmentation framework for fair recommendation, which includes two data augmentation strategies to generate fair augmented graphs and feature representations. Furthermore, we introduce a debiasing learning method that minimizes the dependence between the learned representations and sensitive information to eliminate bias. Extensive experiments on two real-world datasets demonstrate the superiority of our proposed framework.
[IR-4] A Hybrid Recommendation Framework for Enhancing User Engagement in Local News
链接: https://arxiv.org/abs/2508.19539
作者: Payam Pourashraf,Bamshad Mobasher
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Local news organizations face an urgent need to boost reader engagement amid declining circulation and competition from global media. Personalized news recommender systems offer a promising solution by tailoring content to user interests. Yet, conventional approaches often emphasize general preferences and may overlook nuanced or eclectic interests in local news. We propose a hybrid news recommender that integrates local and global preference models to improve engagement. Building on evidence of the value of localized models, our method unifies local and non-local predictors in one framework. The system adaptively combines recommendations from a local model, specialized in region-specific content, and a global model that captures broader preferences. Ensemble strategies and multiphase training balance the two. We evaluated the model on two datasets: a synthetic set based on Syracuse newspaper distributions and a Danish dataset (EB-NeRD) labeled for local and non-local content with an LLM. Results show our integrated approach outperforms single-model baselines in accuracy and coverage, suggesting improved personalization that can drive user engagement. The findings have practical implications for publishers, especially local outlets. By leveraging both community-specific and general user interests, the hybrid recommender can deliver more relevant content, increasing retention and subscriptions. In sum, this work introduces a new direction for recommender systems, bridging local and global models to revitalize local news consumption through scalable, personalized user experiences. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.19539 [cs.IR] (or arXiv:2508.19539v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.19539 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Payam Pourashraf [view email] [v1] Wed, 27 Aug 2025 03:27:20 UTC (196 KB)
[IR-5] APS Explorer: Navigating Algorithm Performance Spaces for Informed Dataset Selection
链接: https://arxiv.org/abs/2508.19399
作者: Tobias Vente,Michael Heep,Abdullah Abbas,Theodor Sperle,Joeran Beel,Bart Goethals
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Dataset selection is crucial for offline recommender system experiments, as mismatched data (e.g., sparse interaction scenarios require datasets with low user-item density) can lead to unreliable results. Yet, 86% of ACM RecSys 2024 papers provide no justification for their dataset choices, with most relying on just four datasets: Amazon (38%), MovieLens (34%), Yelp (15%), and Gowalla (12%). While Algorithm Performance Spaces (APS) were proposed to guide dataset selection, their adoption has been limited due to the absence of an intuitive, interactive tool for APS exploration. Therefore, we introduce the APS Explorer, a web-based visualization tool for interactive APS exploration, enabling data-driven dataset selection. The APS Explorer provides three interactive features: (1) an interactive PCA plot showing dataset similarity via performance patterns, (2) a dynamic meta-feature table for dataset comparisons, and (3) a specialized visualization for pairwise algorithm performance.
[IR-6] AI for Statutory Simplification: A Comprehensive State Legal Corpus and Labor Benchmark
链接: https://arxiv.org/abs/2508.19365
作者: Emaan Hariri,Daniel E. Ho
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 10 pages, 3 figures. To appear in ICAIL 2025
Abstract:One of the emerging use cases of AI in law is for code simplification: streamlining, distilling, and simplifying complex statutory or regulatory language. One U.S. state has claimed to eliminate one third of its state code using AI. Yet we lack systematic evaluations of the accuracy, reliability, and risks of such approaches. We introduce LaborBench, a question-and-answer benchmark dataset designed to evaluate AI capabilities in this domain. We leverage a unique data source to create LaborBench: a dataset updated annually by teams of lawyers at the U.S. Department of Labor, who compile differences in unemployment insurance laws across 50 states for over 101 dimensions in a six-month process, culminating in a 200-page publication of tables. Inspired by our collaboration with one U.S. state to explore using large language models (LLMs) to simplify codes in this domain, where complexity is particularly acute, we transform the DOL publication into LaborBench. This provides a unique benchmark for AI capacity to conduct, distill, and extract realistic statutory and regulatory information. To assess the performance of retrieval augmented generation (RAG) approaches, we also compile StateCodes, a novel and comprehensive state statute and regulatory corpus of 8.7 GB, enabling much more systematic research into state codes. We then benchmark the performance of information retrieval and state-of-the-art large LLMs on this data and show that while these models are helpful as preliminary research for code simplification, the overall accuracy is far below the touted promises for LLMs as end-to-end pipelines for regulatory simplification.