本篇博文主要内容为 2025-08-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-08)
今日共更新520篇论文,其中:
- 自然语言处理共71篇(Computation and Language (cs.CL))
- 人工智能共160篇(Artificial Intelligence (cs.AI))
- 计算机视觉共127篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共161篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] H-Net: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
【速读】: 该论文旨在解决字节级语言模型(byte-level language models)在形态丰富的语言(Morphologically Rich Languages, MRLs)中因词汇跨度长而引发的计算效率低下问题。其核心挑战在于如何在不依赖传统分词器(如BPE)的前提下,实现高效且语义合理的文本分割。解决方案的关键创新在于提出H-NET++:一种层次化动态分块(hierarchical dynamic-chunking)模型,通过端到端训练自动学习语言学启发的分块策略;其核心组件包括轻量级Transformer上下文混合器(1.9M参数)以实现跨块注意力、两层潜在超先验(latent hyper-prior)保障文档级一致性、对拼写伪影(如波斯语中的ZWNJ)的专门处理机制,以及基于课程学习的分段长度渐进训练策略。实验表明,该方法在波斯语数据上显著优于现有基线,在压缩性能、下游任务表现和鲁棒性方面均取得突破。
链接: https://arxiv.org/abs/2508.05628
作者: Mehrdad Zakershahrak,Samira Ghodratnama
机构: Model call failure
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.
zh
[NLP-1] How Do LLM s Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
【速读】: 该论文旨在解决如何有效理解大型语言模型(Large Language Models, LLMs)在自然多轮对话中实现说服行为的机制问题。现有研究对LLMs的说服能力尚缺乏系统性分析,尤其在动态交互场景下的具体作用路径不明确。论文的关键解决方案是采用线性探针(linear probes)这一轻量级工具,基于认知科学的理论框架,分别训练探针以捕捉说服成功、被说服者人格特质和说服策略等核心维度。实验证明,尽管探针结构简单,但其能在样本级和数据集级准确识别说服发生的关键节点,并在计算效率上显著优于传统提示工程(prompting-based)方法,在某些任务如揭示说服策略方面甚至表现更优,从而为研究复杂行为(如欺骗与操纵)提供了高效可行的新范式。
链接: https://arxiv.org/abs/2508.05625
作者: Brandon Jaipersaud,David Krueger,Ekdeep Singh Lubana
机构: Mila; University of Montreal; CBS-NTT Program in Physics of Intelligence, Harvard University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.
zh
[NLP-2] Learning to Reason for Factuality
【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理型大语言模型(Reasoning Large Language Models, R-LLMs)在长文本事实性任务中幻觉率过高、事实准确性不足的问题。尽管R-LLMs在复杂推理任务上表现优异,但其生成内容常因缺乏可靠的事实验证机制而产生大量不实信息。现有方法依赖于如FActScore等自动评估框架构建离线偏好数据进行强化学习(Reinforcement Learning, RL),但直接将此类指标作为在线RL的奖励信号会导致奖励黑客(reward hacking)行为,例如生成更简略或无关的回答以欺骗奖励模型。解决方案的关键在于设计一种新型多维奖励函数,同时优化事实精确度(factual precision)、响应细节水平(response detail level)和答案相关性(answer relevance),并通过在线RL训练策略,从而在不牺牲整体有用性的前提下显著降低幻觉率并提升回答质量。实验表明,该方法在六个长文本事实性基准上平均减少23.1个百分点的幻觉率,同时提升23%的细节水平。
链接: https://arxiv.org/abs/2508.05618
作者: Xilun Chen,Ilia Kulikov,Vincent-Pierre Berges,Barlas Oğuz,Rulin Shao,Gargi Ghosh,Jason Weston,Wen-tau Yih
机构: Meta(Meta)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
zh
[NLP-3] st-Time Reinforcement Learning for GUI Grounding via Region Consistency
【速读】: 该论文旨在解决图形用户界面(Graphical User Interface, GUI)定位任务中对大量像素级标注数据的依赖问题,这类标注成本高且难以获取。现有方法通常依赖监督学习或强化学习配合人工标注奖励信号,限制了模型在真实场景中的泛化能力和部署效率。其解决方案的关键在于利用模型在测试阶段生成的多个预测结果之间的空间重叠模式,提取隐式的置信度信号——即“区域一致性”(Region Consistency)。通过构建空间投票网格识别出模型预测高度一致的区域,GUI-RC(Region Consistency)无需训练即可提升定位精度;进一步地,GUI-RCPO(Region Consistency Policy Optimization)将一致性模式转化为奖励机制,驱动模型在推理过程中基于未标注数据进行自监督强化学习优化,从而显著提升性能。这一方法揭示了测试时扩展(test-time scaling)与测试时强化学习在GUI接地任务中的巨大潜力,为构建更鲁棒、数据高效的GUI代理提供了新路径。
链接: https://arxiv.org/abs/2508.05615
作者: Yong Du,Yuchen Yan,Fei Tang,Zhengxi Lu,Chang Zong,Weiming Lu,Shengpei Jiang,Yongliang Shen
机构: Zhejiang University (浙江大学); Central South University (中南大学); Zhejiang University of Science and Technology (浙江科技大学); SF Technology (SF科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL
Abstract:Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
zh
[NLP-4] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
【速读】: 该论文旨在解决当前大型语言模型在具身智能体(embodied agent)推理能力方面的评估空白问题,特别是其在物理交互、工具使用及多智能体协作等复杂场景下的表现。传统基准测试往往依赖预定义的工具集或显式协作指令,无法真实反映智能体自主决策的能力。论文提出的解决方案是构建OmniEAR框架,该框架通过文本化环境表示来建模连续物理属性与复杂空间关系,并要求智能体基于任务需求动态获取能力并自主制定协调策略,从而实现对具身推理的系统性评估。其关键创新在于摒弃了静态约束和明确指导,转而强调从约束中自主推理的能力,揭示出当前模型在隐式协作和复合任务中的显著性能下降,暴露了架构层面的根本局限,为未来具身人工智能的发展提供了严谨的评测基准。
链接: https://arxiv.org/abs/2508.05614
作者: Zixuan Wang,Dingming Li,Hongxing Li,Shuo Chen,Yuchen Yan,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL Code: this https URL
Abstract:Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.
zh
[NLP-5] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)在推理任务中面临的两大问题:一是规则驱动的奖励机制(rule-based rewards)鲁棒性不足,二是基于模型的奖励机制(model-based rewards)易受奖励黑客(reward hacking)攻击。解决方案的关键在于提出Cooper框架,该框架通过联合优化策略模型(policy model)与奖励模型(reward model),充分利用规则奖励的高精度特性,并动态构建正负样本对以持续训练奖励模型,从而提升其鲁棒性并降低奖励黑客风险。此外,论文引入混合标注策略和基于参考答案的奖励建模范式(reference-based reward modeling),进一步提升了奖励模型的准确性与实用性,最终在Qwen2.5-1.5B-Instruct等模型上实现了端到端强化学习性能的显著提升。
链接: https://arxiv.org/abs/2508.05613
作者: Haitao Hong,Yuchen Yan,Xingyu Wu,Guiyang Hou,Wenqi Zhang,Weiming Lu,Yongliang Shen,Jun Xiao
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL Code: this https URL
Abstract:Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.
zh
[NLP-6] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
【速读】: 该论文旨在解决将Chain-of-Thought (CoT)推理机制有效扩展至视觉-语言任务中的难题,特别是如何建模视觉状态的演化以支持连贯且具grounding(具象化)的多模态推理。现有方法因难以捕捉视觉状态转移或受碎片化架构影响导致视觉轨迹不一致而表现受限。其解决方案的关键在于提出Uni-CoT——一个统一的Chain-of-Thought框架,通过引入双层推理范式:宏观层CoT用于高层任务规划,微观层CoT用于子任务执行,从而显著降低计算开销;同时设计结构化训练策略,结合交错图像-文本监督与多任务目标,使单一模型实现高效、可扩展且连贯的多模态推理能力。
链接: https://arxiv.org/abs/2508.05606
作者: Luozheng Qin,Jia Gong,Yuqing Sun,Tianjiao Li,Mengping Yang,Xiaomeng Yang,Chao Qu,Zhiyu Tan,Hao Li
机构: Shanghai Academy of AI for Science (上海人工智能科学研究院); Fudan University (复旦大学); Nanyang Technological University (南洋理工大学); INFTech
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: this https URL
Abstract:Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: this https URL Comments: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2508.05606 [cs.CV] (or arXiv:2508.05606v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.05606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-7] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理能力提升中面临的高质量、高难度训练数据稀缺问题。现有合成方法多依赖于对人类编写模板的改造,导致数据多样性与可扩展性受限。其解决方案的关键在于提出MathSmith框架,通过从PlanetMath中随机采样概念-解释对构建全新数学问题,确保数据独立性并避免污染;同时设计九种预定义策略作为推理过程中的软约束以增强难度,并引入强化学习联合优化结构有效性、推理复杂度与答案一致性,其中推理链长度被用作认知复杂度的代理指标,从而生成更符合长链思维(long-chain-of-thought, CoT)要求的挑战性问题。
链接: https://arxiv.org/abs/2508.05592
作者: Shaoxiong Zhan,Yanlin Lai,Ziyu Lu,Dahua Lin,Ziqing Yang,Fei Tang
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); SenseTime Research (商汤科技研究院); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.
zh
[NLP-8] Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models ALT
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)生成可解释且可计算的表型(Computable Phenotypes, CPs),以支持高血压患者的临床决策,从而实现规模化、精准化的医疗干预。其解决方案的关键在于提出并验证了一种“合成-执行-调试-指令”(synthesize, execute, debug, instruct)策略,通过数据驱动的反馈迭代优化LLMs生成的CP代码,使其在少量训练样本下即可逼近当前最先进机器学习方法的性能,同时保持程序的可解释性与准确性。
链接: https://arxiv.org/abs/2508.05581
作者: Guilherme Seidyo Imai Aldeia,Daniel S. Herman,William G. La Cava
机构: Federal University of ABC (联邦大学); University of Pennsylvania (宾夕法尼亚大学); Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear in PMLR, Volume 298, Machine Learning for Healthcare, 2025
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities for medical question answering and programming, but their potential for generating interpretable computable phenotypes (CPs) is under-explored. In this work, we investigate whether LLMs can generate accurate and concise CPs for six clinical phenotypes of varying complexity, which could be leveraged to enable scalable clinical decision support to improve care for patients with hypertension. In addition to evaluating zero-short performance, we propose and test a synthesize, execute, debug, instruct strategy that uses LLMs to generate and iteratively refine CPs using data-driven feedback. Our results show that LLMs, coupled with iterative learning, can generate interpretable and reasonably accurate programs that approach the performance of state-of-the-art ML methods while requiring significantly fewer training examples.
zh
[NLP-9] Fairypm i: the First 2-bit Complex LLM with All Parameters in pm1 pm i
【速读】: 该论文旨在解决当前低比特量化感知训练(Quantization-Aware Training, QAT)方法无法突破全精度模型性能上限(accuracy ceiling)的问题。现有研究均以最小化全精度模型的量化误差为目标,导致2-bit量化后的模型性能受限于原模型的上限。为打破这一限制,作者提出一种新范式——先提升全精度模型的性能(即“抬高天花板”),再对其进行高效2-bit量化。其核心解决方案是提出Fairy ±i框架,通过将权重映射到四次单位根(±1, ±i)实现信息论最优的2-bit表示,利用复数域的表征优势提升全精度模型准确率,并使每个量化权重的实部或虚部为零,从而支持仅用加法和元素交换完成乘法-free推理,显著提升计算效率与存储利用率。
链接: https://arxiv.org/abs/2508.05571
作者: Feiyu Wang,Guoan Wang,Yihao Zhang,Shengfan Wang,Weitao Li,Bokai Huang,Shimao Chen,Zihan Jiang,Rui Xu,Tong Yang
机构: Peking University (北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 14 figures
Abstract:Quantization-Aware Training (QAT) integrates quantization into the training loop, enabling LLMs to learn robust low-bit representations, and is widely recognized as one of the most promising research directions. All current QAT research focuses on minimizing quantization error on full-precision models, where the full-precision accuracy acts as an upper bound (accuracy ceiling). No existing method has even attempted to surpass this ceiling. To break this ceiling, we propose a new paradigm: raising the ceiling (full-precision model), and then still quantizing it efficiently into 2 bits. We propose Fairy \pm i , the first 2-bit quantization framework for complex-valued LLMs. Specifically, our method leverages the representational advantages of the complex domain to boost full-precision accuracy. We map weights to the fourth roots of unity \pm1, \pm i\ , forming a perfectly symmetric and information-theoretically optimal 2-bit representation. Importantly, each quantized weight has either a zero real or imaginary part, enabling multiplication-free inference using only additions and element swaps. Experimental results show that Fairy \pm i outperforms the ceiling of existing 2-bit quantization approaches in terms of both PPL and downstream tasks, while maintaining strict storage and compute efficiency. This work opens a new direction for building highly accurate and practical LLMs under extremely low-bit constraints.
zh
[NLP-10] SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription INTERSPEECH2025
【速读】: 该论文旨在解决金融领域中多说话人语音识别(multi-talker automatic speech recognition, ASR)任务的标注数据稀缺问题,尤其是缺乏带有说话人标签且专业转录的音频-文本对数据。解决方案的关键在于构建SPGISpeech 2.0数据集,其包含3,780小时经过专业转录的财报电话会议音频片段,并为每个音频片段提供通话和说话人信息,从而支持端到端的带说话人标签的ASR建模。该数据集通过提升任务多样性与标注粒度,在多个主流ASR模型上验证了微调后的性能提升,为金融场景下的语音识别技术发展提供了高质量的数据基础。
链接: https://arxiv.org/abs/2508.05554
作者: Raymond Grossman,Taejin Park,Kunal Dhawan,Andrew Titus,Sophia Zhi,Yulia Shchadilova,Weiqing Wang,Jagadeesh Balam,Boris Ginsburg
机构: Kensho(肯肖); NVIDIA(英伟达)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: To be presented at Interspeech 2025
Abstract:We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.
zh
[NLP-11] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLM s
【速读】: 该论文试图解决的问题是:多语言大语言模型(Multilingual Large Language Models, MLLMs)在不同语言间是否存在政治观点差异,即跨文化背景下的政治态度是否会在模型中体现为跨语言的差异化表达。解决方案的关键在于通过提示投票建议应用中的政治陈述来评估模型的政治立场,并采用直接偏好优化(Direct Preference Optimization, DPO)方法仅使用英文对齐数据对模型进行政治倾向调整,从而系统性地比较模型在未对齐与对齐状态下五种西方语言中的政治观点一致性。研究发现,未对齐模型中跨语言政治观点差异极小,而对齐后所有语言的观点均发生几乎一致的偏移,表明政治观点可在西方语言间有效迁移,凸显了实现显式社会-语言-文化-政治对齐的挑战。
链接: https://arxiv.org/abs/2508.05553
作者: Franziska Weeber,Tanise Ceron,Sebastian Padó
机构: University of Stuttgart (斯图加特大学); Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs’ opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.
zh
[NLP-12] Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverag e Guarantees
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多项选择题问答(Multiple-Choice Question Answering, MCQA)任务中因幻觉(hallucination)和过度自信(overconfidence)导致的不可靠性问题,尤其是在高风险应用场景下。解决方案的关键在于提出一种基于频率的不确定性量化方法,在黑盒(black-box)设置下利用合规预测(conformal prediction, CP)技术实现可证明的覆盖率保证。该方法通过多次独立采样模型输出分布,以最频繁出现的样本作为参考计算预测熵(predictive entropy, PE),实验证明其在区分正确与错误预测方面优于基于对数几率(logit-based)PE的方法,并能有效控制经验误覆盖率在用户指定的风险水平内,从而为MCQA场景下的LLM提供一种无需依赖模型内部结构、具备分布无关性和模型无关性的可靠不确定性评估框架。
链接: https://arxiv.org/abs/2508.05544
作者: Guang Yang,Xinyang Liu
机构: University of Jinan (济南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review
Abstract:Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model’s output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications.
zh
[NLP-13] Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
【速读】: 该论文旨在解决长时程人机协作中机器人需适应多样人类伙伴的挑战,包括人类行为模式、协助意愿及对机器人能力认知的动态变化。为实现灵活且高效的协作,论文提出MICoBot系统,其关键在于构建一个紧密耦合的混合主动性对话(Mixed-Initiative dialog)机制,使机器人与人类均可主动发起、接受或拒绝任务步骤分配建议。该系统在三个层次上协同决策:(1)元规划器基于人类对话制定高层协作策略;(2)规划器结合预训练的可操作性模型(affordance model)评估机器人能力与人类可用性,优化剩余任务分配;(3)执行器决定低层动作或自然语言交互内容。实验证明,该方法显著优于纯大语言模型(LLM)基线和其他分配策略,在模拟和真实场景下均提升了任务成功率与用户体验。
链接: https://arxiv.org/abs/2508.05535
作者: Albert Yu,Chengshu Li,Luca Macesanu,Arnav Balaji,Ruchira Ray,Raymond Mooney,Roberto Martín-Martín
机构: UT Austin (德克萨斯大学奥斯汀分校); Stanford (斯坦福大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Project website at this https URL
Abstract:Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot’s capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot’s capabilities (measured by a simulation-pretrained affordance model) and the human’s estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world – on a physical robot with 18 unique human participants over 27 hours – demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at this https URL.
zh
[NLP-14] CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation ACL2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律文本生成任务中因缺乏事实一致性而导致的幻觉(hallucination)问题,尤其是在长文本生成场景下,现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法无法确保外部上下文被有效利用。其解决方案的关键在于提出一种名为 Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex) 的解码策略,该策略通过动态地将模型自身的词表分布与基于上下文复制得到的概率分布进行插值,依据模型对生成内容的信心程度决定是否直接复制上下文内容,从而显著提升生成结果对源信息的忠实度(faithfulness)。
链接: https://arxiv.org/abs/2508.05534
作者: Santosh T.Y.S.S,Youssef Tarek Elkhayat,Oana Ichim,Pranav Shetty,Dongsheng Wang,Zhiqiang Ma,Armineh Nourbakhsh,Xiaomo Liu
机构: Technical University of Munich (慕尼黑工业大学); Graduate Institute of International and Development Studies (国际与发展研究所); JPMorgan AI Research (摩根大通人工智能研究)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025-Main Conference
Abstract:Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model’s confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.
zh
[NLP-15] he World According to LLM s: How Geographic Origin Influences LLM s Entity Deduction Capabilities
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中隐性地理与文化偏见的问题,这类偏见往往难以通过传统人工设计的提问方式被发现。其核心问题是:尽管LLMs在显式偏见方面已进行过大量调优,但它们在推理过程中仍可能因预训练数据中的结构不平衡而表现出对全球北方(Global North)或西方(Global West)实体的偏好。解决方案的关键在于采用一种创造性的、自由形式的评估框架——“20 Questions”多轮推理游戏,并构建了一个新的跨文化、跨地域的数据集Geo20Q+,用于考察模型在主动发起问题和推理过程中的表现差异。该方法能有效暴露标准提示设置下隐藏的隐性偏见,揭示出LLMs在实体推断任务中对全球南方(Global South)和东方(Global East)文化的显著性能劣势,且这种差距无法完全由维基百科浏览量或语料频率解释,凸显了基于多轮互动行为分析在识别复杂偏见方面的独特价值。
链接: https://arxiv.org/abs/2508.05525
作者: Harsh Nishant Lalai,Raj Sanjay Shah,Jiaxin Pei,Sashank Varma,Yi-Chia Wang,Ali Emami
机构: BITS, Pilani; Georgia Institute of Technology; Stanford University; Emory University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Conference on Language Modeling 2025
Abstract:Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at this https URL.
zh
[NLP-16] LAG: Logic-Augmented Generation from a Cartesian Perspective
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中易产生幻觉(hallucination)的问题,尤其是在需要复杂推理的场景下,传统检索增强生成(Retrieval-Augmented Generation, RAG)方法因依赖直接语义检索且缺乏结构化逻辑组织而表现受限。其解决方案的关键在于提出逻辑增强生成(Logic-Augmented Generation, LAG)范式:通过笛卡尔方法论启发的系统性问题分解,将复杂问题拆解为按逻辑依赖顺序排列的原子子问题;利用前序答案引导后续子问题的上下文检索,确保推理链逐步 grounded;同时引入逻辑终止机制以防止错误传播并减少无效计算,最终整合所有子解答生成验证后的回答,从而提升推理鲁棒性与人类认知一致性。
链接: https://arxiv.org/abs/2508.05509
作者: Yilin Xiao,Chuang Zhou,Qinggang Zhang,Su Dong,Shengyuan Chen,Xiao Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textitDiscours de la méthode, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.
zh
[NLP-17] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在低资源语言场景下性能显著下降的问题。现有方法通常仅关注文本模态或依赖机器翻译,导致模型只能生成“薄描述”(thin descriptions),而忽视了跨模态信息丰富性与文化语境的关联性,进而无法有效服务低资源语言用户。解决方案的关键在于提出一种双源策略(dual-source strategy),分别针对语言能力(linguistic capability)和文化扎根性(cultural groundedness)两个核心目标:通过采集原生网络替代文本(alt-text)增强文化意识,同时利用MLLM自动生成的图像描述提升语言能力。基于此策略构建的MELLA数据集,在多个低资源语言上经微调后显著提升了模型表现,生成更具情境深度的“厚描述”(thick descriptions),验证了文化知识与语言能力双重增强的有效性。
链接: https://arxiv.org/abs/2508.05502
作者: Yufei Gao,Jiaying Fei,Nuo Chen,Ruirui Chen,Guohang Yan,Yunshi Lan,Botian Shi
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); East China Normal University (华东师范大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Institute of High Performance Computing, ASTAR (高性能计算研究所,ASTAR)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce “thin descriptions”, they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing “thick descriptions”. We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at this https URL.
zh
[NLP-18] Can Large Language Models Generate Effective Datasets for Emotion Recognition in Conversations?
【速读】: 该论文旨在解决情感识别在对话(Emotion Recognition in Conversations, ERC)任务中数据稀缺、现有数据集存在源偏倚及软标签主观性强等问题,同时探索大语言模型(Large Language Models, LLMs)在ERC数据生成中的应用局限。解决方案的关键在于采用一个小型、资源高效且通用的LLM来合成具有多样属性的ERC数据集,以补充当前最广泛使用的三个ERC基准数据集;具体生成了六个新数据集,其中两个针对每个基准进行定制优化,从而提升分类模型的鲁棒性,并验证其在缓解标签不平衡问题上的有效性。实验表明,基于生成数据训练的ERC模型在原有基准上表现出显著且稳定的性能提升。
链接: https://arxiv.org/abs/2508.05474
作者: Burak Can Kaplan,Hugo Cesar De Castro Carneiro,Stefan Wermter
机构: University of Hamburg (汉堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 figures
Abstract:Emotion recognition in conversations (ERC) focuses on identifying emotion shifts within interactions, representing a significant step toward advancing machine intelligence. However, ERC data remains scarce, and existing datasets face numerous challenges due to their highly biased sources and the inherent subjectivity of soft labels. Even though Large Language Models (LLMs) have demonstrated their quality in many affective tasks, they are typically expensive to train, and their application to ERC tasks–particularly in data generation–remains limited. To address these challenges, we employ a small, resource-efficient, and general-purpose LLM to synthesize ERC datasets with diverse properties, supplementing the three most widely used ERC benchmarks. We generate six novel datasets, with two tailored to enhance each benchmark. We evaluate the utility of these datasets to (1) supplement existing datasets for ERC classification, and (2) analyze the effects of label imbalance in ERC. Our experimental results indicate that ERC classifier models trained on the generated datasets exhibit strong robustness and consistently achieve statistically significant performance improvements on existing ERC benchmarks.
zh
[NLP-19] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
【速读】: 该论文试图解决当前用于评估生成式 AI(Generative AI)创造力的多种指标之间缺乏一致性与可比性的问题,尤其是在创意写作、非常规问题解决和研究创意生成等不同领域中。其解决方案的关键在于系统性地比较代表性创造力度量方法——包括创造力指数(creativity index)、困惑度(perplexity)、句法模板(syntactic templates)以及基于大语言模型的评判(LLM-as-a-Judge)——并揭示它们各自捕捉创造力维度的不同特性及其局限性,从而强调构建更稳健、通用且贴近人类创造力判断标准的评估框架的必要性。
链接: https://arxiv.org/abs/2508.05470
作者: Li-Chun Lu,Miri Liu,Pin-Chun Lu,Yufei Tian,Shao-Hua Sun,Nanyun Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures
Abstract:We systematically examine, analyze, and compare representative creativity measures–creativity index, perplexity, syntactic templates, and LLM-as-a-Judge–across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index’s focus on lexical diversity, perplexity’s sensitivity to model confidence, and syntactic templates’ inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.
zh
[NLP-20] ASE: Token Awareness and Structured Evaluation for Multilingual Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在细粒度的词元级(token-level)理解与结构推理能力上的不足,这些问题限制了其在需要精确控制和结构感知任务中的应用表现。解决方案的关键在于提出TASE基准测试体系,该体系涵盖10项任务,分为词元意识(token awareness)和结构理解(structural understanding)两大类,覆盖中、英、韩三种语言,并构建了一个包含35,927个实例的评估数据集及可扩展的合成数据生成流水线,用于训练与评测。通过系统评估30余种主流商业与开源模型(如O3、Claude 4、Gemini 2.5 Pro、DeepSeek-R1等),并基于GRPO方法微调Qwen2.5-14B模型,研究揭示了人类水平性能仍显著优于现有LLMs,明确了词元级推理能力是当前LLM的关键短板,从而为未来低层语言理解与跨语言泛化能力的改进提供了诊断工具与方向。
链接: https://arxiv.org/abs/2508.05468
作者: Chenzhuo Zhao,Xinda Wang,Yue Huang,Junting Lu,Ziqian Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning–capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs’ ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at this https URL .
zh
[NLP-21] Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
【速读】: 该论文旨在解决当前通用人工智能(General Purpose AI, GPAI)评估体系与新兴监管要求之间存在的“基准-法规差距”问题,即现有评估基准未能有效覆盖欧盟人工智能法案(EU AI Act)所关注的系统性风险。解决方案的关键在于提出并实施Bench-2-CoP框架,该框架采用经验证的大语言模型作为评判者(LLM-as-judge)方法,系统性地将194,955个来自主流基准的问题映射至欧盟AI法案的能力与倾向分类体系中,从而量化评估覆盖度。研究发现,当前评估高度集中于如“幻觉倾向”和“歧视性偏见”等行为属性,而对失控场景相关的功能性能力(如规避人类监督、自我复制和自主开发)则完全缺失,导致对“失控”和“网络攻击”等系统性风险的覆盖率不足1%,揭示了亟需重构评估工具以匹配监管重点。
链接: https://arxiv.org/abs/2508.05464
作者: Matteo Prandi,Vincenzo Suriani,Federico Pierucci,Marcello Galisai,Daniele Nardi,Piercosma Bisconti
机构: Sapienza University of Rome (罗马大学); DEXAI-Artificial Ethics; Sant’Anna School of Advanced Studies (圣安娜高等研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this “benchmark-regulation gap.” We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act’s taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem is overwhelmingly focused on a narrow set of behavioral propensities, such as “Tendency to hallucinate” (53.7% of the corpus) and “Discriminatory bias” (28.9%), while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This translates to a near-total evaluation gap for systemic risks like “Loss of Control” (0.4% coverage) and “Cyber Offence” (0.8% coverage). This study provides the first comprehensive, quantitative analysis of this gap, offering critical insights for policymakers to refine the CoP and for developers to build the next generation of evaluation tools, ultimately fostering safer and more compliant AI.
zh
[NLP-22] LLM Eval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在静态基准测试中普遍存在的数据污染(data contamination)和排行榜过拟合(leaderboard overfitting)问题,这些问题会掩盖模型的真实能力。其解决方案的关键在于提出LLMEval-3动态评估框架,该框架基于一个包含22万道研究生级别问题的专有题库,每次评估时动态抽取未见过的测试集;并通过抗污染的数据筛选机制、新颖的防作弊架构以及经过校准的“大模型作为裁判”(LLM-as-a-judge)流程(与人类专家达成90%一致性),结合相对排名系统实现公平比较,从而显著提升评估结果的稳定性和可信度。
链接: https://arxiv.org/abs/2508.05452
作者: Ming Zhang,Yujiong Shen,Jingyi Deng,Yuhui Wang,Yue Zhang,Junzhe Wang,Shichun Liu,Shihan Dou,Huayu Sha,Qiyuan Peng,Changhao Jiang,Jingqi Tong,Yilong Wu,Zhihao Zhang,Mingqi Wu,Zhiheng Xi,Mingxu Chai,Tao Liang,Zhihui Fei,Zhen Wang,Mingyang Wan,Guojun Ma,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
zh
[NLP-23] MyCulture: Exploring Malaysias Diverse Culture under Low-Resource Language Constraints
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化语境下存在的文化偏见问题,尤其针对低资源语言环境中的文化表征与评估不准确现象。其核心挑战在于训练数据主要来自高资源语言(如英语和中文),导致模型对非主流文化的理解能力不足。解决方案的关键在于提出MyCulture基准测试,该基准以马来西亚文化为对象,覆盖艺术、服饰、习俗、娱乐、食物和宗教六大维度,并采用一种新颖的开放式多选题格式(open-ended multiple-choice question format),避免预设选项带来的猜测偏差和格式偏倚,从而提升评估的公平性与区分度。此外,该研究通过结构化输出与自由形式输出的对比分析结构偏倚,并利用多语言提示变体检测语言偏倚,系统性地揭示了不同LLM在文化理解上的显著差异,凸显了构建文化根基深厚且语言包容性强的评测体系的紧迫性。
链接: https://arxiv.org/abs/2508.05429
作者: Zhong Ken Hew,Jia Xin Low,Sze Jue Yang,Chee Seng chan
机构: Universiti Malaya (马来亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.
zh
[NLP-24] he TUB Sign Language Corpus Collection
【速读】: 该论文旨在解决多语言视频数据资源匮乏的问题,特别是针对手语(Sign Language)与主流口语之间缺乏大规模平行语料库的现状。其解决方案的关键在于构建一个包含12种手语的视频平行语料库,涵盖超过1300小时的视频文件和1400万词元的字幕文本,其中首次为8种拉丁美洲手语提供了系统性的平行数据,同时德国手语语料规模达到此前可用数据的十倍。该语料库通过整合来自新闻节目、政府机构及教育频道等在线来源的视频内容,并经过数据采集、创作者沟通、授权获取、爬取与裁剪等多个处理阶段完成构建,为手语研究、自然语言处理及生成式AI(Generative AI)应用提供了高质量、可扩展的数据基础。
链接: https://arxiv.org/abs/2508.05374
作者: Eleftherios Avramidis,Vera Czehmann,Fabian Deckert,Lorenz Hufe,Aljoscha Lipski,Yuni Amaloa Quintero Villalobos,Tae Kwon Rhee,Mengqian Shi,Lennart Stölting,Fabrizio Nunnari,Sebastian Möller
机构: German Research Center for AI (DFKI) Speech and Language Technology (德国人工智能研究中心(DFKI)语音与语言技术); Technische Universität Berlin (柏林工业大学); Fraunhofer HHI (弗劳恩霍夫海因里希·赫兹研究所); Bliss e.V. ( Bliss e.V.); Google (谷歌); German Research Center for AI (DFKI) Cognitive Assistants (德国人工智能研究中心(DFKI)认知助手)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.
zh
[NLP-25] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
【速读】: 该论文旨在解决在生物医学等专业领域中,如何有效应用生成式 AI (Generative AI) 驱动的检索增强生成(Retrieval Augmented Generation, RAG)系统和“深度研究”(deep research)系统来支持自主搜索任务的问题。这些问题的核心挑战在于:自动化系统可能降低用户参与度,并难以满足专家级信息需求,而专业搜索任务对透明性和准确性要求极高。为此,作者提出了一种基于自反馈机制(self-feedback mechanism)的解决方案,其关键在于让大型语言模型(LLMs)通过迭代生成、评估与优化自身输出,实现查询扩展及多类型答案(如是/否、事实型、列表型、理想型)的改进。该机制旨在提升模型自我修正能力,从而探索是否能增强性能,并比较推理类模型与非推理类模型在生成有用反馈方面的差异。
链接: https://arxiv.org/abs/2508.05366
作者: Samy Ateia,Udo Kruschwitz
机构: Information Science, University of Regensburg (雷根斯堡大学)
类目: Computation and Language (cs.CL)
备注: Version as accepted at the BioASQ Lab at CLEF 2025
Abstract:Agentic Retrieval Augmented Generation (RAG) and ‘deep research’ systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.
zh
[NLP-26] Evaluation of a Sign Language Avatar on Comprehensibility User Experience Acceptability
【速读】: 该论文旨在解决现有手语(Sign Language, SL)虚拟形象在增强现实设备(如Microsoft Hololens 2)上可理解性与用户体验(User Experience, UX)不足的问题,尤其是在缺乏关键非手动特征(如口部动作和面部表情)及交互设计缺陷(如手势识别不明确、反馈缺失和菜单位置不当)的情况下。研究发现,尽管用户偏好可调参数,但调整功能本身并未显著提升用户体验或可理解性,反而因操作复杂度增加导致压力上升。因此,解决方案的关键在于:首先确保虚拟形象具备默认可理解的高质量动画(特别是口部动作和面部表情),其次优化交互界面以提升可用性,并通过参与式设计(Participatory Design)保障系统实用性与接受度,而非仅依赖个性化设置。
链接: https://arxiv.org/abs/2508.05358
作者: Fenya Wasserroth,Eleftherios Avramidis,Vera Czehmann,Tanja Kojic,Fabrizio Nunnari,Sebastian Möller
机构: Technische Universität Berlin(柏林工业大学); German Research Center for AI(德国人工智能研究中心)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design.
zh
[NLP-27] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression
【速读】: 该论文旨在解决大型推理语言模型(Large Reasoning Language Models, LRLMs)中存在的“过度思考”(overthinking)问题,即模型在生成复杂思维链(chain-of-thought)时因频繁触发反思行为(reflection behaviors,如“Wait”和“Alternatively”等关键词)而产生冗余推理步骤,导致token消耗增加、推理成本上升且实用性下降。解决方案的关键在于提出一种无监督、无需重训练或架构修改的模型无关方法——确定性引导的反思抑制(Certainty-Guided Reflection Suppression, CGRS),其核心机制是动态检测模型对当前输出的置信度,当置信度较高时主动抑制反思触发词的生成,从而避免不必要的反思循环,同时保持推理准确性。实验表明,CGRS在多个推理基准测试中平均减少18.5%至41.9%的token使用量,并在长度压缩与性能之间实现更优平衡。
链接: https://arxiv.org/abs/2508.05337
作者: Jiameng Huang,Baijiong Lin,Guhao Feng,Jierun Chen,Di He,Lu Hou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report
Abstract:Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., “Wait” and “Alternatively”) to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model’s generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS’s effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS’s practical value for efficient reasoning.
zh
[NLP-28] A Novel Architecture for Symbolic Reasoning with Decision Trees and LLM Agents
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂推理任务中缺乏可解释性、逻辑一致性不足以及对结构化知识利用不充分的问题。其核心挑战在于如何有效融合符号推理(symbolic reasoning)与神经网络的生成能力,以实现既具备因果逻辑严谨性又具有泛化和交互规划能力的通用推理系统。解决方案的关键在于提出一种协同多智能体架构,将决策树(decision tree)和随机森林(random forest)作为可调用的符号oracle嵌入统一推理框架中,由中央协调器维护信念状态一致性并调度各代理与外部工具间的通信,从而实现对结构化与非结构化输入的联合推理。此设计显著提升了在ProofWriter、GSM8k和ARC等基准上的逻辑一致性和抽象能力,验证了其在临床决策支持与科学发现等领域的实用性。
链接: https://arxiv.org/abs/2508.05311
作者: Andrew Kiruluta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We propose a hybrid architecture that integrates decision tree-based symbolic reasoning with the generative capabilities of large language models (LLMs) within a coordinated multi-agent framework. Unlike prior approaches that loosely couple symbolic and neural modules, our design embeds decision trees and random forests as callable oracles within a unified reasoning system. Tree-based modules enable interpretable rule inference and causal logic, while LLM agents handle abductive reasoning, generalization, and interactive planning. A central orchestrator maintains belief state consistency and mediates communication across agents and external tools, enabling reasoning over both structured and unstructured inputs. The system achieves strong performance on reasoning benchmarks. On \textitProofWriter, it improves entailment consistency by +7.2% through logic-grounded tree validation. On GSM8k, it achieves +5.3% accuracy gains in multistep mathematical problems via symbolic augmentation. On \textitARC, it boosts abstraction accuracy by +6.0% through integration of symbolic oracles. Applications in clinical decision support and scientific discovery show how the system encodes domain rules symbolically while leveraging LLMs for contextual inference and hypothesis generation. This architecture offers a robust, interpretable, and extensible solution for general-purpose neuro-symbolic reasoning. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2508.05311 [cs.AI] (or arXiv:2508.05311v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.05311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-29] SONAR-LLM : Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
【速读】: 该论文旨在解决大型概念模型(Large Concept Model, LCM)在文本生成中依赖扩散采样器(diffusion sampler)导致训练信号非概率性的问题,从而限制了模型对真实数据分布的建模能力。其解决方案的关键在于提出SONAR-LLM——一个仅使用解码器结构的Transformer模型,它在连续的SONAR嵌入空间中进行推理,但通过冻结的SONAR解码器将token级交叉熵损失反向传播,实现一种混合目标函数:既保留了LCM的语义抽象能力,又恢复了基于似然的训练信号,从而避免了扩散采样过程并提升了训练效率与生成质量。
链接: https://arxiv.org/abs/2508.05305
作者: Nikita Dragunov,Temurbek Rahmatullaev,Elizaveta Goncharova,Andrey Kuznetsov,Anton Razzhigaev
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that “thinks” in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.
zh
[NLP-30] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue
【速读】: 该论文旨在解决元评审(meta-reviewing)过程中因数据稀缺导致的对话代理(dialogue agent)训练困难问题,以及现有通用大语言模型(Large Language Models, LLMs)在该场景下表现不足的问题。其核心解决方案在于:首先,利用基于自精炼(self-refinement)策略的LLM生成高质量合成数据,以提升对话内容与专家领域相关性;其次,基于此类合成数据训练专门面向元评审任务的对话代理,实验证明其性能优于现成的LLM助手;最后,在真实元评审场景中验证了所提代理能显著提升评审效率。
链接: https://arxiv.org/abs/2508.05283
作者: Sukannya Purkayastha,Nils Dycke,Anne Lauscher,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (通用知识处理实验室); Technical University of Darmstadt (达姆施塔特工业大学); Hessian Center for AI (黑森人工智能中心); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注: 36 pages, 16 tables, 13 figures
Abstract:Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emphoff-the-shelf LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnoteCode and Data: this https URL
zh
[NLP-31] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Frag ility in LLM s
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理中因错误传播导致的可靠性问题,特别是针对“晚期脆弱性”(Late-Stage Fragility)现象——即后期推理步骤中的错误比早期错误更易破坏最终答案。其解决方案的关键在于提出自适应自我校正链式思维(Adaptive Self-Correction Chain-of-Thought, ASCoT),该方法包含两个核心模块:首先由自适应验证管理器(Adaptive Verification Manager, AVM)基于位置影响得分函数 $ I(k) $ 识别高风险的晚期推理步骤;随后由多视角自我校正引擎(Multi-Perspective Self-Correction Engine, MSCE)对这些关键步骤实施针对性的双路径校正,从而实现从统一验证向适应性、漏洞感知的修正机制转变。
链接: https://arxiv.org/abs/2508.05282
作者: Dongxu Zhang,Ning Yang,Jihua Zhu,Jinnan Yang,Miao Xin,Baoliang Tian
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held “cascading failure” hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term “Late-Stage Fragility”: errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.
zh
[NLP-32] Understanding and Mitigating Errors of LLM -Generated RTL Code
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的寄存器传输级(Register-Transfer Level, RTL)代码生成中整体成功率低的问题,其核心挑战在于错误成因复杂且缺乏系统性分析。研究表明,多数错误并非源于LLM推理能力不足,而是由RTL编程知识欠缺、电路概念理解不清、设计描述模糊或对多模态输入误读所致。解决方案的关键在于构建一套针对性的错误修正机制:首先利用上下文学习(in-context learning)建立领域专用知识库,并通过检索增强生成(Retrieval-Augmented Generation, RAG)补充RTL知识;其次引入设计描述规则与规则校验机制以减少歧义;再次集成外部工具将多模态输入转化为LLM兼容的元格式以降低误解风险;最后采用迭代调试循环(仿真-错误定位-修正)处理残余错误。上述方法集成于LLM框架后,在VerilogEval基准测试中达到91.0%准确率,相较基线提升32.7%,验证了该方案的有效性。
链接: https://arxiv.org/abs/2508.05266
作者: Jiazheng Zhang,Cheng Liu,Huawei Li
机构: State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Department of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术系)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 26 figures
Abstract:Despite the promising potential of large language model (LLM) based register-transfer-level (RTL) code generation, the overall success rate remains unsatisfactory. Errors arise from various factors, with limited understanding of specific failure causes hindering improvement. To address this, we conduct a comprehensive error analysis and manual categorization. Our findings reveal that most errors stem not from LLM reasoning limitations, but from insufficient RTL programming knowledge, poor understanding of circuit concepts, ambiguous design descriptions, or misinterpretation of complex multimodal inputs. Leveraging in-context learning, we propose targeted error correction techniques. Specifically, we construct a domain-specific knowledge base and employ retrieval-augmented generation (RAG) to supply necessary RTL knowledge. To mitigate ambiguity errors, we introduce design description rules and implement a rule-checking mechanism. For multimodal misinterpretation, we integrate external tools to convert inputs into LLM-compatible meta-formats. For remaining errors, we adopt an iterative debugging loop (simulation-error localization-correction). Integrating these techniques into an LLM-based framework significantly improves performance. We incorporate these error correction techniques into a foundational LLM-based RTL code generation framework, resulting in significantly improved performance. Experimental results show that our enhanced framework achieves 91.0% accuracy on the VerilogEval benchmark, surpassing the baseline code generation approach by 32.7%, demonstrating the effectiveness of our methods.
zh
[NLP-33] CodeBoost: Boosting Code LLM s by Squeezing Knowledge from Code Snippets with RL
【速读】: 该论文旨在解决代码大语言模型(Code Large Language Models, Code LLMs)在后训练阶段依赖人工标注的“指令-最终答案”对所导致的瓶颈问题,即高质量编程指令数据收集成本高、难以扩展,而代码片段(code snippets)却广泛可用。为突破这一限制,作者提出CodeBoost框架,其核心创新在于完全基于代码片段进行模型增强训练,无需人类标注的指令。关键组件包括:最大团筛选(maximum-clique curation)以构建代表性且多样化的训练语料;双向预测(bi-directional prediction)使模型同时学习前向与后向生成目标;错误感知预测(error-aware prediction)利用正确与错误输出中的学习信号;异构增强(heterogeneous augmentation)丰富代码语义多样性;以及异构奖励(heterogeneous rewarding)通过格式正确性与执行反馈(成功与失败)多维度引导模型优化。实验证明,CodeBoost在多个代码LLM和基准测试中均显著提升性能,展现出高效且可扩展的后训练潜力。
链接: https://arxiv.org/abs/2508.05242
作者: Sijie Wang,Quanjiang Guo,Kai Zhao,Yawei Zhang,Xin Li,Xiang Li,Siqi Li,Rui She,Shangshu Yu,Wee Peng Tay
机构: Nanyang Technological University (南洋理工大学); University of Electronic Science and Technology of China (电子科技大学); Beihang University (北京航空航天大学); Northeastern University, China (东北大学)
类目: Computation and Language (cs.CL)
备注: Technical report. Project page: this https URL
Abstract:Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using “human instruction-final answer” pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.
zh
[NLP-34] Pruning Large Language Models by Identifying and Preserving Functional Networks
【速读】: 该论文旨在解决当前结构化剪枝(structured pruning)方法在压缩大语言模型(LLMs)时忽视人工神经元之间相互作用与协作的问题,这种忽略会导致模型宏观功能架构的破坏,从而降低剪枝后的性能。其解决方案的关键在于借鉴人脑功能神经网络的特性,将LLM视为“数字大脑”,通过识别和保留其中的功能网络(functional networks)来实现高效且保真的剪枝:具体而言,先将LLM分解为若干功能网络(类比于神经影像数据中识别的功能脑网络),再保留这些功能网络中的关键神经元,从而在减少GPU内存占用和加速推理的同时维持模型的核心功能完整性。
链接: https://arxiv.org/abs/2508.05239
作者: Yiheng Liu,Junhao Ning,Sichen Xia,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures
Abstract:Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at this https URL.
zh
[NLP-35] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation
【速读】: 该论文旨在解决资源受限环境下多模态情感分析中,如何实现高效且可解释的情感推理链生成与分类问题(即Resource-Limited Joint Multimodal Sentiment Reasoning and Classification, JMSRC)。现有方法依赖参数量大的多模态大语言模型(Multimodal Large Language Model, MLLM),忽视了轻量化模型在边缘设备或低算力场景下的部署需求。其解决方案的关键在于提出一种“教师-助教-学生”蒸馏框架(MulCoT-RD),通过MLLM生成初始推理数据并训练中等规模的助教模型进行多任务学习,再联合优化一个轻量级学生模型,使其同时完成多模态情感推理链生成与分类任务,从而在仅3B参数下实现高性能、强泛化能力及增强的可解释性。
链接: https://arxiv.org/abs/2508.05234
作者: Haonan Shangguan,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang,Ge Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a “Teacher-Assistant-Student” distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.
zh
[NLP-36] FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在金融领域应用中因幻觉(hallucination)导致的可靠性问题,尤其是从表格数据中提取和计算数值时出现的错误,这些问题可能严重影响金融决策与合规性。解决方案的关键在于构建一个严谨且可扩展的评估框架,将幻觉建模为基于真实金融文档的上下文感知掩码跨度预测任务,并通过自动化掩码策略生成新的评测数据集(源自标普500年度报告),从而系统性地评估先进LLMs在金融表格数据上的内在幻觉模式,为金融机构提供可靠的内部模型评估方法,推动更可信的金融生成式AI(Generative AI)系统的发展。
链接: https://arxiv.org/abs/2508.05201
作者: Mengao Zhang,Jiayu Fu,Tanya Warrier,Yuwen Wang,Tianhui Tan,Ke-wei Huang
机构: Asian Institute of Digital Finance, National University of Singapore(新加坡国立大学数字金融研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages
Abstract:Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from SP 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.
zh
[NLP-37] QA-Drag on: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在多模态大语言模型(Multimodal Large Language Models, MLLMs)中处理复杂视觉问答(Knowledge-Intensive Visual Question Answering, VQA)任务时的局限性,即传统RAG通常仅从文本或图像中单独检索信息,难以支持需要多跳推理或多源知识融合的复杂查询。其解决方案的关键在于提出QA-Dragon——一个查询感知的动态RAG系统,通过引入领域路由机制(domain router)识别查询的主题领域以实现领域特定推理,并结合搜索路由机制(search router)动态选择最优检索策略;同时,在混合架构下协同调度文本与图像检索代理,从而支持跨模态、多轮和多跳推理,显著提升复杂VQA任务下的答案准确性和知识覆盖度。
链接: https://arxiv.org/abs/2508.05197
作者: Zhuohang Jiang,Pangjing Wu,Xu Yuan,Wenqi Fan,Qing Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: The source code for our system is released in this https URL
Abstract:Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query’s subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.
zh
[NLP-38] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答系统中产生的幻觉问题,即模型生成与事实不符或误导性的文本片段(hallucinated text spans)。解决方案的关键在于结合外部上下文信息,并采用两种策略:一是基于少量示例的提示工程(few-shot prompting),二是利用合成数据对LLM进行微调以实现token级分类。实验表明,引入相关上下文能有效缓解幻觉现象,且微调模型与精心设计的提示方法均展现出优异性能,在多语言任务中取得领先或竞争性结果。
链接: https://arxiv.org/abs/2508.05179
作者: Catherine Kobus,François Lancelot,Marion-Cécile Martin,Nawal Ould Amer
机构: Airbus AI Research (空中客车人工智能研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.
zh
[NLP-39] Posterior-GRPO: Rewarding Reasoning Processes in Code Generation
【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)代码生成中仅依赖测试用例结果奖励、忽视中间推理过程质量的问题。现有方法易受奖励欺骗(reward hacking)影响,即模型学会优化推理奖励信号但未提升最终代码正确性。解决方案的关键在于提出一个统一框架:首先构建LCB-RB基准用于评估推理质量;其次引入基于优化-劣化(Optimized-Degraded, OD-based)的奖励建模方法,通过系统性地优化与劣化初始推理路径来生成高质量偏好对,从而训练出性能领先的7B参数奖励模型;最后设计后验增强的GRPO(Posterior-GRPO, P-GRPO),该方法仅对成功任务的推理过程施加奖励,有效缓解奖励欺骗并使模型内部推理与最终代码正确性对齐,实现在多种代码生成任务上超越纯结果奖励基线4.5%,达到与GPT-4-Turbo相当的效果。
链接: https://arxiv.org/abs/2508.05170
作者: Lishui Fan,Yu Zhang,Mouxiang Chen,Zhongxin Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model’s internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available.
zh
[NLP-40] Aligning LLM s on a Budget: Inference-Time Alignment with Heuristic Reward Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段对齐用户偏好时面临的计算成本与对齐质量之间的权衡问题。现有方法通常忽略这一平衡,仅关注优化策略的性能表现,导致高昂的微调或推理开销。其解决方案的关键在于提出一种无需调优、兼容黑盒部署的推理阶段对齐方法——HIA(Heuristic-Guided Inference-time Alignment),该方法通过轻量级提示优化器、启发式奖励模型以及两阶段过滤机制,在保持对齐质量的同时显著减少推理调用次数,尤其在低推理预算下(如仅1–2次响应查询)仍能有效提升多目标、条件驱动任务的表现。
链接: https://arxiv.org/abs/2508.05165
作者: Mason Nakamura,Saaduddin Mahmud,Kyle H. Wray,Hamed Zamani,Shlomo Zilberstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Aligning LLMs with user preferences is crucial for real-world use but often requires costly fine-tuning or expensive inference, forcing trade-offs between alignment quality and computational cost. Existing inference-time methods typically ignore this balance, focusing solely on the optimized policy’s performance. We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while preserving alignment quality. On real-world prompt datasets, HelpSteer and ComPRed, HIA outperforms best-of-N sampling, beam search, and greedy search baselines in multi-objective, goal-conditioned tasks under the same inference budget. We also find that HIA is effective under low-inference budgets with as little as one or two response queries, offering a practical solution for scalable, personalized LLM deployment.
zh
[NLP-41] owards Assessing Medical Ethics from Knowledge to Practice
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域应用中缺乏对医学伦理原则系统评估的问题,尤其是现有基准测试往往忽视了伦理推理能力的考察。其解决方案的关键在于提出PrinciplismQA——一个包含3,648道题目的综合性基准测试,基于医学伦理中的原则主义(Principlism)框架构建,涵盖从权威教材中精选的多项选择题和来自医学伦理案例文献的开放性问题,并由医学专家进行验证。该基准能够系统评估LLMs在医疗场景下对伦理原则(如有利原则Beneficence、不伤害原则Non-maleficence、自主原则Autonomy和公正原则Justice)的理解与动态应用能力,揭示模型在实际伦理决策中存在知识与实践脱节的问题,从而为提升医疗AI的伦理对齐提供可量化的诊断工具和改进方向。
链接: https://arxiv.org/abs/2508.05132
作者: Chang Hong,Minghao Wu,Qingying Xiao,Yuchi Wang,Xiang Wan,Guangjun Yu,Benyou Wang,Yan Hu
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); National Health Data Institute, Shenzhen (深圳市健康大数据研究所); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs’ alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models’ ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models’ overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.
zh
[NLP-42] Navigating Through Paper Flood: Advancing LLM -based Paper Evaluation through Domain-Aware Retrieval and Latent Reasoning
【速读】: 该论文旨在解决学术论文数量激增背景下,如何高效准确地识别高质量研究的挑战。现有基于大语言模型(Large Language Models, LLMs)的自动化论文评估方法常受限于领域知识陈旧和推理能力不足。其解决方案的关键在于提出PaperEval框架,包含两个核心组件:一是领域感知的论文检索模块,用于获取相关同期工作以支持对新颖性和贡献的上下文化评估;二是潜在推理机制,通过深度理解复杂的研究动机与方法,并与同期相关工作进行全面比较,从而提升评估的准确性与可靠性。此外,引入渐进式排序优化策略引导LLM迭代细化预测,强化相对比较导向的推理过程,实验表明该方法在学术影响力和论文质量评估上均优于现有方法。
链接: https://arxiv.org/abs/2508.05129
作者: Wuqiang Zheng,Yiyan Xu,Xinyu Lin,Chongming Gao,Wenjie Wang,Fuli Feng
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:With the rapid and continuous increase in academic publications, identifying high-quality research has become an increasingly pressing challenge. While recent methods leveraging Large Language Models (LLMs) for automated paper evaluation have shown great promise, they are often constrained by outdated domain knowledge and limited reasoning capabilities. In this work, we present PaperEval, a novel LLM-based framework for automated paper evaluation that addresses these limitations through two key components: 1) a domain-aware paper retrieval module that retrieves relevant concurrent work to support contextualized assessments of novelty and contributions, and 2) a latent reasoning mechanism that enables deep understanding of complex motivations and methodologies, along with comprehensive comparison against concurrently related work, to support more accurate and reliable evaluation. To guide the reasoning process, we introduce a progressive ranking optimization strategy that encourages the LLM to iteratively refine its predictions with an emphasis on relative comparison. Experiments on two datasets demonstrate that PaperEval consistently outperforms existing methods in both academic impact and paper quality evaluation. In addition, we deploy PaperEval in a real-world paper recommendation system for filtering high-quality papers, which has gained strong engagement on social media – amassing over 8,000 subscribers and attracting over 10,000 views for many filtered high-quality papers – demonstrating the practical effectiveness of PaperEval.
zh
[NLP-43] Attention Basin: Why Contextual Position Matters in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理输入序列时对信息位置敏感的问题,即模型倾向于关注序列开头和结尾的信息,而忽视中间内容,这种现象被称为“注意力盆地”(attention basin)。其核心问题在于,当前模型的注意力机制未能有效分配给关键信息以更高的权重,从而限制了性能提升。解决方案的关键在于提出一种无训练、无需修改模型参数的两阶段框架——Attention-Driven Reranking (AttnRank),通过小规模校准集估计模型内在的位置偏好,并重新排序输入内容(如检索文档或少样本示例),使最显著的信息处于高注意力位置,从而显著提升多跳问答(multi-hop QA)和少样本上下文学习(few-shot in-context learning)任务中的性能表现。
链接: https://arxiv.org/abs/2508.05128
作者: Zihao Yi,Delong Zeng,Zhenqing Ling,Haohao Luo,Zhe Xu,Wei Liu,Jian Luan,Wanxia Cao,Ying Shen
机构: 1. Institute of Artificial Intelligence, Zhejiang University (浙江大学人工智能研究所); 2. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.
zh
[NLP-44] Exploring Superior Function Calls via Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中执行函数调用(function calling)时存在的三大核心问题:一是策略学习过程中探索不足,二是思维链(chain-of-thought)生成缺乏结构化推理能力,三是参数提取验证机制不完善。其解决方案的关键在于提出一种新颖的强化学习框架,通过基于策略熵的战略性探索(strategic entropy-based exploration)增强群体相对策略优化(group relative policy optimization, GRPO),并结合两阶段数据准备流程——包括迭代式大语言模型评估和抽象语法树(Abstract Syntax Tree, AST)验证——以确保训练样本质量。该方法显著提升了复杂多函数场景下的准确率,在Berkeley Function Calling Leaderboard上达到86.02%的整体准确率,优于标准GRPO高达6%,尤其在代码预训练模型上表现突出,表明结构化语言生成能力为函数调用任务的强化学习提供了优越起点。
链接: https://arxiv.org/abs/2508.05118
作者: Bingguang Hao,Maolin Wang,Zengzhuang Xu,Yicheng Chen,Cunyin Peng,Jinjie GU,Chenyi Zhuang
机构: AWorld Team, Inclusion AI; Formulas Youshu; Ant Group(蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.
zh
[NLP-45] BEE-RAG : Balanced Entropy Engineering for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理长上下文时因熵无约束增长和注意力稀释导致的性能下降问题。其解决方案的关键在于提出平衡熵工程的RAG框架(Balanced Entropy-Engineered RAG, BEE-RAG),通过熵不变性原则实现上下文长度与注意力敏感度的解耦,从而维持稳定的熵水平;进一步引入零样本多重要性估计推理策略和参数高效的自适应微调机制,以动态优化不同场景下的熵平衡因子,显著提升RAG在多种任务中的适应性和性能表现。
链接: https://arxiv.org/abs/2508.05100
作者: Yuhao Wang,Ruiyang Ren,Yucheng Wang,Jing Liu,Wayne Xin Zhao,Hua Wu,Haifeng Wang
机构: Baidu(百度); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.
zh
[NLP-46] Multimodal Fact Checking with Unified Visual Textual and Contextual Representations
【速读】: 该论文旨在解决多模态虚假信息(multimodal misinformation)在事实核查中的挑战,即当虚假主张同时包含文本和图像时,传统依赖文本证据的核查系统难以有效识别。其解决方案的关键在于提出一个统一的细粒度多模态事实验证框架“MultiCheck”,该框架通过专用编码器分别处理文本与图像,并引入融合模块利用元素级交互捕捉跨模态关系,再结合对比学习目标在共享潜在空间中增强声明-证据对的语义一致性,从而实现更精准、可解释的事实验证。
链接: https://arxiv.org/abs/2508.05097
作者: Aditya Kishore,Gaurav Kumar,Jasabanta Patro
机构: Indian Institute of Science Education and Research Bhopal (印度科学教育与研究学院布帕尔分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called “MultiCheck”, designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.
zh
[NLP-47] JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering
【速读】: 该论文旨在解决当前针对多模态大语言模型(Multimodal Large Language Models, MLLMs)的越狱攻击(Jailbreak attacks)研究中一个关键缺陷:现有方法过度关注提升攻击成功率(Attack Success Rate, ASR),却忽视了生成响应是否真正实现了攻击者的恶意意图,导致许多越狱输出虽能绕过安全过滤机制,但内容质量低、缺乏实质性危害。为填补这一空白,作者提出JPS(Jailbreak MLLMs with collaborative visual Perturbation and textual Steering),其核心创新在于通过视觉扰动与文本引导提示(steering prompt)的协同优化实现高效且意图明确的越狱。具体而言,JPS利用目标导向的对抗性图像扰动实现安全机制的有效规避,并结合由多智能体系统优化的“引导提示”精准控制大语言模型输出以满足攻击者意图;二者通过迭代联合优化显著提升攻击效果。此外,论文还引入恶意意图履行率(Malicious Intent Fulfillment Rate, MIFR)作为评估指标,借助推理型大语言模型进行量化分析,实验表明JPS在多个MLLM和基准测试上均达到ASR与MIFR的新SOTA水平。
链接: https://arxiv.org/abs/2508.05087
作者: Renmiao Chen,Shiyao Cui,Xuancheng Huang,Chengwei Pan,Victor Shea-Jay Huang,QingLin Zhang,Xuan Ouyang,Zhexin Zhang,Hongning Wang,Minlie Huang
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 10 pages, 3 tables, 2 figures, to appear in the Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)
Abstract:Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker’s malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underlineJailbreak MLLMs with collaborative visual \underlinePerturbation and textual \underlineSteering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by “steering prompt” optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers’ intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \hrefthis https URLthis https URL. \colorwarningcolorWarning: This paper contains potentially sensitive contents.
zh
[NLP-48] Cognitive Duality for Adaptive Web Agents
【速读】: 该论文旨在解决当前自主网页代理(web agent)在复杂动态环境中决策能力不足的问题,尤其是现有方法通常局限于离线模仿学习或在线探索,难以有效融合两者以实现高效且灵活的智能行为。其解决方案的关键在于借鉴人类认知的双系统理论(dual-process theory of human cognition),将代理行为分解为快速的系统1(System 1,直观反应)与慢速的系统2(System 2,深度推理)过程,并在此基础上构建CogniWeb这一模块化架构,通过任务复杂度自适应切换两种处理模式,从而在保持高效率(token使用减少75%)的同时实现43.96%的成功率,在WebArena基准上达到竞争力表现。
链接: https://arxiv.org/abs/2508.05081
作者: Jiarun Liu,Chunhong Zhang,Zheng Hu
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Web navigation represents a critical and challenging domain for evaluating artificial general intelligence (AGI), demanding complex decision-making within high-entropy, dynamic environments with combinatorially explosive action spaces. Current approaches to building autonomous web agents either focus on offline imitation learning or online exploration, but rarely integrate both paradigms effectively. Inspired by the dual-process theory of human cognition, we derive a principled decomposition into fast System 1 and slow System 2 cognitive processes. This decomposition provides a unifying perspective on existing web agent methodologies, bridging the gap between offline learning of intuitive reactive behaviors and online acquisition of deliberative planning capabilities. We implement this framework in CogniWeb, a modular agent architecture that adaptively toggles between fast intuitive processing and deliberate reasoning based on task complexity. Our evaluation on WebArena demonstrates that CogniWeb achieves competitive performance (43.96% success rate) while maintaining significantly higher efficiency (75% reduction in token usage).
zh
[NLP-49] Align Dont Divide: Revisiting the LoRA Architecture in Multi-Task Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多任务学习(Multi-Task Learning, MTL)场景下如何高效适配多个不同领域任务的问题。当前主流方法依赖于结构多样化的LoRA变体,如多适配器(multi-adapters)或多头(multi-heads)设计以捕捉任务特异性知识,但作者发现这种复杂架构并非最优解。其关键创新在于提出一种新假设:多任务泛化能力的核心在于学习鲁棒的共享表示,而非隔离任务特定特征;为此,论文提出Align-LoRA方法,通过引入显式的对齐损失(alignment loss)强制不同任务在共享适配器空间中学习一致的表示,从而在简化结构的同时显著提升性能,验证了“简单有效”的新范式。
链接: https://arxiv.org/abs/2508.05078
作者: Jinda Liu,Bo Cheng,Yi Chang,Yuan Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at this https URL.
zh
[NLP-50] A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding
【速读】: 该论文旨在解决当前语言引导的3D场景理解领域中,关于如何有效整合大型语言模型(Large Language Models, LLMs)与3D高斯点绘(Gaussian Splatting)技术的研究空白问题。其解决方案的关键在于系统性地梳理和总结现有研究中将语言嵌入(language embeddings)融入高斯点绘管线的方法,包括理论基础、集成策略及实际应用场景,并指出计算瓶颈、泛化能力不足以及语义标注的3D高斯数据稀缺等核心挑战,从而为未来基于高斯点绘的语言引导3D场景建模与理解提供清晰的发展方向。
链接: https://arxiv.org/abs/2508.05064
作者: Mahmoud Chick Zaouali,Todd Charter,Yehor Karpichev,Brandon Haworth,Homayoun Najjjaran
机构: University of Victoria (维多利亚大学)
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian Splatting has rapidly emerged as a transformative technique for real-time 3D scene representation, offering a highly efficient and expressive alternative to Neural Radiance Fields (NeRF). Its ability to render complex scenes with high fidelity has enabled progress across domains such as scene reconstruction, robotics, and interactive content creation. More recently, the integration of Large Language Models (LLMs) and language embeddings into Gaussian Splatting pipelines has opened new possibilities for text-conditioned generation, editing, and semantic scene understanding. Despite these advances, a comprehensive overview of this emerging intersection has been lacking. This survey presents a structured review of current research efforts that combine language guidance with 3D Gaussian Splatting, detailing theoretical foundations, integration strategies, and real-world use cases. We highlight key limitations such as computational bottlenecks, generalizability, and the scarcity of semantically annotated 3D Gaussian data and outline open challenges and future directions for advancing language-guided 3D scene understanding using Gaussian Splatting.
zh
[NLP-51] Evaluation of LLM s in AMR Parsing
【速读】: 该论文旨在解决自然语言处理中语义表示任务中的抽象 meaning representation (AMR) 解析问题,即如何将句子转化为结构化的语义图(根节点、有向无环图),以准确捕捉概念及其语义关系。其解决方案的关键在于采用“仅解码器型大语言模型(decoder-only Large Language Models, LLMs)的微调策略”,通过直接在标准 AMR 数据集(LDC2020T02 Gold AMR3.0)上进行训练,无需复杂架构或多阶段流水线设计,即可实现与当前最先进(SOTA)AMR解析器相媲美的性能表现。实验表明,LLaMA 3.2 在该方法下取得了 SMATCH F1 达到 0.804 的结果,接近 Graphene Smatch (MBSE) 的 0.854,验证了该方案的有效性和简洁性。
链接: https://arxiv.org/abs/2508.05028
作者: Shu Han Ho
机构: University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 32 figures
Abstract:Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.
zh
[NLP-52] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning CIKM2025
【速读】: 该论文旨在解决多轮多参与者对话中目标-方面-观点-情感四元组(Target-Aspect-Opinion-Sentiment Quadruple, TAOSQ)抽取任务中存在的噪声干扰问题。现有方法通常假设情感元素在整段对话中分布均匀,通过学习全对话范围内的词间关系进行抽取,但实际对话常包含多个语义独立的子对话,且子对话之间缺乏明确依赖关系,导致全局建模引入冗余噪声。解决方案的关键在于:首先利用结构熵最小化算法对对话进行语义分割,将对话划分为尽可能少且语义独立的子对话,从而保留相关话语并区分无关内容;随后提出两阶段抽取框架——先在话语层提取单个情感要素,再在子对话层完成四元组匹配,显著提升了抽取准确性并降低了计算开销。
链接: https://arxiv.org/abs/2508.05023
作者: Kun Peng,Cong Cao,Hao Peng,Zhifeng Hao,Lei Jiang,Kongjing Gu,Yanbing Liu,Philip S. Yu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Beihang University(北京航空航天大学); University of Shantou(汕头大学); National University of Defense Technology(国防科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM2025
Abstract:Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.
zh
[NLP-53] Making Prompts First-Class Citizens for Adaptive LLM Pipelines
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)流水线中提示词(prompt)管理的局限性问题:当前提示词仍以静态、不透明的字符串形式存在,与数据流脱节,导致难以复用、优化和运行时控制。其核心解决方案是提出SPEAR,一个将提示词作为结构化、可适应的第一类执行组件的语言与运行时系统。SPEAR的关键创新在于引入了提示代数(prompt algebra),支持运行时提示精炼(runtime prompt refinement)和结构化提示管理(structured prompt management),从而实现基于置信度、延迟或缺失上下文等信号动态调整提示,并通过版本化视图支持可观测性和日志记录;同时,它还通过提示逻辑的数据化表达,使诸如操作符融合(operator fusion)、前缀缓存(prefix caching)和视图重用等优化成为可能。
链接: https://arxiv.org/abs/2508.05012
作者: Ugur Cetintemel,Shu Chen,Alexander W. Lee,Deepti Raghavan
机构: Brown University (布朗大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Modern LLM pipelines increasingly resemble data-centric systems: they retrieve external context, compose intermediate outputs, validate results, and adapt based on runtime feedback. Yet, the central element guiding this process – the prompt – remains a brittle, opaque string, disconnected from the surrounding dataflow. This disconnect limits reuse, optimization, and runtime control. In this paper, we describe our vision and an initial design for SPEAR, a language and runtime that fills this prompt management gap by making prompts structured, adaptive, and first-class components of the execution model. SPEAR enables (1) runtime prompt refinement – modifying prompts dynamically in response to execution-time signals such as confidence, latency, or missing context; and (2) structured prompt management – organizing prompt fragments into versioned views with support for introspection and logging. SPEAR defines a prompt algebra that governs how prompts are constructed and adapted within a pipeline. It supports multiple refinement modes (manual, assisted, and automatic), giving developers a balance between control and automation. By treating prompt logic as structured data, SPEAR enables optimizations such as operator fusion, prefix caching, and view reuse. Preliminary experiments quantify the behavior of different refinement modes compared to static prompts and agentic retries, as well as the impact of prompt-level optimizations such as operator fusion. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2508.05012 [cs.DB] (or arXiv:2508.05012v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2508.05012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-54] Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses
【速读】: 该论文旨在解决城市空间数据集成中传统规则驱动方法难以覆盖所有边缘情况、需人工验证与修复,以及机器学习方法依赖大量任务特定标注样本的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的推理能力,并通过引入相关特征以降低对复杂空间推理的依赖,从而提升集成精度;同时采用“审查与修正”(review-and-refine)策略,有效纠正初始错误响应并保留正确结果,展现出LLMs在适应性空间数据集成中的潜力和灵活性。
链接: https://arxiv.org/abs/2508.05009
作者: Bin Han,Robert Wolfe,Anat Caspi,Bill Howe
机构: University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We explore the application of large language models (LLMs) to empower domain experts in integrating large, heterogeneous, and noisy urban spatial datasets. Traditional rule-based integration methods are unable to cover all edge cases, requiring manual verification and repair. Machine learning approaches require collecting and labeling of large numbers of task-specific samples. In this study, we investigate the potential of LLMs for spatial data integration. Our analysis first considers how LLMs reason about environmental spatial relationships mediated by human experience, such as between roads and sidewalks. We show that while LLMs exhibit spatial reasoning capabilities, they struggle to connect the macro-scale environment with the relevant computational geometry tasks, often producing logically incoherent responses. But when provided relevant features, thereby reducing dependence on spatial reasoning, LLMs are able to generate high-performing results. We then adapt a review-and-refine method, which proves remarkably effective in correcting erroneous initial responses while preserving accurate responses. We discuss practical implications of employing LLMs for spatial data integration in real-world contexts and outline future research directions, including post-training, multi-modal integration methods, and support for diverse data formats. Our findings position LLMs as a promising and flexible alternative to traditional rule-based heuristics, advancing the capabilities of adaptive spatial data integration.
zh
[NLP-55] R-Zero: Self-Evolving Reasoning LLM from Zero Data
【速读】: 该论文旨在解决当前自演化大型语言模型(Self-evolving Large Language Models, LLMs)在训练过程中严重依赖人工标注任务和标签的问题,这一瓶颈限制了AI系统向超越人类智能方向发展的潜力。解决方案的关键在于提出R-Zero框架,该框架完全自主地从零开始生成训练数据,通过初始化两个角色不同的独立模型——挑战者(Challenger)与求解器(Solver)——并让二者在交互中协同进化:挑战者被奖励以提出接近求解器能力边界的任务,而求解器则因成功解决更具挑战性的任务而获得奖励。这种机制无需任何预设任务或标签,即可自动构建针对性强、持续进化的学习课程,从而显著提升模型的推理能力。
链接: https://arxiv.org/abs/2508.05004
作者: Chengsong Huang,Wenhao Yu,Xiaoyang Wang,Hongming Zhang,Zongxia Li,Ruosen Li,Jiaxin Huang,Haitao Mi,Dong Yu
机构: Tencent AI Seattle Lab; Washington University in St. Louis; University of Maryland College Park; The University of Texas at Dallas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
zh
[NLP-56] A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health
【速读】: 该论文旨在解决从非结构化文本中提取与自杀相关社会决定因素(Social Determinants of Health, SDoH)时面临的三大挑战:长尾因子分布问题、关键应激源的识别困难以及模型可解释性不足。其解决方案的关键在于提出了一种多阶段大语言模型框架,通过分步处理提升SDoH因子提取的准确性和上下文召回能力,并借助中间层解释增强模型透明度;同时,实验表明对小型专用模型进行微调可在保持甚至优于主流模型性能的同时显著降低推理成本,从而实现高效且可解释的SDoH识别。
链接: https://arxiv.org/abs/2508.05003
作者: Song Wang,Yishu Wei,Haotian Ma,Max Lovitt,Kelly Deng,Yuan Meng,Zihan Xu,Jingze Zhang,Yunyu Xiao,Ying Ding,Xuhai Xu,Joydeep Ghosh,Yifan Peng
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Weill Cornell Medicine (威尔康奈尔医学院); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model’s explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.
zh
[NLP-57] REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation
【速读】: 该论文旨在解决同步语音翻译(Simultaneous Speech Translation, SimulST)系统中翻译质量与延迟之间的权衡问题。其核心挑战在于如何在保证翻译准确性的同时尽可能降低延迟。解决方案的关键在于提出一种基于信息论原则的新型训练损失函数——正则化熵信息适应(Regularized Entropy INformation Adaptation, REINA),该方法通过优化一个自适应决策策略来决定何时等待更多输入,仅当等待能带来额外信息时才延迟输出。REINA可直接用于训练现有的非流式翻译模型,从而实现对延迟和质量之间帕累托前沿的有效提升,在法语、西班牙语和德语到英语的多语言任务上取得了与模型规模相当的最先进(SOTA)流式翻译性能,并通过引入新的流式效率指标量化验证了其相比以往方法最多提升21%的延迟/质量平衡能力。
链接: https://arxiv.org/abs/2508.04946
作者: Nameer Hirschkind,Joseph Liu,Mahesh Kumar Nandwana,Xiao Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.
zh
[NLP-58] owards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering
【速读】: 该论文旨在解决视觉活动识别系统评估中存在的语义模糊性问题,即同一动作在不同语境下可能对应多个同义动词(如“brushing”与“grooming”)或因视角差异导致合理但不同的动词选择(如“piloting”与“operating”),而传统基于单一标准答案的精确匹配评估方法无法捕捉此类多样性,从而低估模型性能。解决方案的关键在于提出一种视觉-语言聚类框架,通过构建动词语义簇(verb sense clusters)来表征图像中动作的不同解释视角;实验表明,每张图像平均对应2.8个语义簇,且该方法相较于传统评估更贴近人类判断,显著提升了评估的鲁棒性和细致度。
链接: https://arxiv.org/abs/2508.04945
作者: Louie Hong Yao,Nicholas Jarvis,Tianyu Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures
Abstract:Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.
zh
[NLP-59] I Think Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化评价中对特定语言特征(如模糊表达,hedging language)的系统性偏见问题,这种偏见可能导致对不同社会群体的不公平评分。其解决方案的关键在于构建一个结构化的基准测试框架,通过100组经验证的问答对生成受控的语言变体,在保持语义等价的前提下隔离具体语言现象,从而精确测量模型在性别、社会阶层或地域背景等维度上的歧视性响应模式。该方法首次实现了对LLM中隐性语言歧视的量化评估,并揭示了模糊表达平均被低评25.6%的现象,为公平性检测提供了可复用的技术路径。
链接: https://arxiv.org/abs/2508.04939
作者: Julia Kharchenko,Tanya Roosta,Aman Chadha,Chirag Shah
机构: University of Washington (华盛顿大学); UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Amazon GenAI (亚马逊生成式人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark’s effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.
zh
[NLP-60] ConfAgents : A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis
【速读】: 该论文旨在解决当前AI代理在医疗研究中因依赖静态预定义策略而导致的局限性问题,即代理虽能成为更高效的工具使用者,却无法自主提升战略规划能力,而后者正是处理复杂医疗领域任务的关键。解决方案的核心在于提出HealthFlow——一种通过新颖的元级演化机制实现自我进化的AI代理,其能够将执行过程中的成功与失败经验提炼为持久的战略知识库,从而自主优化高层问题求解策略。这一机制使HealthFlow突破了传统框架对固定策略的依赖,显著提升了在真实医疗数据分析任务中的性能表现。
链接: https://arxiv.org/abs/2508.04915
作者: Huiya Zhao,Yinghao Zhu,Zixiang Wang,Yasha Wang,Junyi Gao,Liantao Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Code: this https URL
Abstract:The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow’s self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.
zh
[NLP-61] Advancing Hate Speech Detection with Transformers: Insights from the MetaHate
【速读】: 该论文旨在解决社交媒体中仇恨言论(hate speech)的自动化检测问题,其核心挑战在于如何在多样化的社交平台环境中准确识别具有社会危害性的歧视性或诽谤性内容。解决方案的关键在于采用基于Transformer架构的预训练语言模型进行微调,特别是利用MetaHate数据集(包含120万条样本的36个子数据集)对BERT、RoBERTa、GPT-2和ELECTRA等模型进行系统评估,其中微调后的ELECTRA模型表现最优(F1分数达0.8980),展现出更强的泛化能力和精度,从而为构建高效、鲁棒的仇恨言论检测系统提供了有力支持。
链接: https://arxiv.org/abs/2508.04913
作者: Santosh Chapagain,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to the Deviant Dynamics in Digital Spaces workshop at ASONAM 2025
Abstract:Hate speech is a widespread and harmful form of online discourse, encompassing slurs and defamatory posts that can have serious social, psychological, and sometimes physical impacts on targeted individuals and communities. As social media platforms such as X (formerly Twitter), Facebook, Instagram, Reddit, and others continue to facilitate widespread communication, they also become breeding grounds for hate speech, which has increasingly been linked to real-world hate crimes. Addressing this issue requires the development of robust automated methods to detect hate speech in diverse social media environments. Deep learning approaches, such as vanilla recurrent neural networks (RNNs), long short-term memory (LSTM), and convolutional neural networks (CNNs), have achieved good results, but are often limited by issues such as long-term dependencies and inefficient parallelization. This study represents the comprehensive exploration of transformer-based models for hate speech detection using the MetaHate dataset–a meta-collection of 36 datasets with 1.2 million social media samples. We evaluate multiple state-of-the-art transformer models, including BERT, RoBERTa, GPT-2, and ELECTRA, with fine-tuned ELECTRA achieving the highest performance (F1 score: 0.8980). We also analyze classification errors, revealing challenges with sarcasm, coded language, and label noise.
zh
[NLP-62] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory
【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, Multi-agent LLM)系统在协作推理任务中因静态或全上下文路由策略导致的过度token消耗、冗余内存暴露及跨交互轮次适应性差的问题。其解决方案的关键在于提出一种模块化且角色感知的上下文路由框架RCR-Router,该框架通过轻量级评分策略动态选择与每个智能体角色和任务阶段语义相关的记忆子集,并严格遵守token预算;同时,智能体输出被迭代整合进共享记忆存储,以实现渐进式上下文优化,从而提升协作效率与可扩展性。
链接: https://arxiv.org/abs/2508.04903
作者: Jun Liu,Zhenglun Kong,Changdi Yang,Fan Yang,Tianqi Li,Peiyan Dong,Joannah Nanjekye,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Pu Zhao,Xue Lin,Dong Huang,Yanzhi Wang
机构: 1. Tsinghua University (清华大学); 2. Northeastern University (东北大学); 3. Peking University (北京大学); 4. University of California, Berkeley (加州大学伯克利分校); 5. University of Oxford (牛津大学); 6. University of Cambridge (剑桥大学); 7. Chinese Academy of Sciences (中国科学院); 8. MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks – HotPotQA, MuSiQue, and 2WikiMultihop – demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.
zh
[NLP-63] Fine-Tuning Small Language Models (SLMs) for Autonomous Web-based Geographical Information Systems (AWebGIS)
【速读】: 该论文旨在解决当前基于自然语言的自主地理信息系统(AWebGIS)依赖云端大语言模型(LLMs)所带来的隐私泄露、网络连接依赖及可扩展性差等问题。其核心解决方案是采用在客户端(即用户浏览器)运行的微调小型语言模型(SLM),具体使用T5-small模型实现完全离线的自主处理能力,从而在不依赖服务器端推理的情况下提升准确率与系统效率。实验表明,该方案在精确匹配准确率(0.93)、Levenshtein相似度(0.99)以及ROUGE-1和ROUGE-L指标(均为0.98)上均优于云端和半自动化方法,验证了客户端执行模型在AWebGIS中的可行性与优越性。
链接: https://arxiv.org/abs/2508.04846
作者: Mahdi Nazari Ashani,Ali Asghar Alesheikh,Saba Kazemi,Kimya Kheirkhah,Yasin Mohammadi,Fatemeh Rezaie,Amir Mahdi Manafi,Hedieh Zarkesh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Autonomous web-based geographical information systems (AWebGIS) aim to perform geospatial operations from natural language input, providing intuitive, intelligent, and hands-free interaction. However, most current solutions rely on cloud-based large language models (LLMs), which require continuous internet access and raise users’ privacy and scalability issues due to centralized server processing. This study compares three approaches to enabling AWebGIS: (1) a fully-automated online method using cloud-based LLMs (e.g., Cohere); (2) a semi-automated offline method using classical machine learning classifiers such as support vector machine and random forest; and (3) a fully autonomous offline (client-side) method based on a fine-tuned small language model (SLM), specifically T5-small model, executed in the client’s web browser. The third approach, which leverages SLMs, achieved the highest accuracy among all methods, with an exact matching accuracy of 0.93, Levenshtein similarity of 0.99, and recall-oriented understudy for gisting evaluation ROUGE-1 and ROUGE-L scores of 0.98. Crucially, this client-side computation strategy reduces the load on backend servers by offloading processing to the user’s device, eliminating the need for server-based inference. These results highlight the feasibility of browser-executable models for AWebGIS solutions.
zh
[NLP-64] Persistent Instability in LLM s Personality Measurements: Effects of Scale Reasoning and Conversation History
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全部署中缺乏行为一致性的问题,特别是其类人格特质(personality-like traits)的稳定性不足。研究表明,即使参数规模超过400B的模型仍表现出显著的响应变异性(标准差达0.4),且多种预期可稳定行为的干预措施(如链式思维推理、详细角色设定或对话历史引入)反而加剧了不稳定性。解决方案的关键在于提出并验证了PERSIST框架——一个涵盖50万+响应的系统性评估体系,结合传统与LLM适配的人格测量工具,在多种提示结构和推理模式下进行多维测试,从而揭示当前LLMs在人格层面存在普遍且顽固的行为不一致现象,表明基于人格对齐的安全策略可能根本不可靠。
链接: https://arxiv.org/abs/2508.04826
作者: Tommaso Tosato,Saskia Helbling,Yorguin-Jose Mantilla-Ramos,Mahmood Hegazy,Alberto Tosato,David John Lemay,Irina Rish,Guillaume Dumas
机构: University of Montreal (蒙特利尔大学); Mila - Quebec AI Institute (魁北克人工智能研究所); Google Brain (谷歌大脑); Meta AI (Meta人工智能实验室); University of Toronto (多伦多大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.
zh
[NLP-65] Pitch Accent Detection improves Pretrained Automatic Speech Recognition
【速读】: 该论文旨在解决预训练语音模型在有限资源微调场景下,因忽略韵律线索(如语调重音)而导致自动语音识别(ASR)性能下降的问题。其解决方案的关键在于引入一个联合训练的ASR与语调重音检测模型,通过在训练过程中显式建模语调重音这一重要韵律特征,不仅显著提升了语调重音检测任务的F1分数(缩小与最先进水平差距41%),还在ASR任务中实现了28.3%的词错误率(WER)降低,证明了保留或重新学习语音中的韵律信息对提升ASR系统鲁棒性和准确性的重要性。
链接: https://arxiv.org/abs/2508.04814
作者: David Sasu,Natalie Schluter
机构: ITU(丹麦技术大学); Apple(苹果)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.
zh
[NLP-66] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
【速读】: 该论文旨在解决多语言自然语言处理(Natural Language Processing, NLP)中因标准分词算法(如Byte Pair Encoding, BPE)依赖频率目标而导致的资源不平等问题。这类算法倾向于优化主流语言的压缩效率,使得低资源语言的分词结果更长、形态学上不合理,甚至包含大量未登录词(UNK)标记,从而加剧不同语言群体之间的计算与经济不平等。解决方案的关键在于提出一种公平感知的BPE(Parity-aware BPE):在每一步合并操作中,优先最大化当前压缩最差语言的压缩增益,通过牺牲少量全局压缩效率来实现跨语言分词长度的公平性,实验证明其对整体压缩率和下游语言模型性能影响微小,但显著改善了多语言间的分词均衡性。
链接: https://arxiv.org/abs/2508.04796
作者: Negar Foroutan,Clara Meister,Debjit Paul,Joel Niklaus,Sina Ahmadi,Antoine Bosselut,Rico Sennrich
机构: EPFL(瑞士联邦理工学院); ETH Zurich(苏黎世联邦理工学院); Niklaus.ai; University of Zurich(苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Tokenization is the first – and often least scrutinized – step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with UNK placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.
zh
[NLP-67] Enhancing Dialogue Annotation with Speaker Characteristics Leverag ing a Frozen LLM
【速读】: 该论文旨在解决对话转录后处理中缺乏对说话人特征(如年龄、性别、情绪等)进行结构化标注的问题,从而提升转录文本的语义丰富性和可解释性。其解决方案的关键在于利用冻结的音频基础模型(如Whisper或WavLM)与冻结的LLAMA语言模型相结合,通过轻量级连接器将音频表征与语言表征对齐,无需任何任务特定微调即可实现说话人属性的推断,同时保持模块化和高效性;此外,研究还发现冻结的LLAMA模型可直接比较x-vectors,在某些场景下达到8.8%的等错误率(Equal Error Rate, EER)。
链接: https://arxiv.org/abs/2508.04795
作者: Thomas Thebaud,Yen-Ju Lu,Matthew Wiesner,Peter Viechnicki,Najim Dehak
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in the 2025 IEEE Automatic Speech Recognition and Understanding Workshop
Abstract:In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.
zh
[NLP-68] Prescriptive Agents based on Rag for Automated Maintenance (PARAM)
【速读】: 该论文旨在解决工业机械设备维护中仅依赖传统异常检测而缺乏可执行维护建议的问题,从而实现从状态监测到可操作维护计划的跨越。其解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的智能系统,该系统通过将轴承振动频率特征(BPFO、BPFI、BSF、FTF)序列化为自然语言输入,结合多智能体生成机制与向量嵌入语义搜索技术,实现高精度的故障类型分类与严重程度评估,并整合维护手册和网络检索信息以生成结构化的维修建议,包括立即行动项、检查清单、纠正措施、零部件需求及时间规划,从而显著提升维护决策的智能化与实用性。
链接: https://arxiv.org/abs/2508.04714
作者: Chitranshu Harbola,Anupam Purwar
机构: AIGurukul(人工智能学院); Independent(独立)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP)
备注:
Abstract:Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors.
zh
[NLP-69] Speech LLM s in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages INTERSPEECH2025
【速读】: 该论文旨在解决低资源语言环境下自动语音识别(Automatic Speech Recognition, ASR)性能受限的问题,尤其是在训练数据稀缺的情况下,如何有效利用大语言模型(Large Language Models, LLMs)提升ASR效果。其解决方案的关键在于提出SLAM-ASR框架,通过一个可训练的轻量级投影层(projector)将语音编码器与LLM连接起来,从而实现对语音特征到文本语义的有效映射;同时研究表明,使用在高资源语言上预训练的单语或多语投影层可以显著缓解小样本训练下的性能下降问题,尤其在结合多语种LLM(如EuroLLM、Salamandra)与Whisper-large-v3-turbo时表现优异,为未来优化面向低资源语言和多语种场景的Speech LLM提供了重要实践依据。
链接: https://arxiv.org/abs/2508.05149
作者: Seraphina Fong,Marco Matassoni,Alessio Brutti
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025. 5 pages, 2 figures, 3 tables
Abstract:Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality.
zh
[NLP-70] Federal Reserve Communication and the COVID-19 Pandemic
【速读】: 该论文试图解决的问题是:中央银行(尤其是美联储)在重大经济危机期间的沟通策略如何演变,以及这些策略如何随异常经济环境而调整。解决方案的关键在于运用定制化的词典(针对新冠疫情、非常规货币政策(UMP)和金融稳定)、情感分析与主题建模技术,系统比较美联储在新冠疫情期间、互联网泡沫破裂时期及全球金融危机期间的沟通内容、情感倾向与时间节点特征。研究发现,美联储在新冠疫情期间对金融稳定、市场波动性、社会福利和UMP的关注显著增强,且其沟通更具反应性;同时,利率公告和会议纪要中金融稳定相关情绪的下降预示了后续宽松货币政策的出台,表明沟通策略已从全球金融危机后逐步固化为“新常态”,体现出央行在危机中的制度性适应能力。
链接: https://arxiv.org/abs/2508.04830
作者: Jonathan Benchimol,Sophia Kazinnik,Yossi Saadon
机构: 未知
类目: General Economics (econ.GN); Computation and Language (cs.CL); Information Theory (cs.IT); Applications (stat.AP); Machine Learning (stat.ML)
备注:
Abstract:In this study, we examine the Federal Reserve’s communication strategies during the COVID-19 pandemic, comparing them with communication during previous periods of economic stress. Using specialized dictionaries tailored to COVID-19, unconventional monetary policy (UMP), and financial stability, combined with sentiment analysis and topic modeling techniques, we identify a distinct focus in Fed communication during the pandemic on financial stability, market volatility, social welfare, and UMP, characterized by notable contextual uncertainty. Through comparative analysis, we juxtapose the Fed’s communication during the COVID-19 crisis with its responses during the dot-com and global financial crises, examining content, sentiment, and timing dimensions. Our findings reveal that Fed communication and policy actions were more reactive to the COVID-19 crisis than to previous crises. Additionally, declining sentiment related to financial stability in interest rate announcements and minutes anticipated subsequent accommodative monetary policy decisions. We further document that communicating about UMP has become the “new normal” for the Fed’s Federal Open Market Committee meeting minutes and Chairman’s speeches since the Global Financial Crisis, reflecting an institutional adaptation in communication strategy following periods of economic distress. These findings contribute to our understanding of how central bank communication evolves during crises and how communication strategies adapt to exceptional economic circumstances.
zh
计算机视觉
[CV-0] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing
【速读】:该论文旨在解决人脸识别(Face Recognition, FR)技术发展带来的隐私泄露问题,特别是现有面部匿名化方法无法满足生物特征模板保护的三大核心要求——可撤销性(revocability)、不可链接性(unlinkability)和不可逆性(irreversibility)。解决方案的关键在于提出 FaceAnonyMixer,一个基于预训练生成模型潜在空间的可撤销人脸生成框架:通过将真实人脸图像的潜在编码与由可撤销密钥生成的合成编码进行不可逆混合,并结合多目标损失函数优化,生成高质量且符合可撤销生物特征标准的匿名人脸图像,从而在不修改现有FR系统的情况下实现强隐私保护与高识别准确率的统一。
链接: https://arxiv.org/abs/2508.05636
作者: Mohammed Talha Alam,Fahad Shamshad,Fakhri Karray,Karthik Nandakumar
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Waterloo (滑铁卢大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Joint Conference on Biometrics (IJCB) 2025
Abstract:Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: this https URL.
zh
[CV-1] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
【速读】:该论文旨在解决机器人操作中政策学习、评估与仿真难以统一的问题,尤其在多模态指令驱动下的通用具身智能(embodied intelligence)实现上存在瓶颈。解决方案的关键在于构建一个统一的世界基础平台 Genie Envisioner (GE),其核心是基于视频扩散模型的 GE-Base,能够以结构化的潜在空间捕捉真实世界机器人交互中的空间、时间和语义动态;在此基础上,通过轻量级流匹配解码器 GE-Act 实现从潜在表示到可执行动作轨迹的映射,从而支持低监督下的跨体态泛化策略推理;同时,GE-Sim 作为动作条件神经模拟器提供高保真闭环策略开发所需的滚动回放,配合 EWMBench 标准化基准套件进行视觉保真度、物理一致性及指令-动作对齐性的量化评估,整体形成端到端可扩展且实用的指令驱动机器人操作框架。
链接: https://arxiv.org/abs/2508.05635
作者: Yue Liao,Pengfei Zhou,Siyuan Huang,Donglin Yang,Shengcong Chen,Yuxin Jiang,Yue Hu,Jingbin Cai,Si Liu,Jianlan Luo,Liliang Chen,Shuicheng Yan,Maoqing Yao,Guanghui Ren
机构: AgiBot Genie Team; NUS LV-Lab; BUAA
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
zh
[CV-2] owards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling
【速读】:该论文旨在解决移动机器人在人群环境中进行强化学习训练后,面对分布外(out-of-distribution)场景时性能显著下降的问题。其核心挑战在于如何提升机器人导航策略在未知或变化的人群动态下的鲁棒性。解决方案的关键在于引入基于自适应共形推断(adaptive conformal inference)的预测不确定性估计,并将其作为额外观测信息融入强化学习框架中,通过约束式强化学习(constrained reinforcement learning)引导机器人行为,从而在保证安全性的前提下增强对分布偏移的适应能力。
链接: https://arxiv.org/abs/2508.05634
作者: Jianpeng Yao,Xiaopan Zhang,Yu Xia,Zejin Wang,Amit K. Roy-Chowdhury,Jiachen Li
机构: University of California, Riverside (加州大学河滨分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 9th Conference on Robot Learning (CoRL 2025); Project website: this https URL . arXiv admin note: text overlap with arXiv:2407.17460
Abstract:Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent’s behavior through constrained reinforcement learning. The system helps regulate the agent’s actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on this https URL.
zh
[CV-3] GAP: Gaussianize Any Point Clouds with Text Guidance ICCV2025
【速读】:该论文旨在解决从无颜色的3D点云(colorless 3D point clouds)直接生成高质量3D高斯表示(3D Gaussians)这一尚未解决的挑战,从而实现点云到高斯表示的高效映射。其解决方案的关键在于提出一种名为GAP的新方法,该方法通过多视角优化框架结合深度感知图像扩散模型(depth-aware image diffusion model),在不同视角下合成一致的外观;同时引入表面锚定机制(surface-anchoring mechanism)以确保高斯分布始终约束在三维形状表面,提升几何精度,并采用基于扩散的修补策略(diffuse-based inpainting strategy)专门用于填补难以观测区域,从而实现高保真度的点云到高斯表示转换。
链接: https://arxiv.org/abs/2508.05631
作者: Weiqi Zhang,Junsheng Zhou,Haotian Geng,Wenyuan Zhang,Yu-Shen Liu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
Abstract:3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: this https URL.
zh
[CV-4] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
【速读】:该论文旨在解决当前视频对象分割(Video Object Segmentation, VOS)方法在真实复杂场景中泛化能力不足的问题。现有基准数据集(如DAVIS和YouTube-VOS)主要包含显著、孤立且主导的物体,难以反映现实世界中的多样性与挑战性。为此,作者提出MOSEv2数据集,作为对前代MOSEv1的显著升级,其关键创新在于引入更复杂的场景因素,包括频繁的物体消失与重出现、严重遮挡与密集排列、小尺寸目标、恶劣天气(雨、雪、雾)、低光照条件(夜间、水下)、多镜头序列、伪装物体、非物理目标(阴影、反射)、需外部知识的任务等,从而全面模拟真实环境。通过在5种不同设置下评估20种代表性VOS方法和9种视频目标跟踪方法,结果显示性能普遍下降(如SAM2从76.4%降至50.9%),验证了MOSEv2作为更具挑战性的基准平台的有效性,推动VOS技术向实际应用迈进。
链接: https://arxiv.org/abs/2508.05630
作者: Henghui Ding,Kaining Ying,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang,Philip H.S. Torr,Song Bai
机构: Fudan University (复旦大学); ByteDance Inc. (字节跳动); Shanghai University of Finance and Economics (上海财经大学); Nanyang Technological University (南洋理工大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MOSEv2 Dataset Report, Project Page: this https URL
Abstract:Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% JF) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at this https URL.
zh
[CV-5] Physically Controllable Relighting of Photographs SIGGRAPH2025
【速读】:该论文旨在解决野外场景(in-the-wild image)中光照编辑的难题,即如何在不依赖人工标注或先验信息的情况下,实现物理准确且可控的光照重渲染(relighting)。传统方法难以同时兼顾物理真实性与视觉逼真度,而生成式 AI (Generative AI) 技术虽能生成高质量图像,却缺乏对光照参数的显式控制。解决方案的关键在于提出一种自监督(self-supervised)框架:首先通过单目图像估计几何结构和固有属性,构建彩色网格(colored mesh)表示;随后利用路径追踪(path-tracing)引擎生成初步光照调整结果,并将其输入前馈神经渲染器(feed-forward neural renderer),预测最终的逼真 relighting 效果;并通过可微分渲染(differentiable rendering)过程重建真实场景光照,从而在原始图像集合上实现无需标签的神经渲染器训练。这一方法将传统渲染的物理准确性与神经渲染的视觉保真度相结合,实现了类似 Blender 等 3D 工具中的显式光照控制能力,应用于复杂野外环境。
链接: https://arxiv.org/abs/2508.05626
作者: Chris Careaga,Yağız Aksoy
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Proc. SIGGRAPH 2025, 10 pages, 9 figures
Abstract:We present a self-supervised approach to in-the-wild image relighting that enables fully controllable, physically based illumination editing. We achieve this by combining the physical accuracy of traditional rendering with the photorealistic appearance made possible by neural rendering. Our pipeline works by inferring a colored mesh representation of a given scene using monocular estimates of geometry and intrinsic components. This representation allows users to define their desired illumination configuration in 3D. The scene under the new lighting can then be rendered using a path-tracing engine. We send this approximate rendering of the scene through a feed-forward neural renderer to predict the final photorealistic relighting result. We develop a differentiable rendering process to reconstruct in-the-wild scene illumination, enabling self-supervised training of our neural renderer on raw image collections. Our method represents a significant step in bringing the explicit physical control over lights available in typical 3D computer graphics tools, such as Blender, to in-the-wild relighting.
zh
[CV-6] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity
【速读】:该论文旨在解决当前3D生成内容质量评估中存在的局限性问题,即现有方法主要依赖图像级指标且仅在物体层面进行评估,难以捕捉空间一致性、材质真实性以及局部细节的高保真度。其解决方案的关键在于提出一个分层评估框架Hi3DEval,该框架融合了物体级与部件级的评估维度,实现了多维度的整体质量判断与细粒度分析;同时构建了大规模标注数据集Hi3DBench及基于混合3D表示的自动化评分系统,其中利用视频表征增强时空一致性建模,并结合预训练3D特征实现部件级感知,从而显著提升对人类偏好的一致性并提供可扩展的自动评估能力。
链接: https://arxiv.org/abs/2508.05609
作者: Yuhan Zhang,Long Zhuo,Ziyang Chu,Tong Wu,Zhibing Li,Liang Pan,Dahua Lin,Ziwei Liu
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Stanford University (斯坦福大学); The Chinese University of Hong Kong (香港中文大学); S-Lab, Nanyang Technological University (南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL
Abstract:Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at this https URL.
zh
[CV-7] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
【速读】:该论文旨在解决多模态生成式AI中图像与文本相关性二分类评估问题(binary image-text relevancy evaluation),即判断给定文本描述是否与输入图像内容一致。这一任务在衡量响应质量或对候选响应进行排序时至关重要,但挑战在于文本格式多样且相关性定义因场景而异。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)构建评估器,因其能够灵活处理复杂文本结构并整合额外任务信息;具体而言,作者提出LLaVA-RE框架,基于LLaVA架构设计详细的任务指令和多模态上下文示例,并构建了一个覆盖多种任务类型的新型二分类相关性数据集,实验证明了该方法的有效性。
链接: https://arxiv.org/abs/2508.05602
作者: Tao Sun,Oliver Liu,JinJin Li,Lan Ma
机构: Stony Brook University (石溪大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the First Workshop of Evaluation of Multi-Modal Generation 2025
Abstract:Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., Relevant'' vs.
Not Relevant’', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.
zh
[CV-8] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
【速读】:该论文旨在解决视觉生成任务中现有视觉分词器(visual tokenizer)在压缩比与重建保真度之间难以取得理想平衡的问题。其核心解决方案在于提出一种高效且紧凑的WeTok分词器,关键创新包括:(1) 分组无查找量化(Group-wise lookup-free Quantization, GQ),通过将潜在特征分组并进行无查找量化,在不增加内存和计算负担的前提下实现更可扩展的码本,显著提升重建质量;(2) 生成式解码(Generative Decoding, GD),引入带有额外噪声变量先验的生成解码器,使模型能够基于离散令牌概率性建模视觉数据分布,从而在高压缩比下仍能有效还原细节。实验表明,WeTok在ImageNet 50k验证集上实现了最低的零样本rFID(0.12),且在768压缩比下仍优于对比方法。
链接: https://arxiv.org/abs/2508.05599
作者: Shaobin Zhuang,Yiwei Guo,Canmiao Fu,Zhipeng Huang,Zeyue Tian,Ying Zhang,Chen Li,Yali Wang
机构: Shanghai Jiao Tong University (上海交通大学); WeChat Vision, Tencent Inc. (腾讯微信视觉团队); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Hong Kong University of Science and Technology (香港科技大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, 37 tables
Abstract:Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: this https URL.
zh
[CV-9] DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition ACM-MM2025
【速读】:该论文旨在解决开放词汇多标签识别(Open-Vocabulary Multi-Label Recognition, OV-MLR)中两个关键挑战:一是弱监督下细粒度的类内定位精度不足,二是缺乏对类别间结构化关系知识的有效利用,导致对未见类别的识别性能受限。解决方案的核心在于提出双适应精炼迁移(Dual Adaptive Refinement Transfer, DART)框架,其关键创新包括:1)引入自适应精炼模块(Adaptive Refinement Module, ARM)结合弱监督补丁选择损失(Weakly Supervised Patch Selecting, WPS),在仅使用图像级标签的情况下实现类内特征的判别性精炼;2)设计自适应迁移模块(Adaptive Transfer Module, ATM),基于大语言模型(Large Language Model, LLM)挖掘的结构化知识构建类别关系图(Class Relationship Graph, CRG),并通过图注意力网络实现类别间关系信息的自适应迁移。DART是首个在OV-MLR任务中同时实现弱监督类内精炼与外部LLM衍生关系知识驱动的类间迁移的框架,显著提升了整体性能。
链接: https://arxiv.org/abs/2508.05585
作者: Haijing Liu,Tao Pu,Hefeng Wu,Keze Wang,Liang Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.
zh
[CV-10] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中高质量、多样化且可扩展数据集获取困难的问题,尤其针对当前依赖人工构建场景导致的效率低下与规模受限瓶颈。其解决方案的关键在于提出一个基于多模态大语言模型(Multimodal Large Language Model, MLLM)驱动的自动化数据合成框架——Follow-Your-Instruction,该框架通过三个核心模块实现:首先利用 MLLM-Collector 从多模态输入中自动收集资产及其描述;其次借助 MLLM-Generator 和 MLLM-Optimizer 分别完成 3D 布局构建与语义优化;最后由 MLLM-Planner 生成时序一致的未来帧,从而实现 2D、3D 和 4D 数据的高效合成,显著提升下游模型性能。
链接: https://arxiv.org/abs/2508.05580
作者: Kunyu Feng,Yue Ma,Xinhua Zhang,Boshi Liu,Yikuang Yuluo,Yinhan Zhang,Runtao Liu,Hongyu Liu,Zhiyuan Qin,Shanhui Mo,Qifeng Chen,Zeyu Wang
机构: HKUST(GZ); HKUST; Tsinghua Univerisity; Peking University; Chongqing University; Beijing Innovation Center of Humanoid Robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbfFollow-Your-Instruction first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction’s potential as a scalable and effective data engine for generative intelligence.
zh
[CV-11] X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment
【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)中存在的两个关键问题:一是要求所有客户端的数据样本必须完全对齐(不允许存在缺失特征),二是推理阶段需依赖所有客户端协同参与,无法支持单个客户端独立完成本地推理。为此,作者提出X-VFL框架,其核心创新在于设计了两个关键模块:Cross Completion (XCom) 和 Decision Subspace Alignment (DS-Align)。XCom通过利用其他客户端的信息重建非对齐数据中的缺失特征,从而实现对部分缺失特征数据的有效处理;DS-Align则在决策子空间内对齐本地特征与全局特征,使每个客户端能够基于自身完成独立的推理任务。这一方案显著提升了VFL在实际场景中对不完整数据和分布式部署的适应能力。
链接: https://arxiv.org/abs/2508.05568
作者: Qinghua Yao,Xiangrui Xu,Zhize Li
机构: Singapore Management University (新加坡管理大学); University of Pennsylvania (宾夕法尼亚大学); Beijing Jiaotong University (北京交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
备注: 20 pages
Abstract:Vertical Federated Learning (VFL) enables collaborative learning by integrating disjoint feature subsets from multiple clients/parties. However, VFL typically faces two key challenges: i) the requirement for perfectly aligned data samples across all clients (missing features are not allowed); ii) the requirement for joint collaborative inference/prediction involving all clients (it does not support locally independent inference on a single client). To address these challenges, we propose X-VFL, a new VFL framework designed to deal with the non-aligned data samples with (partially) missing features and to support locally independent inference of new data samples for each client. In particular, we design two novel modules in X-VFL: Cross Completion (XCom) and Decision Subspace Alignment (DS-Align). XCom can complete/reconstruct missing features for non-aligned data samples by leveraging information from other clients. DS-Align aligns local features with completed and global features across all clients within the decision subspace, thus enabling locally independent inference at each client. Moreover, we provide convergence theorems for different algorithms used in training X-VFL, showing an O(1/\sqrtT) convergence rate for SGD-type algorithms and an O(1/T) rate for PAGE-type algorithms, where T denotes the number of training update steps. Extensive experiments on real-world datasets demonstrate that X-VFL significantly outperforms existing methods, e.g., achieving a 15% improvement in accuracy on the image CIFAR-10 dataset and a 43% improvement on the medical MIMIC-III dataset. These results validate the practical effectiveness and superiority of X-VFL, particularly in scenarios involving partially missing features and locally independent inference.
zh
[CV-12] Adapting Vision-Language Models Without Labels: A Comprehensive Survey
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在具体下游任务中性能不佳的问题,尤其是在缺乏标注数据的情况下如何实现高效适配。其核心挑战在于如何在不依赖大量标注样本的前提下,提升VLMs在特定场景下的泛化能力与实用性。解决方案的关键在于提出一种基于未标注视觉数据可用性和性质的系统性分类框架,将现有无监督VLM适应方法归纳为四类范式:无数据迁移(Data-Free Transfer)、无监督域迁移(Unsupervised Domain Transfer)、瞬时测试时适应(Episodic Test-Time Adaptation)和在线测试时适应(Online Test-Time Adaptation),并在此基础上分析各范式的核心方法与策略,从而建立对无监督VLM适应领域的结构化理解,推动该方向的研究进展与实践应用。
链接: https://arxiv.org/abs/2508.05547
作者: Hao Dong,Lijun Sheng,Jian Liang,Ran He,Eleni Chatzi,Olga Fink
机构: ETH Zürich (苏黎世联邦理工学院); University of Science and Technology of China (中国科学技术大学); NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中科院自动化所模式识别国家重点实验室和多媒体信息处理研究中心); EPFL (洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Discussions, comments, and questions are welcome in \url{ this https URL }
Abstract:Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at this https URL.
zh
[CV-13] Point cloud segmentation for 3D Clothed Human Layering
【速读】:该论文旨在解决3D服装建模与仿真中因衣物形态多样性导致的高保真度难题,尤其是真实褶皱生成困难的问题。传统3D形状分割方法多用于场景理解,难以处理衣物与人体之间存在强重叠且多层嵌套的复杂结构。解决方案的关键在于提出一种新的“衣着人体分层”(clothed human layering)3D点云分割范式,允许每个点同时属于多个层次(如身体和不同衣物层),从而实现对被遮挡区域(即被上层衣物覆盖的下层衣物或身体部位)的语义重建。研究进一步构建了一个模拟真实3D扫描的合成数据集,并评估了多种神经网络架构在细粒度和粗粒度层面的分层识别性能,验证了该策略在真实与合成数据上的有效性。
链接: https://arxiv.org/abs/2508.05531
作者: Davide Garavaso,Federico Masi,Pietro Musoni,Umberto Castellani
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Cloth modeling and simulation is essential for avatars creation in several fields, such as fashion, entertainment, and animation. Achieving high-quality results is challenging due to the large variability of clothed body especially in the generation of realistic wrinkles. 3D scan acquisitions provide more accuracy in the representation of real-world objects but lack semantic information that can be inferred with a reliable semantic reconstruction pipeline. To this aim, shape segmentation plays a crucial role in identifying the semantic shape parts. However, current 3D shape segmentation methods are designed for scene understanding and interpretation and only few work is devoted to modeling. In the context of clothed body modeling the segmentation is a preliminary step for fully semantic shape parts reconstruction namely the underlying body and the involved garments. These parts represent several layers with strong overlap in contrast with standard segmentation methods that provide disjoint sets. In this work we propose a new 3D point cloud segmentation paradigm where each 3D point can be simultaneously associated to different layers. In this fashion we can estimate the underlying body parts and the unseen clothed regions, i.e., the part of a cloth occluded by the clothed-layer above. We name this segmentation paradigm clothed human layering. We create a new synthetic dataset that simulates very realistic 3D scans with the ground truth of the involved clothing layers. We propose and evaluate different neural network settings to deal with 3D clothing layering. We considered both coarse and fine grained per-layer garment identification. Our experiments demonstrates the benefit in introducing proper strategies for the segmentation on the garment domain on both the synthetic and real-world scan datasets.
zh
[CV-14] Looking into the Unknown: Exploring Action Discovery for Segmentation of Known and Unknown Actions
【速读】:该论文针对时序动作分割(Temporal Action Segmentation)中部分标注数据的挑战,提出了一种新的“动作发现”(Action Discovery)设置,旨在解决已知动作类别在训练集中仅部分标注、而未知动作完全未标注的问题。此类问题常见于神经科学等领域,其中明确行为(如行走、进食)与细微或罕见行为共存,且标注常因模糊或缺失导致不完整。解决方案的关键在于两步策略:首先引入粒度引导分割模块(Granularity-Guided Segmentation Module, GGSM),通过模仿已标注动作的时间粒度来识别已知和未知动作的时序区间;其次提出未知动作片段分配机制(Unknown Action Segment Assignment, UASA),基于学习到的嵌入相似性为未知动作片段赋予语义类别标签。该方法在Breakfast、50Salads和Desktop Assembly三个基准数据集上显著优于现有基线。
链接: https://arxiv.org/abs/2508.05529
作者: Federico Spurio,Emad Bahrami,Olga Zatsarynna,Yazan Abu Farha,Gianpiero Francesca,Juergen Gall
机构: 1. University of Bonn (波恩大学); 2. ETH Zurich (苏黎世联邦理工学院); 3. University of Trento (特伦托大学); 4. German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Action Discovery, a novel setup within Temporal Action Segmentation that addresses the challenge of defining and annotating ambiguous actions and incomplete annotations in partially labeled datasets. In this setup, only a subset of actions - referred to as known actions - is annotated in the training data, while other unknown actions remain unlabeled. This scenario is particularly relevant in domains like neuroscience, where well-defined behaviors (e.g., walking, eating) coexist with subtle or infrequent actions that are often overlooked, as well as in applications where datasets are inherently partially annotated due to ambiguous or missing labels. To address this problem, we propose a two-step approach that leverages the known annotations to guide both the temporal and semantic granularity of unknown action segments. First, we introduce the Granularity-Guided Segmentation Module (GGSM), which identifies temporal intervals for both known and unknown actions by mimicking the granularity of annotated actions. Second, we propose the Unknown Action Segment Assignment (UASA), which identifies semantically meaningful classes within the unknown actions, based on learned embedding similarities. We systematically explore the proposed setting of Action Discovery on three challenging datasets - Breakfast, 50Salads, and Desktop Assembly - demonstrating that our method considerably improves upon existing baselines.
zh
[CV-15] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLM s in Content Moderation for Brand Safety ICCV2025
【速读】:该论文旨在解决在线视频内容激增背景下,传统人工审核在处理不安全视频时面临的效率瓶颈与心理健康风险问题,尤其聚焦于多模态内容审核中对视觉与文本线索的精细理解需求。其核心解决方案是引入多模态大语言模型(Multimodal Large Language Models, MLLMs)用于品牌安全分类任务,并构建了一个全新、多模态且多语言的数据集,由专业审校人员标注多种风险类别,从而系统评估MLLMs在准确性与成本效益方面相较于人类专家的表现,同时揭示其局限性与失败案例,为未来负责任的品牌安全与内容审核研究提供基准和数据支持。
链接: https://arxiv.org/abs/2508.05527
作者: Adi Levi,Or Levi,Sardhendu Mishra,Jonathan Morra
机构: Zefr Inc (Zefr公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Computer Vision in Advertising and Marketing (CVAM) workshop at ICCV 2025
Abstract:As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safe-guarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.
zh
[CV-16] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework
【速读】:该论文旨在解决生成式视频模型(Generative Video Models)泛滥背景下,AI生成或篡改视频检测面临的关键挑战:现有检测方法因仅依赖孤立的空间、时间或频谱信息,难以在多种篡改类型间实现良好泛化,且通常需要大型模型才能取得较好效果。其解决方案的核心是提出SSTGNN(Spatial-Spectral-Temporal Graph Neural Network),通过将视频建模为结构化图表示,实现对空间不一致、时间伪影和频谱失真的联合推理;同时引入可学习的频谱滤波器与时间差分建模机制,增强对细微篡改痕迹的捕捉能力。实验表明,SSTGNN在域内和跨域场景下均表现优越,并具备对未见篡改类型的强鲁棒性,且参数量仅为当前最优模型的1/42.4,显著提升了轻量化与实际部署可行性。
链接: https://arxiv.org/abs/2508.05526
作者: Haoyu Liu,Chaoyu Gong,Mengke He,Jiate Li,Kai Han,Siqiang Luo
机构: Nanyang Technological University (南洋理工大学); University of Southern California (南加州大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4 \times fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.
zh
[CV-17] Optimal Brain Connection: Towards Efficient Structural Pruning
【速读】:该论文旨在解决结构化剪枝(Structural Pruning)方法在压缩神经网络时忽视参数间相互连接关系的问题,从而导致模型性能下降。其解决方案的关键在于提出两个核心机制:一是引入Jacobian准则(Jacobian Criterion),这是一种一阶指标,能够同时捕捉组件内部的交互作用和层间的依赖关系,从而更准确地评估结构参数的重要性;二是设计等效剪枝机制(Equivalent Pruning),利用自动编码器在微调阶段保留所有原始连接(包括被剪枝的连接)的贡献,有效缓解剪枝后性能退化问题。
链接: https://arxiv.org/abs/2508.05521
作者: Shaowu Chen,Wei Ma,Binhua Huang,Qingyuan Wang,Guoxin Wang,Weize Sun,Lei Huang,Deepu John
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection–including pruned ones–during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: this https URL
zh
[CV-18] Leverag ing AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods
【速读】:该论文旨在解决临床试验数据清洗(clinical data cleaning)中存在的效率瓶颈问题,传统人工审查方法难以应对日益增长的数据量和复杂性。解决方案的关键在于构建一个名为Octozi的人工智能辅助平台,该平台融合了大语言模型(large language models)与领域特定启发式规则(domain-specific heuristics),从而显著提升数据清洗的效率与准确性。实验表明,AI辅助使数据清洗吞吐量提高6.03倍,错误率从54.67%降至8.48%,同时将假阳性查询减少15.48倍,且效果在不同经验水平的审阅者中具有一致性,验证了其在安全关键型临床工作流中的可推广性和有效性。
链接: https://arxiv.org/abs/2508.05519
作者: Matthew Purri,Amit Patel,Erik Deurrell
机构: Octozi(Octozi)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform clinical data review. In a controlled experimental study with experienced clinical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines and reducing costs while maintaining regulatory compliance. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.
zh
[CV-19] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment
【速读】:该论文旨在解决图像质量评估(Image Quality Assessment, IQA)模型在面对对抗攻击时缺乏鲁棒性与可证明安全性的问题。现有方法通常在输入空间中添加高斯噪声以实现防御,但会导致图像视觉质量下降。其解决方案的关键在于将随机平滑(randomized smoothing)技术引入特征空间而非输入空间,通过分析骨干网络雅可比矩阵的最大奇异值来建立特征空间噪声水平与输入空间扰动之间的形式化关联,从而在不改变模型结构的前提下,为全参考(FR)和无参考(NR)IQA模型提供认证鲁棒性保障,同时保持图像保真度并显著提升计算效率。
链接: https://arxiv.org/abs/2508.05516
作者: Ekaterina Shumitskaya,Dmitriy Vatolin,Anastasia Antsiferova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network’s Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.
zh
[CV-20] Head Anchor Enhanced Detection and Association for Crowded Pedestrian Tracking
【速读】:该论文旨在解决复杂场景下行人跟踪中因严重遮挡导致的轨迹不稳定问题,尤其是在多行人交互或重叠时目标特征丢失对跟踪性能的显著影响。其解决方案的关键在于两个方面:一是通过融合目标检测器回归分支和分类分支的检测特征,并嵌入空间与位置信息,从而构建更丰富的外观表征;二是引入头部关键点检测模型以增强对非遮挡区域(如头部)的特征利用,并采用一种迭代式卡尔曼滤波(iterative Kalman filtering)运动模型,结合3D先验信息提升在复杂环境中的轨迹补全能力,从而实现更鲁棒的多目标跟踪(multi-object tracking)。
链接: https://arxiv.org/abs/2508.05514
作者: Zewei Wu,César Teixeira,Wei Ke,Zhang Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual pedestrian tracking represents a promising research field, with extensive applications in intelligent surveillance, behavior analysis, and human-computer interaction. However, real-world applications face significant occlusion challenges. When multiple pedestrians interact or overlap, the loss of target features severely compromises the tracker’s ability to maintain stable trajectories. Traditional tracking methods, which typically rely on full-body bounding box features extracted from Re-ID models and linear constant-velocity motion assumptions, often struggle in severe occlusion scenarios. To address these limitations, this work proposes an enhanced tracking framework that leverages richer feature representations and a more robust motion model. Specifically, the proposed method incorporates detection features from both the regression and classification branches of an object detector, embedding spatial and positional information directly into the feature representations. To further mitigate occlusion challenges, a head keypoint detection model is introduced, as the head is less prone to occlusion compared to the full body. In terms of motion modeling, we propose an iterative Kalman filtering approach designed to align with modern detector assumptions, integrating 3D priors to better complete motion trajectories in complex scenes. By combining these advancements in appearance and motion modeling, the proposed method offers a more robust solution for multi-object tracking in crowded environments where occlusions are prevalent.
zh
[CV-21] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events
【速读】:该论文旨在解决事件相机(event camera)数据固有的稀疏性和噪声问题,以及其仅反映亮度变化导致的有效特征提取困难。为充分挖掘事件数据中的潜在信息(如边缘和纹理线索),作者提出了一种自监督预训练框架,其关键在于三阶段设计:第一阶段为差分引导的掩码建模(Difference-guided Masked Modeling),借鉴事件传感器的物理采样过程,重建时间强度差图以增强原始事件数据的信息表达;第二阶段为骨干网络固定特征迁移(Backbone-fixed Feature Transition),在对比学习中保持骨干网络不变,从而稳定掩码建模所学表示;第三阶段为聚焦导向的对比学习(Focus-aimed Contrastive Learning),通过更新整个模型强化对高价值区域的语义区分能力。该框架显著提升了下游任务(如目标识别、语义分割和光流估计)的性能表现。
链接: https://arxiv.org/abs/2508.05507
作者: Lin Zhu,Ruonan Liu,Xiao Wang,Lizhi Wang,Hua Huang
机构: Beijing Institute of Technology (北京理工大学); Anhui University (安徽大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at this https URL.
zh
[CV-22] MagicHOI: Leverag ing 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
【速读】:该论文旨在解决现有基于RGB的双手-物体重建方法在真实场景中因视角受限和物体部分不可见而导致重建结果不合理的难题。传统模板依赖方法需预设物体形状,而无模板方法通常假设物体完全可见,这在固定相机视角与静态抓握条件下难以满足。解决方案的关键在于利用大规模新颖视图合成扩散模型(novel view synthesis diffusion models)提供的丰富物体监督信号,作为先验来约束手部交互过程中未被观测到的物体区域,从而实现对物体结构的有效正则化;同时通过引入可见接触约束(visible contact constraints)将手部姿态与物体表面进行对齐,最终显著提升重建精度与合理性。
链接: https://arxiv.org/abs/2508.05506
作者: Shibo Wang,Haonan He,Maria Parelli,Christoph Gebhardt,Zicong Fan,Jie Song
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); ETH Zürich (苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, large-scale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.
zh
[CV-23] Symmetry Understanding of 3D Shapes via Chirality Disentanglement ICCV2025
【速读】:该论文旨在解决当前形状分析(如点云和网格)中缺乏能够区分左右对称结构的旋光性(chirality)特征的问题。尽管图像域中的旋光性信息已被广泛研究,但现有形状顶点描述符通常无法有效区分左右对称部分,限制了其在下游任务中的性能。解决方案的关键在于提出一种无监督的旋光性特征提取流程,基于最近的Diff3F框架,从2D基础模型中提取旋光性信息并将其注入到形状顶点上,从而生成具有旋光性感知能力的特征表示。该方法在多个数据集上通过定量与定性实验验证了其有效性,并在左右解耦、形状匹配和部件分割等下游任务中展现出实用性。
链接: https://arxiv.org/abs/2508.05505
作者: Weikang Wang,Tobias Weißberg,Nafie El Amrani,Florian Bernard
机构: University of Bonn (波恩大学); Lamarr Institute (拉马尔研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: this https URL
zh
[CV-24] Parameter-free entropy-regularized multi-view clustering with hierarchical feature selection
【速读】:该论文旨在解决多视图聚类(multi-view clustering)中自动发现异构数据跨视图模式的难题,同时应对高维特征带来的冗余信息干扰问题。传统方法普遍存在手动调参和缺乏统一的跨视图整合机制等局限性。其解决方案的关键在于提出两种互补算法AMVFCM-U与AAMVFCM-U,构建了一个参数自适应的统一框架:通过熵正则化项替代模糊化参数以实现跨视图一致性约束;引入基于信噪比(signal-to-noise ratio)的特征加权机制(δjh=(σjh)2xˉjh),在保证收敛性的前提下实现特征重要性动态评估;并通过双层熵正则化自动平衡视图与特征层面的贡献权重。此外,AAMVFCM-U进一步采用分层降维策略,在特征与视图两个层级上通过自适应阈值(θh(t)=ndh(t))实现高效维度压缩,从而显著提升计算效率并精准识别关键视图组合以优化聚类性能。
链接: https://arxiv.org/abs/2508.05504
作者: Kristina P. Sinaga,Sara Colantonio,Miin-Shen Yang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)
备注: 81 pages, 10 figures, 17 tables
Abstract:Multi-view clustering faces critical challenges in automatically discovering patterns across heterogeneous data while managing high-dimensional features and eliminating irrelevant information. Traditional approaches suffer from manual parameter tuning and lack principled cross-view integration mechanisms. This work introduces two complementary algorithms: AMVFCM-U and AAMVFCM-U, providing a unified parameter-free framework. Our approach replaces fuzzification parameters with entropy regularization terms that enforce adaptive cross-view consensus. The core innovation employs signal-to-noise ratio based regularization ( \delta_j^h = \frac\barx_j^h(\sigma_j^h)^2 ) for principled feature weighting with convergence guarantees, coupled with dual-level entropy terms that automatically balance view and feature contributions. AAMVFCM-U extends this with hierarchical dimensionality reduction operating at feature and view levels through adaptive thresholding ( \theta^h^(t) = \fracd_h^(t)n ). Evaluation across five diverse benchmarks demonstrates superiority over 15 state-of-the-art methods. AAMVFCM-U achieves up to 97% computational efficiency gains, reduces dimensionality to 0.45% of original size, and automatically identifies critical view combinations for optimal pattern discovery.
zh
[CV-25] AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection
【速读】:该论文旨在解决工业视觉异常检测(Industrial Visual Anomaly Detection, IAD)在实际应用中依赖大量人工干预、开发流程繁琐的问题。传统方法需针对不同场景手动设计数据预处理、模型选择与训练等环节,效率低下且难以标准化。为此,作者提出AutoIAD框架,其核心创新在于构建一个由Manager-Driven中央代理协调的多智能体协作系统,集成领域特定知识库(domain-specific knowledge base),实现从原始工业图像数据到训练完成的异常检测模型的端到端自动化开发。关键要素包括:中央管理器对子智能体(如数据准备、加载、模型设计、训练)的高效调度,以及知识库驱动的迭代优化机制,从而显著提升任务完成率和模型性能(AUROC),同时有效抑制幻觉问题,保障解决方案的鲁棒性与高质量。
链接: https://arxiv.org/abs/2508.05503
作者: Dongwei Ji,Bingzhang Hu,Yi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi-agent collaboration framework, specifically designed for end-to-end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager-Driven central agent to orchestrate specialized sub-agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain-specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general-purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high-quality IAD solutions.
zh
[CV-26] SMOL-MapSeg: Show Me One Label
【速读】:该论文旨在解决预训练基础模型(如SAM)在处理历史地图时性能不佳的问题,其核心挑战在于历史地图中同一概念可能呈现多样化的形状和风格,缺乏现代图像中常见的语义一致性。解决方案的关键在于提出一种“按需声明式”(On-Need Declarative, OND)知识引导提示机制,通过显式提示引导模型识别特定概念与其对应模式之间的关系,使用户能够在推理阶段(on-need inference)指定目标概念与特征,从而提升模型对历史地图的语义分割准确性。该方法通过替换SAM的提示编码器并结合少量标注数据进行微调,实现了SMOL-MapSeg模型,在已知类别上表现优异,并具备少样本适应未见类别的能力。
链接: https://arxiv.org/abs/2508.05501
作者: Yunshuang Yuan,Frank Thiemann,Thorsten Dahms,Monika Sester
机构: Institute of Cartography and Geoinformatics, Leibniz University Hannover, Germany(德国汉诺威莱布尼茨大学测绘与地理信息研究所); The German Federal Agency for Cartography and Geodesy (BKG)(德国联邦大地测量局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Historical maps are valuable for studying changes to the Earth’s surface. With the rise of deep learning, models like UNet have been used to extract information from these maps through semantic segmentation. Recently, pre-trained foundation models have shown strong performance across domains such as autonomous driving, medical imaging, and industrial inspection. However, they struggle with historical maps. These models are trained on modern or domain-specific images, where patterns can be tied to predefined concepts through common sense or expert knowledge. Historical maps lack such consistency – similar concepts can appear in vastly different shapes and styles. To address this, we propose On-Need Declarative (OND) knowledge-based prompting, which introduces explicit prompts to guide the model on what patterns correspond to which concepts. This allows users to specify the target concept and pattern during inference (on-need inference). We implement this by replacing the prompt encoder of the foundation model SAM with our OND prompting mechanism and fine-tune it on historical maps. The resulting model is called SMOL-MapSeg (Show Me One Label). Experiments show that SMOL-MapSeg can accurately segment classes defined by OND knowledge. It can also adapt to unseen classes through few-shot fine-tuning. Additionally, it outperforms a UNet-based baseline in average segmentation performance.
zh
[CV-27] Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification
【速读】:该论文试图解决的问题是:如何评估图像压缩在防御对抗性扰动(adversarial perturbations)方面的有效性,尤其是在面对强白盒攻击(white-box attacks)和自适应攻击(adaptive attacks)时。解决方案的关键在于发现并验证:高保真度、高真实感的图像重建能力显著提升了压缩模型对攻击的抵抗力,这种鲁棒性并非源于梯度掩蔽(gradient masking),而是由于真实感重建保持了与自然图像分布的一致性(distributional alignment),从而构成了对抗攻击的实质性障碍。这一发现揭示了未来攻击技术需突破的核心挑战——提升生成图像的真实感以实现更有效的攻击。
链接: https://arxiv.org/abs/2508.05489
作者: Samuel Räber,Till Aczel,Andreas Plesner,Roger Wattenhofer
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
zh
[CV-28] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery
【速读】:该论文旨在解决垂体肿瘤手术中因邻近重要解剖结构变形或包裹而导致的术中风险识别难题,核心挑战在于缺乏像素级标注的视频流数据以及因术中遮挡、摄像机运动和出血等因素导致的特征表示不一致问题。解决方案的关键在于构建了一个名为PAS(Pituitary Anatomy Segmentation)的新数据集,包含7,845张时间一致的图像,并通过模拟手术器械引入的数据增强策略缓解类别不平衡;进一步提出F2PASeg模型,其核心创新是引入特征融合模块(Feature Fusion module),联合利用高分辨率图像特征与深层语义嵌入,显著提升了对术中动态变化的鲁棒性,实现了关键解剖结构的实时精准分割,为术中规划提供了可靠支持。
链接: https://arxiv.org/abs/2508.05465
作者: Lumin Chen,Zhiying Wu,Tianye Lei,Xuexue Bai,Ming Feng,Yuxi Wang,Gaofeng Meng,Zhen Lei,Hongbin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注:
Abstract:Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: this https URL.
zh
[CV-29] How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization
【速读】:该论文旨在解决传统基于流(Flow-based)方法在无监督异常检测与定位任务中因模型表达能力受限而导致的性能瓶颈问题。其核心解决方案在于提出一种基于时间反演流匹配(time-reversed Flow Matching, rFM)的新范式,通过引入 Worst Transport (WT) 离散插值策略重构非概率演化路径,从而增强对样本轨迹的动力学控制能力:正常样本被引导至“退化势阱”中稳定分布,而异常样本则因势能结构变化而逃逸,实现理论上可解释的异常分离机制。该方法首次成功将流匹配应用于无监督异常检测,在MVTec数据集上达到单尺度最优性能,且具备良好的计算可扩展性。
链接: https://arxiv.org/abs/2508.05461
作者: Liangwei Li,Lin Liu,Juanxiu Liu,Jing Zhang,Ruqian Hao,Xiaohui Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a new paradigm for unsupervised anomaly detection and localization using Flow Matching (FM), which fundamentally addresses the model expressivity limitations of conventional flow-based methods. To this end, we formalize the concept of time-reversed Flow Matching (rFM) as a vector field regression along a predefined probability path to transform unknown data distributions into standard Gaussian. We bring two core observations that reshape our understanding of FM. First, we rigorously prove that FM with linear interpolation probability paths is inherently non-invertible. Second, our analysis reveals that employing reversed Gaussian probability paths in high-dimensional spaces can lead to trivial vector fields. This issue arises due to the manifold-related constraints. Building on the second observation, we propose Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path. The proposed WT-Flow enhances dynamical control over sample trajectories, constructing ‘‘degenerate potential wells’’ for anomaly-free samples while allowing anomalous samples to escape. This novel unsupervised paradigm offers a theoretically grounded separation mechanism for anomalous samples. Notably, FM provides a computationally tractable framework that scales to complex data. We present the first successful application of FM for the unsupervised anomaly detection task, achieving state-of-the-art performance at a single scale on the MVTec dataset. The reproducible code for training will be released upon camera-ready submission.
zh
[CV-30] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
【速读】:该论文旨在解决语言-图像预训练(Language-Image Pre-training, LIP)模型中解释性不足的问题,特别是现有基于显著图(saliency maps)的归因方法仅能捕捉一阶特征重要性,而忽略了视觉与语言模态之间复杂的高阶交互作用。解决方案的关键在于提出一种基于博弈论的统一框架——忠实交互解释方法(Faithful Interaction Explanations of LIP models, FIxLIP),其核心创新是采用加权Banzhaf交互指数替代传统的Shapley交互量化框架,从而在保持计算效率的同时更灵活地刻画跨模态的二阶交互效应,并自然扩展了如“指向游戏”(pointing game)和插入/删除曲线间面积(area between insertion/deletion curves)等评估指标以适配二阶解释。实验表明,FIxLIP在MS COCO和ImageNet-1k基准上显著优于传统一阶归因方法,且可用于模型间的公平比较。
链接: https://arxiv.org/abs/2508.05430
作者: Hubert Baniecki,Maximilian Muschalik,Fabian Fumagalli,Barbara Hammer,Eyke Hüllermeier,Przemyslaw Biecek
机构: University of Warsaw (华沙大学); Warsaw University of Technology (华沙理工大学); LMU Munich, MCML (慕尼黑路德维希-马克西米利安大学, 计算机科学与数学实验室); Bielefeld University (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model’s similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.
zh
[CV-31] Smoothing Slot Attention Iterations and Recurrences
【速读】:该论文针对当前基于Slot Attention (SA) 的对象中心学习(Object-Centric Learning, OCL)方法在图像和视频处理中存在的两个核心问题提出解决方案:一是第一帧的冷启动查询(cold-start queries)缺乏样本特异性信息,导致聚合精度不足;二是非第一帧的查询已具备样本特异性,但现有方法仍使用与第一帧相同的变换方式,未能区分不同帧的聚合需求。解决方案的关键在于提出SmoothSA机制:(1)通过一个小型自蒸馏模块对第一帧的冷启动查询进行“预热”(preheat),引入输入特征中的丰富信息以提升初始聚合质量;(2)在视频帧间递归过程中,差异化设计第一帧与非第一帧的迭代策略——第一帧采用完整迭代,非第一帧仅执行单次迭代,从而实现跨帧聚合的平滑过渡。
链接: https://arxiv.org/abs/2508.05417
作者: Rongzhen Zhao,Wenyan Yang,Juho Kannala,Joni Pajarinen
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textititeratively refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textitrecurrently shared across frames, with queries cold-started on the first frame while transitioned from the previous frame’s slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video’s first frame; Also, non-first frames’ queries are already sample-specific thus require transforms different from the first frame’s aggregation. We address these issues for the first time with our \textitSmoothSA: (1) To smooth SA iterations on the image or video’s first frame, we \textitpreheat the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textitdifferentiate the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method’s effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our code is available in the supplement.
zh
[CV-32] Physical Adversarial Camouflage through Gradient Calibration and Regularization IJCAI2025
【速读】:该论文旨在解决物理空间中对抗性伪装(adversarial camouflage)在复杂环境下的优化稳定性与有效性问题,具体挑战包括:1)不同距离下采样点密度不一致导致梯度优化难以保证局部连续性;2)多角度更新纹理梯度时产生冲突,降低优化稳定性和攻击成功率。解决方案的关键在于提出一种基于梯度优化的新型对抗性伪装框架:首先引入梯度校准策略(gradient calibration strategy),通过将梯度从稀疏区域传播至未采样点,实现跨距离的一致梯度更新;其次设计梯度解相关方法(gradient decorrelation method),依据损失值优先级对梯度进行正交化处理,消除冗余或冲突更新,从而提升多角度优化的稳定性和攻击效果。
链接: https://arxiv.org/abs/2508.05414
作者: Jiawei Liang,Siyuan Liang,Jianjie Huang,Chenxi Si,Ming Zhang,Xiaochun Cao
机构: Sun Yat-sen University Shenzhen Campus (中山大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Nanyang Technological University (南洋理工大学); National Key Laboratory of Science and Technology on Information System Security (信息安全国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI 2025
Abstract:The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.
zh
[CV-33] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization
【速读】:该论文旨在解决生物特征识别系统(如基于深度神经网络的面部识别系统)在训练过程中遭受后门攻击的问题。此类攻击通过向少量训练图像中注入微小触发器(如贴纸、妆容或图案遮罩),使得攻击者在认证时使用相同触发器即可被错误识别为他人,从而实现未经授权访问。现有防御机制难以在不损害数据可用性的前提下精准识别和清除中毒样本,导致系统可靠性下降。解决方案的关键在于提出一种通用且可扩展的方法——TrueBiometric:可信生物特征识别系统,其核心是利用多个先进的大视觉语言模型(Vision Language Models, VLMs)通过多数投票机制精确检测中毒图像,并采用目标导向且校准过的修正噪声对中毒样本进行修复。实验证明,该方法可在不牺牲干净样本准确率的前提下实现100%的检测与纠正准确率,显著优于现有最优方案。
链接: https://arxiv.org/abs/2508.05409
作者: Farah Wahida,M.A.P. Chamikara,Yashothara Shanmugarasa,Mohan Baruwal Chhetri,Thilina Ranbaduge,Ibrahim Khalil
机构: RMIT University (皇家墨尔本理工大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 19 Pages, 24 Figures
Abstract:Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.
zh
[CV-34] DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
【速读】:该论文旨在解决当前端到端自动驾驶模型过度依赖自车状态作为单一学习目标、缺乏面向规划的感知理解,从而限制决策过程鲁棒性的问题。其解决方案的关键在于提出DistillDrive框架,通过知识蒸馏机制,利用基于结构化场景表示的规划模型作为教师模型,以多样化规划实例作为多目标学习目标来增强多模态运动特征的学习;同时结合强化学习优化状态到决策的映射,并采用生成建模构建面向规划的实例,促进潜在空间内的复杂交互,从而提升整体决策性能。
链接: https://arxiv.org/abs/2508.05402
作者: Rui Yu,Xianghang Zhang,Runkai Zhao,Huaicheng Yan,Meng Wang
机构: East China University of Science and Technology (华东理工大学); SenseAuto Research (深兰科技研究院); The University of Sydney (悉尼大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of state-to-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model. Code and model are publicly available at this https URL
zh
[CV-35] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
【速读】:该论文旨在解决生成式 AI 中文本到图像(Text-to-image, T2I)生成任务里,尤其是组合性 T2I 生成中存在的属性绑定不准确与图文对齐不佳的问题。尽管扩散模型(Diffusion Models)已被广泛研究用于提升图文一致性,但基于掩码生成的 Transformer(Masked Generative Transformers)在该场景下同样存在局限性,且尚未被深入探索。为此,作者提出了一种无需训练的解决方案——Unmasking with Contrastive Attention Guidance (UNCAGE),其核心在于利用注意力图(attention maps)来指导解掩码过程,优先选择明确表征单个物体的 token 进行解掩码,从而增强生成图像中语义成分的准确性与结构合理性。该方法在多个基准测试中均实现了定量和定性上的性能提升,且推理开销可忽略不计。
链接: https://arxiv.org/abs/2508.05399
作者: Wonjun Kang,Byeongkeun Ahn,Minjae Lee,Kevin Galim,Seunghyuk Oh,Hyung Il Koo,Nam Ik Cho
机构: 1. Korea University (韩国大学); 2. Samsung Electronics (三星电子); 3. Seoul National University (首尔国立大学); 4. KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at this https URL.
zh
[CV-36] Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis
【速读】:该论文旨在解决全切片病理图像(Whole Slide Images, WSIs)和感兴趣区域(Regions of Interest, ROIs)分类中因缺乏对组织结构空间依赖关系建模而导致的准确性不足问题。现有基于多实例学习(Multiple Instance Learning, MIL)的方法难以捕捉组织间的空间关联,而传统图神经网络(Graph Neural Networks, GNNs)通常采用静态图拓扑结构,忽略组织切片的真实空间坐标信息;同时,常规注意力机制在聚焦结构相关区域方面表现有限。解决方案的关键在于提出一种具有可变形注意力机制(deformable attention)的新型GNN框架:通过基于patch特征构建动态加权有向图,使每个节点能够利用注意力加权边聚合邻域上下文信息,并引入由真实坐标驱动的可学习空间偏移量(learnable spatial offsets),从而实现对形态学相关区域的自适应关注,显著扩展感知范围的同时保持空间特异性。
链接: https://arxiv.org/abs/2508.05382
作者: Mingxi Fu,Xitong Ling,Yuxuan Chen,Jiawen Li,fanglei fu,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate classification of Whole Slide Images (WSIs) and Regions of Interest (ROIs) is a fundamental challenge in computational pathology. While mainstream approaches often adopt Multiple Instance Learning (MIL), they struggle to capture the spatial dependencies among tissue structures. Graph Neural Networks (GNNs) have emerged as a solution to model inter-instance relationships, yet most rely on static graph topologies and overlook the physical spatial positions of tissue patches. Moreover, conventional attention mechanisms lack specificity, limiting their ability to focus on structurally relevant regions. In this work, we propose a novel GNN framework with deformable attention for pathology image analysis. We construct a dynamic weighted directed graph based on patch features, where each node aggregates contextual information from its neighbors via attention-weighted edges. Specifically, we incorporate learnable spatial offsets informed by the real coordinates of each patch, enabling the model to adaptively attend to morphologically relevant regions across the slide. This design significantly enhances the contextual field while preserving spatial specificity. Our framework achieves state-of-the-art performance on four benchmark datasets (TCGA-COAD, BRACS, gastric intestinal metaplasia grading, and intestinal ROI classification), demonstrating the power of deformable attention in capturing complex spatial structures in WSIs and ROIs.
zh
[CV-37] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation
【速读】:该论文旨在解决当前医学影像报告自动生成方法仅依赖全局图像特征、无法捕捉细粒度器官间关系的问题,从而影响报告准确性。解决方案的关键在于提出一种分层图注意力网络(CT-GRAPH),通过将解剖区域结构化为图结构,显式建模放射学知识:将细粒度器官特征与更粗粒度的解剖系统及全局患者上下文相连接,利用预训练的3D医学特征编码器提取器官级和全局特征,并在图结构中进一步精炼这些特征,最终融合至大语言模型以生成详尽的医学报告。
链接: https://arxiv.org/abs/2508.05375
作者: Hamza Kalisch,Fabian Hörst,Jens Kleesiek,Ken Herrmann,Constantin Seibold
机构: Institute for AI in Medicine (IKIM), University Hospital Essen (AöR); Department of Nuclear Medicine, University Hospital Essen (AöR); Department of Physics, TU Dortmund
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9% in F1 score over current state-of-the-art methods. The code is publicly available at this https URL.
zh
[CV-38] Cross-View Localization via Redundant Sliced Observations and A-Contrario Validation
【速读】:该论文旨在解决跨视角定位(Cross-view Localization, CVL)中因仅输出单一观测值(即相机位姿)而导致难以通过测量数据相互验证、进而影响定位可靠性评估的问题。解决方案的关键在于提出一种两阶段方法Slice-Loc,其核心创新是引入基于a-contrario原理的可靠性验证机制:首先将查询图像分割为多个子图像,分别估计每个切片的3-DoF位姿以生成冗余且独立的观测;随后利用几何刚性公式剔除错误位姿,并融合有效内点获得最终位姿;此外,通过估计误报数量(Number of False Alarms, NFA)量化定位的可信度,从而在提升精度的同时实现故障检测与抑制。实验表明,该方法显著降低了定位误差比例(>10 m误差占比<3%),并在DReSS跨城市测试中将平均定位误差由4.47 m降至1.86 m,平均朝向误差由3.42°降至1.24°。
链接: https://arxiv.org/abs/2508.05369
作者: Yongjun Zhang,Mingtao Xiong,Yi Wan,Gui-Song Xia
机构: School of Remote Sensing and Information Engineering, Wuhan University (武汉大学遥感信息工程学院); School of Artificial Intelligence, Wuhan University (武汉大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from \mathbf3.42^\circ to \mathbf1.24^\circ , outperforming state-of-the-art methods. Code and dataset will be available at: this https URL.
zh
[CV-39] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation
【速读】:该论文旨在解决胸部X光片报告生成任务中忽视患者特有先验知识的问题,即临床背景(如症状、病史)和最近一次影像数据,而这些信息是放射科医生进行诊断推理时不可或缺的要素。现有方法多基于单张图像生成报告,未能有效整合此类先验信息,导致生成报告难以体现诊断意图或疾病进展。解决方案的关键在于提出PriorRG框架,采用两阶段训练策略:第一阶段引入先验引导的对比预训练机制,利用临床上下文指导时空特征提取,使模型更贴近放射学报告中的内在时空语义;第二阶段设计先验感知的粗到精解码机制,逐步将患者特有先验知识与视觉编码器隐状态融合,从而增强报告的临床准确性与流畅性。
链接: https://arxiv.org/abs/2508.05353
作者: Kang Liu,Zhuoqi Ma,Zikang Fang,Yunan Li,Kun Xie,Qiguang Miao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Chest X-ray report generation aims to reduce radiologists’ workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge – including clinical context (e.g., symptoms, medical history) and the most recent prior image – which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder’s hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.
zh
[CV-40] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering ACM-MM’25
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在三维场景重建中因高斯函数固有低通特性导致的高频细节表达能力不足、冗余基元引发的训练与渲染效率下降及内存开销过大的问题。其解决方案的关键在于提出一种基于3D Gabor核的新颖基元——3D Gabor溅射(3DGabSplat),该基元构建了一个包含多方向频率响应的滤波器组,通过多个不同频率的3D Gabor核实现对精细三维结构更灵活高效的建模;同时设计了基于CUDA的高效光栅化器和频率自适应机制,以实现多方向频率分量的投影优化与基元联合自适应调整,从而在提升视图合成质量的同时显著降低参数数量和内存占用。
链接: https://arxiv.org/abs/2508.05343
作者: Junyu Zhou,Yuyang Huang,Wenrui Dai,Junni Zou,Ziyang Zheng,Nuowen Kan,Chenglin Li,Hongkai Xiong
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM’25
Abstract:Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.
zh
[CV-41] xtual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
【速读】:该论文旨在解决大型预训练视觉语言模型(Vision Language Models, VLMs)在特定目标检测任务中性能优化的问题,尤其是在保持模型原有零样本能力(zero-shot capabilities)和自然语言查询功能的前提下实现高效微调。传统方法虽能提升特定任务表现,但常导致原模型能力遗忘;而本文提出一种受文本反转(Textual Inversion, TI)启发的方案,通过学习新的或改进现有的词元(token)嵌入来扩展VLM词汇表,从而在仅需三张示例图像的情况下即可准确识别新颖或细粒度物体。其关键在于:仅更新词元嵌入维度,冻结原始模型权重,既保留了原模型在基准测试中的性能,又支持零样本域迁移(如仅用真实照片训练即可检测草图),且计算开销远低于全模型微调。
链接: https://arxiv.org/abs/2508.05323
作者: Frank Ruis,Gertjan Burghouts,Hugo Kuijf
机构: TNO(荷兰应用科学研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen, retaining the original model’s benchmark performance, and leveraging its existing capabilities such as zero-shot domain transfer (e.g., detecting a sketch of an object after training only on real photos). The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning. We evaluated whether the method matches or outperforms the baseline methods that suffer from forgetting in a wide variety of quantitative and qualitative experiments.
zh
[CV-42] mKG-RAG : Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
【速读】:该论文旨在解决传统基于检索增强生成(Retrieval-Augmented Generation, RAG)的视觉问答(Visual Question Answering, VQA)方法在处理知识密集型任务时,因依赖非结构化文档而忽视知识元素间结构关系,导致引入无关或误导性内容、降低答案准确性和可靠性的核心问题。其解决方案的关键在于提出一种基于多模态知识图谱(Multimodal Knowledge Graph, mKG)的增强生成框架(mKG-RAG),通过大模型驱动的关键词提取与视觉-文本匹配技术,从多模态文档中蒸馏出语义一致且模态对齐的实体和关系,构建高质量的多模态知识图谱作为结构化知识表示;同时引入双阶段检索策略与问题感知的多模态检索器,在提升检索效率的同时显著增强检索精度,从而实现更精准、可靠的VQA生成。
链接: https://arxiv.org/abs/2508.05318
作者: Xu Yuan,Liangbo Ning,Wenqi Fan,Qing Li
机构: The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.
zh
[CV-43] Divide-and-Conquer for Enhancing Unlabeled Learning Stability and Plasticity in Semi-supervised Continual Learning ICCV2025
【速读】:该论文旨在解决半监督持续学习(Semi-supervised Continual Learning, SSCL)中的核心挑战,即如何在序列化学习场景中有效利用标注与未标注数据,同时平衡记忆稳定性(Memory Stability, MS)与学习可塑性(Learning Plasticity, LP),并提升未标注数据的学习效果(Unlabeled Learning, UL)。其解决方案的关键在于提出USP框架,采用分而治之策略协同优化三个维度:(1)通过特征空间预留(Feature Space Reservation, FSR)策略增强LP,利用等角紧框架(equiangular tight frame)结构将旧类映射至预留特征空间以保留未来类别学习能力;(2)设计划分-征服伪标签(Divide-and-Conquer Pseudo-labeling, DCP)方法提升UL,实现对高置信度和低置信度未标注样本的可靠伪标签分配;(3)引入类均值锚定的未标注蒸馏(Class-mean-anchored Unlabeled Distillation, CUD)机制强化MS,借助DCP输出将未标注数据锚定至稳定类均值进行知识蒸馏,从而缓解灾难性遗忘。实验表明,USP在最终准确率上相较现有方法提升达5.94%,验证了其有效性。
链接: https://arxiv.org/abs/2508.05316
作者: Yue Duan,Taicai Chen,Lei Qi,Yinghuan Shi
机构: Nanjing University (南京大学); Southeast University (东南大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP’s outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 5.94% in the last accuracy, validating its effectiveness. The code is available at this https URL.
zh
[CV-44] CoCAViT: Compact Vision Transformer with Robust Global Coordination
【速读】:该论文旨在解决小规模视觉模型在分布外(out-of-distribution, OOD)数据上性能显著下降的问题,揭示了现有高效架构在泛化能力上的不足。其关键解决方案在于识别并修正导致这一问题的架构瓶颈与设计缺陷,并引入一种名为Coordinator-patch Cross Attention (CoCA) 的机制,通过动态、领域感知的全局token增强局部-全局特征建模能力,从而以极低的计算开销实现跨域鲁棒特征提取。结合上述改进,作者提出CoCAViT,一种专为实时视觉表征设计的新一代视觉骨干网络,在ImageNet-1K等基准上展现出优异的准确率和OOD鲁棒性。
链接: https://arxiv.org/abs/2508.05307
作者: Xuyang Wang,Lingjuan Miao,Zhiqiang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.
zh
[CV-45] VS-LLM : Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
【速读】:该论文旨在解决绘画投射测验(Drawing Projection Test, DPT)中基于“一个人从树上摘苹果”(PPAT)主题的草图在抑郁状态评估中的主观性强、人工解读效率低的问题。传统方法依赖心理分析师的经验进行主观判断,难以实现大规模自动化分析。其解决方案的关键在于提出一种基于大语言模型(LLM)的视觉-语义抑郁评估方法(Visual-Semantic depression assessment based on LLM, VS-LLM),该方法聚焦于草图的整体特征如色彩运用和空间布局,而非细节描绘,并通过自动化分析提升识别准确率——实验结果显示相比人工评估方法提升了17.6%。
链接: https://arxiv.org/abs/2508.05299
作者: Meiqi Wu,Yaxuan Kang,Xuchen Li,Shiyu Hu,Xiaotang Chen,Yunfeng Kang,Weiqiang Wang,Kaiqi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants’ mental states through their sketches. Specifically, through sketches with the theme of “a person picking an apple from a tree (PPAT)”, it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists’ understanding of an individual’s mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches’ elements recognition. Our datasets and codes are available at this https URL.
zh
[CV-46] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection
【速读】:该论文旨在解决遥感影像变化检测中因空间域特征表示多样性不足而导致的细微变化区域难以识别的问题,尤其是在边缘信息模糊、结构细节不清晰的情况下。解决方案的关键在于提出一种基于小波域引导的双频编码方法(Wavelet-Guided Dual-Frequency Encoding, WGDF),其核心创新包括:首先利用离散小波变换(Discrete Wavelet Transform, DWT)将输入图像分解为高频和低频分量,分别用于建模局部细节与全局结构;在高频分支设计了双频特征增强模块(Dual-Frequency Feature Enhancement, DFFE)以强化边缘细节表征,并引入频域交互差异模块(Frequency-Domain Interactive Difference, FDID)提升细粒度变化建模能力;在低频分支采用Transformer捕捉全局语义关系,并结合渐进式上下文差异模块(Progressive Contextual Difference Module, PCDM)逐步精炼变化区域,实现精确的结构语义刻画;最终通过高低频特征的协同融合,统一局部敏感性与全局判别力,从而显著提升变化检测的精度与鲁棒性。
链接: https://arxiv.org/abs/2508.05271
作者: Xiaoyang Zhang,Guodong Fan,Guang-Yong Chen,Zhen Hua,Jinjiang Li,Min Gan,C. L. Philip Chen
机构: Shandong Technology and Business University (山东工商大学); Fuzhou University (福州大学); Qingdao University (青岛大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to TAES
Abstract:Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at this https URL.
zh
[CV-47] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding ACM-MM2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态室外环境理解中对4D激光雷达(LiDAR)数据利用不足的问题,其核心挑战在于缺乏高质量、针对特定模态的标注数据以及无法直接处理高维4D LiDAR数据的MLLM架构。解决方案的关键在于提出B4DL基准测试平台、一个可扩展的数据生成管道,以及首个能够直接处理原始4D LiDAR数据并与语言理解相融合的MLLM模型,从而实现对动态室外场景中时空推理的统一建模与理解。
链接: https://arxiv.org/abs/2508.05269
作者: Changho Choi,Youngwoo Shin,Gyojin Han,Dong-Jae Lee,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACM MM 2025
Abstract:Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: this https URL
zh
[CV-48] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion
【速读】:该论文旨在解决红外与可见光图像融合(Infrared and Visible Image Fusion, IVIF)中因缺乏深层语义理解而导致关键目标信息丢失,以及融合过程引入伪影和细节损失的问题。解决方案的关键在于提出一种基于Segment Anything Model (SAM)引导的条件扩散模型SGDFuse:利用SAM生成的高质量语义掩码作为显式先验,通过条件扩散模型实现从粗到细的去噪生成过程,从而在融合过程中明确注入语义指导并保障最终结果的高保真度。
链接: https://arxiv.org/abs/2508.05264
作者: Xiaoyang Zhang,Zhen Hua,Yakun Ju,Wei Zhou,Jun Liu,Alex C. Kot
机构: Shandong Technology and Business University (山东工商学院); University of Leicester (莱斯特大学); Lancaster University (兰卡斯特大学); Cardiff University (卡迪夫大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to TCSVT
Abstract:Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at this https URL.
zh
[CV-49] Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging
【速读】:该论文旨在解决冠状动脉旁路移植术(Coronary Artery Bypass Grafting, CABG)术后实时心脏灌注评估中,因心脏运动及血管结构丰富导致的图像特征显著波动所引发的传统特征点跟踪方法失效问题。其解决方案的关键在于提出一种基于循环一致性检查(cyclic consistency checks)的粒子滤波跟踪器,通过在采样粒子中引入几何约束和跨帧一致性验证机制,显著提升了复杂动态场景下的跟踪鲁棒性与精度,在保证25.4 fps实时性能的同时,实现了117个目标点的同步跟踪,平均跟踪误差仅为(5.00 ± 0.22 px),优于现有深度学习与传统跟踪方法。
链接: https://arxiv.org/abs/2508.05262
作者: Suresh Guttikonda,Maximilian Neidhart,Johanna Sprenger,Johannes Petersen,Christian Detter,Alexander Schlaefer
机构: Hamburg University of Technology (汉堡工业大学); University Heart and Vascular Center Hamburg (汉堡大学心脏和血管中心); SustAInLivWork Center of Excellence (SustAInLivWork卓越中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CURAC conference 2025
Abstract:Intraoperative fluorescent cardiac imaging enables quality control following coronary bypass grafting surgery. We can estimate local quantitative indicators, such as cardiac perfusion, by tracking local feature points. However, heart motion and significant fluctuations in image characteristics caused by vessel structural enrichment limit traditional tracking methods. We propose a particle filtering tracker based on cyclicconsistency checks to robustly track particles sampled to follow target landmarks. Our method tracks 117 targets simultaneously at 25.4 fps, allowing real-time estimates during interventions. It achieves a tracking error of (5.00 +/- 0.22 px) and outperforms other deep learning trackers (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px).
zh
[CV-50] CF3: Compact and Fast 3D Feature Fields ICCV2025
【速读】:该论文旨在解决当前3D Gaussian Splatting(3DGS)方法在融合2D基础模型信息时依赖自底向上优化流程所导致的计算成本过高问题。现有方法将原始2D特征视为真实标签进行优化,效率低下且难以高效构建紧凑的3D特征场。其解决方案的关键在于提出一种自顶向下的构建流程——CF3,首先通过预训练高斯对多视角2D特征进行快速加权融合,随后直接在提升后的特征空间中训练每个高斯点的自编码器(autoencoder),从而更贴合实际特征分布;更重要的是,引入自适应稀疏化策略,在剪枝和合并冗余高斯点的同时优化剩余高斯属性,实现几何细节保留下的高效表示,最终仅用Feature-3DGS约5%的高斯数量即可获得竞争力强的3D特征场。
链接: https://arxiv.org/abs/2508.05254
作者: Hyunjoon Lee,Joonkyu Min,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.
zh
[CV-51] A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis
【速读】:该论文旨在解决基于生物特征的性别分类(gender classification)问题,尤其聚焦于利用虹膜(iris)作为特征源的方法。其解决方案的关键在于:虹膜具有高度稳定性(在个体一生中基本不变)、非侵入性且外部可见,同时已有成熟的技术用于虹膜图像的分割与编码,从而能够有效提取虹膜纹理中的属性向量(attribute vectors),为性别识别提供可靠依据。研究系统梳理了现有方法,并指出了该领域的挑战与改进方向,为后续研究提供了理论基础和实践参考。
链接: https://arxiv.org/abs/2508.05246
作者: Basna Mohammed Salih Hasan,Ramadhan J. Mstafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 Pages, 8 Figures, 1 Table
Abstract:Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals’ identities may be gleaned from information about their gender, which is a kind of soft this http URL the years, several methods for determining a person’s gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual’s life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.
zh
[CV-52] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
【速读】:该论文旨在解决医学图像理解中两个关键问题:一是高质量标注的医学数据稀缺,二是现有方法过度依赖全局图像特征,难以捕捉细微但具有临床意义的病灶区域。解决方案的关键在于提出RegionMed-CLIP框架,其核心是一个区域感知的多模态对比学习机制,通过创新的感兴趣区域(region-of-interest, ROI)处理器,自适应地融合细粒度局部特征与全局语义信息,并结合渐进式训练策略增强跨模态层次对齐。该方法显著提升了模型对局部病理信号的敏感性,从而在图像-文本检索、零样本分类和视觉问答等任务上超越当前最优模型。
链接: https://arxiv.org/abs/2508.05244
作者: Tianchen Fang,Guiru Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.
zh
[CV-53] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在零样本场景下对抗鲁棒性(zero-shot adversarial robustness)与模型零样本泛化能力之间的固有权衡问题。其核心解决方案在于系统梳理并对比两类防御范式:一是通过参数调整实现的对抗微调(Adversarial Fine-Tuning, AFT),二是保持模型参数不变的无训练/测试时防御策略;进一步指出从对齐保持方法(TeCoA)到嵌入空间重构(LAAT、TIMA)、从输入启发式处理(AOM、TTC)到潜在空间净化(CLIPure)的技术演进路径,强调未来需探索混合防御策略与对抗预训练以突破当前瓶颈。
链接: https://arxiv.org/abs/2508.05237
作者: Zane Xu,Jason Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model’s zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.
zh
[CV-54] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models
【速读】:该论文旨在解决自动驾驶场景中任意视角图像生成的问题,其核心挑战在于缺乏未见视角的真值数据,从而限制了高保真生成模型的训练。解决方案的关键在于提出了一种基于扩散模型的框架Arbiviewgen,其中包含两个创新组件:一是特征感知的自适应视图拼接(Feature-Aware Adaptive View Stitching, FAVS),通过相机位姿进行粗粒度几何匹配,并结合改进的特征匹配算法与聚类分析识别高置信度匹配区域;二是跨视角一致性自监督学习(Cross-View Consistency Self-Supervised Learning, CVC-SSL),利用扩散模型从合成拼接图像中重建原始视角图像,从而在无需额外传感器或深度图的情况下实现跨视角一致性约束,使模型能够仅依赖多摄像头图像及其位姿即可完成可控的任意视角图像生成。
链接: https://arxiv.org/abs/2508.05236
作者: Yatong Lan,Jingfeng Chen,Yiru Wang,Lei He
机构: Tsinghua University (清华大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.
zh
[CV-55] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2
【速读】:该论文旨在解决多相流中气泡分割的难题,尤其是在气泡发生形变、聚并或破裂等复杂形态下传统方法(包括多数基于学习的方法)因假设气泡近似球形而失效的问题。其解决方案的关键在于引入现代视觉基础模型(vision foundation models),将任务转化为迁移学习问题,并首次证明经过微调的Segment Anything Model (SAM v2.1) 可以仅用约100张标注图像即可准确分割高度非凸、不规则的气泡结构,显著提升了在空气润滑系统等复杂场景下的分割精度与泛化能力。
链接: https://arxiv.org/abs/2508.05227
作者: Semanur Küçük,Cosimo Della Santina,Angeliki Laskari
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages
Abstract:Segmenting gas bubbles in multiphase flows is a critical yet unsolved challenge in numerous industrial settings, from metallurgical processing to maritime drag reduction. Traditional approaches-and most recent learning-based methods-assume near-spherical shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This complexity is particularly evident in air lubrication systems, where coalesced bubbles form amorphous and topologically diverse patches. In this work, we revisit the problem through the lens of modern vision foundation models. We cast the task as a transfer learning problem and demonstrate, for the first time, that a fine-tuned Segment Anything Model SAM v2.1 can accurately segment highly non-convex, irregular bubble structures using as few as 100 annotated images.
zh
[CV-56] Dont Reach for the Stars: Rethinking Topology for Resilient Federated Learning
【速读】:该论文旨在解决传统联邦学习(Federated Learning, FL)在中心化架构下存在的局限性,包括单点故障、个性化能力弱、对数据分布偏移的鲁棒性差以及客户端更新选择依赖低级参数差异导致的不可靠性。其解决方案的核心是提出一种去中心化的点对点(Peer-to-Peer, P2P)联邦学习框架——LIGHTYEAR(Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization),关键创新在于引入基于本地验证集计算的“一致性得分”(agreement score),用于量化来自其他客户端模型更新在函数空间中与当前客户端参考模型的语义对齐程度,并据此筛选可信且有益的更新进行聚合,同时引入正则化项以增强训练稳定性。此机制显著提升了客户端层面的性能,尤其在对抗性和异构环境下表现优越。
链接: https://arxiv.org/abs/2508.05224
作者: Mirko Konstantin,Anirban Mukhopadhyay
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy by keeping data local. Traditional FL approaches rely on a centralized, star-shaped topology, where a central server aggregates model updates from clients. However, this architecture introduces several limitations, including a single point of failure, limited personalization, and poor robustness to distribution shifts or vulnerability to malfunctioning clients. Moreover, update selection in centralized FL often relies on low-level parameter differences, which can be unreliable when client data is not independent and identically distributed, and offer clients little control. In this work, we propose a decentralized, peer-to-peer (P2P) FL framework. It leverages the flexibility of the P2P topology to enable each client to identify and aggregate a personalized set of trustworthy and beneficial this http URL framework is the Local Inference Guided Aggregation for Heterogeneous Training Environments to Yield Enhancement Through Agreement and Regularization (LIGHTYEAR). Central to our method is an agreement score, computed on a local validation set, which quantifies the semantic alignment of incoming updates in the function space with respect to the clients reference model. Each client uses this score to select a tailored subset of updates and performs aggregation with a regularization term that further stabilizes the training. Our empirical evaluation across two datasets shows that the proposed approach consistently outperforms both centralized baselines and existing P2P methods in terms of client-level performance, particularly under adversarial and heterogeneous conditions.
zh
[CV-57] Reasoning Track: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking
【速读】:该论文旨在解决现有视觉-语言跟踪(Vision-Language Tracking, VLT)方法在目标变化适应性不足、模型推理过程不透明以及未能充分利用大模型优势等方面的局限性。其核心问题在于:传统方法依赖固定文本与视觉特征融合或简单注意力机制,难以应对目标外观和语义的动态变化;而近期基于文本生成的方法虽尝试提升适应性,却缺乏对推理逻辑的显式建模,导致性能受限。解决方案的关键在于提出一种基于推理的VLT框架ReasoningTrack,利用预训练的多模态大模型Qwen2.5-VL,结合监督微调(SFT)与强化学习GRPO优化语言生成与推理能力,将动态更新的语言描述嵌入统一跟踪骨干网络中,协同视觉特征进行目标定位,从而实现更鲁棒且可解释的跟踪性能。
链接: https://arxiv.org/abs/2508.05221
作者: Xiao Wang,Liye Jin,Xufeng Lou,Shiao Wang,Lan Chen,Bo Jiang,Zhipeng Zhang
机构: Anhui University (安徽大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model’s reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on this https URL
zh
[CV-58] xtual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation
【速读】:该论文旨在解决跨域少样本分割(Cross-Domain Few-Shot Segmentation, CD-FSS)中因源域与目标域之间分布差异导致的性能下降问题,尤其在源数据不可获取的场景下(即无源域CD-FSS)。其核心解决方案是提出一种无需访问源域数据的源域无关CD-FSS方法,通过融合视觉与文本信息实现目标域任务自适应。关键创新在于引入任务特定注意力适配器(Task-Specific Attention Adapters, TSAA),嵌入预训练骨干网络的特征金字塔中以适配多层级特征,并结合视觉-视觉嵌入对齐(Visual-Visual Embedding Alignment, VVEA)模块和文本-视觉嵌入对齐(Text-Visual Embedding Alignment, TVEA)模块进行参数优化:VVEA利用全局-局部视觉特征对齐不同视角图像特征,TVEA则借助CLIP等预对齐多模态文本先验引导跨模态适应。最终通过密集比较与跳跃连接融合两模块输出,生成高精度分割掩码,在1-shot和5-shot设置下分别提升平均分割准确率2.18%和4.11%,显著优于现有SOTA方法。
链接: https://arxiv.org/abs/2508.05213
作者: Jianming Liu,Wenlong Qiu,Haitao Wei
机构: Jiangxi Normal University (江西师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,Accepted at ACMMM2025
Abstract:Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18% and 4.11%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at this https URL.
zh
[CV-59] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization ICCV2025
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在推理过程中因视觉令牌(visual tokens)冗余导致的高计算开销问题。现有方法通常基于注意力分数生成重要性图并分阶段剪枝,但策略简单且常造成性能显著下降。其解决方案的关键在于提出VFlowOpt框架,该框架创新性地引入基于注意力上下文相关性和图像块级信息熵的重要性图计算方法,并设计了一个具有回收机制的渐进式剪枝模块,将被剪除的令牌作为“回收令牌”保留以避免信息丢失;同时,通过视觉信息流引导的方法优化剪枝超参数,将LMM中最后一个令牌视为文本-视觉交互最具代表性的信号,从而最小化剪枝前后令牌表示差异,实现对不同LMM适配的高效剪枝策略。实验表明,该方法可在保留相近性能的前提下剪掉90%的视觉令牌,使KV缓存内存减少89%,推理速度提升3.8倍。
链接: https://arxiv.org/abs/2508.05211
作者: Sihan Yang,Runsen Xu,Chenhang Cui,Tai Wang,Dahua Lin,Jiangmiao Pang
机构: Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.
zh
[CV-60] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery
【速读】:该论文旨在解决内窥镜图像中可泛化的密集特征匹配问题,这一问题在机器人辅助手术任务(如三维重建、导航和手术场景理解)中至关重要,但受限于弱纹理、大视角变化等困难视觉条件以及标注数据稀缺的挑战。解决方案的关键在于提出EndoMatcher,一种基于大规模多域数据预训练的通用内窥镜图像匹配方法:首先,采用双分支Vision Transformer结合双重交互模块以提取多尺度特征并增强对应关系学习的鲁棒性;其次,构建首个多域内窥镜匹配数据集Endo-Mix6(包含约120万对真实与合成图像,覆盖六个领域),通过Structure-from-Motion和模拟变换生成对应标签,从而缓解数据稀缺并提升域多样性;最后,设计渐进式多目标训练策略以应对因数据规模差异、分布偏移和误差不平衡带来的训练不稳定问题,实现零样本跨器官和成像条件下的良好泛化性能。
链接: https://arxiv.org/abs/2508.05205
作者: Bingyu Yang,Qingyao Tian,Yimeng Geng,Huai Liao,Xinyan Huang,Jiebo Luo,Hongbin Liu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (中国科学院自动化研究所多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China (中国科学院大学人工智能学院); Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510275, China (中山大学附属第一医院呼吸与危重症医学科); Hong Kong Institute of Science & Innovation, Hong Kong SAR (香港科学与创新研究院); Centre of AI and Robotics, Hong Kong Institute of Science & Innovation, Hong Kong SAR (香港科学与创新研究院人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at this https URL.
zh
[CV-61] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images
【速读】:该论文旨在解决多光谱遥感影像中光谱信息在视觉-语言模型(Vision-Language Models, VLMs)应用中被严重忽视的问题,从而导致像素级地物提取性能不佳。其核心挑战在于如何将传统光谱指数计算得到的地物光谱先验知识有效编码为大语言模型(Large Language Models, LLMs)可理解的文本属性,以实现精准且灵活的指令驱动型地物提取。解决方案的关键在于构建了一个名为SPIE的视觉-语言指令遵循数据集,该数据集通过经典光谱指数计算将地物的光谱特性转化为文本描述,并基于此提出SPEX模型——一个专用于多光谱遥感影像地物提取的多模态大语言模型。SPEX引入了多尺度特征聚合、token上下文压缩和多光谱视觉预训练等关键技术策略,显著提升了像素级分类精度与预测可解释性,是首个面向光谱遥感地物提取任务的多模态VLM。
链接: https://arxiv.org/abs/2508.05202
作者: Dongchen Si,Di Wang,Erzhong Gao,Xiaolei Qin,Liu Zhao,Jing Zhang,Minqiang Xu,Jianbo Zhan,Jianshe Wang,Lin Liu,Bo Du,Liangpei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: this https URL.
zh
[CV-62] Refining Gaussian Splatting: A Volumetric Densification Approach
【速读】:该论文旨在解决3D高斯点阵(3D Gaussian Splatting, 3DGS)中高质量新视角合成依赖于有效点原语(point primitive)管理的问题,尤其是原始3DGS在稠密化(densification)策略上的不足。其解决方案的关键在于提出一种新颖的密度控制方法,该方法利用每个高斯函数对应的惯性体积(volume of inertia)来引导精细化过程,从而更有效地指导点云的增密与剪枝操作;同时,研究了传统结构光从运动(Structure from Motion, SfM)与深度图像匹配(Deep Image Matching, DIM)方法对点云初始化的影响,实验证明该方法在Mip-NeRF 360数据集上显著优于标准3DGS,在多种场景下均展现出更优的重建质量。
链接: https://arxiv.org/abs/2508.05187
作者: Mohamed Abdul Gafoor,Marius Preda,Titus Zaharia
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving high-quality novel view synthesis in 3D Gaussian Splatting (3DGS) often depends on effective point primitive management. The underlying Adaptive Density Control (ADC) process addresses this issue by automating densification and pruning. Yet, the vanilla 3DGS densification strategy shows key shortcomings. To address this issue, in this paper we introduce a novel density control method, which exploits the volumes of inertia associated to each Gaussian function to guide the refinement process. Furthermore, we study the effect of both traditional Structure from Motion (SfM) and Deep Image Matching (DIM) methods for point cloud initialization. Extensive experimental evaluations on the Mip-NeRF 360 dataset demonstrate that our approach surpasses 3DGS in reconstruction quality, delivering encouraging performance across diverse scenes.
zh
[CV-63] Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在多任务机器人操作中因采用固定视角和共享视觉编码器所导致的3D感知能力受限及任务干扰问题,从而影响模型的鲁棒性和泛化性能。其解决方案的关键在于提出一种任务感知视点规划(Task-Aware View Planning, TAVP)框架,通过引入任务特定的表示学习与主动视点规划相结合的方式,利用一个新颖的伪环境加速探索策略以获取信息丰富的观测视角,并设计了基于专家混合(Mixture-of-Experts, MoE)的视觉编码器来解耦不同任务的特征表示,从而提升视觉表征的完整性与判别性,显著增强动作预测性能。
链接: https://arxiv.org/abs/2508.05186
作者: Yongjie Bai,Zhouxia Wang,Yang Liu,Weixing Chen,Ziliang Chen,Mingtong Dai,Yongsen Zheng,Lingbo Liu,Guanbin Li,Liang Lin
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Huawei Cloud (华为云); 4. Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 9 figures, project page: this https URL
Abstract:Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: this https URL.
zh
[CV-64] SPA: Generalized Graph Spectral Alignment for Versatile Domain Adaptation
【速读】:该论文旨在解决领域自适应(Domain Adaptation, DA)中因忽略域内结构信息而导致的判别能力下降问题。现有方法多关注跨域迁移性,却未充分挖掘源域和目标域内部的丰富结构特征,从而在目标域上出现性能退化。解决方案的关键在于提出一种广义的图谱对齐框架SPA++:首先将DA问题建模为图结构,引入新颖的谱正则项以在特征空间中对齐不同域的图结构;其次设计细粒度的邻域感知传播机制,增强目标域的判别能力;最后结合数据增强与一致性正则化,提升模型在复杂分布场景下的适应性与鲁棒性。
链接: https://arxiv.org/abs/2508.05182
作者: Zhiqing Xiao,Haobo Wang,Xu Lu,Wentao Ye,Gang Chen,Junbo Zhao
机构: Zhejiang University (浙江大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: { https://doi.org/10.1007/s11704-025-50328-w }. arXiv admin note: text overlap with arXiv:2310.17594
Abstract:Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.
zh
[CV-65] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering
【速读】:该论文旨在解决多目标跟踪(Multiple Object Tracking, MOT)中因未见类别导致的挑战,包括低置信度检测、弱运动与外观约束以及长期遮挡等问题。其解决方案的关键在于提出一种称为多轨迹片段跟踪(Multi-Tracklet Tracking, MTT)的框架,该框架通过灵活生成轨迹片段(tracklet)并将其融入多轨迹片段关联机制,首先基于短时时空相关性自适应聚类检测结果以构建鲁棒轨迹片段,再利用位置和外观等多线索估计最优轨迹片段划分,从而有效缓解长距离关联中的误差传播问题。
链接: https://arxiv.org/abs/2508.05172
作者: Zewei Wu,Longhao Wang,Cui Wang,César Teixeira,Wei Ke,Zhang Xiong
机构: Macao Polytechnic University (澳门理工学院); University of Coimbra (科英布拉大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.
zh
[CV-66] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models -based Autonomous Driving Systems
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动驾驶(Autonomous Driving, AD)系统中面临的对抗攻击问题,特别是物理可实现且具有强迁移性的对抗补丁(adversarial patch)攻击难题。现有方法主要针对目标检测模型设计,在迁移到具备复杂推理能力的MLLM架构时效果不佳。解决方案的关键在于提出PhysPatch框架,其核心创新包括:基于语义的掩码初始化策略以实现真实场景中的合理补丁定位;基于奇异值分解(SVD)的局部对齐损失结合补丁引导的裁剪-缩放机制以提升跨模型迁移性;以及基于势场的掩码精炼方法优化补丁区域分布。这些技术共同确保了攻击的有效性与物理可行性,显著提升了对MLLM驱动AD系统的攻击成功率和现实部署潜力。
链接: https://arxiv.org/abs/2508.05167
作者: Qi Guo,Xiaojun Jia,Shanmin Pang,Simeng Qin,Lin Wang,Ju Jia,Yang Liu,Qing Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter’s complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.
zh
[CV-67] X-MoGen: Unified Motion Generation across Humans and Animals
【速读】:该论文旨在解决跨物种(人类与动物)文本驱动运动生成问题,即如何在统一框架下实现对不同生物形态的运动建模,从而提升生成运动的合理性与泛化能力。传统方法通常将人类和动物运动分别建模,难以共享知识且易受形态差异影响,导致运动不自然或跨物种迁移性能差。其解决方案的关键在于提出首个统一框架X-MoGen,采用两阶段架构:第一阶段通过条件图变分自编码器学习通用T-pose先验,并利用形态一致性损失约束运动嵌入在共享潜空间中的结构合理性;第二阶段基于掩码运动建模生成与文本描述一致的运动嵌入。此外,构建了包含115种物种、119k条运动序列的UniMo4D数据集,统一骨骼拓扑以支持联合训练,显著提升了跨物种运动生成的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.05162
作者: Xuan Wang,Kai Ruan,Liyang Qian,Zhizhi Guo,Chang Su,Gaoang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbfX-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbfUniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.
zh
[CV-68] Rotation Equivariant Arbitrary-scale Image Super-Resolution
【速读】:该论文旨在解决任意尺度图像超分辨率(Arbitrary-Scale Image Super-Resolution, ASISR)任务中因低分辨率输入图像几何结构严重失真而导致高分辨率重建出现异常伪影的问题,尤其是重复纹理、边缘和形状等常见几何模式在恢复过程中发生扭曲。其解决方案的关键在于构建一个具有旋转等变性(Rotation Equivariance)的ASISR方法,通过重新设计隐式神经表示(Implicit Neural Representation, INR)模块和编码器的基本架构,引入内在的旋转等变能力,从而首次实现从输入到输出端到端的旋转等变性保持。该方法还提供了理论分析以评估其固有的等变误差,实验证明了其在模拟与真实数据集上的优越性能,并可作为即插即用模块集成至现有ASISR框架中进一步提升效果。
链接: https://arxiv.org/abs/2508.05160
作者: Qi Xie,Jiahong Fu,Zongben Xu,Deyu Meng
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); Xi’an Jiaotong University (西安交通大学); Macao Institute of Systems Engineering (澳门系统工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TPAMI, code and supplementary material is available at this https URL
Abstract:The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \ play manner to further enhance their performance.
zh
[CV-69] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models
【速读】:该论文旨在解决慢性疼痛行为评估中因依赖人工标注而难以准确捕捉隐匿且持续的行为变化问题。现有方法受限于人类对疼痛相关行为的认知局限,导致特征提取存在主观偏差和不完整性。其解决方案的关键在于提出一种无需人工定义动作标签的自动特征发现框架:通过通用动作空间投影器(universal action space projector)从原始视频中直接提取小鼠行为特征,保留了丰富的行为信息并避免了人为标注偏差;同时构建了一个涵盖神经病理性疼痛与炎症性疼痛在多个时间点进展的动物行为数据集,最终在15类疼痛分类任务中达到48.41%的准确率(显著优于人类专家21.33%和B-SOiD方法30.52%),并在简化为三类(神经病理性疼痛、炎症性疼痛、无疼痛)时进一步提升至73.1%,验证了方法的有效性和临床转化潜力。
链接: https://arxiv.org/abs/2508.05138
作者: Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,Hong-Yuan Mark Liao,James C. Liao,Chien-Chang Chen
机构: Institute of Information Science (IIS); Institute of Biomedical Sciences (IBMS); Institute of Biological Chemistry (IBC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Assessing chronic pain behavior in mice is critical for preclinical studies. However, existing methods mostly rely on manual labeling of behavioral features, and humans lack a clear understanding of which behaviors best represent chronic pain. For this reason, existing methods struggle to accurately capture the insidious and persistent behavioral changes in chronic pain. This study proposes a framework to automatically discover features related to chronic pain without relying on human-defined action labels. Our method uses universal action space projector to automatically extract mouse action features, and avoids the potential bias of human labeling by retaining the rich behavioral information in the original video. In this paper, we also collected a mouse pain behavior dataset that captures the disease progression of both neuropathic and inflammatory pain across multiple time points. Our method achieves 48.41% accuracy in a 15-class pain classification task, significantly outperforming human experts (21.33%) and the widely used method B-SOiD (30.52%). Furthermore, when the classification is simplified to only three categories, i.e., neuropathic pain, inflammatory pain, and no pain, then our method achieves an accuracy of 73.1%, which is notably higher than that of human experts (48%) and B-SOiD (58.43%). Finally, our method revealed differences in drug efficacy for different types of pain on zero-shot Gabapentin drug testing, and the results were consistent with past drug efficacy literature. This study demonstrates the potential clinical application of our method, which can provide new insights into pain research and related drug development.
zh
[CV-70] FedGIN: Federated Learning with Dynamic Global Intensity Non-linear Augmentation for Organ Segmentation using Multi-modal Images MICCAI2025
【速读】:该论文旨在解决多模态医学图像分割中因数据稀缺、模态间分布差异(如CT与MRI)以及隐私限制导致的数据共享难题,从而实现跨模态的高精度分割模型部署。其解决方案的关键在于提出一种联邦学习(Federated Learning, FL)框架FedGIN,通过引入轻量级全局强度非线性(Global Intensity Non-linear, GIN)增强模块,在本地训练阶段对不同模态的强度分布进行调和,从而提升模型在无原始数据共享条件下的跨模态泛化能力。实验表明,FedGIN在有限数据和完整数据场景下均显著优于基线方法,展现出强大的多模态分割性能和临床实用性。
链接: https://arxiv.org/abs/2508.05137
作者: Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper Accepted at MICCAI 2025 DeCaf Workshop Track
Abstract:Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints.
zh
[CV-71] Latent Expression Generation for Referring Image Segmentation and Grounding ICCV2025
【速读】:该论文旨在解决视觉定位任务(如指代表达分割,RIS 和指代表达理解,REC)中因单一文本输入无法充分捕捉图像中丰富视觉细节而导致的目标误识别问题。现有方法通常依赖于单一句法描述,而忽略了图像中多样化的属性信息(如颜色、位置等),从而在相似物体间产生混淆。解决方案的关键在于提出一种新颖的视觉定位框架,通过从单个文本输入生成多个潜在表达(latent expressions),并引入“主体分配器”(subject distributor)和“视觉概念注入器”(visual concept injector)模块,将共享主体与差异属性信息嵌入到潜在表示中,以捕获目标特异性的视觉线索;同时设计了一种正边际对比学习策略(positive-margin contrastive learning),使所有潜在表达与原始文本对齐的同时保留细微差异,从而提升定位精度与泛化能力。
链接: https://arxiv.org/abs/2508.05123
作者: Seonghoon Yu,Joonbeom Hong,Joonseok Lee,Jeany Son
机构: GIST(韩国科学技术院); Seoul National University(首尔国立大学); POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025
Abstract:Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.
zh
[CV-72] RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer
【速读】:该论文旨在解决实时音频驱动人脸动画(Real-time Audio-driven Portrait animation, RAP)中高质量生成与计算效率之间的矛盾问题。现有方法虽能通过高维中间表示和显式运动建模实现高质量结果,但其计算复杂度难以满足实时部署需求;而压缩的潜在空间又易丢失细粒度时空细节,导致音视频不同步。解决方案的关键在于提出一种统一框架RAP,其核心创新包括:(1)引入混合注意力机制以实现对音频信号的细粒度控制;(2)设计静态-动态训练-推理范式,避免显式运动监督,从而在保持高视觉保真度的同时有效缓解长期时间漂移,实现低延迟、高精度的实时人脸动画生成。
链接: https://arxiv.org/abs/2508.05115
作者: Fangyu Du,Taiqing Li,Ziwei Zhang,Qian Qiao,Tan Yu,Dingcheng Zhen,Xu Jia,Yang Yang,Shunshun Yin,Siyuan Liu
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Peking University (北京大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 11 pages, 9 figures
Abstract:Audio-driven portrait animation aims to synthesize realistic and natural talking head videos from an input audio signal and a single reference image. While existing methods achieve high-quality results by leveraging high-dimensional intermediate representations and explicitly modeling motion dynamics, their computational complexity renders them unsuitable for real-time deployment. Real-time inference imposes stringent latency and memory constraints, often necessitating the use of highly compressed latent representations. However, operating in such compact spaces hinders the preservation of fine-grained spatiotemporal details, thereby complicating audio-visual synchronization RAP (Real-time Audio-driven Portrait animation), a unified framework for generating high-quality talking portraits under real-time constraints. Specifically, RAP introduces a hybrid attention mechanism for fine-grained audio control, and a static-dynamic training-inference paradigm that avoids explicit motion supervision. Through these techniques, RAP achieves precise audio-driven control, mitigates long-term temporal drift, and maintains high visual fidelity. Extensive experiments demonstrate that RAP achieves state-of-the-art performance while operating under real-time constraints.
zh
[CV-73] AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification
【速读】:该论文旨在解决多实例学习(Multi-Instance Learning, MIL)在病理图像分类中因需处理每张超高分辨率全切片图像(Whole Slide Image, WSI)中的数千个图像块(patch)而导致的高推理成本问题。解决方案的关键在于提出一种异构分层蒸馏多实例学习框架(Asymmetric Hierarchical Distillation Multi-Instance Learning, AHDMIL),其核心由两个组件构成:动态多实例网络(Dynamic Multi-Instance Network, DMIN)用于高分辨率WSI分类,以及双分支轻量级实例预筛选网络(Dual-Branch Lightweight Instance Pre-screening Network, DB-LIPN)用于低分辨率对应图像的快速相关性判断。通过两阶段训练——自蒸馏(Self-Distillation, SD)和异构蒸馏(Asymmetric Distillation, AD),AHDMIL首先利用DMIN生成每个patch的注意力得分以识别无关patch,随后DB-LIPN学习预测低分辨率patch的相关性,并基于空间对应关系筛选出关键patch用于DMIN的微调与高效推理。此外,论文首次在计算病理学中引入基于切比雪夫多项式的柯尔莫哥洛夫-阿诺德(Kolmogorov-Arnold, CKA)分类器,通过可学习激活层提升分类性能。实验表明,AHDMIL在多个公开数据集上均实现了更高的分类准确率与显著的推理加速(平均提速1.2–2.1倍)。
链接: https://arxiv.org/abs/2508.05114
作者: Jiuyang Dong,Jiahan Li,Junjun Jiang,Kui Jiang,Yongbing Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to the need to process thousands of patches from each gigapixel whole slide image (WSI). To address this, we propose AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework that enables fast and accurate classification by eliminating irrelevant patches through a two-step training process. AHDMIL comprises two key components: the Dynamic Multi-Instance Network (DMIN), which operates on high-resolution WSIs, and the Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), which analyzes corresponding low-resolution counterparts. In the first step, self-distillation (SD), DMIN is trained for WSI classification while generating per-instance attention scores to identify irrelevant patches. These scores guide the second step, asymmetric distillation (AD), where DB-LIPN learns to predict the relevance of each low-resolution patch. The relevant patches predicted by DB-LIPN have spatial correspondence with patches in high-resolution WSIs, which are used for fine-tuning and efficient inference of DMIN. In addition, we design the first Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier in computational pathology, which improves classification performance through learnable activation layers. Extensive experiments on four public datasets demonstrate that AHDMIL consistently outperforms previous state-of-the-art methods in both classification performance and inference speed. For example, on the Camelyon16 dataset, it achieves a relative improvement of 5.3% in accuracy and accelerates inference by this http URL. Across all datasets, area under the curve (AUC), accuracy, f1 score, and brier score show consistent gains, with average inference speedups ranging from 1.2 to 2.1 times. The code is available.
zh
[CV-74] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning
【速读】:该论文旨在解决少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)中因训练数据稀缺导致的基类判别能力下降与新类泛化能力不足的问题,以及在增量任务中由于无法访问原始数据而造成的类别间决策边界模糊问题。解决方案的关键在于提出一种名为SMP(Sculpting Margin Penalty)的方法,其核心创新是将边缘惩罚(margin penalty)策略嵌入参数高效微调框架的不同阶段:在基础任务训练中引入Margin-aware Intra-task Adapter Merging(MIAM)机制,通过两个具有不同分类损失的低秩适配器(adapter)——一个带边缘惩罚以增强基类判别性,另一个无边缘约束以促进对新类的泛化——并自适应融合以提升前向兼容性;在增量任务中采用Margin Penalty-based Classifier Calibration(MPCC)策略,通过对所有已见类的嵌入进行带边缘惩罚的分类器微调,从而优化决策边界。该方法在CIFAR100、ImageNet-R和CUB200等多个数据集上实现了当前最优性能,并有效平衡了基类与新类的学习表现。
链接: https://arxiv.org/abs/2508.05094
作者: Liang Bai,Hong Song,Jinfu Li,Yucong Lin,Jingfan Fan,Tianyu Fu,Danni Ai,Deqiang Xiao,Jian Yang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures
Abstract:Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning, which prospectively prepares for future tasks during base task training, has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL). However, existing methods still struggle to balance base-class discriminability and new-class generalization. Moreover, limited access to original data during incremental tasks often results in ambiguous inter-class decision boundaries. To address these challenges, we propose SMP (Sculpting Margin Penalty), a novel FSCIL method that strategically integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. Specifically, we introduce the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning. MIAM trains two sets of low-rank adapters with distinct classification losses: one with a margin penalty to enhance base-class discriminability, and the other without margin constraints to promote generalization to future new classes. These adapters are then adaptively merged to improve forward compatibility. For incremental tasks, we propose a Margin Penalty-based Classifier Calibration (MPCC) strategy to refine decision boundaries by fine-tuning classifiers on all seen classes’ embeddings with a margin penalty. Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes.
zh
[CV-75] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
【速读】:该论文旨在解决当前扩散模型在生成长时程、高保真视频时面临的两大核心问题:一是主体身份漂移(identity drift),即视频中主体外观随时间推移发生不一致;二是运动控制精度不足,难以实现对动作细节的精确调控。针对这些问题,其解决方案的关键在于提出PoseGen框架,该框架创新性地采用上下文内LoRA微调策略(in-context LoRA finetuning strategy),在token层面注入主体外观信息以实现身份保持,同时在channel层面条件化姿态信息以实现精细动作控制;此外,通过引入交错分段生成机制(interleaved segment generation method)并结合共享KV缓存和专用过渡过程,有效突破视频长度限制,确保背景一致性与时间连续性,从而生成任意长度且无伪影的高质量视频。
链接: https://arxiv.org/abs/2508.05091
作者: Jingxuan He,Busheng Su,Finn Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.
zh
[CV-76] AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models
【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, PFMs)因预训练环境多样且不透明所引入的潜在偏差问题,这些问题会限制其在下游任务中的泛化能力和可解释性。解决方案的关键在于提出AdaFusion框架,该框架首次实现了基于提示(prompt)引导的动态融合机制,能够自适应地整合多个PFM的互补知识:通过压缩与对齐不同模型的图像块(tile-level)特征,并利用轻量级注意力机制根据组织表型上下文调整融合权重,从而实现跨模型的知识协同与性能提升,同时提供模型特异性归纳偏置的可解释性洞察。
链接: https://arxiv.org/abs/2508.05084
作者: Yuxiang Xiao,Yang Hu,Bin Li,Tianyang Zhang,Zexi Li,Huazhu Fu,Jens Rittscher,Kaixiang Yang
机构: 1. University of Oxford (牛津大学); 2. University of Oxford (牛津大学); 3. Tsinghua University (清华大学); 4. University of Oxford (牛津大学); 5. Zhejiang University (浙江大学); 6. University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Tables, 11 Figures
Abstract:Pathology foundation models (PFMs) have demonstrated strong representational capabilities through self-supervised pre-training on large-scale, unannotated histopathology image datasets. However, their diverse yet opaque pretraining contexts, shaped by both data-related and structural/training factors, introduce latent biases that hinder generalisability and transparency in downstream applications. In this paper, we propose AdaFusion, a novel prompt-guided inference framework that, to our knowledge, is among the very first to dynamically integrate complementary knowledge from multiple PFMs. Our method compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. We evaluate AdaFusion on three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference. Our approach consistently surpasses individual PFMs across both classification and regression tasks, while offering interpretable insights into each model’s biosemantic specialisation. These results highlight AdaFusion’s ability to bridge heterogeneous PFMs, achieving both enhanced performance and interpretability of model-specific inductive biases.
zh
[CV-77] FLUX-Makeup: High-Fidelity Identity-Consistent and Robust Makeup Transfer via Diffusion Transformer
【速读】:该论文旨在解决当前妆容迁移(makeup transfer)方法中因依赖额外人脸控制模块或算法而导致的身份一致性差、误差累积及性能受限的问题。现有基于生成对抗网络(GAN)的方法需精心设计损失函数以平衡迁移质量与面部身份一致性,而基于扩散模型(diffusion-based)的方法则常引入辅助控制模块来保留身份特征,但这些组件易引入噪声并影响最终效果。解决方案的关键在于提出FLUX-Makeup框架,其核心创新包括:1)基于FLUX-Kontext架构,将源图像作为原生条件输入,无需额外人脸控制;2)设计轻量级RefLoRAInjector模块,解耦参考路径与主干网络,实现高效且全面的妆容特征提取;3)构建鲁棒且可扩展的数据生成流水线,提供高质量成对妆容数据用于训练,显著优于现有数据集。该方案在不依赖任何辅助控制组件的前提下实现了高保真度、强身份一致性和跨场景鲁棒性。
链接: https://arxiv.org/abs/2508.05069
作者: Jian Zhu,Shanyuan Liu,Liuzhuozheng Li,Yue Gong,He Wang,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin,Yang Xu
机构: Nanjing University of Science and Technology (南京理工大学); 360 AI Research (360人工智能研究院); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.
zh
[CV-78] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks
【速读】:该论文旨在解决图像颜色化(image colorization)问题,即如何为灰度图像自动添加合理的颜色。由于该任务本质上是病态的(ill-posed),因丢失了三个颜色通道中的两个,导致存在巨大的解空间不确定性。解决方案的关键在于利用场景语义信息和表面纹理作为颜色预测的先验约束,同时采用分类(classification)与对抗学习(adversarial learning)相结合的方法建模颜色分布的多模态特性,从而克服传统回归方法忽略颜色多样性的问题。
链接: https://arxiv.org/abs/2508.05068
作者: Ruiyu Li,Changyuan Qiu,Hangrui Cao,Qihan Ren,Yuqing Qiu
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 5 pages, 4 figures
Abstract:Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons. Comments: 5 pages, 4 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2508.05068 [cs.CV] (or arXiv:2508.05068v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.05068 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-79] Decoupling Continual Semantic Segmentation
【速读】:该论文旨在解决持续语义分割(Continual Semantic Segmentation, CSS)中的灾难性遗忘问题,即在学习新类别时对先前已习得知识的遗忘,这在密集预测任务中尤为突出。现有方法通常采用单阶段编码器-解码器架构,导致类别标签与分割掩码紧密耦合,从而引发新旧类别学习间的干扰,难以实现保留能力(retention)与可塑性(plasticity)的平衡。其解决方案的关键在于提出一种两阶段框架 DecoupleCSS:第一阶段利用预训练文本和图像编码器(通过 LoRA 适配)提取类别感知信息并生成位置敏感提示;第二阶段则借助 Segment Anything Model (SAM) 实现类无关的精确分割掩码生成,使分割知识可在新旧类别间共享,从而有效解耦类别识别与分割过程,显著提升模型在持续学习场景下的性能表现。
链接: https://arxiv.org/abs/2508.05065
作者: Yifu Guo,Yuquan Lu,Wentao Zhang,Zishan Xu,Dexia Chen,Siyu Zhang,Yizhe Zhang,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); South China Normal University (华南师范大学); Southwest University (西南大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: this https URL.
zh
[CV-80] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion
【速读】:该论文旨在解决从单张图像中在复杂光照条件下准确估计物理基础渲染(Physically Based Rendering, PBR)材质参数的问题,尤其是如何同时提升反演漫反射(albedo)、金属度(metallic)和粗糙度(roughness)的精度。其解决方案的关键在于提出了一种双路径扩散框架(DualMat),该框架在两个独立但协同的潜在空间中运行:一是利用预训练视觉知识通过RGB潜在空间优化漫反射;二是设计了一个紧凑的专用潜在空间用于高精度金属度与粗糙度估计。为确保两路径间预测的一致性,引入特征蒸馏(feature distillation)机制,并采用修正流(rectified flow)以减少推理步骤并保持质量,从而实现高效且高质量的材质重建。
链接: https://arxiv.org/abs/2508.05060
作者: Yifeng Huang,Zhang Chen,Yi Xu,Minh Hoai,Zhong Li
机构: Stony Brook University (石溪大学); Meta; Goertek Alpha Labs (歌尔Alpha实验室); The University of Adelaide (阿德莱德大学); Apple Inc. (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.
zh
[CV-81] Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting
【速读】:该论文旨在解决如何获取超越现有数据集所能提供的知识量的预训练权重(pre-trained weights)这一关键问题,以提升下游任务在数据稀缺场景下的性能。其解决方案的核心在于提出了一种名为“知识溢出权重预测”(KNOW prediction)的新策略,通过结构化遗忘(structured forgetting)及其逆过程来合成富含知识的权重。研究发现,对逐步缩小的数据集进行顺序微调会诱导出可建模的结构化遗忘过程,该过程可通过元学习(meta-learning)进行建模并反转,从而恢复出仿佛在更大数据集上训练所得的增强权重。由此构建的超模型——知识溢出权重预测器(KNOWN),能够学习权重演化的通用规律,并预测具有更强泛化能力的优化权重,实验证明其显著优于朴素微调和简单权重预测方法。
链接: https://arxiv.org/abs/2508.05059
作者: Jinhyeok Jang,Jaehong Kim,Jung Uk Kim
机构: ETRI(电子通信研究院); Kyung Hee University(中央大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce \textbfKNowledge Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our \textbfKNowledge Overflowed Weights Nowcaster (KNOWN) acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Naïve fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer in deep learning.
zh
[CV-82] Finding Needles in Images: Can Multimodal LLM s Locate Fine Details? ACL2025
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在复杂文档中进行细粒度信息定位与推理能力不足的问题,尤其是在需要从大量文本或图像内容中精准提取特定细节(如菜单中的营养信息或长篇文章中的免责声明)的任务场景下。其核心挑战在于模型对局部关键信息的感知与聚焦能力有限,类似于“在图像中寻找针头”(Finding Needles in Images, NiM)。为应对这一问题,作者提出了Spot-IT方法,其关键创新在于通过智能patch选择机制和高斯注意力(Gaussian attention)模拟人类视觉聚焦行为,从而增强模型对文档中细微但重要区域的关注能力。实验表明,Spot-IT显著提升了MLLMs在复杂布局文档中精确提取细节的能力。
链接: https://arxiv.org/abs/2508.05053
作者: Parth Thakkar,Ankush Agarwal,Prasad Kasu,Pulkit Bansal,Chaitanya Devaguptapu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2025 in the main track
Abstract:While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs’ capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.
zh
[CV-83] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID ICCV2025
【速读】:该论文旨在解决视频行人重识别(Video-based Person Re-Identification, Video ReID)中因忽视查询-图库视频对中最判别性特征的选择而导致匹配效果不佳的问题。其解决方案的关键在于提出一种分层自适应生物特征专家混合框架(Hierarchical and Adaptive Mixture of Biometric Experts, HAMoBE),该框架通过冻结预训练大模型(如CLIP)的多层特征提取低级表示,并在第二层引入专注于长期、短期和时序特征的专用专家模块,同时设计了一个双输入决策门控网络(dual-input decision gating network),动态调整各专家对当前输入场景的贡献权重,从而模拟人类感知机制并实现更鲁棒的跨视频匹配。
链接: https://arxiv.org/abs/2508.05038
作者: Yiyang Su,Yunping Shi,Feng Liu,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025
Abstract:Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, which are essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this issue, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-layer features from a pre-trained large model (e.g., CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features–appearance, static body shape, and dynamic gait–and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-layer representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (e.g., +13.0% Rank-1 accuracy).
zh
[CV-84] A Novel Image Similarity Metric for Scene Composition Structure ICIP2025
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在图像生成过程中对场景构图结构(Scene Composition Structure, SCS)保持能力的评估难题。传统图像相似性指标难以有效衡量SCS完整性,因为像素级方法对细微噪声敏感,感知类指标偏重人类审美偏好,而基于神经网络的指标则存在训练开销和泛化能力不足的问题。论文提出的解决方案是引入一种全新的、无需训练的SCS相似性指数度量(SCSSIM),其核心创新在于利用立方体分层分割(Cuboidal hierarchical partitioning)提取图像的统计特征,从而鲁棒地捕捉非对象依赖的结构关系。实验表明,SCSSIM对非构图失真具有高度不变性,同时对构图失真呈现强单调下降趋势,能精准反映SCS是否被破坏,显著优于现有指标,在保障生成图像结构准确性方面具备重要价值。
链接: https://arxiv.org/abs/2508.05037
作者: Md Redwanul Haque,Manzur Murshed,Manoranjan Paul,Tsz-Kwan Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: IEEE ICIP 2025
Abstract:The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image’s underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM’s high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.
zh
[CV-85] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes IJCAI2025
【速读】:该论文旨在解决皮肤癌(skin carcinoma)临床诊疗中SOAP(Subjective, Objective, Assessment, and Plan)病历文档生成效率低、依赖人工书写导致医师负担重的问题。其解决方案的关键在于提出一种弱监督的多模态框架skin-SOAP,能够仅基于有限输入(如病变图像和稀疏临床文本)自动生成结构化且符合临床逻辑的SOAP笔记,显著减少对大规模标注数据的依赖,同时保持与GPT-4o、Claude等先进大模型相当的临床相关性表现。该方法通过引入两个新指标MedConceptEval和Clinical Coherence Score(CCS)量化评估生成内容在医学概念一致性与临床连贯性上的质量,从而确保输出具备可落地的临床实用性。
链接: https://arxiv.org/abs/2508.05019
作者: Sadia Kamal,Tim Oates,Joy Wan
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IJCAI 2025 Workshops. arXiv admin note: substantial text overlap with arXiv:2506.10328
Abstract:Skin carcinoma is the most prevalent form of cancer globally, accounting for over 8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.
zh
[CV-86] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content
【速读】:该论文旨在解决当前缺乏针对AI增强用户生成内容(AI-enhanced User-Generated Content, AI-UGC)的专用感知质量评估模型的问题,这一缺失限制了用户体验提升和增强方法的发展。其解决方案的关键在于构建了一个名为AU-IQA的基准数据集,包含4800张由超分辨率、低光照增强和去噪三类典型增强技术生成的AI-UGC图像,并在此基础上系统评估了多种现有质量评估模型(包括传统图像质量评估方法和大规模多模态模型),从而揭示当前方法在评估AI-UGC感知质量上的表现差异与局限性。
链接: https://arxiv.org/abs/2508.05016
作者: Shushi Wang,Chunyi Li,Zicheng Zhang,Han Zhou,Wei Dong,Jun Chen,Guangtao Zhai,Xiaohong Liu
机构: Shanghai Jiao Tong University (上海交通大学); McMaster University (麦克马斯特大学); Suzhou Key Laboratory of Artificial Intelligence (苏州市人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is this https URL.
zh
[CV-87] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在医学图像分割任务中因域偏移(domain shift)导致的泛化能力不足问题,尤其针对由设备差异、成像伪影和成像模式等混杂因素引起的医学图像高变异性。其解决方案的关键在于提出多模态因果驱动表示学习(Multimodal Causal-Driven Representation Learning, MCDRL)框架:首先利用CLIP模型的跨模态对齐能力,通过文本提示构建用于表征域特异性变化的混杂因子字典;其次训练一个因果干预网络,基于该字典识别并消除域特异性干扰,同时保留对分割任务至关重要的解剖结构信息,从而实现更鲁棒的域泛化性能。
链接: https://arxiv.org/abs/2508.05008
作者: Xusheng Liang,Lihua Zhou,Nianxin Li,Miao Xu,Ziyang Song,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo,Zhen Lei
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Alibaba Group (阿里巴巴集团); 4. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP’s cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
zh
[CV-88] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression
【速读】:该论文旨在解决视频持续学习(Video Continual Learning, Video CL)中因视频数据高存储需求而导致的实践难题,尤其是在长期视频流和有限记忆缓冲区约束下,传统基于重放(rehearsal-based)方法难以有效维持模型性能的问题。其解决方案的关键在于提出一种名为“持续刷新异模记忆”(Continually Refreshed Amodal Memory, CRAM)的方法:通过压缩视觉表示(compressed vision),将视频编码为嵌入向量(embeddings)而非原始帧存储于内存中,从而显著降低存储开销;同时,为应对在线训练压缩器时产生的灾难性遗忘(catastrophic forgetting),设计了一种通过旧网络解码、新网络重新压缩的代码刷新机制,确保记忆内容随模型演进保持一致性与有效性。此方案在EpicKitchens-100和Kinetics-700等大规模视频基准上实现了优于现有方法的性能,且内存占用低于2 GB。
链接: https://arxiv.org/abs/2508.05001
作者: Shivani Mall,Joao F. Henriques
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.
zh
[CV-89] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification ECAI2025
【速读】:该论文旨在解决行人重识别(Person Re-identification, Re-ID)任务中在遮挡场景下的性能瓶颈问题,特别是当预训练视觉语言模型因过度关注整体图像语义而忽略细粒度属性信息时,导致对部分遮挡行人或外观差异细微的个体区分能力不足。解决方案的关键在于提出 Attribute-Guide ReID (AG-ReID) 框架,其核心创新为:首先利用预训练模型自身能力生成细粒度属性伪标签以捕捉细微视觉特征,随后引入双引导机制(dual-guidance mechanism),融合整体语义与细粒度属性信息,从而增强图像特征提取能力。该方法无需额外数据或标注,在多个主流Re-ID数据集上实现了SOTA性能,显著提升了对遮挡和细微属性差异的鲁棒性。
链接: https://arxiv.org/abs/2508.04998
作者: Rui Zhi,Zhen Yang,Haiyang Zhang
机构: Beijing University of Post and Telecommunication (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 supplement pages, 3 figures, ECAI2025
Abstract:Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pre-trained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models’ inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.
zh
[CV-90] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks
【速读】:该论文试图解决的问题是:在深度神经网络中,如何通过熟悉性训练(familiarity training)使早期层(early layers)获得对全局图像上下文(global image context)的敏感性,从而模拟大脑初级视觉皮层(early visual cortex)中快速学习全局语境的现象。其解决方案的关键在于引入基于低秩适应(Low-Rank Adaptation, LoRA)的“快权重”(fast weights)机制,用于在Vision Transformer(ViT)架构的每个自注意力模块中编码短期记忆痕迹。这种快慢权重混合结构不仅使早期层的潜在表示与顶层包含全局信息的表示对齐,还扩展了在已记忆图像上下文中自注意力的感知范围,从而在功能上实现类似生物神经回路的全局上下文敏感性增强机制。
链接: https://arxiv.org/abs/2508.04988
作者: Yue Li,Weifan Wang,Tai Sing Lee
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, as evidenced by a sparsification of population responses and a reduction in mean activity when exposed to familiar versus novel image contexts. This phenomenon has been attributed primarily to local recurrent interactions, rather than changes in feedforward or feedback pathways, supported by both empirical findings and circuit-level modeling. Recurrent neural circuits capable of simulating these effects have been shown to reshape the geometry of neural manifolds, enhancing robustness and invariance to irrelevant variations. In this study, we employ a Vision Transformer (ViT)-based autoencoder to investigate, from a functional perspective, how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. We hypothesize that rapid learning operates via fast weights, which encode transient or short-term memory traces, and we explore the use of Low-Rank Adaptation (LoRA) to implement such fast weights within each Transformer layer. Our results show that (1) The proposed ViT-based autoencoder’s self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Together, these findings suggest that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.
zh
[CV-91] Unified modality separation: A vision-language framework for unsupervised domain adaptation
【速读】:该论文旨在解决预训练视觉语言模型(VLM)在无监督域适应(UDA)中因模态间隙(modality gap)导致的性能瓶颈问题。现有方法在存在模态间隙时仅能迁移模态不变知识,难以充分利用模态特定信息,从而限制了目标域上的表现。解决方案的关键在于提出一个统一的模态解耦框架,将VLM特征中的模态特定(modality-specific)与模态不变(modality-invariant)成分进行分离,并在训练阶段分别处理,在测试阶段通过自适应集成权重最大化不同组件间的协同效应。此外,作者设计了一种实例级模态差异度量指标,用于区分样本类型并指导优化策略,结合提示调优(prompt tuning)技术,在保证9倍计算效率提升的同时实现最高达9%的性能增益。
链接: https://arxiv.org/abs/2508.04987
作者: Xinyao Li,Jingjing Li,Zhekai Du,Lei Zhu,Heng Tao Shen
机构: University of Electronic Science and Technology of China (电子科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TPAMI
Abstract:Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.
zh
[CV-92] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion
【速读】:该论文旨在解决深度补全(depth completion)任务中模型在分布外(out-of-distribution, OOD)场景下性能显著下降的问题。现有基于学习的方法依赖于精心准备但数据有限的训练集,导致泛化能力不足。其解决方案的关键在于引入深度基础模型(depth foundation model),利用其从RGB图像中提取环境线索(包括结构和语义上下文)来引导稀疏深度信息向缺失区域传播;同时设计了一种无参数的双空间传播机制,在3D和2D空间中同步传播稀疏深度以保持几何结构与局部一致性,并进一步通过一个可学习的校正模块对预测深度进行逐级修正,从而在不依赖大规模训练数据的前提下显著提升模型在OOD场景下的鲁棒性。
链接: https://arxiv.org/abs/2508.04984
作者: Shenglun Chen,Xinzhu Ma,Hong Zhang,Haojie Li,Zhihui Wang
机构: Dalian University of Technology (大连理工大学); Beihang University (北京航空航天大学); Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIP
Abstract:Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in this https URL.
zh
[CV-93] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression
【速读】:该论文旨在解决基于扩散模型的图像压缩方法中存在的两个关键问题:一是由于多步采样导致的解码延迟过高,二是因过度依赖生成先验而导致重建 fidelity(保真度)较差。解决方案的核心在于提出一种单步扩散图像压缩模型 SODEC,其关键创新包括:首先,基于“足够信息量的潜在表示可避免多步精修”的洞察,利用预训练的变分自编码器(VAE)生成富含信息的 latent 表示,并用单步解码替代迭代去噪过程;其次,引入保真度引导模块(fidelity guidance module),以增强输出图像对原始图像的忠实性;此外,设计了率退火(rate annealing)训练策略,支持在极低比特率下的有效训练。实验表明,SODEC 在率失真感知性能上显著优于现有方法,且解码速度相比以往扩散模型提升超过 20 倍。
链接: https://arxiv.org/abs/2508.04979
作者: Zheng Chen,Mingde Zhou,Jinpei Guo,Jiale Yuan,Yifei Ji,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
Abstract:Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20 \times . Code is released at: this https URL.
zh
[CV-94] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception
【速读】:该论文旨在解决边缘计算平台上实时目标检测面临的挑战,即在计算资源受限和严格延迟约束下实现高分辨率物体检测。其解决方案的关键在于扩展了基于画布(canvas)的注意力调度机制:一方面允许画布帧采用可变尺寸,另一方面引入可选的画布帧率,使其可偏离原始数据帧率。这一改进显著提升了质量与成本之间的权衡能力,在NVIDIA Jetson Orin Nano上运行YOLOv11模型对Waymo Open Dataset视频帧进行检测的结果表明,相较现有最优方法,该方案能稳定获得更高的平均精度(mAP)和召回率。
链接: https://arxiv.org/abs/2508.04976
作者: Md Iftekharul Islam Sakib,Yigong Hu,Tarek Abdelzaher
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time perception on edge platforms faces a core challenge: executing high-resolution object detection under stringent latency constraints on limited computing resources. Canvas-based attention scheduling was proposed in earlier work as a mechanism to reduce the resource demands of perception subsystems. It consolidates areas of interest in an input data frame onto a smaller area, called a canvas frame, that can be processed at the requisite frame rate. This paper extends prior canvas-based attention scheduling literature by (i) allowing for variable-size canvas frames and (ii) employing selectable canvas frame rates that may depart from the original data frame rate. We evaluate our solution by running YOLOv11, as the perception module, on an NVIDIA Jetson Orin Nano to inspect video frames from the Waymo Open Dataset. Our results show that the additional degrees of freedom improve the attainable quality/cost trade-offs, thereby allowing for a consistently higher mean average precision (mAP) and recall with respect to the state of the art.
zh
[CV-95] UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在稀疏视图场景下因高斯点权重均等分配而导致的过拟合问题,从而影响重建质量。其解决方案的关键在于引入可学习的不确定性(learned uncertainty)机制:一方面,该不确定性用于指导高斯透明度的可微更新,保持3DGS原有流程完整性;另一方面,通过软可微丢弃正则化(soft differentiable dropout regularisation),将不确定性转化为连续的丢弃概率,动态控制高斯点在投影与混合过程中的参与程度,从而实现更高效且高质量的稀疏视图合成。
链接: https://arxiv.org/abs/2508.04968
作者: Zhihao Guo,Peng Wang,Zidong Chen,Xiangyu Kong,Yan Lyu,Guanyu Gao,Liangxiu Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:3D Gaussian Splatting (3DGS) has become a competitive approach for novel view synthesis (NVS) due to its advanced rendering efficiency through 3D Gaussian projection and blending. However, Gaussians are treated equally weighted for rendering in most 3DGS methods, making them prone to overfitting, which is particularly the case in sparse-view scenarios. To address this, we investigate how adaptive weighting of Gaussians affects rendering quality, which is characterised by learned uncertainties proposed. This learned uncertainty serves two key purposes: first, it guides the differentiable update of Gaussian opacity while preserving the 3DGS pipeline integrity; second, the uncertainty undergoes soft differentiable dropout regularisation, which strategically transforms the original uncertainty into continuous drop probabilities that govern the final Gaussian projection and blending process for rendering. Extensive experimental results over widely adopted datasets demonstrate that our method outperforms rivals in sparse-view 3D synthesis, achieving higher quality reconstruction with fewer Gaussians in most datasets compared to existing sparse-view approaches, e.g., compared to DropGaussian, our method achieves 3.27% PSNR improvements on the MipNeRF 360 dataset.
zh
[CV-96] Laplacian Analysis Meets Dynamics Modelling: Gaussian Splatting for 4D Reconstruction
【速读】:该论文旨在解决动态场景下3D高斯溅射(3D Gaussian Splatting, 3DGS)方法中存在的两大核心问题:一是由于低秩分解导致的过度平滑,二是由于高维网格采样引发的特征冲突,其根本原因在于运动细节保持与形变一致性之间存在的固有频谱冲突。解决方案的关键在于提出一种结合显式-隐式函数的新型动态3DGS框架,包含三项创新:(1)频谱感知的拉普拉斯编码架构,融合哈希编码与基于拉普拉斯的模块以实现灵活的频率级运动控制;(2)增强的高斯动态属性,用于补偿几何形变引起的光度失真;(3)基于KDTree的自适应高斯分裂策略,引导对动态区域的高效查询与优化。该方案在复杂动态场景重建中实现了卓越的保真度表现。
链接: https://arxiv.org/abs/2508.04966
作者: Yifan Zhou,Beizhen Zhao,Pengcheng Wu,Hao Wang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:While 3D Gaussian Splatting (3DGS) excels in static scene modeling, its extension to dynamic scenes introduces significant challenges. Existing dynamic 3DGS methods suffer from either over-smoothing due to low-rank decomposition or feature collision from high-dimensional grid sampling. This is because of the inherent spectral conflicts between preserving motion details and maintaining deformation consistency at different frequency. To address these challenges, we propose a novel dynamic 3DGS framework with hybrid explicit-implicit functions. Our approach contains three key innovations: a spectral-aware Laplacian encoding architecture which merges Hash encoding and Laplacian-based module for flexible frequency motion control, an enhanced Gaussian dynamics attribute that compensates for photometric distortions caused by geometric deformation, and an adaptive Gaussian split strategy guided by KDTree-based primitive control to efficiently query and optimize dynamic areas. Through extensive experiments, our method demonstrates state-of-the-art performance in reconstructing complex dynamic scenes, achieving better reconstruction fidelity.
zh
[CV-97] Perceive-Sample-Compress: Towards Real-Time 3D Gaussian Splatting
【速读】:该论文旨在解决传统3D高斯泼溅(3D Gaussian Splatting, 3DGS)在大规模场景管理与高效存储方面的局限性,尤其是在复杂环境或计算资源受限条件下难以兼顾渲染质量与内存效率的问题。其解决方案的关键在于提出一个“感知-采样-压缩”(perceive-sample-compress)框架:首先通过场景感知补偿算法智能优化各层级高斯参数,优先提升关键区域的视觉保真度;其次引入金字塔采样表示以分层管理高斯原语,实现多尺度资源调度;最后设计广义高斯混合模型压缩算法,在不损失视觉质量的前提下显著降低存储开销,从而在保持实时渲染速度的同时大幅提升内存效率和整体可见质量。
链接: https://arxiv.org/abs/2508.04965
作者: Zijian Wang,Beizhen Zhao,Hao Wang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated remarkable capabilities in real-time and photorealistic novel view synthesis. However, traditional 3DGS representations often struggle with large-scale scene management and efficient storage, particularly when dealing with complex environments or limited computational resources. To address these limitations, we introduce a novel perceive-sample-compress framework for 3D Gaussian Splatting. Specifically, we propose a scene perception compensation algorithm that intelligently refines Gaussian parameters at each level. This algorithm intelligently prioritizes visual importance for higher fidelity rendering in critical areas, while optimizing resource usage and improving overall visible quality. Furthermore, we propose a pyramid sampling representation to manage Gaussian primitives across hierarchical levels. Finally, to facilitate efficient storage of proposed hierarchical pyramid representations, we develop a Generalized Gaussian Mixed model compression algorithm to achieve significant compression ratios without sacrificing visual fidelity. The extensive experiments demonstrate that our method significantly improves memory efficiency and high visual quality while maintaining real-time rendering speed.
zh
[CV-98] Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework
【速读】:该论文旨在解决开放世界点云语义分割(Open-world point cloud semantic segmentation, OW-Seg)中现有方法依赖资源密集型离线增量学习或需大量标注支持数据的问题,从而限制了其在真实场景中的实用性。解决方案的关键在于提出首个“人机协同”框架HOW-Seg:通过直接在查询数据上构建类原型(class prototypes),避免因支持集与查询集间类别分布偏移导致的原型偏差;利用稀疏人类标注引导原型生成,实现对基础类和新类的基于原型的分割;进一步引入分层原型消歧机制以细化模糊原型,并结合密集条件随机场(CRF)增强上下文感知能力,最终通过迭代式人类反馈动态优化预测结果,显著提升分割质量。
链接: https://arxiv.org/abs/2508.04962
作者: Peng Zhang,Songru Yang,Jinsheng Sun,Weiqing Li,Zhiyong Su
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: To be published in IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.
zh
[CV-99] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在存在域偏移(domain shift)情况下的鲁棒性不足问题,尤其针对生物医学成像中因批次效应(batch effects)导致的真实生物信号被掩盖的挑战。其解决方案的关键在于提出AdvDINO框架,通过在DINOv2架构中引入梯度反转层(gradient reversal layer),实现域对抗训练,从而促进特征学习的域不变性(domain-invariant feature learning)。该方法有效缓解了切片特异性偏差,在六通道多光谱免疫荧光(multiplex immunofluorescence, mIF)全切片图像上显著提升了模型的鲁棒性和生物学可解释性。
链接: https://arxiv.org/abs/2508.04955
作者: Stella Su,Marc Harary,Scott J. Rodig,William Lotter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift – systematic differences across data sources – remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across 5.46 million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains – including radiology, remote sensing, and autonomous driving – where domain shift and limited annotated data hinder model generalization and interpretability.
zh
[CV-100] RKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
【速读】:该论文旨在解决弱监督动态场景图生成(Weakly Supervised Dynamic Scene Graph Generation, WS-DSGG)中因依赖外部预训练物体检测器所导致的检测不准确和置信度低的问题。现有方法在动态、关系感知的场景下表现不佳,主要由于这些检测器基于静态、以物体为中心的图像训练,难以适应视频中复杂的运动和交互关系。解决方案的关键在于提出一种时序增强的关系感知知识迁移方法(Temporal-enhanced Relation-aware Knowledge Transferring, TRKT),其核心包括两个组件:(1) 关系感知知识挖掘——通过类别特定的注意力图定位物体区域与交互区域,并引入帧间注意力增强策略利用光流信息提升对运动模糊的鲁棒性;(2) 双流融合模块——将上述注意力图与外部检测结果融合,从而精化物体定位并提高提案置信度。该方法显著提升了WS-DSGG在Action Genome数据集上的性能。
链接: https://arxiv.org/abs/2508.04943
作者: Zhu Xu,Ting Lei,Zhimin Li,Guan Wang,Qingchao Chen,Yuxin Peng,Yang liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Tencent Inc. (腾讯公司); Baidu Inc. (百度公司); National Institute of Health Data Science, Peking University (北京大学健康数据科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at this https URL.
zh
[CV-101] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在零样本学习(zero-shot learning)和少样本学习(few-shot learning)场景下,基于提示学习(prompt learning)的方法如CoOp和CoCoOp容易过拟合已知类别、从而限制对未见类别的泛化能力的问题。解决方案的关键在于提出ProMIM框架,其核心创新是将掩码图像建模(Masked Image Modeling, MIM)机制无缝集成到现有VLM提示学习流程中:通过仅掩码可见图像块(patch)并利用这些掩码后的表示来引导生成实例条件化的提示(instance-conditioned prompts),从而提升特征鲁棒性、缓解过拟合问题,且计算开销极低,可作为即插即用模块增强CoOp等方法的性能。
链接: https://arxiv.org/abs/2508.04942
作者: Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo
机构: Convergence Research Institute, Sungkyunkwan University (成均馆大学); Deakin University (迪肯大学); Department of Electrical and Computer Engineering, Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACMMM-LAVA 2025, 10 pages, camera-ready version
Abstract:Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.
zh
[CV-102] oward Errorless Training ImageNet-1k
【速读】:该论文旨在提升图像分类任务的准确率,针对ImageNet 2012竞赛数据集进行训练,以实现接近最优的分类性能。其解决方案的关键在于采用一种新的训练方法(文献[5]中提出),并使用前馈人工神经网络(feedforward artificial neural network)进行建模,最终在Top-1准确率上达到99.69%,平均每个批次分区有285.9个标签被完全正确分类。作者推测模型未能达到100%准确率的原因是数据集中存在双标签问题(double-labeling problem),即同一图像被标注为不同类别,导致模型学习困难。
链接: https://arxiv.org/abs/2508.04941
作者: Bo Deng,Levi Heath
机构: University of Nebraska-Lincoln (内布拉斯加大学林肯分校); University of Colorado Colorado Springs (科罗拉多大学斯普林斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 2 figures
Abstract:In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.
zh
[CV-103] ALScope: A Unified Toolkit for Deep Active Learning
【速读】:该论文旨在解决深度主动学习(Deep Active Learning, DAL)在复杂现实场景下缺乏统一评估平台的问题,从而阻碍了对不同算法在分布偏移(如开放集识别)和数据不平衡等挑战下的公平、系统性比较。其解决方案的关键在于构建了一个名为ALScope的DAL平台,该平台整合了10个来自计算机视觉(CV)与自然语言处理(NLP)领域的数据集以及21种代表性DAL算法(包括经典基线与针对特定挑战设计的最新方法),并支持灵活配置关键实验因素(如OOD样本比例、类别不平衡比等),实现了多维度、贴近实际任务设置的综合评估能力。
链接: https://arxiv.org/abs/2508.04937
作者: Chenkai Wu,Yuanyuan Qi,Xiaohao Yang,Jueqing Lu,Gang Liu,Wray Buntine,Lan Du
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep Active Learning (DAL) reduces annotation costs by selecting the most informative unlabeled samples during training. As real-world applications become more complex, challenges stemming from distribution shifts (e.g., open-set recognition) and data imbalance have gained increasing attention, prompting the development of numerous DAL algorithms. However, the lack of a unified platform has hindered fair and systematic evaluation under diverse conditions. Therefore, we present a new DAL platform ALScope for classification tasks, integrating 10 datasets from computer vision (CV) and natural language processing (NLP), and 21 representative DAL algorithms, including both classical baselines and recent approaches designed to handle challenges such as distribution shifts and data imbalance. This platform supports flexible configuration of key experimental factors, ranging from algorithm and dataset choices to task-specific factors like out-of-distribution (OOD) sample ratio, and class imbalance ratio, enabling comprehensive and realistic evaluation. We conduct extensive experiments on this platform under various settings. Our findings show that: (1) DAL algorithms’ performance varies significantly across domains and task settings; (2) in non-standard scenarios such as imbalanced and open-set settings, DAL algorithms show room for improvement and require further investigation; and (3) some algorithms achieve good performance, but require significantly longer selection time.
zh
[CV-104] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
【速读】:该论文旨在解决基础单目深度估计器(Foundation Monocular Depth Estimators, FMDEs)在应用于鱼眼(fisheye)图像时因相机标定参数变化导致的分布偏移(covariate shift)问题,从而产生错误的深度估计。现有方法通常需要重新训练或微调模型,或通过图像空间中的重校准与投影到标准参考帧来适应新视角,但这些方式易引入伪影和信息损失。解决方案的关键在于引入一组轻量级的“标定标记”(Calibration Tokens),通过调节编码鱼眼图像的潜在嵌入(latent embeddings)分布,使其与透视图像的嵌入分布对齐,从而无需重新训练即可复用已训练好的FMDE模型。该方法为自监督学习框架,仅依赖公开的大规模透视图像数据集,通过将透视图像模拟为鱼眼图像并强制两者深度估计一致性进行训练,实现跨相机类型的泛化能力。
链接: https://arxiv.org/abs/2508.04928
作者: Suchisrit Gangopadhyay,Jung-Hee Kim,Xien Chen,Patrick Rim,Hyoungseob Park,Alex Wong
机构: Yale University (耶鲁大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: this https URL.
zh
[CV-105] st-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations
【速读】:该论文旨在解决现有视频精彩片段检测方法在测试阶段泛化能力不足的问题,即固定模型难以适应新视频中内容、风格或音视频质量的多样性,导致检测性能下降。解决方案的关键在于提出Highlight-TTA框架,该框架通过在测试时动态调整模型参数以适配特定测试视频的特征,并结合辅助任务“跨模态幻觉”(cross-modality hallucinations)进行联合优化,利用元辅助训练策略实现主任务与辅助任务的有效协同,在测试阶段通过辅助任务引导模型自适应调整,从而显著提升模型的泛化能力和检测精度。
链接: https://arxiv.org/abs/2508.04924
作者: Zahidul Islam,Sujoy Paul,Mrigank Rochan
机构: University of Saskatchewan (萨斯喀彻温大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.
zh
[CV-106] Revealing Temporal Label Noise in Multimodal Hateful Video Classification
【速读】:该论文旨在解决当前多模态仇恨言论检测中因视频级标注(video-level annotations)导致的标签噪声问题,即多数标注为仇恨内容的视频实际包含大量非仇恨片段,从而削弱模型对仇恨内容的细粒度识别能力。解决方案的关键在于引入时间戳裁剪(timestamp-based trimming)方法,从HateMM和MultiHateClip数据集中提取明确的仇恨片段,并通过细粒度分析揭示仇恨与非仇恨内容在语义上的重叠及标注模糊性;进一步的受控实验表明,时间戳噪声会显著改变模型决策边界并降低分类置信度,凸显了仇恨言论表达具有强时序连续性和上下文依赖性。这一发现强调需构建具备时序感知能力的模型和基准测试体系,以提升检测系统的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2508.04900
作者: Shuonan Yang,Tailin Chen,Rahul Singh,Jiangbei Yue,Jianbo Jiao,Zeyu Fu
机构: University of Exeter(埃克塞特大学); University of Birmingham(伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at this https URL.
zh
[CV-107] Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications
【速读】:该论文旨在解决基于Transformer的检测器在处理遮挡、细粒度定位以及计算效率低下等问题,这些问题通常由固定查询和密集注意力机制引起。解决方案的关键在于提出DAMM(Dual-stream Attention with Multi-Modal queries)框架,其核心创新包括:引入三种模态查询(基于视觉-语言模型的外观查询、基于多边形嵌入的位置查询及随机学习的通用场景查询)实现查询自适应,以及设计双流交叉注意力模块分别优化语义特征与空间特征,从而提升复杂场景下的定位精度和整体效率。
链接: https://arxiv.org/abs/2508.04868
作者: Noreen Anwar,Guillaume-Alexandre Bilodeau,Wassim Bouachir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \hrefthis https URLGitHub.
zh
[CV-108] VER-Bench: Evaluating MLLM s on Reasoning with Fine-Grained Visual Evidence
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉理解能力评估中存在的关键短板:现有基准测试要么局限于局部细节的简单感知(如“图像中有什么?”),要么聚焦显著的宏观对象,难以有效评估模型对细微、隐蔽视觉线索的识别与整合能力。研究表明,真正复杂的视觉推理更依赖于对微小区域(平均仅占图像0.25%面积)中隐含信息的捕捉和利用。为此,作者提出VER-Bench框架,其核心创新在于构建了一个包含374道精心设计问题的评测体系,覆盖地理空间、时间、情境、意图、系统状态和符号推理等维度,并为每道题提供结构化的证据链——包括可定位的细粒度视觉线索及其与世界知识融合后的推理路径。该方案首次系统性地量化了模型在提取微弱视觉证据并进行证据驱动推理方面的能力瓶颈,推动MLLMs向更接近人类水平的深度视觉理解演进。
链接: https://arxiv.org/abs/2508.04852
作者: Chenhui Qiang,Zhaoyang Wei,Xumeng Han Zipeng Wang,Siyao Li,Xiangyuan Lan,Jianbin Jiao,Zhenjun Han
机构: University of Chinese Academy of Sciences(中国科学院大学); Peng Cheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ACMM2025 Dataset track
Abstract:With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., “what is in the image?”), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs’ ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models’ limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models’s capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available this https URL.
zh
[CV-109] LuKAN: A Kolmogorov-Arnold Network Framework for 3D Human Motion Prediction
【速读】:该论文旨在解决3D人体运动预测中预测精度与计算效率难以兼顾的问题。其解决方案的关键在于提出一种基于Kolmogorov-Arnold Networks(KANs)的新型模型LuKAN,该模型采用Lucas多项式作为激活函数,通过离散小波变换(Discrete Wavelet Transform, DWT)编码时间信息、空间投影层捕捉关节间依赖关系,并利用具有线性递推特性的Lucas多项式构建时序依赖学习器(Temporal Dependency Learner),从而在保证结构一致性的同时实现高效函数逼近和时序连贯的预测输出。
链接: https://arxiv.org/abs/2508.04847
作者: Md Zahidul Hasan,A. Ben Hamza,Nizar Bouguila
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The goal of 3D human motion prediction is to forecast future 3D poses of the human body based on historical motion data. Existing methods often face limitations in achieving a balance between prediction accuracy and computational efficiency. In this paper, we present LuKAN, an effective model based on Kolmogorov-Arnold Networks (KANs) with Lucas polynomial activations. Our model first applies the discrete wavelet transform to encode temporal information in the input motion sequence. Then, a spatial projection layer is used to capture inter-joint dependencies, ensuring structural consistency of the human body. At the core of LuKAN is the Temporal Dependency Learner, which employs a KAN layer parameterized by Lucas polynomials for efficient function approximation. These polynomials provide computational efficiency and an enhanced capability to handle oscillatory behaviors. Finally, the inverse discrete wavelet transform reconstructs motion sequences in the time domain, generating temporally coherent predictions. Extensive experiments on three benchmark datasets demonstrate the competitive performance of our model compared to strong baselines, as evidenced by both quantitative and qualitative evaluations. Moreover, its compact architecture coupled with the linear recurrence of Lucas polynomials, ensures computational efficiency.
zh
[CV-110] A deep learning approach to track eye movements based on events
【速读】:该论文旨在解决在高速眼动过程中如何以低成本、高精度追踪眼球中心位置(x, y)的问题,尤其针对传统方法依赖昂贵高速相机的局限性。其核心解决方案是利用事件相机(event camera)提供的稀疏、高时间分辨率数据,结合深度学习模型进行眼动分析;其中,CNN_LSTM 模型表现最优,实现了约 81% 的准确率,为后续开发可解释且经济高效的注意力预测算法奠定了基础。
链接: https://arxiv.org/abs/2508.04827
作者: Chirag Seth,Divya Naiken,Keyan Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This research project addresses the challenge of accurately tracking eye movements during specific events by leveraging previous research. Given the rapid movements of human eyes, which can reach speeds of 300°/s, precise eye tracking typically requires expensive and high-speed cameras. Our primary objective is to locate the eye center position (x, y) using inputs from an event camera. Eye movement analysis has extensive applications in consumer electronics, especially in VR and AR product development. Therefore, our ultimate goal is to develop an interpretable and cost-effective algorithm using deep learning methods to predict human attention, thereby improving device comfort and enhancing overall user experience. To achieve this goal, we explored various approaches, with the CNN_LSTM model proving most effective, achieving approximately 81% accuracy. Additionally, we propose future work focusing on Layer-wise Relevance Propagation (LRP) to further enhance the model’s interpretability and predictive performance.
zh
[CV-111] Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
【速读】:该论文旨在解决虚拟试衣(virtual try-on)中服装与人体对应关系建模的难题,尤其是在姿态和外观变化下的准确匹配问题。其核心解决方案是提出一个统一且可扩展的框架Voost,该框架通过单一扩散变换器(diffusion transformer)联合学习虚拟试衣与虚拟脱衣(try-off)任务,从而实现双向监督和灵活条件控制,无需任务特定网络、辅助损失或额外标签即可增强服装-人体关系推理能力。关键创新在于利用双向一致性进行自校正采样,并引入注意力温度缩放以提升对分辨率或掩码变化的鲁棒性,显著提升了生成图像的对齐精度、视觉保真度及泛化性能。
链接: https://arxiv.org/abs/2508.04825
作者: Seungyong Lee,Jeong-gi Kwak
机构: NXN Labs
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
zh
[CV-112] Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models
【速读】:该论文旨在解决基于重建的扩散模型在异常检测中面临的三大挑战:(1)重建过程计算复杂度高,难以满足实时应用需求;(2)对于复杂或细微的异常模式,重建结果可能对应于另一张正常图像而非原始输入;(3)中间噪声水平的选择依赖于先验知识,而在无监督场景下这一假设不成立。其解决方案的关键在于提出一种无需重建的异常检测方法——Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR),该方法直接从扩散模型中生成异常图(anomaly map),从而在提升检测精度的同时显著降低计算开销,实现了高效率与高准确性的统一。
链接: https://arxiv.org/abs/2508.04818
作者: Mehrdad Moradi,Marco Grasso,Bianca Maria Colosimo,Kamran Paynabar
机构: Georgia Institute of Technology (佐治亚理工学院); Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: 9 pages, 8 figures, 2 tables. Submitted to an IEEE conference
Abstract:Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: this https URL Comments: 9 pages, 8 figures, 2 tables. Submitted to an IEEE conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML) MSC classes: 62H35, 68T07, 62M40, 68T45 ACMclasses: I.2.6; I.2.10; I.4.6; I.4.8; I.5.1; I.5.4 Cite as: arXiv:2508.04818 [cs.CV] (or arXiv:2508.04818v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.04818 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mehrdad Moradi [view email] [v1] Wed, 6 Aug 2025 18:56:08 UTC (4,309 KB)
zh
[CV-113] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework
【速读】:该论文旨在解决当前自监督学习(Self-Supervised Learning, SSL)中预训练模型通常孤立进行、未能有效融合多模型互补知识,导致模型体积庞大、难以在资源受限场景部署的问题。其核心解决方案是提出一种轻量级、无需额外参数的共识导向掩码蒸馏框架(Consensus-oriented Masked Distillation, CoMAD),通过不对称掩码机制使学生网络仅接收25%的图像补丁输入,而教师模型则分别使用渐进式轻量且唯一的掩码,迫使学生在更丰富的上下文中插值缺失特征;同时,利用线性适配器与层归一化对齐教师嵌入至学生空间,并通过联合共识门控机制(joint consensus gating)结合余弦相似度与教师间一致性动态加权每个token,最终基于可见token和重建特征图的双层级KL散度进行训练,从而实现高效的知识迁移与紧凑模型构建。
链接: https://arxiv.org/abs/2508.04816
作者: Sriram Mandalika,Lalitha V
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 Pages, 2 Figures
Abstract:Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student’s space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD’s ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.
zh
[CV-114] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis
【速读】:该论文旨在解决耳鼻喉科(ENT)内镜图像自动化分析在临床实践中面临的挑战,包括设备与操作者间的差异性、病灶的细微性和局部性,以及诸如侧向性和声带状态等细粒度区分问题。现有公开基准数据集普遍缺乏对这些复杂需求的支持,尤其是难以实现视觉相似病例的可靠检索和基于简洁文本描述的跨模态匹配。解决方案的关键在于构建ENTRep数据集,该数据集包含专家标注的图像,涵盖解剖区域与正常/异常状态标签,并配有中英双语临床叙事描述;同时定义了三项标准化评测任务(细粒度分类、图像到图像检索和文本到图像检索),并采用服务器端评分机制确保评估一致性,从而推动面向多语言临床场景的端到端内镜分析技术发展。
链接: https://arxiv.org/abs/2508.04801
作者: Trong-Thuan Nguyen,Viet-Tham Huynh,Thao Thi Phuong Dao,Ha Nguyen Thi,Tien To Vu Thuy,Uyen Hanh Tran,Tam V. Nguyen,Thanh Dinh Le,Minh-Triet Tran
机构: University of Science, VNU-HCM (胡志明市国家大学科学大学); Vietnam National University (越南国家大学); Thong Nhat Hospital (阮氏平医院); Faculty of Medicine, Pham Ngoc Thach University of Medicine (范玉奢医学院); Cho Ray Hospital (Cho Ray医院); University of Dayton (代顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.
zh
[CV-115] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration
【速读】:该论文旨在解决超高清图像复原(Ultra-High-Definition Image Restoration, UHD IR)任务中传统方法存在的局限性:极端下采样会导致信息不可逆丢失,而纯频域方法因缺乏退化局部性建模能力,难以有效处理空间受限的图像伪影。解决方案的关键在于提出一种基于Retinex理论的双分支框架RetinexDual,其核心创新为两个互补子网络的设计——Scale-Attentive maMBA(SAMBA)用于从粗到细地恢复反射分量以减少伪影并增强细节,Frequency Illumination Adaptor(FIA)则在频域中利用全局上下文精准校正颜色与光照失真。这种分工明确、各司其职的架构显著提升了UHD IR任务的复原效果。
链接: https://arxiv.org/abs/2508.04797
作者: Mohab Kishawy,Ali Abdellatif Hussein,Jun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.
zh
[CV-116] Artificial Intelligence-Based Classification of Spitz Tumors
【速读】:该论文旨在解决Spitz肿瘤在诊断上因与普通黑色素瘤(conventional melanoma)存在形态学重叠而带来的挑战,以及如何更准确地预测其遗传异常和分类。解决方案的关键在于开发并验证基于组织病理学特征和/或临床信息的AI模型,其中最优模型利用UNI特征实现了区分Spitz肿瘤与普通黑色素瘤的AUROC达0.95、准确率为0.86;同时在预测Spitz肿瘤遗传异常(准确率0.55)和诊断类别(准确率0.51)方面显著优于随机猜测,且整体性能优于四位资深病理医师(尽管统计差异不显著)。此外,模拟实验进一步表明,引入AI辅助决策可优化病理科的辅助检测流程,降低材料成本、缩短周转时间并减少不必要的检查。
链接: https://arxiv.org/abs/2508.05391
作者: Ruben T. Lucassen,Marjanna Romers,Chiel F. Ebbelaar,Aia N. Najem,Donal P. Hayes,Antien L. Mooyaart,Sara Roshani,Liliane C. D. Wynaendts,Nikolas Stathonikos,Gerben E. Breimer,Anne M. L. Jansen,Mitko Veta,Willeke A. M. Blokx
机构: 1. Erasmus MC (伊拉斯谟医学中心); 2. University Medical Center Utrecht (乌得勒支大学医学中心); 3. Amsterdam University Medical Centers (阿姆斯特丹大学医学中心); 4. Leiden University Medical Center (莱顿大学医学中心); 5. University of Oxford (牛津大学); 6. Radboud University Medical Center (奈梅亨大学医学中心); 7. University College London (伦敦大学学院); 8. KU Leuven (鲁汶大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 2 figures, 6 tables, 6 supplementary tables
Abstract:Spitz tumors are diagnostically challenging due to overlap in atypical histological features with conventional melanomas. We investigated to what extent AI models, using histological and/or clinical features, can: (1) distinguish Spitz tumors from conventional melanomas; (2) predict the underlying genetic aberration of Spitz tumors; and (3) predict the diagnostic category of Spitz tumors. The AI models were developed and validated using a dataset of 393 Spitz tumors and 379 conventional melanomas. Predictive performance was measured using the AUROC and the accuracy. The performance of the AI models was compared with that of four experienced pathologists in a reader study. Moreover, a simulation experiment was conducted to investigate the impact of implementing AI-based recommendations for ancillary diagnostic testing on the workflow of the pathology department. The best AI model based on UNI features reached an AUROC of 0.95 and an accuracy of 0.86 in differentiating Spitz tumors from conventional melanomas. The genetic aberration was predicted with an accuracy of 0.55 compared to 0.25 for randomly guessing. The diagnostic category was predicted with an accuracy of 0.51, where random chance-level accuracy equaled 0.33. On all three tasks, the AI models performed better than the four pathologists, although differences were not statistically significant for most individual comparisons. Based on the simulation experiment, implementing AI-based recommendations for ancillary diagnostic testing could reduce material costs, turnaround times, and examinations. In conclusion, the AI models achieved a strong predictive performance in distinguishing between Spitz tumors and conventional melanomas. On the more challenging tasks of predicting the genetic aberration and the diagnostic category of Spitz tumors, the AI models performed better than random chance.
zh
[CV-117] Coarse-to-Fine Joint Registration of MR and Ultrasound Images via Imaging Style Transfer
【速读】:该论文旨在解决术前磁共振(Magnetic Resonance, MR)图像与术后超声(Ultrasound, US)图像之间配准(registration)困难的问题,尤其是在两者模态差异大、缺乏对应标记点的情况下。解决方案的关键在于提出了一种结合无配对风格迁移(unpaired style transfer)与分层变形配准的流程:首先利用3D CycleGAN生成合成T1加权MR图像以弥合MR与US之间的域差距,进而通过仿射(affine)与局部非刚性(local deformable)变换实现从粗到精的图像配准,从而显著提升MR与US图像对之间的空间一致性。
链接: https://arxiv.org/abs/2508.05240
作者: Junyi Wang,Xi Zhu,Yikun Guo,Zixi Wang,Haichuan Gao,Le Zhang,Fan Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We developed a pipeline for registering pre-surgery Magnetic Resonance (MR) images and post-resection Ultrasound (US) images. Our approach leverages unpaired style transfer using 3D CycleGAN to generate synthetic T1 images, thereby enhancing registration performance. Additionally, our registration process employs both affine and local deformable transformations for a coarse-to-fine registration. The results demonstrate that our approach improves the consistency between MR and US image pairs in most cases.
zh
[CV-118] Beyond Pixels: Medical Image Quality Assessment with Implicit Neural Representations
【速读】:该论文旨在解决医学影像中伪影(artifact)对诊断准确性及下游分析带来的挑战,尤其针对传统基于图像的伪影检测方法因预处理导致的信息丢失和高内存消耗问题,限制了分类模型的可扩展性。其解决方案的关键在于利用隐式神经表示(implicit neural representations, INRs)对医学图像进行紧凑且连续的建模,从而自然适应不同分辨率和图像尺寸,并显著降低内存开销;在此基础上,进一步构建深度权重空间网络、图神经网络和关系注意力变换器等模型在INR空间上实现图像质量评估,实验证明该方法在ACDC数据集上使用合成伪影模式时,以更少参数达到与传统方法相当的性能。
链接: https://arxiv.org/abs/2508.05168
作者: Caner Özer,Patryk Rygiel,Bram de Wilde,İlkay Öksüz,Jelmer M. Wolterink
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 16th Machine Learning in Medical Imaging (MLMI 2025) workshop
Abstract:Artifacts pose a significant challenge in medical imaging, impacting diagnostic accuracy and downstream analysis. While image-based approaches for detecting artifacts can be effective, they often rely on preprocessing methods that can lead to information loss and high-memory-demand medical images, thereby limiting the scalability of classification models. In this work, we propose the use of implicit neural representations (INRs) for image quality assessment. INRs provide a compact and continuous representation of medical images, naturally handling variations in resolution and image size while reducing memory overhead. We develop deep weight space networks, graph neural networks, and relational attention transformers that operate on INRs to achieve image quality assessment. Our method is evaluated on the ACDC dataset with synthetically generated artifact patterns, demonstrating its effectiveness in assessing image quality while achieving similar performance with fewer parameters.
zh
[CV-119] CryoGS: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction
【速读】:该论文旨在解决单颗粒冷冻电镜(single-particle cryo-EM)重构中依赖外部共识图或原子模型进行初始化的问题,从而限制了方法在自包含流程中的应用。其解决方案的关键在于提出cryoGS,一种基于高斯混合模型(GMM)的方法,通过将高斯点绘(Gaussian splatting)与冷冻电镜成像物理过程相结合,引入了正交投影感知的高斯点绘机制,并设计了归一化项和傅里叶变换对齐的坐标系以适配cryo-EM成像特性,实现了从原始粒子图像出发、使用随机初始化即可稳定高效完成均质重建的能力。
链接: https://arxiv.org/abs/2508.04929
作者: Suyi Chen,Haibin Ling
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from a large collection of noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. Addressing this issue, we introduce cryoGS, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. All these innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoGS over representative baselines. The code will be released upon publication.
zh
[CV-120] Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization
【速读】:该论文旨在解决乳腺影像检索系统中基于BIRADS(Breast Imaging Reporting and Data System)类别精确匹配的五分类检索难题,其核心挑战在于相较于常见的二分类任务,五类别的精确匹配显著提升了复杂度,并且现有研究在样本量不足、数据划分不当及统计验证不充分等方面存在方法学局限,制约了临床转化。解决方案的关键在于构建了一个系统性的评估框架,综合比较了DenseNet121、ResNet50和VGG16等CNN架构,并引入高级训练策略,包括差异学习率微调(differential learning rate fine-tuning)、度量学习(metric learning)以及超集成优化(super-ensemble optimization),同时采用严格的分层数据划分(50%/20%/30%)与Bootstrap置信区间(1,000次采样)进行稳健验证。实验表明,超集成优化达到36.33% precision@10(95% CI: [34.78%, 37.88%]),相比基线提升24.93%,显著优于文献中预期的20–25%性能水平,为临床诊断支持和质量控制应用提供了可落地的性能基准与架构选择依据。
链接: https://arxiv.org/abs/2508.04790
作者: MD Shaikh Rahman,Feiroz Humayara,Syed Maudud E Rabbi,Muhammad Mahbubur Rashid
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Content-based mammographic image retrieval systems require exact BIRADS categorical matching across five distinct classes, presenting significantly greater complexity than binary classification tasks commonly addressed in literature. Current medical image retrieval studies suffer from methodological limitations including inadequate sample sizes, improper data splitting, and insufficient statistical validation that hinder clinical translation. We developed a comprehensive evaluation framework systematically comparing CNN architectures (DenseNet121, ResNet50, VGG16) with advanced training strategies including sophisticated fine-tuning, metric learning, and super-ensemble optimization. Our evaluation employed rigorous stratified data splitting (50%/20%/30% train/validation/test), 602 test queries, and systematic validation using bootstrap confidence intervals with 1,000 samples. Advanced fine-tuning with differential learning rates achieved substantial improvements: DenseNet121 (34.79% precision@10, 19.64% improvement) and ResNet50 (34.54%, 19.58% improvement). Super-ensemble optimization combining complementary architectures achieved 36.33% precision@10 (95% CI: [34.78%, 37.88%]), representing 24.93% improvement over baseline and providing 3.6 relevant cases per query. Statistical analysis revealed significant performance differences between optimization strategies (p0.001) with large effect sizes (Cohen’s d0.8), while maintaining practical search efficiency (2.8milliseconds). Performance significantly exceeds realistic expectations for 5-class medical retrieval tasks, where literature suggests 20-25% precision@10 represents achievable performance for exact BIRADS matching. Our framework establishes new performance benchmarks while providing evidence-based architecture selection guidelines for clinical deployment in diagnostic support and quality assurance applications.
zh
人工智能
[AI-0] KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)推荐系统研究中因缺乏真实、动态且交互性强的直播流媒体数据集而导致的瓶颈问题。现有研究难以准确建模直播场景下用户与主播之间实时互动(如点击、评论、点赞、打赏)以及内容动态变化的特性,从而限制了推荐算法在实际应用中的有效性。解决方案的关键在于构建并公开发布 KuaiLive 数据集——这是首个来自中国头部直播平台快手(Kuaishou)的实时交互型数据集,涵盖 23,772 名用户和 452,621 名主播在 21 天内的细粒度行为日志,包含精确的直播间起止时间戳、多类实时交互信号及丰富的用户/主播侧信息特征。该数据集能够支持更真实的候选物品动态演化模拟与用户-主播行为建模,为直播推荐任务(如 Top-K 推荐、点击率预测、观看时长预测等)提供坚实基准,并推动多行为建模、多任务学习及公平性感知推荐等前沿方向的研究进展。
链接: https://arxiv.org/abs/2508.05633
作者: Changle Qu,Sunhao Dai,Ke Guo,Liqin Zhao,Yanan Niu,Xiao Zhang,Jun Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Live streaming platforms have become a dominant form of online content consumption, offering dynamically evolving content, real-time interactions, and highly engaging user experiences. These unique characteristics introduce new challenges that differentiate live streaming recommendation from traditional recommendation settings and have garnered increasing attention from industry in recent years. However, research progress in academia has been hindered by the lack of publicly available datasets that accurately reflect the dynamic nature of live streaming environments. To address this gap, we introduce KuaiLive, the first real-time, interactive dataset collected from Kuaishou, a leading live streaming platform in China with over 400 million daily active users. The dataset records the interaction logs of 23,772 users and 452,621 streamers over a 21-day period. Compared to existing datasets, KuaiLive offers several advantages: it includes precise live room start and end timestamps, multiple types of real-time user interactions (click, comment, like, gift), and rich side information features for both users and streamers. These features enable more realistic simulation of dynamic candidate items and better modeling of user and streamer behaviors. We conduct a thorough analysis of KuaiLive from multiple perspectives and evaluate several representative recommendation methods on it, establishing a strong benchmark for future research. KuaiLive can support a wide range of tasks in the live streaming domain, such as top-K recommendation, click-through rate prediction, watch time prediction, and gift price prediction. Moreover, its fine-grained behavioral data also enables research on multi-behavior modeling, multi-task learning, and fairness-aware recommendation. The dataset and related resources are publicly available at this https URL.
zh
[AI-1] Simulating Human-Like Learning Dynamics with LLM -Empowered Agents
【速读】:该论文旨在解决现有研究在捕捉人类学习行为动态过程中的局限性,尤其是难以追踪长期学习进展、缺乏可解释性以及无法模拟真实教学环境中个体差异的问题。其解决方案的关键在于提出LearnerAgent框架,这是一个基于大语言模型(Large Language Models, LLMs)的多智能体系统,通过构建具有心理学基础的学习者类型(如深度学习者、表层学习者和懒散学习者)及无特定人格特征的通用学习者,结合周度知识获取、月度策略选择、周期性测试与同伴互动机制,实现对学习者认知发展轨迹的全年级动态追踪与分析。该框架揭示了LLM默认行为本质上是“勤奋但脆弱的表层学习者”,从而为理解生成式AI(Generative AI)的认知局限提供了新视角。
链接: https://arxiv.org/abs/2508.05622
作者: Yu Yuan,Lili Zhao,Wei Chen,Guangting Zheng,Kai Zhang,Mengdi Zhang,Qi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Capturing human learning behavior based on deep learning methods has become a major research focus in both psychology and intelligent systems. Recent approaches rely on controlled experiments or rule-based models to explore cognitive processes. However, they struggle to capture learning dynamics, track progress over time, or provide explainability. To address these challenges, we introduce LearnerAgent, a novel multi-agent framework based on Large Language Models (LLMs) to simulate a realistic teaching environment. To explore human-like learning dynamics, we construct learners with psychologically grounded profiles-such as Deep, Surface, and Lazy-as well as a persona-free General Learner to inspect the base LLM’s default behavior. Through weekly knowledge acquisition, monthly strategic choices, periodic tests, and peer interaction, we can track the dynamic learning progress of individual learners over a full-year journey. Our findings are fourfold: 1) Longitudinal analysis reveals that only Deep Learner achieves sustained cognitive growth. Our specially designed “trap questions” effectively diagnose Surface Learner’s shallow knowledge. 2) The behavioral and cognitive patterns of distinct learners align closely with their psychological profiles. 3) Learners’ self-concept scores evolve realistically, with the General Learner developing surprisingly high self-efficacy despite its cognitive limitations. 4) Critically, the default profile of base LLM is a “diligent but brittle Surface Learner”-an agent that mimics the behaviors of a good student but lacks true, generalizable understanding. Extensive simulation experiments demonstrate that LearnerAgent aligns well with real scenarios, yielding more insightful findings about LLMs’ behavior.
zh
[AI-2] he Missing Reward: Active Inference in the Era of Experience
【速读】:该论文旨在解决当前AI系统在迈向真正自主智能过程中面临的“ grounded-agency gap”(具身代理差距)问题,即现有AI难以在动态环境中自主制定、调整并执行目标。其核心挑战在于依赖人工设计的奖励函数,导致系统可扩展性受限且无法有效从经验中学习。解决方案的关键在于引入主动推理(Active Inference, AIF),通过将外部奖励信号替换为以最小化自由能为内在驱动力的贝叶斯优化框架,使智能体能够自然地在探索与利用之间取得平衡;同时结合大语言模型(Large Language Models)作为生成式世界模型,构建兼具高效经验学习能力和人类价值对齐的自主AI系统。
链接: https://arxiv.org/abs/2508.05619
作者: Bo Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Biological Physics (physics.bio-ph); Computational Physics (physics.comp-ph); History and Philosophy of Physics (physics.hist-ph)
备注:
Abstract:This paper argues that Active Inference (AIF) provides a crucial foundation for developing autonomous AI agents capable of learning from experience without continuous human reward engineering. As AI systems begin to exhaust high-quality training data and rely on increasingly large human workforces for reward design, the current paradigm faces significant scalability challenges that could impede progress toward genuinely autonomous intelligence. The proposal for an ``Era of Experience,‘’ where agents learn from self-generated data, is a promising step forward. However, this vision still depends on extensive human engineering of reward functions, effectively shifting the bottleneck from data curation to reward curation. This highlights what we identify as the \textbfgrounded-agency gap: the inability of contemporary AI systems to autonomously formulate, adapt, and pursue objectives in response to changing circumstances. We propose that AIF can bridge this gap by replacing external reward signals with an intrinsic drive to minimize free energy, allowing agents to naturally balance exploration and exploitation through a unified Bayesian objective. By integrating Large Language Models as generative world models with AIF’s principled decision-making framework, we can create agents that learn efficiently from experience while remaining aligned with human values. This synthesis offers a compelling path toward AI systems that can develop autonomously while adhering to both computational and physical constraints.
zh
[AI-3] rajEvo: Trajectory Prediction Heuristics Design via LLM -driven Evolution
【速读】:该论文旨在解决轨迹预测任务中传统启发式方法准确性与泛化能力不足、深度学习方法计算成本高且可解释性差的问题,尤其在分布外(Out-of-Distribution, OOD)场景下表现不佳。其核心解决方案是提出TrajEvo框架,利用大语言模型(Large Language Models, LLMs)自动设计轨迹预测启发式规则,并通过进化算法从历史轨迹数据中生成和优化这些规则;关键创新包括跨代精英采样(Cross-Generation Elite Sampling)以增强种群多样性,以及统计反馈环(Statistics Feedback Loop)使LLM能够分析并改进备选预测结果,从而实现快速、可解释且具备强泛化能力的轨迹预测。
链接: https://arxiv.org/abs/2508.05616
作者: Zhikai Zhao,Chuanbo Hua,Federico Berto,Kanghoon Lee,Zihan Ma,Jiachen Li,Jinkyoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: arXiv admin note: substantial text overlap with arXiv:2505.04480
Abstract:Trajectory prediction is a critical task in modeling human behavior, especially in safety-critical domains such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy and generalizability. Although deep learning approaches offer improved performance, they typically suffer from high computational cost, limited explainability, and, importantly, poor generalization to out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We propose two key innovations: Cross-Generation Elite Sampling to encourage population diversity, and a Statistics Feedback Loop that enables the LLM to analyze and improve alternative predictions. Our evaluations demonstrate that TrajEvo outperforms existing heuristic methods across multiple real-world datasets, and notably surpasses both heuristic and deep learning methods in generalizing to an unseen OOD real-world dataset. TrajEvo marks a promising step toward the automated design of fast, explainable, and generalizable trajectory prediction heuristics. We release our source code to facilitate future research at this https URL.
zh
[AI-4] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在强化学习(Reinforcement Learning, RL)微调过程中存在的训练效率低下问题,具体表现为两个未被充分关注的现象:优势值坍缩(Advantage Collapsing),即批量中大部分优势值集中在零附近;以及回放静默(Rollout Silencing),即随训练进程推进,产生非零梯度的回放轨迹比例逐渐下降。这两个问题导致梯度更新质量差,阻碍长期学习效率提升。解决方案的关键在于提出 Shuffle-R1 框架,其核心创新包括:(1) 成对轨迹采样(Pairwise Trajectory Sampling),通过选择高对比度、优势值较大的轨迹来提升梯度信号质量;(2) 基于优势的轨迹洗牌(Advantage-based Trajectory Shuffle),通过有意识地重新组合批次以增强优质回放轨迹的暴露频率,从而优化样本利用效率。实验表明,该方法在多个推理基准上显著优于现有强基线,且开销极小,凸显了数据驱动型调整对提升 MLLM 强化学习训练效率的重要性。
链接: https://arxiv.org/abs/2508.05612
作者: Linghao Zhu,Yiran Guan,Dingkang Liang,Jianzhong Ju,Zhenbo Luo,Bin Qin,Jian Luan,Yuliang Liu,Xiang Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.
zh
[AI-5] MV-Debate: Multi-view Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media
【速读】:该论文旨在解决社交媒体中多模态有害内容(如讽刺、仇恨言论和虚假信息)的检测难题,这些问题因跨模态矛盾、快速文化变迁及微妙语用线索而更加复杂。解决方案的关键在于提出MV-Debate框架,该框架采用动态反射门控机制的多视角代理辩论系统,集成四种互补的分析代理——表面分析师、深层推理者、模态对比者和社会情境分析师,从不同解释角度对内容进行迭代辩论与反思,以在反射收益准则下优化响应精度与效率,从而实现统一的多模态有害内容识别。
链接: https://arxiv.org/abs/2508.05557
作者: Rui Lu,Jinhe Bi,Yunpu Ma,Feng Xiao,Yuntao Du,Yijun Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts.
zh
[AI-6] ractable Sharpness-Aware Learning of Probabilistic Circuits
【速读】:该论文旨在解决概率电路(Probabilistic Circuits, PCs)在训练过程中因模型容量增大而易发生过拟合的问题,尤其是在数据有限的情况下。其核心问题是:PCs 在优化时容易收敛到“尖锐极小值”(sharp optima),这类极小值虽然在训练集上表现良好,但泛化能力差。解决方案的关键在于提出一种基于海森矩阵(Hessian)的正则化项,利用PCs结构特性高效计算对数似然函数的海森矩阵迹(trace of the Hessian),该迹作为“尖锐度”的代理指标,在深度神经网络中通常难以计算,但在PCs中可高效求解。通过最小化该迹,可诱导出梯度范数正则化项,从而在期望最大化(EM)算法中实现闭式参数更新,并与基于梯度的学习方法无缝集成,实验证明该方法能稳定引导PCs向更平坦的极小值区域收敛,显著提升泛化性能。
链接: https://arxiv.org/abs/2508.05537
作者: Hrithik Suresh,Sahil Sidheekh,Vishnu Shreeram M.P,Sriraam Natarajan,Narayanan C. Krishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic Circuits (PCs) are a class of generative models that allow exact and tractable inference for a wide range of queries. While recent developments have enabled the learning of deep and expressive PCs, this increased capacity can often lead to overfitting, especially when data is limited. We analyze PC overfitting from a log-likelihood-landscape perspective and show that it is often caused by convergence to sharp optima that generalize poorly. Inspired by sharpness aware minimization in neural networks, we propose a Hessian-based regularizer for training PCs. As a key contribution, we show that the trace of the Hessian of the log-likelihood-a sharpness proxy that is typically intractable in deep neural networks-can be computed efficiently for PCs. Minimizing this Hessian trace induces a gradient-norm-based regularizer that yields simple closed-form parameter updates for EM, and integrates seamlessly with gradient based learning methods. Experiments on synthetic and real-world datasets demonstrate that our method consistently guides PCs toward flatter minima, improves generalization performance.
zh
[AI-7] Streamlining Admission with LOR Insights: AI-Based Leadership Assessment in Online Masters Program
【速读】:该论文旨在解决研究生申请材料中推荐信(Letters of Recommendation, LORs)评估效率低下的问题,即人工审阅文本密集的推荐信耗时且劳动强度大,难以系统性挖掘候选人的领导力特质。解决方案的关键在于开发LORI(LOR Insights),一个基于人工智能的检测工具,利用自然语言处理技术与RoBERTa和LLAMA等大语言模型,自动识别推荐信中体现的团队协作、沟通能力和创新能力等领导力维度。实验表明,其最新RoBERTa模型在测试数据上达到加权F1分数91.6%、精确率92.4%和召回率91.6%,展现出高一致性,从而实现了对申请人领导力能力的高效、自动化且全面的评估。
链接: https://arxiv.org/abs/2508.05513
作者: Meryem Yilmaz Soylu,Adrian Gallard,Jeonghyun Lee,Gayane Grigoryan,Rushil Desai,Stephen Harmon
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Letters of recommendation (LORs) provide valuable insights into candidates’ capabilities and experiences beyond standardized test scores. However, reviewing these text-heavy materials is time-consuming and labor-intensive. To address this challenge and support the admission committee in providing feedback for students’ professional growth, our study introduces LORI: LOR Insights, a novel AI-based detection tool for assessing leadership skills in LORs submitted by online master’s program applicants. By employing natural language processing and leveraging large language models using RoBERTa and LLAMA, we seek to identify leadership attributes such as teamwork, communication, and innovation. Our latest RoBERTa model achieves a weighted F1 score of 91.6%, a precision of 92.4%, and a recall of 91.6%, showing a strong level of consistency in our test data. With the growing importance of leadership skills in the STEM sector, integrating LORI into the graduate admissions process is crucial for accurately assessing applicants’ leadership capabilities. This approach not only streamlines the admissions process but also automates and ensures a more comprehensive evaluation of candidates’ capabilities.
zh
[AI-8] Auto-Eval Judge: Towards a General Agent ic Framework for Task Completion Evaluation
【速读】:该论文旨在解决当前评估基础模型作为智能体(Agent)在多领域任务中完成情况时存在的局限性问题:现有方法如LLM-as-a-Judge仅关注最终输出,忽视了代理决策过程中关键的逐步推理;而现有的Agent-as-a-Judge系统则通常局限于特定领域,缺乏通用性。解决方案的关键在于提出一个可泛化、模块化的评估框架,该框架通过模拟人类评价方式,将任务分解为子任务,并利用代理的输出与推理信息逐层验证每个步骤,各模块独立贡献评估维度,最终聚合得出任务完成度的综合判断。实证表明,该框架在GAIA和BigCodeBench两个基准上显著优于基于GPT-4o的LLM-as-a-Judge基线,任务成功预测与人工评价的对齐准确率分别提升4.76%和10.52%。
链接: https://arxiv.org/abs/2508.05508
作者: Roshita Bhonsle,Rishav Dutta,Sneha Vavilapalli,Harsh Seth,Abubakarr Jaye,Yapei Chang,Mukund Rungta,Emmanuel Aboah Boateng,Sadid Hasan,Ehi Nosakhare,Soundar Srinivasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another’s task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent’s output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.
zh
[AI-9] GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理结构化知识(如知识图谱)时能力有限的问题,特别是当前图谱检索方法难以同时捕捉全局图结构并保持精确性,导致推理性能下降。解决方案的关键在于提出GRAIL框架——通过结合大语言模型(Large Language Models, LLMs)引导的随机探索与路径过滤机制,构建细粒度推理轨迹的数据合成管道;进而采用两阶段训练策略学习动态决策策略,并将检索精度与简洁性的平衡目标解耦为细粒度过程监督奖励,从而提升数据效率与训练稳定性。该框架在实际部署中采用交互式检索范式,使模型能自主探索图路径并在检索广度与精度间动态权衡,显著提升了知识图谱问答任务的准确率与F1值。
链接: https://arxiv.org/abs/2508.05498
作者: Ge Chang,Jinbo Su,Jiacheng Liu,Pengfei Yang,Yuhao Shang,Huiwen Zheng,Hongli Ma,Yan Liang,Yuanchun Li,Yunxin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures
Abstract:Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at this https URL.
zh
[AI-10] InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLM s to Enhance Reasoning Capabilities
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段提升推理能力时面临的资源消耗高、数据效率低的问题。现有方法多依赖启发式或任务特定的数据筛选策略,难以扩展到多样化场景。其解决方案的关键在于提出一种名为InfiAlign的可扩展且数据高效的后训练框架,该框架融合监督微调(Supervised Fine-Tuning, SFT)与直接偏好优化(Direct Preference Optimization, DPO),并通过一个基于多维质量指标的自动数据选择流水线,从开源推理数据集中高效筛选高质量对齐数据。该机制显著降低了数据需求量(仅需12%的训练数据即可达到与复杂模型相当的性能),同时保持了跨任务的良好泛化能力,并在数学推理任务上实现平均3.89%的性能提升。
链接: https://arxiv.org/abs/2508.05496
作者: Shuo Cai,Su Lu,Qi Zhou,Kejing Yang,Zhijie Sang,Congkai Xie,Hongxia Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at this https URL.
zh
[AI-11] MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling
【速读】:该论文旨在解决多模态电子健康记录(EHR)数据在临床预测建模中难以有效整合的问题,尤其是由于不同模态数据(如医学影像、实验室结果与临床文本)的异构性和高数据需求导致的建模挑战。其解决方案的关键在于提出一种新型架构——多模态代理混合模型(Mixture-of-Multimodal-Agents, MoMA),该架构通过三个层级的大型语言模型(LLM)代理协作实现:首先由“专家代理”将非文本模态转换为结构化文本摘要,再由“聚合代理”融合这些摘要与临床笔记生成统一的多模态摘要,最后由“预测代理”基于该摘要输出临床预测。这种模块化设计显著提升了模型在多种任务中的准确性和灵活性。
链接: https://arxiv.org/abs/2508.05492
作者: Jifan Gao,Mahmudur Rahman,John Caskey,Madeline Oguss,Ann O’Rourke,Randy Brown,Anne Stey,Anoop Mayampurath,Matthew M. Churpek,Guanhua Chen,Majid Afshar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multimodal electronic health record (EHR) data provide richer, complementary insights into patient health compared to single-modality data. However, effectively integrating diverse data modalities for clinical prediction modeling remains challenging due to the substantial data requirements. We introduce a novel architecture, Mixture-of-Multimodal-Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents (“specialist agents”) to convert non-textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM (“aggregator agent”) to generate a unified multimodal summary, which is then used by a third LLM (“predictor agent”) to produce clinical predictions. Evaluating MoMA on three prediction tasks using real-world datasets with different modality combinations and prediction settings, MoMA outperforms current state-of-the-art methods, highlighting its enhanced accuracy and flexibility across various tasks.
zh
[AI-12] Embedding Alignment in Code Generation for Audio
【速读】:该论文旨在解决生成式 AI(Generative AI)在创意编程领域,尤其是实时编码(live-coding)中,代码生成模型难以提供多样化且具有音乐意义的代码候选结果的问题。现有模型缺乏对代码与音频输出之间映射关系的显式建模,导致用户无法有效利用多样化的代码选项来实现其音乐意图。解决方案的关键在于构建一个代码-音频嵌入对齐映射(code-audio embedding alignment map),通过学习代码嵌入空间与音频嵌入空间之间的非线性拓扑关系,使模型能够基于输入代码预测对应的音频嵌入,从而提升输出的音乐多样性与可控性。
链接: https://arxiv.org/abs/2508.05473
作者: Sam Kouteili,Hiren Madhu,George Typaldos,Mark Santolucito
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding, by enabling users to focus on structural motifs over syntactic details. In such domains, when prompting an LLM, users may benefit from considering multiple varied code candidates to better realize their musical intentions. Code generation models, however, struggle to present unique and diverse code candidates, with no direct insight into the code’s audio output. To better establish a relationship between code candidates and produced audio, we investigate the topology of the mapping between code and audio embedding spaces. We find that code and audio embeddings do not exhibit a simple linear relationship, but supplement this with a constructed predictive model that shows an embedding alignment map could be learned. Supplementing the aim for musically diverse output, we present a model that given code predicts output audio embedding, constructing a code-audio embedding alignment map.
zh
[AI-13] ask complexity shapes internal representations and robustness in neural networks
【速读】:该论文旨在解决神经网络内部表征如何受输入数据复杂性和任务难度影响这一问题,即揭示多层感知机(MLP)中表征的拓扑结构与鲁棒性随任务复杂度变化的机制。其解决方案的关键在于引入一套五种数据无关的探测方法(剪枝、二值化、噪声注入、符号翻转和双部网络随机化),从网络科学视角将MLP建模为带符号的加权双部图,并通过对比简单任务(如MNIST)与困难任务(如Fashion-MNIST)下的模型行为,发现:在困难任务中,权重二值化会导致性能崩溃,而简单任务则保持鲁棒;剪枝低幅度边在二值化困难任务模型中引发显著的性能相变;适度噪声可提升准确率,类似最优小权重符号翻转带来的随机共振效应;仅保留符号结构即可维持高精度,表明signed bipartite topology是关键因素。由此定义了一个模型和模态无关的任务复杂度度量——全精度与二值化或洗牌网络性能之间的差距,为模型压缩与可解释性提供了与任务复杂度对齐的实用策略。
链接: https://arxiv.org/abs/2508.05463
作者: Robert Jankowski,Filippo Radicchi,M. Ángeles Serrano,Marián Boguñá,Santo Fortunato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity.
zh
[AI-14] EnergyPatchTST: Multi-scale Time Series Transformers with Uncertainty Estimation for Energy Forecasting
【速读】:该论文旨在解决能源时间序列预测中因多尺度时间动态特征和实际数据不规则性导致的现有深度学习方法性能受限的问题。其解决方案的关键在于提出EnergyPatchTST模型,核心创新包括:(1)多尺度特征提取机制以捕捉不同时间分辨率下的模式;(2)基于蒙特卡洛消去法的概率预测框架,实现不确定性估计;(3)引入未来已知变量(如温度、风况)的集成路径;(4)通过预训练与微调策略提升小样本能源数据集上的性能。实验表明,该方法在多个常见能源数据集上预测误差降低7–12%,并提供可靠的不确定性估计,为能源领域的时序预测提供了重要参考。
链接: https://arxiv.org/abs/2508.05454
作者: Wei Li,Zixin Wang,Qizheng Sun,Qixiang Gao,Fenglei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the International Conference on Intelligent Computing (ICIC 2025). 12 pages. The final authenticated version is published in the Lecture Notes in Computer Science (LNCS) series, vol 15860, and is available online. This is the author’s version of the work submitted for peer review
Abstract:Accurate and reliable energy time series prediction is of great significance for power generation planning and allocation. At present, deep learning time series prediction has become the mainstream method. However, the multi-scale time dynamics and the irregularity of real data lead to the limitations of the existing methods. Therefore, we propose EnergyPatchTST, which is an extension of the Patch Time Series Transformer specially designed for energy forecasting. The main innovations of our method are as follows: (1) multi-scale feature extraction mechanism to capture patterns with different time resolutions; (2) probability prediction framework to estimate uncertainty through Monte Carlo elimination; (3) integration path of future known variables (such as temperature and wind conditions); And (4) Pre-training and Fine-tuning examples to enhance the performance of limited energy data sets. A series of experiments on common energy data sets show that EnergyPatchTST is superior to other commonly used methods, the prediction error is reduced by 7-12%, and reliable uncertainty estimation is provided, which provides an important reference for time series prediction in the energy field.
zh
[AI-15] ail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees
【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在高风险决策场景中缺乏对极端不利后果(即尾部风险,tail-risk)严格保障的问题。传统安全感知MCTS方法通常依赖均值风险度量或硬性成本阈值,无法提供关于极端风险事件的严谨保障,可能导致严重后果。解决方案的关键在于提出两种新方法:一是CVaR-MCTS,将条件风险价值(Conditional Value-at-Risk, CVaR)这一一致尾部风险度量嵌入MCTS框架,通过参数α实现对“最坏(1−α)%情景”下预期损失的显式控制;二是Wasserstein-MCTS(W-MCTS),引入一阶Wasserstein模糊集以刻画尾部风险估计中的不确定性,从而缓解因样本有限导致的偏差问题。理论分析表明,二者均具备概率保证(PAC)的尾部安全性,并建立了相应的累积遗憾(regret)上界,实验验证了其在多样模拟环境中的优越性能和鲁棒性。
链接: https://arxiv.org/abs/2508.05441
作者: Zuyuan Zhang,Arnob Ghosh,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Making decisions with respect to just the expected returns in Monte Carlo Tree Search (MCTS) cannot account for the potential range of high-risk, adverse outcomes associated with a decision. To this end, safety-aware MCTS often consider some constrained variants – by introducing some form of mean risk measures or hard cost thresholds. These approaches fail to provide rigorous tail-safety guarantees with respect to extreme or high-risk outcomes (denoted as tail-risk), potentially resulting in serious consequence in high-stake scenarios. This paper addresses the problem by developing two novel solutions. We first propose CVaR-MCTS, which embeds a coherent tail risk measure, Conditional Value-at-Risk (CVaR), into MCTS. Our CVaR-MCTS with parameter \alpha achieves explicit tail-risk control over the expected loss in the “worst (1-\alpha)% scenarios.” Second, we further address the estimation bias of tail-risk due to limited samples. We propose Wasserstein-MCTS (or W-MCTS) by introducing a first-order Wasserstein ambiguity set \mathcalP_\varepsilon_s(s,a) with radius \varepsilon_s to characterize the uncertainty in tail-risk estimates. We prove PAC tail-safety guarantees for both CVaR-MCTS and W-MCTS and establish their regret. Evaluations on diverse simulated environments demonstrate that our proposed methods outperform existing baselines, effectively achieving robust tail-risk guarantees with improved rewards and stability.
zh
[AI-16] Whose Truth? Pluralistic Geo-Alignment for (Agent ic) AI
【速读】:该论文试图解决的问题是:当前人工智能(AI)对齐(alignment)研究多聚焦于偏见与不平等,但忽视了地理差异对AI行为规范的影响——即不同地区因文化、政治和法律背景不同,对“适当性”“真实性”或“合法性”的判断存在显著差异。例如,生成式AI可能在文本到图像模型中呈现性别比例均衡的公司领导层,而忽略了现实中的性别不平等;某些内容如克什米尔问题的表述则高度依赖用户地理位置和上下文。解决方案的关键在于推动时空感知的对齐机制(spatio-temporally aware alignment),取代过去“一刀切”的对齐策略,通过识别并适配用户所处地理环境及其语境,提升AI输出的本地化合理性与社会可接受性,并为未来代理型AI(Agentic AI)的发展提供可评估、可解释的对齐框架。
链接: https://arxiv.org/abs/2508.05432
作者: Krzysztof Janowicz,Zilong Liu,Gengchen Mai,Zhangyu Wang,Ivan Majic,Alexandra Fortacz,Grant McKenzie,Song Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:AI (super) alignment describes the challenge of ensuring (future) AI systems behave in accordance with societal norms and goals. While a quickly evolving literature is addressing biases and inequalities, the geographic variability of alignment remains underexplored. Simply put, what is considered appropriate, truthful, or legal can differ widely across regions due to cultural norms, political realities, and legislation. Alignment measures applied to AI/ML workflows can sometimes produce outcomes that diverge from statistical realities, such as text-to-image models depicting balanced gender ratios in company leadership despite existing imbalances. Crucially, some model outputs are globally acceptable, while others, e.g., questions about Kashmir, depend on knowing the user’s location and their context. This geographic sensitivity is not new. For instance, Google Maps renders Kashmir’s borders differently based on user location. What is new is the unprecedented scale and automation with which AI now mediates knowledge, expresses opinions, and represents geographic reality to millions of users worldwide, often with little transparency about how context is managed. As we approach Agentic AI, the need for spatio-temporally aware alignment, rather than one-size-fits-all approaches, is increasingly urgent. This paper reviews key geographic research problems, suggests topics for future work, and outlines methods for assessing alignment sensitivity.
zh
[AI-17] Large Language Models Transform Organic Synthesis From Reaction Prediction to Automation
【速读】:该论文旨在解决传统有机合成中反应设计效率低、实验周期长以及数据驱动不足的问题,同时应对生成式AI(Generative AI)在化学领域应用时存在的数据偏倚、推理不透明和安全风险等挑战。其解决方案的关键在于将大语言模型(Large Language Models, LLMs)与图神经网络(Graph Neural Networks)、量子计算及实时光谱技术相结合,从而缩短分子发现周期,提升合成路径预测的准确性,并推动绿色、可解释且安全的化学研发流程。此外,通过开放基准测试、联邦学习和可解释界面等社区协作机制,确保技术民主化并维持人类对自动化系统的控制权。
链接: https://arxiv.org/abs/2508.05427
作者: Kartar Kumar Lohana Tharwani,Rajesh Kumar,Sumita,Numan Ahmed,Yong Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are beginning to reshape how chemists plan and run reactions in organic synthesis. Trained on millions of reported transformations, these text-based models can propose synthetic routes, forecast reaction outcomes and even instruct robots that execute experiments without human supervision. Here we survey the milestones that turned LLMs from speculative tools into practical lab partners. We show how coupling LLMs with graph neural networks, quantum calculations and real-time spectroscopy shrinks discovery cycles and supports greener, data-driven chemistry. We discuss limitations, including biased datasets, opaque reasoning and the need for safety gates that prevent unintentional hazards. Finally, we outline community initiatives open benchmarks, federated learning and explainable interfaces that aim to democratize access while keeping humans firmly in control. These advances chart a path towards rapid, reliable and inclusive molecular innovation powered by artificial intelligence and automation.
zh
[AI-18] DeepPHY: Benchmarking Agent ic VLMs on Physical Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在复杂动态环境中难以实现细节关注与精确动作规划的问题,尤其是在需要物理规则理解、空间推理和长期策略优化的真实任务中表现不佳。解决方案的关键在于提出 DeepPHY——一个新颖的基准框架,通过一系列具有挑战性的模拟环境系统性地评估 VLMs 对基础物理原理的理解与推理能力,并引入细粒度的评价指标以量化其预测控制性能。
链接: https://arxiv.org/abs/2508.05405
作者: Xinrun Xu,Pi Bu,Ye Wang,Börje F. Karlsson,Ziming Wang,Tengtao Song,Qi Zhu,Jun Song,Zhiming Ding,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 48 pages
Abstract:Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs’ understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.
zh
[AI-19] Real-Time Iteration Scheme for Diffusion Policy
【速读】:该论文旨在解决扩散策略(Diffusion Policy)在机器人操作任务中因迭代去噪过程导致推理时间过长,且需执行动作块后才能进行下一次预测以保持动作一致性的问题,从而限制了其在低延迟敏感任务或周期短的简单任务中的应用。解决方案的关键在于借鉴最优控制中的实时迭代(Real-Time Iteration, RTI)策略,利用前一时刻的解作为当前时刻优化的初始猜测,从而显著降低运行时计算成本;同时提出一种基于缩放的方法有效处理离散动作(如抓取),无需模型蒸馏或结构重设计,即可无缝集成到大量预训练扩散模型中,尤其适用于资源密集型的大模型。
链接: https://arxiv.org/abs/2508.05396
作者: Yufei Duan,Hang Yin,Danica Kragic
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Diffusion Policies have demonstrated impressive performance in robotic manipulation tasks. However, their long inference time, resulting from an extensive iterative denoising process, and the need to execute an action chunk before the next prediction to maintain consistent actions limit their applicability to latency-critical tasks or simple tasks with a short cycle time. While recent methods explored distillation or alternative policy structures to accelerate inference, these often demand additional training, which can be resource-intensive for large robotic models. In this paper, we introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme, a method from optimal control that accelerates optimization by leveraging solutions from previous time steps as initial guesses for subsequent iterations. We explore the application of this scheme in diffusion inference and propose a scaling-based method to effectively handle discrete actions, such as grasping, in robotic manipulation. The proposed scheme significantly reduces runtime computational costs without the need for distillation or policy redesign. This enables a seamless integration into many pre-trained diffusion-based models, in particular, to resource-demanding large models. We also provide theoretical conditions for the contractivity which could be useful for estimating the initial denoising step. Quantitative results from extensive simulation experiments show a substantial reduction in inference time, with comparable overall performance compared with Diffusion Policy using full-step denoising. Our project page with additional resources is available at: this https URL.
zh
[AI-20] An Explainable Machine Learning Framework for Railway Predictive Maintenance using Data Streams from the Metro Operator of Portugal
【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)中实时故障预测与维护决策支持的问题,尤其针对铁路运维场景下对高精度、可解释性预测模型的需求。其解决方案的关键在于提出了一种在线处理流水线,包含两个核心创新:一是专用的样本预处理模块,能够实时构建统计特征和频域特征;二是自然语言与可视化结合的可解释性模块,首次实现了在线故障预测中的可解释性输出。该方法在Porto地铁数据集上验证,F-measure超过98%,准确率高达99%,且在类别不平衡和噪声环境下仍保持高性能,有效支撑了铁路运维中主动维护决策的可靠性与及时性。
链接: https://arxiv.org/abs/2508.05388
作者: Silvia García-Méndez,Francisco de Arriba-Pérez,Fátima Leal,Bruno Veloso,Benedita Malheiro,Juan Carlos Burguillo-Rial
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This work contributes to a real-time data-driven predictive maintenance solution for Intelligent Transportation Systems. The proposed method implements a processing pipeline comprised of sample pre-processing, incremental classification with Machine Learning models, and outcome explanation. This novel online processing pipeline has two main highlights: (i) a dedicated sample pre-processing module, which builds statistical and frequency-related features on the fly, and (ii) an explainability module. This work is the first to perform online fault prediction with natural language and visual explainability. The experiments were performed with the MetroPT data set from the metro operator of Porto, Portugal. The results are above 98 % for F-measure and 99 % for accuracy. In the context of railway predictive maintenance, achieving these high values is crucial due to the practical and operational implications of accurate failure prediction. In the specific case of a high F-measure, this ensures that the system maintains an optimal balance between detecting the highest possible number of real faults and minimizing false alarms, which is crucial for maximizing service availability. Furthermore, the accuracy obtained enables reliability, directly impacting cost reduction and increased safety. The analysis demonstrates that the pipeline maintains high performance even in the presence of class imbalance and noise, and its explanations effectively reflect the decision-making process. These findings validate the methodological soundness of the approach and confirm its practical applicability for supporting proactive maintenance decisions in real-world railway operations. Therefore, by identifying the early signs of failure, this pipeline enables decision-makers to understand the underlying problems and act accordingly swiftly.
zh
[AI-21] Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中,因轨迹采样(trajectory sampling)与策略优化(policy optimisation)共置于同一GPU集群而导致的计算负载串行切换问题。这种串行切换违背了当前分布式训练系统所依赖的单程序多数据(Single-Program-Multiple-Data, SPMD)范式,限制了硬件利用率和系统扩展性。解决方案的关键在于提出Echo系统,通过将采样和训练阶段在异构“推理”与“训练”计算群组之间清晰解耦,并引入两种轻量级同步协议:一种是每次API调用时刷新采样器权重的顺序拉取模式(sequential pull mode),以最小化偏差;另一种是异步推送-拉取模式(asynchronous push-pull mode),通过带版本标签的回放缓冲区流式传输轨迹以最大化硬件利用率。实验表明,Echo在地理分布集群上训练多个代表性RL任务时,在收敛速度和最终奖励上达到与完全共置基线相当的效果,同时将轨迹生成任务卸载至边缘计算设备,实现了数据中心级性能的去中心化部署。
链接: https://arxiv.org/abs/2508.05387
作者: Jie Xiao,Shaoduo Gan,Changyuan Fan,Qingnan Ren,Alfred Long,Yuchen Zhang,Rymon Yu,Eric Yang,Lynn Ai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today’s distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous “inference” and “training” swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes sampler weights on every API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.
zh
[AI-22] StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models
【速读】:该论文旨在解决现有视觉-语言模型(Vision-Language Models)在处理复杂、多步骤推理任务时的局限性,尤其是在需要部分正确性反馈以促进有效学习的场景下,传统基于单一二元奖励机制难以提供细粒度指导的问题。解决方案的关键在于提出StructVRM方法,其核心是一个基于模型的验证器(model-based verifier),能够对多子问题的回答进行语义和数学等价性评估,而非依赖严格的字符串匹配,从而实现对复杂问题中各子任务的精细化、可验证的奖励评分,支持部分得分机制,显著提升了模型在高难度STEM-Bench等基准上的推理能力。
链接: https://arxiv.org/abs/2508.05383
作者: Xiangxiang Zhang,Jingxuan Wei,Donghong Zhong,Qi Chen,Caijun Jia,Cheng Tan,Jinming Gu,Xiaobo Qin,Zhiping Liu,Liang Hu,Tong Sun,Yuchen Wu,Zewei Sun,Chenwei Lou,Hua Zheng,Tianyang Zhan,Changbao Wang,Shuangzhi Wu,Zefa Lin,Chang Guo,Sihang Yuan,Riwei Chen,Shixiong Zhao,Yingping Zhang,Gaowei Wu,Bihui Yu,Jiahui Wu,Zhehui Zhao,Qianqian Liu,Ruofeng Tang,Xingyue Huang,Bing Zhao,Mengyang Zhang,Youqiang Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains.
zh
[AI-23] Optimal Corpus Aware Training for Neural Machine Translation
【速读】:该论文旨在解决传统Corpus Aware Training (CAT)方法在实际应用中存在预定义高质量数据组导致的误差风险与效率低下问题。其核心解决方案是提出Optimal Corpus Aware Training (OCAT),通过冻结大部分模型参数,仅对少量与语料库相关的参数进行微调,从而实现轻量级、抗过拟合且高效的模型性能提升。此方法在WMT23英中和英德翻译任务上分别获得+3.6和+1.8 chrF的显著提升,同时相比其他先进微调技术更具鲁棒性且对超参数不敏感。
链接: https://arxiv.org/abs/2508.05364
作者: Yi-Hsiu Liao,Cheng Shen,Brenda(Zixiaofan)Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Corpus Aware Training (CAT) leverages valuable corpus metadata during training by injecting corpus information into each training example, and has been found effective in the literature, commonly known as the “tagging” approach. Models trained with CAT inherently learn the quality, domain and nuance between corpora directly from data, and can easily switch to different inference behavior. To achieve the best evaluation, CAT models pre-define a group of high quality data before training starts which can be error-prone and inefficient. In this work, we propose Optimal Corpus Aware Training (OCAT), which fine-tunes a CAT pre-trained model by freezing most of the model parameters and only tuning small set of corpus-related parameters. We show that OCAT is lightweight, resilient to overfitting, and effective in boosting model accuracy. We use WMT23 English to Chinese and English to German translation tasks as our test ground and show +3.6 and +1.8 chrF improvement, respectively, over vanilla training. Furthermore, our approach is on-par or slightly better than other state-of-the-art fine-tuning techniques while being less sensitive to hyperparameter settings.
zh
[AI-24] Building Effective Safety Guardrails in AI Education Tools
【速读】:该论文旨在解决生成式 AI(Generative AI)在教育领域应用中因内容安全性和适龄性不足而引发的潜在风险问题,尤其是在教师使用AI辅助工具进行课程规划时可能产生的不当输出。其核心解决方案在于构建一套多层安全防护机制,关键包括:(1)通过提示工程(prompt engineering)确保AI输出符合教学法和国家课程标准;(2)输入威胁检测以防范恶意攻击;(3)引入独立异步内容审核代理(Independent Asynchronous Content Moderation Agent, IACMA)对生成内容进行预定义安全类别评估;(4)采用“人在回路”(human-in-the-loop)模式,强制教师在课堂使用前人工审查内容。这四重防护机制共同构成了系统性的安全保障体系,同时强调持续迭代优化与跨行业协作的重要性,以提升生成式AI教育工具的安全性与可靠性。
链接: https://arxiv.org/abs/2508.05360
作者: Hannah-Beth Clark,Laura Benton,Emma Searle,Margaux Dowland,Matthew Gregory,Will Gayne,John Roberts
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages, published in proceedings of International Conference on Artificial Intelligence in Education (AIED) 2025: Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED
Abstract:There has been rapid development in generative AI tools across the education sector, which in turn is leading to increased adoption by teachers. However, this raises concerns regarding the safety and age-appropriateness of the AI-generated content that is being created for use in classrooms. This paper explores Oak National Academy’s approach to addressing these concerns within the development of the UK Government’s first publicly available generative AI tool - our AI-powered lesson planning assistant (Aila). Aila is intended to support teachers planning national curriculum-aligned lessons that are appropriate for pupils aged 5-16 years. To mitigate safety risks associated with AI-generated content we have implemented four key safety guardrails - (1) prompt engineering to ensure AI outputs are generated within pedagogically sound and curriculum-aligned parameters, (2) input threat detection to mitigate attacks, (3) an Independent Asynchronous Content Moderation Agent (IACMA) to assess outputs against predefined safety categories, and (4) taking a human-in-the-loop approach, to encourage teachers to review generated content before it is used in the classroom. Through our on-going evaluation of these safety guardrails we have identified several challenges and opportunities to take into account when implementing and testing safety guardrails. This paper highlights ways to build more effective safety guardrails in generative AI education tools including the on-going iteration and refinement of guardrails, as well as enabling cross-sector collaboration through sharing both open-source code, datasets and learnings.
zh
[AI-25] Multi-Modal Multi-Behavior Sequential Recommendation with Conditional Diffusion-Based Feature Denoising SIGIR2025
【速读】:该论文旨在解决多模态多行为序列推荐(Multi-Modal Multi-Behavior Sequential Recommendation)中的三大挑战:(1) 不同行为下用户对物品模态偏好差异难以有效刻画;(2) 隐式反馈中存在如误点击等隐式噪声,难以有效抑制;(3) 多模态表示中存在的模态噪声影响用户偏好建模的准确性。其解决方案的关键在于提出一种新颖的M³BSR模型,通过三个核心模块实现:首先,引入条件扩散模态去噪层(Conditional Diffusion Modality Denoising Layer)去除多模态表示中的噪声;其次,利用深层行为信息引导浅层行为数据的去噪过程,实现基于条件扩散的行为去噪(Conditional Diffusion Behavior Denoising),以缓解隐式反馈噪声的影响;最后,设计多专家兴趣提取层(Multi-Expert Interest Extraction Layer),显式建模跨行为与跨模态的共性及特异性兴趣,从而提升推荐性能。
链接: https://arxiv.org/abs/2508.05352
作者: Xiaoxi Cui,Weihai Lu,Yu Tong,Yiheng Li,Zhejun Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2025
Abstract:The sequential recommendation system utilizes historical user interactions to predict preferences. Effectively integrating diverse user behavior patterns with rich multimodal information of items to enhance the accuracy of sequential recommendations is an emerging and challenging research direction. This paper focuses on the problem of multi-modal multi-behavior sequential recommendation, aiming to address the following challenges: (1) the lack of effective characterization of modal preferences across different behaviors, as user attention to different item modalities varies depending on the behavior; (2) the difficulty of effectively mitigating implicit noise in user behavior, such as unintended actions like accidental clicks; (3) the inability to handle modality noise in multi-modal representations, which further impacts the accurate modeling of user preferences. To tackle these issues, we propose a novel Multi-Modal Multi-Behavior Sequential Recommendation model (M ^3 BSR). This model first removes noise in multi-modal representations using a Conditional Diffusion Modality Denoising Layer. Subsequently, it utilizes deep behavioral information to guide the denoising of shallow behavioral data, thereby alleviating the impact of noise in implicit feedback through Conditional Diffusion Behavior Denoising. Finally, by introducing a Multi-Expert Interest Extraction Layer, M ^3 BSR explicitly models the common and specific interests across behaviors and modalities to enhance recommendation performance. Experimental results indicate that M ^3 BSR significantly outperforms existing state-of-the-art methods on benchmark datasets.
zh
[AI-26] Minimal Model Reasoning in Description Logics: Dont Try This at Home!
【速读】:该论文旨在解决描述逻辑(Description Logics, DLs)中“纯最小模型”(pure minimal models)的概念可满足性问题,即在所有谓词扩展均需最小化的约束下,判断概念是否在某种最小模型中成立。此前,此类问题在DL领域研究甚少,而本文揭示了其高度复杂性:即使在最基础的EL语言中,该问题也是不可判定的(undecidable),且该不可判定性还延伸至一个极其受限的元组生成依赖(tuple-generating dependencies)片段。为恢复可判定性,论文提出对TBox施加无环条件(acyclicity conditions),使得最坏情况下的计算复杂度降至双指数时间以下,并建立了与近期提出的逐点圈闭(pointwise circumscription)之间的联系;此外,还分析了数据复杂性(data complexity)。最后,在DL-Lite族中,尽管已知DL-Litecore具有正向结果,本文进一步证明其扩展DL-Litehorn已达到ExpSpace-hard下界。解决方案的关键在于引入结构约束(无环性)以控制推理复杂度并建立理论关联。
链接: https://arxiv.org/abs/2508.05350
作者: Federica Di Stefano,Quentin Manière,Magdalena Ortiz,Mantas Šimkus
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
备注: 44 pages
Abstract:Reasoning with minimal models has always been at the core of many knowledge representation techniques, but we still have only a limited understanding of this problem in Description Logics (DLs). Minimization of some selected predicates, letting the remaining predicates vary or be fixed, as proposed in circumscription, has been explored and exhibits high complexity. The case of `pure’ minimal models, where the extension of all predicates must be minimal, has remained largely uncharted. We address this problem in popular DLs and obtain surprisingly negative results: concept satisfiability in minimal models is undecidable already for \mathcalEL . This undecidability also extends to a very restricted fragment of tuple-generating dependencies. To regain decidability, we impose acyclicity conditions on the TBox that bring the worst-case complexity below double exponential time and allow us to establish a connection with the recently studied pointwise circumscription; we also derive results in data complexity. We conclude with a brief excursion to the DL-Lite family, where a positive result was known for DL-Lite _\textcore , but our investigation establishes ExpSpace-hardness already for its extension DL-Lite _\texthorn .
zh
[AI-27] NomicLaw: Emergent Trust and Strategic Argumentation in LLM s During Collaborative Law-Making
【速读】:该论文试图解决的问题是:当前对大型语言模型(Large Language Models, LLMs)在开放、多智能体场景下,特别是涉及法律与伦理困境的审议行为缺乏实证理解。为应对这一问题,作者提出了NomicLaw——一个结构化的多智能体模拟系统,其中LLMs通过提出规则、提供理由并投票决定同伴提案来协作制定法律。解决方案的关键在于利用投票模式量化信任与互惠,并结合定性分析智能体如何运用策略性语言影响决策过程,从而揭示LLMs在自发形成联盟、背叛信任及调整修辞以塑造集体结果方面的社会推理与说服能力。
链接: https://arxiv.org/abs/2508.05344
作者: Asutosh Hota,Jussi P.P. Jokinen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have extended their capabilities from basic text processing to complex reasoning tasks, including legal interpretation, argumentation, and strategic interaction. However, empirical understanding of LLM behavior in open-ended, multi-agent settings especially those involving deliberation over legal and ethical dilemmas remains limited. We introduce NomicLaw, a structured multi-agent simulation where LLMs engage in collaborative law-making, responding to complex legal vignettes by proposing rules, justifying them, and voting on peer proposals. We quantitatively measure trust and reciprocity via voting patterns and qualitatively assess how agents use strategic language to justify proposals and influence outcomes. Experiments involving homogeneous and heterogeneous LLM groups demonstrate how agents spontaneously form alliances, betray trust, and adapt their rhetoric to shape collective decisions. Our results highlight the latent social reasoning and persuasive capabilities of ten open-source LLMs and provide insights into the design of future AI systems capable of autonomous negotiation, coordination and drafting legislation in legal settings.
zh
[AI-28] Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control
【速读】:该论文旨在解决从人类视频中教机器人掌握灵巧操作技能的难题,尤其是传统低级轨迹模仿方法在面对不同物体类型、空间布局和机械臂配置时缺乏泛化能力的问题。其解决方案的关键在于提出Graph-Fused Vision-Language-Action (GF-VLA)框架,该框架通过提取基于香农信息熵(Shannon information)的高任务相关性线索来识别手部与物体,并将其编码为时序有序的场景图(scene graphs),以捕捉手-物及物-物之间的交互关系;随后将这些图与语言条件化的Transformer融合,生成层次化的行为树(behavior trees)和可解释的笛卡尔运动指令;同时引入跨手选择策略(cross-hand selection policy)优化双臂协作中的夹爪分配,无需显式几何推理即可提升执行效率。实验表明,该方法在多种结构化双臂积木装配任务中实现了超过90%的任务成功率,展现出优异的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2508.05342
作者: Shunlei Li,Longsen Gao,Jin Wang,Chang Che,Xi Xiao,Jiuwen Cao,Yingbai Hu,Hamid Reza Karimi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Journal under review
Abstract:Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.
zh
[AI-29] he Term Agent Has Been Diluted Beyond Utility and Requires Redefinition
【速读】:该论文试图解决人工智能领域中“agent”(智能体)一词因多义性导致的研究沟通障碍、系统评估与可复现性困难以及政策制定挑战等问题。其解决方案的关键在于提出一个结构化的框架,明确界定成为智能体的最低标准,并将系统在环境交互、学习与适应、自主性、目标复杂度和时间一致性等维度上进行表征,从而在保留术语历史多样性的同时提供精确的描述语言,助力研究清晰化、可复现性提升及政策制定的有效推进。
链接: https://arxiv.org/abs/2508.05338
作者: Brinnae Bent
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to AIES 2025
Abstract:The term ‘agent’ in artificial intelligence has long carried multiple interpretations across different subfields. Recent developments in AI capabilities, particularly in large language model systems, have amplified this ambiguity, creating significant challenges in research communication, system evaluation and reproducibility, and policy development. This paper argues that the term ‘agent’ requires redefinition. Drawing from historical analysis and contemporary usage patterns, we propose a framework that defines clear minimum requirements for a system to be considered an agent while characterizing systems along a multidimensional spectrum of environmental interaction, learning and adaptation, autonomy, goal complexity, and temporal coherence. This approach provides precise vocabulary for system description while preserving the term’s historically multifaceted nature. After examining potential counterarguments and implementation challenges, we provide specific recommendations for moving forward as a field, including suggestions for terminology standardization and framework adoption. The proposed approach offers practical tools for improving research clarity and reproducibility while supporting more effective policy development.
zh
[AI-30] ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning
【速读】:该论文旨在解决交互式模仿学习(Interactive Imitation Learning)中人类教学者投入成本过高这一瓶颈问题,特别是如何减少对人类教师的查询次数。现有方法虽采用主动学习策略仅在不确定、高风险或新颖情境下 querying 教师,但忽略了新手在计划阶段所表达的动作意图及其不确定性信息。解决方案的关键在于提出 Active Skill-level Data Aggregation (ASkDAgger) 框架,其核心创新是允许新手表达“我计划这么做,但我不确定”的状态,并通过三个机制实现知识利用:(1) S-Aware Gating (SAG) 动态调整决策阈值以平衡敏感度与特异性;(2) Foresight Interactive Experience Replay (FIER) 将有效且重标注的新手动作计划转化为示范数据;(3) Prioritized Interactive Experience Replay (PIER) 根据不确定性、成功概率和示范时效性进行优先级排序回放。该框架显著降低了对人工标注的依赖,提升了泛化能力并加速了跨域适应。
链接: https://arxiv.org/abs/2508.05310
作者: Jelle Luijkx,Zlatan Ajanović,Laura Ferranti,Jens Kober
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted for publication in Transactions on Machine Learning Research (TMLR, 2025)
Abstract:Human teaching effort is a significant bottleneck for the broader applicability of interactive imitation learning. To reduce the number of required queries, existing methods employ active learning to query the human teacher only in uncertain, risky, or novel situations. However, during these queries, the novice’s planned actions are not utilized despite containing valuable information, such as the novice’s capabilities, as well as corresponding uncertainty levels. To this end, we allow the novice to say: “I plan to do this, but I am uncertain.” We introduce the Active Skill-level Data Aggregation (ASkDAgger) framework, which leverages teacher feedback on the novice plan in three key ways: (1) S-Aware Gating (SAG): Adjusts the gating threshold to track sensitivity, specificity, or a minimum success rate; (2) Foresight Interactive Experience Replay (FIER), which recasts valid and relabeled novice action plans into demonstrations; and (3) Prioritized Interactive Experience Replay (PIER), which prioritizes replay based on uncertainty, novice success, and demonstration age. Together, these components balance query frequency with failure incidence, reduce the number of required demonstration annotations, improve generalization, and speed up adaptation to changing domains. We validate the effectiveness of ASkDAgger through language-conditioned manipulation tasks in both simulation and real-world environments. Code, data, and videos are available at this https URL.
zh
[AI-31] Estimating Musical Surprisal from Audio in Autoregressive Diffusion Model Noise Spaces
【速读】:该论文旨在解决如何更有效地建模音乐和音频中的预期感(musical expectancy)与意外性(surprisal)问题,特别是相较于传统的生成式无限词汇变压器(Generative Infinite-Vocabulary Transformer, GIVT),能否通过信息内容(Information Content, IC)的计算方式提升对音频中突变特征的捕捉能力。其解决方案的关键在于引入基于自回归扩散模型(Autoregressive Diffusion Models, ADMs)的信息内容估计方法,并利用两种不同扩散常微分方程(Diffusion Ordinary Differential Equations, ODEs)构造的IC指标来刻画音频在多粒度层次上的意外性——即不同噪声水平下的扩散过程对应于不同时间尺度或频谱分辨率下的音频特征,从而显著提升了在单音高意外性建模和多轨音频段落边界检测两项任务中的表现,优于GIVT基线模型。
链接: https://arxiv.org/abs/2508.05306
作者: Mathias Rose Bjare,Stefan Lattner,Gerhard Widmer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 9 pages, 1 figure, 5 tables. Accepted at the 25th International Society for Music Information Retrieval Conference (ISMIR), Daejeon, South Korea, 2025 2025
Abstract:Recently, the information content (IC) of predictions from a Generative Infinite-Vocabulary Transformer (GIVT) has been used to model musical expectancy and surprisal in audio. We investigate the effectiveness of such modelling using IC calculated with autoregressive diffusion models (ADMs). We empirically show that IC estimates of models based on two different diffusion ordinary differential equations (ODEs) describe diverse data better, in terms of negative log-likelihood, than a GIVT. We evaluate diffusion model IC’s effectiveness in capturing surprisal aspects by examining two tasks: (1) capturing monophonic pitch surprisal, and (2) detecting segment boundaries in multi-track audio. In both tasks, the diffusion models match or exceed the performance of a GIVT. We hypothesize that the surprisal estimated at different diffusion process noise levels corresponds to the surprisal of music and audio features present at different audio granularities. Testing our hypothesis, we find that, for appropriate noise levels, the studied musical surprisal tasks’ results improve. Code is provided on this http URL.
zh
[AI-32] owards Embodied Agent ic AI: Review and Classification of LLM - and VLM-Driven Robot Autonomy and Interaction
【速读】:该论文旨在解决当前机器人自主性与人机交互中面临的复杂任务规划与执行难题,特别是如何利用生成式 AI(Generative AI)模型提升机器人系统的智能水平和通用性。其解决方案的关键在于提出一种基于“代理架构”(agentic architectures)的范式,其中AI代理作为协调者、规划者、感知执行者或通用接口,能够理解自然语言指令、调用API、规划任务序列并辅助操作与诊断,从而实现更灵活、可扩展的机器人行为控制。论文进一步通过分类模型集成方法并对比不同文献中代理的角色,为该领域提供了系统性的技术框架与趋势洞察。
链接: https://arxiv.org/abs/2508.05294
作者: Sahar Salimpour,Lei Fu,Farhad Keramat,Leonardo Militano,Giovanni Toffetti,Harry Edelman,Jorge Peña Queralta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Foundation models, including large language models (LLMs) and vision-language models (VLMs), have recently enabled novel approaches to robot autonomy and human-robot interfaces. In parallel, vision-language-action models (VLAs) or large behavior models (BLMs) are increasing the dexterity and capabilities of robotic systems. This survey paper focuses on those words advancing towards agentic applications and architectures. This includes initial efforts exploring GPT-style interfaces to tooling, as well as more complex system where AI agents are coordinators, planners, perception actors, or generalist interfaces. Such agentic architectures allow robots to reason over natural language instructions, invoke APIs, plan task sequences, or assist in operations and diagnostics. In addition to peer-reviewed research, due to the fast-evolving nature of the field, we highlight and include community-driven projects, ROS packages, and industrial frameworks that show emerging trends. We propose a taxonomy for classifying model integration approaches and present a comparative analysis of the role that agents play in different solutions in today’s literature.
zh
[AI-33] FlowState: Sampling Rate Invariant Time Series Forecasting AAAI2026
【速读】:该论文旨在解决现有时间序列基础模型(Time Series Foundation Models, TSFMs)在跨不同上下文长度与目标长度、采样率适应性以及计算效率方面的局限性问题。其核心解决方案在于提出一种名为FlowState的新架构,关键创新包括基于状态空间模型(State Space Model, SSM)的编码器和函数基解码器(functional basis decoder),从而实现连续时间建模与动态时间尺度调整能力,使模型天然具备跨所有可能时间分辨率的泛化能力,并能在线自适应不同输入采样率,显著提升模型效率与性能。
链接: https://arxiv.org/abs/2508.05287
作者: Lars Graf,Thomas Ortner,Stanisław Woźniak,Angeliki Pantazi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Currently under review at AAAI 2026
Abstract:Foundation models (FMs) have transformed natural language processing, but their success has not yet translated to time series forecasting. Existing time series foundation models (TSFMs), often based on transformer variants, struggle with generalization across varying context and target lengths, lack adaptability to different sampling rates, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that addresses these challenges through two key innovations: a state space model (SSM) based encoder and a functional basis decoder. This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons. In contrast to other state-of-the-art TSFMs, which require training data across all possible sampling rates to memorize patterns at each scale, FlowState inherently adapts its internal dynamics to the input scale, enabling smaller models, reduced data requirements, and improved efficiency. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being the smallest model, FlowState outperforms all other models and is state-of-the-art for the GIFT-ZS and the Chronos-ZS benchmarks. Ablation studies confirm the effectiveness of its components, and we demonstrate its unique ability to adapt online to varying input sampling rates.
zh
[AI-34] An Explainable Natural Language Framework for Identifying and Notifying Target Audiences In Enterprise Communication ISWC2025
【速读】:该论文旨在解决大型维护组织中识别领域专家(Subject Matter Experts, SMEs)及跨复杂实体关系进行高效沟通的问题,传统沟通方式常因信息过载和响应延迟而效率低下。其解决方案的关键在于构建一个结合RDF图数据库与大语言模型(Large Language Models, LLMs)的框架,通过自然语言查询实现精准受众定位,并借助规划-编排架构提供可解释的推理过程,从而在提升沟通效率的同时保障结果的可信度。
链接: https://arxiv.org/abs/2508.05267
作者: Vítor N. Lourenço,Mohnish Dubey,Yunfei Bai,Audrey Depeige,Vivek Jain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to publication at the 24th International Semantic Web Conference Industry Track, ISWC 2025
Abstract:In large-scale maintenance organizations, identifying subject matter experts and managing communications across complex entities relationships poses significant challenges – including information overload and longer response times – that traditional communication approaches fail to address effectively. We propose a novel framework that combines RDF graph databases with LLMs to process natural language queries for precise audience targeting, while providing transparent reasoning through a planning-orchestration architecture. Our solution enables communication owners to formulate intuitive queries combining concepts such as equipment, manufacturers, maintenance engineers, and facilities, delivering explainable results that maintain trust in the system while improving communication efficiency across the organization.
zh
[AI-35] Marine Chlorophyll Prediction and Driver Analysis based on LSTM-RF Hybrid Models
【速读】:该论文旨在解决海洋叶绿素浓度(chlorophyll concentration)精准预测的问题,这是评估海洋生态系统健康状况和碳循环强度的关键指标,对赤潮预警与生态响应具有重要意义。针对单一模型在时间序列建模和非线性特征刻画方面的局限性,研究提出了一种LSTM-RF混合模型,其关键在于融合长短期记忆网络(LSTM)的时间依赖建模能力与随机森林(Random Forest, RF)的非线性关系捕捉优势,通过多源海洋数据(如温度、盐度、溶解氧等)训练,显著提升了预测性能(R²达0.5386),优于单独使用LSTM(R²=0.0208)或RF(R²=0.4934)的效果,同时标准化处理与滑动窗口策略进一步优化了模型精度,为高频海洋生态变量预测提供了创新解决方案。
链接: https://arxiv.org/abs/2508.05260
作者: Zhouyao Qian,Yang Chen,Baodian Li,Shuyi Zhang,Zhen Tian,Gongsen Wang,Tianyue Gu,Xinyu Zhou,Huilin Chen,Xinyi Li,Hao Zhu,Shuyao Zhang,Zongheng Li,Siyuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE 5th International Conference on Advanced Algorithms and Neural Networks (AANN)
Abstract:Marine chlorophyll concentration is an important indicator of ecosystem health and carbon cycle strength, and its accurate prediction is crucial for red tide warning and ecological response. In this paper, we propose a LSTM-RF hybrid model that combines the advantages of LSTM and RF, which solves the deficiencies of a single model in time-series modelling and nonlinear feature portrayal. Trained with multi-source ocean data(temperature, salinity, dissolved oxygen, etc.), the experimental results show that the LSTM-RF model has an R^2 of 0.5386, an MSE of 0.005806, and an MAE of 0.057147 on the test set, which is significantly better than using LSTM (R^2 = 0.0208) and RF (R^2 =0.4934) alone , respectively. The standardised treatment and sliding window approach improved the prediction accuracy of the model and provided an innovative solution for high-frequency prediction of marine ecological variables.
zh
[AI-36] Driver Assistant: Persuading Drivers to Adjust Secondary Tasks Using Large Language Models
【速读】:该论文旨在解决Level 3自动化驾驶系统中驾驶员因从事次要任务而注意力分散、在紧急情况下难以及时接管车辆的问题。其核心挑战在于,系统虽能允许驾驶员分心,但在需要干预时提供有限的反应时间并带来显著认知负荷。解决方案的关键在于利用大语言模型(Large Language Model, LLM)生成“人性化”的劝说性建议,基于Level 3系统所感知的道路状况作为触发条件,通过视觉与听觉双通道主动引导驾驶员行为,从而在降低认知负担的同时维持对道路环境的适当关注,并协调次要任务与接管行为之间的关系。
链接: https://arxiv.org/abs/2508.05238
作者: Wei Xiang,Muchen Li,Jie Yan,Manling Zheng,Hanfei Zhu,Mengyun Jiang,Lingyun Sun
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Abstract:Level 3 automated driving systems allows drivers to engage in secondary tasks while diminishing their perception of risk. In the event of an emergency necessitating driver intervention, the system will alert the driver with a limited window for reaction and imposing a substantial cognitive burden. To address this challenge, this study employs a Large Language Model (LLM) to assist drivers in maintaining an appropriate attention on road conditions through a “humanized” persuasive advice. Our tool leverages the road conditions encountered by Level 3 systems as triggers, proactively steering driver behavior via both visual and auditory routes. Empirical study indicates that our tool is effective in sustaining driver attention with reduced cognitive load and coordinating secondary tasks with takeover behavior. Our work provides insights into the potential of using LLMs to support drivers during multi-task automated driving.
zh
[AI-37] FDC-Net: Rethinking the association between EEG artifact removal and multi-dimensional affective computing
【速读】:该论文旨在解决脑电图(EEG)情绪识别中因生理伪迹干扰导致的性能下降问题,特别是现有方法将去噪与情绪识别作为独立任务处理时存在的误差累积及任务间协同潜力未被挖掘的问题。其解决方案的关键在于提出一种端到端的噪声鲁棒情绪识别框架——反馈驱动协同网络(FDC-Net),通过双向梯度传播与联合优化策略实现去噪与分类任务的动态协同,并引入门控注意力机制结合频率自适应Transformer和可学习频带位置编码,从而在保持高去噪精度的同时显著提升复杂伪迹环境下的情绪识别准确率。
链接: https://arxiv.org/abs/2508.05231
作者: Wenjia Dong,Xueyuan Xu,Tianze Yu,Junming Zhang,Li Zhuo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Electroencephalogram (EEG)-based emotion recognition holds significant value in affective computing and brain-computer interfaces. However, in practical applications, EEG recordings are susceptible to the effects of various physiological artifacts. Current approaches typically treat denoising and emotion recognition as independent tasks using cascaded architectures, which not only leads to error accumulation, but also fails to exploit potential synergies between these tasks. Moreover, conventional EEG-based emotion recognition models often rely on the idealized assumption of “perfectly denoised data”, lacking a systematic design for noise robustness. To address these challenges, a novel framework that deeply couples denoising and emotion recognition tasks is proposed for end-to-end noise-robust emotion recognition, termed as Feedback-Driven Collaborative Network for Denoising-Classification Nexus (FDC-Net). Our primary innovation lies in establishing a dynamic collaborative mechanism between artifact removal and emotion recognition through: (1) bidirectional gradient propagation with joint optimization strategies; (2) a gated attention mechanism integrated with frequency-adaptive Transformer using learnable band-position encoding. Two most popular EEG-based emotion datasets (DEAP and DREAMER) with multi-dimensional emotional labels were employed to compare the artifact removal and emotion recognition performance between ASLSL and nine state-of-the-art methods. In terms of the denoising task, FDC-Net obtains a maximum correlation coefficient (CC) value of 96.30% on DEAP and a maximum CC value of 90.31% on DREAMER. In terms of the emotion recognition task under physiological artifact interference, FDC-Net achieves emotion recognition accuracies of 82.3+7.1% on DEAP and 88.1+0.8% on DREAMER.
zh
[AI-38] ADSEL: Adaptive dual self-expression learning for EEG feature selection via incomplete multi-dimensional emotional tagging
【速读】:该论文旨在解决基于脑电图(EEG)的多维情绪识别中因特征维度高、样本量有限而导致的分类器过拟合与计算复杂度高的问题,尤其针对实际场景中多维情绪标签不完整的情况。现有特征选择方法通常假设标签数据完整,而忽略了标签空间中样本间相关性及其与各维度交互作用,从而影响模型泛化能力。解决方案的关键在于提出一种新颖的不完整多维特征选择算法,其核心是将自适应双自表达学习(ADSEL)与最小二乘回归相结合:ADSEL在标签空间中构建样本级与维度级自表达学习之间的双向路径,实现跨层级信息共享,从而同时利用样本和维度层面的有效信息进行标签重建,显著提升标签恢复精度并精准筛选出最优EEG特征子集,以增强多维情绪识别性能。
链接: https://arxiv.org/abs/2508.05229
作者: Tianze Yu,Junming Zhang,Wenjia Dong,Xueyuan Xu,Li Zhuo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:EEG based multi-dimension emotion recognition has attracted substantial research interest in human computer interfaces. However, the high dimensionality of EEG features, coupled with limited sample sizes, frequently leads to classifier overfitting and high computational complexity. Feature selection constitutes a critical strategy for mitigating these challenges. Most existing EEG feature selection methods assume complete multi-dimensional emotion labels. In practice, open acquisition environment, and the inherent subjectivity of emotion perception often result in incomplete label data, which can compromise model generalization. Additionally, existing feature selection methods for handling incomplete multi-dimensional labels primarily focus on correlations among various dimensions during label recovery, neglecting the correlation between samples in the label space and their interaction with various dimensions. To address these issues, we propose a novel incomplete multi-dimensional feature selection algorithm for EEG-based emotion recognition. The proposed method integrates an adaptive dual self-expression learning (ADSEL) with least squares regression. ADSEL establishes a bidirectional pathway between sample-level and dimension-level self-expression learning processes within the label space. It could facilitate the cross-sharing of learned information between these processes, enabling the simultaneous exploitation of effective information across both samples and dimensions for label reconstruction. Consequently, ADSEL could enhances label recovery accuracy and effectively identifies the optimal EEG feature subset for multi-dimensional emotion recognition.
zh
[AI-39] CWEFS: Brain volume conduction effects inspired channel-wise EEG feature selection for multi-dimensional emotion recognition
【速读】:该论文旨在解决高维多通道脑电图(EEG)特征中存在的冗余与无关信息问题,这些问题不仅阻碍了情感判别特征的提取,还影响了实时性能。现有特征选择方法忽略了潜在EEG特征结构对情感标签相关性的影响,并假设各通道重要性均一,从而限制了多维情感计算中EEG特征选择模型的精确构建。其解决方案的关键在于提出一种通道级EEG特征选择(CWEFS)方法:通过引入共享潜在结构模型构建跨不同EEG通道的一致潜在空间,以保留局部几何结构;同时结合多维情感标签的潜在语义分析增强标签相关性建模,并引入自适应通道权重学习机制自动确定各通道在情感特征选择中的重要性,从而实现更精准、可解释的情感识别。
链接: https://arxiv.org/abs/2508.05228
作者: Xueyuan Xu,Wenjia Dong,Fulin Wei,Li Zhuo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to the intracranial volume conduction effects, high-dimensional multi-channel electroencephalography (EEG) features often contain substantial redundant and irrelevant information. This issue not only hinders the extraction of discriminative emotional representations but also compromises the real-time performance. Feature selection has been established as an effective approach to address the challenges while enhancing the transparency and interpretability of emotion recognition models. However, existing EEG feature selection research overlooks the influence of latent EEG feature structures on emotional label correlations and assumes uniform importance across various channels, directly limiting the precise construction of EEG feature selection models for multi-dimensional affective computing. To address these limitations, a novel channel-wise EEG feature selection (CWEFS) method is proposed for multi-dimensional emotion recognition. Specifically, inspired by brain volume conduction effects, CWEFS integrates EEG emotional feature selection into a shared latent structure model designed to construct a consensus latent space across diverse EEG channels. To preserve the local geometric structure, this consensus space is further integrated with the latent semantic analysis of multi-dimensional emotional labels. Additionally, CWEFS incorporates adaptive channel-weight learning to automatically determine the significance of different EEG channels in the emotional feature selection task. The effectiveness of CWEFS was validated using three popular EEG datasets with multi-dimensional emotional labels. Comprehensive experimental results, compared against nineteen feature selection methods, demonstrate that the EEG feature subsets chosen by CWEFS achieve optimal emotion recognition performance across six evaluation metrics.
zh
[AI-40] Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction
【速读】:该论文旨在解决钻井过程中钻速(Rate of Penetration, ROP)预测精度不足的问题,传统方法因难以捕捉钻井数据中复杂的时序依赖关系和高维特征交互,导致预测效果不佳且实时性差。其解决方案的关键在于提出一种融合长短期记忆网络(LSTM)、Transformer编码器、时间序列混合器(Time-Series Mixer, TS-Mixer)块及注意力机制的新型混合深度学习架构,通过协同建模时序依赖、静态特征交互、全局上下文信息与动态特征重要性,显著提升了预测性能,在真实钻井数据集上实现了R²达0.9988、平均绝对百分比误差(MAPE)仅为1.447%的优异表现,同时借助SHAP和LIME方法保障模型可解释性,确保预测结果在不同场景下的准确性与公平性。
链接: https://arxiv.org/abs/2508.05210
作者: Saddam Hussain Khan(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 37 Pages, 19 Figures, 9 Tables
Abstract:The Rate of Penetration (ROP) is crucial for optimizing drilling operations; however, accurately predicting it is hindered by the complex, dynamic, and high-dimensional nature of drilling data. Traditional empirical, physics-based, and basic machine learning models often fail to capture intricate temporal and contextual relationships, resulting in suboptimal predictions and limited real-time utility. To address this gap, we propose a novel hybrid deep learning architecture integrating Long Short-Term Memory (LSTM) networks, Transformer encoders, Time-Series Mixer (TS-Mixer) blocks, and attention mechanisms to synergistically model temporal dependencies, static feature interactions, global context, and dynamic feature importance. Evaluated on a real-world drilling dataset, our model outperformed benchmarks (standalone LSTM, TS-Mixer, and simpler hybrids) with an R-squared score of 0.9988 and a Mean Absolute Percentage Error of 1.447%, as measured by standard regression metrics (R-squared, MAE, RMSE, MAPE). Model interpretability was ensured using SHAP and LIME, while actual vs. predicted curves and bias checks confirmed accuracy and fairness across scenarios. This advanced hybrid approach enables reliable real-time ROP prediction, paving the way for intelligent, cost-effective drilling optimization systems with significant operational impact.
zh
[AI-41] SpectroStream: A Versatile Neural Codec for General Audio
【速读】:该论文旨在解决传统神经音频编解码器在高采样率(如48 kHz)和多通道(如立体声)场景下难以实现低比特率(4–16 kbps)高质量音频重建的问题。其解决方案的关键在于提出SpectroStream架构,该架构基于时频域(time-frequency domain)的音频表示,显著提升了高采样率下的音质表现;同时引入延迟融合(delayed-fusion)策略处理多通道音频,在保证各声道声学质量的同时维持声道间相位一致性,从而实现高效且高质量的多通道音频编码。
链接: https://arxiv.org/abs/2508.05207
作者: Yunpeng Li,Kehang Han,Brian McWilliams,Zalan Borsos,Marco Tagliasacchi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4–16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.
zh
[AI-42] EvoGraph: Hybrid Directed Graph Evolution toward Software 3.0 ICSE2025
【速读】:该论文旨在解决软件系统在长期演化过程中面临的维护难题,尤其是遗留代码现代化(legacy modernization)中存在的一系列挑战,如隐式契约(implicit contracts)、性能保持(performance preservation)和集成演进(integration evolution)等问题。其核心解决方案是提出EvoGraph框架,该框架将所有软件 artefact(包括源代码、构建管道、文档和任务票)表示为带类型的有向图,并利用针对特定语言的小型语言模型(Small Language Models, SLMs)执行学习到的变异操作(mutation operators),通过多目标适应度函数选择最优演化路径。该方法在多个基准测试中展现出显著效果:修复83%已知安全漏洞、COBOL到Java转换实现93%功能等价性、文档更新延迟小于两分钟,同时相比基线方案降低40%延迟并减少7倍特性交付时间,且在多种语言迁移任务中实现82–96%语义等价性的同时,计算成本下降90%。
链接: https://arxiv.org/abs/2508.05199
作者: Igor Costa,Christopher Baran
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 tables, 1 algorithm. Submitted to ICSE 2025
Abstract:We introduce EvoGraph, a framework that enables software systems to evolve their own source code, build pipelines, documentation, and tickets. EvoGraph represents every artefact in a typed directed graph, applies learned mutation operators driven by specialized small language models (SLMs), and selects survivors with a multi-objective fitness. On three benchmarks, EvoGraph fixes 83% of known security vulnerabilities, translates COBOL to Java with 93% functional equivalence (test verified), and maintains documentation freshness within two minutes. Experiments show a 40% latency reduction and a sevenfold drop in feature lead time compared with strong baselines. We extend our approach to evoGraph, leveraging language-specific SLMs for modernizing .NET, Lisp, CGI, ColdFusion, legacy Python, and C codebases, achieving 82-96% semantic equivalence across languages while reducing computational costs by 90% compared to large language models. EvoGraph’s design responds to empirical failure modes in legacy modernization, such as implicit contracts, performance preservation, and integration evolution. Our results suggest a practical path toward Software 3.0, where systems adapt continuously yet remain under measurable control.
zh
[AI-43] Balancing Accuracy and Novelty with Sub-Item Popularity
【速读】:该论文旨在解决音乐推荐系统中因过度依赖用户历史重复听歌行为而导致的“新颖性不足”问题,即现有基于个性化流行度评分(Personalised Popularity Scores, PPS)的方法虽能提升推荐相关性,却容易强化已知内容,抑制系统发现新奇或意外优质歌曲的能力,从而影响长期用户参与度。其解决方案的关键在于:将原始RecJPQ框架中的子项分解结构(sub-item decomposition)重构为细粒度的子ID级别(sub-ID-level),在该层级上引入个性化流行度评分(sPPS),使模型能够捕捉跨子嵌入(sub-embeddings)间的共享重复模式,从而在不牺牲推荐准确性的前提下,实现对个性化新颖性的显式控制。
链接: https://arxiv.org/abs/2508.05198
作者: Chiara Mallamaci,Aleksandr Vladimirovich Petrov,Alberto Carlo Maria Mancino,Vito Walter Anelli,Tommaso Di Noia,Craig Macdonald
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In the realm of music recommendation, sequential recommenders have shown promise in capturing the dynamic nature of music consumption. A key characteristic of this domain is repetitive listening, where users frequently replay familiar tracks. To capture these repetition patterns, recent research has introduced Personalised Popularity Scores (PPS), which quantify user-specific preferences based on historical frequency. While PPS enhances relevance in recommendation, it often reinforces already-known content, limiting the system’s ability to surface novel or serendipitous items - key elements for fostering long-term user engagement and satisfaction. To address this limitation, we build upon RecJPQ, a Transformer-based framework initially developed to improve scalability in large-item catalogues through sub-item decomposition. We repurpose RecJPQ’s sub-item architecture to model personalised popularity at a finer granularity. This allows us to capture shared repetition patterns across sub-embeddings - latent structures not accessible through item-level popularity alone. We propose a novel integration of sub-ID-level personalised popularity within the RecJPQ framework, enabling explicit control over the trade-off between accuracy and personalised novelty. Our sub-ID-level PPS method (sPPS) consistently outperforms item-level PPS by achieving significantly higher personalised novelty without compromising recommendation accuracy. Code and experiments are publicly available at this https URL.
zh
[AI-44] Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination
【速读】:该论文旨在解决复杂系统中网络安全事件响应动作识别困难的问题,尤其针对当前基于前沿大语言模型(Large Language Models, LLMs)的响应规划方法存在成本高和幻觉(hallucination)风险大的局限性。其解决方案的关键在于提出一种三步法:微调(fine-tuning)、信息检索(information retrieval)与前瞻规划(lookahead planning),通过结构化约束降低幻觉概率,并在保证响应质量的前提下显著缩短恢复时间,同时具备轻量化特性,可在通用硬件上运行。
链接: https://arxiv.org/abs/2508.05188
作者: Kim Hammar,Tansu Alpcan,Emil C. Lupu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Timely and effective incident response is key to managing the growing frequency of cyberattacks. However, identifying the right response actions for complex systems is a major technical challenge. A promising approach to mitigate this challenge is to use the security knowledge embedded in large language models (LLMs) to assist security operators during incident handling. Recent research has demonstrated the potential of this approach, but current methods are mainly based on prompt engineering of frontier LLMs, which is costly and prone to hallucinations. We address these limitations by presenting a novel way to use an LLM for incident response planning with reduced hallucination. Our method includes three steps: fine-tuning, information retrieval, and lookahead planning. We prove that our method generates response plans with a bounded probability of hallucination and that this probability can be made arbitrarily small at the expense of increased planning time under certain assumptions. Moreover, we show that our method is lightweight and can run on commodity hardware. We evaluate our method on logs from incidents reported in the literature. The experimental results show that our method a) achieves up to 22% shorter recovery times than frontier LLMs and b) generalizes to a broad range of incident types and response actions.
zh
[AI-45] Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent -based Simulation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)驱动的基于代理模型(Agent-Based Models, ABMs)和理性基于代理模型(Rational Agent-Based Models, RABMs)在性能评估中面临的挑战,包括系统复杂性、随机性以及缺乏标准化的RL算法比较指标。解决方案的关键在于开发领域驱动的奖励机制(domain-driven rewards),并将其与传统及前沿的评估指标相结合,从而在具体应用场景(如疫情建模中的戴口罩、接种疫苗和封锁行为)中实现更准确、可解释且具有实际意义的模型优化与评估。
链接: https://arxiv.org/abs/2508.05154
作者: Rishabh Gaur,Gaurav Deshkar,Jayanta Kshirsagar,Harshal Hayatnagarkar,Janani Venugopalan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:For the development and optimization of agent-based models (ABMs) and rational agent-based models (RABMs), optimization algorithms such as reinforcement learning are extensively used. However, assessing the performance of RL-based ABMs and RABMS models is challenging due to the complexity and stochasticity of the modeled systems, and the lack of well-standardized metrics for comparing RL algorithms. In this study, we are developing domain-driven metrics for RL, while building on state-of-the-art metrics. We demonstrate our ``Domain-driven-RL-metrics’’ using policy optimization on a rational ABM disease modeling case study to model masking behavior, vaccination, and lockdown in a pandemic. Our results show the use of domain-driven rewards in conjunction with traditional and state-of-the-art metrics for a few different simulation scenarios such as the differential availability of masks.
zh
[AI-46] FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction
【速读】:该论文旨在解决机器人在服装类物体操作中,尤其是双臂协同平整(bimanual smoothing)任务上的类别级泛化(category-level generalization)难题,该问题主要源于高维状态空间、复杂的动力学特性以及同一类别内服装的显著差异。现有方法通常因对特定实例过度拟合视觉特征而失效,或虽具备类别感知能力却无法准确预测协同双臂动作的价值。其解决方案的关键在于提出特征条件化的双臂价值网络(Feature-Conditioned Bimanual Value Network, FCBV-Net),该网络基于预训练且冻结的密集几何特征(dense geometric features)进行双臂动作价值预测,从而确保对类别内变化的鲁棒性;同时通过可训练的下游组件学习任务特定策略,实现几何理解与动作价值学习的解耦(decoupling)。实验表明,FCBV-Net在未见过的服装上仅产生11.5%的效率下降,显著优于2D图像基线(96.2%效率下降)和3D对应基线(83%最终覆盖度),验证了该解耦机制在提升类别级泛化能力方面的有效性。
链接: https://arxiv.org/abs/2508.05153
作者: Mohammed Daba,Jing Qiu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 1 table. Submitted to IEEE Robotics and Automation Letters
Abstract:Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned Bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated GarmentLab experiments with the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization.
zh
[AI-47] ool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models
【速读】:该论文旨在解决当前AI代理在工具检索过程中因忽略工具间依赖关系而导致的前置工具遗漏问题,从而影响任务成功执行的挑战。其解决方案的关键在于提出一种名为Tool Graph Retriever (TGR) 的方法,通过构建工具依赖图(tool dependency graph)并利用图卷积网络(graph convolution)将工具间的依赖关系整合到工具表征中,进而提升工具检索的准确性与有效性。
链接: https://arxiv.org/abs/2508.05152
作者: Linfeng Gao,Yaoxiang Wang,Minlong Peng,Jialong Tang,Yuzhe Shang,Mingming Sun,Jinsong Su
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:With the remarkable advancement of AI agents, the number of their equipped tools is increasing rapidly. However, integrating all tool information into the limited model context becomes impractical, highlighting the need for efficient tool retrieval methods. In this regard, dominant methods primarily rely on semantic similarities between tool descriptions and user queries to retrieve relevant tools. However, they often consider each tool independently, overlooking dependencies between tools, which may lead to the omission of prerequisite tools for successful task execution. To deal with this defect, in this paper, we propose Tool Graph Retriever (TGR), which exploits the dependencies among tools to learn better tool representations for retrieval. First, we construct a dataset termed TDI300K to train a discriminator for identifying tool dependencies. Then, we represent all candidate tools as a tool dependency graph and use graph convolution to integrate the dependencies into their representations. Finally, these updated tool representations are employed for online retrieval. Experimental results on several commonly used datasets show that our TGR can bring a performance improvement to existing dominant methods, achieving SOTA performance. Moreover, in-depth analyses also verify the importance of tool dependencies and the effectiveness of our TGR.
zh
[AI-48] Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories
【速读】:该论文旨在解决自驱动实验室(Self-Driving Laboratories, SDLs)中因机器人与自动化系统集成而引入的额外安全复杂性问题,尤其是火灾风险加剧、个人防护装备(Personal Protective Equipment, PPE)合规性难以保障以及突发事故响应滞后等挑战。其解决方案的关键在于提出并实现了一个名为Chemist Eye的分布式安全监控系统,该系统通过多模态视觉传感(RGB、深度和红外相机)与基于视觉语言模型(Vision-Language Model, VLM)的决策机制相结合,实现对人员状态、PPE佩戴情况及火源隐患的实时识别,并能自动驱动移动机器人避让危险区域或人员,同时通过本地声光报警与第三方消息平台推送通知,从而显著提升SDLs中的情境感知能力与应急响应效率。
链接: https://arxiv.org/abs/2508.05148
作者: Francisco Munguia-Galeano,Zhengxue Zhou,Satheeshkumar Veeramani,Hatem Fakhruldeen,Louis Longley,Rob Clowes,Andrew I. Cooper
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of robotics and automation into self-driving laboratories (SDLs) can introduce additional safety complexities, in addition to those that already apply to conventional research laboratories. Personal protective equipment (PPE) is an essential requirement for ensuring the safety and well-being of workers in laboratories, self-driving or otherwise. Fires are another important risk factor in chemical laboratories. In SDLs, fires that occur close to mobile robots, which use flammable lithium batteries, could have increased severity. Here, we present Chemist Eye, a distributed safety monitoring system designed to enhance situational awareness in SDLs. The system integrates multiple stations equipped with RGB, depth, and infrared cameras, designed to monitor incidents in SDLs. Chemist Eye is also designed to spot workers who have suffered a potential accident or medical emergency, PPE compliance and fire hazards. To do this, Chemist Eye uses decision-making driven by a vision-language model (VLM). Chemist Eye is designed for seamless integration, enabling real-time communication with robots. Based on the VLM recommendations, the system attempts to drive mobile robots away from potential fire locations, exits, or individuals not wearing PPE, and issues audible warnings where necessary. It also integrates with third-party messaging platforms to provide instant notifications to lab personnel. We tested Chemist Eye with real-world data from an SDL equipped with three mobile robots and found that the spotting of possible safety hazards and decision-making performances reached 97 % and 95 %, respectively.
zh
[AI-49] Graph-based Event Log Repair
【速读】:该论文旨在解决流程挖掘(Process Mining)中事件日志(event log)质量低下问题,特别是由于手动操作或数据采集不完整导致的事件属性缺失问题。现有标准方法通常依赖于预定义的过程模型进行推理填补,或使用机器学习/深度学习模型从相似案例中学习修复缺失值。本文的关键解决方案是提出一种异质图神经网络(Heterogeneous Graph Neural Network, HGNN)模型,该模型能够以更自然的方式建模流程执行轨迹中的多模态复杂序列,从而实现对包含不完整事件的轨迹中所有缺失属性的高精度重建。相较于当前基于自动编码器的先进方法及无模型方法主要修复部分属性的局限性,该方案在多种类型缺失值下均展现出优越的重建性能。
链接: https://arxiv.org/abs/2508.05145
作者: Sebastiano Dissegna,Chiara Di Francescomarino,Massimiliano Ronzani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The quality of event logs in Process Mining is crucial when applying any form of analysis to them. In real-world event logs, the acquisition of data can be non-trivial (e.g., due to the execution of manual activities and related manual recording or to issues in collecting, for each event, all its attributes), and often may end up with events recorded with some missing information. Standard approaches to the problem of trace (or log) reconstruction either require the availability of a process model that is used to fill missing values by leveraging different reasoning techniques or employ a Machine Learning/Deep Learning model to restore the missing values by learning from similar cases. In recent years, a new type of Deep Learning model that is capable of handling input data encoded as graphs has emerged, namely Graph Neural Networks. Graph Neural Network models, and even more so Heterogeneous Graph Neural Networks, offer the advantage of working with a more natural representation of complex multi-modal sequences like the execution traces in Process Mining, allowing for more expressive and semantically rich encodings. In this work, we focus on the development of a Heterogeneous Graph Neural Network model that, given a trace containing some incomplete events, will return the full set of attributes missing from those events. We evaluate our work against a state-of-the-art approach leveraging autoencoders on two synthetic logs and four real event logs, on different types of missing values. Different from state-of-the-art model-free approaches, which mainly focus on repairing a subset of event attributes, the proposed approach shows very good performance in reconstructing all different event attributes. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.05145 [cs.AI] (or arXiv:2508.05145v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.05145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] Beyond Automation: Socratic AI Epistemic Agency and the Implications of the Emergence of Orchestrated Multi-Agent Learning Architectures
【速读】:该论文旨在解决生成式 AI(Generative AI)在高等教育中被广泛使用时可能引发的“去技能化”(de-skilling)问题,即学生过度依赖AI工具而削弱批判性思维与自主学习能力。其解决方案的关键在于设计并验证一种基于苏格拉底式对话(Socratic dialogue)的AI助教(Socratic AI Tutor),该助教通过结构化互动引导学生发展研究问题,促进元认知参与。实验结果表明,相较于未受指导的AI聊天机器人,使用该助教的学生在批判性、独立性和反思性思维方面获得显著支持。进一步地,论文提出“协同多智能体系统”(orchestrated multi-agent systems, MAS)概念,强调由教育者主导编排的模块化AI代理组合,以差异化角色和协调交互支持多样化的学习路径,并引入“提供与使用模型”(offer-and-use model)实现教学意图的精准传递。这一方案不仅提供实证依据,还为高校构建人机协同(human-AI co-agency)的混合学习生态系统提供了可扩展的理论框架与实践路径。
链接: https://arxiv.org/abs/2508.05116
作者: Peer-Benedikt Degen,Igor Asanov
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Generative AI is no longer a peripheral tool in higher education. It is rapidly evolving into a general-purpose infrastructure that reshapes how knowledge is generated, mediated, and validated. This paper presents findings from a controlled experiment evaluating a Socratic AI Tutor, a large language model designed to scaffold student research question development through structured dialogue grounded in constructivist theory. Conducted with 65 pre-service teacher students in Germany, the study compares interaction with the Socratic Tutor to engagement with an uninstructed AI chatbot. Students using the Socratic Tutor reported significantly greater support for critical, independent, and reflective thinking, suggesting that dialogic AI can stimulate metacognitive engagement and challenging recent narratives of de-skilling due to generative AI usage. These findings serve as a proof of concept for a broader pedagogical shift: the use of multi-agent systems (MAS) composed of specialised AI agents. To conceptualise this, we introduce the notion of orchestrated MAS, modular, pedagogically aligned agent constellations, curated by educators, that support diverse learning trajectories through differentiated roles and coordinated interaction. To anchor this shift, we propose an adapted offer-and-use model, in which students appropriate instructional offers from these agents. Beyond technical feasibility, we examine system-level implications for higher education institutions and students, including funding necessities, changes to faculty roles, curriculars, competencies and assessment practices. We conclude with a comparative cost-effectiveness analysis highlighting the scalability of such systems. In sum, this study contributes both empirical evidence and a conceptual roadmap for hybrid learning ecosystems that embed human-AI co-agency and pedagogical alignment.
zh
[AI-51] EasySize: Elastic Analog Circuit Sizing via LLM -Guided Heuristic Search
【速读】:该论文旨在解决模拟电路门级尺寸优化(gate sizing)中普遍存在的效率低、依赖人工经验及跨工艺节点迁移能力弱的问题。传统方法难以在不同工艺节点(technology nodes)、设计规格和电路拓扑结构下保持高性能与稳定性,而现有基于大语言模型(Large Language Models, LLMs)的方案往往计算开销大且缺乏可移植性。其解决方案的关键在于提出一个轻量级框架 EasySize,该框架基于微调后的 Qwen3-8B 模型,通过引入性能指标可达性易度(Ease of Attainability, EOA)动态构建任务特定损失函数,并结合全局差分进化(Differential Evolution, DE)与局部粒子群优化(Particle Swarm Optimization, PSO)进行高效启发式搜索,在反馈增强的设计流程中实现高通用性和低资源消耗。即使仅在 350nm 工艺节点数据上微调,EasySize 在多个更先进工艺节点(180nm、45nm、22nm)上仍表现出优异泛化能力,显著优于 AutoCkt 等主流强化学习方法。
链接: https://arxiv.org/abs/2508.05113
作者: Xinyue Wu,Fan Hu,Shaik Jani Babu,Yi Zhao,Xinfei Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Analog circuit design is a time-consuming, experience-driven task in chip development. Despite advances in AI, developing universal, fast, and stable gate sizing methods for analog circuits remains a significant challenge. Recent approaches combine Large Language Models (LLMs) with heuristic search techniques to enhance generalizability, but they often depend on large model sizes and lack portability across different technology nodes. To overcome these limitations, we propose EasySize, the first lightweight gate sizing framework based on a finetuned Qwen3-8B model, designed for universal applicability across process nodes, design specifications, and circuit topologies. EasySize exploits the varying Ease of Attainability (EOA) of performance metrics to dynamically construct task-specific loss functions, enabling efficient heuristic search through global Differential Evolution (DE) and local Particle Swarm Optimization (PSO) within a feedback-enhanced flow. Although finetuned solely on 350nm node data, EasySize achieves strong performance on 5 operational amplifier (Op-Amp) netlists across 180nm, 45nm, and 22nm technology nodes without additional targeted training, and outperforms AutoCkt, a widely-used Reinforcement Learning based sizing framework, on 86.67% of tasks with more than 96.67% of simulation resources reduction. We argue that EasySize can significantly reduce the reliance on human expertise and computational resources in gate sizing, thereby accelerating and simplifying the analog circuit design process. EasySize will be open-sourced at a later date.
zh
[AI-52] Integrated Influence: Data Attribution with Baseline
【速读】:该论文旨在解决现有数据归因(data attribution)方法在解释能力上的局限性问题,特别是基于留一法(leave-one-out, LOO)的方法因仅扰动单个训练样本而忽略训练集整体的协同影响,且缺乏基线(baseline)设置导致难以提供反事实解释。其解决方案的关键在于提出一种名为“集成影响”(Integrated Influence)的新方法,该方法通过定义一个基准数据集,沿着数据退化过程逐步将当前数据集过渡至基准状态,并累积每个训练样本在整个过程中对模型输出的影响,从而实现更可靠、更具灵活性的数据归因分析。该方法具有坚实的理论基础,并可将如影响函数(influence functions)等主流方法视为其特例。
链接: https://arxiv.org/abs/2508.05089
作者: Linxiao Yang,Xinyu Gu,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As an effective approach to quantify how training samples influence test sample, data attribution is crucial for understanding data and model and further enhance the transparency of machine learning models. We find that prevailing data attribution methods based on leave-one-out (LOO) strategy suffer from the local-based explanation, as these LOO-based methods only perturb a single training sample, and overlook the collective influence in the training set. On the other hand, the lack of baseline in many data attribution methods reduces the flexibility of the explanation, e.g., failing to provide counterfactual explanations. In this paper, we propose Integrated Influence, a novel data attribution method that incorporates a baseline approach. Our method defines a baseline dataset, follows a data degeneration process to transition the current dataset to the baseline, and accumulates the influence of each sample throughout this process. We provide a solid theoretical framework for our method, and further demonstrate that popular methods, such as influence functions, can be viewed as special cases of our approach. Experimental results show that Integrated Influence generates more reliable data attributions compared to existing methods in both data attribution task and mislablled example identification task.
zh
[AI-53] MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models
【速读】:该论文旨在解决医学多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对不断更新的医学知识时,难以高效、可靠地修正过时或错误信息的问题。现有知识编辑方法主要集中在纯文本场景,缺乏针对图像与文本联合模态的系统性评估框架。解决方案的关键在于构建首个专门面向医学领域的多模态知识编辑基准——MedMKEB,该基准涵盖可靠性、泛化性、局部性、可迁移性和鲁棒性等维度,并通过精心设计的编辑任务(如反事实修正、语义泛化、知识迁移和对抗鲁棒性)以及专家人工验证,全面评估模型的知识编辑能力。实验表明,当前通用及医学专用MLLMs在知识编辑方面存在显著局限,凸显了开发专业化编辑策略的必要性。
链接: https://arxiv.org/abs/2508.05083
作者: Dexuan Xu,Jieyi Wang,Zhongyan Chai,Yongzhi Cao,Hanpin Wang,Huamin Zhang,Yu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Recent advances in multimodal large language models (MLLMs) have significantly improved medical AI, enabling it to unify the understanding of visual and textual information. However, as medical knowledge continues to evolve, it is critical to allow these models to efficiently update outdated or incorrect information without retraining from scratch. Although textual knowledge editing has been widely studied, there is still a lack of systematic benchmarks for multimodal medical knowledge editing involving image and text modalities. To fill this gap, we present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing in medical multimodal large language models. MedMKEB is built on a high-quality medical visual question-answering dataset and enriched with carefully constructed editing tasks, including counterfactual correction, semantic generalization, knowledge transfer, and adversarial robustness. We incorporate human expert validation to ensure the accuracy and reliability of the benchmark. Extensive single editing and sequential editing experiments on state-of-the-art general and medical MLLMs demonstrate the limitations of existing knowledge-based editing approaches in medicine, highlighting the need to develop specialized editing strategies. MedMKEB will serve as a standard benchmark to promote the development of trustworthy and efficient medical knowledge editing algorithms.
zh
[AI-54] Align-for-Fusion: Harmonizing Triple Preferences via Dual-oriented Diffusion for Cross-domain Sequential Recommendation
【速读】:该论文旨在解决跨域序列推荐(Cross-Domain Sequential Recommendation, CDSR)中因数据稀疏性和兴趣漂移导致的推荐性能下降问题,特别是现有方法在多域偏好融合时缺乏细粒度建模能力的问题。解决方案的关键在于提出了一种“对齐-融合”框架 HorizonRec,其核心创新包括:一是引入混合条件分布检索策略,利用用户真实行为逻辑中的分布作为语义桥梁,在不同域间建立一致的多域偏好建模;二是设计双导向偏好扩散方法(dual-oriented preference diffusion),通过抑制潜在噪声并强化目标相关兴趣,实现更精细的三域偏好融合,从而提升推荐的准确性与鲁棒性。
链接: https://arxiv.org/abs/2508.05074
作者: Yongfu Zha,Xinxin Dong,Haokai Ma,Yonghui Yang,Xiaodong Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized sequential recommendation aims to predict appropriate items for users based on their behavioral sequences. To alleviate data sparsity and interest drift issues, conventional approaches typically incorporate auxiliary behaviors from other domains via cross-domain transition. However, existing cross-domain sequential recommendation (CDSR) methods often follow an align-then-fusion paradigm that performs representation-level alignment across multiple domains and combines them mechanically for recommendation, overlooking the fine-grained fusion of domain-specific preferences. Inspired by recent advances in diffusion models (DMs) for distribution matching, we propose an align-for-fusion framework for CDSR to harmonize triple preferences via dual-oriented DMs, termed HorizonRec. Specifically, we investigate the uncertainty injection of DMs and identify stochastic noise as a key source of instability in existing DM-based recommenders. To address this, we introduce a mixed-conditioned distribution retrieval strategy that leverages distributions retrieved from users’ authentic behavioral logic as semantic bridges across domains, enabling consistent multi-domain preference modeling. Furthermore, we propose a dual-oriented preference diffusion method to suppress potential noise and emphasize target-relevant interests during multi-domain user representation fusion. Extensive experiments on four CDSR datasets from two distinct platforms demonstrate the effectiveness and robustness of HorizonRec in fine-grained triple-domain preference fusion.
zh
[AI-55] Human-AI Schema Discovery and Application for Creative Problem Solving
【速读】:该论文旨在解决人类在复杂或陌生领域中难以发现和应用隐含结构模式(schema)的问题,而这些模式对于创造性问题解决至关重要。解决方案的关键在于构建一个支持人机协同的schema发现与应用框架:首先通过系统辅助用户从示例中抽象出schema(即schema引导的感知理解),进而将这些schema转化为人机协作的创意工作流(即schema的操作化应用),从而提升隐性知识的可访问性与可操作性,推动更透明、协作式的人机交互系统发展。
链接: https://arxiv.org/abs/2508.05045
作者: Sitong Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Humans often rely on underlying structural patterns-schemas-to create, whether by writing stories, designing software, or composing music. Schemas help organize ideas and guide exploration, but they are often difficult to discover and apply, especially in complex or unfamiliar domains. My Ph.D. research develops a framework for human-AI schema discovery and application to support creative problem solving. I design systems that support users in sensemaking over examples to abstract schemas, and in operationalizing schemas into human-AI co-creative workflows for application. This research offers insights into how schema-guided interaction can make implicit knowledge more accessible and actionable, advancing more transparent and collaborative human-AI systems.
zh
[AI-56] SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)微调过程中对数据和计算资源需求过高、难以应用于小型模型的问题。现有课程学习或数据选择方法多依赖启发式策略或消耗大量计算资源,限制了其可扩展性和泛化能力。解决方案的关键在于提出一种自适应的自我节奏学习框架 SPaRFT(Self-Paced Reinforcement Fine-Tuning),其核心由两部分组成:一是基于聚类的数据简化方法,通过语义和难度划分训练数据并提取紧凑且多样化的子集以减少冗余;二是将数据簇视为多臂老虎机(Multi-Armed Bandit)中的“臂”,根据模型当前性能动态优化样本分配策略。实验表明,SPaRFT 在多个推理基准上达到与最先进方法相当甚至更优的准确率,同时仅需最多 100 倍少的训练样本,验证了性能驱动型训练课程设计在低资源场景下提升 LLM 推理能力的有效性。
链接: https://arxiv.org/abs/2508.05015
作者: Dai Do,Manh Nguyen,Svetha Venkatesh,Hung Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical for smaller models. Current approaches to curriculum learning or data selection are largely heuristic-driven or demand extensive computational resources, limiting their scalability and generalizability. We propose \textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply \emphcluster-based data reduction to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a \emphmulti-armed bandit treats data clusters as arms, optimized to allocate training samples based on model current performance. Experiments across multiple reasoning benchmarks show that SPaRFT achieves comparable or better accuracy than state-of-the-art baselines while using up to (100\times) fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources.
zh
[AI-57] owards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)在歌词到歌曲生成任务中普遍存在的内容幻觉(content hallucination)问题,即模型输出与输入歌词语义不一致,从而破坏音乐连贯性。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的偏好优化框架,通过构建基于音素错误率(Phoneme Error Rate, PER)的幻觉偏好数据集,并采用三种不同的偏好优化策略(DPO、PPO 和 GRPO)进行训练,实现对幻觉的有效控制。其中,DPO 通过离策略方式提升正向 token 的概率,使 PER 下降 7.4%;而 PPO 与 GRPO 则利用基于 PER 的奖励模型进行策略迭代优化,在 KL 正则化下实现幻觉抑制并保持音乐质量,PER 分别降低 4.9% 和 4.7%,验证了该方法在提升生成一致性与音乐性方面的有效性。
链接: https://arxiv.org/abs/2508.05011
作者: Huaicheng Zhang,Wei Tan,Guangzheng Li,Yixuan Zhang,Hangting Chen,Shun Lei,Chenyu Yang,Zhiyong Wu,Shuai Wang,Qijun Huang,Dong Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework’s transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.
zh
[AI-58] he Docking Game: Loop Self-Play for Fast Dynamic and Accurate Prediction of Flexible Protein–Ligand Binding
【速读】:该论文旨在解决当前多任务学习模型在分子对接中存在性能差异的问题,即在配体对接(ligand docking)任务上的表现显著低于蛋白质口袋对接(protein pocket docking)任务。这一差距主要源于配体与蛋白质在结构复杂性上的本质差异。解决方案的关键在于提出一种基于博弈论的框架——“Docking Game”,将配体对接模块视为“配体玩家”,蛋白质口袋对接模块视为“蛋白玩家”,并设计了一种新颖的循环自对弈算法(Loop Self-Play, LoopPlay),通过双层循环机制实现两者的协同优化:外层循环中双方交换预测构象以促进相互适应,内层循环中各自利用自身预测结果动态调整模型输出。理论证明了LoopPlay的收敛性,实验表明其在公共基准数据集上相比现有最优方法可提升约10%的结合模式预测准确率,显著增强了药物发现中分子对接的精度。
链接: https://arxiv.org/abs/2508.05006
作者: Youzhi Zhang,Yufei Li,Gaofeng Meng,Hongbin Liu,Jiebo Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures
Abstract:Molecular docking is a crucial aspect of drug discovery, as it predicts the binding interactions between small-molecule ligands and protein pockets. However, current multi-task learning models for docking often show inferior performance in ligand docking compared to protein pocket docking. This disparity arises largely due to the distinct structural complexities of ligands and proteins. To address this issue, we propose a novel game-theoretic framework that models the protein-ligand interaction as a two-player game called the Docking Game, with the ligand docking module acting as the ligand player and the protein pocket docking module as the protein player. To solve this game, we develop a novel Loop Self-Play (LoopPlay) algorithm, which alternately trains these players through a two-level loop. In the outer loop, the players exchange predicted poses, allowing each to incorporate the other’s structural predictions, which fosters mutual adaptation over multiple iterations. In the inner loop, each player dynamically refines its predictions by incorporating its own predicted ligand or pocket poses back into its model. We theoretically show the convergence of LoopPlay, ensuring stable optimization. Extensive experiments conducted on public benchmark datasets demonstrate that LoopPlay achieves approximately a 10% improvement in predicting accurate binding modes compared to previous state-of-the-art methods. This highlights its potential to enhance the accuracy of molecular docking in drug discovery.
zh
[AI-59] Agent icData: An Agent ic Data Analytics System for Heterogeneous Data
【速读】:该论文旨在解决现有非结构化数据智能分析系统依赖专家编写代码和管理复杂分析流程所带来的高成本与低效率问题。其核心解决方案是提出AgenticData,一个基于代理(agent)的自动化数据智能分析系统,通过自然语言(Natural Language, NL)交互实现跨多领域、结构化与非结构化数据的自主分析。关键创新在于:一是采用反馈驱动的规划技术将NL查询转化为由关系与语义操作符组成的语义计划;二是设计多代理协作策略,包括数据概要代理用于发现相关数据、语义交叉验证代理基于反馈进行迭代优化、以及智能记忆代理维持短期上下文与长期知识;三是引入语义优化模型以高效地精炼并执行语义计划。实验表明,该方法在多个基准测试中显著优于当前最先进方法,在易难任务上均展现出更高准确性。
链接: https://arxiv.org/abs/2508.05002
作者: Ji Sun,Guoliang Li,Peiyao Zhou,Yihui Ma,Jingzhe Xu,Yuan Li
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time-consuming. To address these challenges, we introduce AgenticData, an innovative agentic data analytics system that allows users to simply pose natural language (NL) questions while autonomously analyzing data sources across multiple domains, including both unstructured and structured data. First, AgenticData employs a feedback-driven planning technique that automatically converts an NL query into a semantic plan composed of relational and semantic operators. We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context and long-term knowledge. Second, we propose a semantic optimization model to refine and execute semantic plans effectively. Our system, AgenticData, has been tested using three benchmarks. Experimental results showed that AgenticData achieved superior accuracy on both easy and difficult tasks, significantly outperforming state-of-the-art methods.
zh
[AI-60] Situated Epistemic Infrastructures: A Diagnostic Framework for Post-Coherence Knowledge
【速读】:该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)如ChatGPT广泛应用的背景下,传统知识基础设施因缺乏稳定的引用、权威性和验证机制而表现出脆弱性,导致知识的权威性难以维系。为应对这一挑战,作者提出“情境化认识论基础设施”(Situated Epistemic Infrastructures, SEI)框架作为诊断工具,其核心在于通过追踪制度性、计算性和时间性安排中可信度的中介过程,重新理解知识如何在人机混合系统中获得权威性。SEI的关键创新在于超越传统的分类思维,强调协调(coordination)而非静态分类,并主张建立前瞻性和适应性的认识论治理(epistemic stewardship)模型,从而为人工智能治理、知识生产与信息系统的伦理设计提供一种非表征主义(representationalist)的替代路径。
链接: https://arxiv.org/abs/2508.04995
作者: Matthew Kelly
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 27 pages including references. Draft prepared for submission to Science, Technology Human Values
Abstract:Large Language Models (LLMs) such as ChatGPT have rendered visible the fragility of contemporary knowledge infrastructures by simulating coherence while bypassing traditional modes of citation, authority, and validation. This paper introduces the Situated Epistemic Infrastructures (SEI) framework as a diagnostic tool for analyzing how knowledge becomes authoritative across hybrid human-machine systems under post-coherence conditions. Rather than relying on stable scholarly domains or bounded communities of practice, SEI traces how credibility is mediated across institutional, computational, and temporal arrangements. Integrating insights from infrastructure studies, platform theory, and epistemology, the framework foregrounds coordination over classification, emphasizing the need for anticipatory and adaptive models of epistemic stewardship. The paper contributes to debates on AI governance, knowledge production, and the ethical design of information systems by offering a robust alternative to representationalist models of scholarly communication.
zh
[AI-61] Hierarchical Deep Deterministic Policy Gradient for Autonomous Maze Navigation of Mobile Robots
【速读】:该论文旨在解决深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)算法在迷宫导航任务中因稀疏奖励、探索效率低以及长程规划困难而导致的成功率低和平均奖励不足的问题。解决方案的关键在于提出一种高效的分层DDPG(Hierarchical DDPG, HDDPG)算法,其核心结构包含高层策略与低层策略:高层策略基于改进的DDPG框架从长期视角生成中间子目标(subgoal),低层策略则根据当前状态并遵循高层分配的子目标执行基础动作;同时引入离线策略修正机制以优化子目标重标注、自适应参数空间噪声增强探索能力,并设计重塑的内在-外在奖励函数提升学习效率,从而显著改善导航性能。
链接: https://arxiv.org/abs/2508.04994
作者: Wenjie Hu,Ye Zhou,Hann Woei Ho
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Maze navigation is a fundamental challenge in robotics, requiring agents to traverse complex environments efficiently. While the Deep Deterministic Policy Gradient (DDPG) algorithm excels in control tasks, its performance in maze navigation suffers from sparse rewards, inefficient exploration, and long-horizon planning difficulties, often leading to low success rates and average rewards, sometimes even failing to achieve effective navigation. To address these limitations, this paper proposes an efficient Hierarchical DDPG (HDDPG) algorithm, which includes high-level and low-level policies. The high-level policy employs an advanced DDPG framework to generate intermediate subgoals from a long-term perspective and on a higher temporal scale. The low-level policy, also powered by the improved DDPG algorithm, generates primitive actions by observing current states and following the subgoal assigned by the high-level policy. The proposed method enhances stability with off-policy correction, refining subgoal assignments by relabeling historical experiences. Additionally, adaptive parameter space noise is utilized to improve exploration, and a reshaped intrinsic-extrinsic reward function is employed to boost learning efficiency. Further optimizations, including gradient clipping and Xavier initialization, are employed to improve robustness. The proposed algorithm is rigorously evaluated through numerical simulation experiments executed using the Robot Operating System (ROS) and Gazebo. Regarding the three distinct final targets in autonomous maze navigation tasks, HDDPG significantly overcomes the limitations of standard DDPG and its variants, improving the success rate by at least 56.59% and boosting the average reward by a minimum of 519.03 compared to baseline algorithms.
zh
[AI-62] MENDR: Manifold Explainable Neural Data Representations
【速读】:该论文旨在解决当前脑电图(EEG)基础模型在预训练动态过程中的透明性不足、嵌入表示中EEG信息保留程度不明确,以及临床应用时可解释性欠缺的问题。其解决方案的关键在于提出MENDR(Manifold Explainable Neural Data Representations),一种基于滤波器组的EEG基础模型,该模型采用新颖的黎曼流形Transformer架构,通过离散小波包变换将EEG信号分解为多分辨率系数,并学习对称正定矩阵(Symmetric Positive Definite, SPD)嵌入;同时,MENDR利用几何椭球可视化SPD嵌入以增强可解释性,并支持从嵌入中高精度重建原始EEG信号,从而在保证性能的同时显著提升模型透明度与临床适用性。
链接: https://arxiv.org/abs/2508.04956
作者: Matthew Chen,Micky Nnamdi,Justin Shao,Andrew Hornback,Hongyun Huang,Ben Tamo,Yishan Zhong,Benoit Marteau,Wenqi Shi,May Dongmei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models for electroencephalography (EEG) signals have recently demonstrated success in learning generalized representations of EEGs, outperforming specialized models in various downstream tasks. However, many of these models lack transparency in their pretraining dynamics and offer limited insight into how well EEG information is preserved within their embeddings. For successful clinical integration, EEG foundation models must ensure transparency in pretraining, downstream fine-tuning, and the interpretability of learned representations. Current approaches primarily operate in the temporal domain, overlooking advancements in digital signal processing that enable the extraction of deterministic and traceable features, such as wavelet-based representations. We propose MENDR (Manifold Explainable Neural Data Representations), a filter bank-based EEG foundation model built on a novel Riemannian Manifold Transformer architecture to resolve these issues. MENDR learns symmetric positive definite matrix embeddings of EEG signals and is pretrained on a large corpus comprising over 4,000 hours of EEG data, decomposed via discrete wavelet packet transforms into multi-resolution coefficients. MENDR significantly enhances interpretability by visualizing symmetric positive definite embeddings as geometric ellipsoids and supports accurate reconstruction of EEG signals from learned embeddings. Evaluations across multiple clinical EEG tasks demonstrate that MENDR achieves near state-of-the-art performance with substantially fewer parameters, underscoring its potential for efficient, interpretable, and clinically applicable EEG analysis.
zh
[AI-63] sserae: Scalable Placement Policies for Deep Learning Workloads
【速读】:该论文旨在解决深度学习(Deep Learning, DL)集群调度中资源利用率低的问题,特别是现有调度策略在作业放置(placement)上存在性能欠佳或可扩展性差的缺陷。其解决方案的关键在于将多种放置约束建模为图匹配(graph matching)问题,并基于此设计出新型的放置策略,以最小化作业迁移开销和提升作业打包效率;该方法被集成到Tesserae调度器中,实现了高可扩展性和高效性,实验表明其平均作业完成时间(JCT)最多提升1.62倍,总工期(Makespan)最多提升1.15倍。
链接: https://arxiv.org/abs/2508.04953
作者: Song Bian,Saurabh Agarwal,Md. Tareq Mahmood,Shivaram Venkataraman
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 16 pages, 18 figures
Abstract:Training deep learning (DL) models has become a dominant workload in data-centers and improving resource utilization is a key goal of DL cluster schedulers. In order to do this, schedulers typically incorporate placement policies that govern where jobs are placed on the cluster. Existing placement policies are either designed as ad-hoc heuristics or incorporated as constraints within a complex optimization problem and thus either suffer from suboptimal performance or poor scalability. Our key insight is that many placement constraints can be formulated as graph matching problems and based on that we design novel placement policies for minimizing job migration overheads and job packing. We integrate these policies into Tesserae and describe how our design leads to a scalable and effective GPU cluster scheduler. Our experimental results show that Tesserae improves average JCT by up to 1.62x and the Makespan by up to 1.15x compared with the existing schedulers.
zh
[AI-64] INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM
【速读】:该论文旨在解决传统机器人操作控制与规划方法在真实场景中因物理模型精度不足和任务泛化能力弱而导致的失效问题,尤其是在面对未见过的任务或环境变化时表现不佳。其解决方案的关键在于提出INTENTION框架,通过融合视觉-语言模型(Vision-Language Models, VLMs)驱动的场景推理与交互驱动的记忆机制,使机器人具备类人的交互直觉和自主操作能力;其中,记忆图(Memory Graph)用于记录历史任务交互中的场景信息,体现人类对不同任务的隐式理解与决策逻辑,而直观感知器(Intuitive Perceptor)则从视觉场景中提取物理关系与可操作性特征,从而实现无需重复指令即可在新场景中推断合适交互行为的能力。
链接: https://arxiv.org/abs/2508.04931
作者: Jin Wang,Weijie Wang,Boyuan Deng,Heng Zhang,Rui Dai,Nikos Tsagarakis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Web: this https URL
Abstract:Traditional control and planning for robotic manipulation heavily rely on precise physical models and predefined action sequences. While effective in structured environments, such approaches often fail in real-world scenarios due to modeling inaccuracies and struggle to generalize to novel tasks. In contrast, humans intuitively interact with their surroundings, demonstrating remarkable adaptability, making efficient decisions through implicit physical understanding. In this work, we propose INTENTION, a novel framework enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios, by integrating Vision-Language Models (VLMs) based scene reasoning with interaction-driven memory. We introduce Memory Graph to record scenes from previous task interactions which embodies human-like understanding and decision-making about different tasks in real world. Meanwhile, we design an Intuitive Perceptor that extracts physical relations and affordances from visual scenes. Together, these components empower robots to infer appropriate interaction behaviors in new scenes without relying on repetitive instructions. Videos: this https URL
zh
[AI-65] axonomy of Faults in Attention-Based Neural Networks
【速读】:该论文试图解决现有深度学习故障分类体系无法有效捕捉注意力机制(attention mechanism)特有故障的问题,从而导致从业者缺乏可操作的诊断指导。其解决方案的关键在于通过系统性分析来自96个项目的555个真实故障案例,构建了首个针对基于注意力机制的神经网络(Attention-based Neural Networks, ABNNs)的七类注意力特有故障分类体系,并基于症状与根本原因的关联分析,提炼出四种证据驱动的诊断启发式规则,能够解释33.0%的注意力特有故障,首次为注意力机制模型提供了系统的诊断指导框架。
链接: https://arxiv.org/abs/2508.04925
作者: Sigma Jahan,Saurabh Singh Rajput,Tushar Sharma,Mohammad Masudur Rahman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Attention mechanisms are at the core of modern neural architectures, powering systems ranging from ChatGPT to autonomous vehicles and driving a major economic impact. However, high-profile failures, such as ChatGPT’s nonsensical outputs or Google’s suspension of Gemini’s image generation due to attention weight errors, highlight a critical gap: existing deep learning fault taxonomies might not adequately capture the unique failures introduced by attention mechanisms. This gap leaves practitioners without actionable diagnostic guidance. To address this gap, we present the first comprehensive empirical study of faults in attention-based neural networks (ABNNs). Our work is based on a systematic analysis of 555 real-world faults collected from 96 projects across ten frameworks, including GitHub, Hugging Face, and Stack Overflow. Through our analysis, we develop a novel taxonomy comprising seven attention-specific fault categories, not captured by existing work. Our results show that over half of the ABNN faults arise from mechanisms unique to attention architectures. We further analyze the root causes and manifestations of these faults through various symptoms. Finally, by analyzing symptom-root cause associations, we identify four evidence-based diagnostic heuristics that explain 33.0% of attention-specific faults, offering the first systematic diagnostic guidance for attention-based models.
zh
[AI-66] Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLM s)
【速读】:该论文旨在解决生成式 AI (Generative AI) 与图结构数据结合时在节点分类任务中面临的对抗攻击脆弱性问题,尤其是针对图感知大语言模型(graph-aware LLMs)的鲁棒性未被系统研究这一空白。解决方案的关键在于提出一个端到端防御框架 GALGUARD,其核心包含两个模块:一是基于大语言模型的特征修正模块,用于缓解特征层面的微小扰动攻击;二是适配图神经网络(GNN)的防御机制,以抵御结构层面的攻击,从而实现对图编码设计缺陷引发的多种攻击路径的有效防护。
链接: https://arxiv.org/abs/2508.04894
作者: Iyiola E. Olatunji,Franziska Boenisch,Jing Xu,Adam Dziedzic
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Large Language Models (LLMs) are increasingly integrated with graph-structured data for tasks like node classification, a domain traditionally dominated by Graph Neural Networks (GNNs). While this integration leverages rich relational information to improve task performance, their robustness against adversarial attacks remains unexplored. We take the first step to explore the vulnerabilities of graph-aware LLMs by leveraging existing adversarial attack methods tailored for graph-based models, including those for poisoning (training-time attacks) and evasion (test-time attacks), on two representative models, LLAGA (Chen et al. 2024) and GRAPHPROMPTER (Liu et al. 2024). Additionally, we discover a new attack surface for LLAGA where an attacker can inject malicious nodes as placeholders into the node sequence template to severely degrade its performance. Our systematic analysis reveals that certain design choices in graph encoding can enhance attack success, with specific findings that: (1) the node sequence template in LLAGA increases its vulnerability; (2) the GNN encoder used in GRAPHPROMPTER demonstrates greater robustness; and (3) both approaches remain susceptible to imperceptible feature perturbation attacks. Finally, we propose an end-to-end defense framework GALGUARD, that combines an LLM-based feature correction module to mitigate feature-level perturbations and adapted GNN defenses to protect against structural attacks.
zh
[AI-67] Leverag ing Deep Learning for Physical Model Bias of Global Air Quality Estimates
【速读】:该论文旨在解决当前物理模型在模拟地表臭氧(surface ozone)浓度时存在的偏差问题,尤其是在与人类健康相关的尺度上,由于全球臭氧趋势驱动因素尚不明确,导致模型实用性受限。其解决方案的关键在于引入一种基于二维卷积神经网络(2D Convolutional Neural Network)的架构,用于估计化学传输模型(MOMO-Chem)的残差(即模型偏差),从而更精准地捕捉物理模型无法体现的复杂空间特征。研究进一步验证了高分辨率卫星遥感数据中的土地利用信息对提升模型估计精度的有效性,并揭示了城市尺度臭氧偏差的关键影响因素,为优化环境政策提供科学依据。
链接: https://arxiv.org/abs/2508.04886
作者: Kelsey Doerksen,Yuliya Marchetti,Kevin Bowman,Steven Lu,James Montgomery,Yarin Gal,Freddie Kalaitzis,Kazuyuki Miyazaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Air pollution is the world’s largest environmental risk factor for human disease and premature death, resulting in more than 6 million permature deaths in 2019. Currently, there is still a challenge to model one of the most important air pollutants, surface ozone, particularly at scales relevant for human health impacts, with the drivers of global ozone trends at these scales largely unknown, limiting the practical use of physics-based models. We employ a 2D Convolutional Neural Network based architecture that estimate surface ozone MOMO-Chem model residuals, referred to as model bias. We demonstrate the potential of this technique in North America and Europe, highlighting its ability better to capture physical model residuals compared to a traditional machine learning method. We assess the impact of incorporating land use information from high-resolution satellite imagery to improve model estimates. Importantly, we discuss how our results can improve our scientific understanding of the factors impacting ozone bias at urban scales that can be used to improve environmental policy.
zh
[AI-68] Uncertainty Quantification for Surface Ozone Emulators using Deep Learning
【速读】:该论文旨在解决传统物理模型在模拟地表臭氧(Surface Ozone, O₃)污染趋势时的局限性,尤其是在人类健康相关尺度上的预测精度不足问题。其核心挑战在于如何在保持高预测性能的同时提升模型的可解释性,以支持政策制定与公共卫生决策。解决方案的关键在于引入一种不确定性感知的U-Net架构,并结合贝叶斯回归和分位数回归方法,对多模型多组分化学数据同化(MOMO-Chem)模型的地表臭氧残差(bias)进行区域估计。该方法不仅提升了预测精度,还通过量化不确定性(Uncertainty Quantification, UQ)识别出最优与次优的地面观测站点,从而优化偏差校正策略,并评估了土地利用信息在地表臭氧残差建模中的作用。
链接: https://arxiv.org/abs/2508.04885
作者: Kelsey Doerksen,Yuliya Marchetti,Steven Lu,Kevin Bowman,James Montgomery,Kazuyuki Miyazaki,Yarin Gal,Freddie Kalaitzis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Air pollution is a global hazard, and as of 2023, 94% of the world’s population is exposed to unsafe pollution levels. Surface Ozone (O3), an important pollutant, and the drivers of its trends are difficult to model, and traditional physics-based models fall short in their practical use for scales relevant to human-health impacts. Deep Learning-based emulators have shown promise in capturing complex climate patterns, but overall lack the interpretability necessary to support critical decision making for policy changes and public health measures. We implement an uncertainty-aware U-Net architecture to predict the Multi-mOdel Multi-cOnstituent Chemical data assimilation (MOMO-Chem) model’s surface ozone residuals (bias) using Bayesian and quantile regression methods. We demonstrate the capability of our techniques in regional estimation of bias in North America and Europe for June 2019. We highlight the uncertainty quantification (UQ) scores between our two UQ methodologies and discern which ground stations are optimal and sub-optimal candidates for MOMO-Chem bias correction, and evaluate the impact of land-use information in surface ozone residual modeling.
zh
[AI-69] Sequence Aware SAC Control for Engine Fuel Consumption Optimization in Electrified Powertrain
【速读】:该论文旨在解决重型混合动力汽车(Heavy-Duty Hybrid Electric Vehicles, HD-HEVs)中发动机控制的自适应与高效能量管理问题,以在保障电池电量长期稳定运行的前提下降低燃油消耗。其关键解决方案是提出一种基于软演员-评论家(Soft Actor-Critic, SAC)算法的强化学习(Reinforcement Learning, RL)框架,并通过引入门控循环单元(Gated Recurrent Units, GRUs)和决策变压器(Decision Transformers, DTs)来增强策略网络(actor)和评论家网络(critic)的时序建模能力,从而实现对复杂驾驶工况下动态决策的优化。实验表明,采用DT作为策略网络、GRU作为评论家网络的SAC代理在HFET工况下燃油节省效果仅比动态规划(Dynamic Programming, DP)低1.8%,且在未见过的驾驶循环(如US06和HHDDT巡航段)中显著优于前馈神经网络(Feedforward Network, FFN)基线模型,验证了该方法在真实场景中的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2508.04874
作者: Wafeeq Jaleel,Md Ragib Rownak,Athar Hanif,Sidra Ghayour Bhatti,Qadeer Ahmed
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As hybrid electric vehicles (HEVs) gain traction in heavy-duty trucks, adaptive and efficient energy management is critical for reducing fuel consumption while maintaining battery charge for long operation times. We present a new reinforcement learning (RL) framework based on the Soft Actor-Critic (SAC) algorithm to optimize engine control in series HEVs. We reformulate the control task as a sequential decision-making problem and enhance SAC by incorporating Gated Recurrent Units (GRUs) and Decision Transformers (DTs) into both actor and critic networks to capture temporal dependencies and improve planning over time. To evaluate robustness and generalization, we train the models under diverse initial battery states, drive cycle durations, power demands, and input sequence lengths. Experiments show that the SAC agent with a DT-based actor and GRU-based critic was within 1.8% of Dynamic Programming (DP) in fuel savings on the Highway Fuel Economy Test (HFET) cycle, while the SAC agent with GRUs in both actor and critic networks, and FFN actor-critic agent were within 3.16% and 3.43%, respectively. On unseen drive cycles (US06 and Heavy Heavy-Duty Diesel Truck (HHDDT) cruise segment), generalized sequence-aware agents consistently outperformed feedforward network (FFN)-based agents, highlighting their adaptability and robustness in real-world settings.
zh
[AI-70] Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos
【速读】:该论文旨在解决后训练量化(Post-training Quantization, PTQ)方法中缺乏严格定量理论保障的问题,尤其是针对主流算法OPTQ(又称GPTQ)及其变体Qronos。其关键解决方案是首次为确定性和随机性OPTQ变体以及Qronos提供了非渐近的2-范数误差界,并进一步为随机版本建立了更强的无穷范数误差界。这些理论分析揭示了OPTQ迭代过程如何引入量化误差,并明确量化误差与校准数据及正则化参数之间的依赖关系,从而为实践中特征按范数降序排列等启发式策略提供理论依据,并指导正则化参数的选择;同时,对Qronos的扩展分析也解释了其在实证中的优势。
链接: https://arxiv.org/abs/2508.04853
作者: Haoyu Zhang,Shihao Zhang,Ian Colbert,Rayan Saab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Numerical Analysis (math.NA)
备注:
Abstract:Post-training quantization (PTQ) has become a crucial tool for reducing the memory and compute costs of modern deep neural networks, including large language models (LLMs). Among PTQ algorithms, the OPTQ framework-also known as GPTQ-has emerged as a leading method due to its computational efficiency and strong empirical performance. Despite its widespread adoption, however, OPTQ lacks rigorous quantitative theoretical guarantees. This paper presents the first quantitative error bounds for both deterministic and stochastic variants of OPTQ, as well as for Qronos, a recent related state-of-the-art PTQ algorithm. We analyze how OPTQ’s iterative procedure induces quantization error and derive non-asymptotic 2-norm error bounds that depend explicitly on the calibration data and a regularization parameter that OPTQ uses. Our analysis provides theoretical justification for several practical design choices, including the widely used heuristic of ordering features by decreasing norm, as well as guidance for selecting the regularization parameter. For the stochastic variant, we establish stronger infinity-norm error bounds, which enable control over the required quantization alphabet and are particularly useful for downstream layers and nonlinearities. Finally, we extend our analysis to Qronos, providing new theoretical bounds, for both its deterministic and stochastic variants, that help explain its empirical advantages.
zh
[AI-71] Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning
【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在理想化评估场景下表现出较强推理能力,但在现实非理想场景中性能显著下降的问题。其核心挑战在于现有强化学习(Reinforcement Learning, RL)微调方法虽能提升模型在理想条件下的推理表现,却无法有效应对实际应用中常见的输入不完整、噪声干扰和上下文冗余等复杂情况。解决方案的关键在于提出并系统评估三种具有现实意义的非理想推理场景——摘要推理(summary inference)、细粒度噪声抑制(fine-grained noise suppression)和上下文过滤(contextual filtering),并通过基于策略梯度的RL微调方法对多个LLM及一个先进视觉-语言模型(Large Vision-Language Model, LVLM)进行训练,从而揭示当前主流方法在鲁棒性方面的严重不足。
链接: https://arxiv.org/abs/2508.04848
作者: Chang Tian,Matthew B. Blaschko,Mingzhe Xing,Xiuxing Li,Yinliang Yue,Marie-Francine Moens
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: large language models, large vision-language model, reasoning, non-ideal conditions, reinforcement learning
Abstract:Reinforcement learning (RL) has become a key technique for enhancing the reasoning abilities of large language models (LLMs), with policy-gradient algorithms dominating the post-training stage because of their efficiency and effectiveness. However, most existing benchmarks evaluate large-language-model reasoning under idealized settings, overlooking performance in realistic, non-ideal scenarios. We identify three representative non-ideal scenarios with practical relevance: summary inference, fine-grained noise suppression, and contextual filtering. We introduce a new research direction guided by brain-science findings that human reasoning remains reliable under imperfect inputs. We formally define and evaluate these challenging scenarios. We fine-tune three LLMs and a state-of-the-art large vision-language model (LVLM) using RL with a representative policy-gradient algorithm and then test their performance on eight public datasets. Our results reveal that while RL fine-tuning improves baseline reasoning under idealized settings, performance declines significantly across all three non-ideal scenarios, exposing critical limitations in advanced reasoning capabilities. Although we propose a scenario-specific remediation method, our results suggest current methods leave these reasoning deficits largely unresolved. This work highlights that the reasoning abilities of large models are often overstated and underscores the importance of evaluating models under non-ideal scenarios. The code and data will be released at XXXX.
zh
[AI-72] Multi-Stage Knowledge-Distilled VGAE and GAT for Robust Controller-Area-Network Intrusion Detection
【速读】:该论文旨在解决车载控制器局域网(Controller Area Network, CAN)协议因缺乏内置安全机制而易受网络攻击的问题。其核心解决方案是提出一种多阶段入侵检测框架,关键在于将CAN总线通信建模为图序列以捕捉时序与关系依赖,并结合无监督异常检测与有监督图学习技术:首先使用变分图自编码器(Variational Graph Autoencoder, VGAE)进行结构异常检测并采用选择性欠采样处理类别不平衡问题;随后通过知识蒸馏的图注意力网络(Knowledge-Distilled Graph Attention Network, KD-GAT)实现高精度攻击分类,该学生模型在参数量减少96%的情况下仍保持优异性能。
链接: https://arxiv.org/abs/2508.04845
作者: Robert Frenken,Sidra Ghayour Bhatti,Hanqin Zhang,Qadeer Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2507.19686 Author note: This submission is an extension of the above work by the same author
Abstract:The Controller Area Network (CAN) protocol is a standard for in-vehicle communication but remains susceptible to cyber-attacks due to its lack of built-in security. This paper presents a multi-stage intrusion detection framework leveraging unsupervised anomaly detection and supervised graph learning tailored for automotive CAN traffic. Our architecture combines a Variational Graph Autoencoder (VGAE) for structural anomaly detection with a Knowledge-Distilled Graph Attention Network (KD-GAT) for robust attack classification. CAN bus activity is encoded as graph sequences to model temporal and relational dependencies. The pipeline applies VGAE-based selective undersampling to address class imbalance, followed by GAT classification with optional score-level fusion. The compact student GAT achieves 96% parameter reduction compared to the teacher model while maintaining strong predictive performance. Experiments on six public CAN intrusion datasets–Car-Hacking, Car-Survival, and can-train-and-test–demonstrate competitive accuracy and efficiency, with average improvements of 16.2% in F1-score over existing methods, particularly excelling on highly imbalanced datasets with up to 55% F1-score improvements.
zh
[AI-73] Automated File-Level Logging Generation for Machine Learning Applications using LLM s: A Case Study using GPT -4o Mini
【速读】:该论文旨在解决机器学习(Machine Learning, ML)项目中文件级日志生成不足的问题,尤其关注大型语言模型(Large Language Models, LLMs)在自动化生成高质量、位置合理且符合项目规范的日志语句方面的潜力。其关键解决方案是基于GPT-4o mini模型对171个包含4,073个Python文件的ML项目进行文件级日志生成实验,通过对比人工日志与模型生成日志的位置准确性、日志级别、变量使用及文本质量,并结合人工分析识别常见模式与挑战,系统评估LLM在实际场景中的适用性与局限性。结果表明,尽管模型能在63.91%的情况下将日志放置于与人类相似的位置,但存在高达82.66%的过度记录(overlogging)问题,且常出现日志位置不当、大代码块内难以精准定位以及不符合项目特定日志规范等挑战,提示当前LLM生成能力尚需优化以实现实用化部署。
链接: https://arxiv.org/abs/2508.04820
作者: Mayra Sofia Ruiz Rodriguez,SayedHassan Khatoonabadi,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Logging is essential in software development, helping developers monitor system behavior and aiding in debugging applications. Given the ability of large language models (LLMs) to generate natural language and code, researchers are exploring their potential to generate log statements. However, prior work focuses on evaluating logs introduced in code functions, leaving file-level log generation underexplored – especially in machine learning (ML) applications, where comprehensive logging can enhance reliability. In this study, we evaluate the capacity of GPT-4o mini as a case study to generate log statements for ML projects at file level. We gathered a set of 171 ML repositories containing 4,073 Python files with at least one log statement. We identified and removed the original logs from the files, prompted the LLM to generate logs for them, and evaluated both the position of the logs and log level, variables, and text quality of the generated logs compared to human-written logs. In addition, we manually analyzed a representative sample of generated logs to identify common patterns and challenges. We find that the LLM introduces logs in the same place as humans in 63.91% of cases, but at the cost of a high overlogging rate of 82.66%. Furthermore, our manual analysis reveals challenges for file-level logging, which shows overlogging at the beginning or end of a function, difficulty logging within large code blocks, and misalignment with project-specific logging conventions. While the LLM shows promise for generating logs for complete files, these limitations remain to be addressed for practical implementation.
zh
[AI-74] Optimality Principles and Neural Ordinary Differential Equations-based Process Modeling for Distributed Control
【速读】:该论文旨在解决如何将数据驱动的机器学习方法自然地集成到经典过程模型与控制框架中的问题,以实现对复杂过程网络系统的建模与优化。其核心挑战在于保持系统拓扑结构的一致性及广延量(extensive quantities)的守恒特性,同时引入从数据中学习的动态关系。解决方案的关键在于构建一个基于连接矩阵和网络图的过程建模框架,通过定义与稳态非平衡熵产生等价的自然目标函数作为系统动力学驱动力,并利用锥扇区(conic sector)条件约束流场行为;在此基础上,采用稀疏深度神经网络结合伴随法(adjoint method)与自适应常微分方程求解器,从合成时间序列数据中学习系统特定的本构关系,从而形成可用于模型预测控制(Model Predictive Control, MPC)的状态空间模型,实现了拓扑守恒性质与数据驱动动态建模的统一。
链接: https://arxiv.org/abs/2508.04799
作者: Michael R. Wartmann,B. Erik Ydstie
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 27 pages, 7 figures
Abstract:Most recent advances in machine learning and analytics for process control pose the question of how to naturally integrate new data-driven methods with classical process models and control. We propose a process modeling framework enabling integration of data-driven algorithms through consistent topological properties and conservation of extensive quantities. Interconnections among process network units are represented through connectivity matrices and network graphs. We derive the system’s natural objective function equivalent to the non-equilibrium entropy production in a steady state system as a driving force for the process dynamics. We illustrate how distributed control and optimization can be implemented into process network structures and how control laws and algorithms alter the system’s natural equilibrium towards engineered objectives. The basic requirement is that the flow conditions can be expressed in terms of conic sector (passivity) conditions. Our formalism allows integration of fundamental conservation properties from topology with learned dynamic relations from data through sparse deep neural networks. We demonstrate in a practical example of a simple inventory control system how to integrate the basic topology of a process with a neural network ordinary differential equation model. The system specific constitutive equations are left undescribed and learned by the neural ordinary differential equation algorithm using the adjoint method in combination with an adaptive ODE solver from synthetic time-series data. The resulting neural network forms a state space model for use in e.g. a model predictive control algorithm. Comments: 27 pages, 7 figures Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2508.04799 [cs.NE] (or arXiv:2508.04799v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2508.04799 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-75] Evaluating the Impact of LLM -guided Reflection on Learning Outcomes with Interactive AI-Generated Educational Podcasts
【速读】:该论文试图解决的问题是:如何通过在交互式AI生成播客中嵌入大语言模型(Large Language Model, LLM)引导的反思提示(reflection prompts),来提升学习效果和用户体验。其解决方案的关键在于引入LLM引导的反思提示机制,以促进用户对内容的深度加工;然而研究发现,尽管学习成效未显著差异,但此类提示降低了用户的感知吸引力,提示当前反思性交互设计需进一步优化,以平衡认知收益与用户体验。
链接: https://arxiv.org/abs/2508.04787
作者: Vishnu Menon,Andy Cherney,Elizabeth B. Cloude,Li Zhang,Tiffany D. Do
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to NCME Special Interest Group on AI in Measurement: AIME-CON 2025 conference
Abstract:This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design.
zh
[AI-76] Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration
【速读】:该论文旨在解决电力系统恢复过程中效率与公平性之间的失衡问题,特别是由于弱势社区提交的恢复请求较少而导致其长期断电风险更高的不公平现象。解决方案的关键在于提出一种名为EPOPR(Equity-aware Power Restoration with Predict-then-Optimize Framework)的框架,其核心创新包括:(1) 基于不确定性感知的修复时长预测方法——公平性校准分位数回归(Equity-Conformalized Quantile Regression),用于应对数据异方差性下的预测挑战;(2) 空间-时间注意力强化学习机制(Spatial-Temporal Attentional RL),使决策代理能够动态适应不同区域的不确定性水平,从而在保障整体恢复效率的同时提升社区间的公平性。
链接: https://arxiv.org/abs/2508.04780
作者: Lin Jiang,Dahai Yu,Rongchao Xu,Tian Tang,Guang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 9 pages,12 figures
Abstract:The increasing frequency of extreme weather events, such as hurricanes, highlights the urgent need for efficient and equitable power system restoration. Many electricity providers make restoration decisions primarily based on the volume of power restoration requests from each region. However, our data-driven analysis reveals significant disparities in request submission volume, as disadvantaged communities tend to submit fewer restoration requests. This disparity makes the current restoration solution inequitable, leaving these communities vulnerable to extended power outages. To address this, we aim to propose an equity-aware power restoration strategy that balances both restoration efficiency and equity across communities. However, achieving this goal is challenging for two reasons: the difficulty of predicting repair durations under dataset heteroscedasticity, and the tendency of reinforcement learning agents to favor low-uncertainty actions, which potentially undermine equity. To overcome these challenges, we design a predict-then-optimize framework called EPOPR with two key components: (1) Equity-Conformalized Quantile Regression for uncertainty-aware repair duration prediction, and (2) Spatial-Temporal Attentional RL that adapts to varying uncertainty levels across regions for equitable decision-making. Experimental results show that our EPOPR effectively reduces the average power outage duration by 3.60% and decreases inequity between different communities by 14.19% compared to state-of-the-art baselines.
zh
[AI-77] Agency Affordances and Enculturation of Augmentation Technologies
【速读】:该论文试图解决的问题是:当前对增强技术(augmentation technologies)的乐观假设——即这些技术将显著改善人类生活、 literacy(读写能力)、文化、艺术、经济和社会环境——缺乏批判性审视,尤其在人工智能(AI)驱动下,这种假设可能掩盖了技术背后复杂的权力关系与社会编码过程。解决方案的关键在于通过跨学科视角,特别是媒体与传播研究,剖析人机代理关系(agentive relationships)中的权力结构,并揭示营销传播如何将未来用户纳入商业数字景观(如元宇宙概念),从而实现对增强技术的社会嵌入(enculturation)。论文强调需澄清AI术语模糊性,并以WIPO的AI分类体系为框架,系统分析非人类代理(non-human agents)在产业中的发展如何推动增强技术崛起,最终指向对“元宇宙”和增强现实(AR)等新兴技术叙事的审慎评估。
链接: https://arxiv.org/abs/2508.04725
作者: Ann Hill Duin,Isabel Pedersen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 28 pages
Abstract:Augmentation technologies are undergoing a process of enculturation due to many factors, one being the rise of artificial intelligence (AI), or what the World Intellectual Property Organization (WIPO) terms the AI wave or AI boom. Chapter 3 focuses critical attention on the hyped assumption that sophisticated, emergent, and embodied augmentation technologies will improve lives, literacy, cultures, arts, economies, and social contexts. The chapter begins by discussing the problem of ambiguity with AI terminology, which it aids with a description of the WIPO Categorization of AI Technologies Scheme. It then draws on media and communication studies to explore concepts such as agents, agency, power, and agentive relationships between humans and robots. The chapter focuses on the development of non-human agents in industry as a critical factor in the rise of augmentation technologies. It looks at how marketing communication enculturates future users to adopt and adapt to the technology. Scholars are charting the significant ways that people are drawn further into commercial digital landscapes, such as the Metaverse concept, in post-internet society. It concludes by examining recent claims concerning the Metaverse and augmented reality.
zh
[AI-78] Wearable Music2Emotion : Assessing Emotions Induced by AI-Generated Music through Portable EEG-fNIRS Fusion ACM-MM2025
【速读】:该论文旨在解决音乐诱导情绪研究中存在的三大问题:(1)音乐刺激受限于小规模语料库及主观选择偏差,难以反映个体情感差异;(2)过度依赖单一模态神经信号(如EEG),忽视多模态融合带来的互补信息;(3)传统脑机接口设备体积庞大、操作复杂,限制了实际应用场景。其解决方案的关键在于提出MEEtBrain框架,通过生成式AI(Generative AI)自动大规模生成个性化音乐刺激以消除主观偏倚并提升多样性,同时采用轻量化无线头戴式设备同步采集EEG与fNIRS信号,实现便携、多模态的情绪分析(正向/唤醒维度),并在20名受试者中验证了其有效性,目前数据集已扩展至44名参与者并公开共享,推动该领域研究与应用发展。
链接: https://arxiv.org/abs/2508.04723
作者: Sha Zhao,Song Yi,Yangxuan Zhou,Jiadong Pan,Jiquan Wang,Jie Xia,Shijian Li,Shurong Dong,Gang Pan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by ACM MM 2025
Abstract:Emotions critically influence mental health, driving interest in music-based affective computing via neurophysiological signals with Brain-computer Interface techniques. While prior studies leverage music’s accessibility for emotion induction, three key limitations persist: \textbf(1) Stimulus Constraints: Music stimuli are confined to small corpora due to copyright and curation costs, with selection biases from heuristic emotion-music mappings that ignore individual affective profiles. \textbf(2) Modality Specificity: Overreliance on unimodal neural data (e.g., EEG) ignores complementary insights from cross-modal signal fusion.\textbf (3) Portability Limitation: Cumbersome setups (e.g., 64+ channel gel-based EEG caps) hinder real-world applicability due to procedural complexity and portability barriers. To address these limitations, we propose MEEtBrain, a portable and multimodal framework for emotion analysis (valence/arousal), integrating AI-generated music stimuli with synchronized EEG-fNIRS acquisition via a wireless headband. By MEEtBrain, the music stimuli can be automatically generated by AI on a large scale, eliminating subjective selection biases while ensuring music diversity. We use our developed portable device that is designed in a lightweight headband-style and uses dry electrodes, to simultaneously collect EEG and fNIRS recordings. A 14-hour dataset from 20 participants was collected in the first recruitment to validate the framework’s efficacy, with AI-generated music eliciting target emotions (valence/arousal). We are actively expanding our multimodal dataset (44 participants in the latest dataset) and make it publicly available to promote further research and practical applications. \textbfThe dataset is available at this https URL.
zh
[AI-79] oward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR Quantized LLM s and Real-Time TTS
【速读】:该论文旨在解决电信领域中实时语音交互AI代理的低延迟与高域适应性问题,以支持呼叫中心自动化、智能交互式语音应答(IVR)及AI驱动的客户服务等场景。解决方案的关键在于构建一个端到端的电信专用语音AI流水线,集成四个经过领域适配的模型:TSLAM(4-bit量化电信大语言模型)、T-VEC(电信嵌入模型)、TTE(电信自动语音识别模型)和T-Synth(电信文本转语音模型),通过流式ASR、对话智能、基于电信文档的检索增强生成(RAG)以及实时TTS实现低延迟、知识驱动的语音交互,实测RTF低于1.0,满足企业级电信部署需求。
链接: https://arxiv.org/abs/2508.04721
作者: Vignesh Ethiraj,Ashwath David,Sidhanth Menon,Divya Vijay
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:We introduce a low-latency telecom AI voice agent pipeline for real-time, interactive telecommunications use, enabling advanced voice AI for call center automation, intelligent IVR (Interactive Voice Response), and AI-driven customer support. The solution is built for telecom, combining four specialized models by NetoAI: TSLAM, a 4-bit quantized Telecom-Specific Large Language Model (LLM); T-VEC, a Telecom-Specific Embedding Model; TTE, a Telecom-Specific Automatic Speech Recognition (ASR) model; and T-Synth, a Telecom-Specific Text-to-Speech (TTS) model. These models enable highly responsive, domain-adapted voice AI agents supporting knowledge-grounded spoken interactions with low latency. The pipeline integrates streaming ASR (TTE), conversational intelligence (TSLAM), retrieval augmented generation (RAG) over telecom documents, and real-time TTS (T-Synth), setting a new benchmark for telecom voice assistants. To evaluate the system, we built a dataset of 500 human-recorded telecom questions from RFCs, simulating real telecom agent queries. This framework allows analysis of latency, domain relevance, and real-time performance across the stack. Results show that TSLAM, TTE, and T-Synth deliver real-time factors (RTF) below 1.0, supporting enterprise, low-latency telecom deployments. These AI agents – powered by TSLAM, TTE, and T-Synth – provide a foundation for next-generation telecom AI, enabling automated customer support, diagnostics, and more.
zh
[AI-80] Who is a Better Player: LLM against LLM
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中过度依赖问答(Question-and-Answer, QA)基准方法所带来的数据依赖性局限问题,提出了一种基于对抗性棋类游戏的综合性评估框架。其解决方案的关键在于构建了一个名为“棋镇”(Qi Town)的专业化评测平台,支持5种广泛流行的棋类游戏,并集成20个由LLM驱动的玩家进行对弈;该平台采用Elo评分系统与创新的性能循环图(Performance Loop Graph, PLG)实现对LLM技术能力的量化评估,同时引入正向情感得分(Positive Sentiment Score, PSS)衡量其心理适应性,通过循环赛制实现系统性比较,从而更全面地揭示LLM在高压力对抗环境下的表现与稳定性。
链接: https://arxiv.org/abs/2508.04720
作者: Yingjie Zhou,Jiezhang Cao,Farong Wen,Li Xu,Yanwei Jiang,Jun Jia,Ronghui Li,Xiaohong Liu,Yu Zhou,Xiongkuo Min,Jie Guo,Zicheng Zhang,Guangtao Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial board games, as a paradigmatic domain of strategic reasoning and intelligence, have long served as both a popular competitive activity and a benchmark for evaluating artificial intelligence (AI) systems. Building on this foundation, we propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition, compensating the limitation of data dependency of the mainstream Question-and-Answer (QA) based benchmark method. We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players. The platform employs both the Elo rating system and a novel Performance Loop Graph (PLG) to quantitatively evaluate the technical capabilities of LLMs, while also capturing Positive Sentiment Score (PSS) throughout gameplay to assess mental fitness. The evaluation is structured as a round-robin tournament, enabling systematic comparison across players. Experimental results indicate that, despite technical differences, most LLMs remain optimistic about winning and losing, demonstrating greater adaptability to high-stress adversarial environments than humans. On the other hand, the complex relationship between cyclic wins and losses in PLGs exposes the instability of LLMs’ skill play during games, warranting further explanation and exploration.
zh
[AI-81] GeoFlow: Agent ic Workflow Automation for Geospatial Tasks
【速读】:该论文旨在解决地理空间任务中代理(agent)执行效率与准确性不足的问题,尤其是现有方法在推理分解上有所突破但缺乏对地理空间API调用的显式指导。其解决方案的关键在于提出GeoFlow框架,为每个代理分配明确的工具调用目标(tool-calling objectives),从而在运行时精准引导地理空间API的调用,相较当前最优方法,该方案使代理成功率提升6.8%,同时将token消耗降低至四分之一。
链接: https://arxiv.org/abs/2508.04719
作者: Amulya Bhattaram,Justin Chung,Stanley Chung,Ranit Gupta,Janani Ramamoorthy,Kartikeya Gullapalli,Diana Marculescu,Dimitrios Stamoulis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACM SIGSPATIAL 2025
Abstract:We present GeoFlow, a method that automatically generates agentic workflows for geospatial tasks. Unlike prior work that focuses on reasoning decomposition and leaves API selection implicit, our method provides each agent with detailed tool-calling objectives to guide geospatial API invocation at runtime. GeoFlow increases agentic success by 6.8% and reduces token usage by up to fourfold across major LLM families compared to state-of-the-art approaches.
zh
[AI-82] AI Should Be More Human Not More Complex
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在搜索应用中倾向于生成冗长、词汇复杂的回答,反而降低用户满意度与参与度的问题。其解决方案的关键在于:通过大规模用户实验验证,用户更偏好简洁、明确标注来源的回应,而非过度复杂化的解释;研究指出,AI系统应避免“人工复杂化”趋势,转而模仿人类高效沟通的核心特征——直接性、适当引用和对能力边界的诚实认知,从而提升可信度与用户体验。
链接: https://arxiv.org/abs/2508.04713
作者: Carlo Esposito(Eyed Softwares, Aploide Softwares)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 2025 - Knowledge Commons - Eyed Research Collection
Abstract:Large Language Models (LLMs) in search applications increasingly prioritize verbose, lexically complex responses that paradoxically reduce user satisfaction and engagement. Through a comprehensive study of 10.000 (est.) participants comparing responses from five major AI-powered search systems, we demonstrate that users overwhelmingly prefer concise, source-attributed responses over elaborate explanations. Our analysis reveals that current AI development trends toward “artificial sophistication” create an uncanny valley effect where systems sound knowledgeable but lack genuine critical thinking, leading to reduced trust and increased cognitive load. We present evidence that optimal AI communication mirrors effective human discourse: direct, properly sourced, and honest about limitations. Our findings challenge the prevailing assumption that more complex AI responses indicate better performance, instead suggesting that human-like brevity and transparency are key to user engagement and system reliability.
zh
[AI-83] How Robust are LLM -Generated Library Imports? An Empirical Study using Stack Overflow
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成代码时对第三方软件库(software libraries)推荐的准确性与可用性问题,尤其是这些推荐是否能直接用于实际项目中。其关键发现在于:尽管LLMs倾向于推荐成熟、流行且许可宽松的第三方库,但存在显著的可用性缺陷——4.6%的推荐库因导入名称与可安装包不匹配而无法自动解析,且仅两个模型提供安装指导,导致开发者需手动处理依赖关系。这揭示了当前LLM生成代码在依赖管理上的局限性,为提升其在真实开发场景中的可靠性提供了明确改进方向。
链接: https://arxiv.org/abs/2507.10818
作者: Jasmine Latendresse,SayedHassan Khatoonabadi,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Software libraries are central to the functionality, security, and maintainability of modern code. As developers increasingly turn to Large Language Models (LLMs) to assist with programming tasks, understanding how these models recommend libraries is essential. In this paper, we conduct an empirical study of six state-of-the-art LLMs, both proprietary and open-source, by prompting them to solve real-world Python problems sourced from Stack Overflow. We analyze the types of libraries they import, the characteristics of those libraries, and the extent to which the recommendations are usable out of the box. Our results show that LLMs predominantly favour third-party libraries over standard ones, and often recommend mature, popular, and permissively licensed dependencies. However, we also identify gaps in usability: 4.6% of the libraries could not be resolved automatically due to structural mismatches between import names and installable packages, and only two models (out of six) provided installation guidance. While the generated code is technically valid, the lack of contextual support places the burden of manually resolving dependencies on the user. Our findings offer actionable insights for both developers and researchers, and highlight opportunities to improve the reliability and usability of LLM-generated code in the context of software dependencies.
zh
[AI-84] Evaluating the Use of LLM s for Documentation to Code Traceability
【速读】:该论文旨在解决软件文档(如API参考和用户指南)与源代码之间自动追踪链接(traceability links)的难题,以提升开发效率和维护质量。其核心解决方案是利用大语言模型(Large Language Models, LLMs)对文档与代码间的语义关系进行建模与推理,通过构建两个新的开源项目数据集(Unity Catalog 和 Crawl4AI),系统评估LLMs在三方面能力:(1)追踪链接识别准确率,(2)关系解释质量,(3)多步链式结构重建能力。关键发现表明,最佳LLM模型在F1分数上达到79.4%–80.4%,显著优于传统方法(如TF-IDF、BM25和CodeBERT),且尽管完全正确的解释比例有限(42.9%–71.1%),但大部分基础关联仍能被正确捕捉(>97%部分准确)。此外,任务设计策略(如一对多匹配)对性能影响显著,提示未来工具应结合人类反馈机制以应对命名误导、幻影链接等典型错误模式。
链接: https://arxiv.org/abs/2506.16440
作者: Ebube Alor,SayedHassan Khatoonabadi,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.
zh
[AI-85] LLM -based Multi-Agent Copilot for Quantum Sensor
【速读】:该论文旨在解决量子传感器(quantum sensor)研发中因跨学科知识壁垒和复杂优化流程导致的效率瓶颈问题。其解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的多智能体框架——QCopilot,该框架融合外部知识检索、主动学习与不确定性量化机制,通过专业化智能体动态选择优化方法、自动化建模分析并独立完成问题诊断,从而实现无需人工干预的自主实验设计与异常参数识别。
链接: https://arxiv.org/abs/2508.05421
作者: Rong Sha,Binglin Wang,Jun Yang,Xiaoxiao Ma,Chengkun Wu,Liang Yan,Chao Zhou,Jixun Liu,Guochao Wang,Shuhua Yan,Lingxiao Zhu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Atomic Physics (physics.atom-ph)
备注: 13 pages,4 figures
Abstract:Large language models (LLM) exhibit broad utility but face limitations in quantum sensor development, stemming from interdisciplinary knowledge barriers and involving complex optimization processes. Here we present QCopilot, an LLM-based multi-agent framework integrating external knowledge access, active learning, and uncertainty quantification for quantum sensor design and diagnosis. Comprising commercial LLMs with few-shot prompt engineering and vector knowledge base, QCopilot employs specialized agents to adaptively select optimization methods, automate modeling analysis, and independently perform problem diagnosis. Applying QCopilot to atom cooling experiments, we generated 10 ^\rm8 sub- \rm\mu K atoms without any human intervention within a few hours, representing \sim 100 \times speedup over manual experimentation. Notably, by continuously accumulating prior knowledge and enabling dynamic modeling, QCopilot can autonomously identify anomalous parameters in multi-parameter experimental settings. Our work reduces barriers to large-scale quantum sensor deployment and readily extends to other quantum information systems.
zh
[AI-86] Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS INTERSPEECH2025
【速读】:该论文旨在解决因数据稀缺导致的构音障碍语音(dysarthric speech)辅助技术发展受限问题,特别是现有生成式语音合成技术在数据增强中的潜在偏差。解决方案的关键在于评估最先进的F5-TTS模型在零样本语音克隆(zero-shot voice cloning)任务中对构音障碍语音的合成效果,重点关注可理解性、说话人相似性和韵律保留能力,并通过公平性指标(如Disparate Impact和Parity Difference)量化不同严重程度构音障碍群体间的性能差异。研究发现,F5-TTS在合成过程中显著偏向于提升可理解性,而牺牲了说话人特征和韵律保真度,揭示了当前方法中存在的公平性问题,为构建更具包容性的构音障碍语音合成系统提供了关键洞见。
链接: https://arxiv.org/abs/2508.05102
作者: Anuprabha M,Krishna Gurugubelli,Anil Kumar Vuppala
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2025
Abstract:Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies.
zh
[AI-87] ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular Ultrasound
【速读】:该论文旨在解决当前临床实践中缺乏自动化工具来通过超声图像准确识别视网膜脱离(Retinal Detachment, RD)及其是否累及黄斑(macula-intact vs. macula-detached)的问题,而这一区分对制定手术优先级至关重要。现有基于点-of-care超声(Point-of-Care Ultrasound, POCUS)的诊断受限于操作者经验不足,尤其在资源有限地区更为明显。解决方案的关键在于构建首个公开可用的、标注了RD存在与否及黄斑状态的眼部超声视频数据集——Eye Retinal DEtachment ultraSound (ERDES),并提供基于时空卷积神经网络(Spatiotemporal Convolutional Neural Networks, CNNs)的基准模型,从而推动机器学习算法在RD自动检测与黄斑状态分类中的开发与验证。
链接: https://arxiv.org/abs/2508.04735
作者: Pouyan Navard,Yasemin Ozkut,Srikar Adhikari,Elaine Situ-LaCasse,Josie Acuña,Adrienne Yarnish,Alper Yilmaz
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: Under Review, this https URL
Abstract:Retinal detachment (RD) is a vision-threatening condition that requires timely intervention to preserve vision. Macular involvement – whether the macula is still intact (macula-intact) or detached (macula-detached) – is the key determinant of visual outcomes and treatment urgency. Point-of-care ultrasound (POCUS) offers a fast, non-invasive, cost-effective, and accessible imaging modality widely used in diverse clinical settings to detect RD. However, ultrasound image interpretation is limited by a lack of expertise among healthcare providers, especially in resource-limited settings. Deep learning offers the potential to automate ultrasound-based assessment of RD. However, there are no ML ultrasound algorithms currently available for clinical use to detect RD and no prior research has been done on assessing macular status using ultrasound in RD cases – an essential distinction for surgical prioritization. Moreover, no public dataset currently supports macular-based RD classification using ultrasound video clips. We introduce Eye Retinal DEtachment ultraSound, ERDES, the first open-access dataset of ocular ultrasound clips labeled for (i) presence of retinal detachment and (ii) macula-intact versus macula-detached status. The dataset is intended to facilitate the development and evaluation of machine learning models for detecting retinal detachment. We also provide baseline benchmarks using multiple spatiotemporal convolutional neural network (CNN) architectures. All clips, labels, and training code are publicly available at this https URL.
zh
[AI-88] Cross-Domain Image Synthesis: Generating HE from Multiplex Biomarker Imaging
【速读】:该论文旨在解决多路免疫荧光(multiplex immunofluorescence, mIF)成像中缺乏形态学参考的问题,即如何将mIF提供的高维分子信息与常规苏木精-伊红(Hematoxylin & Eosin, HE)染色的组织结构信息有效融合。其核心挑战在于,尽管mIF能提供丰富的空间分子数据,但缺乏HE染色所具有的标准化形态学特征,限制了其在现有基于HE的计算机辅助诊断(Computer-Aided Diagnosis, CAD)工具中的直接应用。解决方案的关键是采用一种多层级向量量化生成对抗网络(multi-level Vector-Quantized Generative Adversarial Network, VQGAN),从mIF图像中生成高质量的虚拟HE图像,从而实现分子数据与形态学分析流程的无缝衔接。实验表明,相较于传统条件生成对抗网络(conditional GAN, cGAN),VQGAN生成的虚拟HE图像在核分割和组织分类等下游任务中表现出更高的准确性与真实性,验证了其作为科学可信赖的虚拟染色方法的有效性。
链接: https://arxiv.org/abs/2508.04734
作者: Jillur Rahman Saurav,Mohammad Sadegh Nasr,Jacob M. Luber
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:While multiplex immunofluorescence (mIF) imaging provides deep, spatially-resolved molecular data, integrating this information with the morphological standard of Hematoxylin Eosin (HE) can be very important for obtaining complementary information about the underlying tissue. Generating a virtual HE stain from mIF data offers a powerful solution, providing immediate morphological context. Crucially, this approach enables the application of the vast ecosystem of HE-based computer-aided diagnosis (CAD) tools to analyze rich molecular data, bridging the gap between molecular and morphological analysis. In this work, we investigate the use of a multi-level Vector-Quantized Generative Adversarial Network (VQGAN) to create high-fidelity virtual HE stains from mIF images. We rigorously evaluated our VQGAN against a standard conditional GAN (cGAN) baseline on two publicly available colorectal cancer datasets, assessing performance on both image similarity and functional utility for downstream analysis. Our results show that while both architectures produce visually plausible images, the virtual stains generated by our VQGAN provide a more effective substrate for computer-aided diagnosis. Specifically, downstream nuclei segmentation and semantic preservation in tissue classification tasks performed on VQGAN-generated images demonstrate superior performance and agreement with ground-truth analysis compared to those from the cGAN. This work establishes that a multi-level VQGAN is a robust and superior architecture for generating scientifically useful virtual stains, offering a viable pathway to integrate the rich molecular data of mIF into established and powerful HE-based analytical workflows.
zh
[AI-89] Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis
【速读】:该论文旨在解决在NISQ(含噪声中等规模量子)时代及未来容错量子计算中,从固定初始态高效合成指定目标量子态的量子线路(quantum circuit)设计难题。其核心挑战在于量子态空间的指数级增长导致的传统方法难以扩展。解决方案的关键在于引入基于动作序列的表格型Q-learning(tabular Q-learning)框架,并结合离散化的量子态空间表示,以有效控制状态空间维度膨胀;同时设计了一种混合奖励机制,包含领域知识驱动的静态奖励和可定制的动态惩罚项(如门拥堵和冗余状态重访),从而引导智能体探索最优路径。通过稀疏矩阵表示与状态空间离散化,该方法实现了高维环境下的可扩展导航且计算开销低,验证了其在图态制备任务中生成最小深度电路的能力,并拓展至通用门集后仍保持优异性能,证明了其在量子电路优化中的资源效率与适应性。
链接: https://arxiv.org/abs/2507.16641
作者: Sara Giordano,Kornikar Sen,Miguel A. Martin-Delgado
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, color figures
Abstract:A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the NISQ era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension. The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. By leveraging sparse matrix representations and state-space discretization, the method enables scalable navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set for arbitrary quantum states, it still produces minimal depth circuits, highlighting the algorithm’s robustness and adaptability. The results confirm that this RL-driven approach efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.
zh
[AI-90] Reinforcement Learning Generation of 4-Qubits Entangled States
【速读】:该论文旨在解决如何自动构造具有特定纠缠结构的四量子比特(4-qubit)纠缠态问题,特别是针对49个真实SLOCC(Stochastic Local Operations and Classical Communication)等价类中的代表性状态。其解决方案的关键在于设计了一种基于强化学习(Q-learning)的人工智能算法,通过构建状态链接图(state-link graph, SLG)来可视化和指导Q-矩阵的生成过程,从而识别出实现目标纠缠态所需的量子门操作及其连接关系。该方法不仅能生成最优量子电路(相对于所选量子门集),还揭示了纠缠特征与具体量子门之间的内在关联,为实验实现及基础物理研究提供了可操作的路径。
链接: https://arxiv.org/abs/2204.12351
作者: Sara Giordano,Miguel A. Martin-Delgado
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tex file, color figures
Abstract:We have devised an artificial intelligence algorithm with machine reinforcement learning (Q-learning) to construct remarkable entangled states with 4 qubits. This way, the algorithm is able to generate representative states for some of the 49 true SLOCC classes of the four-qubit entanglement states. In particular, it is possible to reach at least one true SLOCC class for each of the nine entanglement families. The quantum circuits synthesized by the algorithm may be useful for the experimental realization of these important classes of entangled states and to draw conclusions about the intrinsic properties of our universe. We introduce a graphical tool called the state-link graph (SLG) to represent the construction of the Quality matrix (Q-matrix) used by the algorithm to build a given objective state belonging to the corresponding entanglement class. This allows us to discover the necessary connections between specific entanglement features and the role of certain quantum gates that the algorithm needs to include in the quantum gate set of actions. The quantum circuits found are optimal by construction with respect to the quantum gate-set chosen. These SLGs make the algorithm simple, intuitive and a useful resource for the automated construction of entangled states with a low number of qubits.
zh
机器学习
[LG-0] On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
链接: https://arxiv.org/abs/2508.05629
作者: Yongliang Wu,Yizhou Zhou,Zhou Ziheng,Yingzhe Peng,Xinyu Ye,Xinting Hu,Wenbo Zhu,Lu Qi,Ming-Hsuan Yang,Xu Yang
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures
Abstract:We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at this https URL.
[LG-1] Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification
链接: https://arxiv.org/abs/2508.05600
作者: Thorsten Peinemann,Paula Arnold,Sebastian Berndt,Thomas Eisenbarth,Esfandiar Mohammadi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Backdoor injection attacks are a threat to machine learning models that are trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples, but need much information about the data points or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression and linear classification. For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We also validate our theoretical results experimentally with realistic benchmark data sets. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2508.05600 [cs.LG] (or arXiv:2508.05600v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.05600 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] Optimizing IoT Threat Detection with Kolmogorov-Arnold Networks (KANs)
链接: https://arxiv.org/abs/2508.05591
作者: Natalia Emelianova,Carlos Kamienski,Ronaldo C. Prati
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 13 pages
Abstract:The exponential growth of the Internet of Things (IoT) has led to the emergence of substantial security concerns, with IoT networks becoming the primary target for cyberattacks. This study examines the potential of Kolmogorov-Arnold Networks (KANs) as an alternative to conventional machine learning models for intrusion detection in IoT networks. The study demonstrates that KANs, which employ learnable activation functions, outperform traditional MLPs and achieve competitive accuracy compared to state-of-the-art models such as Random Forest and XGBoost, while offering superior interpretability for intrusion detection in IoT networks.
[LG-3] Enhancing PyKEEN with Multiple Negative Sampling Solutions for Knowledge Graph Embedding Models
链接: https://arxiv.org/abs/2508.05587
作者: Claudia d’Amato,Ivan Diliso,Nicola Fanizzi,Zafar Saeed
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures
Abstract:Embedding methods have become popular due to their scalability on link prediction and/or triple classification tasks on Knowledge Graphs. Embedding models are trained relying on both positive and negative samples of triples. However, in the absence of negative assertions, these must be usually artificially generated using various negative sampling strategies, ranging from random corruption to more sophisticated techniques which have an impact on the overall performance. Most of the popular libraries for knowledge graph embedding, support only basic such strategies and lack advanced solutions. To address this gap, we deliver an extension for the popular KGE framework PyKEEN that integrates a suite of several advanced negative samplers (including both static and dynamic corruption strategies), within a consistent modular architecture, to generate meaningful negative samples, while remaining compatible with existing PyKEEN -based workflows and pipelines. The developed extension not only enhancesPyKEEN itself but also allows for easier and comprehensive development of embedding methods and/or for their customization. As a proof of concept, we present a comprehensive empirical study of the developed extensions and their impact on the performance (link prediction tasks) of different embedding methods, which also provides useful insights for the design of more effective strategies
[LG-4] Prediction of Survival Outcomes under Clinical Presence Shift: A Joint Neural Network Architecture
链接: https://arxiv.org/abs/2508.05472
作者: Vincent Jeanselme,Glen Martin,Matthew Sperrin,Niels Peek,Brian Tom,Jessica Barrett
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electronic health records arise from the complex interaction between patients and the healthcare system. This observation process of interactions, referred to as clinical presence, often impacts observed outcomes. When using electronic health records to develop clinical prediction models, it is standard practice to overlook clinical presence, impacting performance and limiting the transportability of models when this interaction evolves. We propose a multi-task recurrent neural network that jointly models the inter-observation time and the missingness processes characterising this interaction in parallel to the survival outcome of interest. Our work formalises the concept of clinical presence shift when the prediction model is deployed in new settings (e.g. different hospitals, regions or countries), and we theoretically justify why the proposed joint modelling can improve transportability under changes in clinical presence. We demonstrate, in a real-world mortality prediction task in the MIMIC-III dataset, how the proposed strategy improves performance and transportability compared to state-of-the-art prediction models that do not incorporate the observation process. These results emphasise the importance of leveraging clinical presence to improve performance and create more transportable clinical prediction models.
[LG-5] Lets Measure Information Step-by-Step: LLM -Based Evaluation Beyond Vibes
链接: https://arxiv.org/abs/2508.05469
作者: Zachary Robertson,Sanmi Koyejo
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 13 pages
Abstract:We develop mechanisms for evaluating AI systems without ground truth by exploiting a connection between gaming resistance and output quality. The data processing inequality ensures post-hoc attempts to game a metric degrades both information content and task performance. We prove that f-mutual information measures are the unique gaming resistant mechanisms under natural conditions, with the overseer acting as an agent. While Shannon mutual information faces exponential sample complexity, bounded measures like total variation distance remain tractable. Empirically, across ten domains from translation to peer review, all information-theoretic mechanisms achieve perfect discrimination (d 0.5) between faithful and strategic agents. In contrast, LLM judges exhibit systematic evaluation inversion, preferring fabricated content over accurate summaries. Our mechanisms show 10-100x better robustness to adversarial manipulation than current practices. We also find performance follows an inverted-U curve with compression ratio, peaking at 10:1 where agent responses exhibit optimal information diversity (3 effective dimensions), giving a bias-variance perspective on when our approach is expected to be most effective.
[LG-6] Learning Geometric-Aware Quadrature Rules for Functional Minimization
链接: https://arxiv.org/abs/2508.05445
作者: Costas Smaragdakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures
Abstract:Accurate numerical integration over non-uniform point clouds is a challenge for modern mesh-free machine learning solvers for partial differential equations (PDEs) using variational principles. While standard Monte Carlo (MC) methods are not capable of handling a non-uniform point cloud, modern neural network architectures can deal with permutation-invariant inputs, creating quadrature rules for any point cloud. In this work, we introduce QuadrANN, a Graph Neural Network (GNN) architecture designed to learn optimal quadrature weights directly from the underlying geometry of point clouds. The design of the model exploits a deep message-passing scheme where the initial layer encodes rich local geometric features from absolute and relative positions as well as an explicit local density measure. In contrast, the following layers incorporate a global context vector. These architectural choices allow the QuadrANN to generate a data-driven quadrature rule that is permutation-invariant and adaptive to both local point density and the overall domain shape. We test our methodology on a series of challenging test cases, including integration on convex and non-convex domains and estimating the solution of the Heat and Fokker-Planck equations. Across all the tests, QuadrANN reduces the variance of the integral estimation compared to standard Quasi-Monte Carlo methods by warping the point clouds to be more dense in critical areas where the integrands present certain singularities. This enhanced stability in critical areas of the domain at hand is critical for the optimization of energy functionals, leading to improved deep learning-based variational solvers.
[LG-7] Online Sparsification of Bipartite-Like Clusters in Graphs ICML2025
链接: https://arxiv.org/abs/2508.05437
作者: Joyentanuj Das,Suranjan De,He Sun
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Graph clustering is an important algorithmic technique for analysing massive graphs, and has been widely applied in many research fields of data science. While the objective of most graph clustering algorithms is to find a vertex set of low conductance, a sequence of recent studies highlights the importance of the inter-connection between vertex sets when analysing real-world datasets. Following this line of research, in this work we study bipartite-like clusters and present efficient and online sparsification algorithms that find such clusters in both undirected graphs and directed ones. We conduct experimental studies on both synthetic and real-world datasets, and show that our algorithms significantly speedup the running time of existing clustering algorithms while preserving their effectiveness.
[LG-8] Competing Risks: Impact on Risk Estimation and Algorithmic Fairness
链接: https://arxiv.org/abs/2508.05435
作者: Vincent Jeanselme,Brian Tom,Jessica Barrett
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate time-to-event prediction is integral to decision-making, informing medical guidelines, hiring decisions, and resource allocation. Survival analysis, the quantitative framework used to model time-to-event data, accounts for patients who do not experience the event of interest during the study period, known as censored patients. However, many patients experience events that prevent the observation of the outcome of interest. These competing risks are often treated as censoring, a practice frequently overlooked due to a limited understanding of its consequences. Our work theoretically demonstrates why treating competing risks as censoring introduces substantial bias in survival estimates, leading to systematic overestimation of risk and, critically, amplifying disparities. First, we formalize the problem of misclassifying competing risks as censoring and quantify the resulting error in survival estimates. Specifically, we develop a framework to estimate this error and demonstrate the associated implications for predictive performance and algorithmic fairness. Furthermore, we examine how differing risk profiles across demographic groups lead to group-specific errors, potentially exacerbating existing disparities. Our findings, supported by an empirical analysis of cardiovascular management, demonstrate that ignoring competing risks disproportionately impacts the individuals most at risk of these events, potentially accentuating inequity. By quantifying the error and highlighting the fairness implications of the common practice of considering competing risks as censoring, our work provides a critical insight into the development of survival models: practitioners must account for competing risks to improve accuracy, reduce disparities in risk assessment, and better inform downstream decisions.
[LG-9] Discovering Interpretable Programmatic Policies via Multimodal LLM -assisted Evolutionary Search
链接: https://arxiv.org/abs/2508.05433
作者: Qinglong Hu,Xialiang Tong,Mingxuan Yuan,Fei Liu,Zhichao Lu,Qingfu Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Interpretability and high performance are essential goals in designing control policies, particularly for safety-critical tasks. Deep reinforcement learning has greatly enhanced performance, yet its inherent lack of interpretability often undermines trust and hinders real-world deployment. This work addresses these dual challenges by introducing a novel approach for programmatic policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as policy generators, combining them with evolutionary mechanisms for automatic policy optimization. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and facilitate targeted improvements, enhancing the efficiency of policy discovery and producing adaptable, human-aligned policies. Experimental results show that MLES achieves policy discovery capabilities and efficiency comparable to Proximal Policy Optimization (PPO) across two control tasks, while offering transparent control logic and traceable design processes. This paradigm overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various control tasks. MLES shows promise as a leading approach for the next generation of interpretable control policy discovery.
[LG-10] Group Causal Policy Optimization for Post-Training Large Language Models
链接: https://arxiv.org/abs/2508.05428
作者: Ziyin Gu,Jingyao Wang,Ran Zuo,Chuxiong Sun,Zeen Song,Changwen Zheng,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally informed subspace improves prediction quality, and (2) this projection yields a better baseline than query only conditioning. Building on these insights, we propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components: a causally informed reward adjustment and a novel KL regularization term that aligns the policy with a causally projected reference distribution. Comprehensive experimental evaluations demonstrate that GCPO consistently surpasses existing methods, including GRPO across multiple reasoning benchmarks.
[LG-11] Federated Multi-Objective Learning with Controlled Pareto Frontiers
链接: https://arxiv.org/abs/2508.05424
作者: Jiansheng Rao,Jiayi Li,Zhizhi Gong,Soummya Kar,Haoxuan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) is a widely adopted paradigm for privacy-preserving model training, but FedAvg optimise for the majority while under-serving minority clients. Existing methods such as federated multi-objective learning (FMOL) attempts to import multi-objective optimisation (MOO) into FL. However, it merely delivers task-wise Pareto-stationary points, leaving client fairness to chance. In this paper, we introduce Conically-Regularised FMOL (CR-FMOL), the first federated MOO framework that enforces client-wise Pareto optimality through a novel preference-cone constraint. After local federated multi-gradient descent averaging (FMGDA) / federated stochastic multi-gradient descent averaging (FSMGDA) steps, each client transmits its aggregated task-loss vector as an implicit preference; the server then solves a cone-constrained Pareto-MTL sub-problem centred at the uniform vector, producing a descent direction that is Pareto-stationary for every client within its cone. Experiments on non-IID benchmarks show that CR-FMOL enhances client fairness, and although the early-stage performance is slightly inferior to FedAvg, it is expected to achieve comparable accuracy given sufficient training rounds.
[LG-12] Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
链接: https://arxiv.org/abs/2508.05423
作者: Yixuan Zhang,Wenxin Zhang,Hua Jiang,Quyu Kong,Feng Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Biological neurons communicate through spike trains, discrete, irregular bursts of activity that exhibit variability far beyond the modeling capacity of conventional variational autoencoders (VAEs). Recent work, such as the Poisson-VAE, makes a biologically inspired move by modeling spike counts using the Poisson distribution. However, they impose a rigid constraint: equal mean and variance, which fails to reflect the true stochastic nature of neural activity. In this work, we challenge this constraint and introduce NegBio-VAE, a principled extension of the VAE framework that models spike counts using the negative binomial distribution. This shift grants explicit control over dispersion, unlocking a broader and more accurate family of neural representations. We further develop two ELBO optimization schemes and two differentiable reparameterization strategies tailored to the negative binomial setting. By introducing one additional dispersion parameter, NegBio-VAE generalizes the Poisson latent model to a negative binomial formulation. Empirical results demonstrate this minor yet impactful change leads to significant gains in reconstruction fidelity, highlighting the importance of explicitly modeling overdispersion in spike-like activations.
[LG-13] Echo State Networks for Bitcoin Time Series Prediction
链接: https://arxiv.org/abs/2508.05416
作者: Mansi Sharma,Enrico Sartor,Marc Cavazza,Helmut Prendinger
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Forecasting stock and cryptocurrency prices is challenging due to high volatility and non-stationarity, influenced by factors like economic changes and market sentiment. Previous research shows that Echo State Networks (ESNs) can effectively model short-term stock market movements, capturing nonlinear patterns in dynamic data. To the best of our knowledge, this work is among the first to explore ESNs for cryptocurrency forecasting, especially during extreme volatility. We also conduct chaos analysis through the Lyapunov exponent in chaotic periods and show that our approach outperforms existing machine learning methods by a significant margin. Our findings are consistent with the Lyapunov exponent analysis, showing that ESNs are robust during chaotic periods and excel under high chaos compared to Boosting and Naïve methods.
[LG-14] MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow
链接: https://arxiv.org/abs/2508.05411
作者: Md Atik Ahamed,Qiang Ye,Qiang Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Molecular generation conditioned on textual descriptions is a fundamental task in computational chemistry and drug discovery. Existing methods often struggle to simultaneously ensure high-quality, diverse generation and fast inference. In this work, we propose a novel causality-aware framework that addresses these challenges through two key innovations. First, we introduce a Causality-Aware Transformer (CAT) that jointly encodes molecular graph tokens and text instructions while enforcing causal dependencies during generation. Second, we develop a Variational Mean Flow (VMF) framework that generalizes existing flow-based methods by modeling the latent space as a mixture of Gaussians, enhancing expressiveness beyond unimodal priors. VMF enables efficient one-step inference while maintaining strong generation quality and diversity. Extensive experiments on four standard molecular benchmarks demonstrate that our model outperforms state-of-the-art baselines, achieving higher novelty (up to 74.5%), diversity (up to 70.3%), and 100% validity across all datasets. Moreover, VMF requires only one number of function evaluation (NFE) during conditional generation and up to five NFEs for unconditional generation, offering substantial computational efficiency over diffusion-based methods.
[LG-15] Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam
链接: https://arxiv.org/abs/2508.05408
作者: Asma Atamna,Tom Maus,Fabian Kievelitz,Tobias Glasmachers
类目: Machine Learning (cs.LG)
*备注:
Abstract:The learning rate is a crucial hyperparameter in deep learning, with its ideal value depending on the problem and potentially changing during training. In this paper, we investigate the practical utility of adaptive learning rate mechanisms that adjust step sizes dynamically in response to the loss landscape. We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length, computed as a time-discounted sum of normalized gradient steps, and the expected length of a random walk. While the original approach offers a compelling intuition, we show that its adaptation mechanism for Adam is conceptually inconsistent due to the optimizer’s internal preconditioning. We propose a corrected variant that better reflects Adam’s update dynamics. To assess the practical value of online learning rate adaptation, we benchmark SGD and Adam, with and without cumulative adaptation, and compare them to a recent alternative method. Our results aim to clarify when and why such adaptive strategies offer practical benefits.
[LG-16] NT-ML: Backdoor Defense via Non-target Label Training and Mutual Learning
链接: https://arxiv.org/abs/2508.05404
作者: Wenjie Huo,Katinka Wolter
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent studies have shown that deep neural networks (DNNs) are vulnerable to backdoor attacks, where a designed trigger is injected into the dataset, causing erroneous predictions when activated. In this paper, we propose a novel defense mechanism, Non-target label Training and Mutual Learning (NT-ML), which can successfully restore the poisoned model under advanced backdoor attacks. NT aims to reduce the harm of poisoned data by retraining the model with the outputs of the standard training. At this stage, a teacher model with high accuracy on clean data and a student model with higher confidence in correct prediction on poisoned data are obtained. Then, the teacher and student can learn the strengths from each other through ML to obtain a purified student model. Extensive experiments show that NT-ML can effectively defend against 6 backdoor attacks with a small number of clean samples, and outperforms 5 state-of-the-art backdoor defenses.
[LG-17] On the Reliability of Sampling Strategies in Offline Recommender Evaluation RECSYS2025
链接: https://arxiv.org/abs/2508.05398
作者: Bruno L. Pereira,Alan Said,Rodrygo L. T. Santos
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to RecSys 2025
Abstract:Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: sampling resolution (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons.
[LG-18] Latent Preference Bandits
链接: https://arxiv.org/abs/2508.05367
作者: Newton Mwai,Emil Carlsson,Fredrik D. Johansson
类目: Machine Learning (cs.LG)
*备注: 25 pages, 9 figures
Abstract:Bandit algorithms are guaranteed to solve diverse sequential decision-making problems, provided that a sufficient exploration budget is available. However, learning from scratch is often too costly for personalization tasks where a single individual faces only a small number of decision points. Latent bandits offer substantially reduced exploration times for such problems, given that the joint distribution of a latent state and the rewards of actions is known and accurate. In practice, finding such a model is non-trivial, and there may not exist a small number of latent states that explain the responses of all individuals. For example, patients with similar latent conditions may have the same preference in treatments but rate their symptoms on different scales. With this in mind, we propose relaxing the assumptions of latent bandits to require only a model of the \emphpreference ordering of actions in each latent state. This allows problem instances with the same latent state to vary in their reward distributions, as long as their preference orderings are equal. We give a posterior-sampling algorithm for this problem and demonstrate that its empirical performance is competitive with latent bandits that have full knowledge of the reward distribution when this is well-specified, and outperforms them when reward scales differ between instances with the same latent state.
[LG-19] Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity
链接: https://arxiv.org/abs/2508.05302
作者: Hikaru Umeda,Hideaki Iiduka
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The convergence behavior of mini-batch stochastic gradient descent (SGD) is highly sensitive to the batch size and learning rate settings. Recent theoretical studies have identified the existence of a critical batch size that minimizes stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations required to reach a stationary point of the empirical loss function in a deep neural network. An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size. The batch size and learning rate are adjusted on the basis of the observed decay in the full gradient norm during training. Experiments using an adaptive joint scheduler based on this strategy demonstrated improved convergence speed compared with that of existing schedulers.
[LG-20] Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity
链接: https://arxiv.org/abs/2508.05297
作者: Hikaru Umeda,Hideaki Iiduka
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic gradient methods. However, naive scheduling of these hyperparameters can degrade optimization efficiency and compromise generalization. Motivated by recent theoretical insights, we investigated how the batch size and learning rate should be increased during training to balance efficiency and convergence. We analyzed this problem on the basis of stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations needed to reach an \epsilon -approximate stationary point of the empirical loss. We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity and validated them through extensive experiments. Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.
[LG-21] RLHF Fine-Tuning of LLM s for Alignment with Implicit User Feedback in Conversational Recommenders
链接: https://arxiv.org/abs/2508.05289
作者: Zhongheng Yang,Aijia Sun,Yushang Zhao,Yinuo Yang,Dannier Li,Chengrui Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity, or engagement patterns. In this paper, we share a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context. We specify a reward model R_\phi learnt on weakly-labelled engagement information and maximize user-centric utility by optimizing the foundational LLM M_\theta through a proximal policy optimization (PPO) approach. The architecture models conversational state transitions s_t \to a_t \to s_t +1 , where the action a_t is associated with LLM-generated item suggestions only on condition of conversation history in the past. The evaluation across synthetic and real-world datasets (this http URL, OpenDialKG) demonstrates that our RLHF-fine-tuned models can perform better in terms of top- k recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up This paper shows that implicit signal alignment can be efficient in achieving scalable and user-adaptive design of CRS.
[LG-22] MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLM s
链接: https://arxiv.org/abs/2508.05257
作者: Xiaodong Chen,Mingming Ha,Zhenzhong Lan,Jing Zhang,Jianguo Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices Bi shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).
[LG-23] Cross-LoRA: A Data-Free LoRA Transfer Framework across Heterogeneous LLM s
链接: https://arxiv.org/abs/2508.05232
作者: Feifan Xia,Mingyang Liao,Yuyang Fang,Defang Li,Yantong Xie,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional parameter-efficient fine-tuning (PEFT) methods such as LoRA are tightly coupled with the base model architecture, which constrains their applicability across heterogeneous pretrained large language models (LLMs). To address this limitation, we introduce Cross-LoRA, a data-free framework for transferring LoRA modules between diverse base models without requiring additional training data. Cross-LoRA consists of two key components: (a) LoRA-Align, which performs subspace alignment between source and target base models through rank-truncated singular value decomposition (SVD) and Frobenius-optimal linear transformation, ensuring compatibility under dimension mismatch; and (b) LoRA-Shift, which applies the aligned subspaces to project source LoRA weight updates into the target model parameter space. Both components are data-free, training-free, and enable lightweight adaptation on a commodity GPU in 20 minutes. Experiments on ARCs, OBOA and HellaSwag show that Cross-LoRA achieves relative gains of up to 5.26% over base models. Across other commonsense reasoning benchmarks, Cross-LoRA maintains performance comparable to that of directly trained LoRA adapters.
[LG-24] ML-based Short Physical Performance Battery future score prediction based on questionnaire data
链接: https://arxiv.org/abs/2508.05222
作者: Marcin Kolakowski,Seif Ben Bader
类目: Machine Learning (cs.LG)
*备注: Originally presented at: 2024 32nd Telecommunication Forum (TELFOR), Belgrade, Serbia
Abstract:Effective slowing down of older adults’ physical capacity deterioration requires intervention as soon as the first symptoms surface. In this paper, we analyze the possibility of predicting the Short Physical Performance Battery (SPPB) score at a four-year horizon based on questionnaire data. The ML algorithms tested included Random Forest, XGBoost, Linear Regression, dense and TabNet neural networks. The best results were achieved for the XGBoost (mean absolute error of 0.79 points). Based on the Shapley values analysis, we selected smaller subsets of features (from 10 to 20) and retrained the XGBoost regressor, achieving a mean absolute error of 0.82.
[LG-25] DFW: A Novel Weighting Scheme for Covariate Balancing and Treatment Effect Estimation
链接: https://arxiv.org/abs/2508.05215
作者: Ahmad Saeed Khan,Erik Schaffernicht,Johannes Andreas Stork
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: This paper has been accepted in Frontiers in Applied Mathematics and Statistics - Mathematics of Computation and Data Science
Abstract:Estimating causal effects from observational data is challenging due to selection bias, which leads to imbalanced covariate distributions across treatment groups. Propensity score-based weighting methods are widely used to address this issue by reweighting samples to simulate a randomized controlled trial (RCT). However, the effectiveness of these methods heavily depends on the observed data and the accuracy of the propensity score estimator. For example, inverse propensity weighting (IPW) assigns weights based on the inverse of the propensity score, which can lead to instable weights when propensity scores have high variance-either due to data or model misspecification-ultimately degrading the ability of handling selection bias and treatment effect estimation. To overcome these limitations, we propose Deconfounding Factor Weighting (DFW), a novel propensity score-based approach that leverages the deconfounding factor-to construct stable and effective sample weights. DFW prioritizes less confounded samples while mitigating the influence of highly confounded ones, producing a pseudopopulation that better approximates a RCT. Our approach ensures bounded weights, lower variance, and improved covariate this http URL DFW is formulated for binary treatments, it naturally extends to multi-treatment settings, as the deconfounding factor is computed based on the estimated probability of the treatment actually received by each sample. Through extensive experiments on real-world benchmark and synthetic datasets, we demonstrate that DFW outperforms existing methods, including IPW and CBPS, in both covariate balancing and treatment effect estimation.
[LG-26] Bidding-Aware Retrieval for Multi-Stage Consistency in Online Advertising
链接: https://arxiv.org/abs/2508.05206
作者: Bin Liu,Yunfei Liu,Ziru Xu,Zhaoyu Zhou,Zhi Kou,Yeqiu Yang,Han Zhu,Jian Xu,Bo Zheng
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Online advertising systems typically use a cascaded architecture to manage massive requests and candidate volumes, where the ranking stages allocate traffic based on eCPM (predicted CTR \times Bid). With the increasing popularity of auto-bidding strategies, the inconsistency between the computationally sensitive retrieval stage and the ranking stages becomes more pronounced, as the former cannot access precise, real-time bids for the vast ad corpus. This discrepancy leads to sub-optimal platform revenue and advertiser outcomes. To tackle this problem, we propose Bidding-Aware Retrieval (BAR), a model-based retrieval framework that addresses multi-stage inconsistency by incorporating ad bid value into the retrieval scoring function. The core innovation is Bidding-Aware Modeling, incorporating bid signals through monotonicity-constrained learning and multi-task distillation to ensure economically coherent representations, while Asynchronous Near-Line Inference enables real-time updates to the embedding for market responsiveness. Furthermore, the Task-Attentive Refinement module selectively enhances feature interactions to disentangle user interest and commercial value signals. Extensive offline experiments and full-scale deployment across Alibaba’s display advertising platform validated BAR’s efficacy: 4.32% platform revenue increase with 22.2% impression lift for positively-operated advertisements.
[LG-27] Physics-Informed Time-Integrated DeepONet: Temporal Tangent Space Operator Learning for High-Accuracy Inference
链接: https://arxiv.org/abs/2508.05190
作者: Luis Mandl,Dibyajyoti Nayak,Tim Ricken,Somdatta Goswami
类目: Machine Learning (cs.LG)
*备注: 17 pages, 16 figures, 3 tables
Abstract:Accurately modeling and inferring solutions to time-dependent partial differential equations (PDEs) over extended horizons remains a core challenge in scientific machine learning. Traditional full rollout (FR) methods, which predict entire trajectories in one pass, often fail to capture the causal dependencies and generalize poorly outside the training time horizon. Autoregressive (AR) approaches, evolving the system step by step, suffer from error accumulation, limiting long-term accuracy. These shortcomings limit the long-term accuracy and reliability of both strategies. To address these issues, we introduce the Physics-Informed Time-Integrated Deep Operator Network (PITI-DeepONet), a dual-output architecture trained via fully physics-informed or hybrid physics- and data-driven objectives to ensure stable, accurate long-term evolution well beyond the training horizon. Instead of forecasting future states, the network learns the time-derivative operator from the current state, integrating it using classical time-stepping schemes to advance the solution in time. Additionally, the framework can leverage residual monitoring during inference to estimate prediction quality and detect when the system transitions outside the training domain. Applied to benchmark problems, PITI-DeepONet shows improved accuracy over extended inference time horizons when compared to traditional methods. Mean relative \mathcalL_2 errors reduced by 84% (vs. FR) and 79% (vs. AR) for the one-dimensional heat equation; by 87% (vs. FR) and 98% (vs. AR) for the one-dimensional Burgers equation; and by 42% (vs. FR) and 89% (vs. AR) for the two-dimensional Allen-Cahn equation. By moving beyond classic FR and AR schemes, PITI-DeepONet paves the way for more reliable, long-term integration of complex, time-dependent PDEs.
[LG-28] Human Activity Recognition from Smartphone Sensor Data for Clinical Trials
链接: https://arxiv.org/abs/2508.05175
作者: Stefania Russo,Rafał Klimas,Marta Płonka,Hugo Le Gall,Sven Holm,Dimitar Stanev,Florian Lipsmeier,Mattia Zanon,Lito Kriara
类目: Machine Learning (cs.LG)
*备注: 32 pages, 5 figures, 4 tables, 1 supplementary figure, 4 supplementary tables
Abstract:We developed a ResNet-based human activity recognition (HAR) model with minimal overhead to detect gait versus non-gait activities and everyday activities (walking, running, stairs, standing, sitting, lying, sit-to-stand transitions). The model was trained and evaluated using smartphone sensor data from adult healthy controls (HC) and people with multiple sclerosis (PwMS) with Expanded Disability Status Scale (EDSS) scores between 0.0-6.5. Datasets included the GaitLab study (ISRCTN15993728), an internal Roche dataset, and publicly available data sources (training only). Data from 34 HC and 68 PwMS (mean [SD] EDSS: 4.7 [1.5]) were included in the evaluation. The HAR model showed 98.4% and 99.6% accuracy in detecting gait versus non-gait activities in the GaitLab and Roche datasets, respectively, similar to a comparative state-of-the-art ResNet model (99.3% and 99.4%). For everyday activities, the proposed model not only demonstrated higher accuracy than the state-of-the-art model (96.2% vs 91.9%; internal Roche dataset) but also maintained high performance across 9 smartphone wear locations (handbag, shopping bag, crossbody bag, backpack, hoodie pocket, coat/jacket pocket, hand, neck, belt), outperforming the state-of-the-art model by 2.8% - 9.0%. In conclusion, the proposed HAR model accurately detects everyday activities and shows high robustness to various smartphone wear locations, demonstrating its practical applicability.
[LG-29] Near Optimal Inference for the Best-Performing Algorithm
链接: https://arxiv.org/abs/2508.05173
作者: Amichai Painsky
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Consider a collection of competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to rank highest on a future, unseen dataset. A natural approach is to select the algorithm that demonstrates the best performance on the benchmark. However, in many cases the performance differences are marginal and additional candidates may also be considered. This problem is formulated as subset selection for multinomial distributions. Formally, given a sample from a countable alphabet, our goal is to identify a minimal subset of symbols that includes the most frequent symbol in the population with high confidence. In this work, we introduce a novel framework for the subset selection problem. We provide both asymptotic and finite-sample schemes that significantly improve upon currently known methods. In addition, we provide matching lower bounds, demonstrating the favorable performance of our proposed schemes.
[LG-30] S2M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection
链接: https://arxiv.org/abs/2508.05164
作者: Jiaqi Wang,Zhengyu Ma,Xiongri Shen,Chenlin Zhou,Leilei Zhao,Han Zhang,Yi Zhong,Siqi Cai,Zhenxi Song,Zhiguo Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Auditory attention detection (AAD) aims to decode listeners’ focus in complex auditory environments from electroencephalography (EEG) recordings, which is crucial for developing neuro-steered hearing devices. Despite recent advancements, EEG-based AAD remains hindered by the absence of synergistic frameworks that can fully leverage complementary EEG features under energy-efficiency constraints. We propose S ^2 M-Former, a novel spiking symmetric mixing framework to address this limitation through two key innovations: i) Presenting a spike-driven symmetric architecture composed of parallel spatial and frequency branches with mirrored modular design, leveraging biologically plausible token-channel mixers to enhance complementary learning across branches; ii) Introducing lightweight 1D token sequences to replace conventional 3D operations, reducing parameters by 14.7 \times . The brain-inspired spiking architecture further reduces power consumption, achieving a 5.8 \times energy reduction compared to recent ANN methods, while also surpassing existing SNN baselines in terms of parameter efficiency and performance. Comprehensive experiments on three AAD benchmarks (KUL, DTU and AV-GC-AAD) across three settings (within-trial, cross-trial and cross-subject) demonstrate that S ^2 M-Former achieves comparable state-of-the-art (SOTA) decoding accuracy, making it a promising low-power, high-performance solution for AAD tasks.
[LG-31] pFedDSH: Enabling Knowledge Transfer in Personalized Federated Learning through Data-free Sub-Hypernetwork
链接: https://arxiv.org/abs/2508.05157
作者: Thinh Nguyen,Le Huy Khiem,Van-Tuan Tran,Khoa D Doan,Nitesh V Chawla,Kok-Seng Wong
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures
Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, offering a significant privacy benefit. However, most existing Personalized Federated Learning (pFL) methods assume a static client participation, which does not reflect real-world scenarios where new clients may continuously join the federated system (i.e., dynamic client onboarding). In this paper, we explore a practical scenario in which a new batch of clients is introduced incrementally while the learning task remains unchanged. This dynamic environment poses various challenges, including preserving performance for existing clients without retraining and enabling efficient knowledge transfer between client batches. To address these issues, we propose Personalized Federated Data-Free Sub-Hypernetwork (pFedDSH), a novel framework based on a central hypernetwork that generates personalized models for each client via embedding vectors. To maintain knowledge stability for existing clients, pFedDSH incorporates batch-specific masks, which activate subsets of neurons to preserve knowledge. Furthermore, we introduce a data-free replay strategy motivated by DeepInversion to facilitate backward transfer, enhancing existing clients’ performance without compromising privacy. Extensive experiments conducted on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that pFedDSH outperforms the state-of-the-art pFL and Federated Continual Learning baselines in our investigation scenario. Our approach achieves robust performance stability for existing clients, as well as adaptation for new clients and efficient utilization of neural resources.
[LG-32] PSEO: Optimizing Post-hoc Stacking Ensemble Through Hyperparameter Tuning
链接: https://arxiv.org/abs/2508.05144
作者: Beicheng Xu,Wei Liu,Keyao Ding,Yupeng Lu,Bin Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem is fundamental in Automated Machine Learning (AutoML). Inspired by the success of ensemble learning, recent AutoML systems construct post-hoc ensembles for final predictions rather than relying on the best single model. However, while most CASH methods conduct extensive searches for the optimal single model, they typically employ fixed strategies during the ensemble phase that fail to adapt to specific task characteristics. To tackle this issue, we propose PSEO, a framework for post-hoc stacking ensemble optimization. First, we conduct base model selection through binary quadratic programming, with a trade-off between diversity and performance. Furthermore, we introduce two mechanisms to fully realize the potential of multi-layer stacking. Finally, PSEO builds a hyperparameter space and searches for the optimal post-hoc ensemble strategy within it. Empirical results on 80 public datasets show that \sys achieves the best average test rank (2.96) among 16 methods, including post-hoc designs in recent AutoML systems and state-of-the-art ensemble learning methods.
[LG-33] Deep Neural Networks with General Activations: Super-Convergence in Sobolev Norms
链接: https://arxiv.org/abs/2508.05141
作者: Yahong Yang,Juncai He
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 45 pages, 4 figures
Abstract:This paper establishes a comprehensive approximation result for deep fully-connected neural networks with commonly-used and general activation functions in Sobolev spaces W^n,\infty , with errors measured in the W^m,p -norm for m n and 1\le p \le \infty . The derived rates surpass those of classical numerical approximation techniques, such as finite element and spectral methods, exhibiting a phenomenon we refer to as \emphsuper-convergence. Our analysis shows that deep networks with general activations can approximate weak solutions of partial differential equations (PDEs) with superior accuracy compared to traditional numerical methods at the approximation level. Furthermore, this work closes a significant gap in the error-estimation theory for neural-network-based approaches to PDEs, offering a unified theoretical foundation for their use in scientific computing.
[LG-34] HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
链接: https://arxiv.org/abs/2508.05135
作者: Thinh Nguyen,Trung Phan,Binh T. Nguyen,Khoa D Doan,Kok-Seng Wong
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 11 pages, 3 figures
Abstract:Federated Learning (FL) is a decentralized approach where multiple clients collaboratively train a shared global model without sharing their raw data. Despite its effectiveness, conventional FL faces scalability challenges due to excessive computational and communication demands placed on a single central server as the number of participating devices grows. Hierarchical Federated Learning (HFL) addresses these issues by distributing model aggregation tasks across intermediate nodes (stations), thereby enhancing system scalability and robustness against single points of failure. However, HFL still suffers from a critical yet often overlooked limitation: domain shift, where data distributions vary significantly across different clients and stations, reducing model performance on unseen target domains. While Federated Domain Generalization (FedDG) methods have emerged to improve robustness to domain shifts, their integration into HFL frameworks remains largely unexplored. In this paper, we formally introduce Hierarchical Federated Domain Generalization (HFedDG), a novel scenario designed to investigate domain shift within hierarchical architectures. Specifically, we propose HFedATM, a hierarchical aggregation method that first aligns the convolutional filters of models from different stations through Filter-wise Optimal Transport Alignment and subsequently merges aligned models using a Shrinkage-aware Regularized Mean Aggregation. Our extensive experimental evaluations demonstrate that HFedATM significantly boosts the performance of existing FedDG baselines across multiple datasets and maintains computational and communication efficiency. Moreover, theoretical analyses indicate that HFedATM achieves tighter generalization error bounds compared to standard hierarchical averaging, resulting in faster convergence and stable training behavior.
[LG-35] Learning from Similarity-Confidence and Confidence-Difference
链接: https://arxiv.org/abs/2508.05108
作者: Tomoya Tate,Kosuke Sugiyama,Masato Uchida
类目: Machine Learning (cs.LG)
*备注: 41 pages, 13 figures. arXiv admin note: text overlap with arXiv:2310.05632 by other authors
Abstract:In practical machine learning applications, it is often challenging to assign accurate labels to data, and increasing the number of labeled instances is often limited. In such cases, Weakly Supervised Learning (WSL), which enables training with incomplete or imprecise supervision, provides a practical and effective solution. However, most existing WSL methods focus on leveraging a single type of weak supervision. In this paper, we propose a novel WSL framework that leverages complementary weak supervision signals from multiple relational perspectives, which can be especially valuable when labeled data is limited. Specifically, we introduce SconfConfDiff Classification, a method that integrates two distinct forms of weaklabels: similarity-confidence and confidence-difference, which are assigned to unlabeled data pairs. To implement this method, we derive two types of unbiased risk estimators for classification: one based on a convex combination of existing estimators, and another newly designed by modeling the interaction between two weak labels. We prove that both estimators achieve optimal convergence rates with respect to estimation error bounds. Furthermore, we introduce a risk correction approach to mitigate overfitting caused by negative empirical risk, and provide theoretical analysis on the robustness of the proposed method against inaccurate class prior probability and label noise. Experimental results demonstrate that the proposed method consistently outperforms existing baselines across a variety of settings.
[LG-36] Cold Start Active Preference Learning in Socio-Economic Domains
链接: https://arxiv.org/abs/2508.05090
作者: Mojtaba Fayaz-Bakhsh,Danial Ataee,MohammadAmin Fazli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Active preference learning is a powerful paradigm for efficiently modeling preferences, yet it suffers from the cold-start problem: a significant drop in performance when no initial labeled data is available. This challenge is particularly acute in computational social systems and economic analysis, where labeled data is often scarce, expensive, and subject to expert noise. To address this gap, we propose a novel framework for cold-start active preference learning. Our method initiates the learning process through a self-supervised pre-training phase, utilizing Principal Component Analysis (PCA) to derive initial pseudo-labels from the data’s inherent structure, thereby creating a cold-start model without any initial oracle interaction. Subsequently, the model is refined through an active learning loop that strategically queries a simulated noisy oracle for labels. We conduct extensive experiments on diverse datasets from different domains, including financial credibility, career success rate, and socio-economic status. The results demonstrate that our cold-start approach outperforms standard active learning strategies that begin from a blank slate, achieving higher accuracy with substantially fewer labeled pairs. Our framework offers a practical and effective solution to mitigate the cold-start problem, enhancing the sample efficiency and applicability of preference learning in data-constrained environments. We release our code at this https URL
[LG-37] Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning
链接: https://arxiv.org/abs/2508.05077
作者: Luai Abuelsamen,Temitope Lukman Adebanjo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 9 pages, 1 figure, 1 table, theoretical analysis with empirical validation on PerAct implementation in MuJoCo simulation environment
Abstract:This paper examines the theoretical foundations of multimodal imitation learning through the lens of statistical learning theory. We analyze how multimodal perception (RGB-D, proprioception, language) affects sample complexity and optimization landscapes in imitation policies. Building on recent advances in multimodal learning theory, we show that properly integrated multimodal policies can achieve tighter generalization bounds and more favorable optimization landscapes than their unimodal counterparts. We provide a comprehensive review of theoretical frameworks that explain why multimodal architectures like PerAct and CLIPort achieve superior performance, connecting these empirical results to fundamental concepts in Rademacher complexity, PAC learning, and information theory.
[LG-38] ULU: A Unified Activation Function
链接: https://arxiv.org/abs/2508.05073
作者: Simin Huo
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose \textbfULU, a novel non-monotonic, piecewise activation function defined as \f(x;\alpha_1),x0; f(x;\alpha_2),x=0 \ , where f(x;\alpha)=0.5x(tanh(\alpha x)+1),\alpha 0 . ULU treats positive and negative inputs differently. Extensive experiments demonstrate ULU significantly outperforms ReLU and Mish across image classification and object detection tasks. Its variant Adaptive ULU (\textbfAULU) is expressed as \f(x;\beta_1^2),x0; f(x;\beta_2^2),x=0 \ , where \beta_1 and \beta_2 are learnable parameters, enabling it to adapt its response separately for positive and negative inputs. Additionally, we introduce the LIB (Like Inductive Bias) metric from AULU to quantitatively measure the inductive bias of the model.
[LG-39] ANGO: Graph Neural Dynamics via Learned Energy and Tangential Flows
链接: https://arxiv.org/abs/2508.05070
作者: Moshe Eliasof,Eldad Haber,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce TANGO – a dynamical systems inspired framework for graph representation learning that governs node feature evolution through a learned energy landscape and its associated descent dynamics. At the core of our approach is a learnable Lyapunov function over node embeddings, whose gradient defines an energy-reducing direction that guarantees convergence and stability. To enhance flexibility while preserving the benefits of energy-based dynamics, we incorporate a novel tangential component, learned via message passing, that evolves features while maintaining the energy value. This decomposition into orthogonal flows of energy gradient descent and tangential evolution yields a flexible form of graph dynamics, and enables effective signal propagation even in flat or ill-conditioned energy regions, that often appear in graph learning. Our method mitigates oversquashing and is compatible with different graph neural network backbones. Empirically, TANGO achieves strong performance across a diverse set of node and graph classification and regression benchmarks, demonstrating the effectiveness of jointly learned energy functions and tangential flows for graph neural networks.
[LG-40] wo tales for a geometric Jensen–Shannon divergence
链接: https://arxiv.org/abs/2508.05066
作者: Frank Nielsen
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:The geometric Jensen–Shannon divergence (G-JSD) gained popularity in machine learning and information sciences thanks to its closed-form expression between Gaussian distributions. In this work, we introduce an alternative definition of the geometric Jensen–Shannon divergence tailored to positive densities which does not normalize geometric mixtures. This novel divergence is termed the extended G-JSD as it extends to more general positive measures. We give explicitly the gap between the extended G-JSD and G-JSD when considering probability densities, and report both lower and upper bounds in terms of other statistical divergences. We derive corresponding closed-form expressions when considering the case of multivariate Gaussian distributions often met in applications. Finally, we show that these two types of geometric JSDs, the G-JSD and the extended G-JSD, can be interpreted as regularizations of the ordinary JSD by additive terms.
[LG-41] An ML-based Approach to Predicting Software Change Dependencies: Insights from an Empirical Study on OpenStack
链接: https://arxiv.org/abs/2508.05034
作者: Arabat, Ali,Sayagh,Mohammed,Hassine,Jameleddine
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:As software systems grow in complexity, accurately identifying and managing dependencies among changes becomes increasingly critical. For instance, a change that leverages a function must depend on the change that introduces it. Establishing such dependencies allows CI/CD pipelines to build and orchestrate changes effectively, preventing build failures and incomplete feature deployments. In modern software systems, dependencies often span multiple components across teams, creating challenges for development and deployment. They serve various purposes, from enabling new features to managing configurations, and can even involve traditionally independent changes like documentation updates. To address these challenges, we conducted a preliminary study on dependency management in OpenStack, a large-scale software system. Our study revealed that a substantial portion of software changes in OpenStack over the past 10 years are interdependent. Surprisingly, 51.08% of these dependencies are identified during the code review phase-after a median delay of 5.06 hours-rather than at the time of change creation. Developers often spend a median of 57.12 hours identifying dependencies, searching among a median of 463 other changes. To help developers proactively identify dependencies, we propose a semi-automated approach that leverages two ML models. The first model predicts the likelihood of dependencies among changes, while the second identifies the exact pairs of dependent changes. Our proposed models demonstrate strong performance, achieving average AUC scores of 79.33% and 91.89%, and Brier scores of 0.11 and 0.014, respectively. Indeed, the second model has a good top-k recall across all types of pairs, while the top-k precision has room for improvement.
[LG-42] Will You Be Aware? Eye Tracking-Based Modeling of Situational Awareness in Augmented Reality
链接: https://arxiv.org/abs/2508.05025
作者: Zhehan Qu,Tianyi Hu,Christian Fronk,Maria Gorlatova
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Augmented Reality (AR) systems, while enhancing task performance through real-time guidance, pose risks of inducing cognitive tunneling-a hyperfocus on virtual content that compromises situational awareness (SA) in safety-critical scenarios. This paper investigates SA in AR-guided cardiopulmonary resuscitation (CPR), where responders must balance effective compressions with vigilance to unpredictable hazards (e.g., patient vomiting). We developed an AR app on a Magic Leap 2 that overlays real-time CPR feedback (compression depth and rate) and conducted a user study with simulated unexpected incidents (e.g., bleeding) to evaluate SA, in which SA metrics were collected via observation and questionnaires administered during freeze-probe events. Eye tracking analysis revealed that higher SA levels were associated with greater saccadic amplitude and velocity, and with reduced proportion and frequency of fixations on virtual content. To predict SA, we propose FixGraphPool, a graph neural network that structures gaze events (fixations, saccades) into spatiotemporal graphs, effectively capturing dynamic attentional patterns. Our model achieved 83.0% accuracy (F1=81.0%), outperforming feature-based machine learning and state-of-the-art time-series models by leveraging domain knowledge and spatial-temporal information encoded in ET data. These findings demonstrate the potential of eye tracking for SA modeling in AR and highlight its utility in designing AR systems that ensure user safety and situational awareness.
[LG-43] Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis
链接: https://arxiv.org/abs/2508.04999
作者: Menghua Jiang,Yuxia Lin,Baoliang Chen,Haifeng Hu,Yuncheng Jiang,Sijie Mai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal sentiment analysis (MSA) aims to understand human emotions by integrating information from multiple modalities, such as text, audio, and visual data. However, existing methods often suffer from spurious correlations both within and across modalities, leading models to rely on statistical shortcuts rather than true causal relationships, thereby undermining generalization. To mitigate this issue, we propose a Multi-relational Multimodal Causal Intervention (MMCI) model, which leverages the backdoor adjustment from causal theory to address the confounding effects of such shortcuts. Specifically, we first model the multimodal inputs as a multi-relational graph to explicitly capture intra- and inter-modal dependencies. Then, we apply an attention mechanism to separately estimate and disentangle the causal features and shortcut features corresponding to these intra- and inter-modal relations. Finally, by applying the backdoor adjustment, we stratify the shortcut features and dynamically combine them with the causal features to encourage MMCI to produce stable predictions under distribution shifts. Extensive experiments on several standard MSA datasets and out-of-distribution (OOD) test sets demonstrate that our method effectively suppresses biases and improves performance.
[LG-44] RCUKF: Data-Driven Modeling Meets Bayesian Estimation
链接: https://arxiv.org/abs/2508.04985
作者: Kumar Anurag,Kasra Azizi,Francesco Sorrentino,Wenbin Wan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 6 pages, 6 figures. Accepted at IFAC MECC 2025 (Modeling, Estimation and Control Conference)
Abstract:Accurate modeling is crucial in many engineering and scientific applications, yet obtaining a reliable process model for complex systems is often challenging. To address this challenge, we propose a novel framework, reservoir computing with unscented Kalman filtering (RCUKF), which integrates data-driven modeling via reservoir computing (RC) with Bayesian estimation through the unscented Kalman filter (UKF). The RC component learns the nonlinear system dynamics directly from data, serving as a surrogate process model in the UKF prediction step to generate state estimates in high-dimensional or chaotic regimes where nominal mathematical models may fail. Meanwhile, the UKF measurement update integrates real-time sensor data to correct potential drift in the data-driven model. We demonstrate RCUKF effectiveness on well-known benchmark problems and a real-time vehicle trajectory estimation task in a high-fidelity simulation environment.
[LG-45] A Metric for MLLM Alignment in Large-scale Recommendation
链接: https://arxiv.org/abs/2508.04963
作者: Yubin Zhang,Yanhua Huang,Haiming Xu,Mingliang Qi,Chang Wang,Jiarui Jin,Xiangyuan Ren,Xiaodan Wang,Ruiwen Xu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: this http URL Review
Abstract:Multimodal recommendation has emerged as a critical technique in modern recommender systems, leveraging content representations from advanced multimodal large language models (MLLMs). To ensure these representations are well-adapted, alignment with the recommender system is essential. However, evaluating the alignment of MLLMs for recommendation presents significant challenges due to three key issues: (1) static benchmarks are inaccurate because of the dynamism in real-world applications, (2) evaluations with online system, while accurate, are prohibitively expensive at scale, and (3) conventional metrics fail to provide actionable insights when learned representations underperform. To address these challenges, we propose the Leakage Impact Score (LIS), a novel metric for multimodal recommendation. Rather than directly assessing MLLMs, LIS efficiently measures the upper bound of preference data. We also share practical insights on deploying MLLMs with LIS in real-world scenarios. Online A/B tests on both Content Feed and Display Ads of Xiaohongshu’s Explore Feed production demonstrate the effectiveness of our proposed method, showing significant improvements in user spent time and advertiser value.
[LG-46] Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization
链接: https://arxiv.org/abs/2508.04950
作者: Wei Liu,Anweshit Panda,Ujwal Pandey,Christopher Brissette,Yikang Shen,George M. Slota,Naigang Wang,Jie Chen,Yangyang Xu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: accepted by TMLR
Abstract:In this paper, we design two compressed decentralized algorithms for solving nonconvex stochastic optimization under two different scenarios. Both algorithms adopt a momentum technique to achieve fast convergence and a message-compression technique to save communication costs. Though momentum acceleration and compressed communication have been used in literature, it is highly nontrivial to theoretically prove the effectiveness of their composition in a decentralized algorithm that can maintain the benefits of both sides, because of the need to simultaneously control the consensus error, the compression error, and the bias from the momentum gradient. For the scenario where gradients are bounded, our proposal is a compressed decentralized adaptive method. To the best of our knowledge, this is the first decentralized adaptive stochastic gradient method with compressed communication. For the scenario of data heterogeneity without bounded gradients, our proposal is a compressed decentralized heavy-ball method, which applies a gradient tracking technique to address the challenge of data heterogeneity. Notably, both methods achieve an optimal convergence rate, and they can achieve linear speed up and adopt topology-independent algorithmic parameters within a certain regime of the user-specified error tolerance. Superior empirical performance is observed over state-of-the-art methods on training deep neural networks (DNNs) and Transformers. Comments: accepted by TMLR Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2508.04950 [cs.LG] (or arXiv:2508.04950v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.04950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Self-Error Adjustment: Theory and Practice of Balancing Individual Performance and Diversity in Ensemble Learning
链接: https://arxiv.org/abs/2508.04948
作者: Rui Zou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensemble learning boosts performance by aggregating predictions from multiple base learners. A core challenge is balancing individual learner accuracy with diversity. Traditional methods like Bagging and Boosting promote diversity through randomness but lack precise control over the accuracy-diversity trade-off. Negative Correlation Learning (NCL) introduces a penalty to manage this trade-off but suffers from loose theoretical bounds and limited adjustment range. To overcome these limitations, we propose a novel framework called Self-Error Adjustment (SEA), which decomposes ensemble errors into two distinct components: individual performance terms, representing the self-error of each base learner, and diversity terms, reflecting interactions among learners. This decomposition allows us to introduce an adjustable parameter into the loss function, offering precise control over the contribution of each component, thus enabling finer regulation of ensemble performance. Compared to NCL and its variants, SEA provides a broader range of effective adjustments and more consistent changes in diversity. Furthermore, we establish tighter theoretical bounds for adjustable ensemble methods and validate them through empirical experiments. Experimental results on several public regression and classification datasets demonstrate that SEA consistently outperforms baseline methods across all tasks. Ablation studies confirm that SEA offers more flexible adjustment capabilities and superior performance in fine-tuning strategies.
[LG-48] Sensitivity of Stability: Theoretical Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning
链接: https://arxiv.org/abs/2508.04901
作者: Prabhav Singh,Jessica Sorrell
类目: Machine Learning (cs.LG)
*备注: 24 Pages, 5 Figures
Abstract:The widespread adoption of transfer learning has revolutionized machine learning by enabling efficient adaptation of pre-trained models to new domains. However, the reliability of these adaptations remains poorly understood, particularly when using adaptive data selection strategies that dynamically prioritize training examples. We present a comprehensive theoretical and empirical analysis of replicability in transfer learning, introducing a mathematical framework that quantifies the fundamental trade-off between adaptation effectiveness and result consistency. Our key contribution is the formalization of selection sensitivity ( \Delta_Q ), a measure that captures how adaptive selection strategies respond to perturbations in training data. We prove that replicability failure probability: the likelihood that two independent training runs produce models differing in performance by more than a threshold, increases quadratically with selection sensitivity while decreasing exponentially with sample size. Through extensive experiments on the MultiNLI corpus using six adaptive selection strategies - ranging from uniform sampling to gradient-based selection - we demonstrate that this theoretical relationship holds precisely in practice. Our results reveal that highly adaptive strategies like gradient-based and curriculum learning achieve superior task performance but suffer from high replicability failure rates, while less adaptive approaches maintain failure rates below 7%. Crucially, we show that source domain pretraining provides a powerful mitigation mechanism, reducing failure rates by up to 30% while preserving performance gains. These findings establish principled guidelines for practitioners to navigate the performance-replicability trade-off and highlight the need for replicability-aware design in modern transfer learning systems.
[LG-49] Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection ALT
链接: https://arxiv.org/abs/2508.04899
作者: Jovana Kljajic,John M. O’Toole,Robert Hogan,Tamara Skoric
类目: Machine Learning (cs.LG)
*备注: Submitted for possible publication at IEEE Journal of Biomedical and Health Informatics
Abstract:Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson’s correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.
[LG-50] Retrieval-Augmented Water Level Forecasting for Everglades
链接: https://arxiv.org/abs/2508.04888
作者: Rahuul Rangaraj,Jimeng Shi,Rajendra Paudel,Giri Narasimhan,Yanzhao Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate water level forecasting is crucial for managing ecosystems such as the Everglades, a subtropical wetland vital for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent advances in deep learning, particularly time series foundation models, have demonstrated success in general-domain forecasting, their application in hydrology remains underexplored. Furthermore, they often struggle to generalize across diverse unseen datasets and domains, due to the lack of effective mechanisms for adaptation. To address this gap, we introduce Retrieval-Augmented Forecasting (RAF) into the hydrology domain, proposing a framework that retrieves historically analogous multivariate hydrological episodes to enrich the model input before forecasting. By maintaining an external archive of past observations, RAF identifies and incorporates relevant patterns from historical data, thereby enhancing contextual awareness and predictive accuracy without requiring the model for task-specific retraining or fine-tuning. Furthermore, we explore and compare both similarity-based and mutual information-based RAF methods. We conduct a comprehensive evaluation on real-world data from the Everglades, demonstrating that the RAF framework yields substantial improvements in water level forecasting accuracy. This study highlights the potential of RAF approaches in environmental hydrology and paves the way for broader adoption of adaptive AI methods by domain experts in ecosystem management. The code and data are available at this https URL.
[LG-51] Gaussian mixture layers for neural networks
链接: https://arxiv.org/abs/2508.04883
作者: Sinho Chewi,Philippe Rigollet,Yuling Yan
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:The mean-field theory for two-layer neural networks considers infinitely wide networks that are linearly parameterized by a probability measure over the parameter space. This nonparametric perspective has significantly advanced both the theoretical and conceptual understanding of neural networks, with substantial efforts made to validate its applicability to networks of moderate width. In this work, we explore the opposite direction, investigating whether dynamics can be directly implemented over probability measures. Specifically, we employ Gaussian mixture models as a flexible and expressive parametric family of distributions together with the theory of Wasserstein gradient flows to derive training dynamics for such measures. Our approach introduces a new type of layer – the Gaussian mixture (GM) layer – that can be integrated into neural network architectures. As a proof of concept, we validate our proposal through experiments on simple classification tasks, where a GM layer achieves test performance comparable to that of a two-layer fully connected network. Furthermore, we examine the behavior of these dynamics and demonstrate numerically that GM layers exhibit markedly different behavior compared to classical fully connected layers, even when the latter are large enough to be considered in the mean-field regime.
[LG-52] Hilbert Neural Operator: Operator Learning in the Analytic Signal Domain
链接: https://arxiv.org/abs/2508.04882
作者: Saman Pordanesh,Pejman Shahsavari,Hossein Ghadjari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators have emerged as a powerful, data-driven paradigm for learning solution operators of partial differential equations (PDEs). State-of-the-art architectures, such as the Fourier Neural Operator (FNO), have achieved remarkable success by performing convolutions in the frequency domain, making them highly effective for a wide range of problems. However, this method has some limitations, including the periodicity assumption of the Fourier transform. In addition, there are other methods of analysing a signal, beyond phase and amplitude perspective, and provide us with other useful information to learn an effective network. We introduce the \textbfHilbert Neural Operator (HNO), a new neural operator architecture to address some advantages by incorporating a strong inductive bias from signal processing. HNO operates by first mapping the input signal to its analytic representation via the Hilbert transform, thereby making instantaneous amplitude and phase information explicit features for the learning process. The core learnable operation – a spectral convolution – is then applied to this Hilbert-transformed representation. We hypothesize that this architecture enables HNO to model operators more effectively for causal, phase-sensitive, and non-stationary systems. We formalize the HNO architecture and provide the theoretical motivation for its design, rooted in analytic signal theory.
[LG-53] Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment
链接: https://arxiv.org/abs/2508.04865
作者: Aleksander Boruch-Gruszecki,Yangtian Zi,Zixuan Wu,Tejas Oberoi,Carolyn Jane Anderson,Joydeep Biswas,Arjun Guha
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 18 pages, 19 figures. For artifacts, see this https URL
Abstract:Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure. We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment. Applied to five low-resource languages–Lua, Julia, R, OCaml, and Fortran–Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for \le 16 B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version LiveCodeBench that we introduce. We will release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file. Comments: 18 pages, 19 figures. For artifacts, see this https URL Subjects: Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2508.04865 [cs.LG] (or arXiv:2508.04865v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.04865 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aleksander Boruch-Gruszecki [view email] [v1] Wed, 6 Aug 2025 20:30:55 UTC (439 KB)
[LG-54] Unified Flow Matching for Long Horizon Event Forecasting
链接: https://arxiv.org/abs/2508.04843
作者: Xiao Shou
类目: Machine Learning (cs.LG)
*备注: 7 pages
Abstract:Modeling long horizon marked event sequences is a fundamental challenge in many real-world applications, including healthcare, finance, and user behavior modeling. Existing neural temporal point process models are typically autoregressive, predicting the next event one step at a time, which limits their efficiency and leads to error accumulation in long-range forecasting. In this work, we propose a unified flow matching framework for marked temporal point processes that enables non-autoregressive, joint modeling of inter-event times and event types, via continuous and discrete flow matching. By learning continuous-time flows for both components, our method generates coherent long horizon event trajectories without sequential decoding. We evaluate our model on six real-world benchmarks and demonstrate significant improvements over autoregressive and diffusion-based baselines in both accuracy and generation efficiency.
[LG-55] HCRide: Harmonizing Passenger Fairness and Driver Preference for Human-Centered Ride-Hailing
链接: https://arxiv.org/abs/2508.04811
作者: Lin Jiang,Yu Yang,Guang Wang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 9 pages,4 figures
Abstract:Order dispatch systems play a vital role in ride-hailing services, which directly influence operator revenue, driver profit, and passenger experience. Most existing work focuses on improving system efficiency in terms of operator revenue, which may cause a bad experience for both passengers and drivers. Hence, in this work, we aim to design a human-centered ride-hailing system by considering both passenger fairness and driver preference without compromising the overall system efficiency. However, it is nontrivial to achieve this target due to the potential conflicts between passenger fairness and driver preference since optimizing one may sacrifice the other. To address this challenge, we design HCRide, a Human-Centered Ride-hailing system based on a novel multi-agent reinforcement learning algorithm called Harmonization-oriented Actor-Bi-Critic (Habic), which includes three major components (i.e., a multi-agent competition mechanism, a dynamic Actor network, and a Bi-Critic network) to optimize system efficiency and passenger fairness with driver preference consideration. We extensively evaluate our HCRide using two real-world ride-hailing datasets from Shenzhen and New York City. Experimental results show our HCRide effectively improves system efficiency by 2.02%, fairness by 5.39%, and driver preference by 10.21% compared to state-of-the-art baselines.
[LG-56] Federated Continual Recommendation CIKM2025
链接: https://arxiv.org/abs/2508.04792
作者: Jaehyung Lim,Wonbin Kweon,Woojoo Kim,Junyoung Kim,Seongjin Choi,Dongha Kim,Hwanjo Yu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted to CIKM 2025
Abstract:The increasing emphasis on privacy in recommendation systems has led to the adoption of Federated Learning (FL) as a privacy-preserving solution, enabling collaborative training without sharing user data. While Federated Recommendation (FedRec) effectively protects privacy, existing methods struggle with non-stationary data streams, failing to maintain consistent recommendation quality over time. On the other hand, Continual Learning Recommendation (CLRec) methods address evolving user preferences but typically assume centralized data access, making them incompatible with FL constraints. To bridge this gap, we introduce Federated Continual Recommendation (FCRec), a novel task that integrates FedRec and CLRec, requiring models to learn from streaming data while preserving privacy. As a solution, we propose F3CRec, a framework designed to balance knowledge retention and adaptation under the strict constraints of FCRec. F3CRec introduces two key components: Adaptive Replay Memory on the client side, which selectively retains past preferences based on user-specific shifts, and Item-wise Temporal Mean on the server side, which integrates new knowledge while preserving prior information. Extensive experiments demonstrate that F3CRec outperforms existing approaches in maintaining recommendation quality over time in a federated environment.
[LG-57] Are Large Language Models Dynamic Treatment Planners? An In Silico Study from a Prior Knowledge Injection Angle
链接: https://arxiv.org/abs/2508.04755
作者: Zhiyao Luo,Tingting Zhu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 20 pages
Abstract:Reinforcement learning (RL)-based dynamic treatment regimes (DTRs) hold promise for automating complex clinical decision-making, yet their practical deployment remains hindered by the intensive engineering required to inject clinical knowledge and ensure patient safety. Recent advancements in large language models (LLMs) suggest a complementary approach, where implicit prior knowledge and clinical heuristics are naturally embedded through linguistic prompts without requiring environment-specific training. In this study, we rigorously evaluate open-source LLMs as dynamic insulin dosing agents in an in silico Type 1 diabetes simulator, comparing their zero-shot inference performance against small neural network-based RL agents (SRAs) explicitly trained for the task. Our results indicate that carefully designed zero-shot prompts enable smaller LLMs (e.g., Qwen2.5-7B) to achieve comparable or superior clinical performance relative to extensively trained SRAs, particularly in stable patient cohorts. However, LLMs exhibit notable limitations, such as overly aggressive insulin dosing when prompted with chain-of-thought (CoT) reasoning, highlighting critical failure modes including arithmetic hallucination, temporal misinterpretation, and inconsistent clinical logic. Incorporating explicit reasoning about latent clinical states (e.g., meals) yielded minimal performance gains, underscoring the current model’s limitations in capturing complex, hidden physiological dynamics solely through textual inference. Our findings advocate for cautious yet optimistic integration of LLMs into clinical workflows, emphasising the necessity of targeted prompt engineering, careful validation, and potentially hybrid approaches that combine linguistic reasoning with structured physiological modelling to achieve safe, robust, and clinically effective decision-support systems.
[LG-58] InfoQ: Mixed-Precision Quantization via Global Information Flow
链接: https://arxiv.org/abs/2508.04753
作者: Mehmet Emre Akbulut,Hazem Hesham Yousef Shalby,Fabrizio Pittorino,Manuel Roveri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase. InfoQ assesses layer sensitivity by quantizing each layer at different bit-widths and measuring, through a single forward pass, the resulting change in mutual information in the subsequent layers. This quantifies how much each layer quantization impacts the network information flow. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14X and 10.66X).
[LG-59] PA-RNet: Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting
链接: https://arxiv.org/abs/2508.04750
作者: Chanjuan Liu(1),Shengzhi Wang(2),Enqiang Zhu(2) ((1) School of Computer Science and Technology, Dalian University of Technology, Dalian, China, (2) Institute of Computing Technology, Guangzhou University, Guangzhou, China)
类目: Machine Learning (cs.LG)
*备注:
Abstract:In real-world applications, multimodal time series data often suffer from interference, especially in the textual modality. Existing methods for multimodal time series forecasting often neglect the inherent perturbations within textual data, where irrelevant, noisy, or ambiguous content can significantly degrade model performance, particularly when the noise exhibits varying intensity or stems from structural inconsistencies. To address this challenge, we propose PA-RNet (Perturbation-Aware Reasoning Network for Multimodal Time Series Forecasting), a robust multimodal forecasting framework. PA-RNet features a perturbation-aware projection module and a cross-modal attention mechanism to effectively separate noise from the textual embeddings while maintaining semantically meaningful representations, thereby enhancing the model’s generalization ability. Theoretically, we establish the Lipschitz continuity of PA-RNet with respect to textual inputs and prove that the proposed perturbation module can reduce expected prediction error, offering strong guarantees of stability under noisy conditions. Furthermore, we introduce a textual perturbation pipeline that can be seamlessly incorporated into existing multimodal time series forecasting tasks, allowing for systematic evaluation of the model’s robustness in the presence of varying levels of textual noise. Extensive experiments across diverse domains and temporal settings demonstrate that PA-RNet consistently outperforms state-of-the-art baselines.
[LG-60] AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models
链接: https://arxiv.org/abs/2508.04748
作者: Xuan Lin,Long Chen,Yile Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages
Abstract:Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking’’ process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model’s reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model’s inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in this https URL.
[LG-61] A Foundational Multi-Modal Model for Few-Shot Learning
链接: https://arxiv.org/abs/2508.04746
作者: Pengtao Dang,Tingbo Guo,Sha Cao,Chi Zhang
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures
Abstract:Few-shot learning (FSL) is a machine learning paradigm that aims to generalize models from a small number of labeled examples, typically fewer than 10 per class. FSL is particularly crucial in biomedical, environmental, materials, and mechanical sciences, where samples are limited and data collection is often prohibitively costly, time-consuming, or ethically constrained. In this study, we present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM), trained on a set of independent tasks spanning diverse domains, task types, and input modalities, can substantially improve the generalization of FSL models, outperforming models based on conventional meta-learning on tasks of the same type. To support this, we first constructed a Multi-Modal Model Few-shot Dataset (M3FD, over 10K+ few-shot samples), which includes 2D RGB images, 2D/3D medical scans, tabular and time-course datasets, from which we manually curated FSL tasks such as classification. We further introduced M3F (Multi-Modal Model for Few-shot learning framework), a novel Large Multi-Modal Model framework tailored for data-constrained scientific applications. M3F supports a wide range of scientific data types through a modular pipeline. By fine-tuning the model on M3FD, M3F improves model performance, making LMMM feasible for real-world FSL deployment. The source code is located at this https URL. To democratize access to complex FSL data and promote reproducibility for public usage, M3FD is paired with a flexible and user-friendly tool that enables efficient querying, task-specific sampling, and preprocessing. Together, our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.
[LG-62] Edge-Assisted Collaborative Fine-Tuning for Multi-User Personalized Artificial Intelligence Generated Content (AIGC)
链接: https://arxiv.org/abs/2508.04745
作者: Nan Li,Wanting Yang,Marie Siew,Zehui Xiong,Binbin Chen,Shiwen Mao,Kwok-Yan Lam
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models (DMs) have emerged as powerful tools for high-quality content generation, yet their intensive computational requirements for inference pose challenges for resource-constrained edge devices. Cloud-based solutions aid in computation but often fall short in addressing privacy risks, personalization efficiency, and communication costs in multi-user edge-AIGC scenarios. To bridge this gap, we first analyze existing edge-AIGC applications in personalized content synthesis, revealing their limitations in efficiency and scalability. We then propose a novel cluster-aware hierarchical federated aggregation framework. Based on parameter-efficient local fine-tuning via Low-Rank Adaptation (LoRA), the framework first clusters clients based on the similarity of their uploaded task requirements, followed by an intra-cluster aggregation for enhanced personalization at the server-side. Subsequently, an inter-cluster knowledge interaction paradigm is implemented to enable hybrid-style content generation across diverse this http URL upon federated learning (FL) collaboration, our framework simultaneously trains personalized models for individual users at the devices and a shared global model enhanced with multiple LoRA adapters on the server,enabling efficient edge inference; meanwhile, all prompts for clustering and inference are encoded prior to transmission, thereby further mitigating the risk of plaintext leakage. Our evaluations demonstrate that the framework achieves accelerated convergence while maintaining practical viability for scalable multi-user personalized AIGC services under edge constraints.
[LG-63] MissMecha: An All-in-One Python Package for Studying Missing Data Mechanisms
链接: https://arxiv.org/abs/2508.04740
作者: Youran Zhou,Mohamed Reda Bouadjenek,Sunil Aryal
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:
Abstract:Incomplete data is a persistent challenge in real-world datasets, often governed by complex and unobservable missing mechanisms. Simulating missingness has become a standard approach for understanding its impact on learning and analysis. However, existing tools are fragmented, mechanism-limited, and typically focus only on numerical variables, overlooking the heterogeneous nature of real-world tabular data. We present MissMecha, an open-source Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions. MissMecha supports both numerical and categorical features, enabling mechanism-aware studies across mixed-type tabular datasets. It includes visual diagnostics, MCAR testing utilities, and type-aware imputation evaluation metrics. Designed to support data quality research, benchmarking, and education,MissMecha offers a unified platform for researchers and practitioners working with incomplete data.
[LG-64] LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation
链接: https://arxiv.org/abs/2508.04732
作者: Xiaoqi Dong,Xiangyu Zhou,Nicholas Evans,Yujia Lin
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback Refinement (IVFR) module, which acts as a “visual critic” to iteratively correct and optimize generated images. Evaluated on the challenging LongBench-T2I Benchmark, LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines. Notably, our framework demonstrates significant improvements in critical dimensions such as text rendering and pose expression, validating the effectiveness of LVLM integration for more controllable and higher-quality image generation.
[LG-65] NAEx: A Plug-and-Play Framework for Explaining Network Alignment
链接: https://arxiv.org/abs/2508.04731
作者: Shruti Saxena,Arijit Khan,Joydeep Chandra
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:
Abstract:Network alignment (NA) identifies corresponding nodes across multiple networks, with applications in domains like social networks, co-authorship, and biology. Despite advances in alignment models, their interpretability remains limited, making it difficult to understand alignment decisions and posing challenges in building trust, particularly in high-stakes domains. To address this, we introduce NAEx, a plug-and-play, model-agnostic framework that explains alignment models by identifying key subgraphs and features influencing predictions. NAEx addresses the key challenge of preserving the joint cross-network dependencies on alignment decisions by: (1) jointly parameterizing graph structures and feature spaces through learnable edge and feature masks, and (2) introducing an optimization objective that ensures explanations are both faithful to the original predictions and enable meaningful comparisons of structural and feature-based similarities between networks. NAEx is an inductive framework that efficiently generates NA explanations for previously unseen data. We introduce evaluation metrics tailored to alignment explainability and demonstrate NAEx’s effectiveness and efficiency on benchmark datasets by integrating it with four representative NA models.
[LG-66] Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers
链接: https://arxiv.org/abs/2508.04711
作者: Yue Dong,Han Li,Shen Li,Nikhil Patel,Xing Liu,Xiaodong Wang,Chuanhao Zhuge
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Large-scale recommendation systems are pivotal to process an immense volume of daily user interactions, requiring the effective modeling of high cardinality and heterogeneous features to ensure accurate predictions. In prior work, we introduced Hierarchical Sequential Transducers (HSTU), an attention-based architecture for modeling high cardinality, non-stationary streaming recommendation data, providing good scaling law in the generative recommender framework (GR). Recent studies and experiments demonstrate that attending to longer user history sequences yields significant metric improvements. However, scaling sequence length is activation-heavy, necessitating parallelism solutions to effectively shard activation memory. In transformer-based LLMs, context parallelism (CP) is a commonly used technique that distributes computation along the sequence-length dimension across multiple GPUs, effectively reducing memory usage from attention activations. In contrast, production ranking models typically utilize jagged input tensors to represent user interaction features, introducing unique CP implementation challenges. In this work, we introduce context parallelism with jagged tensor support for HSTU attention, establishing foundational capabilities for scaling up sequence dimensions. Our approach enables a 5.3x increase in supported user interaction sequence length, while achieving a 1.55x scaling factor when combined with Distributed Data Parallelism (DDP).
[LG-67] High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation
链接: https://arxiv.org/abs/2508.05570
作者: Ilya Levin,Alexey Naumov,Sergey Samsonov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:
Abstract:In this paper, we study the bias and high-order error bounds of the Linear Stochastic Approximation (LSA) algorithm with Polyak-Ruppert (PR) averaging under Markovian noise. We focus on the version of the algorithm with constant step size \alpha and propose a novel decomposition of the bias via a linearization technique. We analyze the structure of the bias and show that the leading-order term is linear in \alpha and cannot be eliminated by PR averaging. To address this, we apply the Richardson-Romberg (RR) extrapolation procedure, which effectively cancels the leading bias term. We derive high-order moment bounds for the RR iterates and show that the leading error term aligns with the asymptotically optimal covariance matrix of the vanilla averaged LSA iterates.
[LG-68] L1-Regularized Functional Support Vector Machine
链接: https://arxiv.org/abs/2508.05567
作者: Bingfan Liu,Peijun Sang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:In functional data analysis, binary classification with one functional covariate has been extensively studied. We aim to fill in the gap of considering multivariate functional covariates in classification. In particular, we propose an L_1 -regularized functional support vector machine for binary classification. An accompanying algorithm is developed to fit the classifier. By imposing an L_1 penalty, the algorithm enables us to identify relevant functional covariates of the binary response. Numerical results from simulations and one real-world application demonstrate that the proposed classifier enjoys good performance in both prediction and feature selection.
[LG-69] On the Design of Expressive and Trainable Pulse-based Quantum Machine Learning Models
链接: https://arxiv.org/abs/2508.05559
作者: Han-Xiao Tao,Xin Wang,Re-Bing Wu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures
Abstract:Pulse-based Quantum Machine Learning (QML) has emerged as a novel paradigm in quantum artificial intelligence due to its exceptional hardware efficiency. For practical applications, pulse-based models must be both expressive and trainable. Previous studies suggest that pulse-based models under dynamic symmetry can be effectively trained, thanks to a favorable loss landscape that has no barren plateaus. However, the resulting uncontrollability may compromise expressivity when the model is inadequately designed. This paper investigates the requirements for pulse-based QML models to be expressive while preserving trainability. We present a necessary condition pertaining to the system’s initial state, the measurement observable, and the underlying dynamical symmetry Lie algebra, supported by numerical simulations. Our findings establish a framework for designing practical pulse-based QML models that balance expressivity and trainability.
[LG-70] Exact and Heuristic Algorithms for Constrained Biclustering
链接: https://arxiv.org/abs/2508.05493
作者: Antonio M. Sudoso
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Biclustering, also known as co-clustering or two-way clustering, simultaneously partitions the rows and columns of a data matrix to reveal submatrices with coherent patterns. Incorporating background knowledge into clustering to enhance solution quality and interpretability has attracted growing interest in mathematical optimization and machine learning research. Extending this paradigm to biclustering enables prior information to guide the joint grouping of rows and columns. We study constrained biclustering with pairwise constraints, namely must-link and cannot-link constraints, which specify whether objects should belong to the same or different biclusters. As a model problem, we address the constrained version of the k-densest disjoint biclique problem, which aims to identify k disjoint complete bipartite subgraphs (called bicliques) in a weighted complete bipartite graph, maximizing the total density while satisfying pairwise constraints. We propose both exact and heuristic algorithms. The exact approach is a tailored branch-and-cut algorithm based on a low-dimensional semidefinite programming (SDP) relaxation, strengthened with valid inequalities and solved in a cutting-plane fashion. Exploiting integer programming tools, a rounding scheme converts SDP solutions into feasible biclusterings at each node. For large-scale instances, we introduce an efficient heuristic based on the low-rank factorization of the SDP. The resulting nonlinear optimization problem is tackled with an augmented Lagrangian method, where the subproblem is solved by decomposition through a block-coordinate projected gradient algorithm. Extensive experiments on synthetic and real-world datasets show that the exact method significantly outperforms general-purpose solvers, while the heuristic achieves high-quality solutions efficiently on large instances.
[LG-71] Harmonic fractal transformation for modeling complex neuronal effects: from bursting and noise shaping to waveform sensitivity and noise-induced subthreshold spiking
链接: https://arxiv.org/abs/2508.05341
作者: Mariia Sorokina
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:We propose the first fractal frequency mapping, which in a simple form enables to replicate complex neuronal effects. Unlike the conventional filters, which suppress or amplify the input spectral components according to the filter weights, the transformation excites novel components by a fractal recomposition of the input spectra resulting in a formation of spikes at resonant frequencies that are optimal for sampling. This enables high sensitivity detection, robustness to noise and noise-induced signal amplification. The proposed model illustrates that a neuronal functionality can be viewed as a linear summation of spectrum over nonlinearly transformed frequency domain.
[LG-72] Salt-Rock Creep Deformation Forecasting Using Deep Neural Networks and Analytical Models for Subsurface Energy Storag e Applications
链接: https://arxiv.org/abs/2508.05248
作者: Pradeep Kumar Shukla,Tanujit Chakraborty,Mustafa Sari,Joel Sarout,Partha Pratim Mandal
类目: Geophysics (physics.geo-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:This study provides an in-depth analysis of time series forecasting methods to predict the time-dependent deformation trend (also known as creep) of salt rock under varying confining pressure conditions. Creep deformation assessment is essential for designing and operating underground storage facilities for nuclear waste, hydrogen energy, or radioactive materials. Salt rocks, known for their mechanical properties like low porosity, low permeability, high ductility, and exceptional creep and self-healing capacities, were examined using multi-stage triaxial (MSTL) creep data. After resampling, axial strain datasets were recorded at 5–10 second intervals under confining pressure levels ranging from 5 to 35 MPa over 5.8–21 days. Initial analyses, including Seasonal-Trend Decomposition (STL) and Granger causality tests, revealed minimal seasonality and causality between axial strain and temperature data. Further statistical tests, such as the Augmented Dickey-Fuller (ADF) test, confirmed the stationarity of the data with p-values less than 0.05, and wavelet coherence plot (WCP) analysis indicated repeating trends. A suite of deep neural network (DNN) models (Neural Basis Expansion Analysis for Time Series (N-BEATS), Temporal Convolutional Networks (TCN), Recurrent Neural Networks (RNN), and Transformers (TF)) was utilized and compared against statistical baseline models. Predictive performance was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Symmetric Mean Absolute Percentage Error (SMAPE). Results demonstrated that N-BEATS and TCN models outperformed others across various stress levels, respectively. DNN models, particularly N-BEATS and TCN, showed a 15–20% improvement in accuracy over traditional analytical models, effectively capturing complex temporal dependencies and patterns.
[LG-73] High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference
链接: https://arxiv.org/abs/2508.05212
作者: Ziliang Shen,Caixing Wang,Shaoli Wang,Yibo Yan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:With the development of big data and machine learning, privacy concerns have become increasingly critical, especially when handling heterogeneous datasets containing sensitive personal information. Differential privacy provides a rigorous framework for safeguarding individual privacy while enabling meaningful statistical analysis. In this paper, we propose a differentially private quantile regression method for high-dimensional data in a distributed setting. Quantile regression is a powerful and robust tool for modeling the relationships between the covariates and responses in the presence of outliers or heavy-tailed distributions. To address the computational challenges due to the non-smoothness of the quantile loss function, we introduce a Newton-type transformation that reformulates the quantile regression task into an ordinary least squares problem. Building on this, we develop a differentially private estimation algorithm with iterative updates, ensuring both near-optimal statistical accuracy and formal privacy guarantees. For inference, we further propose a differentially private debiased estimator, which enables valid confidence interval construction and hypothesis testing. Additionally, we propose a communication-efficient and differentially private bootstrap for simultaneous hypothesis testing in high-dimensional quantile regression, suitable for distributed settings with both small and abundant local data. Extensive simulations demonstrate the robustness and effectiveness of our methods in practical scenarios.
[LG-74] Hybrid quantum tensor networks for aeroelastic applications
链接: https://arxiv.org/abs/2508.05169
作者: M. Lautaro Hickmann(1),Pedro Alves(1),David Quero(2),Friedhelm Schwenker(3),Hans-Martin Rieser(1) ((1) Institute for AI Safety and Security, German Aerospace Center (DLR), Ulm and St. Augustin, Germany, (2) Institute of Aeroelasticity, German Aerospace Center (DLR), Göttingen, Germany, (3) Institute of Neural Information Processing, University of Ulm, Ulm, Germany)
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 20 pages, 8 Figures, submitted to Quantum Machine Intelligence
Abstract:We investigate the application of hybrid quantum tensor networks to aeroelastic problems, harnessing the power of Quantum Machine Learning (QML). By combining tensor networks with variational quantum circuits, we demonstrate the potential of QML to tackle complex time series classification and regression tasks. Our results showcase the ability of hybrid quantum tensor networks to achieve high accuracy in binary classification. Furthermore, we observe promising performance in regressing discrete variables. While hyperparameter selection remains a challenge, requiring careful optimisation to unlock the full potential of these models, this work contributes significantly to the development of QML for solving intricate problems in aeroelasticity. We present an end-to-end trainable hybrid algorithm. We first encode time series into tensor networks to then utilise trainable tensor networks for dimensionality reduction, and convert the resulting tensor to a quantum circuit in the encoding step. Then, a tensor network inspired trainable variational quantum circuit is applied to solve either a classification or a multivariate or univariate regression task in the aeroelasticity domain.
[LG-75] Q-DPTS: Quantum Differentially Private Time Series Forecasting via Variational Quantum Circuits
链接: https://arxiv.org/abs/2508.05036
作者: Chi-Sheng Chen,Samuel Yen-Chi Chen
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Time series forecasting is vital in domains where data sensitivity is paramount, such as finance and energy systems. While Differential Privacy (DP) provides theoretical guarantees to protect individual data contributions, its integration especially via DP-SGD often impairs model performance due to injected noise. In this paper, we propose Q-DPTS, a hybrid quantum-classical framework for Quantum Differentially Private Time Series Forecasting. Q-DPTS combines Variational Quantum Circuits (VQCs) with per-sample gradient clipping and Gaussian noise injection, ensuring rigorous (\epsilon, \delta) -differential privacy. The expressiveness of quantum models enables improved robustness against the utility loss induced by DP mechanisms. We evaluate Q-DPTS on the ETT (Electricity Transformer Temperature) dataset, a standard benchmark for long-term time series forecasting. Our approach is compared against both classical and quantum baselines, including LSTM, QASA, QRWKV, and QLSTM. Results demonstrate that Q-DPTS consistently achieves lower prediction error under the same privacy budget, indicating a favorable privacy-utility trade-off. This work presents one of the first explorations into quantum-enhanced differentially private forecasting, offering promising directions for secure and accurate time series modeling in privacy-critical scenarios.
[LG-76] Supervised Machine Learning Methods with Uncertainty Quantification for Exoplanet Atmospheric Retrievals from Transmission Spectroscopy
链接: https://arxiv.org/abs/2508.04982
作者: Roy T. Forestano,Konstantin T. Matchev,Katia Matcheva,Eyup B. Unlu
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 51 pages, 26 figures, Submitted to AAS Journals
Abstract:Standard Bayesian retrievals for exoplanet atmospheric parameters from transmission spectroscopy, while well understood and widely used, are generally computationally expensive. In the era of the JWST and other upcoming observatories, machine learning approaches have emerged as viable alternatives that are both efficient and robust. In this paper we present a systematic study of several existing machine learning regression techniques and compare their performance for retrieving exoplanet atmospheric parameters from transmission spectra. We benchmark the performance of the different algorithms on the accuracy, precision, and speed. The regression methods tested here include partial least squares (PLS), support vector machines (SVM), k nearest neighbors (KNN), decision trees (DT), random forests (RF), voting (VOTE), stacking (STACK), and extreme gradient boosting (XGB). We also investigate the impact of different preprocessing methods of the training data on the model performance. We quantify the model uncertainties across the entire dynamical range of planetary parameters. The best performing combination of ML model and preprocessing scheme is validated on a the case study of JWST observation of WASP-39b.
[LG-77] Anti-Jamming Sensing with Distributed Reconfigurable Intelligent Metasurface Antennas
链接: https://arxiv.org/abs/2508.04964
作者: Zhaowei Wang,Yunsong Huang,Weicheng Liu,Hui-Ming Wang
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:The utilization of radio frequency (RF) signals for wireless sensing has garnered increasing attention. However, the radio environment is unpredictable and often unfavorable, the sensing accuracy of traditional RF sensing methods is often affected by adverse propagation channels from the transmitter to the receiver, such as fading and noise. In this paper, we propose employing distributed Reconfigurable Intelligent Metasurface Antennas (RIMSA) to detect the presence and location of objects where multiple RIMSA receivers (RIMSA Rxs) are deployed on different places. By programming their beamforming patterns, RIMSA Rxs can enhance the quality of received signals. The RF sensing problem is modeled as a joint optimization problem of beamforming pattern and mapping of received signals to sensing outcomes. To address this challenge, we introduce a deep reinforcement learning (DRL) algorithm aimed at calculating the optimal beamforming patterns and a neural network aimed at converting received signals into sensing outcomes. In addition, the malicious attacker may potentially launch jamming attack to disrupt sensing process. To enable effective sensing in interferenceprone environment, we devise a combined loss function that takes into account the Signal to Interference plus Noise Ratio (SINR) of the received signals. The simulation results show that the proposed distributed RIMSA system can achieve more efficient sensing performance and better overcome environmental influences than centralized implementation. Furthermore, the introduced method ensures high-accuracy sensing performance even under jamming attack.
[LG-78] he Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models
链接: https://arxiv.org/abs/2508.04884
作者: Leo Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint
Abstract:In this work, we study the problem of choosing the discretisation schedule for sampling from masked discrete diffusion models in terms of the information geometry of the induced probability path. Specifically, we show that the optimal schedule under the Fisher-Rao geometry recovers the popularly-used cosine schedule.
[LG-79] Can SGD Handle Heavy-Tailed Noise?
链接: https://arxiv.org/abs/2508.04860
作者: Ilyas Fatkhullin,Florian Hübler,Guanghui Lan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise – common in modern machine learning and reinforcement learning – remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded p -th moments for some p \in (1, 2] , we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: \mathcalO(\varepsilon^-\fracpp-1) and \mathcalO(\varepsilon^-\fracp2(p-1)) , respectively. For non-convex objectives under Hölder smoothness, we prove convergence to a stationary point with rate \mathcalO(\varepsilon^-\frac2pp-1) , and complement this with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we consider non-convex Mini-batch SGD under standard smoothness and bounded central moment assumptions, and show that it also achieves a comparable \mathcalO(\varepsilon^-\frac2pp-1) sample complexity with a potential improvement in the smoothness constant. These results challenge the prevailing view that heavy-tailed noise renders SGD ineffective, and establish vanilla SGD as a robust and theoretically principled baseline – even in regimes where the variance is unbounded.
[LG-80] Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices
链接: https://arxiv.org/abs/2508.04857
作者: Yael Segal-Feldman,Ann R. Bradlow,Matthew Goldrick,Joseph Keshet
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: pre-print
Abstract:Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer. The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter. The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the target term appears in the recording. The results indicate that our system achieves state-of-the-art detection performance and generalizes effectively to out-of-domain conditions, including second-language (L2) speech. Notably, our smallest model, with just 4.2 million parameters, matches or outperforms models that are several times larger, demonstrating both efficiency and robustness.
[LG-81] Data Driven Insights into Composition Property Relationships in FCC High Entropy Alloys
链接: https://arxiv.org/abs/2508.04841
作者: Nicolas Flores,Daniel Salas Mula,Wenle Xu,Sahu Bibhu,Daniel Lewis,Alexandra Eve Salinas,Samantha Mitra,Raj Mahat,Surya R. Kalidindi,Justin Wilkerson,James Paramore,Ankit Srivastiva,George Pharr,Douglas Allaire,Ibrahim Karaman,Brady Butler,Vahid Attari,Raymundo Arroyave
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Structural High Entropy Alloys (HEAs) are crucial in advancing technology across various sectors, including aerospace, automotive, and defense industries. However, the scarcity of integrated chemistry, process, structure, and property data presents significant challenges for predictive property modeling. Given the vast design space of these alloys, uncovering the underlying patterns is essential yet difficult, requiring advanced methods capable of learning from limited and heterogeneous datasets. This work presents several sensitivity analyses, highlighting key elemental contributions to mechanical behavior, including insights into the compositional factors associated with brittle and fractured responses observed during nanoindentation testing in the BIRDSHOT center NiCoFeCrVMnCuAl system dataset. Several encoder decoder based chemistry property models, carefully tuned through Bayesian multi objective hyperparameter optimization, are evaluated for mapping alloy composition to six mechanical properties. The models achieve competitive or superior performance to conventional regressors across all properties, particularly for yield strength and the UTS/YS ratio, demonstrating their effectiveness in capturing complex composition property relationships.
[LG-82] Differentially Private Model-X Knockoffs via Johnson-Lindenstrauss Transform
链接: https://arxiv.org/abs/2508.04800
作者: Yuxuan Tao,Adel Javanmard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 68 pages, 6 figures
Abstract:We introduce a novel privatization framework for high-dimensional controlled variable selection. Our framework enables rigorous False Discovery Rate (FDR) control under differential privacy constraints. While the Model-X knockoff procedure provides FDR guarantees by constructing provably exchangeable ``negative control" features, existing privacy mechanisms like Laplace or Gaussian noise injection disrupt its core exchangeability conditions. Our key innovation lies in privatizing the data knockoff matrix through the Gaussian Johnson-Lindenstrauss Transformation (JLT), a dimension reduction technique that simultaneously preserves covariate relationships through approximate isometry for (\epsilon,\delta) -differential privacy. We theoretically characterize both FDR and the power of the proposed private variable selection procedure, in an asymptotic regime. Our theoretical analysis characterizes the role of different factors, such as the JLT’s dimension reduction ratio, signal-to-noise ratio, differential privacy parameters, sample size and feature dimension, in shaping the privacy-power trade-off. Our analysis is based on a novel `debiasing technique’ for high-dimensional private knockoff procedure. We further establish sufficient conditions under which the power of the proposed procedure converges to one. This work bridges two critical paradigms – knockoff-based FDR control and private data release – enabling reliable variable selection in sensitive domains. Our analysis demonstrates that structural privacy preservation through random projections outperforms the classical noise addition mechanism, maintaining statistical power even under strict privacy budgets. Comments: 68 pages, 6 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2508.04800 [stat.ML] (or arXiv:2508.04800v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2508.04800 Focus to learn more arXiv-issued DOI via DataCite
[LG-83] Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks
链接: https://arxiv.org/abs/2508.04757
作者: Nirjhor Datta,Swakkhar Shatabda,M Sohel Rahman
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: this https URL.
[LG-84] GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation
链接: https://arxiv.org/abs/2508.04747
作者: Tianxiang Hu,Chenyi Zhou,Jiaxiang Liu,Jiongxin Wang,Ruizhe Chen,Haoxiang Xia,Gaoang Wang,Jian Wu,Zuozhu Liu
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by k -nearest neighbor ( k -NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal, particularly in achieving consistent accuracy across all cell types. In this paper, we propose to refine the zero-shot logits produced by LangCell through a graph-regularized optimization framework. By enforcing local consistency over the task-specific PCA-based k-NN graph, our method combines the scalability of the pre-trained models with the structural robustness relied upon in expert annotation. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing automated cell type annotation in practice.
[LG-85] Alz-QNet: A Quantum Regression Network for Studying Alzheimers Gene Interactions
链接: https://arxiv.org/abs/2508.04743
作者: Debanjan Konar,Neerav Sreekumar,Richard Jiang,Vaneet Aggarwal
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG); Genomics (q-bio.GN); Quantum Physics (quant-ph)
*备注:
Abstract:Understanding the molecular-level mechanisms underpinning Alzheimer’s disease (AD) by studying crucial genes associated with the disease remains a challenge. Alzheimer’s, being a multifactorial disease, requires understanding the gene-gene interactions underlying it for theranostics and progress. In this article, a novel attempt has been made using a quantum regression to decode how some crucial genes in the AD Amyloid Beta Precursor Protein ( APP ), Sterol regulatory element binding transcription factor 14 ( FGF14 ), Yin Yang 1 ( YY1 ), and Phospholipase D Family Member 3 ( PLD3 ) etc. become influenced by other prominent switching genes during disease progression, which may help in gene expression-based therapy for AD. Our proposed Quantum Regression Network (Alz-QNet) introduces a pioneering approach with insights from the state-of-the-art Quantum Gene Regulatory Networks (QGRN) to unravel the gene interactions involved in AD pathology, particularly within the Entorhinal Cortex (EC), where early pathological changes occur. Using the proposed Alz-QNet framework, we explore the interactions between key genes ( APP , FGF14 , YY1 , EGR1 , GAS7 , AKT3 , SREBF2 , and PLD3 ) within the CE microenvironment of AD patients, studying genetic samples from the database GSE138852 , all of which are believed to play a crucial role in the progression of AD. Our investigation uncovers intricate gene-gene interactions, shedding light on the potential regulatory mechanisms that underlie the pathogenesis of AD, which help us to find potential gene inhibitors or regulators for theranostics.
[LG-86] Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agent ic AI
链接: https://arxiv.org/abs/2508.04742
作者: Ke Chen,Haohan Wang
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Modern disease classification often overlooks molecular commonalities hidden beneath divergent clinical presentations. This study introduces a transcriptomics-driven framework for discovering disease relationships by analyzing over 1300 disease-condition pairs using GenoMAS, a fully automated agentic AI system. Beyond identifying robust gene-level overlaps, we develop a novel pathway-based similarity framework that integrates multi-database enrichment analysis to quantify functional convergence across diseases. The resulting disease similarity network reveals both known comorbidities and previously undocumented cross-category links. By examining shared biological pathways, we explore potential molecular mechanisms underlying these connections-offering functional hypotheses that go beyond symptom-based taxonomies. We further show how background conditions such as obesity and hypertension modulate transcriptomic similarity, and identify therapeutic repurposing opportunities for rare diseases like autism spectrum disorder based on their molecular proximity to better-characterized conditions. In addition, this work demonstrates how biologically grounded agentic AI can scale transcriptomic analysis while enabling mechanistic interpretation across complex disease landscapes. All results are publicly accessible at this http URL.
[LG-87] CodonMoE: DNA Language Models for mRNA Analyses
链接: https://arxiv.org/abs/2508.04739
作者: Shiyi Du,Litian Liang,Jiayi Li,Carl Kingsford
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Genomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.
[LG-88] Understanding protein function with a multimodal retrieval-augmented foundation model
链接: https://arxiv.org/abs/2508.04724
作者: Timothy Fei Truong Jr,Tristan Bepler
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Protein language models (PLMs) learn probability distributions over natural protein sequences. By learning from hundreds of millions of natural protein sequences, protein understanding and design capabilities emerge. Recent works have shown that scaling these models improves structure prediction, but does not seem to improve mutation understanding and representation quality for protein function prediction. We introduce PoET-2, a multimodal, retrieval-augmented protein foundation model that incorporates in-context learning of family-specific evolutionary constraints with optional structure conditioning to learn generative distributions over protein sequences. PoET-2 uses a hierarchical transformer encoder that is equivariant to sequence context ordering and a dual decoder architecture with both causal and masked language modeling objectives, allowing PoET-2 to operate in both fully generative and bidirectional representation learning modes. PoET-2 achieves state-of-the-art performance on zero-shot variant effect prediction, excelling at scoring variants with multiple mutations and challenging indel mutations. In supervised settings, PoET-2 embeddings outperform previous methods for learning sequence-function relationships, especially with small datasets. This work highlights the benefits of combining retrieval augmentation with multimodal, family-centric modeling for advancing protein foundation models.
[LG-89] From Rattle to Roar: Optimizer Showdown for MambaStock on SP 500
链接: https://arxiv.org/abs/2508.04707
作者: Alena Chan,Maria Garmonina
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:
Abstract:We evaluate the performance of several optimizers on the task of forecasting SP 500 Index returns with the MambaStock model. Among the most widely used algorithms, gradient-smoothing and adaptive-rate optimizers (for example, Adam and RMSProp) yield the lowest test errors. In contrast, the Lion optimizer offers notably faster training. To combine these advantages, we introduce a novel family of optimizers, Roaree, that dampens the oscillatory loss behavior often seen with Lion while preserving its training speed.
[LG-90] graphers Generative Model via Kac Flows DATE
链接: https://arxiv.org/abs/2506.20641
作者: Richard Duong,Jannis Chemseddine,Peter K. Friz,Gabriele Steidl
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Probability (math.PR)
*备注: Update V2: We added CIFAR experiments. Correction V3: The old FID scores CIFAR images of the Kac model corresponded to the schedule g(t) = t. We now updated them with both schedules t and t^2
Abstract:We break the mold in flow-based generative modeling by proposing a new model based on the damped wave equation, also known as telegrapher’s equation. Similar to the diffusion equation and Brownian motion, there is a Feynman-Kac type relation between the telegrapher’s equation and the stochastic Kac process in 1D. The Kac flow evolves stepwise linearly in time, so that the probability flow is Lipschitz continuous in the Wasserstein distance and, in contrast to diffusion flows, the norm of the velocity is globally bounded. Furthermore, the Kac model has the diffusion model as its asymptotic limit. We extend these considerations to a multi-dimensional stochastic process which consists of independent 1D Kac processes in each spatial component. We show that this process gives rise to an absolutely continuous curve in the Wasserstein space and compute the conditional velocity field starting in a Dirac point analytically. Using the framework of flow matching, we train a neural network that approximates the velocity field and use it for sample generation. Our numerical experiments demonstrate the scalability of our approach, and show its advantages over diffusion models.
信息检索
[IR-0] RankArena: A Unified Platform for Evaluating Retrieval Reranking and RAG with Human and LLM Feedback CIKM2025
链接: https://arxiv.org/abs/2508.05512
作者: Abdelrahman Abdallah,Mahmoud Abdalla,Bhawna Piryani,Jamshid Mozafari,Mohammed Ali,Adam Jatowt
类目: Information Retrieval (cs.IR)
*备注: Accept at CIKM 2025
Abstract:Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform for comparing and analysing the performance of retrieval pipelines, rerankers, and RAG systems using structured human and LLM-based feedback as well as for collecting such feedback. RankArena supports multiple evaluation modes: direct reranking visualisation, blind pairwise comparisons with human or LLM voting, supervised manual document annotation, and end-to-end RAG answer quality assessment. It captures fine-grained relevance feedback through both pairwise preferences and full-list annotations, along with auxiliary metadata such as movement metrics, annotation time, and quality ratings. The platform also integrates LLM-as-a-judge evaluation, enabling comparison between model-generated rankings and human ground truth annotations. All interactions are stored as structured evaluation datasets that can be used to train rerankers, reward models, judgment agents, or retrieval strategy selectors. Our platform is publicly available at this https URL, and the Demo video is provided this https URL.
[IR-1] Does Multimodality Improve Recommender Systems as Expected? A Critical Analysis and Future Directions
链接: https://arxiv.org/abs/2508.05377
作者: Hongyu Zhou,Yinan Zhang,Aixin Sun,Zhiqi Shen
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:
Abstract:Multimodal recommendation systems are increasingly popular for their potential to improve performance by integrating diverse data types. However, the actual benefits of this integration remain unclear, raising questions about when and how it truly enhances recommendations. In this paper, we propose a structured evaluation framework to systematically assess multimodal recommendations across four dimensions: Comparative Efficiency, Recommendation Tasks, Recommendation Stages, and Multimodal Data Integration. We benchmark a set of reproducible multimodal models against strong traditional baselines and evaluate their performance on different platforms. Our findings show that multimodal data is particularly beneficial in sparse interaction scenarios and during the recall stage of recommendation pipelines. We also observe that the importance of each modality is task-specific, where text features are more useful in e-commerce and visual features are more effective in short-video recommendations. Additionally, we explore different integration strategies and model sizes, finding that Ensemble-Based Learning outperforms Fusion-Based Learning, and that larger models do not necessarily deliver better results. To deepen our understanding, we include case studies and review findings from other recommendation domains. Our work provides practical insights for building efficient and effective multimodal recommendation systems, emphasizing the need for thoughtful modality selection, integration strategies, and model design.
[IR-2] Difference Views for Visual Graph Query Building
链接: https://arxiv.org/abs/2508.05314
作者: Benedikt Kantz,Stefan Lengauer,Peter Waldert,Tobias Schreck
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 6 figures, preparing for submission to Semantic Web Conferences
Abstract:Knowledge Graphs (KGs) contain vast amounts of linked resources that encode knowledge in various domains, which can be queried and searched for using specialized languages like SPARQL, a query language developed to query KGs. Existing visual query builders enable non-expert users to construct SPARQL queries and utilize the knowledge contained in these graphs. Query building is, however, an iterative and, often, visual process where the question of the user can change and differ throughout the process, especially for explorative search. Our visual querying interface communicates these change between iterative steps in the query building process using graph differences to contrast the changes and the evolution in the graph query. We also enable users to formulate their evolving information needs using a natural language interface directly integrated into the difference query view. We, furthermore, communicate the change in results in the result view by contrasting the differences in both result distribution and individual instances of the prototype graph and demonstrate the system’s applicability through case studies on different ontologies and usage scenarios, illustrating how our system fosters, both, data exploration and analysis of domain-specific graphs.
[IR-3] FIRE: Faithful Interpretable Recommendation Explanations
链接: https://arxiv.org/abs/2508.05225
作者: S.M.F. Sani,Asal Meskin,Mohammad Amanlou,Hamid R. Rabiee
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Natural language explanations in recommender systems are often framed as a review generation task, leveraging user reviews as ground-truth supervision. While convenient, this approach conflates a user’s opinion with the system’s reasoning, leading to explanations that may be fluent but fail to reflect the true logic behind recommendations. In this work, we revisit the core objective of explainable recommendation: to transparently communicate why an item is recommended by linking user needs to relevant item features. Through a comprehensive analysis of existing methods across multiple benchmark datasets, we identify common limitations-explanations that are weakly aligned with model predictions, vague or inaccurate in identifying user intents, and overly repetitive or generic. To overcome these challenges, we propose FIRE, a lightweight and interpretable framework that combines SHAP-based feature attribution with structured, prompt-driven language generation. FIRE produces faithful, diverse, and user-aligned explanations, grounded in the actual decision-making process of the model. Our results demonstrate that FIRE not only achieves competitive recommendation accuracy but also significantly improves explanation quality along critical dimensions such as alignment, structure, and faithfulness. This work highlights the need to move beyond the review-as-explanation paradigm and toward explanation methods that are both accountable and interpretable.
[IR-4] Community-Aware Social Community Recommendation CIKM2025
链接: https://arxiv.org/abs/2508.05107
作者: Runhao Jiang,Renchi Yang,Wenqing Lin
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注: This is the technical report of the paper “Community-Aware Social Community Recommendation” accepted by CIKM 2025
Abstract:Social recommendation, which seeks to leverage social ties among users to alleviate the sparsity issue of user-item interactions, has emerged as a popular technique for elevating personalized services in recommender systems. Despite being effective, existing social recommendation models are mainly devised for recommending regular items such as blogs, images, and products, and largely fail for community recommendations due to overlooking the unique characteristics of communities. Distinctly, communities are constituted by individuals, who present high dynamicity and relate to rich structural patterns in social networks. To our knowledge, limited research has been devoted to comprehensively exploiting this information for recommending communities. To bridge this gap, this paper presents CASO, a novel and effective model specially designed for social community recommendation. Under the hood, CASO harnesses three carefully-crafted encoders for user embedding, wherein two of them extract community-related global and local structures from the social network via social modularity maximization and social closeness aggregation, while the third one captures user preferences using collaborative filtering with observed user-community affiliations. To further eliminate feature redundancy therein, we introduce a mutual exclusion between social and collaborative signals. Finally, CASO includes a community detection loss in the model optimization, thereby producing community-aware embeddings for communities. Our extensive experiments evaluating CASO against nine strong baselines on six real-world social networks demonstrate its consistent and remarkable superiority over the state of the art in terms of community recommendation performance. Comments: This is the technical report of the paper “Community-Aware Social Community Recommendation” accepted by CIKM 2025 Subjects: Social and Information Networks (cs.SI); Information Retrieval (cs.IR) Cite as: arXiv:2508.05107 [cs.SI] (or arXiv:2508.05107v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2508.05107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] An End-to-End Multi-objective Ensemble Ranking Framework for Video Recommendation
链接: https://arxiv.org/abs/2508.05093
作者: Tiantian He,Minzhi Xie,Runtong Li,Xiaoxiao Xu,Jiaqi Yu,Zixiu Wang,Lantao Hu,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:
Abstract:We propose a novel End-to-end Multi-objective Ensemble Ranking framework (EMER) for the multi-objective ensemble ranking module, which is the most critical component of the short video recommendation system. EMER enhances personalization by replacing manually-designed heuristic formulas with an end-to-end modeling paradigm. EMER introduces a meticulously designed loss function to address the fundamental challenge of defining effective supervision for ensemble ranking, where no single ground-truth signal can fully capture user satisfaction. Moreover, EMER introduces novel sample organization method and transformer-based network architecture to capture the comparative relationships among candidates, which are critical for effective ranking. Additionally, we have proposed an offline-online consistent evaluation system to enhance the efficiency of offline model optimization, which is an established yet persistent challenge within the multi-objective ranking domain in industry. Abundant empirical tests are conducted on a real industrial dataset, and the results well demonstrate the effectiveness of our proposed framework. In addition, our framework has been deployed in the primary scenarios of Kuaishou, a short video recommendation platform with hundreds of millions of daily active users, achieving a 1.39% increase in overall App Stay Time and a 0.196% increase in 7-day user Lifetime(LT7), which are substantial improvements.
[IR-6] Data-Aware Socratic Query Refinement in Database Systems
链接: https://arxiv.org/abs/2508.05061
作者: Ruiyuan Zhang,Chrysanthi Kosyfaki,Xiaofang Zhou
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:In this paper, we propose Data-Aware Socratic Guidance (DASG), a dialogue-based query enhancement framework that embeds \linebreak interactive clarification as a first-class operator within database systems to resolve ambiguity in natural language queries. DASG treats dialogue as an optimization decision, asking clarifying questions only when the expected execution cost reduction exceeds the interaction overhead. The system quantifies ambiguity through linguistic fuzziness, schema grounding confidence, and projected costs across relational and vector backends. Our algorithm selects the optimal clarifications by combining semantic relevance, catalog-based information gain, and potential cost reduction. We evaluate our proposed framework on three datasets. The results show that DASG demonstrates improved query precision while maintaining efficiency, establishing a cooperative analytics paradigm where systems actively participate in query formulation rather than passively translating user requests.
[IR-7] Augmented Question-guided Retrieval (AQgR) of Indian Case Law with LLM RAG and Structured Summaries
链接: https://arxiv.org/abs/2508.04710
作者: Vishnuprabha V,Daleesha M Viswanathan,Rajesh R,Aneesh V Pillai
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Identifying relevant legal precedents remains challenging, as most retrieval methods emphasize factual similarity over legal issues, and current systems often lack explanations clarifying case relevance. This paper proposes the use of Large Language Models (LLMs) to address this gap by facilitating the retrieval of relevant cases, generating explanations to elucidate relevance, and identifying core legal issues all autonomously, without requiring legal expertise. Our approach combines Retrieval Augmented Generation (RAG) with structured summaries optimized for Indian case law. Leveraging the Augmented Question-guided Retrieval (AQgR) framework, the system generates targeted legal questions based on factual scenarios to identify relevant case law more effectively. The structured summaries were assessed manually by legal experts, given the absence of a suitable structured summary dataset. Case law retrieval was evaluated using the FIRE dataset, and explanations were reviewed by legal experts, as explanation generation alongside case retrieval is an emerging innovation. Experimental evaluation on a subset of the FIRE 2019 dataset yielded promising outcomes, achieving a Mean Average Precision (MAP) score of 0.36 and a Mean Average Recall (MAR) of 0.67 across test queries, significantly surpassing the current MAP benchmark of 0.1573. This work introduces a suite of novel contributions to advance case law retrieval. By transitioning from fact-based to legal-issue-based retrieval, the proposed approach delivers more contextually relevant results that align closely with legal professionals’ needs. Integrating legal questions within the retrieval process through the AQgR framework ensures more precise and meaningful retrieval by refining the context of queries.