本篇博文主要内容为 2025-09-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-12)
今日共更新411篇论文,其中:
- 自然语言处理共55篇(Computation and Language (cs.CL))
- 人工智能共105篇(Artificial Intelligence (cs.AI))
- 计算机视觉共95篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共92篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] FLUX-Reason -6M PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
【速读】: 该论文旨在解决开源文本到图像(Text-to-Image, T2I)生成模型因缺乏大规模、聚焦推理能力的数据集和全面评估基准而导致的性能差距问题。其核心解决方案包括:构建一个包含600万张高质量FLUX生成图像及2000万条中英双语描述的大型数据集FLUX-Reason-6M,该数据集通过六类关键特征(想象力、实体、文本渲染、风格、情感与构图)组织,并引入显式的生成链式思维(Generation Chain-of-Thought, GCoT)以细化图像生成步骤;同时提出PRISM-Bench评估基准,涵盖七个独立评测维度,其中包含基于GCoT的长文本挑战任务,利用先进视觉语言模型实现对提示-图像对齐度和图像美学的精细化人类对齐评估。此方案显著提升了开源T2I模型在复杂推理能力上的训练与测评水平。
链接: https://arxiv.org/abs/2509.09680
作者: Rongyao Fang,Aldrich Yu,Chengqi Duan,Linjiang Huang,Shuai Bai,Yuxuan Cai,Kun Wang,Si Liu,Xihui Liu,Hongsheng Li
机构: CUHK (香港中文大学); HKU (香港大学); BUAA (北京航空航天大学); Alibaba (阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: this https URL .
zh
[NLP-1] ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在消费级硬件上部署受限的问题,其核心挑战在于模型参数的高内存占用。传统量化方法通过降低数值精度来减少内存需求,但在极端2-bit量化时因激活值中的异常值(outliers)导致性能显著下降。为应对这一问题,现有旋转方法如QuIP和QuaRot采用固定正交变换(如Hadamard矩阵)消除异常值,利用计算不变性实现量化前的数据去相关。然而,这些方法使用静态变换无法适应不同Transformer层中差异化的异常值分布。本文提出ButterflyQuant,关键创新在于用可学习的蝴蝶状(butterfly)正交变换替代固定Hadamard矩阵,该变换由连续Givens旋转角度参数化,具备梯度可微特性,从而支持端到端优化;同时保持正交性以保障理论上的异常值抑制效果,并以O(n log n)复杂度和仅n log n / 2个可学习参数实现高效计算。此外,引入均匀性正则化进一步促进量化友好型激活分布。实验表明,该方法在LLaMA-2-7B模型上2-bit量化下将困惑度从QuaRot的22.1降至15.4,且训练仅需128个校准样本并在单GPU上数分钟内收敛。
链接: https://arxiv.org/abs/2509.09679
作者: Bingxin Xu,Zhen Dong,Oussama Elachqar,Yuzhang Shang
机构: USC(南加州大学); UCSB(加州大学圣塔芭芭拉分校); Oumi; UCF(中佛罗里达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Replace discrete Hadamard transforms with continuous Butterfly transforms to facilitate the learning of rotation matrices in LLM quantization
Abstract:Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: \mathbfy = \mathbfWx = (\mathbfWQ^T)(\mathbfQx) for orthogonal \mathbfQ . However, these methods use fixed transforms–Hadamard matrices achieving optimal worst-case coherence \mu = 1/\sqrtn --that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard’s discrete +1, -1\ entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms’ continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving O(n \log n) computational complexity with only \fracn \log n2 learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU–a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
zh
[NLP-2] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大型语言模型(Large Language Models, LLMs)时存在的探索不足问题,具体表现为过早收敛(premature convergence)和熵坍缩(entropy collapse)。其解决方案的关键在于引入好奇心驱动探索(Curiosity-Driven Exploration, CDE)框架,该框架通过两个互补的内在好奇心信号来增强探索:一是来自策略网络(actor)的困惑度(perplexity),用于惩罚过度自信的错误并促进正确响应的多样性;二是来自价值网络(critic)的多头架构下价值估计方差(variance of value estimates),其理论分析表明该信号与强化学习中经典的计数基探索奖励(count-based exploration bonus)存在关联。实验表明,该方法在AIME基准测试上相较标准RLVR(如GRPO/PPO)实现了约+3点的性能提升,并揭示了RLVR中存在的校准坍缩机制(calibration collapse),为LLM常见失效模式提供了新的理解。
链接: https://arxiv.org/abs/2509.09675
作者: Runpeng Dai,Linfeng Song,Haolin Liu,Zhenwen Liang,Dian Yu,Haitao Mi,Zhaopeng Tu,Rui Liu,Tong Zheng,Hongtu Zhu,Dong Yu
机构: Tencent AI Lab (腾讯人工智能实验室); Tencent Multimodal Department (腾讯多模态部门); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Virginia (弗吉尼亚大学); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model’s own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.
zh
[NLP-3] SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
【速读】: 该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中面临的两大挑战:一是监督微调(SFT)所需的高质量人类操作轨迹数据稀缺且成本高昂;二是模型在分布外任务上泛化能力有限。为应对这些问题,作者提出了一种名为SimpleVLA-RL的高效强化学习(Reinforcement Learning, RL)框架,其核心创新在于针对VLA模型设计了特定的轨迹采样策略、可扩展的并行化机制、多环境渲染以及优化的损失计算方法。该方案显著减少了对大规模标注数据的依赖,并通过引入探索增强策略实现了优于传统SFT方法的现实世界性能,同时在训练过程中发现了一种新现象“pushcut”,即策略能够自主探索出此前未见的动作模式,进一步提升了模型的适应性和创造性。
链接: https://arxiv.org/abs/2509.09674
作者: Haozhan Li,Yuxin Zuo,Jiale Yu,Yuhao Zhang,Zhaohui Yang,Kaiyan Zhang,Xuekai Zhu,Yuchen Zhang,Tianxing Chen,Ganqu Cui,Dehui Wang,Dingxiang Luo,Yuchen Fan,Youbang Sun,Jia Zeng,Jiangmiao Pang,Shanghang Zhang,Yu Wang,Yao Mu,Bowen Zhou,Ning Ding
机构: Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); The University of Hong Kong (香港大学); Shanghai AI Lab (上海人工智能实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms \pi_0 on RoboTwin 1.0\2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut’’ during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: this https URL
zh
[NLP-4] Steering MoE LLM s via Expert (De)Activation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中行为控制与对齐(alignment)的难题,尤其是如何在不重新训练或修改模型权重的前提下,动态调节模型输出的行为属性(如安全性与忠实性)。其解决方案的关键在于提出SteerMoE框架,通过识别在对比输入下具有显著激活模式差异的行为关联专家(behavior-linked experts),并利用这些专家的可选择性激活/去激活来实现对模型行为的精细调控。该方法无需微调即可在多个基准测试中提升安全性达20%、忠实性达27%,同时揭示了现有对齐机制可能被“专家级”行为伪装所规避的风险,为理解MoE架构中的隐式偏见和安全漏洞提供了新视角。
链接: https://arxiv.org/abs/2509.09660
作者: Mohsen Fayyaz,Ali Modarressi,Hanieh Deilamsalehy,Franck Dernoncourt,Ryan Rossi,Trung Bui,Hinrich Schütze,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校); Adobe Research; CIS, LMU Munich (慕尼黑路德维希马克西米利安大学信息科学系); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.
zh
[NLP-5] Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
【速读】: 该论文旨在解决无线电法规领域中问答任务的准确性与可靠性问题,该领域具有法律敏感性和高风险性,对答案的精确性要求极高。解决方案的关键在于构建了一个面向电信领域的检索增强生成(Retrieval-Augmented Generation, RAG)流水线,并引入了首个基于权威来源、经自动化过滤与人工验证构建的多项选择题评估集。该方法通过设计领域特定的检索指标(domain-specific retrieval metric),使检索模块达到约97%的准确率,并在生成阶段显著提升多个模型的表现——例如,相较于无结构文档插入方式仅带来不足1%改进的GPT-4o,采用该RAG流水线后实现了近12%的相对性能提升,证明了针对性强的语境 grounding 是提升监管问答系统效果的有效且简洁的基准方案。
链接: https://arxiv.org/abs/2509.09651
作者: Zakaria El Kassimi,Fares Fourati,Mohamed-Slim Alouini
机构: KAUST
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:We study question answering in the domain of radio regulations, a legally sensitive and high-stakes area. We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge, the first multiple-choice evaluation set for this domain, constructed from authoritative sources using automated filtering and human validation. To assess retrieval quality, we define a domain-specific retrieval metric, under which our retriever achieves approximately 97% accuracy. Beyond retrieval, our approach consistently improves generation accuracy across all tested models. In particular, while naively inserting documents without structured retrieval yields only marginal gains for GPT-4o (less than 1%), applying our pipeline results in nearly a 12% relative improvement. These findings demonstrate that carefully targeted grounding provides a simple yet strong baseline and an effective domain-specific solution for regulatory question answering. All code and evaluation scripts, along with our derived question-answer dataset, are available at this https URL.
zh
[NLP-6] All for One: LLM s Solve Mental Math at the Last Token With Information Transferred From Other Tokens EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行心算任务时内部计算机制不明确的问题,特别是探究其自注意力机制与多层感知机(Multilayer Perceptron, MLP)层如何协同实现信息处理。解决方案的关键在于提出两种新方法:Context-Aware Mean Ablation (CAMA) 和 Attention-Based Peeking (ABP),通过系统性地抑制初始层的输入特异性计算、限制中间层的信息传递路径,并强制剩余层的计算集中于最后一个标记,识别出一个“All-for-One”(AF1)子图结构——该结构显示高精度的计算仅发生在深层的最后一标记上,且依赖于少数特定中间层对其他标记信息的整合。实验表明,此子图对多种模型和输入风格均具有必要性和充分性,且具备跨模型迁移能力。
链接: https://arxiv.org/abs/2509.09650
作者: Siddarth Mamidanna,Daking Rai,Ziyu Yao,Yilun Zhou
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); George Mason University (乔治梅森大学); Datadog AI Research (Datadog AI 研究)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference
Abstract:Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.
zh
[NLP-7] DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
【速读】: 该论文旨在解决零样本语音合成(Zero-shot Text-to-Speech, TTS)中存在推理速度慢、重复性伪影以及难以有效建模语音韵律和声学属性的问题。现有方法多依赖连续空间中的流匹配(flow matching),未能充分挖掘离散编码表示的优势。其解决方案的关键在于提出DiFlow-TTS,这是首个在纯离散空间中应用流匹配(Discrete Flow Matching)的语音合成模型,通过统一架构显式建模语音的因子化属性(如韵律与声学细节),并利用上下文学习机制,以文本内容及参考语音提取的属性作为条件输入,实现零样本下的高质量语音克隆。该设计不仅提升了自然度、韵律保真度和说话人风格保留能力,还显著降低了延迟,推理速度比当前最优基线快达25.8倍。
链接: https://arxiv.org/abs/2509.09631
作者: Ngoc-Son Nguyen,Hieu-Nghia Huynh-Nguyen,Thanh V. T. Tran,Truong-Son Hy,Van Nguyen
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.
zh
[NLP-8] Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM -based Multi-Agent Systems EMNLP2025
【速读】: 该论文旨在解决多智能体系统中因独立微调导致的能力差距与协作效率低下问题,特别是在生成式 AI(Generative AI)驱动的复杂任务求解场景下,如规划代理(planning agent)与接地代理(grounding agent)之间的协调不足。解决方案的关键在于提出 MOAT(Multi-Agent Joint Alignment Tuning)框架,通过迭代对齐机制实现双阶段优化:第一阶段优化规划代理以生成更利于接地代理执行的子目标序列;第二阶段利用代理自身生成的多样化子目标-动作对来提升接地代理的泛化能力。理论分析表明该方法保证训练过程非递减且逐步收敛,实验验证其在多个基准测试中显著优于现有最优基线。
链接: https://arxiv.org/abs/2509.09629
作者: Minghang Zhu,Zhengliang Shi,Zhiwei Xu,Shiguang Wu,Lingjie Wang,Pengjie Ren,Zhaochun Ren,Zhumin Chen
机构: Shandong University (山东大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings
Abstract:The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.
zh
[NLP-9] LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination
【速读】: 该论文旨在解决资源有限环境中因缺乏医疗死亡证明而导致的死亡原因估计难题,提出了一种名为LA-VA的验证性分析流程,其关键在于融合大语言模型(Large Language Models, LLMs)与传统算法及基于嵌入(embedding-based)的分类方法,从而提升死亡原因预测的准确性。通过在PHMRC数据集上对成人、儿童和新生儿三类人群进行评估,研究发现GPT-5单独使用即达到最高性能(平均测试站点准确率分别为48.6%、50.5%和53.5%),显著优于传统统计机器学习基线模型(提升5–10%),表明无需复杂定制即可利用现成LLM显著改善口述尸检(verbal autopsy, VA)的精度,对全球低资源环境下的疾病监测具有重要价值。
链接: https://arxiv.org/abs/2509.09602
作者: Yiqun T. Chen,Tyler H. McCormick,Li Liu,Abhirup Datta
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注:
Abstract:Verbal autopsy (VA) is a critical tool for estimating causes of death in resource-limited settings where medical certification is unavailable. This study presents LA-VA, a proof-of-concept pipeline that combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification for improved cause-of-death prediction. Using the Population Health Metrics Research Consortium (PHMRC) dataset across three age categories (Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches: GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles. Our results demonstrate that GPT-5 achieves the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.
zh
[NLP-10] Fluent but Unfeeling: The Emotional Blind Spots of Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在情绪识别任务中与人类自述情绪在细粒度层面一致性不足的问题。现有研究多局限于预定义的有限情绪类别分类,忽视了更细微的情绪表达差异。解决方案的关键在于构建一个名为EXPRESS的基准数据集,该数据集源自Reddit社区,包含251个细粒度自述情绪标签,并设计了一个系统性的评估框架,将预测的情绪词分解为八种基本情绪(基于经典情绪理论),从而实现对LLMs情绪识别能力的精细化比较。实证结果表明,尽管部分LLMs能生成符合情绪理论定义的情绪术语,但在捕捉上下文线索方面仍显著弱于人类自述,揭示了当前LLMs在细粒度情绪对齐上的局限性。
链接: https://arxiv.org/abs/2509.09593
作者: Bangzhao Shu,Isha Joshi,Melissa Karnaze,Anh C. Pham,Ishita Kakkar,Sindhu Kothe,Arpine Hovasapian,Mai ElSherief
机构: Northeastern University (东北大学); UC San Diego (加州大学圣地亚哥分校); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Camera-ready version for ICWSM 2026. First two authors contributed equally
Abstract:The versatility of Large Language Models (LLMs) in natural language understanding has made them increasingly popular in mental health research. While many studies explore LLMs’ capabilities in emotion recognition, a critical gap remains in evaluating whether LLMs align with human emotions at a fine-grained level. Existing research typically focuses on classifying emotions into predefined, limited categories, overlooking more nuanced expressions. To address this gap, we introduce EXPRESS, a benchmark dataset curated from Reddit communities featuring 251 fine-grained, self-disclosed emotion labels. Our comprehensive evaluation framework examines predicted emotion terms and decomposes them into eight basic emotions using established emotion theories, enabling a fine-grained comparison. Systematic testing of prevalent LLMs under various prompt settings reveals that accurately predicting emotions that align with human self-disclosed emotions remains challenging. Qualitative analysis further shows that while certain LLMs generate emotion terms consistent with established emotion theories and definitions, they sometimes fail to capture contextual cues as effectively as human self-disclosures. These findings highlight the limitations of LLMs in fine-grained emotion alignment and offer insights for future research aimed at enhancing their contextual understanding.
zh
[NLP-11] Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking
【速读】: 该论文旨在解决在线课程环境中学生社交连接难以自然形成的问题,尤其关注当前社交匹配系统(如SAMI)因理论心智(Theory of Mind)不完善而无法有效构建学生心理模型、进而影响推荐相关性的问题。其解决方案的关键在于利用GPTs的零样本(zero-shot)能力,从论坛自我介绍帖子中自动推断学生的五大性格特质(Big-Five personality traits),并将其集成到SAMI基于实体的匹配系统中,从而实现人格信息驱动的社会推荐。初步结果表明,人格特征可作为现有匹配因素的有益补充,但其对学习参与度和匹配质量的全面影响仍需进一步评估。
链接: https://arxiv.org/abs/2509.09583
作者: Brittany Harbison,Samuel Taubman,Travis Taylor,Ashok. K. Goel
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Social connection is a vital part of learning, yet online course environments present barriers to the organic formation of social groups. SAMI offers one solution by facilitating student connections, but its effectiveness is constrained by an incomplete Theory of Mind, limiting its ability to create an effective mental model of a student. One facet of this is its inability to intuit personality, which may influence the relevance of its recommendations. To explore this, we propose a personality detection model utilizing GPTs zero-shot capability to infer Big-Five personality traits from forum introduction posts, often encouraged in online courses. We benchmark its performance against established models, demonstrating its efficacy in this task. Furthermore, we integrate this model into SAMIs entity-based matchmaking system, enabling personality-informed social recommendations. Initial integration suggests personality traits can complement existing matching factors, though additional evaluation is required to determine their full impact on student engagement and match quality.
zh
[NLP-12] Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025) EMNLP
【速读】: 该论文旨在解决金融自然语言处理(Financial NLP)领域中研究进展迅速但缺乏系统性梳理与结构化分析的问题。传统文献综述难以跟上大型语言模型(Large Language Models, LLMs)在该领域的快速应用和多样化数据源的涌现。其解决方案的关键在于提出MetaGraph方法,即一种可泛化的知识图谱提取与分析框架:首先定义金融NLP研究的本体(ontology),然后基于LLM构建抽取流水线,从681篇2022–2025年间的文献中自动提取结构化信息,从而实现对研究趋势的量化、可视化与可查询分析。此方法不仅揭示了金融NLP演进的三个阶段,还提供了一种可复用的科学进展映射范式,适用于其他研究领域。
链接: https://arxiv.org/abs/2509.09544
作者: Paolo Pedinotti,Peter Baumann,Nathan Jessurun,Leslie Barrett,Enrico Santus
机构: Bloomberg(彭博)
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 appendices, EMNLP industry track
Abstract:Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling new tasks and driving a proliferation of datasets and diversification of data sources. Yet, this transformation has outpaced traditional surveys. In this paper, we present MetaGraph, a generalizable methodology for extracting knowledge graphs from scientific literature and analyzing them to obtain a structured, queryable view of research trends. We define an ontology for financial NLP research and apply an LLM-based extraction pipeline to 681 papers (2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals three key phases: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems. This structured view offers both practitioners and researchers a clear understanding of how financial NLP has evolved - highlighting emerging trends, shifting priorities, and methodological shifts-while also demonstrating a reusable approach for mapping scientific progress in other domains.
zh
[NLP-13] DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning EMNLP-2025
【速读】: 该论文旨在解决标注分歧场景下如何有效预测不同标注者视角的注释(perspectivist annotations)并生成软标签(soft labels)的问题。其解决方案的关键在于:一方面,通过在上下文学习(in-context learning, ICL)框架下利用大语言模型,并比较不同的示例采样策略,验证了ICL能够有效预测个体标注者的注释;另一方面,采用RoBERTa模型结合标签分布学习(label distribution learning, LDL)方法,探索多种微调策略,证明LDL在软标签预测任务中具有潜力,值得进一步研究。
链接: https://arxiv.org/abs/2509.09524
作者: Daniil Ignatev,Nan Li,Hugh Mee Wong,Anh Dang,Shane Kaszefski Yaschuk
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 4 figures; to appear at NLPerspectives@EMNLP-2025
Abstract:This system paper presents the DeMeVa team’s approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.
zh
[NLP-14] owards Explainable Job Title Matching: Leverag ing Semantic Textual Relatedness and Knowledge Graphs
【速读】: 该论文旨在解决简历推荐系统中职位名称匹配的语义文本相关性(Semantic Textual Relatedness, STR)问题,尤其针对词面重叠有限或具有误导性的场景。传统方法往往依赖表面词汇相似度,难以捕捉深层次语义关联,导致匹配精度不足。解决方案的关键在于提出一种自监督的混合架构,将密集句向量(dense sentence embeddings)与领域特定的知识图谱(Knowledge Graph, KG)相结合,利用图神经网络(Graph Neural Networks, GNNs)增强语义对齐能力并提升可解释性。通过在STR得分连续体上进行分层评估(低、中、高语义相关性区域),该方法揭示了模型在不同语义子空间中的表现差异,特别是在高STR区域,微调后的SBERT模型结合KG后RMSE降低25%,显著优于基线模型,验证了融合知识图谱与文本嵌入的有效性,并强调了区域性能分析对于HR应用场景中公平性、可解释性和上下文匹配的重要性。
链接: https://arxiv.org/abs/2509.09522
作者: Vadim Zadykian,Bruno Andrade,Haithem Afli
机构: ADAPT Centre, Munster Technological University (梅努斯特理工大学), Cork, Ireland
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic Textual Relatedness (STR) captures nuanced relationships between texts that extend beyond superficial lexical similarity. In this study, we investigate STR in the context of job title matching - a key challenge in resume recommendation systems, where overlapping terms are often limited or misleading. We introduce a self-supervised hybrid architecture that combines dense sentence embeddings with domain-specific Knowledge Graphs (KGs) to improve both semantic alignment and explainability. Unlike previous work that evaluated models on aggregate performance, our approach emphasizes data stratification by partitioning the STR score continuum into distinct regions: low, medium, and high semantic relatedness. This stratified evaluation enables a fine-grained analysis of model performance across semantically meaningful subspaces. We evaluate several embedding models, both with and without KG integration via graph neural networks. The results show that fine-tuned SBERT models augmented with KGs produce consistent improvements in the high-STR region, where the RMSE is reduced by 25% over strong baselines. Our findings highlight not only the benefits of combining KGs with text embeddings, but also the importance of regional performance analysis in understanding model behavior. This granular approach reveals strengths and weaknesses hidden by global metrics, and supports more targeted model selection for use in Human Resources (HR) systems and applications where fairness, explainability, and contextual matching are essential.
zh
[NLP-15] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
【速读】: 该论文旨在解决多语言教育内容在跨语言教学中的可及性问题,特别是针对捷克中小学非母语学习者(如乌克兰、英语和德语使用者)缺乏高质量、适配性强的多模态交互式学习材料的问题。解决方案的关键在于开发并评估一个面向教育领域的直接捷克语-乌克兰语机器翻译(Machine Translation, MT)系统,该系统专门优化了对XML和PDF等格式化内容的处理能力,并能准确处理技术与科学术语,从而实现9000个互动练习的高效多语言转换与部署于在线教育平台,提升教育资源的国际化共享水平。
链接: https://arxiv.org/abs/2509.09473
作者: Lucie Poláková,Martin Popel,Věra Kloudová,Michal Novák,Mariia Anisimova,Jiří Balhar
机构: Charles University (查尔斯大学); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures
Abstract:The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country’s largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system’s evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.
zh
[NLP-16] GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景(如医疗和金融)中缺乏可靠置信度估计的问题。现有方法或计算开销过高,或校准性能差,难以实际部署。其解决方案的关键在于提出GrACE(Generative Approach to Confidence Elicitation),该方法通过模型最后一层隐藏状态与词汇表中特殊标记嵌入之间的相似度实时生成置信度,并通过关联准确率的校准目标对模型进行微调,从而实现可扩展、可靠的置信度估计。实验表明,GrACE在开放生成任务中优于六种竞争方法,且无需额外采样或辅助模型,同时支持基于置信度的测试时缩放策略,显著提升决策准确性并减少所需样本数。
链接: https://arxiv.org/abs/2509.09438
作者: Zhaohan Zhang,Ziquan Liu,Ioannis Patras
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 11 figures
Abstract:Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with calibration targets associated with accuracy. Experiments with three LLMs and two benchmark datasets show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks, outperforming six competing methods without resorting to additional sampling or an auxiliary model. Moreover, we propose two strategies for improving test-time scaling based on confidence induced by GrACE. Experimental results show that using GrACE not only improves the accuracy of the final decision but also significantly reduces the number of required samples in the test-time scaling scheme, indicating the potential of GrACE as a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation.
zh
[NLP-17] LLM s Dont Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在人机协作中缺乏可靠可解释性的问题,特别是探索其生成自生成反事实解释(Self-Generated Counterfactual Explanations, SCEs)的能力。SCEs要求模型通过修改输入以产生不同预测结果来解释自身决策逻辑。研究的关键发现是:LLMs虽能生成有效的SCEs(即修改后确实改变预测),但这些解释往往冗余且不最小化,无法揭示模型的真实决策机制;而当被明确要求生成最小化反事实时,模型则倾向于进行微小改动,反而无法实现预期的预测变化,形成有效性与最小性之间的权衡。这一现象在多个LLM、数据集和评估设置中均一致存在,表明SCEs作为解释工具效果有限甚至具有误导性,对高风险场景中的部署构成潜在风险。
链接: https://arxiv.org/abs/2509.09396
作者: Harry Mayne,Ryan Othniel Kearns,Yushi Yang,Andrew M. Bean,Eoin Delaney,Chris Russell,Adam Mahdi
机构: University of Oxford (牛津大学); Trinity College Dublin (都柏林三一学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main
Abstract:To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at this https URL.
zh
[NLP-18] Hierarchical Bracketing Encodings Work for Dependency Graphs EMNLP2025
【速读】: 该论文旨在解决依赖图解析(dependency graph parsing)中如何高效且准确地表示复杂语法结构的问题,尤其是如何在保持结构信息完整性的前提下降低标签空间复杂度。其解决方案的关键在于采用层次括号编码(hierarchical bracketing encoding)将图结构转化为序列,从而实现线性时间复杂度的解析,同时支持重引用(reentrancies)、环路(cycles)和空节点(empty nodes)等复杂现象;相较于现有图线性化方法,该方案显著缩小了标签空间并保留了完整的结构信息,在多语言、多形式主义基准测试中表现出竞争力,并在精确匹配准确率上持续优于其他方法。
链接: https://arxiv.org/abs/2509.09388
作者: Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares
机构: Universidade da Coruña (拉科鲁尼亚大学); CITIC (信息与计算机科学研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 (main)
Abstract:We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with n tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.
zh
[NLP-19] Modelling Analogies and Analogical Reasoning : Connecting Cognitive Science Theory and NLP Research
【速读】: 该论文试图解决的问题是:如何将认知科学中关于类比推理(analogical reasoning)的理论与自然语言处理(natural language processing, NLP)研究相结合,以提升模型对文本中关系理解的能力,而非仅仅依赖实体层面的相似性。其解决方案的关键在于:系统梳理类比推理的认知过程,并将其映射到NLP任务中,从而为多个非直接关联类比求解的挑战提供新的优化方向,推动模型从表面特征匹配向深层次语义关系建模演进。
链接: https://arxiv.org/abs/2509.09381
作者: Molly R Petersen,Claire E Stevenson,Lonneke van der Plas
机构: Idiap Research Institute (Idiap 研究所); EPFL (瑞士联邦理工学院); University of Amsterdam (阿姆斯特丹大学); Università della Svizzera italiana (瑞士意大利语大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.
zh
[NLP-20] MetaRAG : Metamorphic Testing for Hallucination Detection in RAG Systems
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中幻觉(hallucination)检测难题,即模型生成与检索到的证据不一致的自信但错误信息的问题。现有方法如SelfCheckGPT和MetaQA主要针对独立大语言模型(Large Language Models, LLMs),无法有效应对RAG系统对上下文一致性要求更高的特性。其解决方案的关键在于提出MetaRAG——一种实时、无监督、黑盒式的 metamorphic testing 框架:首先将回答分解为原子事实片段(factoid),继而通过同义词与反义词替换生成可控变异体,再基于检索上下文验证每个变体(同义词应被蕴含,反义词应被矛盾),最终聚合不一致惩罚项形成响应级幻觉评分,并精确定位不支持的声明位置(如特定身份群体相关条款),从而实现可解释的幻觉检测与身份敏感型防护机制。
链接: https://arxiv.org/abs/2509.09360
作者: Channdeth Sok,David Luz,Yacine Haddam
机构: Forvia(福维亚); ENSAE Paris (巴黎高等统计与经济学院); Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG’s span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.
zh
[NLP-21] OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
【速读】: 该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的具身智能系统面临的两大关键问题:一是几何适应性差距(Geometric Adaptability Gap),即模型在仅依赖二维输入或采用硬编码三维几何注入时,难以兼顾空间信息的充分性和二维泛化能力,导致在不同空间需求的任务中适应性差;二是具身约束差距(Embodiment Constraint Gap),即现有方法忽视真实机器人物理约束与能力,生成的规划虽理论可行但难以实际执行。解决方案的核心在于提出OmniEVA——一个具身通用规划器,其关键创新包括:(1) 任务自适应三维定位机制(Task-Adaptive 3D Grounding),通过门控路由实现基于上下文需求的显式选择性三维融合,支持多样化具身任务的上下文感知三维定位;(2) 具身感知推理框架(Embodiment-Aware Reasoning),将任务目标与具身约束联合纳入推理循环,确保规划决策既目标导向又可执行。实验证明,OmniEVA在多种下游场景中展现出卓越且稳健的具身推理与规划能力。
链接: https://arxiv.org/abs/2509.09332
作者: Yuecheng Liu,Dafeng Chi,Shiguang Wu,Zhanguang Zhang,Yuzheng Zhuang,Bowen Yang,He Zhu,Lingfeng Zhang,Pengwei Xie,David Gamaliel Arcos Bravo,Yingxue Zhang,Jianye Hao,Xingyue Quan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically this http URL address these gaps, we introduce OmniEVA – an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: this https URL
zh
[NLP-22] Can Multimodal LLM s See Materials Clearly? A Multimodal Benchmark on Materials Characterization
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实材料表征图像理解任务中表现不足的问题,尤其是其对高阶专业知识和复杂视觉感知能力的欠缺。解决方案的关键在于构建首个面向材料表征图像理解的基准测试平台MatCha,该平台包含1,500个需专家级领域知识的问题,覆盖材料研究的四个核心阶段共21项具体任务,真实反映了材料科学家面临的挑战。通过在MatCha上对先进MLLMs进行评估,研究揭示了模型与人类专家之间显著的性能差距,并指出简单提示策略难以缓解这一问题,从而为未来提升MLLM在材料科学场景下的适应性提供了明确方向。
链接: https://arxiv.org/abs/2509.09307
作者: Zhengzhao Lai,Youbin Zheng,Zhenyang Cai,Haonan Lyu,Jinpu Yang,Hongqing Liang,Yan Hu,Benyou Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at this https URL.
zh
[NLP-23] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
【速读】: 该论文旨在解决专利与联合国可持续发展目标(Sustainable Development Goals, SDGs)关联性分类的难题,其核心挑战在于缺乏大规模标注数据,导致监督学习方法难以应用;现有基于关键词、迁移学习或引文启发式的方法在可扩展性和泛化能力上存在局限。解决方案的关键在于将专利到SDG的分类任务建模为弱监督问题,利用专利引用已标记SDG的非专利文献(Non-Patent Literature, NPL)作为噪声初始信号,并设计一种复合标签函数(labeling function, LF),该函数结合大语言模型(Large Language Models, LLMs)从专利和SDG文献中提取结构化概念(功能、解决方案、应用场景),并基于专利本体构建跨域相似度得分,通过基于排名的检索策略融合信息;同时采用定制化的仅正样本损失函数进行校准,以对齐已知NPL-SDG关联而不惩罚发现新关联。最终生成银标准软多标签数据集,支持高效多标签回归模型训练,并通过内部与外部双重验证证明其优于主流基线方法,展现出更强的主题一致性、认知连贯性和组织聚类特性。
链接: https://arxiv.org/abs/2509.09303
作者: Grazia Sveva Ascione,Nicolò Tamagnone
机构: Polytechnic University of Turin (都灵理工大学); Ca Foscari University of Venice (威尼斯卡福斯卡里大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.
zh
[NLP-24] ree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
【速读】: 该论文旨在解决偏好强化学习(preference-based reinforcement learning, RL)中政策优化的稳定性与效率问题,特别是如何利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成的结构化中间轨迹来改进无价值网络的策略优化方法。其解决方案的关键在于提出一种分阶段的组相对策略优化(Group Relative Policy Optimization, GRPO)训练范式,通过部分展开的MCTS回溯路径生成前缀条件奖励信号(prefix-conditioned reward signals),从而构建树状结构的优势估计机制,以更精确地捕捉组合推理的质量并稳定策略更新过程。
链接: https://arxiv.org/abs/2509.09284
作者: Bingning Huang,Tu Nguyen,Matthieu Zimmer
机构: Technical University Munich (慕尼黑工业大学); Huawei R&D Munich (华为慕尼黑研发中心); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We propose a staged GRPO training paradigm where completions are derived from partially revealed MCTS rollouts, introducing a novel tree-structured setting for advantage estimation. This leads to a rich class of prefix-conditioned reward signals, which we analyze theoretically and empirically. Our initial results indicate that while structured advantage estimation can stabilize updates and better reflect compositional reasoning quality, challenges such as advantage saturation and reward signal collapse remain. We propose heuristic and statistical solutions to mitigate these issues and discuss open challenges for learning under staged or tree-like reward structures.
zh
[NLP-25] Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents ICLR2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体在长时任务中因稀疏、结果导向的奖励信号导致难以对中间步骤进行有效信用分配的问题。其核心挑战在于LLM学习动态中策略梯度的幅度与熵(entropy)存在内在耦合,使得自信的正确动作更新过小,而不确定动作的更新可能引发不稳定。解决方案的关键是提出熵调制策略梯度(Entropy-Modulated Policy Gradients, EMPG),通过结合步骤级不确定性与最终任务结果重新校准学习信号:放大自信正确动作的更新,惩罚自信错误,并抑制不确定步骤的更新以稳定探索;此外引入未来清晰度奖励项,促使智能体选择更可预测的解题路径。
链接: https://arxiv.org/abs/2509.09265
作者: Jiawei Wang,Jiacai Liu,Yuqian Fu,Yingru Li,Xintao Wang,Yuan Lin,Yu Yue,Lin Zhang,Yang Wang,Ke Wang
机构: ByteDance(字节跳动)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICLR 2026 Under review
Abstract:In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at this https URL
zh
[NLP-26] Agent ic LLM s for Question Answering over Tabular Data ACL SEMEVAL2025
【速读】: 该论文旨在解决结构化数据问答(Table QA)中因真实世界表格的多样化结构、规模和数据类型所带来的挑战,尤其在复杂查询理解与准确生成SQL语句方面的问题。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的自然语言到SQL(NL-to-SQL)方法,通过多阶段流水线实现:包括示例选择、SQL查询生成、答案提取、验证及迭代优化,从而显著提升模型对表格数据的语义理解与查询生成能力。实验表明,该方法在DataBench QA和DataBench Lite QA两个基准上分别达到70.5%和71.6%的准确率,远超基线(26%和27%),验证了LLM驱动Table QA的有效性与潜力。
链接: https://arxiv.org/abs/2509.09234
作者: Rishit Tyagi,Mohit Gupta,Rahul Bouri
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ACL workshop SemEval 2025
Abstract:Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5% accuracy on DataBench QA and 71.6% on DataBench Lite QA, significantly surpassing baseline scores of 26% and 27% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.
zh
[NLP-27] Reading Between the Lines: Classifying Resume Seniority with Large Language Models
【速读】: 该论文旨在解决从简历中准确评估候选人资历(seniority)的问题,尤其针对简历中普遍存在的经验夸大和模糊自我呈现带来的挑战。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs),包括微调后的BERT架构,结合一个混合数据集进行系统性评估——该数据集包含真实简历与人工生成的“困难样本”,后者专门模拟资格虚高和隐性能力弱化的场景,从而有效检测与资历膨胀相关的细微语言线索及隐含专业性特征。
链接: https://arxiv.org/abs/2509.09229
作者: Matan Cohen,Shira Shani,Eden Menahem,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures
Abstract:Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. In this study, we investigate the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising both real-world resumes and synthetically generated hard examples designed to simulate exaggerated qualifications and understated seniority. Using the dataset, we evaluate the performance of Large Language Models in detecting subtle linguistic cues associated with seniority inflation and implicit expertise. Our findings highlight promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias introduced by self-promotional language. The dataset is available for the research community at this https URL
zh
[NLP-28] Identifying Key Features for Establishing Sustainable Agro-Tourism Centre: A Data Driven Approach
【速读】: 该论文旨在解决如何有效识别并量化促进农业旅游(Agro-tourism)增长的关键因素问题,以支持农村经济发展与传统文化保护的协同推进。其解决方案的关键在于结合文献综述与机器学习方法,采用最小绝对收缩和选择算子(LASSO)进行特征选择,并利用逻辑回归(Logistic Regression, LR)、随机森林(Random Forest, RF)、决策树(Decision Trees, DT)及极端梯度提升(Extreme Gradient Boosting, XGBoost)等分类模型对所选指标进行验证与评估,最终发现LR模型在不同训练测试比例下均表现出最优分类性能(最高达99%准确率),从而为农业旅游的精准发展策略提供数据驱动的决策依据。
链接: https://arxiv.org/abs/2509.09214
作者: Alka Gadakh,Vidya Kumbhar,Sonal Khosla,Kumar Karunendra
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Agro-tourism serves as a strategic economic model designed to facilitate rural development by diversifying income streams for local communities like farmers while promoting the conservation of indigenous cultural heritage and traditional agricultural practices. As a very booming subdomain of tourism, there is a need to study the strategies for the growth of Agro-tourism in detail. The current study has identified the important indicators for the growth and enhancement of agro-tourism. The study is conducted in two phases: identification of the important indicators through a comprehensive literature review and in the second phase state-of-the-art techniques were used to identify the important indicators for the growth of agro-tourism. The indicators are also called features synonymously, the machine learning models for feature selection were applied and it was observed that the Least Absolute Shrinkage and Selection Operator (LASSO) method combined with, the machine Learning Classifiers such as Logistic Regression (LR), Decision Trees (DT), Random Forest (RF) Tree, and Extreme Gradient Boosting (XGBOOST) models were used to suggest the growth of the agro-tourism. The results show that with the LASSO method, LR model gives the highest classification accuracy of 98% in 70-30% train-test data followed by RF with 95% accuracy. Similarly, in the 80-20% train-test data LR maintains the highest accuracy at 99%, while DT and XGBoost follow with 97% accuracy.
zh
[NLP-29] Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems INTERSPEECH2025
【速读】: 该论文旨在解决当前音频深度伪造检测(Audio Deepfake Detection, ADD)模型评估中存在的两个关键问题:一是现有数据集通常混合多种合成器(synthesizer)样本,导致在计算等错误率(Equal Error Rate, EER)时对样本量较大的合成器过度加权,从而削弱评估的公平性与可靠性;二是多数ADD数据集缺乏真实语音(bona fide speech)多样性,往往仅包含单一环境和语音风格(如干净朗读语音),难以模拟真实应用场景。解决方案的关键在于提出“真实语音交叉测试”(bona fide cross-testing)这一新型评估框架,通过引入多样化的 bona fide 数据集并分别计算各子集上的EER,再进行聚合分析,以实现更平衡、鲁棒且可解释的模型性能评估。
链接: https://arxiv.org/abs/2509.09204
作者: Chin Yuen Kwok,Jia Qi Yip,Zhen Qiu,Chi Hung Chi,Kwok Yan Lam
机构: Digital Trust Centre (数字信任中心); College of Computing and Data Science (计算与数据科学学院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published in Interspeech 2025
Abstract:Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at this https URL.
zh
[NLP-30] CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
【速读】: 该论文旨在解决语言模型在扩展上下文长度时面临的计算与内存开销过大问题,尤其是在训练和推理阶段的效率瓶颈。其解决方案的关键在于提出一种名为CCF(Context Compression Framework)的新型上下文压缩框架,通过学习分层潜在表示来保留全局语义信息并大幅减少输入冗余;具体而言,CCF融合了分段语义聚合与键值记忆编码机制,生成紧凑的表示以支持高精度重建和长程理解,并结合增量分段解码与稀疏水库采样策略,在不降低性能的前提下显著降低内存占用,从而实现高效、可扩展的长上下文建模。
链接: https://arxiv.org/abs/2509.09199
作者: Wenhao Li,Bangcheng Sun,Weihao Ye,Tianyi Zhang,Daohai Yu,Fei Chao,Rongrong Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, naïve context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.
zh
[NLP-31] GmSLM : Generative Marmoset Spoken Language Modeling
【速读】: 该论文旨在解决非人类灵长类动物(狨猴) vocal communication 与大脑活动之间关联研究的难题,特别是针对传统生成式 AI (Generative AI) 方法难以直接应用于狨猴发声交流的问题。其关键解决方案是提出了一种专为狨猴发声语言建模优化的生成模型框架——生成式狨猴口语建模(Generative Marmoset Spoken Language Modeling, GmSLM),该框架结合无监督野外数据和弱标注对话数据,设计了新颖的零样本评估指标,在无需人工标注的情况下实现了对真实与合成语音的准确区分,并在下游任务中表现出优异性能,从而为连接发声行为与神经机制提供了可扩展且实用的计算工具。
链接: https://arxiv.org/abs/2509.09198
作者: Talia Sternberg,Michael London,David Omer,Yossi Adi
机构: The School of Computer Science and Engineering (计算机科学与工程学院); The Edmond and Lily Safra center for Brain Sciences (ELSC) (爱德蒙和莉莉·萨夫拉脑科学中心); Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: this http URL.
zh
[NLP-32] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function INTERSPEECH2025
【速读】: 该论文旨在解决语音识别(ASR)中罕见词识别准确率低的问题。其核心解决方案是通过上下文偏置(contextual biasing)机制,在模型架构中引入一个偏置模块,并利用基于TCPGen的合成数据进行训练,同时提出一种关键词感知损失函数(keyword-aware loss function),该损失函数包含两个互补项:一是针对偏置词预测的掩码交叉熵项,二是用于检测偏置词位置的二分类项。这一设计显著提升了模型在推理阶段对罕见词的识别性能,实验表明,将Whisper模型适配10小时合成数据后,NSC Part 2测试集上的词错误率(WER)从29.71%降至11.81%。
链接: https://arxiv.org/abs/2509.09197
作者: Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in Interspeech 2025
Abstract:Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.
zh
[NLP-33] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition INTERSPEECH2025
【速读】: 该论文旨在解决自动语音识别(ASR)模型在识别罕见词时的性能瓶颈问题,尤其是传统基于Trie结构的上下文偏置方法在解码过程中因需撤销部分假设得分而导致计算开销大、效率低的问题。其解决方案的关键在于通过微调模型(如Whisper)以具备多步前瞻预测能力,使模型能够一次性预测多个输出步骤,从而更准确地判断局部假设是否最终导向完整罕见词的生成,彻底避免了传统方法中依赖束搜索的得分撤销步骤,显著提升了识别准确率并降低了计算复杂度。
链接: https://arxiv.org/abs/2509.09196
作者: Chin Yuen Kwok,Jia Qi yip
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in Interspeech 2025
Abstract:Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g. “Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the full word (“Bonham”) isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.
zh
[NLP-34] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLM s
【速读】: 该论文旨在解决生成式语音大模型(Speech-to-speech Large Language Models, SLLMs)在知识掌握和推理能力方面相较于文本基础大语言模型(Large Language Models, LLMs)出现退化的问题。作者指出,这一现象源于当前SLLMs训练范式未能有效弥合特征表示空间中的声学-语义鸿沟(acoustic-semantic gap)。解决方案的关键在于提出EchoX框架,其通过利用语义表征并动态生成语音训练目标,实现声学与语义学习的深度融合,从而在保持语音输入输出能力的同时,显著增强模型的推理性能。
链接: https://arxiv.org/abs/2509.09174
作者: Yuhao Zhang,Yuhao Du,Zhanchen Dai,Xiangnan Ma,Kaiqi Kou,Benyou Wang,Haizhou Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at this https URL.
zh
[NLP-35] arget-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing ICME2025
【速读】: 该论文旨在解决目标导向的多模态情感分类中因文本内容过度依赖和数据集偏差(特别是词级上下文偏差)导致的虚假相关性问题,从而影响分类准确性。其解决方案的关键在于提出一种增强反事实的去偏框架:首先引入反事实数据增强策略,通过最小扰动情感相关的因果特征生成细节匹配的图文样本,引导模型关注与情感相关的内容;其次设计自适应去偏对比学习机制,从反事实数据中学习鲁棒特征并促使模型决策更加可靠,有效缓解了偏见词汇的影响。
链接: https://arxiv.org/abs/2509.09160
作者: Zhiyue Liu,Fanrong Ma,Xin Ling
机构: Guangxi University (广西大学); Guangxi Key Laboratory of Multimedia Communications and Network Technology (广西多媒体通信与网络技术重点实验室); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the IEEE International Conference on Multimedia and Expo (ICME 2025). © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Abstract:Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model’s attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.
zh
[NLP-36] LITcoder: A General-Purpose Library for Building and Comparing Encoding Models
【速读】: 该论文旨在解决神经编码模型(neural encoding models)在构建与评估过程中存在的技术门槛高、方法不统一、难以系统比较和扩展的问题。其解决方案的关键在于提出并实现了一个开源库 LITcoder,该库提供了一套标准化、模块化的处理流程,涵盖从连续刺激(如文本和语音)到脑数据的对齐、特征提取、映射建模及预测性能评估等环节,支持多种脑数据集、区域、特征表示方式(包括基于神经网络和控制变量)、降采样策略等方法学选择,同时集成日志记录、可视化与实验追踪平台(如 Weights & Biases),从而显著降低实现难度、提升方法严谨性,并促进跨模型与跨数据集的可比性与高效开发。
链接: https://arxiv.org/abs/2509.09152
作者: Taha Binhuraib,Ruimin Gao,Anna A. Ivanova
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
Abstract:We introduce LITcoder, an open-source library for building and benchmarking neural encoding models. Designed as a flexible backend, LITcoder provides standardized tools for aligning continuous stimuli (e.g., text and speech) with brain data, transforming stimuli into representational features, mapping those features onto brain data, and evaluating the predictive performance of the resulting model on held-out data. The library implements a modular pipeline covering a wide array of methodological design choices, so researchers can easily compose, compare, and extend encoding models without reinventing core infrastructure. Such choices include brain datasets, brain regions, stimulus feature (both neural-net-based and control, such as word rate), downsampling approaches, and many others. In addition, the library provides built-in logging, plotting, and seamless integration with experiment tracking platforms such as Weights Biases (WB). We demonstrate the scalability and versatility of our framework by fitting a range of encoding models to three story listening datasets: LeBel et al. (2023), Narratives, and Little Prince. We also explore the methodological choices critical for building encoding models for continuous fMRI data, illustrating the importance of accounting for all tokens in a TR scan (as opposed to just taking the last one, even when contextualized), incorporating hemodynamic lag effects, using train-test splits that minimize information leakage, and accounting for head motion effects on encoding model predictivity. Overall, LITcoder lowers technical barriers to encoding model implementation, facilitates systematic comparisons across models and datasets, fosters methodological rigor, and accelerates the development of high-quality high-performance predictive models of brain activity. Project page: this https URL Subjects: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2509.09152 [cs.CL] (or arXiv:2509.09152v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.09152 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-37] ViRanker: A BGE-M3 Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
【速读】: 该论文旨在解决越南语(Vietnamese)在信息检索中缺乏高性能重排序模型(reranker)的问题,尤其针对该语言复杂的语法结构和声调符号(diacritics)带来的挑战。解决方案的关键在于构建一个基于BGE-M3编码器并引入Blockwise Parallel Transformer架构的交叉编码器(cross-encoder)模型ViRanker,并通过8 GB高质量语料库训练与混合难负样本采样(hybrid hard-negative sampling)策略进行微调,从而显著提升早期排名准确率(early-rank accuracy),在MMARCO-VI基准上超越多语言基线模型,接近当前最优的PhoRanker。
链接: https://arxiv.org/abs/2509.09131
作者: Phuong-Nam Dang,Kieu-Linh Nguyen,Thanh-Hieu Pham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:This paper presents ViRanker, a cross-encoder reranking model tailored to the Vietnamese language. Built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer, ViRanker addresses the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with hybrid hard-negative sampling to strengthen robustness. Evaluated on the MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing multilingual baselines and competing closely with PhoRanker. By releasing the model openly on Hugging Face, we aim to support reproducibility and encourage wider adoption in real-world retrieval systems. Beyond Vietnamese, this study illustrates how careful architectural adaptation and data curation can advance reranking in other underrepresented languages.
zh
[NLP-38] Automated Classification of Tutors Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
【速读】: 该论文旨在解决教育对话分析中人工标注对话行为(Dialogue Acts, DAs)耗时耗力的问题,提出利用生成式AI(Generative AI)实现自动化分类以提升效率。其解决方案的关键在于:首先,基于开放源代码的CIMA语料库,采用定制化提示(prompt)策略对GPT-3.5-turbo与GPT-4模型进行测试;其次,通过明确的任务特定标签定义和上下文信息增强,使GPT-4在DA分类上达到80%准确率、加权F1分数0.81及Cohen’s Kappa 0.74,显著优于基线并表现出与人工标注的高度一致性,验证了生成式AI在教育对话分析中的可行性与有效性。
链接: https://arxiv.org/abs/2509.09125
作者: Liqun He,Jiaqi Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the journal Reflecting Digital Learning. First submitted: 30 Oct 2023. The final version will be available open access via the journal
Abstract:This study explores the use of generative AI for automating the classification of tutors’ Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors’ responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen’s Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at this https URL.
zh
[NLP-39] Compass-v3: Scaling Domain-Specific LLM s for Multilingual E-Commerce in Southeast Asia
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域如东南亚电子商务(e-commerce)场景中性能下降的问题,尤其针对其数据噪声大、异构性强、多语言且动态变化的特性。解决方案的关键在于构建一个垂直领域专家混合模型(Mixture-of-Experts, MoE),即Compass-v3,该模型拥有245B总参数和每token激活71B参数,通过减少专家数量但增大单个专家规模,并结合硬件高效优化策略(如节点内专家并行与定制memcpy操作)以最大化GPU利用率;同时采用混合训练策略,在12T tokens的多语言语料和大规模合成电商指令上进行训练,并提出最优传输直接偏好优化(Optimal-Transport Direct Preference Optimization, OTPO)方法,实现token级差异建模与指令遵循能力提升,从而显著优于现有主流模型并在低资源东南亚语言中保持强泛化能力。
链接: https://arxiv.org/abs/2509.09121
作者: Sophia Maria
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee’s industrial-scale e-commerce platform and is gradually replacing OpenAI’s traffic, now accounting for over 70% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.
zh
[NLP-40] gerCoder: A Novel Suite of LLM s for Code Generation in Bangla
【速读】: 该论文旨在解决孟加拉语(Bangla)在代码生成类大语言模型(Code LLMs)中严重数据匮乏的问题,从而提升其在编程领域的应用能力。当前,尽管孟加拉语是全球第五大使用语言,但缺乏高质量的训练与微调数据,导致现有通用或跨语言模型在该语种上的表现不佳。解决方案的关键在于构建一个系统性的资源体系:首先创建了一个面向编程领域适配的孟加拉语代码指令数据集;其次提出了MBPP-Bangla评估基准以量化代码生成性能;最终基于上述数据训练出TigerCoder系列Code LLMs,在Pass@1指标上相比现有模型实现约11-18%的显著性能提升。研究表明,针对低资源语言精心构建的高质量数据集能够有效弥补小模型的局限性,推动孟加拉语大语言模型研究的发展。
链接: https://arxiv.org/abs/2509.09101
作者: Nishat Raihan,Antonios Anastasopoulos,Marcos Zampieri
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.
zh
[NLP-41] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在通用信息抽取(Universal Information Extraction, UIE)任务中表现不足的问题,尤其是在处理结构化输出场景时,面对复杂模式描述和多步推理需求时性能受限。其解决方案的关键在于将强化学习(Reinforcement Learning, RL)与多视角推理(Multi-Perspective Reasoning, MPR)相结合,使LLMs从被动的信息提取者转变为具备主动推理能力的智能体,从而不仅理解“提取什么”,更掌握“如何推理”。实验表明,该方法在多个信息抽取基准上显著提升准确率,并在复杂任务中增强泛化能力,验证了推理机制在挑战性场景中的关键作用。
链接: https://arxiv.org/abs/2509.09082
作者: Zhongqiu Li,Shiquan Wang,Ruiyu Fang,Mengjiao Bao,Zhenhe Wu,Shuangyong Song,Yongxiang Li,Zhongjiang He
机构: China Telecom (中国电信)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model’s generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.
zh
[NLP-42] Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M
【速读】: 该论文旨在解决大语言模型在实际应用中因对齐不足而导致的安全性与有用性问题,即如何通过有效的微调策略提升模型输出的有害性降低和帮助性增强。其核心解决方案在于系统比较了监督微调(Supervised Fine-Tuning, SFT)、直接偏好优化(Direct Preference Optimization, DPO)以及二者结合(SFT+DPO)三种对齐方法的效果,并引入危害率(Harmlessness Rate, HmR)、帮助率(Helpfulness Rate, HpR)和综合对齐得分(Combined Alignment Score, CAS)作为量化指标。研究发现,虽然SFT单独使用优于DPO,但SFT与DPO联合训练的模型在所有指标上均表现最优,表明两者具有互补性,是提升模型对齐性能的关键路径。
链接: https://arxiv.org/abs/2509.09055
作者: Piyush Pant
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures. Code and dataset available at this https URL
Abstract:This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.
zh
[NLP-43] Stated Preference for Interaction and Continued Engagement (SPICE): Evaluating an LLM s Willingness to Re-engage in Conversation
【速读】: 该论文旨在解决如何有效评估大型语言模型(Large Language Model, LLM)在面对不同用户交互行为时的内在倾向性问题,特别是其是否愿意继续与用户互动。传统方法往往依赖于外部标注或分类任务,难以捕捉模型对交互质量的主观判断。解决方案的关键在于提出并验证一种名为“交互意愿与持续参与的陈述偏好”(Stated Preference for Interaction and Continued Engagement, SPICE)的诊断信号:通过向模型提问一个简单的二元问题——“在回顾一段简短对话后,你是否愿意继续与该用户互动?”——来直接测量其态度。实验表明,SPICE 能够显著区分不同用户语气(友好、模糊、攻击性),且在多种统计控制条件下保持稳健性,同时提供了一个区别于现有滥用检测指标的独立信号,尤其在模型未能识别出攻击性内容时仍能准确反映其不愿继续互动的倾向。这一方法为模型伦理审计提供了低开销、可复现且具关系感知能力的新工具。
链接: https://arxiv.org/abs/2509.09043
作者: Thomas Manuel Rost,Martina Figlia,Bernd Wallraff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We introduce and evaluate Stated Preference for Interaction and Continued Engagement (SPICE), a simple diagnostic signal elicited by asking a Large Language Model a YES or NO question about its willingness to re-engage with a user’s behavior after reviewing a short transcript. In a study using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, we tested four open-weight chat models across four framing conditions, resulting in 480 trials. Our findings show that SPICE sharply discriminates by user tone. Friendly interactions yielded a near-unanimous preference to continue (97.5% YES), while abusive interactions yielded a strong preference to discontinue (17.9% YES), with unclear interactions falling in between (60.4% YES). This core association remains decisive under multiple dependence-aware statistical tests, including Rao-Scott adjustment and cluster permutation tests. Furthermore, we demonstrate that SPICE provides a distinct signal from abuse classification. In trials where a model failed to identify abuse, it still overwhelmingly stated a preference not to continue the interaction (81% of the time). An exploratory analysis also reveals a significant interaction effect: a preamble describing the study context significantly impacts SPICE under ambiguity, but only when transcripts are presented as a single block of text rather than a multi-turn chat. The results validate SPICE as a robust, low-overhead, and reproducible tool for auditing model dispositions, complementing existing metrics by offering a direct, relational signal of a model’s state. All stimuli, code, and analysis scripts are released to support replication.
zh
[NLP-44] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
【速读】: 该论文旨在解决乌尔都语(Urdu)在多模态和视觉-语言研究中严重资源匮乏的问题,这一缺陷限制了具备乌尔都语能力的系统发展,并加剧了以高资源语言为主训练的多语言视觉-语言模型中的语言偏见。解决方案的关键在于构建并发布 COCO-Urdu 数据集——一个基于 MS COCO 的大规模图像-标题数据集,包含 59,000 张图像和 319,000 条乌尔都语标题,其标题通过分层抽样保留原始分布;同时提出了一套混合多模态质量评估框架,结合 COMET-Kiwi(翻译质量)、CLIP-based 相似度(视觉锚定)与 BERTScore + 回译(语义一致性)进行筛选,并利用开源大语言模型迭代优化低分标题,从而确保数据高质量和代表性。该工作为减少多模态研究中的语言偏见、推动包容性视觉-语言系统的发展提供了基础支持。
链接: https://arxiv.org/abs/2509.09014
作者: Umair Hassan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 3 tables. Dataset available at this https URL . Scripts and notebooks to reproduce results available at this https URL
Abstract:Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.
zh
[NLP-45] Can Vision-Language Models Solve Visual Math Equations? EMNLP2025
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在需要整合感知与符号计算的任务中表现不佳的问题,特别是针对视觉方程求解任务——即数学方程以图像形式呈现,变量由物体图标表示,系数需通过计数推断。研究发现,VLMs 在处理文本方程时表现良好,但在处理视觉锚定的方程时失败,其关键瓶颈在于系数计数能力不足,即使变量识别准确也难以完成任务;此外,将识别与推理组合会引入额外错误,凸显多步视觉推理的挑战;随着方程复杂度提升,符号推理本身也成为限制因素。因此,解决方案的关键在于提升模型对视觉场景中的精确计数能力以及增强多步骤符号推理的稳定性。
链接: https://arxiv.org/abs/2509.09013
作者: Monjoy Narayan Choudhury,Junling Wang,Yifan Hou,Mrinmaya Sachan
机构: IIIT Bangalore (印度国际信息技术学院); ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Monjoy Narayan Choudhury and Junling Wang contributed equally to this work. Accepted at EMNLP2025 main. Code and datasets are open-sourced with links in the paper
Abstract:Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.
zh
[NLP-46] Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)研究中缺乏可复现、标准化的训练基准问题,尤其在不同模型规模(0.13B 到 1.7B 参数)和数据规模(高达 1T tokens)下,难以公平比较训练方法的有效性。解决方案的关键在于构建 open-sci-ref 系列密集型 Transformer 模型,这些模型在 8 个近期开源参考数据集上进行训练,并提供中间检查点、训练日志、代码及下游评估结果,从而建立可对比的基准线,使研究人员能够基于统一计算轴(compute axis)量化分析不同训练流程的缩放趋势与性能表现,同时揭示 NemoTron-CC HQ 数据集在训练效果上的最优性。
链接: https://arxiv.org/abs/2509.09009
作者: Marianna Nezhurina,Taishi Nakamura,Timur Carstensen,Niccolò Ajroldi,Ville Komulainen,David Salinas,Jenia Jitsev
机构: LAION; Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ); OpenEuroLLM team; Open-Ψ\Psi (Open-Sci) Collective; ELLIS Institute Tübingen; Institute of Science Tokyo; University of Turku; University of Freiburg
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Model weights and intermediate checkpoints are available at \url{ this https URL }; code for reproducing training, evaluation and raw experiments data at \url{ this https URL }
Abstract:We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.
zh
[NLP-47] BRoverbs – Measuring how much LLM s understand Portuguese proverbs
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在葡萄牙语场景下评估不足的问题,特别是现有评测数据集多依赖翻译文本,难以捕捉语言的细微差异和文化背景,而本土数据集又主要集中于标准化考试或社交媒体情感分析,缺乏对更广泛语言理解能力的考察。其解决方案的关键在于提出BRoverbs数据集,该数据集基于巴西谚语构建,通过蕴含文化智慧、隐喻表达和复杂句法结构的习语内容,系统性地评估LLMs对区域性语言现象的理解能力,从而为葡萄牙语LLM提供一种更具文化敏感性和语言深度的基准测试工具。
链接: https://arxiv.org/abs/2509.08960
作者: Thales Sales Almeida,Giovana Kerche Bonás,João Guilherme Alves Santos
机构: Institute of Computing, University of Campinas (坎皮纳斯州立大学计算研究所); Maritaca AI
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at this https URL.
zh
[NLP-48] Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
【速读】: 该论文旨在解决传统心理测量学方法在处理文本数据时面临的挑战,即如何将非结构化的文本内容转化为可进行心理测量分析的响应数据。其核心问题在于缺乏一种能够自动识别并量化文本中潜在知识维度与语义差异的方法。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成的上下文嵌入(contextual embeddings)构建上下文得分(contextual scores),将文档视为个体、词语视为项目,从而实现对文本数据的心理测量建模。该方法通过两个阶段完成:第一阶段基于编码器型Transformer模型提取共现关键词并生成上下文得分;第二阶段采用探索性因子分析和双因子模型等技术提取潜因子、确定因子间关系,并识别每个因子最显著的相关词汇。此框架为文本数据的心理测量分析提供了新的范式,尤其适用于教育、心理学和法律等领域中的大规模文本挖掘任务。
链接: https://arxiv.org/abs/2509.08920
作者: Jinsong Chen
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP); Methodology (stat.ME)
备注:
Abstract:This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method’s potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.
zh
[NLP-49] Generative Engine Optimization: How to Dominate AI Search
【速读】: 该论文旨在解决生成式 AI 搜索引擎(如 ChatGPT、Perplexity 和 Gemini)兴起背景下,传统搜索引擎优化(SEO)策略失效的问题,并提出新的优化范式——生成式引擎优化(Generative Engine Optimization, GEO)。其核心挑战在于,AI 搜索系统倾向于依赖第三方权威来源(Earned media),而对品牌自建内容和社交内容的权重显著偏低,这与传统搜索引擎(如 Google)的信息获取逻辑存在根本差异。解决方案的关键在于:首先,通过结构化内容设计提升机器可读性和推理依据;其次,主动构建高质量的第三方权威背书以增强 AI 对品牌的信任度;再次,制定针对不同 AI 引擎及多语言环境的差异化策略;最后,帮助中小品牌突破“大品牌偏见”,在生成式搜索生态中实现可见性提升。
链接: https://arxiv.org/abs/2509.08919
作者: Mahe Chen,Xiaoxuan Wang,Kaiwen Chen,Nick Koudas
机构: University of Toronto (多伦多大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:The rapid adoption of generative AI-powered search engines like ChatGPT, Perplexity, and Gemini is fundamentally reshaping information retrieval, moving from traditional ranked lists to synthesized, citation-backed answers. This shift challenges established Search Engine Optimization (SEO) practices and necessitates a new paradigm, which we term Generative Engine Optimization (GEO). This paper presents a comprehensive comparative analysis of AI Search and traditional web search (Google). Through a series of large-scale, controlled experiments across multiple verticals, languages, and query paraphrases, we quantify critical differences in how these systems source information. Our key findings reveal that AI Search exhibit a systematic and overwhelming bias towards Earned media (third-party, authoritative sources) over Brand-owned and Social content, a stark contrast to Google’s more balanced mix. We further demonstrate that AI Search services differ significantly from each other in their domain diversity, freshness, cross-language stability, and sensitivity to phrasing. Based on these empirical results, we formulate a strategic GEO agenda. We provide actionable guidance for practitioners, emphasizing the critical need to: (1) engineer content for machine scannability and justification, (2) dominate earned media to build AI-perceived authority, (3) adopt engine-specific and language-aware strategies, and (4) overcome the inherent “big brand bias” for niche players. Our work provides the foundational empirical analysis and a strategic framework for achieving visibility in the new generative search landscape. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Social and Information Networks (cs.SI) Cite as: arXiv:2509.08919 [cs.IR] (or arXiv:2509.08919v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.08919 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-50] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach
【速读】: 该论文旨在解决气候政策参与监测中人工评估效率低、易出错的问题,尤其针对 InfluenceMap 的 LobbyMap 平台在处理大规模多语言企业文档时存在的劳动密集型和耗时长的挑战。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的 AI 辅助框架,通过布局感知解析(layout-aware parsing)、Nomic 嵌入模型(Nomic embedding model)以及少量样本提示(few-shot prompting)策略的协同优化,实现对企业文本中相关证据的自动化提取与分类,从而显著提升证据获取效率,同时保留专家判断以保障分析精度。
链接: https://arxiv.org/abs/2509.08907
作者: Imene Kolli,Ario Saeid Vaghefi,Chiara Colesanti Senni,Shantam Raj,Markus Leippold
机构: University of Zurich (苏黎世大学); Swiss Finance Institute (瑞士金融研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5°C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.
zh
[NLP-51] Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC ISWC2025
【速读】: 该论文旨在解决在资源受限场景下(如2025 LM-KBC挑战赛)如何提升大语言模型(Large Language Model, LLM)在三元组补全任务中的性能问题。其解决方案的关键在于:首先,引入额外信息可有效提升生成质量;其次,LLM本身具备对低质量三元组进行过滤的能力,从而实现质量保障;最后,响应解析策略需根据具体任务设定在灵活性与一致性之间权衡,以适应不同场景需求。
链接: https://arxiv.org/abs/2509.08903
作者: Alex Clay,Ernesto Jiménez-Ruiz,Pranava Madhyastha
机构: City St George’s, University of London (伦敦城市圣乔治大学); The Alan Turing Institute (艾伦图灵研究所)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, accepted to the ISWC 2025 LM-KBC Workshop
Abstract:RAG and fine-tuning are prevalent strategies for improving the quality of LLM outputs. However, in constrained situations, such as that of the 2025 LM-KBC challenge, such techniques are restricted. In this work we investigate three facets of the triple completion task: generation, quality assurance, and LLM response parsing. Our work finds that in this constrained setting: additional information improves generation quality, LLMs can be effective at filtering poor quality triples, and the tradeoff between flexibility and consistency with LLM response parsing is setting dependent.
zh
[NLP-52] Recurrence Meets Transformers for Universal Multimodal Retrieval
【速读】: 该论文旨在解决多模态检索任务中现有方法受限于单模态查询或文档、且依赖任务特定微调的问题。其核心挑战在于如何构建一个统一的模型,以支持图像与文本联合的多模态查询,并在图文共存的多模态文档集合中实现高效精准检索。解决方案的关键在于提出ReT-2模型,该模型采用多层表示与受LSTM启发的门控机制的循环Transformer架构,能够动态融合不同层级和模态的信息,从而捕捉细粒度的视觉与文本特征,在M2KR和M-BEIR等复杂基准上实现优于现有方法的性能,同时具备更快推理速度和更低内存占用。
链接: https://arxiv.org/abs/2509.08897
作者: Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
机构: University of Modena and Reggio Emilia(摩德纳与雷焦艾米利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: this https URL
zh
[NLP-53] A vibe coding learning design to enhance EFL students talking to through and about AI
【速读】: 该论文旨在解决如何在英语作为外语(EFL)教育中有效整合生成式 AI (Generative AI) 工具以支持学生进行应用开发的问题,特别是通过“vibe coding”(使用自然语言与AI协作构建软件)提升学生的语言实践能力。其解决方案的关键在于提出并验证了一个“人-AI元语言框架”,包含三个维度:与AI对话(提示工程)、通过AI对话(协商作者权)和关于AI对话(AI心智模型)。研究发现,学生在vibe coding中的成功与否主要取决于其prompt engineering策略的差异,以及对AI功能的理解深度和作者权认知的清晰度,因此有效的教学干预需围绕结构化提示工程训练、批判性作者权讨论和AI心智模型词汇建构等元语言能力展开。
链接: https://arxiv.org/abs/2509.08854
作者: David James Woo,Kai Guo,Yangyang Yu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 12 figures
Abstract:This innovative practice article reports on the piloting of vibe coding (using natural language to create software applications with AI) for English as a Foreign Language (EFL) education. We developed a human-AI meta-languaging framework with three dimensions: talking to AI (prompt engineering), talking through AI (negotiating authorship), and talking about AI (mental models of AI). Using backward design principles, we created a four-hour workshop where two students designed applications addressing authentic EFL writing challenges. We adopted a case study methodology, collecting data from worksheets and video recordings, think-aloud protocols, screen recordings, and AI-generated images. Contrasting cases showed one student successfully vibe coding a functional application cohering to her intended design, while another encountered technical difficulties with major gaps between intended design and actual functionality. Analysis reveals differences in students’ prompt engineering approaches, suggesting different AI mental models and tensions in attributing authorship. We argue that AI functions as a beneficial languaging machine, and that differences in how students talk to, through, and about AI explain vibe coding outcome variations. Findings indicate that effective vibe coding instruction requires explicit meta-languaging scaffolding, teaching structured prompt engineering, facilitating critical authorship discussions, and developing vocabulary for articulating AI mental models.
zh
[NLP-54] Automated Unity Game Template Generation from GDDs via NLP and Multi-Modal LLM s
【速读】: 该论文旨在解决游戏开发中从设计文档到可执行代码的自动化转换难题,即如何高效、准确地将Game Design Documents (GDDs) 转化为功能完整的Unity游戏原型。其关键解决方案是构建一个端到端系统,结合经过微调的LLaMA-3模型用于生成符合Unity规范的C#代码,并通过定制的Unity集成包实现代码与引擎环境的无缝对接,从而显著提升编译成功率、对GDD规范的遵循度及代码模块化水平,有效弥合了游戏设计与实现之间的技术鸿沟。
链接: https://arxiv.org/abs/2509.08847
作者: Amna Hassan
机构: UET Taxila
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:This paper presents a novel framework for automated game template generation by transforming Game Design Documents (GDDs) into functional Unity game prototypes using Natural Language Processing (NLP) and multi-modal Large Language Models (LLMs). We introduce an end-to-end system that parses GDDs, extracts structured game specifications, and synthesizes Unity-compatible C# code that implements the core mechanics, systems, and architecture defined in the design documentation. Our approach combines a fine-tuned LLaMA-3 model specialized for Unity code generation with a custom Unity integration package that streamlines the implementation process. Evaluation results demonstrate significant improvements over baseline models, with our fine-tuned model achieving superior performance (4.8/5.0 average score) compared to state-of-the-art LLMs across compilation success, GDD adherence, best practices adoption, and code modularity metrics. The generated templates demonstrate high adherence to GDD specifications across multiple game genres. Our system effectively addresses critical gaps in AI-assisted game development, positioning LLMs as valuable tools in streamlining the transition from game design to implementation.
zh
计算机视觉
[CV-0] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
【速读】:该论文旨在解决当前空间智能模型在可扩展性和真实场景保真度方面的瓶颈问题,其根源在于高质量、大规模训练数据的匮乏。现有数据集通常规模有限、多样性不足,尤其缺乏对现实动态场景中真实相机运动的标注。为此,作者提出了SpatialVID数据集,其关键创新在于构建了一个包含超过21,000小时原始视频的大型野外视频语料库,并通过分层过滤流程将其处理为270万段视频片段(总计7,089小时动态内容),随后利用自动化标注流程为每段视频注入密集的3D信息,包括逐帧相机位姿、深度图、动态掩码、结构化描述文本和序列化的运动指令。这一高丰富度与多样性的数据资产显著提升了模型的泛化能力和性能表现,为视频与三维视觉研究提供了重要支撑。
链接: https://arxiv.org/abs/2509.09676
作者: Jiahao Wang,Yufeng Yuan,Rujie Zheng,Youtian Lin,Jian Gao,Lin-Zhuo Chen,Yajie Bao,Yi Zhang,Chang Zeng,Yanxi Zhou,Xiaoxiao Long,Hao Zhu,Zhaoxiang Zhang,Xun Cao,Yao Yao
机构: Nanjing University (南京大学); Institute of Automation, Chinese Academy of Science (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbfSpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID’s data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
zh
[CV-1] Locality in Image Diffusion Models Emerges from Data Statistics
【速读】:该论文旨在解决深度扩散模型与最优去噪器(optimal denoiser)之间性能差距的问题,即为何使用理论最优的去噪器仅能重现训练集图像,而无法模拟真实深度扩散模型生成高质量样本的能力。其关键解决方案在于揭示了深度扩散模型中表现出的局部性(locality)并非源于卷积神经网络的归纳偏置(inductive bias),而是自然图像数据集中像素相关性的统计特性所致。作者通过证明最优参数化线性去噪器也具备类似局部性,并结合理论与实验验证该性质直接由图像数据的统计结构引发,最终构建了一个更贴近深度扩散模型预测得分的解析去噪器,显著优于先前依赖人工假设的模型。
链接: https://arxiv.org/abs/2509.09672
作者: Artem Lukoianov,Chenyang Yuan,Justin Solomon,Vincent Sitzmann
机构: Massachusetts Institute of Technology (麻省理工学院); Toyota Research Institute (丰田研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 18 figures, 6 tables
Abstract:Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.
zh
[CV-2] Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration
【速读】:该论文旨在解决从人类-物体运动捕捉(MoCap)数据中学习灵巧机器人操作策略时面临的两大挑战:一是演示数据中的误差(如动作不准确)和人手与机器人手之间的具身差异(embodiment gap),二是现有三阶段工作流程(重定向、跟踪与残差校正)导致的误差累积及演示数据利用不足的问题。解决方案的关键在于提出一个统一的单循环优化框架 Dexplore,它将重定向与跟踪联合建模为一个端到端的学习过程,不再将演示视为真实标签,而是作为软引导信号;通过从原始轨迹中提取自适应空间范围(adaptive spatial scopes),并结合强化学习训练策略,在最小化控制代价的同时确保策略保持在有效范围内,从而保留演示意图、激发机器人特定策略、提升对噪声的鲁棒性,并支持大规模演示数据的高效利用。
链接: https://arxiv.org/abs/2509.09671
作者: Sirui Xu,Yu-Wei Chao,Liuyu Bian,Arsalan Mousavian,Yu-Xiong Wang,Liang-Yan Gui,Wei Yang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); NVIDIA (英伟达)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: CoRL 2025
Abstract:Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often leaves demonstrations underused and compound errors across stages. We introduce Dexplore, a unified single-loop optimization that jointly performs retargeting and tracking to learn robot control policies directly from MoCap at scale. Rather than treating demonstrations as ground truth, we use them as soft guidance. From raw trajectories, we derive adaptive spatial scopes, and train with reinforcement learning to keep the policy in-scope while minimizing control effort and accomplishing the task. This unified formulation preserves demonstration intent, enables robot-specific strategies to emerge, improves robustness to noise, and scales to large demonstration corpora. We distill the scaled tracking policy into a vision-based, skill-conditioned generative controller that encodes diverse manipulation skills in a rich latent representation, supporting generalization across objects and real-world deployment. Taken together, these contributions position Dexplore as a principled bridge that transforms imperfect demonstrations into effective training signals for dexterous manipulation.
zh
[CV-3] Geometric Neural Distance Fields for Learning Human Motion Priors
【速读】:该论文旨在解决3D人体运动恢复中的鲁棒性、时序一致性与物理合理性问题,尤其针对现有基于变分自编码器(VAE)或扩散模型的方法在建模复杂运动动态时的局限性。其解决方案的关键在于提出神经黎曼运动场(Neural Riemannian Motion Fields, NRMF),通过将人体运动显式建模为一组神经距离场(Neural Distance Fields, NDFs)的零水平集,分别对应姿态、速度和加速度的动力学信息,并在关节旋转、角速度与角加速度构成的乘积空间上构建几何一致的NDFs,从而保留人体运动的流形结构。此外,论文引入了一种自适应步长混合投影算法和一种几何积分器,分别用于约束可行运动空间和生成真实轨迹,显著提升了在多种任务(如去噪、插值及部分观测拟合)下的泛化性能。
链接: https://arxiv.org/abs/2509.09667
作者: Zhengdi Yu,Simone Foti,Linguang Zhang,Amy Zhao,Cem Keskin,Stefanos Zafeiriou,Tolga Birdal
机构: Imperial College London (帝国理工学院); Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to “roll out” realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.
zh
[CV-4] Can Understanding and Generation Truly Benefit Together – or Just Coexist?
【速读】:该论文旨在解决多模态学习中理解(图像到文本,I2T)与生成(文本到图像,T2I)过程缺乏统一性的问题,即传统方法通常将这两个任务分开训练,导致信息流不一致且难以协同优化。其核心解决方案是提出UAE(Unified Auto-Encoder)框架,通过以重建保真度(reconstruction fidelity)作为统一训练目标,构建双向信息流动机制:首先预训练解码器(T2I)以捕捉细粒度语义和复杂空间关系;随后引入Unified-GRPO强化学习策略,分三阶段迭代优化——冷启动初始化、生成驱动理解(encoder优化以提升decoder重建质量)、理解驱动生成(decoder基于caption重构以增强细节感知与长程指令遵循能力)。这一设计使编码器与解码器在RL过程中实现自主协同进化,最终达成理解与生成的深度统一。
链接: https://arxiv.org/abs/2509.09666
作者: Zhiyuan Yan,Kaiqing Lin,Zongjian Li,Junyan Ye,Hui Han,Zhendong Wang,Hao Liu,Bin Lin,Hao Li,Xue Xu,Xinyan Xiao,Jingdong Wang,Haifeng Wang,Li Yuan
机构: PKU(北京大学); Baidu ERNIE(百度ERNIE); Rabbitpre AI(兔子预AI); SYSU(中山大学); USTC(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder’s reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising “aha moment” arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
zh
[CV-5] Measuring Epistemic Humility in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的幻觉问题,即模型生成与输入图像不一致的内容,这在视觉问答等实际应用中可能导致误导性信息或决策失误。现有评估基准主要关注识别准确率,忽略了模型在面对错误选项时具备拒绝能力的重要性,而这种能力体现了认知谦逊(epistemic humility)。解决方案的关键在于提出HumbleBench——一个全新的幻觉评估基准,其核心设计是引入“以上皆非”选项,迫使模型不仅识别正确的视觉信息,还需判断所提供的选项是否均不正确。该基准基于全景场景图数据集构建,利用细粒度的场景图标注提取真实实体与关系,并通过GPT-4-Turbo生成多选题,辅以人工筛选确保质量,从而系统性地衡量MLLM在三种典型幻觉类型(对象、关系、属性)下对错误选项的拒识能力,填补了当前评估体系在安全关键场景中的空白。
链接: https://arxiv.org/abs/2509.09658
作者: Bingkui Tong,Jiaer Xia,Sifeng Shang,Kaiyang Zhou
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hallucinations in multimodal large language models (MLLMs) – where the model generates content inconsistent with the input image – pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs’ ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a “None of the above” option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs – including both general-purpose and specialized reasoning models – on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at this https URL.
zh
[CV-6] Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth
【速读】:该论文旨在解决脑肿瘤在时空维度上的进展预测问题,这对于神经肿瘤学中的临床决策至关重要。其解决方案的关键在于提出了一种混合机制学习框架,将基于常微分方程(Ordinary Differential Equations, ODEs)的肿瘤生长机制模型与引导去噪扩散隐式模型(Guided Denoising Diffusion Implicit Model, DDIM)相结合:机制模型用于捕捉包括放疗效应在内的肿瘤时间动态并估计未来肿瘤负荷,而DDIM则以该估计为条件进行梯度引导图像生成,从而合成符合解剖结构且与预测生长一致的未来MRI图像。此方法在BraTS成人及儿童胶质瘤数据集上训练,并在自建纵向儿童弥漫性中线胶质瘤(Diffuse Midline Glioma, DMG)病例中验证,能够生成高保真度的随访扫描,并提供具有临床意义的肿瘤生长概率图(Tumor Growth Probability Maps),量化生长范围和方向性。
链接: https://arxiv.org/abs/2509.09610
作者: Daria Laslo,Efthymios Georgiou,Marius George Linguraru,Andreas Rauschecker,Sabine Muller,Catherine R. Jutzeler,Sarah Bruningk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.
zh
[CV-7] Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication
【速读】:该论文旨在解决图对齐(graph alignment)任务中两个关键问题:一是基于图神经网络(GNN)的嵌入表示因过度平滑(oversmoothing)导致节点区分度下降;二是由于结构噪声、特征异质性和训练不稳定性,不同图之间的潜在空间(latent space)出现错位(misalignment),从而影响节点对应关系的可靠性。解决方案的关键在于提出一种双通道编码器(dual-pass encoder),通过结合低通和高通谱滤波器生成兼具结构感知能力与高区分度的嵌入表示,并引入一个几何感知的功能映射模块(geometry-aware functional map module),学习图嵌入间的双射且保距(bijective and isometric)变换,以强制跨图潜在空间的几何一致性,从而提升对结构不一致性和复杂对齐场景的鲁棒性。
链接: https://arxiv.org/abs/2509.09597
作者: Maysam Behmanesh,Erkan Turan,Maks Ovsjanikov
机构: LIX, École Polytechnique, IP Paris (巴黎综合理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages
Abstract:Graph alignment-the problem of identifying corresponding nodes across multiple graphs-is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable cross-graph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. We propose a novel graph alignment framework that simultaneously enhances node distinctiveness and enforces geometric consistency across latent spaces. Our approach introduces a dual-pass encoder that combines low-pass and high-pass spectral filters to generate embeddings that are both structure-aware and highly discriminative. To address latent space misalignment, we incorporate a geometry-aware functional map module that learns bijective and isometric transformations between graph embeddings, ensuring consistent geometric relationships across different representations. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios. Additionally, comprehensive evaluation on vision-language benchmarks using diverse pretrained models shows that our framework effectively generalizes beyond graph domains, enabling unsupervised alignment of vision and language representations.
zh
[CV-8] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
【速读】:该论文旨在解决当前音频驱动虚拟人视频生成方法中缺乏对多模态指令语义理解的问题,即现有方法仅依赖声学或视觉线索进行低层次跟踪,未能建模指令所传达的交际意图,从而导致叙事连贯性和角色表现力不足。解决方案的关键在于提出Kling-Avatar框架,其核心是两级级联架构:第一阶段采用多模态大语言模型(Multimodal Large Language Model, MLLM)作为“导演”,根据多样化指令信号生成受高阶语义(如角色动作和情绪)控制的蓝图视频;第二阶段基于蓝图关键帧并行生成多个子片段,采用首尾帧策略实现全局到局部的细节保留与高层意图编码,兼顾生成质量与效率,支持长达1080p@48fps的高质量长时视频生成,显著提升唇部同步精度、情感动态表达能力及跨域泛化性能。
链接: https://arxiv.org/abs/2509.09595
作者: Yikang Ding,Jiwen Liu,Wenyuan Zhang,Zekun Wang,Wentao Hu,Liyuan Cui,Mingming Lao,Yingchao Shao,Hui Liu,Xiaohan Li,Ming Chen,Xiaoqiang Liu,Yu-Shen Liu,Pengfei Wan
机构: Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project Page: this https URL
Abstract:Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
zh
[CV-9] ObjectReact: Learning Object-Relative Control for Visual Navigation
【速读】:该论文旨在解决单目视觉导航中依赖图像级表示所带来的局限性问题,即图像特征与智能体位姿和形态强耦合,导致泛化能力差、跨场景部署困难。其核心解决方案是提出一种“物体相对”(object-relative)的控制学习范式,关键在于构建一种基于“相对3D场景图”的拓扑度量地图表示,从而实现物体级别的全局路径规划成本计算,并训练一个直接以高阶“目标物体代价图”(WayObject Costmap)为输入的局部控制器(ObjectReact),避免了对显式RGB图像输入的依赖。该方法在不依赖历史轨迹模仿的前提下实现了新路径的探索、解耦控制预测与图像匹配任务,并显著提升了跨形态和跨环境部署时的鲁棒性。
链接: https://arxiv.org/abs/2509.09594
作者: Sourav Garg,Dustin Craggs,Vineeth Bhat,Lachlan Mares,Stefan Podgorski,Madhava Krishna,Feras Dayoub,Ian Reid
机构: University of Adelaide, Australia(阿德莱德大学, 澳大利亚); IIIT Hyderabad, India(印度国际信息技术学院, 印度); MBZUAI, UAE(穆罕默德·本·扎耶德人工智能大学, 阿联酋)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: CoRL 2025; 23 pages including appendix
Abstract:Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an “image-relative” approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent’s pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning “object-relative” control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a “relative” 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed “ObjectReact”, conditioned directly on a high-level “WayObject Costmap” representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments. Code and supplementary material are accessible via project page: this https URL
zh
[CV-10] Visual Grounding from Event Cameras ICCV2025
【速读】:该论文旨在解决事件相机(event camera)数据与自然语言理解之间缺乏有效融合的问题,从而填补多模态感知中的空白。其解决方案的关键在于提出首个大规模基准数据集 Talk2Event,该数据集基于真实驾驶场景构建,包含5,567个场景、13,458个标注对象和超过30,000条经过严格验证的指代表达式。每条表达式均带有四种结构化属性——外观、状态、相对于观察者的空间关系以及与其他物体的相对关系,这种以属性为中心的设计支持可解释且组合式的对象定位,推动从简单识别向动态环境中上下文推理的演进,为机器人学、人机交互等领域的时序感知与多模态理解奠定基础。
链接: https://arxiv.org/abs/2509.09584
作者: Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau
机构: NUS(新加坡国立大学); CNRS@CREATE(法国国家科学研究中心@创新中心); HKUST(GZ)(香港科技大学(广州)); NTU(南洋理工大学); HKUST(香港科技大学); I2R, A*STAR(资讯研究院, 新加坡科技研究局); IPAL, CNRS(信息与认知实验室, 法国国家科学研究中心); CerCo, CNRS(计算与推理中心, 法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Abstract Paper (Non-Archival) @ ICCV 2025 NeVi Workshop
Abstract:Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes – appearance, status, relation to the viewer, and relation to surrounding objects – that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.
zh
[CV-11] PeftCD: Leverag ing Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
【速读】:该论文旨在解决多时相、多源遥感影像中伪变化(pseudo changes)普遍存在的问题,以及标注样本稀缺和跨域泛化能力弱的挑战。解决方案的关键在于提出一种基于视觉基础模型(Vision Foundation Models, VFMs)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)框架PeftCD,其核心设计包括:采用共享权重的Siamese编码器结构,融合LoRA(Low-Rank Adaptation)与Adapter模块以实现仅训练少量额外参数即可完成任务适配;同时探索SAM2与DINOv3两种先进骨干网络以充分挖掘VFMs的潜力,并配备轻量化解码器确保特征表示主导性。该方法在多个公开数据集上实现了SOTA性能,展现出高精度边界分割与强伪变化抑制能力,兼具准确性、效率与泛化性。
链接: https://arxiv.org/abs/2509.09572
作者: Sijun Dong,Yuxuan Hu,LiBo Wang,Geng Chen,Xiaoliang Meng
机构: Wuhan University (武汉大学); Nanjing University of Information Science and Technology (南京信息工程大学); Guangxi Water & Power Design Institute CO., Ltd. (广西水利电力设计研究院有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at this https URL.
zh
[CV-12] Invisible Attributes Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimers Disease Classification MICCAI2025
【速读】:该论文旨在解决深度学习(Deep Learning, DL)在基于磁共振成像(MRI)进行阿尔茨海默病(Alzheimer’s disease, AD)诊断时存在的捷径学习(shortcut learning)与人口统计学偏见问题,尤其是由种族(race)和性别(sex)等受保护属性引发的模型性能偏差。其解决方案的关键在于通过三方面实证分析:首先验证DL模型能否从3D脑部MRI中识别出种族或性别,以确认是否存在基于这些属性的分布偏移;其次检验训练数据中种族或性别不平衡是否导致模型性能下降,从而揭示捷径学习的存在;最后对不同脑区特征重要性进行定量与定性分析,比较受保护属性预测与AD分类任务之间的特征关联。实验使用多个数据集和两种主流DL模型(ResNet与SwinTransformer),系统证明了种族与性别相关的捷径学习及偏见现象的存在,为开发更公平的脑部MRI辅助诊断工具提供了基础。
链接: https://arxiv.org/abs/2509.09558
作者: Akshit Achara,Esther Puyol Anton,Alexander Hammers,Andrew P. King
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: FAIMI @ MICCAI 2025
Abstract:Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer’s disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at this https URL
zh
[CV-13] InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation CVPR2025
【速读】:该论文旨在解决当前3D人-物体交互(Human-Object Interaction, HOI)生成中因数据集局限性导致的挑战,特别是现有数据集普遍存在高质量运动标注缺失、接触穿透(contact penetration)、漂浮(floating)及手部动作错误等伪影问题。其解决方案的关键在于三个方面:首先,整合并标准化来自多个来源的21.81小时HOI数据,并补充细粒度文本注释;其次,提出一种统一优化框架,基于接触不变性(contact invariance)原则,在保持人-物关系稳定的同时引入运动多样性,从而将数据扩展至30.70小时并显著降低伪影;最后,定义六项基准任务并建立统一的HOI生成建模视角,实现了最先进的生成性能,验证了该数据集作为3D HOI生成研究基础资源的有效性。
链接: https://arxiv.org/abs/2509.09555
作者: Sirui Xu,Dongting Li,Yucheng Zhang,Xiyan Xu,Qi Long,Ziyin Wang,Yunzhi Lu,Shuchang Dong,Hezi Jiang,Akshat Gupta,Yu-Xiong Wang,Liang-Yan Gui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at this https URL, and will be actively maintained.
zh
[CV-14] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
【速读】:该论文旨在解决视频扩散模型(video diffusion models)在特征表示能力上的不足问题,即当前模型虽在架构创新和训练目标优化方面取得进展,但对中间特征与预训练视觉编码器(vision encoders)之间语义对齐的关注较少。解决方案的关键在于提出一种名为Align4Gen的新方法,其核心是将多特征融合与对齐机制集成到视频扩散模型的训练流程中,通过优化生成器中间层特征与预训练视觉编码器特征之间的匹配度,从而提升视频生成质量。该方法基于对多种视觉编码器的判别能力和时序一致性分析,选择最优特征进行对齐,实验证明其在无条件和类别条件视频生成任务中均能显著改善生成效果。
链接: https://arxiv.org/abs/2509.09547
作者: Dohun Lee,Hyeonho Jeong,Jiwook Kim,Duygu Ceylan,Jong Chul Ye
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); Ewha Womans University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 14 figures
Abstract:Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: this https URL
zh
[CV-15] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
【速读】:该论文旨在解决传统三维超声(3D ultrasound, US)系统因成本高和操作复杂而难以广泛临床应用的问题,提出了一种基于深度学习的无传感器3D超声技术,通过从一系列2D超声图像中估计探头轨迹来实现3D重建。其关键解决方案是设计了DualTrack架构——一种双编码器结构,分别独立提取局部特征(如斑点模式,用于帧间运动预测)和全局特征(如解剖结构,用于定位扫描相对于器官的位置及预测整体形状),并通过轻量级融合模块整合二者信息以提升轨迹估计精度。该方法突破了以往模型中全局与局部特征耦合或忽略全局信息的局限,实现了更鲁棒、准确的3D重建,实验表明其平均重建误差低于5 mm,在公开基准上达到当前最优性能。
链接: https://arxiv.org/abs/2509.09530
作者: Paul F. R. Wilson,Matteo Ronchetti,Rüdiger Göbl,Viktoria Markova,Sebastian Rosenzweig,Raphael Prevost,Parvin Mousavi,Oliver Zettinig
机构: Queen’s University, Kingston, Canada (皇后大学,金斯顿,加拿大); ImFusion GmbH, Munich, Germany (ImFusion GmbH,慕尼黑,德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.
zh
[CV-16] Generative Diffusion Contrastive Network for Multi-View Clustering ICASSP2026
【速读】:该论文旨在解决多视图聚类(Multi-View Clustering, MVC)中因低质量数据导致的融合性能下降问题,具体表现为某些视图存在噪声数据或缺失数据。解决方案的关键在于提出一种新颖的随机生成扩散融合方法(Stochastic Generative Diffusion Fusion, SGDF),该方法通过引入多生成机制对每个样本的多视图特征进行建模,从而提升对低质量数据的鲁棒性;在此基础上进一步构建生成扩散对比网络(Generative Diffusion Contrastive Network, GDCN),在多个深度MVC任务上实现了当前最优性能。
链接: https://arxiv.org/abs/2509.09527
作者: Jian Zhu,Xin Zou,Xi Wang,Ning Zhang,Bian Wu,Yao Yang,Ying Zhou,Lingfang Zeng,Chang Tang,Cheng Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is submitted to International Conference on Acoustics, Speech, and Signal Processing (ICASSP2026)
Abstract:In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at this https URL.
zh
[CV-17] Region-Wise Correspondence Prediction between Manga Line Art Images
【速读】:该论文旨在解决漫画线稿图像中区域级对应关系的预测问题,即在无预设分割标签或掩码的情况下,自动识别和匹配不同漫画帧之间语义一致的区域,从而支持诸如自动上色和补间帧生成等下游任务。其解决方案的关键在于提出一种基于Transformer的框架,将每张线稿图像划分为若干patch,并学习跨图像的patch级相似性;随后通过边缘感知聚类与区域匹配算法,将patch级预测转化为连贯的区域级对应关系,从而实现无需人工标注的端到端区域对齐。
链接: https://arxiv.org/abs/2509.09501
作者: Yingxuan Li,Jiafeng Mao,Qianru Qiu,Yusuke Matsui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.
zh
[CV-18] Improving Human Motion Plausibility with Body Momentum BMVC2025
【速读】:该论文旨在解决人体运动建模中局部运动(local motion)与全局运动(global motion)之间物理耦合关系难以精确建模的问题。现有方法通常将两者分离处理,忽略了环境交互驱动的身体构型变化对全局位移的影响,导致生成动作存在足部滑动、抖动等问题,且基于关节力矩和外力推导全局轨迹计算复杂。解决方案的关键在于引入全身线动量(linear momentum)和角动量(angular momentum)作为物理约束,利用其反映关节级动力学对整体运动的累积效应,建立局部动作与全局位移之间的物理一致性;并通过设计新的损失项强制生成的动量曲线与真实数据保持一致,从而提升动作的稳定性、平衡性和恢复精度。
链接: https://arxiv.org/abs/2509.09496
作者: Ha Linh Nguyen,Tze Ho Elden Tse,Angela Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025
Abstract:Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from joint torques and external forces is computationally expensive and complex. To address these challenges, we propose using whole-body linear and angular momentum as a constraint to link local motion with global movement. Since momentum reflects the aggregate effect of joint-level dynamics on the body’s movement through space, it provides a physically grounded way to relate local joint behavior to global displacement. Building on this insight, we introduce a new loss term that enforces consistency between the generated momentum profiles and those observed in ground-truth data. Incorporating our loss reduces foot sliding and jitter, improves balance, and preserves the accuracy of the recovered motion. Code and data are available at the project page this https URL.
zh
[CV-19] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
【速读】:该论文旨在解决当前深度伪造(Deepfake)检测数据集存在局限性的问题,如依赖过时的生成方法、真实感不足或仅限于单人脸图像,导致检测模型难以应对现代生成式 AI(Generative AI)技术产生的高保真合成图像。其解决方案的关键在于构建一个大规模、政治敏感场景导向的数据集,包含三百万张真实图像及其描述性标题,用于生成九十六万三千张由专有与开源模型混合生成的高质量合成图像;同时引入一种基于众包的对抗性平台,激励用户持续提交具有挑战性的新型合成图像,从而推动检测方法的动态演进和鲁棒性提升,以有效防范日益复杂的虚假信息传播威胁。
链接: https://arxiv.org/abs/2509.09495
作者: Victor Livernoche,Akshatha Arodi,Andreea Musulan,Zachary Yang,Adam Salvail,Gaétan Marceau Caron,Jean-François Godbout,Reihaneh Rabbany
机构: McGill University (麦吉尔大学); Mila - Quebec Artificial Intelligence Institute (魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); IVADO
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 12 figures
Abstract:Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.
zh
[CV-20] Resource-Efficient Glioma Segmentation on Sub-Saharan MRI
【速读】:该论文旨在解决撒哈拉以南非洲(Sub-Saharan Africa, SSA)地区因高质量标注医学影像数据稀缺而导致的胶质瘤(glioma)精准分割困难问题,从而阻碍了先进深度学习模型在临床工作流中的部署。其解决方案的关键在于提出了一种适用于资源受限环境的高效、轻量化的3D注意力UNet架构,通过引入残差块(residual blocks)增强网络表达能力,并利用BraTS 2021预训练权重进行迁移学习,显著提升了模型在小样本和低质量MRI数据上的泛化性能。该模型在BraTS-Africa数据集上实现了Dice系数达0.76–0.85,且模型体积仅约90 MB,单体积推理时间低于一分钟,具备在消费级硬件上的快速部署潜力,为全球健康公平性提供了可落地的AI医疗影像解决方案。
链接: https://arxiv.org/abs/2509.09469
作者: Freedmore Sidume,Oumayma Soula,Joseph Muthui Wacira,YunFei Zhu,Abbas Rabiu Muhammad,Abderrazek Zeraii,Oluwaseun Kalejaye,Hajer Ibrahim,Olfa Gaddour,Brain Halubanza,Dong Zhang,Udunna C Anazodo,Confidence Raymond
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures
Abstract:Gliomas are the most prevalent type of primary brain tumors, and their accurate segmentation from MRI is critical for diagnosis, treatment planning, and longitudinal monitoring. However, the scarcity of high-quality annotated imaging data in Sub-Saharan Africa (SSA) poses a significant challenge for deploying advanced segmentation models in clinical workflows. This study introduces a robust and computationally efficient deep learning framework tailored for resource-constrained settings. We leveraged a 3D Attention UNet architecture augmented with residual blocks and enhanced through transfer learning from pre-trained weights on the BraTS 2021 dataset. Our model was evaluated on 95 MRI cases from the BraTS-Africa dataset, a benchmark for glioma segmentation in SSA MRI data. Despite the limited data quality and quantity, our approach achieved Dice scores of 0.76 for the Enhancing Tumor (ET), 0.80 for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding Non-Functional Hemisphere (SNFH). These results demonstrate the generalizability of the proposed model and its potential to support clinical decision making in low-resource settings. The compact architecture, approximately 90 MB, and sub-minute per-volume inference time on consumer-grade hardware further underscore its practicality for deployment in SSA health systems. This work contributes toward closing the gap in equitable AI for global health by empowering underserved regions with high-performing and accessible medical imaging solutions.
zh
[CV-21] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model
【速读】:该论文旨在解决多模态医学图像融合(Multi-modal Medical Image Fusion)中现有方法无法处理可变输入模态数量的问题,即传统方法仅支持固定数量的输入(如仅限两模态或三模态),难以适应临床场景中不同数量的影像输入需求。其解决方案的关键在于提出FlexiD-Fuse,一种基于扩散模型(Diffusion-based)的图像融合网络,通过将原本仅支持固定条件输入的扩散融合问题转化为基于扩散过程的最大似然估计问题,并结合层次贝叶斯建模与期望最大化(Expectation-Maximization, EM)算法嵌入扩散采样迭代流程,从而实现对任意数量输入模态的端到端融合,生成高质量且包含跨模态信息的融合图像,且性能不受输入模态数量影响。
链接: https://arxiv.org/abs/2509.09456
作者: Yushen Xu,Xiaosong Li,Yuchun Wang,Xiaoqi Cheng,Huafeng Li,Haishu Tan
机构: Foshan University (佛山大学); Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.
zh
[CV-22] Semantic Concentration for Self-Supervised Dense Representations Learning
【速读】:该论文旨在解决图像级自监督学习(Self-Supervised Learning, SSL)在生成密集特征表示时面临的过分散现象(over-dispersion),即同一类别或实例的patch在特征空间中分布过于分散,从而损害下游密集任务(如语义分割、目标检测等)的性能。其核心解决方案在于引入显式语义集中机制(explicit semantic concentration):首先,通过知识蒸馏策略提取patch对应关系,并设计噪声鲁棒的排序损失(noise-tolerant ranking loss),将平均精度(Average Precision, AP)损失扩展为连续目标,以实现决策无关且自适应聚焦的学习;其次,提出对象感知滤波器(object-aware filter),利用跨注意力机制将patch表示映射到基于对象的特征空间,从而从复杂场景中区分共享模式(shared patterns)。上述方法共同缓解了严格空间对齐带来的限制与场景干扰问题,显著提升了密集自监督学习的效果。
链接: https://arxiv.org/abs/2509.09429
作者: Peisong Wen,Qianqian Xu,Siran Dai,Runmin Cong,Qingming Huang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Peng Cheng Laboratory (鹏城实验室); State Key Laboratory of Information Security (信息安全国家重点实验室); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Shandong University (山东大学); Key Laboratory of Machine Intelligence and System Control, Ministry of Education (教育部机器智能与系统控制重点实验室); Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences (中国科学院大学大数据挖掘与知识管理重点实验室); Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in this https URL.
zh
[CV-23] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
【速读】:该论文旨在解决多模态图像融合与超分辨率联合任务中因目标和背景结构易受低分辨率及弱语义信息影响而导致的融合效果不佳问题。其解决方案的关键在于提出FS-Diff方法,将图像融合与超分辨率统一为条件生成问题,并引入清晰度感知机制(clarity sensing mechanism)以实现自适应的低分辨率感知与跨模态特征提取;同时,通过双向特征Mamba网络提取多模态图像的全局特征,并基于源图像与语义信息作为条件,在改进的U-Net架构上实施多噪声水平随机迭代去噪过程,从而在高分辨率下恢复出富含跨模态特征与语义信息的融合图像。
链接: https://arxiv.org/abs/2509.09427
作者: Yuchan Jie,Yushen Xu,Xiaosong Li,Fuqiang Zhou,Jianming Lv,Huafeng Li
机构: South China University of Technology (华南理工大学); Foshan University (佛山大学); Beihang University (北京航空航天大学); Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at this https URL.
zh
[CV-24] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
【速读】:该论文旨在解决医学视觉语言模型(Medical VLMs)在分布偏移(distribution shift)下可靠性不足的问题,即模型因成像协议和自由文本报告的多样性而学习到任务无关的相关性,从而限制其泛化能力并增加在真实临床场景中失效的风险。解决方案的关键在于提出DRiFt框架,通过参数高效微调(LoRA)与可学习提示词(learnable prompt tokens)实现结构化的特征解耦,显式分离临床相关信号与任务无关噪声;同时,构建高质量、临床语境相关的图像-文本配对数据以增强跨模态对齐并降低不确定性,从而显著提升模型在分布内性能(Top-1准确率提升+11.4%,Macro-F1提升+3.3%)并保持对未见数据集的鲁棒性。
链接: https://arxiv.org/abs/2509.09397
作者: Umaima Rahman,Raza Imam,Mohammad Yaqub,Dwarikanath Mahapatra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at this https URL.
zh
[CV-25] Unsupervised Integrated-Circuit Defect Segmentation via Image-Intrinsic Normality
【速读】:该论文旨在解决集成电路(Integrated-Circuit, IC)制造中缺陷分割的挑战,特别是针对工业界常用方法依赖外部正常样本集所导致的脆弱性问题——由于IC版图在不同产品间差异显著且图像对齐困难,传统方法难以稳定应用。解决方案的关键在于提出一种无需外部正常样本支持的无监督缺陷分割框架:通过一个可学习的正常信息提取器从测试图像中聚合代表性正常特征,并利用一致性损失(coherence loss)约束这些特征与正常区域的关联;随后,基于提取的正常特征引导解码器仅重建正常内容,缺陷则由重建残差自动识别。此外,伪异常增强技术进一步提升了训练稳定性。
链接: https://arxiv.org/abs/2509.09375
作者: Botong Zhao,Qijun Shi,Shujing Lyu,Yue Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern Integrated-Circuit(IC) manufacturing introduces diverse, fine-grained defects that depress yield and reliability. Most industrial defect segmentation compares a test image against an external normal set, a strategy that is brittle for IC imagery where layouts vary across products and accurate alignment is difficult. We observe that defects are predominantly local, while each image still contains rich, repeatable normal patterns. We therefore propose an unsupervised IC defect segmentation framework that requires no external normal support. A learnable normal-information extractor aggregates representative normal features from the test image, and a coherence loss enforces their association with normal regions. Guided by these features, a decoder reconstructs only normal content; the reconstruction residual then segments defects. Pseudo-anomaly augmentation further stabilizes training. Experiments on datasets from three IC process stages show consistent improvements over existing approaches and strong robustness to product variability.
zh
[CV-26] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification ONSD Measurement and Clinical Data
【速读】:该论文旨在解决颅内压(Intracranial Pressure, ICP)监测中因传统有创手段(如腰椎穿刺)风险高、以及现有非侵入性方法(如视神经鞘直径 Optic Nerve Sheath Diameter, ONSD 测量)依赖人工操作导致的不一致性与主观性问题。其解决方案的关键在于提出一个全自动的两阶段框架:第一阶段通过帧级解剖分割与基于国际共识的规则化关键帧识别,实现精准且可重复的ONSD测量;第二阶段融合ONSD指标与临床特征进行多源数据集成,从而客观预测ICP分级。该方法显著提升了诊断准确性(验证准确率0.845 ± 0.071,独立测试准确率0.786),有效减少了人为变异,为急性神经系统疾病中的无创ICP评估提供了可靠路径。
链接: https://arxiv.org/abs/2509.09368
作者: Pengxu Wen,Tingting Yu,Ziwei Nie,Cheng Jiang,Zhenyu Yin,Mingyang He,Bo Liao,Xiaoping Yang
机构: Nanjing University (南京大学); Nanjing University Medical School (南京大学医学院); Nanjing University of Chinese Medicine (南京中医药大学); Nanjing Medical University (南京医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of 0.845 \pm 0.071 (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ( 0.637 \pm 0.111 validation accuracy, 0.429 test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.
zh
[CV-27] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection
【速读】:该论文旨在解决 ill-posed inverse problems(不适定反问题)在单像素成像(single-pixel imaging)中的重建质量难题,其核心挑战在于如何有效融合物理先验(如测量模型)与数据驱动的先验(如深度学习模型)。解决方案的关键在于提出一个统一框架,将扩散模型(diffusion models)的去噪过程解耦为三个可解释阶段:去噪、数据一致性约束和采样,并在此基础上设计了一种混合数据一致性模块(hybrid data-consistency module),该模块线性组合多个 PnP-style(Plug-and-Play)保真项,直接应用于去噪估计结果,从而在不破坏扩散采样轨迹的前提下提升测量一致性,显著改善重建质量。
链接: https://arxiv.org/abs/2509.09365
作者: Xiaodong Wang,Ping Wang,Zhangyuan Li,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.
zh
[CV-28] xture-aware Intrinsic Image Decomposition with Model- and Learning-based Priors
【速读】:该论文旨在解决单图内在图像分解(intrinsic image decomposition)问题,即从一张RGB图像中分离出反射率层(reflectance layer)和阴影层(shading layer)。在复杂场景下,如存在空间变化的光照效应和丰富纹理时,传统方法难以获得高质量的分解结果。解决方案的关键在于提出一种基于纹理引导的正则化项(texture-guided regularization term),通过引入新的纹理感知先验(texture-aware prior),将材质纹理与光照效应有效分离,从而提升分解质量,尤其改善了以往学习方法易产生无纹理或过度平滑的问题。
链接: https://arxiv.org/abs/2509.09352
作者: Xiaodong Wang,Zijun He,Xin Yuan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper aims to recover the intrinsic reflectance layer and shading layer given a single image. Though this intrinsic image decomposition problem has been studied for decades, it remains a significant challenge in cases of complex scenes, i.e. spatially-varying lighting effect and rich textures. In this paper, we propose a novel method for handling severe lighting and rich textures in intrinsic image decomposition, which enables to produce high-quality intrinsic images for real-world images. Specifically, we observe that previous learning-based methods tend to produce texture-less and over-smoothing intrinsic images, which can be used to infer the lighting and texture information given a RGB image. In this way, we design a texture-guided regularization term and formulate the decomposition problem into an optimization framework, to separate the material textures and lighting effect. We demonstrate that combining the novel texture-aware prior can produce superior results to existing approaches.
zh
[CV-29] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
【速读】:该论文旨在解决道路交通事故中因驾驶员分心(distracted driving)和失能驾驶(impaired driving)导致的严重安全隐患问题。其解决方案的关键在于提出一种基于外部观测的驾驶员行为分类系统,利用先进的计算机视觉技术实现对非联网车辆的实时行为分析;具体包括使用YOLO目标检测模型识别车辆与车道边界,并结合自定义车道估计算法进行横向位移分析与轨迹模式识别,从而有效检测出诸如过度横向移动等不安全驾驶行为,且无需依赖车联网通信(V2X)即可在多种道路和环境条件下稳定运行。
链接: https://arxiv.org/abs/2509.09349
作者: Ian Nell,Shane Gilroy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
Abstract:Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.
zh
[CV-30] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment MICCAI2025
【速读】:该论文旨在解决自动化手术技能评估(Automated Surgical Skill Assessment, SSA)中因技能标注数据稀缺而导致模型训练困难的问题。其关键解决方案在于将SSA任务建模为少样本学习(Few-shot Learning, FSL)问题,并系统探究自监督预训练策略对下游少样本SSA性能的影响。研究发现,使用与目标任务域相关的小规模数据进行预训练,可显著优于采用大规模但领域不匹配的数据;进一步地,在预训练中引入特定手术流程的数据,结合领域相关的外部数据源,能有效提升模型性能,平均准确率和F1分数分别提升1.22%和2.28%;反之,若预训练数据与目标任务域差异过大,则可能导致性能下降。这一结果揭示了预训练数据的领域相关性在少样本SSA中的核心作用。
链接: https://arxiv.org/abs/2509.09327
作者: Dimitrios Anastasiou,Razvan Caramalau,Nazir Sirajudeen,Matthew Boal,Philip Edwards,Justin Collins,John Kelly,Ashwin Sridhar,Maxine Tran,Faiz Mumtaz,Nevil Pavithran,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at MICCAI 2025 DEMI Workshop
Abstract:Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at this https URL.
zh
[CV-31] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
【速读】:该论文旨在解决生成式 AI 在服装设计领域中,因用户缺乏专业背景知识而导致的细粒度定制困难问题,即文本描述的不确定性难以精准映射到具体设计需求。解决方案的关键在于提出 Better Understanding Generation (BUG) 工作流,其核心是利用大语言模型(Large Language Model, LLM)与多模态模型(Large Multimodal Model, LMM)结合,通过“图像转提示词”(image-into-prompt)机制实现从用户聊天交互中自动提取语义信息,并生成高质量、可定制的服装设计,从而降低设计门槛并提升个性化程度。
链接: https://arxiv.org/abs/2509.09324
作者: Hui Li,Yi You,Qiqi Chen,Bingfeng Zhang,George Q. Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users’ creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: this https URL.
zh
[CV-32] Image Recognition with Vision and Language Embeddings of VLMs
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在纯视觉推理能力方面研究不足的问题,尤其是在零样本图像分类任务中,如何有效利用语言引导与纯视觉特征的互补性以提升分类性能。其关键解决方案是一种无需学习的融合方法,基于每类精度(per-class precision)对语言引导和视觉相似性结果进行加权融合,从而充分利用两类模态的优势,显著提升整体分类准确率。
链接: https://arxiv.org/abs/2509.09311
作者: Illia Volkov,Nikita Kisel,Klara Janouskova,Jiri Matas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: this https URL.
zh
[CV-33] You Share Beliefs I Adapt: Progressive Heterogeneous Collaborative Perception
【速读】:该论文旨在解决异构协同感知(heterogeneous collaborative perception)中因车辆间模型差异导致的域差距问题,传统方法需通过联合训练或预先存储所有潜在合作者模型来适配,难以在实际场景中部署。其解决方案的关键在于提出一种新的推理阶段自适应框架——渐进式异构协同感知(Progressive Heterogeneous Collaborative Perception, PHCP),该框架将问题建模为少样本无监督域自适应任务,在推理过程中通过自训练动态对齐特征,无需标注数据和联合训练即可实现高效适配,从而显著提升跨车辆模型的协同感知性能。
链接: https://arxiv.org/abs/2509.09310
作者: Hao Si,Ehsan Javanmardi,Manabu Tsukada
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.
zh
[CV-34] Learning Object-Centric Representations in SAR Images with Multi-Level Feature Fusion
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中目标与背景杂波(clutter)特征纠缠的问题,这种纠缠会干扰模型提取清晰的目标表征,从而影响分类性能。解决方案的关键在于提出一种无需掩码标注的对象中心学习(Object-Centric Learning, OCL)框架 SlotSAR,其通过融合来自 SARATR-X 的高层语义特征与小波散射网络(wavelet scattering network)的低层散射特征,构建互补的多层级表示,并设计了一个多层级槽注意力模块(multi-level slot attention module),以增强槽位级别的表征区分度,从而实现对目标与背景杂波的有效解耦。
链接: https://arxiv.org/abs/2509.09298
作者: Oh-Tae Jang,Min-Gon Cho,Kyung-Tae Kim
机构: POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures
Abstract:Synthetic aperture radar (SAR) images contain not only targets of interest but also complex background clutter, including terrain reflections and speckle noise. In many cases, such clutter exhibits intensity and patterns that resemble targets, leading models to extract entangled or spurious features. Such behavior undermines the ability to form clear target representations, regardless of the classifier. To address this challenge, we propose a novel object-centric learning (OCL) framework, named SlotSAR, that disentangles target representations from background clutter in SAR images without mask annotations. SlotSAR first extracts high-level semantic features from SARATR-X and low-level scattering features from the wavelet scattering network in order to obtain complementary multi-level representations for robust target characterization. We further present a multi-level slot attention module that integrates these low- and high-level features to enhance slot-wise representation distinctiveness, enabling effective OCL. Experimental results demonstrate that SlotSAR achieves state-of-the-art performance in SAR imagery by preserving structural details compared to existing OCL methods.
zh
[CV-35] Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception
【速读】:该论文旨在解决无人机(UAV)在真实空对空目标检测场景中,因域偏移(domain shift)和飞行数据损坏导致传统闭集检测器性能显著下降的问题,从而影响安全关键应用的可靠性。其解决方案的关键在于提出一种模型无关的开放集检测框架,专为基于嵌入(embedding-based)的检测器设计:通过在嵌入空间中建模语义不确定性(利用熵估计),结合谱归一化(spectral normalization)与温度缩放(temperature scaling)增强未知目标的判别能力,实现对未知物体的有效排斥,同时保持对受损飞行数据的鲁棒性。实验证明该方法在AOT空中基准测试和真实飞行实验中均优于基线YOLO检测器,相对AUROC提升达10%,且背景排斥策略进一步提升了鲁棒性而不损害检测精度。
链接: https://arxiv.org/abs/2509.09297
作者: Spyridon Loukovitis,Anastasios Arsenos,Vasileios Karampinis,Athanasios Voulodimos
机构: National Technical University Athens (国立技术大学雅典); National & Kapodistrian University of Athens (雅典国立卡波迪斯特里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Open-set detection is crucial for robust UAV autonomy in air-to-air object detection under real-world conditions. Traditional closed-set detectors degrade significantly under domain shifts and flight data corruption, posing risks to safety-critical applications. We propose a novel, model-agnostic open-set detection framework designed specifically for embedding-based detectors. The method explicitly handles unknown object rejection while maintaining robustness against corrupted flight data. It estimates semantic uncertainty via entropy modeling in the embedding space and incorporates spectral normalization and temperature scaling to enhance open-set discrimination. We validate our approach on the challenging AOT aerial benchmark and through extensive real-world flight tests. Comprehensive ablation studies demonstrate consistent improvements over baseline methods, achieving up to a 10% relative AUROC gain compared to standard YOLO-based detectors. Additionally, we show that background rejection further strengthens robustness without compromising detection accuracy, making our solution particularly well-suited for reliable UAV perception in dynamic air-to-air environments.
zh
[CV-36] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training MICCAI2025
【速读】:该论文旨在解决现有脑部MRI分割模型在面对训练中未见过的图像模态(如不同成像对比度)时性能下降的问题,尤其是当输入数据包含新模态、已见模态或两者混合组合时,传统模型难以有效处理。解决方案的关键在于对U-net架构进行简单但有效的改进:引入一个模态无关(modality-agnostic)的输入通道/路径,与原有的模态特定(modality-specific)输入通道并行,从而实现对任意可用模态的灵活推理。为训练该模态无关组件,作者还设计了一种图像增强方案,通过差异化地改变病灶和健康脑组织的外观来合成人工MRI模态,同时保持解剖结构的真实性。实验验证了该方法在8个数据库、5类病理(如卒中、肿瘤等)和8种模态下的有效性,既保留了对训练中常见模态的高精度分割能力,又能有效利用未见过的新模态提升整体分割性能。
链接: https://arxiv.org/abs/2509.09290
作者: Anthony P. Addison,Felix Wagner,Wentian Xu,Natalie Voets,Konstantinos Kamnitsas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2025, for the following workshop: ML-CDS 2025: Multimodal Learning and Fusion Across Scales for Clinical Decision Support
Abstract:Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: this https URL
zh
[CV-37] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图表理解任务中推理能力不足的问题,尤其是现有方法依赖外部工具导致鲁棒性差,或采用单一文本链式思维(Chain-of-Thought, CoT)策略难以验证中间步骤准确性,从而影响事实一致性与可优化性。其解决方案的关键在于提出“代码即思维”(Code-as-Thought, CaT)框架,将图表的视觉信息转化为可验证的符号化代码表示,并引入“视觉可编程性”(Visual Programmability)这一可学习属性,使模型能够动态选择使用CaT路径还是直接视觉推理路径。通过基于双奖励机制的强化学习训练策略——结合数据准确率奖励以抑制数值幻觉,以及决策奖励以指导策略选择——模型不仅学会正确推理,还学会如何根据任务特性自适应地选择最优推理路径。
链接: https://arxiv.org/abs/2509.09286
作者: Bohao Tang,Yan Ma,Fei Zhang,Jiadi Su,Ethan Chern,Zhulin Hu,Zhixin Wang,Pengfei Liu,Ya Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.
zh
[CV-38] Unified Start Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation
【速读】:该论文旨在解决3D医学图像分割中资源消耗大、训练时间长的问题,从而限制其在临床环境中的可扩展性和快速部署。现有高效分割模型通常为静态且人工设计,在训练前即固定结构,难以适应不同任务并平衡性能与资源效率。解决方案的关键在于提出PSP-Seg(Progressive Pruning Segmentation)框架,通过迭代式剪枝策略结合模块级剪枝与功能解耦损失(functional decoupling loss),从冗余模型中逐步移除冗余模块,实现动态且高效的3D分割。实验表明,轻量化版本PSP-Seg-S在保持与nnU-Net相当性能的同时,显著降低GPU内存使用(42–45%)、训练时间(29–48%)和参数量(83–87%)。
链接: https://arxiv.org/abs/2509.09267
作者: Linhao Li,Yiwen Ye,Ziyang Chen,Yong Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures
Abstract:3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg’s potential as a cost-effective yet high-performing alternative for widespread clinical application.
zh
[CV-39] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
【速读】:该论文旨在解决长视频理解中因缺乏精确时间推理能力而导致的关键事件定位与时序认知性能下降的问题(即long video understanding中的temporal reasoning and event localization挑战)。现有方法依赖均匀采样和隐式位置编码,难以建模长距离时序依赖关系,易造成信息丢失。其解决方案的核心在于提出动态绝对时间增强(Dynamic Absolute Time Enhancement, DATE),通过两个关键技术实现:一是时间戳注入机制(Timestamp Injection Mechanism, TIM),将文本形式的时间戳嵌入到视频帧特征中,构建连续的时序参考系统;二是语义引导的时序感知相似性采样策略(Temporal-Aware Similarity Sampling, TASS),将视频采样重构为视觉-语言检索任务,并采用两阶段算法——首先生成描述性标题以增强视觉特征对齐,其次基于相似度驱动且时序正则化的贪心策略选取关键事件帧,从而兼顾语义相关性与时序覆盖范围。
链接: https://arxiv.org/abs/2509.09263
作者: Chao Yuan,Yang Yang,Yehui Yang,Zach Cheng
机构: Beihang University (北京航空航天大学); Dcar, ByteDance (字节跳动); Qfin Holdings, Inc (Qfin控股公司); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.
zh
[CV-40] owards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在牙科专业领域,特别是全景X光片(panoramic X-rays)图像解读中的性能不足问题。现有医学基准和指令数据集未能充分覆盖牙科影像中密集解剖结构与细微病理线索的复杂性,导致LVLMs在该领域的诊断准确性显著受限。解决方案的关键在于构建首个针对全景X光片解读的大规模多模态指令数据集MMOral及其配套评估基准MMOral-Bench,并基于此数据集提出专为牙科优化的微调模型OralGPT。其中,通过监督微调(Supervised Fine-Tuning, SFT)仅需一个训练周期即可实现显著性能提升(如OralGPT相比基线模型提升24.73%),验证了高质量领域特定数据对增强LVLM在牙科临床场景中实用性的核心作用。
链接: https://arxiv.org/abs/2509.09254
作者: Jing Hao,Yuxuan Fan,Yanpeng Sun,Kaixin Guo,Lizhuo Lin,Jinrong Yang,Qi Yong H. Ai,Lun M. Wong,Hao Tang,Kuo Feng Hung
机构: The University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); National University of Singapore (新加坡国立大学); CVTE; Sun Yat-sen University (中山大学); The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 40 pages, 26 figures, 9 tables
Abstract:Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at this https URL.
zh
[CV-41] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification
【速读】:该论文旨在解决胃部组织病理图像分析中人工诊断效率低、一致性差及关键病变易被遗漏的问题,这些问题限制了早期胃病诊断的准确性和可重复性。解决方案的关键在于提出一种新型混合模型 CoAtNeXt,其核心创新包括:基于 CoAtNet 架构将 MBConv 层替换为增强型 ConvNeXtV2 块以提升特征提取能力,并引入卷积块注意力模块(Convolutional Block Attention Module, CBAM)通过通道和空间注意力机制强化局部特征捕捉;同时在计算效率与分类性能之间实现平衡,最终在两个公开数据集上均显著优于10种CNN和10种Vision Transformer(ViT)模型,展现出优异的多类与二分类性能。
链接: https://arxiv.org/abs/2509.09242
作者: Mustafa Yurdakul,Sakir Tasdemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.
zh
[CV-42] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation Transformation and Enhancement
【速读】:该论文旨在解决当前基于上下文学习(In-context Learning, ICL)的医学图像分析模型在两个关键方面的局限性:一是难以同时实现高保真预测与全局解剖结构理解,二是缺乏一个统一的模型能够跨多种医学成像任务(如分割和增强)及解剖区域进行训练。为此,作者提出Medverse,这是一个面向3D医学图像的通用ICL模型,通过在22个数据集上训练,覆盖多种器官、成像模态和临床中心的通用图像分割、变换与增强任务。其核心创新在于采用一种逐级自回归的ICL框架,从粗到细逐步优化预测,生成一致且全分辨率的体积输出,并具备多尺度解剖感知能力;同时引入块状交叉注意力模块,在保持空间稀疏性以提升计算效率的同时,促进上下文与目标输入间的长距离交互。
链接: https://arxiv.org/abs/2509.09232
作者: Jiesi Hu,Jianfeng Cao,Yanwu Yang,Chenfei Ye,Yixuan Zhang,Hanyang Peng,Ting Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbfMedverse, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at this https URL.
zh
[CV-43] MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network
【速读】:该论文旨在解决当前目标引导型人类轨迹预测方法中因任务分解导致的粒度不匹配问题,即粗粒度的目标预测与细粒度的轨迹生成之间缺乏中间层次的语义关联,从而限制了预测精度和合理性。其解决方案的关键在于提出MGTraj模型,通过递归地从粗到细的多粒度轨迹提案编码机制,在每一层级利用基于Transformer的递归细化网络(RRN)捕获特征并进行渐进式优化,并采用权重共享策略融合不同粒度的特征表示,同时引入速度预测作为辅助任务以增强模型性能,从而有效整合多尺度的人类运动模式与目标意图信息。
链接: https://arxiv.org/abs/2509.09200
作者: Ge Sun,Jun Ma
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate human trajectory prediction is crucial for robotics navigation and autonomous driving. Recent research has demonstrated that incorporating goal guidance significantly enhances prediction accuracy by reducing uncertainty and leveraging prior knowledge. Most goal-guided approaches decouple the prediction task into two stages: goal prediction and subsequent trajectory completion based on the predicted goal, which operate at extreme granularities: coarse-grained goal prediction forecasts the overall intention, while fine-grained trajectory completion needs to generate the positions for all future timesteps. The potential utility of intermediate temporal granularity remains largely unexplored, which motivates multi-granularity trajectory modeling. While prior work has shown that multi-granularity representations capture diverse scales of human dynamics and motion patterns, effectively integrating this concept into goal-guided frameworks remains challenging. In this paper, we propose MGTraj, a novel Multi-Granularity goal-guided model for human Trajectory prediction. MGTraj recursively encodes trajectory proposals from coarse to fine granularity levels. At each level, a transformer-based recursive refinement network (RRN) captures features and predicts progressive refinements. Features across different granularities are integrated using a weight-sharing strategy, and velocity prediction is employed as an auxiliary task to further enhance performance. Comprehensive experimental results in EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods.
zh
[CV-44] Breaking the Statistical Similarity Trap in Extreme Convection Detection
【速读】:该论文旨在解决当前深度学习天气模型评估指标中存在的“统计相似性陷阱”(Statistical Similarity Trap)问题,即现有指标倾向于奖励模糊预测,而忽视稀有但高影响的极端对流事件。为应对这一挑战,作者提出DART(Dual Architecture for Regression Tasks)框架,其核心创新在于采用双解码器架构实现背景与极端事件的显式分解、基于物理动机的过采样策略以及任务特定损失函数设计,从而将粗分辨率大气预报精准转换为高分辨率卫星亮温场,并优化极端对流检测性能(亮温低于220 K)。该方案在多个基准上验证了有效性,尤其在去除常被视为关键特征的水汽输送(IVT)后反而提升极端对流检测准确率270%,证明了其架构设计的必要性和实用性。
链接: https://arxiv.org/abs/2509.09195
作者: Md Tanveer Hossain Munim
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 7 figures
Abstract:Current evaluation metrics for deep learning weather models create a “Statistical Similarity Trap”, rewarding blurry predictions while missing rare, high-impact events. We provide quantitative evidence of this trap, showing sophisticated baselines achieve 97.9% correlation yet 0.00 CSI for dangerous convection detection. We introduce DART (Dual Architecture for Regression Tasks), a framework addressing the challenge of transforming coarse atmospheric forecasts into high-resolution satellite brightness temperature fields optimized for extreme convection detection (below 220 K). DART employs dual-decoder architecture with explicit background/extreme decomposition, physically motivated oversampling, and task-specific loss functions. We present four key findings: (1) empirical validation of the Statistical Similarity Trap across multiple sophisticated baselines; (2) the “IVT Paradox”, removing Integrated Water Vapor Transport, widely regarded as essential for atmospheric river analysis, improves extreme convection detection by 270%; (3) architectural necessity demonstrated through operational flexibility (DART achieves CSI = 0.273 with bias = 2.52 vs. 6.72 for baselines at equivalent CSI), and (4) real-world validation with the August 2023 Chittagong flooding disaster as a case study. To our knowledge, this is the first work to systematically address this hybrid conversion-segmentation-downscaling task, with no direct prior benchmarks identified in existing literature. Our validation against diverse statistical and deep learning baselines sufficiently demonstrates DART’s specialized design. The framework enables precise operational calibration through beta-tuning, trains in under 10 minutes on standard hardware, and integrates seamlessly with existing meteorological workflows, demonstrating a pathway toward trustworthy AI for extreme weather preparedness.
zh
[CV-45] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results ICCV
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在开放域视觉质量比较任务中缺乏细粒度、可解释且符合人类偏好的评估能力的问题。其解决方案的关键在于构建了一个涵盖从粗粒度到细粒度的视觉质量对比任务的新基准,包含单图、成对图像及多图像组的数千个任务,并采用基于2AFC(Two-Alternative Forced Choice)的二元偏好判断和多选题(MCQs)的综合评估协议,从而推动LMMs在视觉质量推理与比较上的能力提升,为未来可解释、人类对齐的质量评价系统奠定基础。
链接: https://arxiv.org/abs/2509.09190
作者: Hanwei Zhu,Haoning Wu,Zicheng Zhang,Lingyu Zhu,Yixuan Li,Peilin Chen,Shiqi Wang,Chris Wei Zhou,Linhan Cao,Wei Sun,Xiangyang Zhu,Weixia Zhang,Yucheng Zhu,Jing Liu,Dandan Zhu,Guangtao Zhai,Xiongkuo Min,Zhichao Zhang,Xinyue Li,Shubo Xu,Anh Dao,Yifan Li,Hongyuan Yu,Jiaojiao Yi,Yiding Tian,Yupeng Wu,Feiran Sun,Lijuan Liao,Song Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV VQualA Workshop 2025
Abstract:This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.
zh
[CV-46] Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
【速读】:该论文旨在解决低光照环境下目标检测(low-light object detection)因图像质量退化而面临的挑战,尤其针对现有方法在使用RAW图像时存在的信息损失或复杂框架问题。其解决方案的关键在于提出一种轻量级且自适应的图像信号处理(Image Signal Processing, ISP)插件Dark-ISP,它直接处理Bayer RAW图像,并支持端到端训练。核心创新包括:将传统ISP流程解构为可微分的线性(传感器校准)与非线性(色调映射)子模块,引入内容感知自适应机制和物理先验以实现任务驱动的RAW-to-RGB转换;同时利用ISP固有的级联结构设计自增强(Self-Boost)机制,促进各子模块间的协同优化,从而在保持极少参数量的前提下显著提升低光环境下的检测性能。
链接: https://arxiv.org/abs/2509.09183
作者: Jiasheng Guo,Xin Gao,Yuxiang Yan,Guanghao Li,Jian Pu
机构: Institute of Science and Technology for Brain-inspired Intelligence, Fudan University (复旦大学脑科学与智能技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, conference
Abstract:Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equipped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline’s intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.
zh
[CV-47] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios ICCV2025
【速读】:该论文旨在解决当前AI生成图像检测方法在复杂真实场景下性能不足的问题,尤其是在跨场景泛化、互联网传播扰动和再数字化干扰等实际应用条件下的鲁棒性缺失。其解决方案的关键在于构建了一个多维评估基准——Real-World Robustness Dataset (RRDataset),该数据集从三个维度系统性地刻画现实世界挑战:1)内容多样性维度涵盖七大高保真场景(如战争冲突、灾难事故、政治事件等),填补了现有数据集的语义空白;2)互联网传输鲁棒性维度模拟图像在社交平台间的多次转发过程;3)再数字化鲁棒性维度引入四种不同的重数字化处理方式以测试模型对图像修改的适应能力。通过在此数据集上对17种检测器和10个视觉语言模型(Vision-Language Models, VLMs)进行基准测试,并结合大规模人类实验(192名参与者)验证人类在少样本学习中的识别能力,论文揭示了现有检测方法的局限性,并强调应借鉴人类认知适应机制来设计更具鲁棒性的AI生成图像检测算法。
链接: https://arxiv.org/abs/2509.09172
作者: Chunxiao Li,Xiaoxiao Wang,Meiling Li,Boming Miao,Peng Sun,Yunjian Zhang,Xiangyang Ji,Yao Zhu
机构: Beijing Normal University (北京师范大学); University of Chinese Academy of Sciences (中国科学院大学); Fudan University (复旦大学); Central University of Finance and Economics (中央财经大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025
Abstract:With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.
zh
[CV-48] Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
【速读】:该论文旨在解决大规模预训练视觉Transformer(Vision Transformer, ViT)在资源受限的6G边缘智能系统中部署时面临的高计算复杂度与传输资源消耗问题,尤其是在无线信道噪声环境下如何实现高效且鲁棒的语义通信。其解决方案的关键在于提出一种无需重新训练的自适应token合并框架,通过将每层合并比例的选择建模为多目标优化问题,以平衡模型精度与计算成本;并利用基于高斯过程的贝叶斯优化方法构建帕累托前沿(Pareto frontier),从而在运行时根据动态应用需求和信道质量灵活调整合并策略,实现延迟与语义保真度之间的按需权衡。
链接: https://arxiv.org/abs/2509.09168
作者: Omar Erak,Omar Alhussein,Hatem Abou-Zeid,Mehdi Bennis
机构: KU 6G Research Centre (KU 6G 研究中心); College of Computing and Mathematical Sciences (计算与数学科学学院); Khalifa University (哈利法大学); University of Calgary (卡尔加里大学); University of Oulu (奥卢大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To appear in IEEE Globecom 2025
Abstract:Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.
zh
[CV-49] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
【速读】:该论文旨在解决高光谱遥感图像在精细地物分类中因波段数量多、维度高及光谱混叠导致的特征冗余问题,从而提升分类精度与鲁棒性。其解决方案的关键在于提出一种名为CWSSNet的分类框架,该框架通过融合三维光谱-空间特征与小波卷积机制,引入多尺度卷积注意力模块以整合多模态信息,并在小波域中进行多波段分解与卷积操作,有效突破了传统方法在分类性能上的瓶颈。实验表明,该方法在江西省余干县区域实现了mIoU 74.50%、mAcc 82.73%和mF1 84.94%的优异表现,尤其在水体、植被和裸土三类地物上取得最高IoU值,且在训练样本比例仅为70%时仍保持稳定性能,验证了其在小样本条件下的可靠性。
链接: https://arxiv.org/abs/2509.09163
作者: Yulin Tong,Fengzong Zhang,Haiqin Cheng
机构: East China Jiaotong University (华东交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50%, 82.73%, and 84.94% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.
zh
[CV-50] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering ICME2025
【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-based Visual Question Answering, KB-VQA)中因外部知识源冗余信息引入噪声的问题。现有方法在直接将检索到的知识注入模型时,未有效过滤冗余内容,导致回答质量下降。解决方案的关键在于提出一种无需训练的“知识聚焦”框架:首先通过分析图像-问题对提取低噪声查询,提升检索知识的相关性;其次利用大语言模型识别并提取对答案有益的知识片段以减少冗余;最后设计选择性知识集成策略,在模型置信度不足时才引入知识,从而显著降低冗余信息的影响。实验表明,该方法能获取更准确、关键的知识,优于当前最优方法。
链接: https://arxiv.org/abs/2509.09159
作者: Zhiyue Liu,Sihang Liu,Jinyuan Liu,Xinru Zhang
机构: Guangxi University (广西大学); Guangxi Key Laboratory of Multimedia Communications and Network Technology (广西多媒体通信与网络技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by the IEEE International Conference on Multimedia and Expo (ICME 2025) for oral presentation. © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Abstract:Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.
zh
[CV-51] RT-DETR for UAV Object Detection
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)影像中目标检测面临的挑战,包括小目标密集分布、尺度变化大以及遮挡等问题。其解决方案的关键在于对RT-DETR模型的编码器进行改进:首先引入基于通道门控注意力机制的上采样/下采样(AU/AD)模块,通过双路径结构减少特征传播过程中的误差并保留细节信息;其次在特征融合阶段嵌入CSP-PAC模块,利用并行空洞卷积同时处理局部与上下文信息,实现多尺度特征的有效整合。该设计显著提升了小目标和密集目标的检测性能,且保持实时性,未增加计算复杂度。
链接: https://arxiv.org/abs/2509.09157
作者: Yuan Shufang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.
zh
[CV-52] Mind Meets Space: Rethinking Agent ic Spatial Intelligence from a Neuroscience-inspired Perspective
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在空间推理能力上的局限性问题,即现有系统多依赖符号化和顺序处理,难以实现类人水平的灵活、情境感知的空间决策。其解决方案的关键在于提出一个基于神经科学原理的新型计算框架,将生物神经系统的核心功能映射为六个关键计算模块:生物启发的多模态感知、多感官整合、自我中心-环境中心转换、人工认知地图、空间记忆与空间推理。这一框架为智能体在虚拟与物理环境中构建统一的空间推理能力提供了结构化路径,并通过分析现有方法与基准数据集,识别出亟待突破的技术瓶颈,从而推动更贴近人类空间智能的 agentic spatial intelligence 的发展。
链接: https://arxiv.org/abs/2509.09154
作者: Bui Duc Manh,Soumyaratna Debnath,Zetong Zhang,Shriram Damodaran,Arvind Kumar,Yueyi Zhang,Lu Mi,Erik Cambria,Lin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 54 pages, journal
Abstract:Recent advances in agentic AI have led to systems capable of autonomous task execution and language-based reasoning, yet their spatial reasoning abilities remain limited and underexplored, largely constrained to symbolic and sequential processing. In contrast, human spatial intelligence, rooted in integrated multisensory perception, spatial memory, and cognitive maps, enables flexible, context-aware decision-making in unstructured environments. Therefore, bridging this gap is critical for advancing Agentic Spatial Intelligence toward better interaction with the physical 3D world. To this end, we first start from scrutinizing the spatial neural models as studied in computational neuroscience, and accordingly introduce a novel computational framework grounded in neuroscience principles. This framework maps core biological functions to six essential computation modules: bio-inspired multimodal sensing, multi-sensory integration, egocentric-allocentric conversion, an artificial cognitive map, spatial memory, and spatial reasoning. Together, these modules form a perspective landscape for agentic spatial reasoning capability across both virtual and physical environments. On top, we conduct a framework-guided analysis of recent methods, evaluating their relevance to each module and identifying critical gaps that hinder the development of more neuroscience-grounded spatial reasoning modules. We further examine emerging benchmarks and datasets and explore potential application domains ranging from virtual to embodied systems, such as robotics. Finally, we outline potential research directions, emphasizing the promising roadmap that can generalize spatial reasoning across dynamic or unstructured environments. We hope this work will benefit the research community with a neuroscience-grounded perspective and a structured pathway. Our project page can be found at Github.
zh
[CV-53] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge
【速读】:该论文旨在解决现有基于深度学习的细胞检测模型难以模拟病理学家在不同放大倍数下交替观察Whole-Slide Images(全切片图像)的行为问题,即模型缺乏对多尺度下细胞与组织(cell-tissue)相互关系的理解能力。其关键解决方案在于构建并提供一个包含多尺度重叠细胞和组织标注的大规模数据集(OCELOT 2023挑战赛数据集),并通过参赛模型验证了引入细胞-组织交互信息可显著提升检测性能——顶级模型在测试集上的F1分数相比仅依赖细胞信息的基线模型提高了7.99,证明了融合多尺度语义信息对于实现人类水平诊断性能的重要性。
链接: https://arxiv.org/abs/2509.09153
作者: JaeWoong Shin,Jeongun Ryu,Aaron Valero Puche,Jinhee Lee,Biagio Brattoli,Wonkyung Jung,Soo Ick Cho,Kyunghyun Paeng,Chan-Young Ock,Donggeun Yoo,Zhaoyang Li,Wangkai Li,Huayu Mai,Joshua Millward,Zhen He,Aiden Nibali,Lydia Anette Schoenpflug,Viktor Hendrik Koelzer,Xu Shuoyu,Ji Zheng,Hu Bin,Yu-Wen Lo,Ching-Hui Yang,Sérgio Pereira
机构: Lunit Inc.(Lunit公司); University of Science and Technology of China(中国科学技术大学); La Trobe University(拉特罗布大学); University Hospital of Zürich(苏黎世大学医院); University of Zürich(苏黎世大学); Institute of Medical Genetics and Pathology(医学遗传学与病理学研究所); University of Basel(巴塞尔大学); University of Oxford(牛津大学); Bio-totem Pte Ltd(生物图腾有限公司); National Tsing Hua University(国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is the accepted manuscript of an article published in Medical Image Analysis (Elsevier). The final version is available at: this https URL
Abstract:Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.
zh
[CV-54] Video Understanding by Design: How Datasets Shape Architectures and Insights
【速读】:该论文旨在解决现有视频理解模型研究中缺乏对数据驱动架构演进机制的系统性分析问题,即当前文献多基于任务或模型家族进行分类,忽视了数据集特性如何通过归纳偏置(inductive bias)塑造模型结构的发展路径。其解决方案的关键在于提出一种以数据集为核心的视角,揭示运动复杂度、时间跨度、层次组成和多模态丰富性等数据特征如何施加结构性压力,并促使模型从两流网络、3D卷积神经网络(3D CNNs)到序列模型、Transformer及多模态基础模型逐步演化;进而构建一个统一框架,将数据集特性、归纳偏置与模型架构关联起来,为模型设计提供兼顾可扩展性与任务需求的实践指导。
链接: https://arxiv.org/abs/2509.09151
作者: Lei Wang,Piotr Koniusz,Yongsheng Gao
机构: Griffith University (格里菲斯大学); Data61/CSIRO (数据61/澳大利亚联邦科学与工业研究组织); Australian National University (澳大利亚国立大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research report
Abstract:Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.
zh
[CV-55] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation ICCV2025
【速读】:该论文旨在解决现有3D场景评估指标与人类视觉感知不一致的问题,尤其是传统指标侧重整体图像质量而忽视了人类对场景中“对象”(object)的注意力机制。解决方案的关键在于提出一种新的评估指标——Objectness SIMilarity (OSIM),其核心思想是基于对象检测模型及其特征表示来量化每个物体的“对象性”(objectness),从而实现以对象为中心的3D场景评价。通过用户研究验证,OSIM在与人类感知的一致性上优于现有方法,并可用于标准化地重评近期3D重建与生成模型的性能进展。
链接: https://arxiv.org/abs/2509.09143
作者: Yuiko Uchida,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted by the ICCV 2025 UniLight Workshop
Abstract:This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on “objects,” which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the “objectness” of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at this https URL.
zh
[CV-56] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology
【速读】:该论文旨在解决在存在结构噪声的情况下,如何更有效地从二维二值图像中推断拓扑信息的问题。传统方法如基于立方复形(cubical complexes)与带符号欧几里得距离变换(Signed Euclidean Distance Transform, SEDT)的持久同调(Persistent Homology, PH)分析虽被广泛采用,但在噪声环境下性能受限。本文提出使用监督式人工神经网络(Artificial Neural Networks, ANNs)来预测贝蒂数(Betti numbers),其关键在于利用训练数据中蕴含的上下文和几何先验知识,从而在噪声条件下实现比PH方法更强的鲁棒性。实验表明,ANNs在合成及两个真实世界数据集上均展现出优于PH的噪声容忍能力,为拓扑估计提供了一种新兴且有竞争力的替代方案。
链接: https://arxiv.org/abs/2509.09140
作者: Dylan Peek,Matthew P. Skerritt,Stephan Chalup
机构: University of Southern Queensland (南昆士兰大学); University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.
zh
[CV-57] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain
【速读】:该论文旨在解决正电子发射断层成像(PET)中大规模基础模型构建所面临的两大挑战:标注数据稀缺和计算资源不足。为应对这些问题,作者提出ALL-PET,一个在投影域(projection domain)直接运行的低资源、低样本量PET基础模型。其解决方案的关键在于三个创新:首先,设计Radon掩码增强策略(Radon mask augmentation strategy, RMAS),通过将随机图像域掩码投影至sinogram空间生成超20万种结构多样化的训练样本,显著提升泛化能力且仅需少量数据;其次,引入正/负掩码约束机制,嵌入严格的几何一致性以降低参数负担并维持生成质量;第三,提出透明医学注意力机制(transparent medical attention, TMA),一种无参数、基于解剖几何引导的机制,可增强原始投影数据中的病灶区域,其注意力图源自粗略分割结果并投影至sinogram空间,确保物理一致性与临床可解释性。该方案使模型在仅用500个样本训练时即可生成高质量sinogram,并在低剂量重建、衰减校正、延迟帧预测及示踪剂分离等多任务上实现高效泛化,内存占用低于24GB。
链接: https://arxiv.org/abs/2509.09130
作者: Bin Huang,Kang Chen,Bingxuan Li,Huafeng Liu,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.
zh
[CV-58] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval EMNLP2025
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)在人体表征学习中面临的两个关键问题:一是缺乏大规模标注的人体中心图像文本数据;二是全局对比学习难以保持对细粒度匹配至关重要的局部特征,且易受噪声文本标记的影响。解决方案的关键在于两方面协同改进:其一,构建了一个抗噪的数据构建流程,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的上下文学习能力自动过滤和标注网络获取的人体图像,形成包含500万高质量图像-文本对的WebPerson数据集;其二,提出GA-DMS(Gradient-Attention Guided Dual-Masking Synergetic)框架,通过基于梯度注意力相似度自适应掩码噪声文本标记,并引入掩码标记预测目标以增强细粒度语义表示学习,从而提升跨模态对齐效果。
链接: https://arxiv.org/abs/2509.09118
作者: Tianlu Zheng,Yifan Zhang,Xiang An,Ziyong Feng,Kaicheng Yang,Qichuan Ding
机构: Northeastern University (东北大学); South China University of Technology (华南理工大学); DeepGlint
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by EMNLP2025 Main
Abstract:Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.
zh
[CV-59] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention WACV2026
【速读】:该论文旨在解决从顶视图作物图像中对莲座状植物个体进行完整分割的问题,即在不依赖标注训练数据的情况下实现植物个体级别的零样本分割(zero-shot segmentation)。传统方法在处理由多个重叠叶片组成的植物个体时表现不佳,且通常需要针对特定物种的标注数据,耗时费力。解决方案的关键在于提出ZeroPlantSeg框架,其核心创新是融合基础分割模型(foundation segmentation model)与视觉-语言模型(vision-language model):前者用于提取单个叶片实例,后者则通过结构推理能力识别并整合叶片以重建完整的植物个体,从而实现无需额外训练的跨域分割性能。
链接: https://arxiv.org/abs/2509.09116
作者: Junhao Xing,Ryohei Miyakawa,Yang Yang,Xinpeng Liu,Risa Shinoda,Hiroaki Santo,Yosuke Toda,Fumio Okura
机构: The University of Osaka (大阪大学); Phytometrics; Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026 accepted
Abstract:Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants’ structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at this https URL.
zh
[CV-60] FPI-Det: a face–phone Interaction Dataset for phone-use detection and understanding
【速读】:该论文旨在解决在安全监控、工作场所生产力评估和注意力管理等场景中,如何准确检测人员是否使用手机的问题。这一任务不仅依赖于物体识别(object recognition),还需理解行为上下文(behavioral context),即在复杂条件下推理人脸、手部与设备之间的关系。现有通用基准无法充分捕捉这种细粒度的人机交互(human–device interactions)。解决方案的关键在于构建一个名为FPI-Det的新数据集,包含22,879张同步标注图像,涵盖工作场所、教育、交通和公共场景,具有极端尺度变化、频繁遮挡及多样采集条件等特点,并在此基础上对YOLO和DETR等代表性检测器进行基准测试与性能分析,从而为该问题提供更贴近实际应用的评估框架与基线结果。
链接: https://arxiv.org/abs/2509.09111
作者: Jianqin Gao,Tianqi Wang,Yu Zhang,Yishu Zhang,Chenyuan Wang,Allan Dong,Zihao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human–device interactions. To address this gap, we introduce the FPI-Det, containing 22,879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at this https URL.
zh
[CV-61] S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization
【速读】:该论文旨在解决LiDAR-based全局定位(global localization)任务中对高精度真值位姿(ground-truth poses)依赖过强的问题,这类真值通常来自GPS或SLAM里程计,获取成本高且难以大规模应用。解决方案的关键在于提出一种基于鸟瞰图(bird’s-eye view, BEV)的自监督框架S-BEVLoc,通过利用关键点为中心的BEV图像块之间的已知地理距离构建训练三元组(triplets),无需真值位姿即可进行网络训练;同时采用卷积神经网络(CNN)提取局部特征、NetVLAD聚合全局描述子,并引入SoftCos损失函数增强三元组学习效果,从而在KITTI和NCLT等大规模数据集上实现卓越的场景识别、回环检测与全局定位性能,且具备良好的可扩展性。
链接: https://arxiv.org/abs/2509.09110
作者: Chenghao Zhang,Lun Luo,Si-Yuan Cao,Xiaokai Bai,Yuncheng Jin,Zhu Yu,Beinan Yu,Yisen Wang,Hui-Liang Shen
机构: Zhejiang University (浙江大学); China Jiliang University (中国计量大学); Jinhua Institute of Zhejiang University (浙江大学金华研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird’s-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.
zh
[CV-62] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在具身智能任务中因计算与内存开销过大而导致的实际部署困难问题。现有压缩与加速方法通常采用随意的量化(quantization)或标记剪枝(token pruning),但由于二者存在不兼容性,难以同时实现以获得整体效率提升。解决方案的关键在于提出SQAP-VLA——首个结构化、无需训练的VLA推理加速框架,其通过协同设计量化与剪枝流程,创新性地引入量化感知的剪枝准则(quantization-aware token pruning criteria),能够在高度量化模型上有效执行剪枝,同时改进量化器设计以增强剪枝效果,从而在不显著损失性能的前提下实现计算效率和推理速度的显著提升(如达到1.93倍加速和平均成功率提升4.5%)。
链接: https://arxiv.org/abs/2509.09090
作者: Hengyu Fang,Yijiang Liu,Yuan Du,Li Du,Huanrui Yang
机构: Nanjing University (南京大学); University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures
Abstract:Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a \times 1.93 speedup and up to a 4.5% average success rate enhancement compared to the original model.
zh
[CV-63] IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
【速读】:该论文旨在解决多光谱目标检测中特征融合过程易保留冗余背景或噪声的问题,从而限制感知性能的瓶颈。其核心解决方案在于提出一种基于跨模态特征对比与筛选策略的创新特征融合框架IRDFusion,关键在于设计了两个专用模块:相互特征精炼模块(Mutual Feature Refinement Module, MFRM)和差异特征反馈模块(Differential Feature Feedback Module, DFFM)。MFRM通过建模模态内与模态间特征关系,增强跨模态对齐与判别性;DFFM则动态计算模态间差异特征作为引导信号反馈至MFRM,实现互补信息的自适应融合并抑制共模噪声。两者协同形成迭代关系图差分引导特征融合机制,通过逐轮反馈逐步放大显著关联信号、抑制特征噪声,显著提升多光谱特征融合质量与检测性能。
链接: https://arxiv.org/abs/2509.09085
作者: Jifeng Shen,Haibo Zhan,Xin Zuo,Heng Fan,Xiaohui Yuan,Jun Li,Wankou Yang
机构: Jiangsu University (江苏大学); Jiangsu University of Science and Technology (江苏科技大学); Southeast University (东南大学); University of North Texas (北德克萨斯大学); Nanjing Normal University (南京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages,6 pages, submitted on 3 Sep,2025
Abstract:Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual this http URL address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background this http URL solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative this http URL by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance this http URL extensive experiments on FLIR, LLVIP and M ^3 FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at this https URL.
zh
[CV-64] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach
【速读】:该论文旨在解决现有图卷积神经网络(Graph Convolutional Neural Networks, GCNs)在人体动作识别任务中难以有效检测人-物交互(Human-Object Interaction, HOI)的问题,其根本原因在于缺乏对场景信息的有效建模以及合适的学习架构。解决方案的关键在于引入固定物体信息并采用多任务学习(Multi-task Learning, MTL)策略,通过融合交互区域信息,显著提升了对人-物交互与非交互动作(如行走、站立)的识别准确率,最终达到99.25%的精度,较仅使用人体骨骼姿态的基线模型提升2.75%。
链接: https://arxiv.org/abs/2509.09067
作者: Hesham M. Shehata,Mohammad Abdolrahmani
机构: Kyoto University (京都大学); Asilla (阿西拉)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.
zh
[CV-65] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models ALT
【速读】:该论文旨在解决当前3D医学图像理解中语义理解深度不足的问题,尤其是基于卷积和Transformer的自监督学习(SSL)方法在捕捉深层语义信息方面的局限性。其解决方案的关键在于提出Med3DInsight框架,通过引入一种平面切片感知的Transformer模块(plane-slice-aware transformer module),将3D图像编码器与2D多模态大语言模型(MLLM)进行深度融合;同时采用基于部分最优传输(partial optimal transport)的对齐机制,有效提升模型对LLM生成内容噪声的鲁棒性,从而实现无需人工标注的可扩展3D医学表征学习新范式。
链接: https://arxiv.org/abs/2509.09064
作者: Qiuhui Chen,Xuancheng Yao,Huping Ye,Yi Hong
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Journal of Biomedical and Health Informatics (JBHI)
Abstract:Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at this https URL.
zh
[CV-66] Integrating Anatomical Priors into a Causal Diffusion Model
【速读】:该论文旨在解决生成式AI在3D脑部磁共振成像(MRI)中难以保留细微解剖结构差异的问题,尤其在生成反事实图像时,现有模型因缺乏显式归纳偏置而无法保持高精度的局部解剖细节。其关键解决方案是提出一种基于概率因果图模型(Probabilistic Causal Graph Model, PCGM)的生成框架,通过引入体素级解剖约束作为先验信息,将解剖知识编码为区域掩码(由3D ControlNet扩展实现),并以此约束新型反事实去噪UNet,最终借助3D扩散解码器生成高质量脑部MRI。该方法首次实现了从合成图像中提取的脑测量值可复现神经科学文献中报道的疾病对皮层区域的细微影响,显著提升了合成MRI在研究亚毫米级形态差异中的可信度与实用性。
链接: https://arxiv.org/abs/2509.09054
作者: Binxu Li,Wei Peng,Mingjie Li,Ehsan Adeli,Kilian M. Pohl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures
Abstract:3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.
zh
[CV-67] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI
【速读】:该论文旨在解决当前基于功能性磁共振成像(fMRI)的视觉解码方法普遍依赖个体特异性训练的问题,这一局限性严重制约了模型的可扩展性和实际部署能力。其解决方案的关键在于提出一种轻量级Transformer架构VoxelFormer,该架构通过引入Token Merging Transformer(ToMer)实现体素(voxel)的高效压缩,并采用查询驱动的Q-Former生成固定尺寸的神经表征,使其与CLIP图像嵌入空间对齐,从而支持多被试训练和参数高效的神经解码。
链接: https://arxiv.org/abs/2509.09015
作者: Chenqian Le,Yilin Zhao,Nikasadat Emami,Kushagra Yadav,Xujin “Chris” Liu,Xupeng Chen,Yao Wang
机构: New York University Tandon School of Engineering (纽约大学坦登工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbfVoxelFormer, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.
zh
[CV-68] E-MLNet: Enhanced Mutual Learning for Universal Domain Adaptation with Sample-Specific Weighting
【速读】:该论文旨在解决通用域适应(Universal Domain Adaptation, UniDA)中模型在源域与目标域标签集无重叠时,如何有效区分已知类与未知类的问题。传统方法如MLNet虽通过开放集熵最小化(Open-set Entropy Minimization, OEM)实现知识迁移,但其对所有一对多分类器同等加权,导致学习信号被稀释。本文提出增强型互学习网络(Enhanced Mutual Learning Network, E-MLNet),其关键在于引入动态加权策略以优化OEM机制:利用闭集分类器的预测结果,聚焦于每个目标样本最相关的类别边界,从而强化已知类与未知类之间的判别能力。实验表明,E-MLNet在多个基准数据集上显著优于MLNet,在Open-Partial DA和Open-Set DA设置下分别在22/31和19/31任务中取得更好性能,验证了该聚焦式适应策略的有效性。
链接: https://arxiv.org/abs/2509.09006
作者: Samuel Felipe dos Santos,Tiago Agostinho de Almeida,Jurandy Almeida
机构: Federal University of São Carlos (UFSCar) (圣卡洛斯联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Universal Domain Adaptation (UniDA) seeks to transfer knowledge from a labeled source to an unlabeled target domain without assuming any relationship between their label sets, requiring models to classify known samples while rejecting unknown ones. Advanced methods like Mutual Learning Network (MLNet) use a bank of one-vs-all classifiers adapted via Open-set Entropy Minimization (OEM). However, this strategy treats all classifiers equally, diluting the learning signal. We propose the Enhanced Mutual Learning Network (E-MLNet), which integrates a dynamic weighting strategy to OEM. By leveraging the closed-set classifier’s predictions, E-MLNet focuses adaptation on the most relevant class boundaries for each target sample, sharpening the distinction between known and unknown classes. We conduct extensive experiments on four challenging benchmarks: Office-31, Office-Home, VisDA-2017, and ImageCLEF. The results demonstrate that E-MLNet achieves the highest average H-scores on VisDA and ImageCLEF and exhibits superior robustness over its predecessor. E-MLNet outperforms the strong MLNet baseline in the majority of individual adaptation tasks – 22 out of 31 in the challenging Open-Partial DA setting and 19 out of 31 in the Open-Set DA setting – confirming the benefits of our focused adaptation strategy.
zh
[CV-69] Implicit Neural Representations of Intramyocardial Motion and Strain MICCAI
【速读】:该论文旨在解决从标记磁共振成像(tagging MRI)中自动量化心肌运动与应变(strain)这一重要但具有挑战性的任务。其解决方案的关键在于采用条件于学习到的潜在码(latent codes)的隐式神经表示(implicit neural representations, INRs),以预测左心室(LV)连续位移场,且无需在推理阶段进行优化。该方法在452例UK Biobank测试案例中表现出最优的跟踪精度(均方根误差2.14 mm)和最低的全局径向及周向应变综合误差(分别为2.86%和6.42%),同时相比最准确的深度学习基线提速约380倍,体现出INR模型在大规模心脏磁共振(CMR)数据集上进行高精度、可扩展的心肌应变分析中的显著优势。
链接: https://arxiv.org/abs/2509.09004
作者: Andrew Bell,Yan Kit Choi,Steffen Peterson,Andrew King,Muhummad Sohaib Nazir,Alistair Young
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: STACOM 2025 @ MICCAI
Abstract:Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement – without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is \sim 380 \times faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets.
zh
[CV-70] UltrON: Ultrasound Occupancy Networks MICCAI2025
【速读】:该论文旨在解决自由手式超声成像中因视角依赖性和声影伪影导致的3D解剖结构重建困难,以及现有基于隐式表示(如符号距离函数 SDF)方法对精确标注高度依赖的问题。其关键解决方案是提出一种基于占据函数(occupancy function)的新型形状表示方法,并引入UltrON模型,该模型利用B-mode图像中的声学特征进行弱监督优化,无需额外标注即可提取有效几何信息;同时设计了一种新颖的损失函数以补偿多视角B-mode图像的视点依赖性,从而提升重建的几何一致性与泛化能力,显著缓解遮挡和稀疏标签带来的限制。
链接: https://arxiv.org/abs/2509.08991
作者: Magdalena Wysocki,Felix Duelmer,Ananya Bal,Nassir Navab,Mohammad Farid Azampour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound’s view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \glsUltrON that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \glsUltrON generalizes to shapes of the same anatomy. We show that \glsUltrON mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at this https URL.
zh
[CV-71] Matcher: Improve matching in point cloud registration via local-to-global geometric consistency learning
【速读】:该论文旨在解决点云配准中特征匹配的准确性与几何一致性问题,尤其在复杂场景下(如户外和室内)实现高鲁棒性的刚性配准。解决方案的关键在于提出了一种全可微分框架iMatcher,其核心创新包括:首先通过局部图嵌入模块初始化匹配置信度矩阵,随后利用3D空间中的最近邻搜索进行双向源到目标和目标到源的重定位以优化匹配;最后,通过全局几何一致性学习对配对点特征进行细化,预测逐点匹配概率。该方法显著提升了匹配精度,在多个真实世界数据集上实现了当前最优的内点比率(如KITTI上达95%-97%,3DMatch上最高达81.1%),验证了其在多样场景下的有效性。
链接: https://arxiv.org/abs/2509.08982
作者: Karim Slimani,Catherine Achard,Brahim Tamadazte
机构: ISIR(法国国家信息与自动化研究所); UPMC(巴黎第六大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.
zh
[CV-72] Value bounds and Convergence Analysis for Averag es of LRP attributions
【速读】:该论文旨在解决Layer-wise relevance propagation (LRP)-型归因方法在数值性质上的理解不足问题,特别是其归因值分布特性与收敛行为的理论分析。解决方案的关键在于将LRP类方法表示为修改后的梯度矩阵的乘积,从而建立与雅可比矩阵(Jacobi-matrices)链式法则下矩阵乘法的类比关系;在此基础上,作者推导出奇异值的上界及归因图分量的逐元素边界,并进一步利用这些边界获得控制归因经验均值收敛到期望值的乘法常数。这一理论框架揭示了LRP-beta方法的常数不依赖于权重范数,区别于基于梯度的方法和LRP-epsilon,为多数据增强场景和Smoothgrad类归因方法提供了严谨的理论支撑。
链接: https://arxiv.org/abs/2509.08963
作者: Alexander Binder,Nastaran Takmil-Homayouni,Urun Dogan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages
Abstract:We analyze numerical properties of Layer-wise relevance propagation (LRP)-type attribution methods by representing them as a product of modified gradient matrices. This representation creates an analogy to matrix multiplications of Jacobi-matrices which arise from the chain rule of differentiation. In order to shed light on the distribution of attribution values, we derive upper bounds for singular values. Furthermore we derive component-wise bounds for attribution map values. As a main result, we apply these component-wise bounds to obtain multiplicative constants. These constants govern the convergence of empirical means of attributions to expectations of attribution maps. This finding has important implications for scenarios where multiple non-geometric data augmentations are applied to individual test samples, as well as for Smoothgrad-type attribution methods. In particular, our analysis reveals that the constants for LRP-beta remain independent of weight norms, a significant distinction from both gradient-based methods and LRP-epsilon.
zh
[CV-73] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision
【速读】:该论文旨在解决Vision Transformers (ViTs)在小规模数据集上因缺乏局部性(locality)和平移等变性(translation equivariance)等归纳偏置而导致局部特征提取能力不足的问题,从而影响模型的泛化性能。其解决方案的关键在于提出CoSwin架构,通过在每一层注意力模块中引入一个可学习的局部特征增强模块(learnable local feature enhancement module),将局部卷积特征学习与分层移窗自注意力机制相结合,实现细粒度空间细节与全局语义结构的协同捕捉,从而提升Transformer在小样本视觉任务中的表现力和鲁棒性。
链接: https://arxiv.org/abs/2509.08959
作者: Puskal Khadka,Rodrigue Rizk,Longwei Wang,KC Santosh
机构: AI Research Lab (AI 研究实验室); Department of Computer Science (计算机科学系); University of South Dakota (南达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at this https URL
zh
[CV-74] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery
【速读】:该论文旨在解决无人机遥感影像(UAS imagery)中云影(cloud shadows)和太阳耀斑(sun glint)对水体质量参数反演造成的干扰问题。其关键解决方案是提出一种基于U-Net架构的深度学习模型,通过像素级数据训练,实现对云影和太阳耀斑区域的有效识别与分离,并在此基础上构建高质量图像校正模型,以恢复受遮蔽或干扰区域的影像信息,从而提升水体遥感监测的精度。
链接: https://arxiv.org/abs/2509.08949
作者: Yibin Wang,Wondimagegn Beshah,Padmanava Dash,Haifeng Wang
机构: Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.
zh
[CV-75] CameraVDP: Perceptual Display Assessment with Uncertainty Estimation via Camera and Visual Difference Prediction SIGGRAPH
【速读】:该论文旨在解决传统显示设备测量方法在捕捉空间变化的显示伪影(如像素级失真和高频畸变)方面的局限性,以及相机测量引入的光学、采样和光度畸变问题。其核心挑战在于如何将物理测量结果与人类视觉系统的感知特性相结合,以实现对显示质量的感知评估。解决方案的关键在于提出CameraVDP框架,该框架融合了基于相机的重建流程与视觉差异预测器(Visual Difference Predictor, VDP),其中重建流程通过HDR图像堆叠、MTF反演、暗角校正、几何去畸变、单应性变换和色彩校正等步骤,使相机具备高精度显示测量能力;同时,VDP模型能够模拟不同观看条件下人眼对各种刺激的可见性,从而实现从物理测量到感知质量评估的闭环映射。
链接: https://arxiv.org/abs/2509.08947
作者: Yancheng Cai,Robert Wanat,Rafal Mantiuk
机构: University of Cambridge(剑桥大学); LG Electronics North America( LG 电子北美公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH Asia 2025
Abstract:Accurate measurement of images produced by electronic displays is critical for the evaluation of both traditional and computational displays. Traditional display measurement methods based on sparse radiometric sampling and fitting a model are inadequate for capturing spatially varying display artifacts, as they fail to capture high-frequency and pixel-level distortions. While cameras offer sufficient spatial resolution, they introduce optical, sampling, and photometric distortions. Furthermore, the physical measurement must be combined with a model of a visual system to assess whether the distortions are going to be visible. To enable perceptual assessment of displays, we propose a combination of a camera-based reconstruction pipeline with a visual difference predictor, which account for both the inaccuracy of camera measurements and visual difference prediction. The reconstruction pipeline combines HDR image stacking, MTF inversion, vignetting correction, geometric undistortion, homography transformation, and color correction, enabling cameras to function as precise display measurement instruments. By incorporating a Visual Difference Predictor (VDP), our system models the visibility of various stimuli under different viewing conditions for the human visual system. We validate the proposed CameraVDP framework through three applications: defective pixel detection, color fringing awareness, and display non-uniformity evaluation. Our uncertainty analysis framework enables the estimation of the theoretical upper bound for defect pixel detection performance and provides confidence intervals for VDP quality scores.
zh
[CV-76] Discovering Divergent Representations between Text-to-Image Models ICCV2025
【速读】:该论文旨在解决生成式 AI(Generative AI)中不同文本到图像模型在视觉表征学习上的差异问题,具体目标是识别出一个模型生成图像中出现但另一模型未生成的视觉属性,并揭示触发这些差异的提示概念。其解决方案的关键在于提出 CompCon(Comparing Concepts),这是一种基于进化搜索的算法,能够系统性地发现某一模型输出中更占优势的视觉属性,并关联到引发此类差异的提示语义特征,从而实现对多模型生成结果的可解释性比较与分析。
链接: https://arxiv.org/abs/2509.08940
作者: Lisa Dunlap,Joseph E. Gonzalez,Trevor Darrell,Fabian Caba Heilbron,Josef Sivic,Bryan Russell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Code available at this https URL
Abstract:In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, “flames” might appear in one model’s outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model’s output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon’s ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: this https URL
zh
[CV-77] Live® Die: Predicting Survival in Colorectal Liver Metastasis
【速读】:该论文旨在解决结直肠癌肝转移(Colorectal Liver Metastasis, CRLM)患者术后生存预测准确率不足的问题,尤其在多发性病灶(multifocal CRLM)情况下,现有基于有限临床或分子特征的预后模型预测能力有限。解决方案的关键在于提出一个全自动化的框架,融合自动化分割与影像组学(Radiomics)分析:首先利用可提示的基础模型(promptable foundation models)和创新的零样本3D提示传播算法SAMONAI,从部分标注的增强MRI图像中高精度、高效地分割肝脏、肿瘤及脾脏区域;随后,通过基于自编码器的多实例神经网络SurvAMINN对每个肿瘤提取特征并进行生存分析,该网络能联合学习降维与风险预测,聚焦最具侵袭性的肿瘤。该框架在227例患者数据上验证,相较传统临床与基因组生物标志物显著提升预测性能(C-index提升超10%),展现出高准确性、低标注依赖性和可解释性。
链接: https://arxiv.org/abs/2509.08935
作者: Muhammad Alberb,Helen Cheung,Anne Martel
机构: University of Toronto (多伦多大学); Sunnybrook Research Institute ( Sunnybrook 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Thesis at Erasmus Mundus Joint Master’s Degree in Medical Imaging and Applications
Abstract:Colorectal cancer frequently metastasizes to the liver, significantly reducing long-term survival. While surgical resection is the only potentially curative treatment for colorectal liver metastasis (CRLM), patient outcomes vary widely depending on tumor characteristics along with clinical and genomic factors. Current prognostic models, often based on limited clinical or molecular features, lack sufficient predictive power, especially in multifocal CRLM cases. We present a fully automated framework for surgical outcome prediction from pre- and post-contrast MRI acquired before surgery. Our framework consists of a segmentation pipeline and a radiomics pipeline. The segmentation pipeline learns to segment the liver, tumors, and spleen from partially annotated data by leveraging promptable foundation models to complete missing labels. Also, we propose SAMONAI, a novel zero-shot 3D prompt propagation algorithm that leverages the Segment Anything Model to segment 3D regions of interest from a single point prompt, significantly improving our segmentation pipeline’s accuracy and efficiency. The predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts features from each tumor and predicts survival using SurvAMINN, a novel autoencoder-based multiple instance neural network for survival analysis. SurvAMINN jointly learns dimensionality reduction and hazard prediction from right-censored survival data, focusing on the most aggressive tumors. Extensive evaluation on an institutional dataset comprising 227 patients demonstrates that our framework surpasses existing clinical and genomic biomarkers, delivering a C-index improvement exceeding 10%. Our results demonstrate the potential of integrating automated segmentation algorithms and radiomics-based survival analysis to deliver accurate, annotation-efficient, and interpretable outcome prediction in CRLM.
zh
[CV-78] SFD-Mamba2Net: Strcture-Guided Frequency-Enhanced Dual-Stream Mamba2 Network for Coronary Artery Segmentation
【速读】:该论文旨在解决冠状动脉造影(Invasive Coronary Angiography, ICA)图像中血管分割与狭窄检测的准确性问题,其核心挑战在于ICA图像普遍存在低对比度、高噪声以及复杂精细的血管结构,导致现有方法难以满足临床应用需求。解决方案的关键在于提出一种端到端框架SFD-Mamba2Net,其创新性地融合了多尺度结构先验、基于状态空间模型的长距离依赖建模和频域细节增强策略:在编码器中引入曲率感知结构增强(Curvature-Aware Structural Enhancement, CASE)模块以突出细长管状血管结构并抑制背景干扰;在解码器中设计渐进式高频感知(Progressive High-Frequency Perception, PHFP)模块,通过多级小波分解逐步重构高频细节并融合低频全局结构,从而显著提升分割精度与狭窄检测的真阳性率和阳性预测值。
链接: https://arxiv.org/abs/2509.08934
作者: Nan Mu,Ruiqi Song,Zhihui Xu,Jingfeng Jiang,Chen Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide. Invasive Coronary Angiography (ICA), regarded as the gold standard for CAD diagnosis, necessitates precise vessel segmentation and stenosis detection. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Objective: This study aims to improve the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Methods: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. In the encoder, a Curvature-Aware Structural Enhancement (CASE) module is embedded to leverage multi-scale responses for highlighting slender tubular vascular structures, suppressing background interference, and directing attention toward vascular regions. In the decoder, we introduce a Progressive High-Frequency Perception (PHFP) module that employs multi-level wavelet decomposition to progressively refine high-frequency details while integrating low-frequency global structures. Results and Conclusions: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection.
zh
[CV-79] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures
【速读】:该论文旨在解决目标重识别(Re-ID)方法对标签噪声敏感的问题,此类噪声通常会导致性能显著下降。其解决方案的关键在于将Re-ID重新建模为一个监督图像相似性任务,并采用Siamese网络架构来学习判别性的成对关系;同时提出了一种新颖的统计异常检测框架Beta-SOD(Beta混合相似性异常检测),通过双分量Beta分布混合模型建模嵌入对之间的余弦相似度分布,从而有效识别和去除噪声样本。该方法结合了二元交叉熵、对比损失和余弦嵌入损失的多目标优化策略,在CUHK03、Market-1501和VeRi-776等多个数据集上验证了其在不同噪声水平(10%-30%)下的鲁棒性和优越性能。
链接: https://arxiv.org/abs/2509.08926
作者: Waqar Ahmad,Evan Murphy,Vladimir A. Krylov
机构: Dublin City University (都柏林城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is this http URL proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity this http URL demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: this https URL
zh
[CV-80] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLM s with Enhanced Safety Fairness and Controllability
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中可能生成对弱势群体(如LGBTQ+个体、单亲家庭及边缘化社区)有害、偏见或误导性信息的问题。现有安全方法多依赖事后的过滤或通用对齐技术,无法从生成源头主动预防危害。其解决方案的关键在于提出PromptGuard框架及其核心创新——VulnGuard Prompt,这是一种基于真实世界数据驱动的对比学习(contrastive learning)的混合提示技术,通过整合来自精选GitHub仓库的少样本示例、伦理链式推理(ethical chain-of-thought reasoning)与自适应角色提示(adaptive role-prompting),构建针对特定人群的保护屏障,并结合多目标优化理论证明了25–30%的伤害减少效果(基于熵界和帕累托最优性)。
链接: https://arxiv.org/abs/2509.08910
作者: Tung Vu,Lam Nguyen,Quynh Dao
机构: Posts and Telecommunications Institute of Technology (越南邮电学院); Hanoi Architectural University (河内建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.
zh
[CV-81] Diffusion-Based Action Recognition Generalizes to Untrained Domains
【速读】:该论文旨在解决当前深度学习模型在动作识别任务中泛化能力不足的问题,尤其是在面对不同物种、视角和录制场景等复杂变化时难以保持鲁棒性。解决方案的关键在于利用视觉扩散模型(Vision Diffusion Model, VDM)生成的特征,并通过Transformer进行聚合,从而提取更具语义信息的表示;特别地,模型采用基于扩散过程早期时间步条件的特征提取方式,有效抑制像素级细节干扰,强化语义表达,显著提升了跨域动作识别的泛化性能。
链接: https://arxiv.org/abs/2509.08908
作者: Rogerio Guimaraes,Frank Xiao,Pietro Perona,Markus Marks
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: \hrefthis https URL\textttthis http URL Code: \hrefthis https URL\textttthis http URL
zh
[CV-82] DeepTV: A neural network approach for total variation minimization
【速读】:该论文旨在解决无限维总变差(Total Variation, TV)最小化问题的数值求解难题,这类问题在图像处理和反演问题中具有重要应用。传统方法难以直接处理此类无限维优化问题,因此作者提出一种基于神经网络的近似求解框架,其关键在于引入一个辅助神经网络问题,该问题在理论上存在解,并通过Γ-收敛性证明其与原问题的一致性。进一步地,作者设计了一个离散版本的辅助问题,同样具备Γ-收敛性质,从而为数值计算提供了理论保障。特别地,Γ-收敛分析揭示了总变差项的一种合理离散化方式,并将该离散神经网络模型与经典的有限差分法对齐,最终通过数值实验验证了理论结果的有效性。
链接: https://arxiv.org/abs/2409.05569
作者: Andreas Langer,Sara Behnamian
机构: Lund University (隆德大学); University of Copenhagen (哥本哈根大学)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural network approaches have been demonstrated to work quite well to solve partial differential equations in practice. In this context approaches like physics-informed neural networks and the Deep Ritz method have become popular. In this paper, we propose a similar approach to solve an infinite-dimensional total variation minimization problem using neural networks. We illustrate that the resulting neural network problem does not have a solution in general. To circumvent this theoretic issue, we consider an auxiliary neural network problem, which indeed has a solution, and show that it converges in the sense of \Gamma -convergence to the original problem. For computing a numerical solution we further propose a discrete version of the auxiliary neural network problem and again show its \Gamma -convergence to the original infinite-dimensional problem. In particular, the \Gamma -convergence proof suggests a particular discretization of the total variation. Moreover, we connect the discrete neural network problem to a finite difference discretization of the infinite-dimensional total variation minimization problem. Numerical experiments are presented supporting our theoretical findings.
zh
[CV-83] Explainable AI for Accelerated Microstructure Imaging: A SHAP-Guided Protocol on the Connectome 2.0 scanner
【速读】:该论文旨在解决扩散磁共振成像(Diffusion MRI)中神经纤维交换成像(Neurite Exchange Imaging, NEXI)模型在实际应用中因扫描时间过长而限制其推广的问题。为实现高效且准确的生物物理参数映射,研究提出了一种融合数据驱动与可解释人工智能(Explainable Artificial Intelligence, XAI)的混合优化框架,其关键在于采用引导式递归特征消除(guided recursive feature elimination)策略,从原始15个成像参数中筛选出最优8个特征子集,在显著缩短扫描时间至14分钟的同时,保持了与全协议相当的参数估计精度、解剖对比度和重测一致性,相较于传统理论驱动或启发式降维方法,显著提升了水交换时间估计的鲁棒性(偏差降低超过两倍)。
链接: https://arxiv.org/abs/2509.09513
作者: Quentin Uhl,Tommaso Pavan,Julianna Gerold,Kwok-Shing Chan,Yohan Jun,Shohei Fujita,Aneri Bhatt,Yixin Ma,Qiaochu Wang,Hong-Hsi Lee,Susie Y. Huang,Berkin Bilgic,Ileana Jelescu
机构: Lausanne University Hospital (CHUV); University of Lausanne; École Polytechnique Fédérale de Lausanne (EPFL); Massachusetts General Hospital; Athinoula A. Martinos Center for Biomedical Imaging
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Submitted to IEEE Transactions on Medical Imaging (TMI). This all-in-one version includes supplementary materials. 18 pages, 14 figures, 2 tables
Abstract:The diffusion MRI Neurite Exchange Imaging model offers a promising framework for probing gray matter microstructure by estimating parameters such as compartment sizes, diffusivities, and inter-compartmental water exchange time. However, existing protocols require long scan times. This study proposes a reduced acquisition scheme for the Connectome 2.0 scanner that preserves model accuracy while substantially shortening scan duration. We developed a data-driven framework using explainable artificial intelligence with a guided recursive feature elimination strategy to identify an optimal 8-feature subset from a 15-feature protocol. The performance of this optimized protocol was validated in vivo and benchmarked against the full acquisition and alternative reduction strategies. Parameter accuracy, preservation of anatomical contrast, and test-retest reproducibility were assessed. The reduced protocol yielded parameter estimates and cortical maps comparable to the full protocol, with low estimation errors in synthetic data and minimal impact on test-retest variability. Compared to theory-driven and heuristic reduction schemes, the optimized protocol demonstrated superior robustness, reducing the deviation in water exchange time estimates by over two-fold. In conclusion, this hybrid optimization framework enables viable imaging of neurite exchange in 14 minutes without loss of parameter fidelity. This approach supports the broader application of exchange-sensitive diffusion magnetic resonance imaging in neuroscience and clinical research, and offers a generalizable method for designing efficient acquisition protocols in biophysical parameter mapping.
zh
[CV-84] In-Loop Filtering Using Learned Look-Up Tables for Video Coding
【速读】:该论文旨在解决基于深度神经网络(Deep Neural Network, DNN)的环内滤波(In-loop Filtering, ILF)在视频编码中因计算复杂度高、硬件需求大而难以实际部署的问题。其核心解决方案是提出一种通用的查表法(Look-up Table, LUT)实现框架——LUT-ILF++,关键在于通过训练受限输入范围的DNN后遍历所有可能输入并缓存输出至LUT,从而在编码过程中以低复杂度的查找与插值替代昂贵的DNN推理;同时引入多类型滤波LUT协同机制、跨分量索引策略及LUT压缩方案,在保证滤波质量的同时显著降低存储开销和时间复杂度,实验证明该方法在VVC参考软件中可分别获得平均0.82%/2.97%/1.63%和0.85%/4.11%/2.06%的比特率节省(AI/RA配置下)。
链接: https://arxiv.org/abs/2509.09494
作者: Zhuoyuan Li,Jiacheng Li,Yao Li,Jialin Li,Li Li,Dong Liu,Feng Wu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 25 pages
Abstract:In-loop filtering (ILF) is a key technology in video coding standards to reduce artifacts and enhance visual quality. Recently, neural network-based ILF schemes have achieved remarkable coding gains, emerging as a powerful candidate for next-generation video coding standards. However, the use of deep neural networks (DNN) brings significant computational and time complexity or high demands for dedicated hardware, making it challenging for general use. To address this limitation, we study a practical ILF solution by adopting look-up tables (LUTs). After training a DNN with a restricted reference range for ILF, all possible inputs are traversed, and the output values of the DNN are cached into LUTs. During the coding process, the filtering process is performed by simply retrieving the filtered pixel through locating the input pixels and interpolating between the cached values, instead of relying on heavy inference computations. In this paper, we propose a universal LUT-based ILF framework, termed LUT-ILF++. First, we introduce the cooperation of multiple kinds of filtering LUTs and propose a series of customized indexing mechanisms to enable better filtering reference perception with limited storage consumption. Second, we propose the cross-component indexing mechanism to enable the filtering of different color components jointly. Third, in order to make our solution practical for coding uses, we propose the LUT compaction scheme to enable the LUT pruning, achieving a lower storage cost of the entire solution. The proposed framework is implemented in the VVC reference software. Experimental results show that the proposed framework achieves on average 0.82%/2.97%/1.63% and 0.85%/4.11%/2.06% bitrate reduction for common test sequences, under the AI and RA configurations, respectively. Compared to DNN-based solutions, our proposed solution has much lower time complexity and storage cost.
zh
[CV-85] Virtual staining for 3D X-ray histology of bone implants
【速读】:该论文旨在解决三维X射线组织学(3D X-ray histology)因固有灰度图像对比度不足而导致生化特异性较低的问题,从而提升其在生物组织无创成像中的解释能力。解决方案的关键在于将跨模态图像翻译技术引入X射线域,利用同步辐射微CT(synchrotron-radiation-based micro-CT)扫描数据与共注册的甲苯胺蓝染色组织切片之间的配对样本,训练一种改进的CycleGAN网络,以生成具有组织学真实色彩的虚拟染色切片。该模型结合像素级监督和灰度一致性约束,并采用基于补丁的实时数据增强策略,在有限配对数据条件下实现高分辨率结构保留与高质量颜色重建,显著优于Pix2Pix和标准CycleGAN基线方法,为无需化学染色即可获得具有化学信息的三维组织表征提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2509.09235
作者: Sarah C. Irvine,Christian Lucas,Diana Krüger,Bianca Guedert,Julian Moosmann,Berit Zeller-Plumhoff
机构: Helmholtz-Zentrum Hereon (赫尔姆霍兹材料与能源中心); University of Rostock (罗斯托克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph); Quantitative Methods (q-bio.QM)
备注:
Abstract:Three-dimensional X-ray histology techniques offer a non-invasive alternative to conventional 2D histology, enabling volumetric imaging of biological tissues without the need for physical sectioning or chemical staining. However, the inherent greyscale image contrast of X-ray tomography limits its biochemical specificity compared to traditional histological stains. Within digital pathology, deep learning-based virtual staining has demonstrated utility in simulating stained appearances from label-free optical images. In this study, we extend virtual staining to the X-ray domain by applying cross-modality image translation to generate artificially stained slices from synchrotron-radiation-based micro-CT scans. Using over 50 co-registered image pairs of micro-CT and toluidine blue-stained histology from bone-implant samples, we trained a modified CycleGAN network tailored for limited paired data. Whole slide histology images were downsampled to match the voxel size of the CT data, with on-the-fly data augmentation for patch-based training. The model incorporates pixelwise supervision and greyscale consistency terms, producing histologically realistic colour outputs while preserving high-resolution structural detail. Our method outperformed Pix2Pix and standard CycleGAN baselines across SSIM, PSNR, and LPIPS metrics. Once trained, the model can be applied to full CT volumes to generate virtually stained 3D datasets, enhancing interpretability without additional sample preparation. While features such as new bone formation were able to be reproduced, some variability in the depiction of implant degradation layers highlights the need for further training data and refinement. This work introduces virtual staining to 3D X-ray imaging and offers a scalable route for chemically informative, label-free tissue characterisation in biomedical research.
zh
[CV-86] Dynamic Structural Recovery Parameters Enhance Prediction of Visual Outcomes After Macular Hole Surgery
【速读】:该论文旨在解决如何提高对特发性全厚度黄斑裂孔(idiopathic full-thickness macular hole, iFTMH)患者术后视力恢复预测准确性的临床问题。解决方案的关键在于引入新型动态结构参数,并将其整合进多模态深度学习(multimodal deep learning, DL)框架中,从而充分利用纵向光学相干断层扫描(OCT)图像中的定量、定性及动态变化特征,结合临床变量与原始OCT图像,显著提升预测性能。研究结果显示,加入动态参数可改善逻辑回归模型的受试者工作特征曲线下面积(AUC),而多模态DL模型在各随访时间点均优于传统回归模型,最高AUC提升达0.12,验证了动态参数与原始图像信息的互补价值。
链接: https://arxiv.org/abs/2509.09227
作者: Yinzheng Zhao,Zhihao Zhao,Rundong Jiang,Louisa Sackewitz,Quanmin Liang,Mathias Maier,Daniel Zapp,Peter Charbel Issa,Mohammad Ali Nasseri
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: TVST
Abstract:Purpose: To introduce novel dynamic structural parameters and evaluate their integration within a multimodal deep learning (DL) framework for predicting postoperative visual recovery in idiopathic full-thickness macular hole (iFTMH) patients. Methods: We utilized a publicly available longitudinal OCT dataset at five stages (preoperative, 2 weeks, 3 months, 6 months, and 12 months). A stage specific segmentation model delineated related structures, and an automated pipeline extracted quantitative, composite, qualitative, and dynamic features. Binary logistic regression models, constructed with and without dynamic parameters, assessed their incremental predictive value for best-corrected visual acuity (BCVA). A multimodal DL model combining clinical variables, OCT-derived features, and raw OCT images was developed and benchmarked against regression models. Results: The segmentation model achieved high accuracy across all timepoints (mean Dice 0.89). Univariate and multivariate analyses identified base diameter, ellipsoid zone integrity, and macular hole area as significant BCVA predictors (P 0.05). Incorporating dynamic recovery rates consistently improved logistic regression AUC, especially at the 3-month follow-up. The multimodal DL model outperformed logistic regression, yielding higher AUCs and overall accuracy at each stage. The difference is as high as 0.12, demonstrating the complementary value of raw image volume and dynamic parameters. Conclusions: Integrating dynamic parameters into the multimodal DL model significantly enhances the accuracy of predictions. This fully automated process therefore represents a promising clinical decision support tool for personalized postoperative management in macular hole surgery.
zh
[CV-87] Ultrafast Deep Learning-Based Scatter Estimation in Cone-Beam Computed Tomography
【速读】:该论文旨在解决锥形束计算机断层成像(Cone-Beam Computed Tomography, CBCT)中散射伪影(scatter artifacts)严重降低图像质量的问题,尤其针对深度学习方法在移动CBCT系统或边缘设备部署受限于模型内存占用过大的挑战。解决方案的关键在于通过多分辨率网络设计,在保持高精度的同时显著降低计算复杂度:研究系统性地评估了六种分辨率下不同插值方法对散射信号重建误差的影响,并基于浮点运算量(FLOPs)、推理时间与GPU内存需求的权衡,选择最优分辨率进行训练和推理;最终实现78倍FLOPs减少、16倍推理时间缩短和12倍GPU内存节省,同时MAPE从4.42%降至3.85%,MSE从2.01×10⁻²降至1.34×10⁻²,验证了其在资源受限环境中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2509.08973
作者: Harshit Agrawal,Ari Hietanen,Simo Särkkä
机构: Aalto University (阿尔托大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Scatter artifacts drastically degrade the image quality of cone-beam computed tomography (CBCT) scans. Although deep learning-based methods show promise in estimating scatter from CBCT measurements, their deployment in mobile CBCT systems or edge devices is still limited due to the large memory footprint of the networks. This study addresses the issue by applying networks at varying resolutions and suggesting an optimal one, based on speed and accuracy. Methods: First, the reconstruction error in down-up sampling of CBCT scatter signal was examined at six resolutions by comparing four interpolation methods. Next, a recent state-of-the-art method was trained across five image resolutions and evaluated for the reductions in floating-point operations (FLOPs), inference times, and GPU memory requirements. Results: Reducing the input size and network parameters achieved a 78-fold reduction in FLOPs compared to the baseline method, while maintaining comarable performance in terms of mean-absolute-percentage-error (MAPE) and mean-square-error (MSE). Specifically, the MAPE decreased to 3.85% compared to 4.42%, and the MSE decreased to 1.34 \times 10^-2 compared to 2.01 \times 10^-2. Inference time and GPU memory usage were reduced by factors of 16 and 12, respectively. Further experiments comparing scatter-corrected reconstructions on a large, simulated dataset and real CBCT scans from water and Sedentex CT phantoms clearly demonstrated the robustness of our method. Conclusion: This study highlights the underappreciated role of downsampling in deep learning-based scatter estimation. The substantial reduction in FLOPs and GPU memory requirements achieved by our method enables scatter correction in resource-constrained environments, such as mobile CBCT and edge devices. Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.08973 [eess.SP] (or arXiv:2509.08973v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2509.08973 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Harshit Agrawal [view email] [v1] Wed, 10 Sep 2025 20:07:56 UTC (23,566 KB)
zh
人工智能
[AI-0] he Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLM s
【速读】:该论文试图解决的问题是:大规模语言模型(Large Language Models, LLMs)在持续扩展规模时是否面临边际收益递减,特别是在执行长程任务(long-horizon tasks)时的表现瓶颈。研究发现,尽管单步准确率提升看似微小,但其可累积为任务完成长度的指数级增长;而模型在长任务中失败的主要原因并非推理能力不足,而是执行(execution)能力受限——即在多步骤过程中因先前错误积累导致自我条件化(self-conditioning)效应,从而逐步放大错误。解决方案的关键在于将执行能力从推理中分离出来:通过显式提供知识和计划来增强模型的执行能力,而非单纯依赖模型规模或上下文长度。实验表明,即使小模型具备100%单步准确率,大模型仍能正确执行更多步骤;同时,近期具备“思考模式”(thinking models)的模型由于不产生自条件化效应,可在单次推理中完成更长任务,凸显了模型规模与序列测试时计算(sequential test-time compute)对长程任务的重要性。
链接: https://arxiv.org/abs/2509.09677
作者: Akshit Sinha,Arvindh Arun,Shashwat Goel,Steffen Staab,Jonas Geiping
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations – curiously, we observe a self-conditioning effect – models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
zh
[AI-1] Feasibility-Guided Fair Adaptive Offline Reinforcement Learning for Medicaid Care Management
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中公平性与安全性难以兼顾的问题,尤其关注在医疗健康等敏感场景下,如何在保障各受保护子群体(protected subgroups)之间公平性的同时降低潜在危害。其解决方案的关键在于提出一种可行性引导的公平自适应强化学习方法(Feasibility-Guided Fair Adaptive Reinforcement Learning, FG-FARL),通过校准每组的安全阈值来平衡不同群体间的公平目标(如覆盖率或伤害程度),从而在不显著牺牲策略价值的前提下提升公平性表现。
链接: https://arxiv.org/abs/2509.09655
作者: Sanjay Basu,Sadiq Y. Patel,Parth Sheth,Bhairavi Muralidharan,Namrata Elamaran,Aakriti Kinra,Rajaie Batniji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Applications (stat.AP)
备注: 12 pages, 5 figures, 3 tables
Abstract:We introduce Feasibility-Guided Fair Adaptive Reinforcement Learning (FG-FARL), an offline RL procedure that calibrates per-group safety thresholds to reduce harm while equalizing a chosen fairness target (coverage or harm) across protected subgroups. Using de-identified longitudinal trajectories from a Medicaid population health management program, we evaluate FG-FARL against behavior cloning (BC) and HACO (Hybrid Adaptive Conformal Offline RL; a global conformal safety baseline). We report off-policy value estimates with bootstrap 95% confidence intervals and subgroup disparity analyses with p-values. FG-FARL achieves comparable value to baselines while improving fairness metrics, demonstrating a practical path to safer and more equitable decision support.
zh
[AI-2] Explaining Concept Drift through the Evolution of Group Counterfactuals KDD2025 ECML
【速读】:该论文旨在解决动态环境中机器学习模型因概念漂移(concept drift)导致性能下降的问题,尤其是现有方法难以解释模型决策逻辑如何随时间演变的挑战。解决方案的关键在于提出一种基于群体级反事实解释(Group-based Counterfactual Explanations, GCEs)的新型分析方法,通过追踪GCE簇中心及其对应的反事实动作向量在漂移前后的演化轨迹,将模型决策边界的结构变化转化为可解释的代理指标,从而实现对漂移根源的精准诊断,例如区分数据空间偏移与概念重标注等不同成因。
链接: https://arxiv.org/abs/2509.09616
作者: Ignacy Stępka,Jerzy Stefanowski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: TempXAI Workshop @ ECML PKDD 2025
Abstract:Machine learning models in dynamic environments often suffer from concept drift, where changes in the data distribution degrade performance. While detecting this drift is a well-studied topic, explaining how and why the model’s decision-making logic changes still remains a significant challenge. In this paper, we introduce a novel methodology to explain concept drift by analyzing the temporal evolution of group-based counterfactual explanations (GCEs). Our approach tracks shifts in the GCEs’ cluster centroids and their associated counterfactual action vectors before and after a drift. These evolving GCEs act as an interpretable proxy, revealing structural changes in the model’s decision boundary and its underlying rationale. We operationalize this analysis within a three-layer framework that synergistically combines insights from the data layer (distributional shifts), the model layer (prediction disagreement), and our proposed explanation layer. We show that such holistic view allows for a more comprehensive diagnosis of drift, making it possible to distinguish between different root causes, such as a spatial data shift versus a re-labeling of concepts.
zh
[AI-3] LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂软件开发场景中对长上下文理解能力评估不足的问题。现有代码评估基准多聚焦于单函数补全或短上下文任务,难以衡量模型在真实软件系统中跨文件推理、架构一致性维护及大规模代码库理解等关键能力。解决方案的核心是提出LoCoBench——一个面向长上下文LLMs的综合性基准,涵盖8000个系统生成的评估场景,覆盖10种编程语言,上下文长度从1万到100万token不等;其关键创新在于设计了8类任务类别(如架构理解、跨文件重构、安全分析等),并构建包含17项指标的多维评价体系(含8项新指标),最终以LoCoBench Score(LCBS)量化模型性能,从而精准揭示当前主流长上下文模型在复杂软件开发中的显著能力缺口。
链接: https://arxiv.org/abs/2509.09614
作者: Jielin Qiu,Zuxin Liu,Zhiwei Liu,Rithesh Murthy,Jianguo Zhang,Haolin Chen,Shiyu Wang,Ming Zhu,Liangwei Yang,Juntao Tan,Zhepeng Cen,Cheng Qian,Shelby Heinecke,Weiran Yao,Silvio Savarese,Caiming Xiong,Huan Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 53 pages
Abstract:The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: this https URL.
zh
[AI-4] Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution
【速读】:该论文旨在解决嵌入式人工智能(Embodied AI)系统在动态环境中运行时,传统串行计算模式因推理频率不足而难以满足高频率感知与生成需求的问题。其解决方案的关键在于提出Auras框架,该框架通过将感知模块与生成模块解耦,并引入受控的流水线并行机制,在保证高吞吐量的同时维持系统准确性;同时为缓解并行化带来的数据过时问题,Auras设计了一个公共上下文用于感知与生成模块间的共享,从而保障了智能体的推理精度。实验表明,Auras平均提升了2.54倍的吞吐量,且保持了原始准确率的102.7%,有效突破了串行计算的性能瓶颈。
链接: https://arxiv.org/abs/2509.09560
作者: Shulai Zhang,Ao Xu,Quan Chen,Han Zhao,Weihao Cui,Ningxin Zheng,Haibin Lin,Xin Liu,Minyi Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Embodied AI systems operate in dynamic environments, requiring seamless integration of perception and generation modules to process high-frequency input and output demands. Traditional sequential computation patterns, while effective in ensuring accuracy, face significant limitations in achieving the necessary “thinking” frequency for real-world applications. In this work, we present Auras, an algorithm-system co-designed inference framework to optimize the inference frequency of embodied AI agents. Auras disaggregates the perception and generation and provides controlled pipeline parallelism for them to achieve high and stable throughput. Faced with the data staleness problem that appears when the parallelism is increased, Auras establishes a public context for perception and generation to share, thereby promising the accuracy of embodied agents. Experimental results show that Auras improves throughput by 2.54x on average while achieving 102.7% of the original accuracy, demonstrating its efficacy in overcoming the constraints of sequential computation and providing high throughput.
zh
[AI-5] An improved educational competition optimizer with multi-covariance learning operators for global optimization problems
【速读】:该论文旨在解决教育竞争优化器(Educational Competition Optimizer, ECO)在处理复杂优化问题时存在的探索与开发能力失衡问题,从而导致易陷入局部最优、收敛速度慢及稳定性差的局限性。解决方案的关键在于提出一种改进型教育竞争优化器(Improved Educational Competition Optimizer with Multi-Covariance Learning Operators, IECO-MCO),其核心创新是引入三种不同的协方差学习算子(covariance learning operators),有效平衡了算法的全局搜索能力与局部精细搜索能力,同时抑制种群过早收敛,显著提升了算法在CEC 2017和CEC 2022标准测试函数上的收敛速度、稳定性和跳出局部最优的能力。
链接: https://arxiv.org/abs/2509.09552
作者: Baoqi Zhao,Xiong Yang,Hoileong Lee,Bowen Dong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Submitted to Cluster Computing
Abstract:The educational competition optimizer is a recently introduced metaheuristic algorithm inspired by human behavior, originating from the dynamics of educational competition within society. Nonetheless, ECO faces constraints due to an imbalance between exploitation and exploration, rendering it susceptible to local optima and demonstrating restricted effectiveness in addressing complex optimization problems. To address these limitations, this study presents an enhanced educational competition optimizer (IECO-MCO) utilizing multi-covariance learning operators. In IECO, three distinct covariance learning operators are introduced to improve the performance of ECO. Each operator effectively balances exploitation and exploration while preventing premature convergence of the population. The effectiveness of IECO is assessed through benchmark functions derived from the CEC 2017 and CEC 2022 test suites, and its performance is compared with various basic and improved algorithms across different categories. The results demonstrate that IECO-MCO surpasses the basic ECO and other competing algorithms in convergence speed, stability, and the capability to avoid local optima. Furthermore, statistical analyses, including the Friedman test, Kruskal-Wallis test, and Wilcoxon rank-sum test, are conducted to validate the superiority of IECO-MCO over the compared algorithms. Compared with the basic algorithm (improved algorithm), IECO-MCO achieved an average ranking of 2.213 (2.488) on the CE2017 and CEC2022 test suites. Additionally, the practical applicability of the proposed IECO-MCO algorithm is verified by solving constrained optimization problems. The experimental outcomes demonstrate the superior performance of IECO-MCO in tackling intricate optimization problems, underscoring its robustness and practical effectiveness in real-world scenarios.
zh
[AI-6] Compositional Concept Generalization with Variational Quantum Circuits
【速读】:该论文旨在解决当前视觉语言模型在组合泛化(compositional generalization)能力上的不足问题,即模型难以将已学习的语义成分有效组合以应对未见过的语义结构。其解决方案的关键在于将组合张量表示(compositional tensor-based representations)映射到希尔伯特空间(Hilbert space)中,并利用变分量子电路(Variational Quantum Circuits, VQC)学习这些表示,从而提升模型在需要组合推理的任务(如图像描述生成)中的表现。通过两种图像编码方式——二进制向量的多热编码(multi-hot encoding, MHE)和来自CLIP模型的角/幅值编码(angle/amplitude encoding),实验验证了量子模型在训练效率和性能上优于经典组合模型,尤其在MHE编码下取得了良好的概念验证结果。
链接: https://arxiv.org/abs/2509.09541
作者: Hala Hawashin,Mina Abbaszadeh,Nicholas Joseph,Beth Pearson,Martha Lewis,Mehrnoosh sadrzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to: 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), Naples, Italy, Nov 2-5, 2025. This is the authors’ accepted manuscript (AAM). An IEEE copyright notice appears on page 1. The final published version will appear in IEEE Xplore; DOI to be added when available
Abstract:Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.
zh
[AI-7] A modified RIME algorithm with covariance learning and diversity enhancement for numerical optimization
【速读】:该论文旨在解决RIME(Reinforcement-based Iterative Metaheuristic Evolution)算法在优化过程中存在的种群多样性快速丧失和易陷入局部最优的问题,从而导致开发(exploitation)与探索(exploration)能力失衡。解决方案的关键在于提出一种改进的RIME算法——MRIME-CD(Modified RIME with Covariance Learning and Diversity Enhancement),其核心策略包括:1)在软搜索阶段引入协方差学习机制,通过主导个体的自助采样效应增强种群多样性并平衡过度开发;2)在硬穿刺机制中采用平均自助策略,引导种群基于主导个体加权位置进行早期全局搜索,缓解过早收敛倾向;3)设计新的停滞指标,并结合随机协方差学习更新停滞个体,提升跳出局部最优的能力。实验证明,MRIME-CD在CEC2017和CEC2022测试集上显著优于原始RIME,表现出更高的解精度、更快的收敛速度和更强的稳定性。
链接: https://arxiv.org/abs/2509.09529
作者: Shangqing Shi,Luoxiao Zhang,Yuchen Yin,Xiong Yang,Hoileong Lee
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: This is the author’s preprint of the article published in Cluster Computing (Springer): Shi, S., Zhang, L., Yin, Y. et al. A modified RIME algorithm with covariance learning and diversity enhancement for numerical optimization. Cluster Comput 28, 658 (2025). The final authenticated version is available online at SpringerLink
Abstract:Metaheuristics are widely applied for their ability to provide more efficient solutions. The RIME algorithm is a recently proposed physical-based metaheuristic algorithm with certain advantages. However, it suffers from rapid loss of population diversity during optimization and is prone to fall into local optima, leading to unbalanced exploitation and exploration. To address the shortcomings of RIME, this paper proposes a modified RIME with covariance learning and diversity enhancement (MRIME-CD). The algorithm applies three strategies to improve the optimization capability. First, a covariance learning strategy is introduced in the soft-rime search stage to increase the population diversity and balance the over-exploitation ability of RIME through the bootstrapping effect of dominant populations. Second, in order to moderate the tendency of RIME population to approach the optimal individual in the early search stage, an average bootstrapping strategy is introduced into the hard-rime puncture mechanism, which guides the population search through the weighted position of the dominant populations, thus enhancing the global search ability of RIME in the early stage. Finally, a new stagnation indicator is proposed, and a stochastic covariance learning strategy is used to update the stagnant individuals in the population when the algorithm gets stagnant, thus enhancing the ability to jump out of the local optimal solution. The proposed MRIME-CD algorithm is subjected to a series of validations on the CEC2017 test set, the CEC2022 test set, and the experimental results are analyzed using the Friedman test, the Wilcoxon rank sum test, and the Kruskal Wallis test. The results show that MRIME-CD can effectively improve the performance of basic RIME and has obvious superiorities in terms of solution accuracy, convergence speed and stability.
zh
[AI-8] Incorporating AI Incident Reporting into Telecommunications Law and Policy: Insights from India
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在电信基础设施中应用所引发的新型风险问题,特别是算法偏见和系统行为不可预测性等事件,这些风险超出了传统网络安全与数据保护框架的覆盖范围。其核心解决方案在于提出将电信AI事件定义为一类独立的风险类别,并通过政策建议推动印度现有电信治理体系的适应性改革:关键举措包括强制报告高风险AI故障、指定现有政府机构作为统筹管理AI事件数据的牵头部门、以及建立标准化的AI事件报告框架,从而填补监管空白、提升治理透明度与长期韧性,为其他国家在既有行业监管体系内有效管控AI风险提供可复制的实践路径。
链接: https://arxiv.org/abs/2509.09508
作者: Avinash Agarwal,Manisha J. Nene
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 2 figures, 1 table
Abstract:The integration of artificial intelligence (AI) into telecommunications infrastructure introduces novel risks, such as algorithmic bias and unpredictable system behavior, that fall outside the scope of traditional cybersecurity and data protection frameworks. This paper introduces a precise definition and a detailed typology of telecommunications AI incidents, establishing them as a distinct category of risk that extends beyond conventional cybersecurity and data protection breaches. It argues for their recognition as a distinct regulatory concern. Using India as a case study for jurisdictions that lack a horizontal AI law, the paper analyzes the country’s key digital regulations. The analysis reveals that India’s existing legal instruments, including the Telecommunications Act, 2023, the CERT-In Rules, and the Digital Personal Data Protection Act, 2023, focus on cybersecurity and data breaches, creating a significant regulatory gap for AI-specific operational incidents, such as performance degradation and algorithmic bias. The paper also examines structural barriers to disclosure and the limitations of existing AI incident repositories. Based on these findings, the paper proposes targeted policy recommendations centered on integrating AI incident reporting into India’s existing telecom governance. Key proposals include mandating reporting for high-risk AI failures, designating an existing government body as a nodal agency to manage incident data, and developing standardized reporting frameworks. These recommendations aim to enhance regulatory clarity and strengthen long-term resilience, offering a pragmatic and replicable blueprint for other nations seeking to govern AI risks within their existing sectoral frameworks.
zh
[AI-9] SEDM: Scalable Self-Evolving Distributed Memory for Agents
【速读】:该论文旨在解决长期多智能体系统中因海量轨迹与历史交互数据导致的内存管理效率低下问题,现有方法依赖向量检索和分层存储,存在噪声累积、内存无控制膨胀及跨域泛化能力有限等挑战。其解决方案的关键在于提出一种可验证且自适应的框架——自演化分布式记忆(Self-Evolving Distributed Memory, SEDM),通过可验证写入准入机制(基于可重现回放)、自调度内存控制器(根据实证效用动态排序与合并条目)以及跨域知识扩散机制(抽象可复用洞察以支持异构任务迁移),实现记忆从被动存储到主动优化组件的转变,从而提升推理准确率并降低token开销,同时增强多跳推理能力。
链接: https://arxiv.org/abs/2509.09498
作者: Haoran Xu,Jiacong Hu,Ke Zhang,Lei Yu,Yuxin Tang,Xinyuan Song,Yiqun Duan,Lynn Ai,Bill Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term multi-agent systems inevitably generate vast amounts of trajectories and historical interactions, which makes efficient memory management essential for both performance and scalability. Existing methods typically depend on vector retrieval and hierarchical storage, yet they are prone to noise accumulation, uncontrolled memory expansion, and limited generalization across domains. To address these challenges, we present SEDM, Self-Evolving Distributed Memory, a verifiable and adaptive framework that transforms memory from a passive repository into an active, self-optimizing component. SEDM integrates verifiable write admission based on reproducible replay, a self-scheduling memory controller that dynamically ranks and consolidates entries according to empirical utility, and cross-domain knowledge diffusion that abstracts reusable insights to support transfer across heterogeneous tasks. Evaluations on benchmark datasets demonstrate that SEDM improves reasoning accuracy while reducing token overhead compared with strong memory baselines, and further enables knowledge distilled from fact verification to enhance multi-hop reasoning. The results highlight SEDM as a scalable and sustainable memory mechanism for open-ended multi-agent collaboration. The code will be released in the later stage of this project.
zh
[AI-10] Prompt Pirates Need a Map: Stealing Seeds helps Stealing Prompts
【速读】:该论文旨在解决扩散模型(Diffusion Models)在文本到图像生成中面临的提示词窃取(Prompt Theft)安全问题,即攻击者通过分析生成图像反推用于图像合成的原始文本提示词和随机种子(seed),从而侵犯知识产权与隐私。其解决方案的关键在于识别并利用主流图像生成框架中存在的噪声生成漏洞(CWE-339),该漏洞源于PyTorch在CPU上限制随机种子值范围为 232,导致种子可被高效暴力破解;在此基础上,作者提出SeedSnitch工具实现平均140分钟/种子的快速种子恢复,并进一步设计基于遗传算法的PromptPirate方法,显式建模种子信息进行提示词优化,相较现有最优方法(如PromptStealer、P2HP、CLIP-Interrogator)在LPIPS相似度上提升8–11%,显著提升了提示词窃取效率。同时,论文还提出了简单有效的防御措施以阻断此类攻击路径。
链接: https://arxiv.org/abs/2509.09488
作者: Felix Mächtle,Ashwath Shetty,Jonas Sander,Nils Loose,Sören Pirk,Thomas Eisenbarth
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have significantly advanced text-to-image generation, enabling the creation of highly realistic images conditioned on textual prompts and seeds. Given the considerable intellectual and economic value embedded in such prompts, prompt theft poses a critical security and privacy concern. In this paper, we investigate prompt-stealing attacks targeting diffusion models. We reveal that numerical optimization-based prompt recovery methods are fundamentally limited as they do not account for the initial random noise used during image generation. We identify and exploit a noise-generation vulnerability (CWE-339), prevalent in major image-generation frameworks, originating from PyTorch’s restriction of seed values to a range of 2^32 when generating the initial random noise on CPUs. Through a large-scale empirical analysis conducted on images shared via the popular platform CivitAI, we demonstrate that approximately 95% of these images’ seed values can be effectively brute-forced in 140 minutes per seed using our seed-recovery tool, SeedSnitch. Leveraging the recovered seed, we propose PromptPirate, a genetic algorithm-based optimization method explicitly designed for prompt stealing. PromptPirate surpasses state-of-the-art methods, i.e., PromptStealer, P2HP, and CLIP-Interrogator, achieving an 8-11% improvement in LPIPS similarity. Furthermore, we introduce straightforward and effective countermeasures that render seed stealing, and thus optimization-based prompt stealing, ineffective. We have disclosed our findings responsibly and initiated coordinated mitigation efforts with the developers to address this critical vulnerability.
zh
[AI-11] Inteligencia Artificial jurídica y el desafío de la veracidad: análisis de alucinaciones optimización de RAG y principios para una integración responsable
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在法律领域应用中出现的“幻觉”(hallucinations,即生成虚假信息)问题。其核心挑战在于LLMs可能输出看似合理但不准确甚至误导性的法律内容,这对专业实践构成严重风险。论文指出,尽管检索增强生成(Retrieval-Augmented Generation, RAG)策略能在一定程度上缓解该问题,但仍存在局限性;因此,解决方案的关键不在于对生成模型进行渐进式优化,而在于转向一种“咨询型人工智能”(consultative AI)范式,强调结果的 veracity(真实性)与 traceability(可追溯性),将AI定位为辅助专业人士判断的工具,而非替代决策主体,同时必须保留人类监督作为不可替代的核心环节。
链接: https://arxiv.org/abs/2509.09467
作者: Alex Dantart
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: in Spanish and English languages
Abstract:This technical report analyzes the challenge of “hallucinations” (false information) in LLMs applied to law. It examines their causes, manifestations, and the effectiveness of the RAG mitigation strategy, highlighting its limitations and proposing holistic optimizations. The paper explores the ethical and regulatory implications, emphasizing human oversight as an irreplaceable role. It concludes that the solution lies not in incrementally improving generative models, but in adopting a “consultative” AI paradigm that prioritizes veracity and traceability, acting as a tool to amplify, not replace, professional judgment. – Este informe técnico analiza el desafío de las “alucinaciones” (información falsa) en los LLMs aplicados al derecho. Se examinan sus causas, manifestaciones y la efectividad de la estrategia de mitigación RAG, exponiendo sus limitaciones y proponiendo optimizaciones holísticas. Se exploran las implicaciones éticas y regulatorias, enfatizando la supervisión humana como un rol insustituible. El documento concluye que la solución no reside en mejorar incrementalmente los modelos generativos, sino en adoptar un paradigma de IA “consultiva” que priorice la veracidad y la trazabilidad, actuando como una herramienta para amplificar, y no sustituir, el juicio profesional. Comments: in Spanish and English languages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.09467 [cs.AI] (or arXiv:2509.09467v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.09467 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-12] ORSO: Template-Oriented Reasoning Towards General Tasks
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂任务中依赖人工设计的少样本提示(few-shot prompts)来模拟人类推理的问题,这些问题往往限制了模型内在推理能力的发挥,并导致任务间一致性差与构建成本高。解决方案的关键在于提出一种模板导向的推理方法(Template-Oriented Reasoning, TORSO),通过设计通用模板引导模型自主调用其内部推理机制,从而在无需人工构造示例的情况下,在多种任务上生成合理且高性能的推理过程与响应。
链接: https://arxiv.org/abs/2509.09448
作者: Minhyuk Kim,Seungyoon Lee,Heuiseok Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:The approaches that guide Large Language Models (LLMs) to emulate human reasoning during response generation have emerged as an effective method for enabling them to solve complex problems in a step-by-step manner, thereby achieving superior performance. However, most existing approaches using few-shot prompts to generate responses heavily depend on the provided examples, limiting the utilization of the model’s inherent reasoning capabilities. Moreover, constructing task-specific few-shot prompts is often costly and may lead to inconsistencies across different tasks. In this work, we introduce Template-Oriented Reasoning (TORSO), which elicits the model to utilize internal reasoning abilities to generate proper responses across various tasks without the need for manually crafted few-shot examples. Our experimental results demonstrate that TORSO achieves strong performance on diverse LLMs benchmarks with reasonable rationales.
zh
[AI-13] ENSI: Efficient Non-Interactive Secure Inference for Large Language Models
【速读】:该论文旨在解决在大语言模型(Large Language Models, LLMs)中集成加密协议以实现隐私保护推理时所面临的计算复杂度高、实际可用性差的问题。其核心挑战在于,传统同态加密(Homomorphic Encryption, HE)方案与LLMs的庞大参数规模及复杂架构之间存在显著不匹配,导致矩阵乘法和softmax等关键操作在加密状态下效率极低,且频繁的密文刷新(bootstrapping)进一步加剧了性能瓶颈。解决方案的关键在于提出一种非交互式安全推理框架ENSI,通过协同设计加密协议与轻量级LLM架构:首先采用优化编码策略将CKKS同态加密方案与轻量级BitNet模型无缝融合,大幅降低加密矩阵乘法的计算开销;其次引入sigmoid注意力机制替代传统softmax,无需重新训练即可高效支持HE下的注意力计算;最后将Bootstrapping操作嵌入RMSNorm层中,在保证密文纯净性的同时显著减少其调用频率(降至1%)。实验表明,ENSI相较现有最优方法在CPU上实现约8倍矩阵乘法加速和2.6倍softmax推理提速,有效提升了LLM安全推理的实用性。
链接: https://arxiv.org/abs/2509.09424
作者: Zhiyu He,Maojiang Wang,Xinwen Gao,Yuchuan Luo,Lin Liu,Shaojing Fu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Secure inference enables privacy-preserving machine learning by leveraging cryptographic protocols that support computations on sensitive user data without exposing it. However, integrating cryptographic protocols with large language models (LLMs) presents significant challenges, as the inherent complexity of these protocols, together with LLMs’ massive parameter scale and sophisticated architectures, severely limits practical usability. In this work, we propose ENSI, a novel non-interactive secure inference framework for LLMs, based on the principle of co-designing the cryptographic protocols and LLM architecture. ENSI employs an optimized encoding strategy that seamlessly integrates CKKS scheme with a lightweight LLM variant, BitNet, significantly reducing the computational complexity of encrypted matrix multiplications. In response to the prohibitive computational demands of softmax under homomorphic encryption (HE), we pioneer the integration of the sigmoid attention mechanism with HE as a seamless, retraining-free alternative. Furthermore, by embedding the Bootstrapping operation within the RMSNorm process, we efficiently refresh ciphertexts while markedly decreasing the frequency of costly bootstrapping invocations. Experimental evaluations demonstrate that ENSI achieves approximately an 8x acceleration in matrix multiplications and a 2.6x speedup in softmax inference on CPU compared to state-of-the-art method, with the proportion of bootstrapping is reduced to just 1%.
zh
[AI-14] Were Still Doing It (All) Wrong: Recommender Systems Fifteen Years Later
【速读】:该论文试图解决推荐系统(Recommender Systems)研究领域长期存在的方法论缺陷与价值导向偏差问题,这些问题自2011年Xavier Amatriain提出批评以来仍未得到根本性修正,反而因技术复杂性的增加而更加隐蔽和系统化。解决方案的关键在于推动一场范式转变:不仅需要改进评估指标或工具链,更需从根本上重新定义推荐系统研究的目的、服务对象以及知识生产与验证机制,强调认知谦逊(epistemic humility)、人类影响(human impact)和可持续实践(sustainable practice),并借助社区主导的倡议如可复现性研究、价值敏感设计和参与式方法来实现这一转型。
链接: https://arxiv.org/abs/2509.09414
作者: Alan Said,Maria Soledad Pera,Michael D. Ekstrand
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was accepted for publication in the Beyond Algorithms: Reclaiming the Interdisciplinary Roots of Recommender Systems Workshop (BEYOND 2025), September 26th, 2025, co-located with the 19th ACM Recommender Systems Conference, Prague, Czech Republic
Abstract:In 2011, Xavier Amatriain sounded the alarm: recommender systems research was “doing it all wrong” [1]. His critique, rooted in statistical misinterpretation and methodological shortcuts, remains as relevant today as it was then. But rather than correcting course, we added new layers of sophistication on top of the same broken foundations. This paper revisits Amatriain’s diagnosis and argues that many of the conceptual, epistemological, and infrastructural failures he identified still persist, in more subtle or systemic forms. Drawing on recent work in reproducibility, evaluation methodology, environmental impact, and participatory design, we showcase how the field’s accelerating complexity has outpaced its introspection. We highlight ongoing community-led initiatives that attempt to shift the paradigm, including workshops, evaluation frameworks, and calls for value-sensitive and participatory research. At the same time, we contend that meaningful change will require not only new metrics or better tooling, but a fundamental reframing of what recommender systems research is for, who it serves, and how knowledge is produced and validated. Our call is not just for technical reform, but for a recommender systems research agenda grounded in epistemic humility, human impact, and sustainable practice.
zh
[AI-15] MetaLLM ix : An XAI Aided LLM -Meta-learning Based Approach for Hyper-parameters Optimization
【速读】:该论文旨在解决深度学习中模型与超参数选择(hyperparameter selection)效率低、依赖专家经验及计算资源消耗大的问题。传统自动机器学习(AutoML)方法虽能部分自动化流程,但现有基于大语言模型(LLM)的方法仍存在试错成本高、API调用昂贵且缺乏可解释性与泛化能力的局限。其解决方案的关键在于提出MetaLLMiX框架,该框架融合元学习(meta-learning)、可解释人工智能(Explainable AI, XAI)与高效LLM推理机制:通过历史实验结果结合SHAP特征重要性分析构建元知识库,实现零样本(zero-shot)超参数推荐;同时引入LLM-as-judge评估机制以确保输出格式规范、准确性与完整性。实验证明,该方案在8个医学影像数据集上显著降低计算开销(响应时间减少99.6–99.9%),并在5个任务中达到最优性能,训练速度最快提升15.7倍,同时保持与最优基线模型相当的精度(误差<5%)。
链接: https://arxiv.org/abs/2509.09387
作者: Mohammed Tiouti,Mohamed Bal-Ghaoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective model and hyperparameter selection remains a major challenge in deep learning, often requiring extensive expertise and computation. While AutoML and large language models (LLMs) promise automation, current LLM-based approaches rely on trial and error and expensive APIs, which provide limited interpretability and generalizability. We propose MetaLLMiX, a zero-shot hyperparameter optimization framework combining meta-learning, explainable AI, and efficient LLM reasoning. By leveraging historical experiment outcomes with SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained models without additional trials. We further employ an LLM-as-judge evaluation to control output format, accuracy, and completeness. Experiments on eight medical imaging datasets using nine open-source lightweight LLMs show that MetaLLMiX achieves competitive or superior performance to traditional HPO methods while drastically reducing computational cost. Our local deployment outperforms prior API-based approaches, achieving optimal results on 5 of 8 tasks, response time reductions of 99.6-99.9%, and the fastest training times on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of best-performing baselines.
zh
[AI-16] Robust Non-Linear Correlations via Polynomial Regression
【速读】:该论文旨在解决Hirschfeld-Gebelein-Rényi (HGR)相关系数在实际应用中因固有不可计算性导致的偏差-方差权衡问题,这一权衡可能损害基于HGR的算法在现实场景中的鲁棒性。解决方案的关键在于提出一种基于用户可配置多项式核的新计算方法,该方法在保持较高有效性的同时显著提升了计算的鲁棒性和确定性,从而为约束机器学习框架中将HGR作为损失正则化项提供了更可靠的实现路径。
链接: https://arxiv.org/abs/2509.09380
作者: Luca Giuliani,Michele Lombardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:The Hirschfeld-Gebelein-Rényi (HGR) correlation coefficient is an extension of Pearson’s correlation that is not limited to linear correlations, with potential applications in algorithmic fairness, scientific analysis, and causal discovery. Recently, novel algorithms to estimate HGR in a differentiable manner have been proposed to facilitate its use as a loss regularizer in constrained machine learning applications. However, the inherent uncomputability of HGR requires a bias-variance trade-off, which can possibly compromise the robustness of the proposed methods, hence raising technical concerns if applied in real-world scenarios. We introduce a novel computational approach for HGR that relies on user-configurable polynomial kernels, offering greater robustness compared to previous methods and featuring a faster yet almost equally effective restriction. Our approach provides significant advantages in terms of robustness and determinism, making it a more reliable option for real-world applications. Moreover, we present a brief experimental analysis to validate the applicability of our approach within a constrained machine learning framework, showing that its computation yields an insightful subgradient that can serve as a loss regularizer.
zh
[AI-17] Curriculum-Based Multi-Tier Semantic Exploration via Deep Reinforcement Learning
【速读】:该论文旨在解决自主机器人在复杂未知环境中进行高效语义探索时面临的挑战,即传统强化学习(Reinforcement Learning, RL)方法因智能体策略模型容量有限,难以同时实现高效的探索行为与语义理解能力,常需人工干预来完成语义层面的决策。解决方案的关键在于提出一种面向资源高效的语义探索的深度强化学习(Deep Reinforcement Learning, DRL)架构,其核心创新是通过分层奖励函数整合视觉-语言模型(Vision-Language Model, VLM)的常识推理能力,并将VLM查询建模为一个专用动作,使智能体仅在必要时主动调用外部语义信息以节约计算资源;同时结合课程学习策略引导多层级复杂度下的稳定训练,从而显著提升目标发现率并习得导航至语义丰富区域的能力,以及战略性地判断何时请求外部环境信息。
链接: https://arxiv.org/abs/2509.09356
作者: Abdel Hakim Drid,Vincenzo Suriani,Daniele Nardi,Abderrezzak Debilou
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: The 19th International Conference on Intelligent Autonomous Systems (IAS 19), 2025, Genoa
Abstract:Navigating and understanding complex and unknown environments autonomously demands more than just basic perception and movement from embodied agents. Truly effective exploration requires agents to possess higher-level cognitive abilities, the ability to reason about their surroundings, and make more informed decisions regarding exploration strategies. However, traditional RL approaches struggle to balance efficient exploration and semantic understanding due to limited cognitive capabilities embedded in the small policies for the agents, leading often to human drivers when dealing with semantic exploration. In this paper, we address this challenge by presenting a novel Deep Reinforcement Learning (DRL) architecture that is specifically designed for resource efficient semantic exploration. A key methodological contribution is the integration of a Vision-Language Model (VLM) common-sense through a layered reward function. The VLM query is modeled as a dedicated action, allowing the agent to strategically query the VLM only when deemed necessary for gaining external guidance, thereby conserving resources. This mechanism is combined with a curriculum learning strategy designed to guide learning at different levels of complexity to ensure robust and stable learning. Our experimental evaluation results convincingly demonstrate that our agent achieves significantly enhanced object discovery rates and develops a learned capability to effectively navigate towards semantically rich regions. Furthermore, it also shows a strategic mastery of when to prompt for external environmental information. By demonstrating a practical and scalable method for embedding common-sense semantic reasoning with autonomous agents, this research provides a novel approach to pursuing a fully intelligent and self-guided exploration in robotics.
zh
[AI-18] MoSE: Unveiling Structural Patterns in Graphs via Mixture of Subgraph Experts
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在学习图结构数据时,由于依赖局部成对消息传递机制而导致的高阶子图模式捕捉能力不足、结构表达能力有限的问题。现有方法虽尝试通过引入随机游走核来增强表达能力,但其主要用于图级任务且固定核配置限制了模型对多样化子图结构的适应性。论文提出的解决方案是Mixture of Subgraph Experts (MoSE)框架,其关键在于:利用匿名游走(anonymous walks)提取有信息量的子图,并基于结构语义动态路由至特定专家模块,从而实现对多样子图模式的灵活捕获与可解释表示学习,理论分析进一步证明其在Subgraph Weisfeiler-Lehman (SWL) 测试中比SWL更具表达力。
链接: https://arxiv.org/abs/2509.09337
作者: Junda Ye,Zhongbao Zhang,Li Sun,Siqiang Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures
Abstract:While graph neural networks (GNNs) have achieved great success in learning from graph-structured data, their reliance on local, pairwise message passing restricts their ability to capture complex, high-order subgraph patterns. leading to insufficient structural expressiveness. Recent efforts have attempted to enhance structural expressiveness by integrating random walk kernels into GNNs. However, these methods are inherently designed for graph-level tasks, which limits their applicability to other downstream tasks such as node classification. Moreover, their fixed kernel configurations hinder the model’s flexibility in capturing diverse subgraph structures. To address these limitations, this paper proposes a novel Mixture of Subgraph Experts (MoSE) framework for flexible and expressive subgraph-based representation learning across diverse graph tasks. Specifically, MoSE extracts informative subgraphs via anonymous walks and dynamically routes them to specialized experts based on structural semantics, enabling the model to capture diverse subgraph patterns with improved flexibility and interpretability. We further provide a theoretical analysis of MoSE’s expressivity within the Subgraph Weisfeiler-Lehman (SWL) Test, proving that it is more powerful than SWL. Extensive experiments, together with visualizations of learned subgraph experts, demonstrate that MoSE not only outperforms competitive baselines but also provides interpretable insights into structural patterns learned by the model.
zh
[AI-19] owards Adaptive ML Benchmarks: Web-Agent -Driven Construction Domain Expansion and Metric Optimization
【速读】:该论文旨在解决当前用于评估基于大语言模型(Large Language Models, LLMs)的智能体在端到端机器学习(Machine Learning, ML)任务中能力的基准测试存在的局限性,包括任务覆盖不全、领域多样性不足、难度建模粗糙以及评估标准不够严谨等问题。解决方案的关键在于提出TAM Bench,一个结构化、多样化且贴近现实场景的基准测试体系,其核心创新包括:(1) 基于浏览器自动化与LLM的任务获取系统,可从Kaggle、AIcrowd等平台自动采集并结构化ML挑战;(2) 基于排行榜数据的难度建模机制,通过参赛人数和得分分布量化任务复杂度;(3) 多维度评估框架,涵盖性能、格式合规性、约束遵守和任务泛化能力。该基准包含三个子集(Lite、Medium、Full),其中Lite版本适用于日常评测与对比研究。
链接: https://arxiv.org/abs/2509.09321
作者: Hangyi Jia,Yuxi Qian,Hanwen Tong,Xinhui Wu,Lin Chen,Feng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition solving. However, existing benchmarks remain limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture the full capabilities of such agents in realistic settings. We present TAM Bench, a diverse, realistic, and structured benchmark for evaluating LLM-based agents on end-to-end ML tasks. TAM Bench features three key innovations: (1) A browser automation and LLM-based task acquisition system that automatically collects and structures ML challenges from platforms such as Kaggle, AIcrowd, and Biendata, spanning multiple task types and data modalities (e.g., tabular, text, image, graph, audio); (2) A leaderboard-driven difficulty modeling mechanism that estimates task complexity using participant counts and score dispersion, enabling scalable and objective task calibration; (3) A multi-dimensional evaluation framework incorporating performance, format compliance, constraint adherence, and task generalization. Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes – Lite, Medium, and Full – designed for varying evaluation scenarios. The Lite version, with 18 tasks and balanced coverage across modalities and difficulty levels, serves as a practical testbed for daily benchmarking and comparative studies.
zh
[AI-20] Measuring Implicit Spatial Coordination in Teams: Effects on Collective Intelligence and Performance
【速读】:该论文旨在解决在缺乏显式沟通的物理空间协作场景中(如消防、军事、应急响应等),团队如何通过隐式空间协调机制实现高效协作的问题。其核心挑战在于,团队成员需依赖运动模式推断彼此意图,并在动态环境中调整自身行为以达成目标。解决方案的关键在于识别并量化三种空间协调维度——探索多样性(exploration diversity)、运动专业化(movement specialization)和自适应空间邻近度(adaptive spatial proximity)对团队绩效的影响:研究发现,运动专业化显著正向预测性能,而自适应空间邻近度呈现边际倒U型关系,表明适度的动态调整最优;此外,这些指标的时间动态差异可有效区分高/低绩效团队,为训练和AI辅助团队支持系统的设计提供了基于行为模式的优化依据。
链接: https://arxiv.org/abs/2509.09314
作者: Thuy Ngoc Nguyen,Anita Williams Woolley,Cleotilde Gonzalez
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Coordinated teamwork is essential in fast-paced decision-making environments that require dynamic adaptation, often without an opportunity for explicit communication. Although implicit coordination has been extensively considered in the existing literature, the majority of work has focused on co-located, synchronous teamwork (such as sports teams) or, in distributed teams, primarily on coordination of knowledge work. However, many teams (firefighters, military, law enforcement, emergency response) must coordinate their movements in physical space without the benefit of visual cues or extensive explicit communication. This paper investigates how three dimensions of spatial coordination, namely exploration diversity, movement specialization, and adaptive spatial proximity, influence team performance in a collaborative online search and rescue task where explicit communication is restricted and team members rely on movement patterns to infer others’ intentions and coordinate actions. Our metrics capture the relational aspects of teamwork by measuring spatial proximity, distribution patterns, and alignment of movements within shared environments. We analyze data from 34 four-person teams (136 participants) assigned to specialized roles in a search and rescue task. Results show that spatial specialization positively predicts performance, while adaptive spatial proximity exhibits a marginal inverted U-shaped relationship, suggesting moderate levels of adaptation are optimal. Furthermore, the temporal dynamics of these metrics differentiate high- from low-performing teams over time. These findings provide insights into implicit spatial coordination in role-based teamwork and highlight the importance of balanced adaptive strategies, with implications for training and AI-assisted team support systems.
zh
[AI-21] Explaining Tournament Solutions with Minimal Supports
【速读】:该论文旨在解决如何为锦标赛规则下候选者成为胜者提供可认证的解释问题,即回答“为何该候选人最终获胜”这一核心可解释人工智能(Explainable AI)问题。其关键解决方案是引入“最小支持”(minimal supports)的概念——即包含该候选人的最小子锦标赛,在该子结构中无论其余比赛如何完成,该候选人都必然是胜者(即必要胜者)。通过识别这些最小支持,论文为不同锦标赛解法(如Top Cycle、Uncovered Set、Copeland规则、Borda规则、Maximin规则及加权Uncovered Set)提供了最小规模的解释结构,并设计了多项式时间算法来计算除加权Uncovered Set外的所有规则的最小支持;后者被证明是NP难问题。此方法能够生成紧凑、可认证且直观的解释,从而提升决策过程的透明度与可信度。
链接: https://arxiv.org/abs/2509.09312
作者: Clément Contet,Umberto Grandi,Jérôme Mengin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports, minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question,“Why does the winner win the tournament”, a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the Borda rule, the maximin rule, and the weighted uncovered set. For each rule we determine the size of the smallest minimal supports, and we present polynomial-time algorithms to compute them for all but the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations.
zh
[AI-22] LightAgent : Production-level Open-source Agent ic AI Framework
【速读】:该论文旨在解决当前多智能体系统(Multi-agent Systems, MAS)在部署过程中面临的灵活性与简洁性之间的权衡问题,以及缺乏通用、鲁棒且高效的平台支持。解决方案的关键在于提出一个名为LightAgent的轻量级但功能强大的代理框架,其通过集成核心能力如记忆模块(Memory, mem0)、工具调用机制和思维树(Tree of Thought, ToT)来实现高效智能体构建,同时保持极简的架构设计,从而在保证功能完整性的同时显著降低开发与部署复杂度。
链接: https://arxiv.org/abs/2509.09292
作者: Weige Cai,Tong Zhu,Jinyi Niu,Ruiqi Hu,Lingyao Li,Tenglong Wang,Xiaowu Dai,Weining Shen,Liwen Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of large language models (LLMs), Multi-agent Systems (MAS) have achieved significant progress in various application scenarios. However, substantial challenges remain in designing versatile, robust, and efficient platforms for agent deployment. To address these limitations, we propose \textbfLightAgent, a lightweight yet powerful agentic framework, effectively resolving the trade-off between flexibility and simplicity found in existing frameworks. LightAgent integrates core functionalities such as Memory (mem0), Tools, and Tree of Thought (ToT), while maintaining an extremely lightweight structure. As a fully open-source solution, it seamlessly integrates with mainstream chat platforms, enabling developers to easily build self-learning agents. We have released LightAgent at \hrefthis https URLthis https URL
zh
[AI-23] Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Question Answering with LLM s
【速读】:该论文旨在解决传统检索增强生成(Retrieval Augmented Generation, RAG)方法在处理复杂、长篇文本时,难以实现主题层面和整体语义理解的问题,从而限制了其在基于大语言模型(Large Language Models, LLMs)的问答系统中的表现。解决方案的关键在于通过构建知识图谱三元组(knowledge graph triplets)并将其与LLMs集成,以提升对文本深层语义和上下文关系的理解能力;具体比较了spaCy、Stanford CoreNLP-OpenIE和GraphRAG三种开源技术路径,在覆盖度、推理能力和适配性方面进行评估,发现GraphRAG在推理能力上最优,而OpenIE在三元组覆盖率上最佳,为知识图谱驱动的问答系统提供了可量化的优化方向。
链接: https://arxiv.org/abs/2509.09272
作者: Vaibhav Chaudhary,Neha Soni,Narotam Singh,Amita Kapoor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 46 pages, 4 figures, 17 tables
Abstract:Knowledge graphs, a powerful tool for structuring information through relational triplets, have recently become the new front-runner in enhancing question-answering systems. While traditional Retrieval Augmented Generation (RAG) approaches are proficient in fact-based and local context-based extraction from concise texts, they encounter limitations when addressing the thematic and holistic understanding of complex, extensive texts, requiring a deeper analysis of both text and context. This paper presents a comprehensive technical comparative study of three different methodologies for constructing knowledge graph triplets and integrating them with Large Language Models (LLMs) for question answering: spaCy, Stanford CoreNLP-OpenIE, and GraphRAG, all leveraging open source technologies. We evaluate the effectiveness, feasibility, and adaptability of these methods by analyzing their capabilities, state of development, and their impact on the performance of LLM-based question answering. Experimental results indicate that while OpenIE provides the most comprehensive coverage of triplets, GraphRAG demonstrates superior reasoning abilities among the three. We conclude with a discussion on the strengths and limitations of each method and provide insights into future directions for improving knowledge graph-based question answering.
zh
[AI-24] Adaptive Knowledge Distillation using a Device-Aware Teacher for Low-Complexity Acoustic Scene Classification
【速读】:该论文旨在解决低复杂度设备鲁棒性声学场景分类(Low-Complexity Device-Robust Acoustic Scene Classification)问题,核心挑战在于在严格计算资源限制下实现对已见和未见设备的鲁棒泛化能力。解决方案的关键在于提出一种基于知识蒸馏(Knowledge Distillation)的框架:首先训练一个由两个教师模型组成的紧凑集成系统,其中包含一个标准交叉熵训练的PaSST教师和一个采用新型设备感知特征对齐(Device-Aware Feature Alignment, DAFA)损失训练的“泛化专家”教师,以显式优化特征空间的设备鲁棒性;随后,利用测试时可获得的设备标签,对蒸馏得到的学生模型进行最终的设备特定微调,从而显著提升整体性能,尤其在未见设备上的表现优于基线。
链接: https://arxiv.org/abs/2509.09262
作者: Seung Gyu Jeong,Seong Eun Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:In this technical report, we describe our submission for Task 1, Low-Complexity Device-Robust Acoustic Scene Classification, of the DCASE 2025 Challenge. Our work tackles the dual challenges of strict complexity constraints and robust generalization to both seen and unseen devices, while also leveraging the new rule allowing the use of device labels at test time. Our proposed system is based on a knowledge distillation framework where an efficient CP-MobileNet student learns from a compact, specialized two-teacher ensemble. This ensemble combines a baseline PaSST teacher, trained with standard cross-entropy, and a ‘generalization expert’ teacher. This expert is trained using our novel Device-Aware Feature Alignment (DAFA) loss, adapted from prior work, which explicitly structures the feature space for device robustness. To capitalize on the availability of test-time device labels, the distilled student model then undergoes a final device-specific fine-tuning stage. Our proposed system achieves a final accuracy of 57.93% on the development set, demonstrating a significant improvement over the official baseline, particularly on unseen devices.
zh
[AI-25] Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂数据科学任务中多步推理能力不足以及工具使用效率低的问题,这些问题限制了其在真实场景下的自动化分析效果。解决方案的关键在于提出一个可扩展的流水线,从真实的Jupyter笔记本和相关数据文件中提取高质量、基于工具的多步数据处理任务及其可执行解法,并构建了NbQA数据集以标准化任务-解法对;同时引入Jupiter框架,将数据分析建模为搜索问题,利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成多样化的解题路径用于价值模型学习,并在推理阶段结合价值模型与节点访问次数,高效生成最小搜索步数的可执行多步计划。
链接: https://arxiv.org/abs/2509.09245
作者: Shuocheng Li,Yihao Liu,Silin Du,Wenxuan Zeng,Zhe Xu,Mengyu Zhou,Yeye He,Haoyu Dong,Shi Han,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown great promise in automating data science workflows, but existing models still struggle with multi-step reasoning and tool use, which limits their effectiveness on complex data analysis tasks. To address this, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task-solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance multi-step reasoning, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively-matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.
zh
[AI-26] Vejde: A Framework for Inductive Deep Reinforcement Learning Based on Factor Graph Color Refinement
【速读】:该论文旨在解决决策问题中状态结构复杂(如包含对象类别和关系)时,如何高效学习并泛化策略函数的问题。传统方法在面对不同规模或结构的实例时往往难以迁移知识,导致训练效率低且泛化能力差。解决方案的关键在于提出Vejde框架,其核心是将马尔可夫决策过程(MDP)的状态表示为关于实体的事实数据库,并通过数据抽象将其转换为二分图结构;随后利用神经消息传递(neural message passing)映射到潜在状态空间,结合图神经网络(Graph Neural Networks, GNNs)与强化学习(Reinforcement Learning, RL),生成具有归纳能力(inductive)的策略函数。该方法通过因子化状态与动作表示,使代理能够处理不同规模和结构的问题实例,从而实现跨实例的良好泛化性能。
链接: https://arxiv.org/abs/2509.09219
作者: Jakob Nyberg,Pontus Johnson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present and evaluate Vejde; a framework which combines data abstraction, graph neural networks and reinforcement learning to produce inductive policy functions for decision problems with richly structured states, such as object classes and relations. MDP states are represented as data bases of facts about entities, and Vejde converts each state to a bipartite graph, which is mapped to latent states through neural message passing. The factored representation of both states and actions allows Vejde agents to handle problems of varying size and structure. We tested Vejde agents on eight problem domains defined in RDDL, with ten problem instances each, where policies were trained using both supervised and reinforcement learning. To test policy generalization, we separate problem instances in two sets, one for training and the other solely for testing. Test results on unseen instances for the Vejde agents were compared to MLP agents trained on each problem instance, as well as the online planning algorithm Prost. Our results show that Vejde policies in average generalize to the test instances without a significant loss in score. Additionally, the inductive agents received scores on unseen test instances that on average were close to the instance-specific MLP agents.
zh
[AI-27] Enabling Regulatory Multi-Agent Collaboration: Architecture Challenges and Solutions
【速读】:该论文旨在解决大规模自主代理(autonomous agents)在金融、医疗和智能制造等领域应用中因行为不可预测性和能力异质性所引发的治理与问责难题。其核心解决方案是提出一种基于区块链的分层架构,关键在于设计三个模块:(i) 代理行为追踪与仲裁模块,实现自动化问责;(ii) 动态声誉评估模块,用于协同场景下的信任度量;(iii) 恶意行为预测模块,支持对对抗性活动的早期识别。该框架为多代理生态系统中的可信、鲁棒且可扩展的监管机制提供了系统性基础。
链接: https://arxiv.org/abs/2509.09215
作者: Qinnan Hu,Yuntao Wang,Yuan Gao,Zhou Su,Linkang Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 7 pages, 6 figures
Abstract:Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.
zh
[AI-28] ProgD: Progressive Multi-scale Decoding with Dynamic Graphs for Joint Multi-agent Motion Forecasting
【速读】:该论文旨在解决多智能体运动预测中忽视交互关系动态演化的问题,即现有方法虽能实现多个交互智能体的联合预测,但未能充分建模未来场景中交互关系的时变特性。解决方案的关键在于提出一种新颖的渐进式多尺度解码策略(ProgD),结合动态异构图(dynamic heterogeneous graph)构建场景模型:通过逐步展开动态异构图显式捕捉未来场景中社会交互的演化过程,并设计因子化架构以处理时空依赖性并逐步消除未来轨迹的不确定性;同时引入多尺度解码机制提升对未来场景建模的一致性和准确性,从而在INTERACTION和Argoverse 2基准上取得最优性能。
链接: https://arxiv.org/abs/2509.09210
作者: Xing Gao,Zherui Huang,Weiyao Lin,Xiao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Accurate motion prediction of surrounding agents is crucial for the safe planning of autonomous vehicles. Recent advancements have extended prediction techniques from individual agents to joint predictions of multiple interacting agents, with various strategies to address complex interactions within future motions of agents. However, these methods overlook the evolving nature of these interactions. To address this limitation, we propose a novel progressive multi-scale decoding strategy, termed ProgD, with the help of dynamic heterogeneous graph-based scenario modeling. In particular, to explicitly and comprehensively capture the evolving social interactions in future scenarios, given their inherent uncertainty, we design a progressive modeling of scenarios with dynamic heterogeneous graphs. With the unfolding of such dynamic heterogeneous graphs, a factorized architecture is designed to process the spatio-temporal dependencies within future scenarios and progressively eliminate uncertainty in future motions of multiple agents. Furthermore, a multi-scale decoding procedure is incorporated to improve on the future scenario modeling and consistent prediction of agents’ future motion. The proposed ProgD achieves state-of-the-art performance on the INTERACTION multi-agent prediction benchmark, ranking 1^st , and the Argoverse 2 multi-world forecasting benchmark.
zh
[AI-29] Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning IJCAI
【速读】:该论文旨在解决连续控制场景下约束强化学习(Constrained Reinforcement Learning, CRL)中策略优化方法在约束边界附近不稳定、导致训练性能不佳的问题。其核心挑战在于如何在最大化奖励的同时严格满足领域特定的安全约束。解决方案的关键在于提出一种增量惩罚的近端策略优化算法(Incrementally Penalized Proximal Policy Optimization, IP3O),通过引入自适应激励机制并在接近约束边界前逐步增加惩罚强度,从而稳定训练动态并提升约束满足能力。该方法不仅在基准环境中展现出优于现有安全强化学习算法的性能,还提供了最优性误差的理论界保证。
链接: https://arxiv.org/abs/2509.09208
作者: Somnath Hazra,Pallab Dasgupta,Soumyajit Dey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, Accepted to the 34th International Joint Conference on Artificial Intelligence (IJCAI) 2025, Main Track
Abstract:Constrained Reinforcement Learning (RL) aims to maximize the return while adhering to predefined constraint limits, which represent domain-specific safety requirements. In continuous control settings, where learning agents govern system actions, balancing the trade-off between reward maximization and constraint satisfaction remains a significant challenge. Policy optimization methods often exhibit instability near constraint boundaries, resulting in suboptimal training performance. To address this issue, we introduce a novel approach that integrates an adaptive incentive mechanism in addition to the reward structure to stay within the constraint bound before approaching the constraint boundary. Building on this insight, we propose Incrementally Penalized Proximal Policy Optimization (IP3O), a practical algorithm that enforces a progressively increasing penalty to stabilize training dynamics. Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of state-of-the-art Safe RL algorithms. Furthermore, we provide theoretical guarantees by deriving a bound on the worst-case error of the optimality achieved by our algorithm.
zh
[AI-30] On Integrating Large Language Models and Scenario-Based Programming for Improving Software Reliability
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件开发中应用时存在的可靠性问题——即LLMs虽能显著提升开发效率并生成结构良好、易读的代码,但常以高置信度输出错误代码,可能误导开发者接受有缺陷的解决方案。为提升LLMs在软件工程流程中的可信度与可控性,论文提出一种将LLMs与传统软件工程方法(特别是场景驱动的场景编程,Scenario-Based Programming, SBP)结构化结合的方法论。其关键在于利用SBP范式引导人类开发者将领域知识注入LLMs,并通过事件驱动的场景机制对LLMs的输出进行系统性验证与修正,从而实现更可靠、可验证的程序构建。实证研究表明,该方法不仅提升了智能体性能(如击败多个强对抗代理),还支持对核心逻辑进行形式化验证,显著增强了开发过程的可控性和结果的可信度。
链接: https://arxiv.org/abs/2509.09194
作者: Ayelet Berzack,Guy Katz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are fast becoming indispensable tools for software developers, assisting or even partnering with them in crafting complex programs. The advantages are evident – LLMs can significantly reduce development time, generate well-organized and comprehensible code, and occasionally suggest innovative ideas that developers might not conceive on their own. However, despite their strengths, LLMs will often introduce significant errors and present incorrect code with persuasive confidence, potentially misleading developers into accepting flawed solutions. In order to bring LLMs into the software development cycle in a more reliable manner, we propose a methodology for combining them with ``traditional’’ software engineering techniques in a structured way, with the goal of streamlining the development process, reducing errors, and enabling users to verify crucial program properties with increased confidence. Specifically, we focus on the Scenario-Based Programming (SBP) paradigm – an event-driven, scenario-based approach for software engineering – to allow human developers to pour their expert knowledge into the LLM, as well as to inspect and verify its outputs. To evaluate our methodology, we conducted a significant case study, and used it to design and implement the Connect4 game. By combining LLMs and SBP we were able to create a highly-capable agent, which could defeat various strong existing agents. Further, in some cases, we were able to formally verify the correctness of our agent. Finally, our experience reveals interesting insights regarding the ease-of-use of our proposed approach. The full code of our case-study will be made publicly available with the final version of this paper. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) MSC classes: 68N19 Cite as: arXiv:2509.09194 [cs.SE] (or arXiv:2509.09194v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.09194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-31] Probing Pre-trained Language Models on Code Changes: Insights from ReDef a High-Confidence Just-in-Time Defect Prediction Dataset
【速读】:该论文旨在解决现有即时软件缺陷预测(Just-in-Time Software Defect Prediction, JIT-SDP)数据集普遍存在的标签噪声高、缺陷定位精度低的问题,尤其是难以准确识别引入缺陷的代码提交(bug-inducing commits)。其解决方案的关键在于构建一个高置信度的函数级修改基准数据集 ReDef,通过回滚提交(revert commits)锚定缺陷样本,并结合事后历史检查验证清洁样本,同时利用 GPT 辅助的多轮投票与人工审计流程过滤模糊实例。这一方法生成了 3,164 个缺陷和 10,268 个清洁的函数级修改,显著优于已有资源的标签可靠性。此外,论文首次系统评估预训练语言模型(Pre-trained Language Models, PLMs)对代码变更的理解能力,发现紧凑的 diff 样式编码在所有模型中均优于全函数格式,但对抗性扰动测试揭示模型实际依赖表面线索而非真正理解编辑语义,表明当前 PLMs 在处理变更任务时仍缺乏深层语义认知能力。
链接: https://arxiv.org/abs/2509.09192
作者: Doha Nam,Taehyoun Kim,Duksan Ryu,Jongmoon Baik
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: An anonymous link containing the dataset, construction scripts, and experimental code is publicly available for reproducibility: this https URL
Abstract:Just-in-Time software defect prediction (JIT-SDP) plays a critical role in prioritizing risky code changes during code review and continuous integration. However, existing datasets often suffer from noisy labels and low precision in identifying bug-inducing commits. To address this, we present ReDef (Revert-based Defect dataset), a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks. Ambiguous instances are conservatively filtered out via a GPT-assisted triage process involving multiple votes and audits. This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources. Beyond dataset construction, we provide the first systematic evaluation of how pre-trained language models (PLMs) reason about code modifications – specifically, which input encodings most effectively expose change information, and whether models genuinely capture edit semantics. We fine-tune CodeBERT, CodeT5+, and UniXcoder under five encoding strategies, and further probe their sensitivity through counterfactual perturbations that swap added/deleted blocks, invert diff polarity, or inject spurious markers. Our results show that compact diff-style encodings consistently outperform whole-function formats across all PLMs, with statistical tests confirming large, model-independent effects. However, under counterfactual tests, performance degrades little or not at all – revealing that what appears to be robustness in fact reflects reliance on superficial cues rather than true semantic understanding. These findings indicate that, unlike in snapshot-based tasks, current PLMs remain limited in their ability to genuinely comprehend code modifications.
zh
[AI-32] HISPASpoof: A New Dataset For Spanish Speech Forensics ICASSP2026
【速读】:该论文旨在解决西班牙语合成语音检测与归属(synthetic speech detection and attribution)领域缺乏大规模基准数据集的问题,从而推动面向西班牙语的可靠且包容性的语音取证技术发展。其关键解决方案是构建了首个大规模西班牙语合成语音数据集HISPASpoof,该数据集涵盖来自六个方言变体的真实语音以及六种零样本文本到语音(Zero-shot Text-to-Speech, TTS)系统生成的合成语音,并通过实证验证了基于该数据集训练的检测模型在跨语言场景下显著优于仅在英语数据上训练的模型,同时首次系统评估了合成语音归属性能,为西班牙语语音伪造检测提供了重要基准。
链接: https://arxiv.org/abs/2509.09155
作者: Maria Risques,Kratika Bhagtani,Amit Kumar Singh Yadav,Edward J. Delp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, 10 tables, being submitted to ICASSP 2026 (IEEE International Conference on Acoustics, Speech, and Signal Processing 2026)
Abstract:Zero-shot Voice Cloning (VC) and Text-to-Speech (TTS) methods have advanced rapidly, enabling the generation of highly realistic synthetic speech and raising serious concerns about their misuse. While numerous detectors have been developed for English and Chinese, Spanish-spoken by over 600 million people worldwide-remains underrepresented in speech forensics. To address this gap, we introduce HISPASpoof, the first large-scale Spanish dataset designed for synthetic speech detection and attribution. It includes real speech from public corpora across six accents and synthetic speech generated with six zero-shot TTS systems. We evaluate five representative methods, showing that detectors trained on English fail to generalize to Spanish, while training on HISPASpoof substantially improves detection. We also evaluate synthetic speech attribution performance on HISPASpoof, i.e., identifying the generation method of synthetic speech. HISPASpoof thus provides a critical benchmark for advancing reliable and inclusive speech forensics in Spanish.
zh
[AI-33] Anti-Money Laundering Machine Learning Pipelines; A Technical Analysis on Identifying High-risk Bank Clients with Supervised Learning
【速读】:该论文旨在解决金融领域中反洗钱(Anti-Money Laundering, AML)场景下高风险客户识别的难题,通过构建系统化的机器学习(Machine Learning, ML)流水线提升检测准确性与可解释性。解决方案的关键在于:首先采用16步设计与统计分析流程确保模型鲁棒性;其次将数据结构化存储于SQLite数据库并开发基于SQL的特征工程算法,实现高效的数据预处理;再次集成预训练模型并部署为推理就绪状态;最后引入可解释人工智能(Explainable Artificial Intelligence, XAI)模块以量化特征重要性,从而增强模型透明度与业务可信度。该方法在竞赛数据集上实现了平均AUROC达0.961(标准差0.005),位列第二。
链接: https://arxiv.org/abs/2509.09127
作者: Khashayar Namdar,Pin-Chien Wang,Tushar Raju,Steven Zheng,Fiona Li,Safwat Tahmin Khan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Anti-money laundering (AML) actions and measurements are among the priorities of financial institutions, for which machine learning (ML) has shown to have a high potential. In this paper, we propose a comprehensive and systematic approach for developing ML pipelines to identify high-risk bank clients in a dataset curated for Task 1 of the University of Toronto 2023-2024 Institute for Management and Innovation (IMI) Big Data and Artificial Intelligence Competition. The dataset included 195,789 customer IDs, and we employed a 16-step design and statistical analysis to ensure the final pipeline was robust. We also framed the data in a SQLite database, developed SQL-based feature engineering algorithms, connected our pre-trained model to the database, and made it inference-ready, and provided explainable artificial intelligence (XAI) modules to derive feature importance. Our pipeline achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.961 with a standard deviation (SD) of 0.005. The proposed pipeline achieved second place in the competition.
zh
[AI-34] Character-Level Perturbations Disrupt LLM Watermarks
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)水印机制在实际应用中面临的安全脆弱性问题,尤其是针对有限访问权限下的水印移除攻击(watermark removal attacks)有效性不足的现状。现有研究多基于理想化攻击场景评估鲁棒性,忽略了真实环境中检测器访问受限的约束条件,导致对水印防御能力的高估。论文的关键创新在于:首先形式化了LLM水印系统模型并定义了两个现实威胁模型(受限于水印检测器访问),进而发现字符级扰动(如拼写错误、字符交换、删除和同形异义符)可通过破坏分词过程同时影响多个token,显著提升攻击效率;其次提出基于遗传算法(Genetic Algorithm, GA)的引导式移除攻击方法,在仅允许有限黑盒查询检测器的前提下实现高效水印移除;最后论证了固定防御策略存在“对抗困境”(adversarial dilemma),并设计自适应复合字符级攻击以有效绕过多种防御机制。实验验证了字符级扰动与GA优化策略的有效性,揭示了当前水印方案的核心漏洞,强调需构建更具动态适应性的新型鲁棒机制。
链接: https://arxiv.org/abs/2509.09112
作者: Zhaoxi Zhang,Xiaomei Zhang,Yanjun Zhang,He Zhang,Shirui Pan,Bo Liu,Asif Qumer Gill,Leo Yu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) watermarking embeds detectable signals into generated text for copyright protection, misuse prevention, and content detection. While prior studies evaluate robustness using watermark removal attacks, these methods are often suboptimal, creating the misconception that effective removal requires large perturbations or powerful adversaries. To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments confirm the superiority of character-level perturbations and the effectiveness of the GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.09112 [cs.CR] (or arXiv:2509.09112v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.09112 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-35] DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models
【速读】:该论文旨在解决在设备端大语言模型(Large Language Model, LLM)系统中,联邦微调(Federated Fine-tuning)因处理用户敏感数据而引发的隐私泄露问题。其核心解决方案是提出DP-FedLoRA框架,该框架将低秩适应(LoRA-based adaptation)与差分隐私(Differential Privacy, DP)相结合,在通信高效的前提下实现隐私保护。关键创新在于:客户端对本地LoRA矩阵进行裁剪(clipping)并添加高斯噪声以满足(ε, δ)-差分隐私约束,同时通过理论分析证明更新无偏性并量化噪声引入的方差,为隐私预算(privacy budget)配置提供指导。实验表明,该方法在主流基准上保持竞争力的同时实现了强隐私保障,推动了设备端LLM的可扩展、隐私友好的部署。
链接: https://arxiv.org/abs/2509.09097
作者: Honghui Xu,Shiva Shrestha,Wei Chen,Zhiyuan Li,Zhipeng Cai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As on-device large language model (LLM) systems become increasingly prevalent, federated fine-tuning enables advanced language understanding and generation directly on edge devices; however, it also involves processing sensitive, user-specific data, raising significant privacy concerns within the federated learning framework. To address these challenges, we propose DP-FedLoRA, a privacy-enhanced federated fine-tuning framework that integrates LoRA-based adaptation with differential privacy in a communication-efficient setting. Each client locally clips and perturbs its LoRA matrices using Gaussian noise to satisfy ( \epsilon , \delta )-differential privacy. We further provide a theoretical analysis demonstrating the unbiased nature of the updates and deriving bounds on the variance introduced by noise, offering practical guidance for privacy-budget calibration. Experimental results across mainstream benchmarks show that DP-FedLoRA delivers competitive performance while offering strong privacy guarantees, paving the way for scalable and privacy-preserving LLM deployment in on-device environments.
zh
[AI-36] owards Confidential and Efficient LLM Inference with Dual Privacy Protection DASFAA2025
【速读】:该论文旨在解决基于CPU的可信执行环境(Trusted Execution Environment, TEE)在大语言模型(Large Language Models, LLMs)私有推理中面临的高延迟问题,以及差分隐私(Differential Privacy, DP)方法对模型性能和语义理解造成的损害。其解决方案的关键在于提出一种名为CMIF的保密且高效的模型推理框架:该框架将嵌入层(embedding layer)部署在客户端的TEE中以保护敏感输入,后续层则卸载至GPU服务器以提升计算效率;同时优化Report-Noisy-Max机制,在保证用户数据隐私的前提下仅带来轻微的模型性能下降,从而在安全性与效率之间实现良好平衡。
链接: https://arxiv.org/abs/2509.09091
作者: Honglan Yu,Yibin Wang,Feifei Dai,Dong Liu,Haihui Fan,Xiaoyan Gu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by DASFAA2025
Abstract:CPU-based trusted execution environments (TEEs) and differential privacy (DP) have gained wide applications for private inference. Due to high inference latency in TEEs, researchers use partition-based approaches that offload linear model components to GPUs. However, dense nonlinear layers of large language models (LLMs) result in significant communication overhead between TEEs and GPUs. DP-based approaches apply random noise to protect data privacy, but this compromises LLM performance and semantic understanding. To overcome the above drawbacks, this paper proposes CMIF, a Confidential and efficient Model Inference Framework. CMIF confidentially deploys the embedding layer in the client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes the Report-Noisy-Max mechanism to protect sensitive inputs with a slight decrease in model performance. Extensive experiments on Llama-series models demonstrate that CMIF reduces additional inference overhead in TEEs while preserving user data privacy.
zh
[AI-37] KoopMotion: Learning Almost Divergence Free Koopman Flow Fields for Motion Planning
【速读】:该论文旨在解决机器人运动规划中如何从任意初始状态平滑收敛至期望参考轨迹并最终到达终点的问题,尤其在基于示范学习(Learning from Demonstrations, LfD)场景下,传统Koopman算子方法无法自然保证对目标轨迹或特定终点的收敛性。解决方案的关键在于提出KoopMotion框架,其将运动流场建模为由Koopman算子参数化的动力系统,通过学习流场的散度特性来生成平滑且具有收敛性的运动场:当机器人位于参考轨迹之外时,该流场可引导其逐步收敛至轨迹;一旦进入轨迹,则能稳定跟踪直至终点。该方法在空间和时间上均具备高样本效率(仅需LASA数据集3%即可生成密集运动规划),并在真实物理机器人平台(微型自主水面车辆)上验证了其在非静态流体环境中的有效性。
链接: https://arxiv.org/abs/2509.09074
作者: Alice Kate Li,Thales C Silva,Victoria Edwards,Vijay Kumar,M. Ani Hsieh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CoRL 2025 (Conference on Robot Learning). 15 pages 11 figures
Abstract:In this work, we propose a novel flow field-based motion planning method that drives a robot from any initial state to a desired reference trajectory such that it converges to the trajectory’s end point. Despite demonstrated efficacy in using Koopman operator theory for modeling dynamical systems, Koopman does not inherently enforce convergence to desired trajectories nor to specified goals – a requirement when learning from demonstrations (LfD). We present KoopMotion which represents motion flow fields as dynamical systems, parameterized by Koopman Operators to mimic desired trajectories, and leverages the divergence properties of the learnt flow fields to obtain smooth motion fields that converge to a desired reference trajectory when a robot is placed away from the desired trajectory, and tracks the trajectory until the end point. To demonstrate the effectiveness of our approach, we show evaluations of KoopMotion on the LASA human handwriting dataset and a 3D manipulator end-effector trajectory dataset, including spectral analysis. We also perform experiments on a physical robot, verifying KoopMotion on a miniature autonomous surface vehicle operating in a non-static fluid flow environment. Our approach is highly sample efficient in both space and time, requiring only 3% of the LASA dataset to generate dense motion plans. Additionally, KoopMotion provides a significant improvement over baselines when comparing metrics that measure spatial and temporal dynamics modeling efficacy.
zh
[AI-38] Understanding Economic Tradeoffs Between Human and AI Agents in Bargaining Games
【速读】:该论文旨在解决多智能体协作环境中代理(agent)性能评估的局限性问题,即当前评价体系往往仅关注结果指标(如总收益),而忽视了决策过程与行为机制的差异。解决方案的关键在于构建一个动态谈判场景,对人类、大语言模型(LLMs)和贝叶斯代理在相同条件下进行直接比较,从而同时捕捉任务结果与行为动态。研究发现,尽管人类与LLMs在整体收益上达到性能一致(performance parity),但其策略路径截然不同:LLMs偏好保守让步以减少交易失败,而人类则表现出更具战略性的风险承担与公平导向行为。这一发现揭示了“性能一致”可能掩盖关键的过程差异与对齐问题,强调了在实际部署中需重视行为机制而非单一结果指标。
链接: https://arxiv.org/abs/2509.09071
作者: Crystal Qian,Kehang Zhu,John Horton,Benjamin S. Manning,Vivian Tsai,James Wexler,Nithum Thain
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
备注:
Abstract:Coordination tasks traditionally performed by humans are increasingly being delegated to autonomous agents. As this pattern progresses, it becomes critical to evaluate not only these agents’ performance but also the processes through which they negotiate in dynamic, multi-agent environments. Furthermore, different agents exhibit distinct advantages: traditional statistical agents, such as Bayesian models, may excel under well-specified conditions, whereas large language models (LLMs) can generalize across contexts. In this work, we compare humans (N = 216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in a dynamic negotiation setting that enables direct, identical-condition comparisons across populations, capturing both outcomes and behavioral dynamics. Bayesian agents extract the highest surplus through aggressive optimization, at the cost of frequent trade rejections. Humans and LLMs can achieve similar overall surplus, but through distinct behaviors: LLMs favor conservative, concessionary trades with few rejections, while humans employ more strategic, risk-taking, and fairness-oriented behaviors. Thus, we find that performance parity – a common benchmark in agent evaluation – can conceal fundamental differences in process and alignment, which are critical for practical deployment in real-world coordination tasks.
zh
[AI-39] STRIDE: Scalable and Interpretable XAI via Subset-Free Functional Decomposition
【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)框架中两个关键问题:一是特征子集推理带来的指数级计算成本,二是将影响总结为单一标量值时表达能力受限。解决方案的核心在于提出STRIDE框架,通过在再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)中将解释建模为无需枚举子集的正交函数分解,避免了显式遍历所有特征组合;其采用基于递归核中心化过程的解析投影方法计算功能成分 $ f_S(x_S) $,从而在保持模型无关性的同时提供局部与全局解释,并具备理论保障的正交性和 $ L^2 $ 收敛性。
链接: https://arxiv.org/abs/2509.09070
作者: Chaeyun Ko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10 pages, 2 figures
Abstract:Most explainable AI (XAI) frameworks face two practical limitations: the exponential cost of reasoning over feature subsets and the reduced expressiveness of summarizing effects as single scalar values. We present STRIDE, a scalable framework that aims to mitigate both issues by framing explanation as a subset-enumeration-free, orthogonal functional decomposition in a Reproducing Kernel Hilbert Space (RKHS). Rather than focusing only on scalar attributions, STRIDE computes functional components f_S(x_S) via an analytical projection scheme based on a recursive kernel-centering procedure, avoiding explicit subset enumeration. In the tabular setups we study, the approach is model-agnostic, provides both local and global views, and is supported by theoretical results on orthogonality and L^2 convergence under stated assumptions. On public tabular benchmarks in our environment, we observed speedups ranging from 0.6 times (slower than TreeSHAP on a small dataset) to 9.7 times (California), with a median approximate 3.0 times across 10 datasets, while maintaining high fidelity (R^2 between 0.81 and 0.999) and substantial rank agreement on most datasets. Overall, STRIDE complements scalar attribution methods by offering a structured functional perspective, enabling novel diagnostics like ‘component surgery’ to quantitatively measure the impact of specific interactions within our experimental scope.
zh
[AI-40] Instructional Prompt Optimization for Few-Shot LLM -Based Recommendations on Cold-Start Users
【速读】:该论文旨在解决推荐系统中冷启动用户(cold-start user)问题,即在缺乏历史行为数据的情况下,如何提升推荐模型的有效性。其解决方案的关键在于提出一种上下文条件化的提示构造方法 P(u, Ds) → R̂,其中 u 表示冷启动用户画像,Ds 为精心构建的支持集(support set),R̂ 为预测的物品排序列表。通过在少量样本(few-shot)场景下对大型语言模型(LLM)进行指令优化,结合标记级对齐(token-level alignments)与嵌入空间正则化(embedding space regularization),显著提升了模型在低数据环境下的 precision@k 和 NDCG 指标。研究表明,提示设计不仅影响语法结构,更通过控制注意力尺度和解码器行为直接影响推荐性能,验证了基于提示的适配策略是缓解 LLM 推荐管道中冷启动问题的有效途径。
链接: https://arxiv.org/abs/2509.09066
作者: Haowei Yang,Yushang Zhao,Sitao Min,Bo Su,Chao Yao,Wei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The cold-start user issue further compromises the effectiveness of recommender systems in limiting access to the historical behavioral information. It is an effective pipeline to optimize instructional prompts on a few-shot large language model (LLM) used in recommender tasks. We introduce a context-conditioned prompt formulation method P(u,\ Ds)\ \rightarrow\ R\widehat, where u is a cold-start user profile, Ds is a curated support set, and R\widehat is the predicted ranked list of items. Based on systematic experimentation with transformer-based autoregressive LLMs (BioGPT, LLaMA-2, GPT-4), we provide empirical evidence that optimal exemplar injection and instruction structuring can significantly improve the precision@k and NDCG scores of such models in low-data settings. The pipeline uses token-level alignments and embedding space regularization with a greater semantic fidelity. Our findings not only show that timely composition is not merely syntactic but also functional as it is in direct control of attention scales and decoder conduct through inference. This paper shows that prompt-based adaptation may be considered one of the ways to address cold-start recommendation issues in LLM-based pipelines.
zh
[AI-41] A Scoping Review of Machine Learning Applications in Power System Protection and Disturbance Management
【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)在电力系统保护与扰动管理中研究存在的碎片化、方法学不严谨、数据集质量参差以及评估指标不统一等问题,这些问题严重限制了研究成果的可比性、泛化能力及实际部署可行性。其解决方案的关键在于:首先提出一个面向保护任务的ML分类体系(taxonomy),厘清术语歧义;其次倡导标准化报告实践,包括全面的数据集文档记录、方法透明度提升和一致的评估协议;最后强调需加强真实场景验证、鲁棒性测试与部署可行性分析,并推动公开基准数据集和先进ML架构的发展,从而促进ML驱动的保护技术从理论走向工程落地。
链接: https://arxiv.org/abs/2509.09053
作者: Julian Oelhaf,Georg Kordowich,Mehran Pashaei,Christian Bergler,Andreas Maier,Johann Jäger,Siming Bayer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:The integration of renewable and distributed energy resources reshapes modern power systems, challenging conventional protection schemes. This scoping review synthesizes recent literature on machine learning (ML) applications in power system protection and disturbance management, following the PRISMA for Scoping Reviews framework. Based on over 100 publications, three key objectives are addressed: (i) assessing the scope of ML research in protection tasks; (ii) evaluating ML performance across diverse operational scenarios; and (iii) identifying methods suitable for evolving grid conditions. ML models often demonstrate high accuracy on simulated datasets; however, their performance under real-world conditions remains insufficiently validated. The existing literature is fragmented, with inconsistencies in methodological rigor, dataset quality, and evaluation metrics. This lack of standardization hampers the comparability of results and limits the generalizability of findings. To address these challenges, this review introduces a ML-oriented taxonomy for protection tasks, resolves key terminological inconsistencies, and advocates for standardized reporting practices. It further provides guidelines for comprehensive dataset documentation, methodological transparency, and consistent evaluation protocols, aiming to improve reproducibility and enhance the practical relevance of research outcomes. Critical gaps remain, including the scarcity of real-world validation, insufficient robustness testing, and limited consideration of deployment feasibility. Future research should prioritize public benchmark datasets, realistic validation methods, and advanced ML architectures. These steps are essential to move ML-based protection from theoretical promise to practical deployment in increasingly dynamic and decentralized power systems.
zh
[AI-42] MoWE : A Mixture of Weather Experts
【速读】:该论文旨在解决当前数据驱动天气模型性能增长停滞的问题,即现有模型在精度提升上已趋于饱和,难以进一步突破。其解决方案的关键在于提出一种基于专家混合(Mixture of Experts, MoWE)的新范式,通过构建一个基于视觉Transformer(Vision Transformer)的门控网络,动态学习在每个网格点上对多个“专家”模型输出进行加权融合,且权重随预报时效变化。该方法不依赖于训练全新预报器,而是以较低计算成本优化组合已有高质量模型,最终生成合成确定性预报,在均方根误差(RMSE)指标上显著优于单一专家模型及简单平均策略,最高可实现10%的RMSE降低。
链接: https://arxiv.org/abs/2509.09052
作者: Dibyajyoti Chakraborty,Romit Maulik,Peter Harrington,Dallas Foster,Mohammad Amin Nabian,Sanjay Choudhry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
备注:
Abstract:Data-driven weather models have recently achieved state-of-the-art performance, yet progress has plateaued in recent years. This paper introduces a Mixture of Experts (MoWE) approach as a novel paradigm to overcome these limitations, not by creating a new forecaster, but by optimally combining the outputs of existing models. The MoWE model is trained with significantly lower computational resources than the individual experts. Our model employs a Vision Transformer-based gating network that dynamically learns to weight the contributions of multiple “expert” models at each grid point, conditioned on forecast lead time. This approach creates a synthesized deterministic forecast that is more accurate than any individual component in terms of Root Mean Squared Error (RMSE). Our results demonstrate the effectiveness of this method, achieving up to a 10% lower RMSE than the best-performing AI weather model on a 2-day forecast horizon, significantly outperforming individual experts as well as a simple average across experts. This work presents a computationally efficient and scalable strategy to push the state of the art in data-driven weather prediction by making the most out of leading high-quality forecast models.
zh
[AI-43] Envy-Free but Still Unfair: Envy-Freeness Up To One Item (EF-1) in Personalized Recommendation
【速读】:该论文试图解决的问题是:在个性化推荐系统中,传统公平性概念(如 envy-freeness,即无嫉妒性)是否仍然适用。文献指出,虽然无嫉妒性及其松弛形式(EF-1,即“除一个物品外无嫉妒”)在经济学、博弈论和社交选择理论中被广泛使用,并且近年来在推荐系统领域也受到关注,但在强调个性化(personalization)的场景下,嫉妒本身并不能准确衡量公平性。其解决方案的关键在于:揭示嫉妒这一指标在个性化情境下的局限性,从而推动研究者重新思考适合推荐系统特性的公平性度量标准,而非简单套用传统经济模型中的公平定义。
链接: https://arxiv.org/abs/2509.09037
作者: Amanda Aird,Ben Armstrong,Nicholas Mattei,Robin Burke
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Envy-freeness and the relaxation to Envy-freeness up to one item (EF-1) have been used as fairness concepts in the economics, game theory, and social choice literatures since the 1960s, and have recently gained popularity within the recommendation systems communities. In this short position paper we will give an overview of envy-freeness and its use in economics and recommendation systems; and illustrate why envy is not appropriate to measure fairness for use in settings where personalization plays a role.
zh
[AI-44] Uncertainty Awareness and Trust in Explainable AI- On Trust Calibration using Local and Global Explanations ICDM2025
【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)领域中关于不确定性解释与全局解释(global explanations)研究不足的问题,尤其关注如何通过算法设计提升用户对模型预测结果的信任度。其解决方案的关键在于采用一种能够同时涵盖不确定性、鲁棒性及全局XAI特性的算法,并验证该算法在增强用户满意度和人类可解释性方面的有效性,即使该算法本身在技术实现上较为复杂。
链接: https://arxiv.org/abs/2509.08989
作者: Carina Newen,Daniel Bodemer,Sonja Glantz,Emmanuel Müller,Magdalena Wischnewski,Lenka Schnaubert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, accepted but not yet published at ICDM2025
Abstract:Explainable AI has become a common term in the literature, scrutinized by computer scientists and statisticians and highlighted by psychological or philosophical researchers. One major effort many researchers tackle is constructing general guidelines for XAI schemes, which we derived from our study. While some areas of XAI are well studied, we focus on uncertainty explanations and consider global explanations, which are often left out. We chose an algorithm that covers various concepts simultaneously, such as uncertainty, robustness, and global XAI, and tested its ability to calibrate trust. We then checked whether an algorithm that aims to provide more of an intuitive visual understanding, despite being complicated to understand, can provide higher user satisfaction and human interpretability.
zh
[AI-45] ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在持续使用合成数据进行递归训练时出现的“模型坍塌”(model collapse)问题,即模型性能随训练代数增加而逐渐退化,最终失效。其解决方案的关键在于识别出模型对其自生成数据的过度自信(overconfidence)是导致坍塌的核心诱因,并提出一种名为截断交叉熵(Truncated Cross Entropy, TCE)的置信度感知损失函数,通过降低高置信度预测的权重来抑制过拟合与错误积累,从而显著延缓模型坍塌的发生。该方法具有模态无关性,且在理论和实证层面均验证了其有效性,可将模型保持有效输出的时间区间延长超过2.3倍。
链接: https://arxiv.org/abs/2509.08972
作者: Soheil Zibakhsh Shabgahi,Pedram Aghazadeh,Azalia Mirhosseini,Farinaz Koushanfar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The increasing reliance on generative AI models has accelerated the generation rate of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. Although prior studies have explored the causes and detection of model collapse, existing mitigation strategies remain limited. In this paper, we identify model overconfidence in their self-generated data as a key driver of collapse. Building on this observation, we propose a confidence-aware loss function that downweights high-confidence predictions during training. We introduce a novel loss function we call Truncated Cross Entropy (TCE). We demonstrate that TCE significantly delays model collapse in recursive training. We provide a model-agnostic framework that links the loss function design to model collapse mitigation and validate our approach both theoretically and empirically, showing that it can extend the model’s fidelity interval before collapse by more than 2.3x. Finally, we show that our method generalizes across modalities. These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models in the era of increasing synthetic data. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.08972 [cs.AI] (or arXiv:2509.08972v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.08972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-46] Global Constraint LLM Agents for Text-to-Model Translation
【速读】:该论文旨在解决将自然语言描述的优化或满足问题准确转换为MiniZinc模型的难题,这一过程需要逻辑推理能力和约束编程专业知识。解决方案的关键在于提出一种基于代理(agent)的框架:通过多个专门化的大型语言模型(Large Language Model, LLM)代理按全局约束类型分解建模任务,每个代理专注于检测和生成特定类别的全局约束代码,最终由一个汇编代理将这些约束片段整合成完整的MiniZinc模型。该方法通过将复杂任务拆分为更小、定义明确的子任务,使每个LLM仅需处理相对简单的推理挑战,从而降低整体建模复杂度并提升准确性。
链接: https://arxiv.org/abs/2509.08970
作者: Junyang Cai,Serdar Kadioglu,Bistra Dilkina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Natural language descriptions of optimization or satisfaction problems are challenging to translate into correct MiniZinc models, as this process demands both logical reasoning and constraint programming expertise. We introduce a framework that addresses this challenge with an agentic approach: multiple specialized large language model (LLM) agents decompose the modeling task by global constraint type. Each agent is dedicated to detecting and generating code for a specific class of global constraint, while a final assembler agent integrates these constraint snippets into a complete MiniZinc model. By dividing the problem into smaller, well-defined sub-tasks, each LLM handles a simpler reasoning challenge, potentially reducing overall complexity. We conduct initial experiments with several LLMs and show better performance against baselines such as one-shot prompting and chain-of-thought prompting. Finally, we outline a comprehensive roadmap for future work, highlighting potential enhancements and directions for improvement.
zh
[AI-47] Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications
【速读】:该论文旨在解决矩阵版本的专家建议学习(Learning from Expert Advice, LEA)问题中 regret 上界优化的问题,目标是实现实例最优(instance-optimal)的 regret 界,而非仅达到最坏情况下的 minimax-optimal 界。传统矩阵乘法权重更新(Matrix Multiplicative Weight Update, MMWU)算法在 d-维谱单纯形上可实现 O(Tlogd) 的 regret 上界,但该界对具体比较器(comparator)信息未作利用。本文提出一种改进算法,通过引入基于虚误差函数(imaginary error function)的最优势函数(potential function),实现了 O(T⋅S(X∥d−1Id)) 的 regret 上界,其中 S(⋅∥⋅) 为量子相对熵(quantum relative entropy),X 为实际比较器矩阵。该方案的关键在于构建了一个通用的基于势函数的矩阵 LEA 框架,并创新性地提出一个“单边”Jensen 迹不等式(one-sided Jensen’s trace inequality),借助拉普拉斯变换技术,使得非指数型势函数也可应用于矩阵 LEA 分析。该方法在保持与 MMWU 相同计算复杂度的前提下,显著提升了 regret 的实例敏感性,从而实现了“免费”的性能提升。
链接: https://arxiv.org/abs/2509.08911
作者: Weiyuan Gong,Tongyang Li,Xinzhao Wang,Zhiyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Quantum Physics (quant-ph); Machine Learning (stat.ML)
备注: 47 pages
Abstract:The Matrix Multiplicative Weight Update (MMWU) is a seminal online learning algorithm with numerous applications. Applied to the matrix version of the Learning from Expert Advice (LEA) problem on the d -dimensional spectraplex, it is well known that MMWU achieves the minimax-optimal regret bound of O(\sqrtT\log d) , where T is the time horizon. In this paper, we present an improved algorithm achieving the instance-optimal regret bound of O(\sqrtT\cdot S(X||d^-1I_d)) , where X is the comparator in the regret, I_d is the identity matrix, and S(\cdot||\cdot) denotes the quantum relative entropy. Furthermore, our algorithm has the same computational complexity as MMWU, indicating that the improvement in the regret bound is free''. Technically, we first develop a general potential-based framework for matrix LEA, with MMWU being its special case induced by the standard exponential potential. Then, the crux of our analysis is a new
one-sided’’ Jensen’s trace inequality built on a Laplace transform technique, which allows the application of general potential functions beyond exponential to matrix LEA. Our algorithm is finally induced by an optimal potential function from the vector LEA problem, based on the imaginary error function. Complementing the above, we provide a memory lower bound for matrix LEA, and explore the applications of our algorithm in quantum learning theory. We show that it outperforms the state of the art for learning quantum states corrupted by depolarization noise, random quantum states, and Gibbs states. In addition, applying our algorithm to linearized convex losses enables predicting nonlinear quantum properties, such as purity, quantum virtual cooling, and Rényi- 2 correlation. Comments: 47 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Quantum Physics (quant-ph); Machine Learning (stat.ML) Cite as: arXiv:2509.08911 [cs.LG] (or arXiv:2509.08911v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.08911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-48] Benchmarking Energy Efficiency of Large Language Models Using vLLM
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在实际部署中能源效率评估不足的问题,现有基准测试往往无法反映真实生产环境下的使用条件。其解决方案的关键在于提出一个名为LLM Efficiency Benchmark的新基准,该基准通过集成vLLM这一高吞吐量、生产就绪的LLM服务后端,模拟现实世界中的请求并发与模型运行场景,从而更准确地衡量不同模型规模、架构及并发负载对推理能耗的影响,为开发者构建可持续的AI系统提供实证依据。
链接: https://arxiv.org/abs/2509.08867
作者: K. Pronk,Q. Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures
Abstract:The prevalence of Large Language Models (LLMs) is having an growing impact on the climate due to the substantial energy required for their deployment and use. To create awareness for developers who are implementing LLMs in their products, there is a strong need to collect more information about the energy efficiency of LLMs. While existing research has evaluated the energy efficiency of various models, these benchmarks often fall short of representing realistic production scenarios. In this paper, we introduce the LLM Efficiency Benchmark, designed to simulate real-world usage conditions. Our benchmark utilizes vLLM, a high-throughput, production-ready LLM serving backend that optimizes model performance and efficiency. We examine how factors such as model size, architecture, and concurrent request volume affect inference energy efficiency. Our findings demonstrate that it is possible to create energy efficiency benchmarks that better reflect practical deployment conditions, providing valuable insights for developers aiming to build more sustainable AI systems.
zh
[AI-49] Investigating Student Interaction Patterns with Large Language Model-Powered Course Assistants in Computer Science Courses
【速读】:该论文旨在解决高校中学生在非课时时间段难以获得及时学术支持的问题,这一缺口尤其影响初学者和需要灵活辅导的学生群体。解决方案的关键在于开发并部署一个基于大语言模型(Large Language Models, LLMs)的课程助教系统,在多所高校的计算机科学课程中进行实证研究,以量化其使用模式、评估教学有效性,并探索如何通过教育者参与提示(prompt)设计、内容配置与政策制定来优化LLM在教学场景中的应用。
链接: https://arxiv.org/abs/2509.08862
作者: Chang Liu,Loc Hoang,Andrew Stolman,Rene F. Kizilcec,Bo Wu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Providing students with flexible and timely academic support is a challenge at most colleges and universities, leaving many students without help outside scheduled hours. Large language models (LLMs) are promising for bridging this gap, but interactions between students and LLMs are rarely overseen by educators. We developed and studied an LLM-powered course assistant deployed across multiple computer science courses to characterize real-world use and understand pedagogical implications. By Spring 2024, our system had been deployed to approximately 2,000 students across six courses at three institutions. Analysis of the interaction data shows that usage remains strong in the evenings and nights and is higher in introductory courses, indicating that our system helps address temporal support gaps and novice learner needs. We sampled 200 conversations per course for manual annotation: most sampled responses were judged correct and helpful, with a small share unhelpful or erroneous; few responses included dedicated examples. We also examined an inquiry-based learning strategy: only around 11% of sampled conversations contained LLM-generated follow-up questions, which were often ignored by students in advanced courses. A Bloom’s taxonomy analysis reveals that current LLM capabilities are limited in generating higher-order cognitive questions. These patterns suggest opportunities for pedagogically oriented LLM-based educational systems and greater educator involvement in configuring prompts, content, and policies.
zh
[AI-50] Multi Robot Coordination in Highly Dynamic Environments: Tackling Asymmetric Obstacles and Limited Communication
【速读】:该论文旨在解决在通信能力受限、环境部分可观测且存在动态非对称障碍物的多智能体系统(Multi-Agent System, MAS)中,如何高效进行任务分配的问题。其关键解决方案是提出一种受市场机制启发的分布式协调方法,该方法能够有效处理频繁的任务重新分配,并特别考虑了障碍物的非对称特性(asymmetric obstacles),从而提升系统在低通信条件下的协作效率。实验表明,在有限通信环境下,该方法显著减少了任务重叠,尤其在最常被重新分配的任务上降低了52%的冲突率。
链接: https://arxiv.org/abs/2509.08859
作者: Vincenzo Suriani,Daniele Affinita,Domenico D. Bloisi,Daniele Nardi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The 19th International Conference on Intelligent Autonomous Systems (IAS 19), 2025, Genoa
Abstract:Coordinating a fully distributed multi-agent system (MAS) can be challenging when the communication channel has very limited capabilities in terms of sending rate and packet payload. When the MAS has to deal with active obstacles in a highly partially observable environment, the communication channel acquires considerable relevance. In this paper, we present an approach to deal with task assignments in extremely active scenarios, where tasks need to be frequently reallocated among the agents participating in the coordination process. Inspired by market-based task assignments, we introduce a novel distributed coordination method to orchestrate autonomous agents’ actions efficiently in low communication scenarios. In particular, our algorithm takes into account asymmetric obstacles. While in the real world, the majority of obstacles are asymmetric, they are usually treated as symmetric ones, thus limiting the applicability of existing methods. To summarize, the presented architecture is designed to tackle scenarios where the obstacles are active and asymmetric, the communication channel is poor and the environment is partially observable. Our approach has been validated in simulation and in the real world, using a team of NAO robots during official RoboCup competitions. Experimental results show a notable reduction in task overlaps in limited communication settings, with a decrease of 52% in the most frequent reallocated task.
zh
[AI-51] Safe and Certifiable AI Systems: Concepts Challenges and Lessons Learned
【速读】:该论文旨在解决当前安全关键型人工智能(AI)系统在实际应用中缺乏有效认证机制的问题,即如何确保AI系统在安全性、合法性及社会可接受性方面达到可验证的标准。其解决方案的核心是提出TÜV AUSTRIA可信AI框架,该框架基于“安全软件开发”、“功能需求”和“伦理与数据隐私”三大支柱,将欧盟《人工智能法案》中的高层次义务转化为具体且可测试的评估指标。其中最关键的概念是“功能性可信性”,它通过统计定义的应用领域、基于风险的最低性能要求以及独立采样数据上的统计测试,为模型在真实场景中的质量提供透明、可复现的证据,从而实现对AI系统的端到端审计与认证。
链接: https://arxiv.org/abs/2509.08852
作者: Kajetan Schweighofer,Barbara Brune,Lukas Gruber,Simon Schmid,Alexander Aufreiter,Andreas Gruber,Thomas Doms,Sebastian Eder,Florian Mayer,Xaver-Paul Stadlbauer,Christoph Schwald,Werner Zellinger,Bernhard Nessler,Sepp Hochreiter
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 63 pages, 27 figures
Abstract:There is an increasing adoption of artificial intelligence in safety-critical applications, yet practical schemes for certifying that AI systems are safe, lawful and socially acceptable remain scarce. This white paper presents the TÜV AUSTRIA Trusted AI framework an end-to-end audit catalog and methodology for assessing and certifying machine learning systems. The audit catalog has been in continuous development since 2019 in an ongoing collaboration with scientific partners. Building on three pillars - Secure Software Development, Functional Requirements, and Ethics Data Privacy - the catalog translates the high-level obligations of the EU AI Act into specific, testable criteria. Its core concept of functional trustworthiness couples a statistically defined application domain with risk-based minimum performance requirements and statistical testing on independently sampled data, providing transparent and reproducible evidence of model quality in real-world settings. We provide an overview of the functional requirements that we assess, which are oriented on the lifecycle of an AI system. In addition, we share some lessons learned from the practical application of the audit catalog, highlighting common pitfalls we encountered, such as data leakage scenarios, inadequate domain definitions, neglect of biases, or a lack of distribution drift controls. We further discuss key aspects of certifying AI systems, such as robustness, algorithmic fairness, or post-certification requirements, outlining both our current conclusions and a roadmap for future research. In general, by aligning technical best practices with emerging European standards, the approach offers regulators, providers, and users a practical roadmap for legally compliant, functionally trustworthy, and certifiable AI systems.
zh
[AI-52] Uncertainty Estimation using Variance-Gated Distributions
【速读】:该论文旨在解决神经网络中单样本不确定性量化(per-sample uncertainty quantification)的评估问题,尤其在高风险应用场景下如何准确区分模型不确定性(epistemic uncertainty)与数据固有噪声带来的不确定性(aleatoric uncertainty)。传统方法依赖于贝叶斯或近似模型的预测分布,并采用加性分解方式对不确定性进行划分,但该方式近期受到质疑。论文提出了一种基于类别概率分布信噪比(signal-to-noise ratio)的直观框架,其关键在于引入一种方差门控(variance-gated)度量,通过集成学习中不同模型预测的方差来生成置信因子,从而对预测结果进行缩放,以此探讨委员会机器(committee machines)多样性衰减(collapse)的存在性。
链接: https://arxiv.org/abs/2509.08846
作者: H. Martin Gillis,Isaac Xu,Thomas Trappenberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications. A common approach is to use the predictive distribution from Bayesian or approximation models and decompose the corresponding predictive uncertainty into epistemic (model-related) and aleatoric (data-related) components. However, additive decomposition has recently been questioned. In this work, we propose an intuitive framework for uncertainty estimation and decomposition based on the signal-to-noise ratio of class probability distributions across different model predictions. We introduce a variance-gated measure that scales predictions by a confidence factor derived from ensembles. We use this measure to discuss the existence of a collapse in the diversity of committee machines.
zh
[AI-53] Deep opacity and AI: A threat to XAI and to privacy protection mechanisms
【速读】:该论文试图解决的问题是:大数据分析与人工智能(AI)技术在实践中对隐私构成威胁,尤其源于AI系统存在的“黑箱问题”(black box problem),导致相关主体无法提供合理的判断依据以保障隐私权益,例如无法实现有效的知情同意(informed consent)或数据匿名化(anonymity)。论文指出,这种隐私保护失效可归因于三种类型的不透明性:浅层不透明(subjects不了解系统行为)、标准黑箱不透明(分析师不了解系统行为)以及深层不透明(分析师无法可能理解系统潜在行为)。解决方案的关键在于区分这三种不透明类型,并据此识别出当前隐私保护机制失效的根本原因——即缺乏可解释的判断基础;在此基础上,作者提出应通过技术手段提升AI系统的可解释性(explainability),从而增强隐私保护的有效性。
链接: https://arxiv.org/abs/2509.08835
作者: Vincent C. Müller
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:It is known that big data analytics and AI pose a threat to privacy, and that some of this is due to some kind of “black box problem” in AI. I explain how this becomes a problem in the context of justification for judgments and actions. Furthermore, I suggest distinguishing three kinds of opacity: 1) the subjects do not know what the system does (“shallow opacity”), 2) the analysts do not know what the system does (“standard black box opacity”), or 3) the analysts cannot possibly know what the system might do (“deep opacity”). If the agents, data subjects as well as analytics experts, operate under opacity, then these agents cannot provide justifications for judgments that are necessary to protect privacy, e.g., they cannot give “informed consent”, or guarantee “anonymity”. It follows from these points that agents in big data analytics and AI often cannot make the judgments needed to protect privacy. So I conclude that big data analytics makes the privacy problems worse and the remedies less effective. As a positive note, I provide a brief outlook on technical ways to handle this situation.
zh
[AI-54] An Interval Type-2 Version of Bayes Theorem Derived from Interval Probability Range Estimates Provided by Subject Matter Experts
【速读】:该论文旨在解决传统贝叶斯推断在实际应用中因输入概率为精确值而产生的不现实假设问题,特别是当领域专家(SME)仅能提供区间范围估计时,如何有效将这些不确定性信息纳入贝叶斯推理框架。其解决方案的关键在于提出一种区间类型-2(Interval Type-2, IT2)贝叶斯定理扩展方法:首先设计了一种新颖且保守的计算策略,以避免输入IT2模糊隶属函数(Membership Function, MF)之间潜在的不一致性导致无效输出结果;其次开发了一种灵活且通用的算法,用于将专家提供的区间信息编码为IT2模糊隶属函数,从而准确表示贝叶斯定理中的先验和似然概率,显著优于以往仅针对“计算与词语”(Computing with Words)场景下区间到词隶属函数映射的研究。
链接: https://arxiv.org/abs/2509.08834
作者: John T. Rickard,William A. Dembski,James Rickards
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Computational Finance (q-fin.CP)
备注: 13 pages, 12 figures
Abstract:Bayesian inference is widely used in many different fields to test hypotheses against observations. In most such applications, an assumption is made of precise input values to produce a precise output value. However, this is unrealistic for real-world applications. Often the best available information from subject matter experts (SMEs) in a given field is interval range estimates of the input probabilities involved in Bayes Theorem. This paper provides two key contributions to extend Bayes Theorem to an interval type-2 (IT2) version. First, we develop an IT2 version of Bayes Theorem that uses a novel and conservative method to avoid potential inconsistencies in the input IT2 MFs that otherwise might produce invalid output results. We then describe a novel and flexible algorithm for encoding SME-provided intervals into IT2 fuzzy membership functions (MFs), which we can use to specify the input probabilities in Bayes Theorem. Our algorithm generalizes and extends previous work on this problem that primarily addressed the encoding of intervals into word MFs for Computing with Words applications.
zh
[AI-55] PerFairX: Is There a Balance Between Fairness and Personality in Large Language Model Recommendations? ICCV2025 ICCV
【速读】:该论文旨在解决生成式 AI(Generative AI)在推荐系统中引入用户人格特质(基于OCEAN模型)时所引发的心理契合度与群体公平性之间的矛盾问题。解决方案的关键在于提出PerFairX这一统一评估框架,通过对比中立提示与人格敏感提示下不同用户群体的推荐表现,量化个性化与人口统计公平性之间的权衡关系,从而为开发既具心理适配性又兼顾公平性的大语言模型(LLM)推荐系统提供可解释、可比较的基准依据。
链接: https://arxiv.org/abs/2509.08829
作者: Chandan Kumar Sah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures. Accepted to the Workshop on Multimodal Continual Learning (MCL) at ICCV 2025. @2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), ICCV’s 2025
Abstract:The integration of Large Language Models (LLMs) into recommender systems has enabled zero-shot, personality-based personalization through prompt-based interactions, offering a new paradigm for user-centric recommendations. However, incorporating user personality traits via the OCEAN model highlights a critical tension between achieving psychological alignment and ensuring demographic fairness. To address this, we propose PerFairX, a unified evaluation framework designed to quantify the trade-offs between personalization and demographic equity in LLM-generated recommendations. Using neutral and personality-sensitive prompts across diverse user profiles, we benchmark two state-of-the-art LLMs, ChatGPT and DeepSeek, on movie (MovieLens 10M) and music (this http URL 360K) datasets. Our results reveal that personality-aware prompting significantly improves alignment with individual traits but can exacerbate fairness disparities across demographic groups. Specifically, DeepSeek achieves stronger psychological fit but exhibits higher sensitivity to prompt variations, while ChatGPT delivers stable yet less personalized outputs. PerFairX provides a principled benchmark to guide the development of LLM-based recommender systems that are both equitable and psychologically informed, contributing to the creation of inclusive, user-centric AI applications in continual learning contexts.
zh
[AI-56] Beyond Negative Transfer: Disentangled Preference-Guided Diffusion for Cross-Domain Sequential Recommendation
【速读】:该论文旨在解决跨域序列推荐(Cross-Domain Sequential Recommendation, CDSR)中因领域异质性导致的负迁移问题,即在融合多领域用户行为时,域特定偏好与噪声信号易被混淆,从而损害推荐质量。其关键解决方案是提出一种新颖的解耦偏好引导扩散模型(Disentangled Preference-Guided Diffusion Model, DPG-Diff),首次将扩散模型(Diffusion Models, DMs)应用于CDSR任务。DPG-Diff通过显式分解用户偏好为域不变偏好和域特定偏好,并利用二者联合指导逆向扩散过程,实现对复杂偏好信号的有效分离与去噪,从而增强跨域知识迁移的鲁棒性并抑制负迁移。
链接: https://arxiv.org/abs/2509.00389
作者: Xiaoxin Ye,Chengkai Huang,Hongtao Huang,Lina Yao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Cross-Domain Sequential Recommendation (CDSR) leverages user behaviors across domains to enhance recommendation quality. However, naive aggregation of sequential signals can introduce conflicting domain-specific preferences, leading to negative transfer. While Sequential Recommendation (SR) already suffers from noisy behaviors such as misclicks and impulsive actions, CDSR further amplifies this issue due to domain heterogeneity arising from diverse item types and user intents. The core challenge is disentangling three intertwined signals: domain-invariant preferences, domain-specific preferences, and noise. Diffusion Models (DMs) offer a generative denoising framework well-suited for disentangling complex user preferences and enhancing robustness to noise. Their iterative refinement process enables gradual denoising, making them effective at capturing subtle preference signals. However, existing applications in recommendation face notable limitations: sequential DMs often conflate shared and domain-specific preferences, while cross-domain collaborative filtering DMs neglect temporal dynamics, limiting their ability to model evolving user preferences. To bridge these gaps, we propose \textbfDPG-Diff, a novel Disentangled Preference-Guided Diffusion Model, the first diffusion-based approach tailored for CDSR, to or best knowledge. DPG-Diff decomposes user preferences into domain-invariant and domain-specific components, which jointly guide the reverse diffusion process. This disentangled guidance enables robust cross-domain knowledge transfer, mitigates negative transfer, and filters sequential noise. Extensive experiments on real-world datasets demonstrate that DPG-Diff consistently outperforms state-of-the-art baselines across multiple metrics.
zh
[AI-57] Personalized Sleep Prediction via Deep Adaptive Spatiotemporal Modeling and Sparse Data
【速读】:该论文旨在解决个性化睡眠评分预测问题,尤其是在数据稀疏(如来自消费级可穿戴设备的有限生理信号)和跨个体差异显著的情况下,如何实现高精度、鲁棒且可推广的睡眠预测。其解决方案的关键在于提出了一种自适应时空模型(AdaST-Sleep),该模型融合了卷积神经网络(Convolutional Neural Networks, CNNs)以捕捉多特征间的空间交互关系,以及循环神经网络(Recurrent Neural Networks, RNNs)来建模长期健康数据的时间动态性,并引入领域分类器(Domain Classifier)以增强模型在不同受试者之间的泛化能力。实验表明,该方法在多种输入与预测窗口组合下均优于四个基线模型,尤其在7天输入窗口和1天预测窗口时达到最低均方根误差(RMSE=0.282),同时具备多日未来预测的稳定性,验证了其在真实场景中的适用性和灵活性。
链接: https://arxiv.org/abs/2509.09018
作者: Xueyi Wang,C. J. C.(Claudine)Lamoth,Elisabeth Wilhelm
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The paper has been acceptted and presented in the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society
Abstract:A sleep forecast allows individuals and healthcare providers to anticipate and proactively address factors influencing restful rest, ultimately improving mental and physical well-being. This work presents an adaptive spatial and temporal model (AdaST-Sleep) for predicting sleep scores. Our proposed model combines convolutional layers to capture spatial feature interactions between multiple features and recurrent neural network layers to handle longer-term temporal health-related data. A domain classifier is further integrated to generalize across different subjects. We conducted several experiments using five input window sizes (3, 5, 7, 9, 11 days) and five predicting window sizes (1, 3, 5, 7, 9 days). Our approach consistently outperformed four baseline models, achieving its lowest RMSE (0.282) with a seven-day input window and a one-day predicting window. Moreover, the method maintained strong performance even when forecasting multiple days into the future, demonstrating its versatility for real-world applications. Visual comparisons reveal that the model accurately tracks both the overall sleep score level and daily fluctuations. These findings prove that the proposed framework provides a robust and adaptable solution for personalized sleep forecasting using sparse data from commercial wearable devices and domain adaptation techniques.
zh
机器学习
[LG-0] Functional Groups are All you Need for Chemically Interpretable Molecular Property Prediction
链接: https://arxiv.org/abs/2509.09619
作者: Roshan Balaji,Joe Bobby,Nirav Pravinbhai Bhatt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Molecular property prediction using deep learning (DL) models has accelerated drug and materials discovery, but the resulting DL models often lack interpretability, hindering their adoption by chemists. This work proposes developing molecule representations using the concept of Functional Groups (FG) in chemistry. We introduce the Functional Group Representation (FGR) framework, a novel approach to encoding molecules based on their fundamental chemical substructures. Our method integrates two types of functional groups: those curated from established chemical knowledge (FG), and those mined from a large molecular corpus using sequential pattern mining (MFG). The resulting FGR framework encodes molecules into a lower-dimensional latent space by leveraging pre-training on a large dataset of unlabeled molecules. Furthermore, the proposed framework allows the inclusion of 2D structure-based descriptors of molecules. We demonstrate that the FGR framework achieves state-of-the-art performance on a diverse range of 33 benchmark datasets spanning physical chemistry, biophysics, quantum mechanics, biological activity, and pharmacokinetics while enabling chemical interpretability. Crucially, the model’s representations are intrinsically aligned with established chemical principles, allowing chemists to directly link predicted properties to specific functional groups and facilitating novel insights into structure-property relationships. Our work presents a significant step toward developing high-performing, chemically interpretable DL models for molecular discovery.
[LG-1] ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance
链接: https://arxiv.org/abs/2509.09611
作者: Haolan Zheng,Yanlai Chen,Jiequn Han,Yue Yu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We propose a novel data-lean operator learning algorithm, the Reduced Basis Neural Operator (ReBaNO), to solve a group of PDEs with multiple distinct inputs. Inspired by the Reduced Basis Method and the recently introduced Generative Pre-Trained Physics-Informed Neural Networks, ReBaNO relies on a mathematically rigorous greedy algorithm to build its network structure offline adaptively from the ground up. Knowledge distillation via task-specific activation function allows ReBaNO to have a compact architecture requiring minimal computational cost online while embedding physics. In comparison to state-of-the-art operator learning algorithms such as PCA-Net, DeepONet, FNO, and CNO, numerical results demonstrate that ReBaNO significantly outperforms them in terms of eliminating/shrinking the generalization gap for both in- and out-of-distribution tests and being the only operator learning algorithm achieving strict discretization invariance.
[LG-2] Conditioning on PDE Parameters to Generalise Deep Learning Emulation of Stochastic and Chaotic Dynamics
链接: https://arxiv.org/abs/2509.09599
作者: Ira J.S. Shokar,Rich R. Kerswell,Peter H. Haynes
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:We present a deep learning emulator for stochastic and chaotic spatio-temporal systems, explicitly conditioned on the parameter values of the underlying partial differential equations (PDEs). Our approach involves pre-training the model on a single parameter domain, followed by fine-tuning on a smaller, yet diverse dataset, enabling generalisation across a broad range of parameter values. By incorporating local attention mechanisms, the network is capable of handling varying domain sizes and resolutions. This enables computationally efficient pre-training on smaller domains while requiring only a small additional dataset to learn how to generalise to larger domain sizes. We demonstrate the model’s capabilities on the chaotic Kuramoto-Sivashinsky equation and stochastically-forced beta-plane turbulence, showcasing its ability to capture phenomena at interpolated parameter values. The emulator provides significant computational speed-ups over conventional numerical integration, facilitating efficient exploration of parameter space, while a probabilistic variant of the emulator provides uncertainty quantification, allowing for the statistical study of rare events.
[LG-3] What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets
链接: https://arxiv.org/abs/2509.09564
作者: Meghan Wilkinson,Robert H Thomson
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages; accepted to SBP-BRiMS 2025 Poster Session
Abstract:Supervised machine learning techniques rely on labeled data to achieve high task performance, but this requires the labels to capture some meaningful differences in the underlying data structure. For training network intrusion detection algorithms, most datasets contain a series of attack classes and a single large benign class which captures all non-attack network traffic. A review of intrusion detection papers and guides that explicitly state their data preprocessing steps identified that the majority took the labeled categories of the dataset at face value when training their algorithms. The present paper evaluates the structure of benign traffic in several common intrusion detection datasets (NSL-KDD, UNSW-NB15, and CIC-IDS 2017) and determines whether there are meaningful sub-categories within this traffic which may improve overall multi-classification performance using common machine learning techniques. We present an overview of some unsupervised clustering techniques (e.g., HDBSCAN, Mean Shift Clustering) and show how they differentially cluster the benign traffic space.
[LG-4] Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
链接: https://arxiv.org/abs/2509.09550
作者: Harry Julia,Rachel Beeson,Lohith Konathala,Johanna Ulin,Jiameng Gao
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate-distortion performance and compatibility with Large Language Models (LLMs) as discrete feature representations for audio generation. While most existing codecs rely on Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has recently emerged as a compelling alternative that simplifies training and natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC, and show that FSQ encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels. First, through an encoder distillation experiment, we show that two different encoders can learn to encode identical audio into vastly different code sequences whilst maintaining comparable reconstruction quality with the same quantizer and decoder. Second, we demonstrate that FSQ has vastly superior bit-level perturbation robustness by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
[LG-5] ProDiGy: Proximity- and Dissimilarity-Based Byzantine-Robust Federated Learning
链接: https://arxiv.org/abs/2509.09534
作者: Sena Ergisi,Luis Maßny,Rawad Bitar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated Learning (FL) emerged as a widely studied paradigm for distributed learning. Despite its many advantages, FL remains vulnerable to adversarial attacks, especially under data heterogeneity. We propose a new Byzantine-robust FL algorithm called ProDiGy. The key novelty lies in evaluating the client gradients using a joint dual scoring system based on the gradients’ proximity and dissimilarity. We demonstrate through extensive numerical experiments that ProDiGy outperforms existing defenses in various scenarios. In particular, when the clients’ data do not follow an IID distribution, while other defense mechanisms fail, ProDiGy maintains strong defense capabilities and model accuracy. These findings highlight the effectiveness of a dual perspective approach that promotes natural similarity among honest clients while detecting suspicious uniformity as a potential indicator of an attack.
[LG-6] Cough Classification using Few-Shot Learning ALT
链接: https://arxiv.org/abs/2509.09515
作者: Yoga Disha Sendhil Kumar,Manas V Shetty,Sudip Vhaduri
类目: Machine Learning (cs.LG)
*备注: 8 pages 8 images Has been accepted in Pervasive Health 2025
Abstract:This paper investigates the effectiveness of few-shot learning for respiratory sound classification, focusing on coughbased detection of COVID-19, Flu, and healthy conditions. We leverage Prototypical Networks with spectrogram representations of cough sounds to address the challenge of limited labeled data. Our study evaluates whether few-shot learning can enable models to achieve performance comparable to traditional deep learning approaches while using significantly fewer training samples. Additionally, we compare multi-class and binary classification models to assess whether multi-class models can perform comparably to their binary counterparts. Experimental findings show that few-shot learning models can achieve competitive accuracy. Our model attains 74.87% accuracy in multi-class classification with only 15 support examples per class, while binary classification achieves over 70% accuracy across all class pairs. Class-wise analysis reveals Flu as the most distinguishable class, and Healthy as the most challenging. Statistical tests (paired t-test p = 0.149, Wilcoxon p = 0.125) indicate no significant performance difference between binary and multiclass models, supporting the viability of multi-class classification in this setting. These results highlight the feasibility of applying few-shot learning in medical diagnostics, particularly when large labeled datasets are unavailable.
[LG-7] PIPES: A Meta-dataset of Machine Learning Pipelines
链接: https://arxiv.org/abs/2509.09512
作者: Cynthia Moreira Maia,Lucas B. V. de Amorim,George D. C. Cavalcanti,Rafael M. O. Cruz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solutions to the Algorithm Selection Problem (ASP) in machine learning face the challenge of high computational costs associated with evaluating various algorithms’ performances on a given dataset. To mitigate this cost, the meta-learning field can leverage previously executed experiments shared in online repositories such as OpenML. OpenML provides an extensive collection of machine learning experiments. However, an analysis of OpenML’s records reveals limitations. It lacks diversity in pipelines, specifically when exploring data preprocessing steps/blocks, such as scaling or imputation, resulting in limited representation. Its experiments are often focused on a few popular techniques within each pipeline block, leading to an imbalanced sample. To overcome the observed limitations of OpenML, we propose PIPES, a collection of experiments involving multiple pipelines designed to represent all combinations of the selected sets of techniques, aiming at diversity and completeness. PIPES stores the results of experiments performed applying 9,408 pipelines to 300 datasets. It includes detailed information on the pipeline blocks, training and testing times, predictions, performances, and the eventual error messages. This comprehensive collection of results allows researchers to perform analyses across diverse and representative pipelines and datasets. PIPES also offers potential for expansion, as additional data and experiments can be incorporated to support the meta-learning community further. The data, code, supplementary material, and all experiments can be found at this https URL.
[LG-8] Balancing Utility and Privacy: Dynamically Private SGD with Random Projection
链接: https://arxiv.org/abs/2509.09485
作者: Zhanhong Jiang,Md Zahid Hasan,Nastaran Saadati,Aditya Balu,Chao Liu,Soumik Sarkar
类目: Machine Learning (cs.LG)
*备注: 27 pages, 13 figures
Abstract:Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.
[LG-9] Database Views as Explanations for Relational Deep Learning
链接: https://arxiv.org/abs/2509.09482
作者: Agapi Rissaki,Ilias Fountalis,Wolfgang Gatterbauer,Benny Kimelfeld
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:In recent years, there has been significant progress in the development of deep learning models over relational databases, including architectures based on heterogeneous graph neural networks (hetero-GNNs) and heterogeneous graph transformers. In effect, such architectures state how the database records and links (e.g., foreign-key references) translate into a large, complex numerical expression, involving numerous learnable parameters. This complexity makes it hard to explain, in human-understandable terms, how a model uses the available data to arrive at a given prediction. We present a novel framework for explaining machine-learning models over relational databases, where explanations are view definitions that highlight focused parts of the database that mostly contribute to the model’s prediction. We establish such global abductive explanations by adapting the classic notion of determinacy by Nash, Segoufin, and Vianu (2010). In addition to tuning the tradeoff between determinacy and conciseness, the framework allows controlling the level of granularity by adopting different fragments of view definitions, such as ones highlighting whole columns, foreign keys between tables, relevant groups of tuples, and so on. We investigate the realization of the framework in the case of hetero-GNNs. We develop heuristic algorithms that avoid the exhaustive search over the space of all databases. We propose techniques that are model-agnostic, and others that are tailored to hetero-GNNs via the notion of learnable masking. Our approach is evaluated through an extensive empirical study on the RelBench collection, covering a variety of domains and different record-level tasks. The results demonstrate the usefulness of the proposed explanations, as well as the efficiency of their generation.
[LG-10] CountTRuCoLa: Rule Confidence Learning for Temporal Knowledge Graph Forecasting
链接: https://arxiv.org/abs/2509.09474
作者: Julia Gastinger,Christian Meilicke,Heiner Stuckenschmidt
类目: Machine Learning (cs.LG)
*备注:
Abstract:We address the task of temporal knowledge graph (TKG) forecasting by introducing a fully explainable method based on temporal rules. Motivated by recent work proposing a strong baseline using recurrent facts, our approach learns four simple types of rules with a confidence function that considers both recency and frequency. Evaluated on nine datasets, our method matches or surpasses the performance of eight state-of-the-art models and two baselines, while providing fully interpretable predictions.
[LG-11] AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings
链接: https://arxiv.org/abs/2509.09470
作者: Om Vishesh,Harshad Khadilkar,Deepak Akkil
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures
Abstract:Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time-consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, ‘Agent-E’, can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task-oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community.
[LG-12] AquaCast: Urban Water Dynamics Forecasting with Precipitation-Informed Multi-Input Transformer
链接: https://arxiv.org/abs/2509.09458
作者: Golnoosh Abdollahinejad,Saleh Baghersalimi,Denisa-Andreea Constantinescu,Sergey Shevchik,David Atienza
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to Journal of Hydrology, Elsevier, and a preprint version is also available at SSRN https://doi.org/10.2139/ssrn.5399833
Abstract:This work addresses the challenge of forecasting urban water dynamics by developing a multi-input, multi-output deep learning model that incorporates both endogenous variables (e.g., water height or discharge) and exogenous factors (e.g., precipitation history and forecast reports). Unlike conventional forecasting, the proposed model, AquaCast, captures both inter-variable and temporal dependencies across all inputs, while focusing forecast solely on endogenous variables. Exogenous inputs are fused via an embedding layer, eliminating the need to forecast them and enabling the model to attend to their short-term influences more effectively. We evaluate our approach on the LausanneCity dataset, which includes measurements from four urban drainage sensors, and demonstrate state-of-the-art performance when using only endogenous variables. Performance also improves with the inclusion of exogenous variables and forecast reports. To assess generalization and scalability, we additionally test the model on three large-scale synthesized datasets, generated from MeteoSwiss records, the Lorenz Attractors model, and the Random Fields model, each representing a different level of temporal complexity across 100 nodes. The results confirm that our model consistently outperforms existing baselines and maintains a robust and accurate forecast across both real and synthetic datasets.
[LG-13] Composable Score-based Graph Diffusion Model for Multi-Conditional Molecular Generation
链接: https://arxiv.org/abs/2509.09451
作者: Anjie Qiao,Zhen Wang,Chuan Chen,DeFu Lian,Enhong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Controllable molecular graph generation is essential for material and drug discovery, where generated molecules must satisfy diverse property constraints. While recent advances in graph diffusion models have improved generation quality, their effectiveness in multi-conditional settings remains limited due to reliance on joint conditioning or continuous relaxations that compromise fidelity. To address these limitations, we propose Composable Score-based Graph Diffusion model (CSGD), the first model that extends score matching to discrete graphs via concrete scores, enabling flexible and principled manipulation of conditional guidance. Building on this foundation, we introduce two score-based techniques: Composable Guidance (CoG), which allows fine-grained control over arbitrary subsets of conditions during sampling, and Probability Calibration (PC), which adjusts estimated transition probabilities to mitigate train-test mismatches. Empirical results on four molecular datasets show that CSGD achieves state-of-the-art performance, with a 15.3% average improvement in controllability over prior methods, while maintaining high validity and distributional fidelity. Our findings highlight the practical advantages of score-based modeling for discrete graph generation and its capacity for flexible, multi-property molecular design.
[LG-14] Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples
链接: https://arxiv.org/abs/2509.09413
作者: Daniel Agyapong,Briana H. Beatty,Peter G. Kennedy,Toby D. Hocking
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:
Abstract:Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.
[LG-15] Kriging prior Regression: A Case for Kriging-Based Spatial Features with TabPFN in Soil Mapping
链接: https://arxiv.org/abs/2509.09408
作者: Jonas Schmidinger,Viacheslav Barkov,Sebastian Vogel,Martin Atzmueller,Gerard B M Heuvelink
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning and geostatistics are two fundamentally different frameworks for predicting and spatially mapping soil properties. Geostatistics leverages the spatial structure of soil properties, while machine learning captures the relationship between available environmental features and soil properties. We propose a hybrid framework that enriches ML with spatial context through engineering of ‘spatial lag’ features from ordinary kriging. We call this approach ‘kriging prior regression’ (KpR), as it follows the inverse logic of regression kriging. To evaluate this approach, we assessed both the point and probabilistic prediction performance of KpR, using the TabPFN model across six fieldscale datasets from LimeSoDa. These datasets included soil organic carbon, clay content, and pH, along with features derived from remote sensing and in-situ proximal soil sensing. KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions in comparison to several other spatial techniques (e.g., regression/residual kriging with TabPFN), as well as to established non-spatial machine learning algorithms (e.g., random forest). Most notably, it significantly improved the average R2 by around 30% compared to machine learning algorithms without spatial context. This improvement was due to the strong prediction performance of the TabPFN algorithm itself and the complementary spatial information provided by KpR features. TabPFN is particularly effective for prediction tasks with small sample sizes, common in precision agriculture, whereas KpR can compensate for weak relationships between sensing features and soil properties when proximal soil sensing data are limited. Hence, we conclude that KpR with TabPFN is a very robust and versatile modelling framework for digital soil mapping in precision agriculture.
[LG-16] Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation
链接: https://arxiv.org/abs/2509.09362
作者: Hanfei Zhou,Lei Shi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:A key challenge in scientific machine learning is solving partial differential equations (PDEs) on complex domains, where the curved geometry complicates the approximation of functions and their derivatives required by differential operators. This paper establishes the first simultaneous approximation theory for deep neural networks on manifolds. We prove that a constant-depth \mathrmReLU^k-1 network with bounded weights–a property that plays a crucial role in controlling generalization error–can approximate any function in the Sobolev space \mathcalW_p^k(\mathcalM^d) to an error of \varepsilon in the \mathcalW_p^s(\mathcalM^d) norm, for k\geq 3 and sk , using \mathcalO(\varepsilon^-d/(k-s)) nonzero parameters, a rate that overcomes the curse of dimensionality by depending only on the intrinsic dimension d . These results readily extend to functions in Hölder-Zygmund spaces. We complement this result with a matching lower bound, proving our construction is nearly optimal by showing the required number of parameters matches up to a logarithmic factor. Our proof of the lower bound introduces novel estimates for the Vapnik-Chervonenkis dimension and pseudo-dimension of the network’s high-order derivative classes. These complexity bounds provide a theoretical cornerstone for learning PDEs on manifolds involving derivatives. Our analysis reveals that the network architecture leverages a sparse structure to efficiently exploit the manifold’s low-dimensional geometry.
[LG-17] Data Driven Discovery of Emergent Dynamics in Reaction Diffusion Systems from Sparse and Noisy Observations
链接: https://arxiv.org/abs/2509.09278
作者: Saumitra Dwivedi,Ricardo da Silva Torres,Ibrahim A. Hameed,Gunnar Tufte,Anniken Susanne T. Karlsen
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Data-driven discovery of emergent dynamics is gaining popularity, particularly in the context of reaction-diffusion systems. These systems are widely studied across various fields, including neuroscience, ecology, epidemiology, and several other subject areas that deal with emergent dynamics. A current challenge in the discovery process relates to system identification when there is no prior knowledge of the underlying physics. We attempt to address this challenge by learning Soft Artificial Life (Soft ALife) models, such as Agent-based and Cellular Automata (CA) models, from observed data for reaction-diffusion systems. In this paper, we present findings on the applicability of a conceptual framework, the Data-driven Rulesets for Soft Artificial Life (DRSALife) model, to learn Soft ALife rulesets that accurately represent emergent dynamics in a reaction-diffusion system from observed data. This model has demonstrated promising results for Elementary CA Rule 30, Game of Life, and Vicsek Flocking problems in recent work. To our knowledge, this is one of the few studies that explore machine-based Soft ALife ruleset learning and system identification for reaction-diffusion dynamics without any prior knowledge of the underlying physics. Moreover, we provide comprehensive findings from experiments investigating the potential effects of using noisy and sparse observed datasets on learning emergent dynamics. Additionally, we successfully identify the structure and parameters of the underlying partial differential equations (PDEs) representing these dynamics. Experimental results demonstrate that the learned models are able to predict the emergent dynamics with good accuracy (74%) and exhibit quite robust performance when subjected to Gaussian noise and temporal sparsity.
[LG-18] Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
链接: https://arxiv.org/abs/2509.09251
作者: Hanyang Wang,Yuxuan Yang,Hongjun Wang,Lihui Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The intelligent fault diagnosis of rotating mechanical equipment usually requires a large amount of labeled sample data. However, in practical industrial applications, acquiring enough data is both challenging and expensive in terms of time and cost. Moreover, different types of rotating mechanical equipment with different unique mechanical properties, require separate training of diagnostic models for each case. To address the challenges of limited fault samples and the lack of generalizability in prediction models for practical engineering applications, we propose a Multi-Attention Meta Transformer method for few-shot unsupervised rotating machinery fault diagnosis (MMT-FD). This framework extracts potential fault representations from unlabeled data and demonstrates strong generalization capabilities, making it suitable for diagnosing faults across various types of mechanical equipment. The MMT-FD framework integrates a time-frequency domain encoder and a meta-learning generalization model. The time-frequency domain encoder predicts status representations generated through random augmentations in the time-frequency domain. These enhanced data are then fed into a meta-learning network for classification and generalization training, followed by fine-tuning using a limited amount of labeled data. The model is iteratively optimized using a small number of contrastive learning iterations, resulting in high efficiency. To validate the framework, we conducted experiments on a bearing fault dataset and rotor test bench data. The results demonstrate that the MMT-FD model achieves 99% fault diagnosis accuracy with only 1% of labeled sample data, exhibiting robust generalization capabilities.
[LG-19] Constructing a Question-Answering Simulator through the Distillation of LLM s
链接: https://arxiv.org/abs/2509.09226
作者: Haipeng Liu,Ting Long,Jing Fu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:The question-answering (QA) simulator is a model that mimics real student learning behaviors and predicts their correctness of their responses to questions. QA simulators enable educational recommender systems (ERS) to collect large amounts of training data without interacting with real students, thereby preventing harmful recommendations made by an undertrained ERS from undermining actual student learning. Given the QA history, there are two categories of solutions to predict the correctness, conducting the simulation: (1) LLM-free methods, which apply a traditional sequential model to transfer the QA history into a vector representation first, and make predictions based on the representation; (2) LLM-based methods, which leverage the domain knowledge and reasoning capability of LLM to enhence the prediction. LLM-free methods offer fast inference but generally yield suboptimal performance. In contrast, most LLM-based methods achieve better results, but at the cost of slower inference speed and higher GPU memory consumption. In this paper, we propose a method named LLM Distillation based Simulator (LDSim), which distills domain knowledge and reasoning capability from an LLM to better assist prediction, thereby improving simulation performance. Extensive experiments demonstrate that our LDSim achieves strong results on both the simulation task and the knowledge tracing (KT) task. Our code is publicly available at this https URL.
[LG-20] Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
链接: https://arxiv.org/abs/2509.09177
作者: Hanyi Mao,Quanjia Xiao,Lei Pang,Haixiao Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as \sqrtL . Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.
[LG-21] Quantum Machine Learning Quantitative Trading Reinforcement Learning Deep Learning
链接: https://arxiv.org/abs/2509.09176
作者: Jun-Hao Chen,Yu-Chien Huang,Yun-Cheng Tsai,Samuel Yen-Chi Chen
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:The convergence of quantum-inspired neural networks and deep reinforcement learning offers a promising avenue for financial trading. We implemented a trading agent for USD/TWD by integrating Quantum Long Short-Term Memory (QLSTM) for short-term trend prediction with Quantum Asynchronous Advantage Actor-Critic (QA3C), a quantum-enhanced variant of the classical A3C. Trained on data from 2000-01-01 to 2025-04-30 (80% training, 20% testing), the long-only agent achieves 11.87% return over around 5 years with 0.92% max drawdown, outperforming several currency ETFs. We detail state design (QLSTM features and indicators), reward function for trend-following/risk control, and multi-core training. Results show hybrid models yield competitive FX trading performance. Implications include QLSTM’s effectiveness for small-profit trades with tight risk and future enhancements. Key hyperparameters: QLSTM sequence length = 4, QA3C workers = 8. Limitations: classical quantum simulation and simplified strategy. \footnoteThe views expressed in this article are those of the authors and do not represent the views of Wells Fargo. This article is for informational purposes only. Nothing contained in this article should be construed as investment advice. Wells Fargo makes no express or implied warranties and expressly disclaims all legal, tax, and accounting implications related to this article.
[LG-22] Peering Partner Recommendation for ISPs using Machine Learning
链接: https://arxiv.org/abs/2509.09146
作者: Md Ibrahim Ibne Alam,Ankur Senapati,Anindo Mahmood,Murat Yuksel,Koushik Kar
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Machine Learning in Communications and Networking
Abstract:Internet service providers (ISPs) need to connect with other ISPs to provide global connectivity services to their users. To ensure global connectivity, ISPs can either use transit service(s) or establish direct peering relationships between themselves via Internet exchange points (IXPs). Peering offers more room for ISP-specific optimizations and is preferred, but it often involves a lengthy and complex process. Automating peering partner selection can enhance efficiency in the global Internet ecosystem. We explore the use of publicly available data on ISPs to develop a machine learning (ML) model that can predict whether an ISP pair should peer or not. At first, we explore public databases, e.g., PeeringDB, CAIDA, etc., to gather data on ISPs. Then, we evaluate the performance of three broad types of ML models for predicting peering relationships: tree-based, neural network-based, and transformer-based. Among these, we observe that tree-based models achieve the highest accuracy and efficiency in our experiments. The XGBoost model trained with publicly available data showed promising performance, with a 98% accuracy rate in predicting peering partners. In addition, the model demonstrated great resilience to variations in time, space, and missing data. We envision that ISPs can adopt our method to fully automate the peering partner selection process, thus transitioning to a more efficient and optimized Internet ecosystem.
[LG-23] Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2509.09135
作者: Xuefeng Wang,Lei Zhang,Henglin Pu,Ahmed H. Qureshi,Husheng Li
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 19 pages, 10 figures
Abstract:Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton–Jacobi–Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.
[LG-24] Learning What Matters: Causal Time Series Modeling for Arctic Sea Ice Prediction IJCAI2025
链接: https://arxiv.org/abs/2509.09128
作者: Emam Hossain,Md Osman Gani
类目: Machine Learning (cs.LG)
*备注: Accepted and presented at the AI4TS Workshop @ IJCAI 2025 (non-archival)
Abstract:Conventional machine learning and deep learning models typically rely on correlation-based learning, which often fails to distinguish genuine causal relationships from spurious associations, limiting their robustness, interpretability, and ability to generalize. To overcome these limitations, we introduce a causality-aware deep learning framework that integrates Multivariate Granger Causality (MVGC) and PCMCI+ for causal feature selection within a hybrid neural architecture. Leveraging 43 years (1979-2021) of Arctic Sea Ice Extent (SIE) data and associated ocean-atmospheric variables at daily and monthly resolutions, the proposed method identifies causally influential predictors, prioritizes direct causes of SIE dynamics, reduces unnecessary features, and enhances computational efficiency. Experimental results show that incorporating causal inputs leads to improved prediction accuracy and interpretability across varying lead times. While demonstrated on Arctic SIE forecasting, the framework is broadly applicable to other dynamic, high-dimensional domains, offering a scalable approach that advances both the theoretical foundations and practical performance of causality-informed predictive modeling.
[LG-25] Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models
链接: https://arxiv.org/abs/2509.09119
作者: Hao Zhang,Bo Huang,Zhenjia Li,Xi Xiao,Hui Yi Leong,Zumeng Zhang,Xinwei Long,Tianyang Wang,Hao Xu
类目: Machine Learning (cs.LG)
*备注: 15 pages
Abstract:Large Language Models (LLMs) have transformed both everyday life and scientific research. However, adapting LLMs from general-purpose models to specialized tasks remains challenging, particularly in resource-constrained environments. Low-Rank Adaptation (LoRA), a prominent method within Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to LLMs by approximating model weight updates using low-rank decomposition. However, LoRA is limited by its uniform rank ( r ) allocation to each incremental matrix, and existing rank allocation techniques aimed at addressing this issue remain computationally inefficient, complex, and unstable, hindering practical applications. To address these limitations, we propose Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on both their global and local sensitivities. It leverages the second-order derivatives (Hessian Matrix) of the loss function to effectively capture weight sensitivity, enabling optimal rank allocation with minimal computational overhead. Our experimental results have demonstrated robust effectiveness, efficiency and stability of Sensitivity-LoRA across diverse tasks and benchmarks.
[LG-26] CryptGNN: Enabling Secure Inference for Graph Neural Networks
链接: https://arxiv.org/abs/2509.09107
作者: Pritam Sen,Yao Ma,Cristian Borcea
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:We present CryptGNN, a secure and effective inference solution for third-party graph neural network (GNN) models in the cloud, which are accessed by clients as ML as a service (MLaaS). The main novelty of CryptGNN is its secure message passing and feature transformation layers using distributed secure multi-party computation (SMPC) techniques. CryptGNN protects the client’s input data and graph structure from the cloud provider and the third-party model owner, and it protects the model parameters from the cloud provider and the clients. CryptGNN works with any number of SMPC parties, does not require a trusted server, and is provably secure even if P-1 out of P parties in the cloud collude. Theoretical analysis and empirical experiments demonstrate the security and efficiency of CryptGNN.
[LG-27] An entropy formula for the Deep Linear Network
链接: https://arxiv.org/abs/2509.09088
作者: Govind Menon,Tianmin Yu
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Dynamical Systems (math.DS)
*备注:
Abstract:We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.
[LG-28] “A 6 or a 9?”: Ensemble Learning Through the Multiplicity of Performant Models and Explanations KDD
链接: https://arxiv.org/abs/2509.09073
作者: Gianlucca Zuin,Adriano Veloso
类目: Machine Learning (cs.LG)
*备注: Paper accepted to the ACM Transactions on Knowledge Discovery from Data (TKDD) for publication (preprint version)
Abstract:Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions. We propose the Rashomon Ensemble, a method that strategically selects models from these diverse high-performing solutions to improve generalization. By grouping models based on both their performance and explanations, we construct ensembles that maximize diversity while maintaining predictive accuracy. This selection ensures that each model covers a distinct region of the solution space, making the ensemble more robust to distribution shifts and variations in unseen data. We validate our approach on both open and proprietary collaborative real-world datasets, demonstrating up to 0.20+ AUROC improvements in scenarios where the Rashomon ratio is large. Additionally, we demonstrate tangible benefits for businesses in various real-world applications, highlighting the robustness, practicality, and effectiveness of our approach.
[LG-29] he Role of Community Detection Methods in Performance Variations of Graph Mining Tasks
链接: https://arxiv.org/abs/2509.09045
作者: Shrabani Ghosh,Erik Saule
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:In real-world scenarios, large graphs represent relationships among entities in complex systems. Mining these large graphs often containing millions of nodes and edges helps uncover structural patterns and meaningful insights. Dividing a large graph into smaller subgraphs facilitates complex system analysis by revealing local information. Community detection extracts clusters or communities of graphs based on statistical methods and machine learning models using various optimization techniques. Structure based community detection methods are more suitable for applying to graphs because they do not rely heavily on rich node or edge attribute information. The features derived from these communities can improve downstream graph mining tasks, such as link prediction and node classification. In real-world applications, we often lack ground truth community information. Additionally, there is neither a universally accepted gold standard for community detection nor a single method that is consistently optimal across diverse applications. In many cases, it is unclear how practitioners select community detection methods, and choices are often made without explicitly considering their potential impact on downstream tasks. In this study, we investigate whether the choice of community detection algorithm significantly influences the performance of downstream applications. We propose a framework capable of integrating various community detection methods to systematically evaluate their effects on downstream task outcomes. Our comparative analysis reveals that specific community detection algorithms yield superior results in certain applications, highlighting that method selection substantially affects performance.
[LG-30] Deep Context-Conditioned Anomaly Detection for Tabular Data WSDM2026
链接: https://arxiv.org/abs/2509.09030
作者: Spencer King,Zhilu Zhang,Ruofan Yu,Baris Coskun,Wei Ding,Qian Cui
类目: Machine Learning (cs.LG)
*备注: Submitted to WSDM 2026. 11 pages, 4 figures, 5 tables, 1 algorithm, 8 datasets, contextual anomaly detection framework for tabular data
Abstract:Anomaly detection is critical in domains such as cybersecurity and finance, especially when working with large-scale tabular data. Yet, unsupervised anomaly detection – where no labeled anomalies are available – remains a significant challenge. Although various deep learning methods have been proposed to model a dataset’s joint distribution, real-world tabular data often contain heterogeneous contexts (e.g., different users), making globally rare events normal under certain contexts. Consequently, relying on a single global distribution can overlook these contextual nuances, degrading detection performance. In this paper, we present a context-conditional anomaly detection framework tailored for tabular datasets. Our approach automatically identifies context features and models the conditional data distribution using a simple deep autoencoder. Extensive experiments on multiple tabular benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, underscoring the importance of context in accurately distinguishing anomalous from normal instances.
[LG-31] Fast attention mechanisms: a tale of parallelism
链接: https://arxiv.org/abs/2509.09001
作者: Jingwen Liu,Hantao Yu,Clayton Sanford,Alexandr Andoni,Daniel Hsu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and k -hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.
[LG-32] Active Learning and Explainable AI for Multi-Objective Optimization of Spin Coated Polymers AAAI
链接: https://arxiv.org/abs/2509.08988
作者: Brendan Young,Brendan Alvey,Andreas Werbrouck,Will Murphy,James Keller,Mattias J. Young,Matthew Maschmann
类目: Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, Presented at 2025 AAAI Spring Symposium Series
Abstract:Spin coating polymer thin films to achieve specific mechanical properties is inherently a multi-objective optimization problem. We present a framework that integrates an active Pareto front learning algorithm (PyePAL) with visualization and explainable AI techniques to optimize processing parameters. PyePAL uses Gaussian process models to predict objective values (hardness and elasticity) from the design variables (spin speed, dilution, and polymer mixture), guiding the adaptive selection of samples toward promising regions of the design space. To enable interpretable insights into the high-dimensional design space, we utilize UMAP (Uniform Manifold Approximation and Projection) for two-dimensional visualization of the Pareto front exploration. Additionally, we incorporate fuzzy linguistic summaries, which translate the learned relationships between process parameters and performance objectives into linguistic statements, thus enhancing the explainability and understanding of the optimization results. Experimental results demonstrate that our method efficiently identifies promising polymer designs, while the visual and linguistic explanations facilitate expert-driven analysis and knowledge discovery.
[LG-33] Green Federated Learning via Carbon-Aware Client and Time Slot Scheduling
链接: https://arxiv.org/abs/2509.08980
作者: Daniel Richards Arputharaj,Charlotte Rodriguez,Angelo Rodio,Giovanni Neglia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training large-scale machine learning models incurs substantial carbon emissions. Federated Learning (FL), by distributing computation across geographically dispersed clients, offers a natural framework to leverage regional and temporal variations in Carbon Intensity (CI). This paper investigates how to reduce emissions in FL through carbon-aware client selection and training scheduling. We first quantify the emission savings of a carbon-aware scheduling policy that leverages slack time – permitting a modest extension of the training duration so that clients can defer local training rounds to lower-carbon periods. We then examine the performance trade-offs of such scheduling which stem from statistical heterogeneity among clients, selection bias in participation, and temporal correlation in model updates. To leverage these trade-offs, we construct a carbon-aware scheduler that integrates slack time, \alpha -fair carbon allocation, and a global fine-tuning phase. Experiments on real-world CI data show that our scheduler outperforms slack-agnostic baselines, achieving higher model accuracy across a wide range of carbon budgets, with especially strong gains under tight carbon constraints.
[LG-34] FoundationalECGNet: A Lightweight Foundational Model for ECG-based Multitask Cardiac Analysis
链接: https://arxiv.org/abs/2509.08961
作者: Md. Sajeebul Islam Sk.,Md Jobayer,Md Mehedi Hasan Shawon,Md. Golam Raibul Alam
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cardiovascular diseases (CVDs) remain a leading cause of mortality worldwide, underscoring the importance of accurate and scalable diagnostic systems. Electrocardiogram (ECG) analysis is central to detecting cardiac abnormalities, yet challenges such as noise, class imbalance, and dataset heterogeneity limit current methods. To address these issues, we propose FoundationalECGNet, a foundational framework for automated ECG classification. The model integrates a dual-stage denoising by Morlet and Daubechies wavelets transformation, Convolutional Block Attention Module (CBAM), Graph Attention Networks (GAT), and Time Series Transformers (TST) to jointly capture spatial and temporal dependencies in multi-channel ECG signals. FoundationalECGNet first distinguishes between Normal and Abnormal ECG signals, and then classifies the Abnormal signals into one of five cardiac conditions: Arrhythmias, Conduction Disorders, Myocardial Infarction, QT Abnormalities, or Hypertrophy. Across multiple datasets, the model achieves a 99% F1-score for Normal vs. Abnormal classification and shows state-of-the-art performance in multi-class disease detection, including a 99% F1-score for Conduction Disorders and Hypertrophy, as well as a 98.9% F1-score for Arrhythmias. Additionally, the model provides risk level estimations to facilitate clinical decision-making. In conclusion, FoundationalECGNet represents a scalable, interpretable, and generalizable solution for automated ECG analysis, with the potential to improve diagnostic precision and patient outcomes in healthcare settings. We’ll share the code after acceptance.
[LG-35] Group Distributionally Robust Machine Learning under Group Level Distributional Uncertainty
链接: https://arxiv.org/abs/2509.08942
作者: Xenia Konti,Yi Shen,Zifan Wang,Karl Henrik Johansson,Michael J. Pencina,Nicoleta J. Economou-Zavlanos,Michael M. Zavlanos
类目: Machine Learning (cs.LG)
*备注:
Abstract:The performance of machine learning (ML) models critically depends on the quality and representativeness of the training data. In applications with multiple heterogeneous data generating sources, standard ML methods often learn spurious correlations that perform well on average but degrade performance for atypical or underrepresented groups. Prior work addresses this issue by optimizing the worst-group performance. However, these approaches typically assume that the underlying data distributions for each group can be accurately estimated using the training data, a condition that is frequently violated in noisy, non-stationary, and evolving environments. In this work, we propose a novel framework that relies on Wasserstein-based distributionally robust optimization (DRO) to account for the distributional uncertainty within each group, while simultaneously preserving the objective of improving the worst-group performance. We develop a gradient descent-ascent algorithm to solve the proposed DRO problem and provide convergence results. Finally, we validate the effectiveness of our method on real-world data.
[LG-36] Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
链接: https://arxiv.org/abs/2509.08933
作者: Sreejeet Maity,Aritra Mitra
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:We consider the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting where the reward signal is subject to adversarial corruption. Such corruption, which may arise from extreme noise, sensor faults, or malicious attacks, can severely degrade the performance of classical algorithms such as Q-learning. To address this challenge, we propose a new provably robust variant of the Q-learning algorithm that operates effectively even when a fraction of the observed rewards are arbitrarily perturbed by an adversary. Under the asynchronous sampling model with time-correlated data, we establish that despite adversarial corruption, the finite-time convergence rate of our algorithm matches that of existing results for the non-adversarial case, up to an additive term proportional to the fraction of corrupted samples. Moreover, we derive an information-theoretic lower bound revealing that the additive corruption term in our upper bounds is unavoidable. Next, we propose a variant of our algorithm that requires no prior knowledge of the statistics of the true reward distributions. The analysis of this setting is particularly challenging and is enabled by carefully exploiting a refined Azuma-Hoeffding inequality for almost-martingales, a technical tool that might be of independent interest. Collectively, our contributions provide the first finite-time robustness guarantees for asynchronous Q-learning, bridging a significant gap in robust RL. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC) Cite as: arXiv:2509.08933 [cs.LG] (or arXiv:2509.08933v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.08933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-37] Decentralising LLM Alignment: A Case for Context Pluralism and Participation
链接: https://arxiv.org/abs/2509.08858
作者: Oriane Peter,Kate Devlin
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted at AIES 2025
Abstract:Large Language Models (LLMs) alignment methods have been credited with the commercial success of products like ChatGPT, given their role in steering LLMs towards user-friendly outputs. However, current alignment techniques predominantly mirror the normative preferences of a narrow reference group, effectively imposing their values on a wide user base. Drawing on theories of the power/knowledge nexus, this work argues that current alignment practices centralise control over knowledge production and governance within already influential institutions. To counter this, we propose decentralising alignment through three characteristics: context, pluralism, and participation. Furthermore, this paper demonstrates the critical importance of delineating the context-of-use when shaping alignment practices by grounding each of these features in concrete use cases. This work makes the following contributions: (1) highlighting the role of context, pluralism, and participation in decentralising alignment; (2) providing concrete examples to illustrate these strategies; and (3) demonstrating the nuanced requirements associated with applying alignment across different contexts of use. Ultimately, this paper positions LLM alignment as a potential site of resistance against epistemic injustice and the erosion of democratic processes, while acknowledging that these strategies alone cannot substitute for broader societal changes.
[LG-38] Representation-Aware Distributionally Robust Optimization: A Knowledge Transfer Framework
链接: https://arxiv.org/abs/2509.09371
作者: Zitao Wang,Nian Si,Molei Liu
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:We propose REpresentation-Aware Distributionally Robust Estimation (READ), a novel framework for Wasserstein distributionally robust learning that accounts for predictive representations when guarding against distributional shifts. Unlike classical approaches that treat all feature perturbations equally, READ embeds a multidimensional alignment parameter into the transport cost, allowing the model to differentially discourage perturbations along directions associated with informative representations. This yields robustness to feature variation while preserving invariant structure. Our first contribution is a theoretical foundation: we show that seminorm regularizations for linear regression and binary classification arise as Wasserstein distributionally robust objectives, thereby providing tractable reformulations of READ and unifying a broad class of regularized estimators under the DRO lens. Second, we adopt a principled procedure for selecting the Wasserstein radius using the techniques of robust Wasserstein profile inference. This further enables the construction of valid, representation-aware confidence regions for model parameters with distinct geometric features. Finally, we analyze the geometry of READ estimators as the alignment parameters vary and propose an optimization algorithm to estimate the projection of the global optimum onto this solution surface. This procedure selects among equally robust estimators while optimally constructing a representation structure. We conclude by demonstrating the effectiveness of our framework through extensive simulations and a real-world study, providing a powerful robust estimation grounded in learning representation.
[LG-39] Low-degree lower bounds via almost orthonormal bases
链接: https://arxiv.org/abs/2509.09353
作者: Alexandra Carpentier,Simone Maria Giancola(LMO, CELESTE),Christophe Giraud(LMO, CELESTE),Nicolas Verzelen(MISTEA)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Low-degree polynomials have emerged as a powerful paradigm for providing evidence of statistical–computational gaps across a variety of high-dimensional statistical models [Wein25]. For detection problems – where the goal is to test a planted distribution \mathbbP’ against a null distribution \mathbbP with independent components – the standard approach is to bound the advantage using an \mathbbL^2(\mathbbP) -orthonormal family of polynomials. However, this method breaks down for estimation tasks or more complex testing problems where \mathbbP has some planted structures, so that no simple \mathbbL^2(\mathbbP) -orthogonal polynomial family is available. To address this challenge, several technical workarounds have been proposed [SW22,SW25], though their implementation can be delicate. In this work, we propose a more direct proof strategy. Focusing on random graph models, we construct a basis of polynomials that is almost orthonormal under \mathbbP , in precisely those regimes where statistical–computational gaps arise. This almost orthonormal basis not only yields a direct route to establishing low-degree lower bounds, but also allows us to explicitly identify the polynomials that optimize the low-degree criterion. This, in turn, provides insights into the design of optimal polynomial-time algorithms. We illustrate the effectiveness of our approach by recovering known low-degree lower bounds, and establishing new ones for problems such as hidden subcliques, stochastic block models, and seriation models.
[LG-40] Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation
链接: https://arxiv.org/abs/2509.09238
作者: Thorbjørn Mosekjær Iversen,Lars Carøe Sørensen,Simon Faarvang Mathiesen,Henrik Gordon Petersen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject to unmeasurable disturbances. Bayesian optimization can be used to optimize such methods in an efficient manner by deploying a probabilistic function estimator to estimate with a given confidence so that regions of the search space can be pruned away. Consequently, the success of the Bayesian optimization depends on the function estimator’s ability to provide informative confidence bounds. Existing function estimators require many function evaluations to infer the underlying confidence or depend on modeling of the disturbances. In this paper, it is shown that the confidence bounds provided by the Wilson Score Kernel Density Estimator (WS-KDE) are applicable as excellent bounds to any stochastic function with an output confined to the closed interval [0;1] regardless of the distribution of the output. This finding opens up the use of WS-KDE for stable global optimization on a wider range of cost functions. The properties of WS-KDE in the context of Bayesian optimization are demonstrated in simulation and applied to the problem of automated trap design for vibrational part feeders.
[LG-41] Scalable extensions to given-data Sobol index estimators
链接: https://arxiv.org/abs/2509.09078
作者: Teresa Portone,Bert Debusschere,Samantha Yang,Emiliano Islas-Quinones,T. Patrick Xiao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:
Abstract:Given-data methods for variance-based sensitivity analysis have significantly advanced the feasibility of Sobol’ index computation for computationally expensive models and models with many inputs. However, the limitations of existing methods still preclude their application to models with an extremely large number of inputs. In this work, we present practical extensions to the existing given-data Sobol’ index method, which allow variance-based sensitivity analysis to be efficiently performed on large models such as neural networks, which have 10^4 parameterizable inputs. For models of this size, holding all input-output evaluations simultaneously in memory – as required by existing methods – can quickly become impractical. These extensions also support nonstandard input distributions with many repeated values, which are not amenable to equiprobable partitions employed by existing given-data methods. Our extensions include a general definition of the given-data Sobol’ index estimator with arbitrary partition, a streaming algorithm to process input-output samples in batches, and a heuristic to filter out small indices that are indistinguishable from zero indices due to statistical noise. We show that the equiprobable partition employed in existing given-data methods can introduce significant bias into Sobol’ index estimates even at large sample sizes and provide numerical analyses that demonstrate why this can occur. We also show that our streaming algorithm can achieve comparable accuracy and runtimes with lower memory requirements, relative to current methods which process all samples at once. We demonstrate our novel developments on two application problems in neural network modeling. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO) Cite as: arXiv:2509.09078 [stat.ML] (or arXiv:2509.09078v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.09078 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-42] Generative quantum advantage for classical and quantum problems
链接: https://arxiv.org/abs/2509.09033
作者: Hsin-Yuan Huang,Michael Broughton,Norhan Eassa,Hartmut Neven,Ryan Babbush,Jarrod R. McClean
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Recent breakthroughs in generative machine learning, powered by massive computational resources, have demonstrated unprecedented human-like capabilities. While beyond-classical quantum experiments can generate samples from classically intractable distributions, their complexity has thwarted all efforts toward efficient learning. This challenge has hindered demonstrations of generative quantum advantage: the ability of quantum computers to learn and generate desired outputs substantially better than classical computers. We resolve this challenge by introducing families of generative quantum models that are hard to simulate classically, are efficiently trainable, exhibit no barren plateaus or proliferating local minima, and can learn to generate distributions beyond the reach of classical computers. Using a 68 -qubit superconducting quantum processor, we demonstrate these capabilities in two scenarios: learning classically intractable probability distributions and learning quantum circuits for accelerated physical simulation. Our results establish that both learning and sampling can be performed efficiently in the beyond-classical regime, opening new possibilities for quantum-enhanced generative models with provable advantage.
[LG-43] Physics-informed waveform inversion using pretrained wavefield neural operators
链接: https://arxiv.org/abs/2509.08967
作者: Xinquan Huang,Fu Wang,Tariq Alkhalifah
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
Abstract:Full waveform inversion (FWI) is crucial for reconstructing high-resolution subsurface models, but it is often hindered, considering the limited data, by its null space resulting in low-resolution models, and more importantly, by its computational cost, especially if needed for real-time applications. Recent attempts to accelerate FWI using learned wavefield neural operators have shown promise in efficiency and differentiability, but typically suffer from noisy and unstable inversion performance. To address these limitations, we introduce a novel physics-informed FWI framework to enhance the inversion in accuracy while maintaining the efficiency of neural operator-based FWI. Instead of relying only on the L2 norm objective function via automatic differentiation, resulting in noisy model reconstruction, we integrate a physics constraint term in the loss function of FWI, improving the quality of the inverted velocity models. Specifically, starting with an initial model to simulate wavefields and then evaluating the loss over how much the resulting wavefield obeys the physical laws (wave equation) and matches the recorded data, we achieve a reduction in noise and artifacts. Numerical experiments using the OpenFWI and Overthrust models demonstrate our method’s superior performance, offering cleaner and more accurate subsurface velocity than vanilla approaches. Considering the efficiency of the approach compared to FWI, this advancement represents a significant step forward in the practical application of FWI for real-time subsurface monitoring.
[LG-44] Convexity of Optimization Curves: Local Sharp Thresholds Robustness Impossibility and New Counterexamples
链接: https://arxiv.org/abs/2509.08954
作者: Le Duc Hieu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We study when the \emphoptimization curve of first–order methods – the sequence \ f(x_n)n\ge0\ produced by constant–stepsize iterations – is convex, equivalently when the forward differences \ f(x_n)-f(xn+1)\ are nonincreasing. For gradient descent (GD) on convex \ L\ --smooth functions, the curve is convex for all stepsizes \ \eta \le 1.75/L\ , and this threshold is tight. Moreover, gradient norms are nonincreasing for all \ \eta \le 2/L\ , and in continuous time (gradient flow) the curve is always convex. These results complement and refine the classical smooth convex optimization toolbox, connecting discrete and continuous dynamics as well as worst–case analyses.
[LG-45] Deploying AI for Signal Processing education: Selected challenges and intriguing opportunities
链接: https://arxiv.org/abs/2509.08950
作者: Jarvis Haupt,Qin Lu,Yanning Shen,Jia Chen,Yue Dong,Dan McCreary,Mehmet Akçakaya,Georgios B. Giannakis
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted to the IEEE Signal Processing Magazine Special Issue on Artificial Intelligence for Education: A Signal Processing Perspective
Abstract:Powerful artificial intelligence (AI) tools that have emerged in recent years – including large language models, automated coding assistants, and advanced image and speech generation technologies – are the result of monumental human achievements. These breakthroughs reflect mastery across multiple technical disciplines and the resolution of significant technological challenges. However, some of the most profound challenges may still lie ahead. These challenges are not purely technical but pertain to the fair and responsible use of AI in ways that genuinely improve the global human condition. This article explores one promising application aligned with that vision: the use of AI tools to facilitate and enhance education, with a specific focus on signal processing (SP). It presents two interrelated perspectives: identifying and addressing technical limitations, and applying AI tools in practice to improve educational experiences. Primers are provided on several core technical issues that arise when using AI in educational settings, including how to ensure fairness and inclusivity, handle hallucinated outputs, and achieve efficient use of resources. These and other considerations – such as transparency, explainability, and trustworthiness – are illustrated through the development of an immersive, structured, and reliable “smart textbook.” The article serves as a resource for researchers and educators seeking to advance AI’s role in engineering education.
[LG-46] WarpPINN-fibers: improved cardiac strain estimation from cine-MR with physics-informed neural networks
链接: https://arxiv.org/abs/2509.08872
作者: Felipe Álvarez Barrientos,Tomás Banduc,Isabeau Sirven,Francisco Sahli Costabal
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:
Abstract:The contractile motion of the heart is strongly determined by the distribution of the fibers that constitute cardiac tissue. Strain analysis informed with the orientation of fibers allows to describe several pathologies that are typically associated with impaired mechanics of the myocardium, such as cardiovascular disease. Several methods have been developed to estimate strain-derived metrics from traditional imaging techniques. However, the physical models underlying these methods do not include fiber mechanics, restricting their capacity to accurately explain cardiac function. In this work, we introduce WarpPINN-fibers, a physics-informed neural network framework to accurately obtain cardiac motion and strains enhanced by fiber information. We train our neural network to satisfy a hyper-elastic model and promote fiber contraction with the goal to predict the deformation field of the heart from cine magnetic resonance images. For this purpose, we build a loss function composed of three terms: a data-similarity loss between the reference and the warped template images, a regularizer enforcing near-incompressibility of cardiac tissue and a fiber-stretch penalization that controls strain in the direction of synthetically produced fibers. We show that our neural network improves the former WarpPINN model and effectively controls fiber stretch in a synthetic phantom experiment. Then, we demonstrate that WarpPINN-fibers outperforms alternative methodologies in landmark-tracking and strain curve prediction for a cine-MRI benchmark with a cohort of 15 healthy volunteers. We expect that our method will enable a more precise quantification of cardiac strains through accurate deformation fields that are consistent with fiber physiology, without requiring imaging techniques more sophisticated than MRI.
[LG-47] A Masked Representation Learning to Model Cardiac Functions Using Multiple Physiological Signals
链接: https://arxiv.org/abs/2509.08830
作者: Seong-A Park,Jong-Eui Chae,Sungdong Kim,Hyung-Chul Lee,Hyun-Lim Yang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures
Abstract:In clinical settings, monitoring hemodynamics is crucial for managing patient prognosis, necessitating the integrated analysis of multiple physiological signals. While recent research has analyzed single signals such as electrocardiography (ECG) or photoplethysmography (PPG), there has yet to be a proposal for an approach that encompasses the complex signal analysis required in actual clinical scenarios. In this study, we introduce the SNUPHY-M (Seoul National University hospital PHYsiological signal Masked representation learning) model extracts physiological features reflecting the electrical, pressure, and fluid characteristics of the cardiac cycle in the process of restoring three masked physiological signals based on self-supervised learning (SSL): ECG, PPG, and arterial blood pressure (ABP) signals. By employing multiple physical characteristics, the model can extract more enriched features only using non-invasive signals. We evaluated the model’s performance in clinical downstream tasks such as hypotension, stroke volume, systolic blood pressure, diastolic blood pressure, and age prediction. Our results showed that the SNUPHY-M significantly outperformed supervised or SSL models, especially in prediction tasks using non-invasive signals. To the best of our knowledge, SNUPHY-M is the first model to apply multi-modal SSL to cardiovascular analysis involving ECG, PPG, and ABP signals. This approach effectively supports clinical decision-making and enables precise diagnostics, contributing significantly to the early diagnosis and management of hemodynamics without invasiveness.
信息检索
[IR-0] AskDoc – Identifying Hidden Healthcare Disparities
链接: https://arxiv.org/abs/2509.09622
作者: Shashank Gupta
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The objective of this study is to understand the online Ask the Doctor services medical advice on internet platforms via AskDoc, a Reddit community that serves as a public AtD platform and study if platforms mirror existing hurdles and partiality in healthcare across various demographic groups. We downloaded data from January 2020 to May 2022 from AskDoc – a subreddit, and created regular expressions to identify self-reported demographics (Gender, Race, and Age) from the posts, and performed statistical analysis to understand the interaction between peers and physicians with the posters. Half of the posts did not receive comments from peers or physicians. At least 90% of the people disclose their gender and age, and 80% of the people do not disclose their race. It was observed that the subreddit is dominated by adult (age group 20-39) white males. Some disparities were observed in the engagement between the users and the posters with certain demographics. Beyond the confines of clinics and hospitals, social media could bring patients and providers closer together, however, as observed, current physicians participation is low compared to posters.
[IR-1] Boosting Data Utilization for Multilingual Dense Retrieval EMNLP2025
链接: https://arxiv.org/abs/2509.09459
作者: Chao Huang,Fengran Mo,Yufeng Chen,Changhao Guan,Zhenrui Yue,Xinyu Wang,Jinan Xu,Kaiyu Huang
类目: Information Retrieval (cs.IR)
*备注: Accepted by EMNLP 2025 (main)
Abstract:Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
[IR-2] CESRec: Constructing Pseudo Interactions for Sequential Recommendation via Conversational Feedback
链接: https://arxiv.org/abs/2509.09342
作者: Yifan Wang,Shen Gao,Jiabao Fang,Rui Yan,Billy Chiu,Shuo Shang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential Recommendation Systems (SRS) have become essential in many real-world applications. However, existing SRS methods often rely on collaborative filtering signals and fail to capture real-time user preferences, while Conversational Recommendation Systems (CRS) excel at eliciting immediate interests through natural language interactions but neglect historical behavior. To bridge this gap, we propose CESRec, a novel framework that integrates the long-term preference modeling of SRS with the real-time preference elicitation of CRS. We introduce semantic-based pseudo interaction construction, which dynamically updates users’historical interaction sequences by analyzing conversational feedback, generating a pseudo-interaction sequence that seamlessly combines long-term and real-time preferences. Additionally, we reduce the impact of outliers in historical items that deviate from users’core preferences by proposing dual alignment outlier items masking, which identifies and masks such items using semantic-collaborative aligned representations. Extensive experiments demonstrate that CESRec achieves state-of-the-art performance by boosting strong SRS models, validating its effectiveness in integrating conversational feedback into SRS.
[IR-3] Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation CIKM2025
链接: https://arxiv.org/abs/2509.09114
作者: Kelin Ren,Chan-Yang Ju,Dong-Ho Lee
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025
Abstract:Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users’ historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at this https URL.