本篇博文主要内容为 2025-06-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-18)
今日共更新545篇论文,其中:
- 自然语言处理共84篇(Computation and Language (cs.CL))
- 人工智能共182篇(Artificial Intelligence (cs.AI))
- 计算机视觉共93篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共151篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] A Variational Framework for Improving Naturalness in Generative Spoken Language Models ICML
【速读】: 该论文试图解决语音生成中由于使用仅关注语言信息的语义标记(semantic tokens)而导致的语音自然度下降问题。现有方法通过添加音高特征来改进这一问题,但音高无法全面表征副语言属性,且特征选择需要人工设计。论文提出的解决方案的关键在于采用一种端到端的变分方法,自动学习编码连续语音属性以增强语义标记,从而无需手动提取和选择副语言特征,并能根据人类评估者产生更优的语音延续。
链接: https://arxiv.org/abs/2506.14767
作者: Li-Wei Chen,Takuya Higuchi,Zakaria Aldeneh,Ahmed Hussen Abdelaziz,Alexander Rudnicky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: International Conference on Machine Learning (ICML) 2025
Abstract:The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at this https URL.
zh
[NLP-1] ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)中普遍存在的幻觉问题,即模型过度依赖部分线索生成不准确的响应。其解决方案的关键在于揭示了如视觉对比解码(VCD)和指令对比解码(ICD)等方法通过影响模型内部注意力机制来减少幻觉,而不仅仅是对logits进行表面修改。受此启发,作者提出了一种可调控注意力的对比解码框架,直接干预模型的注意力机制,从而提供一种更系统化的幻觉缓解方法。
链接: https://arxiv.org/abs/2506.14766
作者: Yujun Wang,Jinhe Bi,Yunpu Ma,Soeren Pirk
机构: Christian-Albrechts-Universität zu Kiel (基尔大学); Ludwig Maximilian University of Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 15 pages, 7 figures
Abstract:Multimodal Large Language Model (MLLM) often suffer from hallucinations. They over-rely on partial cues and generate incorrect responses. Recently, methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. In this work, we uncover that methods like VCD and ICD fundamentally influence internal attention dynamics of the model. This observation suggests that their effectiveness may not stem merely from surface-level modifications to logits but from deeper shifts in attention distribution. Inspired by this insight, we propose an attention-steerable contrastive decoding framework that directly intervenes in attention mechanisms of the model to offer a more principled approach to mitigating hallucinations. Our experiments across multiple MLLM architectures and diverse decoding methods demonstrate that our approach significantly reduces hallucinations and improves the performance on benchmarks such as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancing performance on standard VQA benchmarks.
zh
[NLP-2] From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
【速读】: 该论文试图解决传统分词方法(如Byte Pair Encoding,BPE)对输入文本施加固定粒度的问题,这种固定粒度限制了语言模型对数据的处理方式以及其预测范围。解决方案的关键在于引入一种自回归U-Net架构,该架构在训练过程中学习嵌入自己的分词结果,从而实现动态的多尺度序列视图。该网络从原始字节读取数据,逐步将字节聚合为单词、词对,直至最多四个词,使得模型在更深层阶段能够预测更远的未来,关注更广泛的语义模式,而浅层阶段则处理细节信息。
链接: https://arxiv.org/abs/2506.14761
作者: Mathurin Videau,Badr Youbi Idrissi,Alessandro Leite,Marc Schoenauer,Olivier Teytaud,David Lopez-Paz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future – anticipating the next few words rather than the next byte – so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.
zh
[NLP-3] Reasoning with Exploration: An Entropy Perspective
【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)中探索与利用的平衡问题,特别是在语言模型(Language Model, LM)推理能力提升过程中,现有方法倾向于利用而难以突破性能瓶颈。解决方案的关键在于重新引入熵(Entropy)作为探索信号,并通过在标准RL的优势函数中添加基于熵的项,鼓励生成更长、更深的推理链,而非单纯增加不确定性以促进探索。这一方法在Pass@K指标上取得了显著提升,尤其是在大K值下表现突出,推动了语言模型推理能力的边界。
链接: https://arxiv.org/abs/2506.14758
作者: Daixuan Cheng,Shaohan Huang,Xuekai Zhu,Bo Dai,Wayne Xin Zhao,Zhenliang Zhang,Furu Wei
机构: RUC(中国人民大学); MSRA(微软研究院); SJTU(上海交通大学); BIGAI(百川智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy – a signal of exploration in RL – and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric – an upper-bound estimator of LM reasoning capabilities – even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.
zh
[NLP-4] Optimizing Length Compression in Large Reasoning Models
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在生成推理链时产生的冗余和不必要的内容问题,其核心问题被定义为“无效思考”——即模型在得出正确答案后倾向于重复检查自身工作。解决方案的关键在于提出两个细粒度原则:简洁性(Brevity),旨在消除冗余;充分性(Sufficiency),确保关键推理步骤得以保留。基于这些原则,作者提出了LC-R1方法,该方法采用组相对策略优化(Group Relative Policy Optimization, GRPO)的后训练机制,并结合长度奖励与压缩奖励,有效减少推理过程的冗长性,同时保持较高的准确性。
链接: https://arxiv.org/abs/2506.14755
作者: Zhengxiang Cheng,Dongping Chen,Mingyang Fu,Tianyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 7 figures, 4 tables
Abstract:Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” – models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at this https URL.
zh
[NLP-5] Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLM s
【速读】: 该论文旨在解决大规模语言模型在推理任务中效率与鲁棒性之间的平衡问题,特别是在保持高性能的同时减少计算资源消耗。其关键解决方案是基于Mixture-of-Experts (MoE)架构的模型设计,并通过强化学习(RL)进行优化,同时引入了联合训练流程,结合知识蒸馏与RL,以提升训练稳定性和计算吞吐量。此外,论文提出了C3PO算法,通过算法与系统协同设计方法增强训练稳定性,并采用基于熵损失选择知识蒸馏检查点的方法,以实现更优的性能-效率权衡。最后,通过两阶段训练范式解决多领域数据融合中的领域冲突问题。
链接: https://arxiv.org/abs/2506.14731
作者: Ring Team,Bin Hu,Cai Chen,Deng Zhao,Ding Liu,Dingnan Jin,Feng Zhu,Hao Dai,Hongzhi Luan,Jia Guo,Jiaming Liu,Jiewei Wu,Jun Mei,Jun Zhou,Junbo Zhao,Junwu Xiong,Kaihong Zhang,Kuan Xu,Lei Liang,Liang Jiang,Liangcheng Fu,Longfei Zheng,Qiang Gao,Qing Cui,Quan Wan,Shaomian Zheng,Shuaicheng Li,Tongkai Yang,Wang Ren,Xiaodong Yan,Xiaopei Wan,Xiaoyun Feng,Xin Zhao,Xinxing Yang,Xinyu Kong,Xuemin Yang,Yang Li,Yingting Wu,Yongkang Liu,Zhankai Xu,Zhenduo Zhang,Zhenglei Zhou,Zhenyu Huang,Zhiqiang Zhang,Zihao Wang,Zujie Wen
机构: Inclusion AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.
zh
[NLP-6] Capacity Matters: a Proof-of-Concept for Transformer Memorization on Real-World Data ACL2025
【速读】: 该论文试图解决模型架构和数据配置如何影响生成式变压器的实证记忆能力的问题。其解决方案的关键在于通过使用来自医学系统命名法(SNOMED)知识图谱的合成文本数据集进行训练,分析不同因素对记忆能力的影响,发现嵌入大小是学习速度和容量的主要决定因素,激活函数的选择对模型性能具有重要影响,特别是Softmax表现出更高的稳定性和容量,同时数据集复杂性的增加有助于提升最终的记忆能力。
链接: https://arxiv.org/abs/2506.14704
作者: Anton Changalidis,Aki Härmä
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注: This work has been accepted for publication at the First Workshop on Large Language Model Memorization (L2M2) at ACL 2025, Vienna, Austria
Abstract:This paper studies how the model architecture and data configurations influence the empirical memorization capacity of generative transformers. The models are trained using synthetic text datasets derived from the Systematized Nomenclature of Medicine (SNOMED) knowledge graph: triplets, representing static connections, and sequences, simulating complex relation patterns. The results show that embedding size is the primary determinant of learning speed and capacity, while additional layers provide limited benefits and may hinder performance on simpler datasets. Activation functions play a crucial role, and Softmax demonstrates greater stability and capacity. Furthermore, increasing the complexity of the data set seems to improve the final memorization. These insights improve our understanding of transformer memory mechanisms and provide a framework for optimizing model design with structured real-world data.
zh
[NLP-7] reasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
【速读】: 该论文旨在解决现代机器学习中在长尾分布的罕见和低频特征上表现不佳的问题,即模型在训练数据中未充分覆盖的使用场景下性能下降。其解决方案的关键在于优化训练协议,以提升模型在推理阶段对低频使用场景的可控性和性能。通过构建详细的数据特征分类和任务来源分类,实现对生成属性的显式控制以及推理时的隐式条件约束,并微调基础模型以自动推断这些标记,从而在不依赖用户输入的情况下提升模型的适应性与表现,尤其在长尾数据上取得了显著提升。
链接: https://arxiv.org/abs/2506.14702
作者: Daniel D’souza,Julia Kreutzer,Adrien Morisot,Ahmet Üstün,Sara Hooker
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: “Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?” We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.
zh
[NLP-8] Massive Supervised Fine-tuning Experiments Reveal How Data Layer and Training Factors Shape LLM Alignment Quality
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在对齐大语言模型(Large Language Models, LLMs)与人类指令和价值观过程中的诸多未解问题。其关键解决方案是通过在多种数据集上训练广泛的基模型,生成超过1000个SFT模型,并系统分析数据集特性及SFT引入的层间参数变化,从而揭示SFT效果的影响因素,如困惑度对SFT有效性的预测能力以及中层权重变化与性能提升的相关性。
链接: https://arxiv.org/abs/2506.14681
作者: Yuto Harada,Yusuke Yamauchi,Yusuke Oda,Yohei Oseki,Yusuke Miyao,Yu Takagi
机构: NII LLMC; The University of Tokyo (东京大学); NAIST; Nagoya Institute of Technology (名古屋工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness–often surpassing superficial similarity between trained data and benchmark–and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.
zh
[NLP-9] GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors
【速读】: 该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)方法中,如低秩适应(LoRA)与专家混合(Mixture-of-Experts, MoE)结合后所面临的两个局限性:一是下游任务对专家数量分配的影响,二是所有LoRA专家统一的秩分配限制了表征多样性。解决方案的关键在于提出GuiLoMo,这是一种基于引导选择向量(Guided Selection Vectors, GSVs)的细粒度分层专家数量和秩分配策略,通过先验双层优化过程学习GSVs以捕捉模型和任务特定需求,并据此分配最优的专家数量和秩。
链接: https://arxiv.org/abs/2506.14646
作者: Hengyuan Zhang,Xinrong Chen,Yingmin Qiu,Xiao Liang,Ziyue Li,Guanyu Wang,Weiping Li,Tong Mo,Wenyue Li,Hayden Kwok-Hay So,Ngai Wong
机构: Tsinghua University (清华大学); Peking University (北京大学); Beijing University of Posts and Telecommunications (北京邮电大学); National Science Library, Chinese Academy of Sciences (中国科学院国家科学图书馆); Baidu (百度); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at this https URL.
zh
[NLP-10] Passing the Turing Test in Political Discourse: Fine-Tuning LLM s to Mimic Polarized Social Media Comments
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在政治讨论中可能加剧意识形态极化的潜在问题,特别是其通过自动化生成具有说服力和偏见内容的能力。解决方案的关键在于利用从Reddit提取的带有政治倾向的讨论数据集对开源LLM进行微调,使其能够生成语境感知且意识形态一致的回应,并通过语言分析、情感评分和人工标注评估模型输出的可信度和修辞一致性。研究结果表明,训练于党派数据的LLM能够生成高度合理且具有挑衅性的评论,这引发了关于AI在政治话语、虚假信息和操控行动中应用的伦理问题。
链接: https://arxiv.org/abs/2506.14645
作者: . Pazzaglia,V. Vendetti,L. D. Comencini,F. Deriu,V. Modugno
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments. Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses. The model’s outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse. The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans. These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns. The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks.
zh
[NLP-11] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot
【速读】: 该论文试图解决当前基于示例的链式思维(Chain-of-Thought, CoT)在大型语言模型(Large Language Models, LLMs)中的有效性问题,特别是在数学推理任务中,传统CoT示例是否仍能提升模型性能。研究发现,对于如Qwen2.5系列等强模型,传统CoT示例并未优于零样本CoT,其主要作用是使输出格式符合人类预期。解决方案的关键在于探索由先进模型生成的增强型CoT示例,但实验结果表明这些示例仍无法提升模型的推理能力,揭示出当前ICL+CoT框架在数学推理中的局限性。
链接: https://arxiv.org/abs/2506.14641
作者: Xiang Cheng,Chengyan Pan,Minjun Zhao,Deyang Li,Fangchao Liu,Xinyu Zhang,Xiao Zhang,Yong Liu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Huawei Poisson Lab (华为泊松实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages,22 figures
Abstract:In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \textttQwen2.5-Max and \textttDeepSeek-R1. Experimental results indicate that these enhanced exemplars still fail to improve the model’s reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.
zh
[NLP-12] AInt Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
【速读】: 该论文试图解决如何在调查研究中利用大语言模型(Large Language Models, LLMs)对开放式问卷回答进行分类的问题,特别是在非英语语境和复杂主题下的适用性与分类质量评估。其解决方案的关键在于比较不同LLMs及其提示策略在德国语料上的表现,并通过人工专家编码评估其预测性能,结果显示只有经过微调的LLM才能达到令人满意的分类效果,而不同LLM和提示方法之间的性能差异显著。
链接: https://arxiv.org/abs/2506.14634
作者: Leah von der Heyde,Anna-Carolina Haensch,Bernd Weiß,Jessika Daikeler
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: to appear in Survey Research Methods
Abstract:The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.
zh
[NLP-13] VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning
【速读】: 该论文旨在解决蚊媒疾病传播的早期检测与繁殖场所主动控制问题,以预防疫情爆发。其解决方案的关键在于构建了一个多模态数据集VisText-Mosquito,该数据集融合了视觉和文本数据,支持蚊虫繁殖场所的自动化检测、分割及推理分析。通过引入先进的深度学习模型,如YOLOv9s和YOLOv11n-Seg,以及微调的BLIP模型,实现了高精度的目标检测、水体分割和自然语言推理生成,从而体现了“预防胜于治疗”的核心理念。
链接: https://arxiv.org/abs/2506.14629
作者: Md. Adnanul Islam,Md. Faiyaz Abdullah Sayeedi,Md. Asaduzzaman Shuvo,Muhammad Ziaur Rahman,Shahanur Rahman Bappy,Raiyan Rahman,Swakkhar Shatabda
机构: United International University, Bangladesh; University of Portsmouth, United Kingdom; BRAC University, Bangladesh
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, our fine-tuned BLIP model achieves a final loss of 0.0028, with a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87. This dataset and model framework emphasize the theme “Prevention is Better than Cure”, showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: this https URL
zh
[NLP-14] Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对复杂多因素道德困境时出现的道德推理分歧问题。其解决方案的关键在于提出一种框架,通过综合多个LLMs的道德判断形成集体道德判断,并重新对偏离该共识的模型进行对齐。该框架的聚合机制将连续的道德可接受性评分融合为集体概率,根据模型可靠性加权贡献;对于偏离的模型,则通过目标嵌入优化过程微调其标记嵌入,以最小化与共识的JS散度,同时保持语义完整性。
链接: https://arxiv.org/abs/2506.14625
作者: Chenchen Yuan,Zheyu Zhang,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs’ moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.
zh
[NLP-15] When Does Meaning Backfire? Investigating the Role of AMRs in NLI
【速读】: 该论文试图解决自然语言推理(Natural Language Inference, NLI)中预训练语言模型在语义理解与泛化能力上的不足问题。其解决方案的关键在于引入抽象语义表示(Abstract Meaning Representation, AMR)以增强模型对前提和假设语义内容的解析。研究通过在微调和提示设置中整合AMR进行实验,发现尽管在提示设置中AMR能带来轻微性能提升,但其效果主要源于放大表层差异而非促进语义推理,这可能导致模型在核心语义保持时错误预测非蕴含关系。
链接: https://arxiv.org/abs/2506.14613
作者: Junghyun Min,Xiulin Yang,Shira Wein
机构: Georgetown University (乔治城大学); Amherst College (阿默斯特学院)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures
Abstract:Natural Language Inference (NLI) relies heavily on adequately parsing the semantic content of the premise and hypothesis. In this work, we investigate whether adding semantic information in the form of an Abstract Meaning Representation (AMR) helps pretrained language models better generalize in NLI. Our experiments integrating AMR into NLI in both fine-tuning and prompting settings show that the presence of AMR in fine-tuning hinders model generalization while prompting with AMR leads to slight gains in \textttGPT-4o. However, an ablation study reveals that the improvement comes from amplifying surface-level differences rather than aiding semantic reasoning. This amplification can mislead models to predict non-entailment even when the core meaning is preserved.
zh
[NLP-16] Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees
【速读】: 该论文旨在解决在不同指令集架构(Instruction Set Architecture, ISA)之间快速、灵活且正确地转换低级程序的问题,以提升现有代码的可移植性和生命周期。其核心挑战在于复杂指令集(CISC)与精简指令集(RISC)架构之间的差异,包括指令复杂性、内存模型和执行范式的不同。解决方案的关键在于提出GG(Guaranteed Guess)方法,该方法结合了预训练大语言模型(Large Language Models, LLMs)的翻译能力与传统软件测试框架的严谨性,通过在测试框架中嵌入生成的翻译结果,从而建立对翻译结果的量化信心。
链接: https://arxiv.org/abs/2506.14606
作者: Ahmed Heakl,Sarim Hashmi,Chaimaa Abi,Celine Lee,Abdulrahman Mahmoud
机构: MBZUAI; Cornell University
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Project page: this https URL
Abstract:The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.
zh
[NLP-17] Computational Studies in Influencer Marketing: A Systematic Literature Review
【速读】: 该论文试图解决当前影响者营销领域中计算研究的碎片化问题,特别是在系统性回顾计算方法方面存在明显不足,导致影响者经济中的整体科学测量稀缺,影响了平台外的利益相关者如监管机构和其他领域研究人员的工作。其解决方案的关键在于通过基于PRISMA模型的系统文献综述(SLR),对69篇研究进行分析,识别该领域的关键研究主题、方法及未来方向,从而为构建更全面、更具上下文敏感性的计算研究框架提供基础。
链接: https://arxiv.org/abs/2506.14602
作者: Haoyang Gui,Thales Bertaglia,Catalina Goanta,Gerasimos Spanakis
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: journal submission, under review
Abstract:Influencer marketing has become a crucial feature of digital marketing strategies. Despite its rapid growth and algorithmic relevance, the field of computational studies in influencer marketing remains fragmented, especially with limited systematic reviews covering the computational methodologies employed. This makes overarching scientific measurements in the influencer economy very scarce, to the detriment of interested stakeholders outside of platforms themselves, such as regulators, but also researchers from other fields. This paper aims to provide an overview of the state of the art of computational studies in influencer marketing by conducting a systematic literature review (SLR) based on the PRISMA model. The paper analyses 69 studies to identify key research themes, methodologies, and future directions in this research field. The review identifies four major research themes: Influencer identification and characterisation, Advertising strategies and engagement, Sponsored content analysis and discovery, and Fairness. Methodologically, the studies are categorised into machine learning-based techniques (e.g., classification, clustering) and non-machine-learning-based techniques (e.g., statistical analysis, network analysis). Key findings reveal a strong focus on optimising commercial outcomes, with limited attention to regulatory compliance and ethical considerations. The review highlights the need for more nuanced computational research that incorporates contextual factors such as language, platform, and industry type, as well as improved model explainability and dataset reproducibility. The paper concludes by proposing a multidisciplinary research agenda that emphasises the need for further links to regulation and compliance technology, finer granularity in analysis, and the development of standardised datasets.
zh
[NLP-18] GenerationPrograms: Fine-grained Attribution with Executable Programs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在源条件文本生成中难以提供细粒度归因的问题,这影响了生成结果的可验证性和可信度。现有归因方法也无法解释模型如何以及为何利用提供的源文档生成最终响应,限制了模型的可解释性。论文提出的解决方案是引入一种模块化生成框架——GenerationPrograms,其关键在于将生成过程分解为两个独立阶段:首先创建一个由模块化文本操作(如改写、压缩和融合)组成的可执行程序计划,该计划专门针对查询进行定制;其次按照程序的指令执行这些操作以生成最终响应。这种分阶段的结构显著提升了归因质量,并增强了模型的可解释性。
链接: https://arxiv.org/abs/2506.14580
作者: David Wan,Eran Hirsch,Elias Stengel-Eskin,Ido Dagan,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 Pages. Code: this https URL
Abstract:Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable “code agent” architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program’s specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.
zh
[NLP-19] GDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization ICML2025
【速读】: 该论文试图解决在直接偏好优化(Direct Preference Optimization, DPO)中难以利用细粒度的token-level奖励模型进行指导的问题,因为DPO被建模为一个序列级的bandit问题。其解决方案的关键在于将序列级的近端策略优化(Proximal Policy Optimization, PPO)分解为一系列token-level的近端策略优化问题,并在此基础上构建带有token-level奖励引导的token-level PPO框架,从而推导出闭式最优的token-level策略和对应的token-level奖励。通过结合所得奖励与Bradley-Terry模型,该方法建立了可计算的损失函数框架,并提出了基于诱导DPO奖励的实际奖励引导策略,使不同token能够根据其奖励表现出不同程度的偏离参考策略。
链接: https://arxiv.org/abs/2506.14574
作者: Mingkang Zhu,Xi Chen,Zhongdao Wang,Bei Yu,Hengshuang Zhao,Jiaya Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025
Abstract:Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at this https URL.
zh
[NLP-20] AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLM s
【速读】: 该论文试图解决传统权重衰减(weight decay)方法在训练大语言模型(large language models, LLMs)时存在的不足,即对所有层采用统一衰减率,忽略了模型结构的多样性以及不同模块间的谱特性差异。解决方案的关键在于提出AlphaDecay方法,该方法基于重尾自正则化(Heavy-Tailed Self-Regularization, HT-SR)理论,通过分析权重相关矩阵的经验谱密度(empirical spectral density, ESD)来量化“重尾性”,并据此为每个模块自适应地分配不同的权重衰减强度,从而平衡模块间的谱特性差异,提升模型性能。
链接: https://arxiv.org/abs/2506.14562
作者: Di He,Ajay Jaiswal,Songjun Tu,Li Shen,Ganzhao Yuan,Shiwei Liu,Lu Yin
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (深圳市先进科技研究所,中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Peng Cheng Laboratory (鹏城实验室); University of Texas at Austin (德克萨斯大学奥斯汀分校); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); University of Oxford (牛津大学); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify “heavy-tailedness.” Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.
zh
[NLP-21] M2BeamLLM : Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models
【速读】: 该论文旨在解决毫米波(mmWave)大规模多输入多输出(mMIMO)通信系统中波束预测的准确性与鲁棒性问题。其解决方案的关键在于提出一种名为M2BeamLLM的新型神经网络框架,该框架整合了多模态传感器数据(包括图像、雷达、LiDAR和GPS),并利用大语言模型(LLM)如GPT-2的强大推理能力进行波束预测。通过传感数据编码、多模态对齐与融合以及监督微调(SFT),M2BeamLLM在标准和少样本场景下均显著优于传统深度学习(DL)模型,并且随着传感模态多样性的增加,其预测性能持续提升。
链接: https://arxiv.org/abs/2506.14532
作者: Can Zheng,Jiguang He,Chung G. Kang,Guofa Cai,Zitong Yu,Merouane Debbah
机构: Korea University (韩国科学技术大学); Great Bay University (大湾区大学); Guangdong University of Technology (广东工业大学); Khalifa University of Science and Technology (哈利法科技大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 20 figures
Abstract:This paper introduces a novel neural network framework called M2BeamLLM for beam prediction in millimeter-wave (mmWave) massive multi-input multi-output (mMIMO) communication systems. M2BeamLLM integrates multi-modal sensor data, including images, radar, LiDAR, and GPS, leveraging the powerful reasoning capabilities of large language models (LLMs) such as GPT-2 for beam prediction. By combining sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT), M2BeamLLM achieves significantly higher beam prediction accuracy and robustness, demonstrably outperforming traditional deep learning (DL) models in both standard and few-shot scenarios. Furthermore, its prediction performance consistently improves with increased diversity in sensing modalities. Our study provides an efficient and intelligent beam prediction solution for vehicle-to-infrastructure (V2I) mmWave communication systems.
zh
[NLP-22] LingoLoop Attack: Trapping MLLM s via Linguistic Context and State Entrapment into Endless Loops
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因攻击者诱导生成过多输出而导致资源耗尽和服务质量下降的问题。其解决方案的关键在于提出一种名为LingoLoop的攻击方法,该方法通过两个核心机制实现:一是基于词性(Part-of-Speech, POS)的延迟机制,通过调整注意力权重来推迟结束符(EOS)的生成;二是生成路径剪枝机制,通过限制隐藏状态的幅度来诱导模型产生重复循环,从而显著增加生成的token数量和能耗。
链接: https://arxiv.org/abs/2506.14493
作者: Jiyuan Fu,Kaixun Jiang,Lingyi Hong,Jinglun Li,Haijing Guo,Dingkang Yang,Zhaoyu Chen,Wenqiang Zhang
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments demonstrate LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. These findings expose significant MLLMs’ vulnerabilities, posing challenges for their reliable deployment. The code will be released publicly following the paper’s acceptance.
zh
[NLP-23] LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM s Textual Training Data
【速读】: 该论文试图解决如何在不损害文本语义完整性的情况下,隐蔽地检测大型语言模型(Large Language Models, LLMs)是否使用了未经授权的数据进行训练的问题。现有方法在隐蔽性方面存在不足,容易被检测和移除。论文提出的解决方案——LexiMark,其关键在于通过精心选择的高熵词的同义词替换来嵌入水印,从而增强模型对水印文本的记忆能力,同时保持文本的语义一致性,使水印难以被检测和移除。
链接: https://arxiv.org/abs/2506.14474
作者: Eyal German,Sagiv Antebi,Edan Habler,Asaf Shabtai,Yuval Elovici
机构: Ben-Gurion University of the Negev (本-古里安内盖夫大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner’s consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM’s memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method’s effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.
zh
[NLP-24] How Far Can LLM s Improve from Experience? Measuring Test-Time Learning Ability in LLM s with Human Comparison
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)评估体系的局限性,即现有基准主要评估静态知识,而未能充分反映模型在动态经验中快速学习的能力。解决方案的关键在于提出一种基于语义游戏的测试平台,用于评估测试时学习(Test-time Learning)能力,该能力指模型在测试过程中通过经验提升性能的能力。研究引入了一个客观评估框架,比较模型在有限经验和累积经验设置下的表现,并包含四种经验表示形式,以全面衡量模型的动态学习能力。
链接: https://arxiv.org/abs/2506.14448
作者: Jiayin Wang,Zhiquang Guo,Weizhi Ma,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.
zh
[NLP-25] LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLM s
【速读】: 该论文试图解决生成式 AI(Generative AI)中扩散语言模型(diffusion LLMs)在长上下文能力方面的不足,特别是其在上下文外推任务中的表现缺乏系统性分析与方法支持。解决方案的关键在于提出一种无需训练的长上下文扩展方法——LongLLaDA,该方法结合了LLaDA与基于NTK的RoPE外推技术,有效提升了扩散LLMs在长上下文任务中的性能,并揭示了其在上下文外推中的独特特性。
链接: https://arxiv.org/abs/2506.14429
作者: Xiaoran Liu,Zhigeng Liu,Zengfeng Huang,Qipeng Guo,Ziwei He,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 16 pages, 12 figures, work in progress
Abstract:Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textbf\textitstable perplexity during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textbf\textitlocal perception phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.
zh
[NLP-26] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge
【速读】: 该论文试图解决传统检索系统依赖表层线索(如关键词重叠和词法语义相似性)的问题,旨在评估检索系统在更复杂的推理任务中的表现。其解决方案的关键在于提出ImpliRet基准,将推理挑战从查询端转移到文档端:查询本身简单,但相关性依赖于文档中隐含的事实,如时间关系、算术关系和世界知识。这种设计迫使检索系统具备更强的文档理解与推理能力,而非仅依赖表面特征。
链接: https://arxiv.org/abs/2506.14407
作者: Zeinab Sadat Taghavi,Ali Modarressi,Yunpu Ma,Hinrich Schütze
机构: Center for Information and Language Processing, LMU Munich (信息与语言处理中心,慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at this http URL.
zh
[NLP-27] hunder-NUBench: A Benchmark for LLM s Sentence-Level Negation Understanding
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在句级否定理解任务中面临的挑战,尤其是现有基准将否定视为更广泛任务中的附属问题,缺乏专门针对否定理解的评估体系。解决方案的关键在于提出一个名为Thunder-NUBench的新基准,该基准专门设计用于评估LLMs的句级否定理解能力,通过对比标准否定与结构多样的替代形式(如局部否定、矛盾和同义改写),实现对模型否定理解能力的深入评估。
链接: https://arxiv.org/abs/2506.14397
作者: Yeonkyoung So,Gyuseong Lee,Sungmok Jung,Joonhak Lee,JiA Kang,Sangho Kim,Jaejin Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce \textbfThunder-NUBench, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models’ negation understanding.
zh
[NLP-28] ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM -based question generation and selection
【速读】: 该论文试图解决在基于大型语言模型(Large Language Models, LLMs)的聊天界面广泛采用背景下,可能促进浅层学习并削弱批判性思维能力发展的问题。其解决方案的关键在于利用LLMs生成能够挑战未经证实或模糊主张的批判性问题,从而促进对论辩文本的深入思考。研究提出了一种两步框架,包含一个生成多个候选问题的Questioner和一个选择最相关问题的Judge,该系统在共享任务竞赛中排名第一,证明了基于LLM的方法在鼓励批判性参与方面的潜力。
链接: https://arxiv.org/abs/2506.14371
作者: Lucile Favero,Daniel Frases,Juan Antonio Pérez-Ortiz,Tanja Käser,Nuria Oliver
机构: ELLIS Alicante(ELLIS阿利坎特); Universidad Alfonso X el Sabio(阿尔方索十世大学); Universitat d’Alacant(阿尔瓦伦特大学); École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 12th Workshop on Argument Mining
Abstract:The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.
zh
[NLP-29] Digital Gatekeepers: Googles Role in Curating Hashtags and Subreddits ACL2025
【速读】: 该论文试图解决搜索引擎作为数字守门人,如何通过算法筛选影响网络和社交媒体内容可见性的问题。研究的关键在于通过对比搜索引擎结果与Reddit和Twitter/X的非抽样数据,揭示内容可见性的系统性偏差,进而分析Google算法在抑制涉及色情内容、阴谋论、广告和加密货币的子版块及标签,以及促进高互动性内容方面的行为特征。
链接: https://arxiv.org/abs/2506.14370
作者: Amrit Poudel,Yifan Ding,Jurgen Pfeffer,Tim Weninger
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main
Abstract:Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reveal systematic biases in content visibility. Google’s algorithms tend to suppress subreddits and hashtags related to sexually explicit material, conspiracy theories, advertisements, and cryptocurrencies, while promoting content associated with higher engagement. These findings suggest that Google’s gatekeeping practices influence public discourse by curating the social media narratives available to users.
zh
[NLP-30] A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive Transparent and Reproducible Geo-Temporal Information Synthesis
【速读】: 该论文试图解决当前深度研究系统在处理涉及地理和/或时间约束的上下文丰富问题时缺乏时空能力的问题(geo-temporal capabilities)。解决方案的关键在于将处理时空约束的能力增强到检索与综合过程中,同时依赖开放且可复现的基础设施以及严格的评估协议,以推动更先进且具备时空感知能力的深度研究系统的发展。
链接: https://arxiv.org/abs/2506.14345
作者: Bruno Martins,Piotr Szymański,Piotr Gramacki
机构: Instituto Superior Técnico and INESC-ID ( Instituto Superior Técnico and INESC-ID); University of Lisbon (University of Lisbon); Wrocław University of Science and Technology (Wrocław University of Science and Technology)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The emergence of Large Language Models (LLMs) has transformed information access, with current LLMs also powering deep research systems that can generate comprehensive report-style answers, through planned iterative search, retrieval, and reasoning. Still, current deep research systems lack the geo-temporal capabilities that are essential for answering context-rich questions involving geographic and/or temporal constraints, frequently occurring in domains like public health, environmental science, or socio-economic analysis. This paper reports our vision towards next generation systems, identifying important technical, infrastructural, and evaluative challenges in integrating geo-temporal reasoning into deep research pipelines. We argue for augmenting retrieval and synthesis processes with the ability to handle geo-temporal constraints, supported by open and reproducible infrastructures and rigorous evaluation protocols. Our vision outlines a path towards more advanced and geo-temporally aware deep research systems, of potential impact to the future of AI-driven information access.
zh
[NLP-31] Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics
【速读】: 该论文试图解决摘要评估中对参考集选择敏感性的问题,即现有基于参考的评估指标在不同参考集下的稳定性不足,导致模型排名不可靠。解决方案的关键在于系统性地分析多种多样的多参考摘要数据集(如SummEval、GUMSum和DUC2004)中常用评估指标的敏感性,并提出在摘要评估中引入参考集变化以提高评估的一致性和与人类判断的相关性,特别是在评估大语言模型(Large Language Models, LLMs)时。
链接: https://arxiv.org/abs/2506.14335
作者: Silvia Casola,Yang Janet Liu,Siyao Peng,Oliver Kraus,Albert Gatt,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany (MaiNLP, 信息与语言处理中心,慕尼黑路德维希马克西米利安大学,德国); Munich Center for Machine Learning (MCML), Munich, Germany (慕尼黑机器学习中心(MCML),慕尼黑,德国); NLP Group, Department of Information and Computing Sciences, Utrecht University, Netherlands (自然语言处理组,信息与计算科学系,乌得勒支大学,荷兰)
类目: Computation and Language (cs.CL)
备注: 17 pages, 13 figures
Abstract:Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
zh
[NLP-32] Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent ACL2025
【速读】: 该论文旨在解决对话推荐代理(Conversational Recommendation Agents, CRAs)在多轮对话中生成短视响应、难以持续引导用户并满足期望的问题。现有偏好优化方法在多轮对话中成本高且效果不佳。解决方案的关键在于提出一种新的多轮偏好优化范式ECPO,该范式基于期望确认理论(Expectation Confirmation Theory),显式建模用户满意度在多轮对话中的演变,并揭示不满的潜在原因,从而实现逐轮偏好优化。ECPO通过消除现有方法中的大量采样开销,同时确保优化过程带来实质性改进,显著提升了CRAs的交互能力。
链接: https://arxiv.org/abs/2506.14302
作者: Xueyang Feng,Jingsen Zhang,Jiakai Tang,Wei Li,Guohao Cai,Xu Chen,Quanyu Dai,Yue Zhu,Zhenhua Dong
机构: Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE; Huawei Noah’s Ark Lab, China
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2025
Abstract:Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue. To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction. These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization. ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements. To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations. Experimental results show that ECPO significantly enhances CRA’s interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods.
zh
[NLP-33] From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents
【速读】: 该论文试图解决对话响应生成中关于何时响应(即基于时间上下文的响应时机)的问题,这一问题在现有研究中尚未得到充分探索。解决方案的关键在于提出了一项名为“及时对话响应生成”的新任务,并构建了TimelyChat基准,用于评估语言模型预测合适时间间隔并生成时间条件响应的能力。此外,研究者通过利用时间常识知识图谱中的未标注事件知识,并借助大语言模型(LLM)合成55K个事件驱动的对话,构建了一个大规模训练数据集,进而训练出名为Timer的对话代理,该代理能够主动预测时间间隔并生成与之对齐的及时响应。
链接: https://arxiv.org/abs/2506.14285
作者: Seongbo Jang,Minjin Jeon,Jaehoon Lee,Seonghyeon Lee,Dongha Lee,Hwanjo Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:While research on dialogue response generation has primarily focused on generating coherent responses conditioning on textual context, the critical question of when to respond grounded on the temporal context remains underexplored. To bridge this gap, we propose a novel task called timely dialogue response generation and introduce the TimelyChat benchmark, which evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses. Additionally, we construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph and employing a large language model (LLM) to synthesize 55K event-driven dialogues. We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals. Experimental results show that Timer outperforms prompting-based LLMs and other fine-tuned baselines in both turn-level and dialogue-level evaluations. We publicly release our data, model, and code.
zh
[NLP-34] Improving LoRA with Variational Learning
【速读】: 该论文试图解决贝叶斯方法在LoRA微调中虽能提升校准性但对其他指标(如准确率)影响有限甚至有害,同时带来计算开销增加的问题。其解决方案的关键在于采用一种名为IVON的变分算法,该算法易于实现且计算成本与AdamW相近,并通过简单的后验剪枝技术显著提升多项指标。
链接: https://arxiv.org/abs/2506.14280
作者: Bai Cong,Nico Daheim,Yuesong Shen,Rio Yokota,Mohammad Emtiyaz Khan,Thomas Möllenhoff
机构: Institute of Science Tokyo (东京科学大学); RIKEN Center for AI Project (理化学研究所人工智能项目中心); Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt (通用知识处理实验室(UKP实验室),计算机科学系,达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE, Germany (德国应用网络安全国家研究中心 ATHENE); Huawei Hong Kong Research Center (华为香港研究中⼼)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 16 pages, 4 figures
Abstract:Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorithm called IVON. We show that IVON is easy to implement and has similar costs to AdamW, and yet it can also drastically improve many metrics by using a simple posterior pruning technique. We present extensive results on billion-scale LLMs (Llama and Qwen series) going way beyond the scale of existing applications of IVON. For example, we finetune a Llama-3.2-3B model on a set of commonsense reasoning tasks and improve accuracy over AdamW by 1.3% and reduce ECE by 5.4%, outperforming AdamW and other recent Bayesian methods like Laplace-LoRA and BLoB. Overall, our results show that variational learning with IVON can effectively improve LoRA finetuning.
zh
[NLP-35] Re-Initialization Token Learning for Tool-Augmented Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂任务(如数值推理和计划生成)时表现不足的问题,以及现有方法在将外部工具(如计算器和数据库)集成到LLMs中时存在的局限性。当前方法通过为每个工具分配唯一标记,使LLMs通过标记预测调用工具,但未能考虑工具标记与词标记之间的关系,从而限制了模型的适应性。解决方案的关键在于提出一种新的标记学习方法,从初始化角度将工具标记与现有的词嵌入空间对齐,从而提升模型性能。该方法通过基于工具名称或描述构建先验标记嵌入,用于初始化和正则化可学习的工具标记嵌入,确保学习到的嵌入与词标记空间良好对齐,进而提高工具调用的准确性。
链接: https://arxiv.org/abs/2506.14248
作者: Chenghao Li,Liu Liu,Baosheng Yu,Jiayan Qiu,Yibing Zhan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have demonstrated exceptional performance, yet struggle with complex tasks such as numerical reasoning, plan generation. Integrating external tools, such as calculators and databases, into large language models (LLMs) is crucial for enhancing problem-solving capabilities. Current methods assign a unique token to each tool, enabling LLMs to call tools through token prediction-similar to word generation. However, this approach fails to account for the relationship between tool and word tokens, limiting adaptability within pre-trained LLMs. To address this issue, we propose a novel token learning method that aligns tool tokens with the existing word embedding space from the perspective of initialization, thereby enhancing model performance. We begin by constructing prior token embeddings for each tool based on the tool’s name or description, which are used to initialize and regularize the learnable tool token embeddings. This ensures the learned embeddings are well-aligned with the word token space, improving tool call accuracy. We evaluate the method on tasks such as numerical reasoning, knowledge-based question answering, and embodied plan generation using GSM8K-XL, FuncQA, KAMEL, and VirtualHome datasets. The results demonstrate clear improvements over recent baselines, including CoT, REACT, ICL, and ToolkenGPT, indicating that our approach effectively augments LLMs with tools through relevant tokens across diverse domains.
zh
[NLP-36] Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLM s
【速读】: 该论文试图解决Reinforcement Learning with Verifiable Rewards (RLVR)在提升大语言模型(Large Language Models, LLMs)推理能力时所面临的悖论问题,即RLVR微调后的模型在Pass@K指标上表现不如基础模型,这表明RLVR可能以牺牲推理多样性为代价对现有推理路径进行重新加权。解决方案的关键在于识别出Pass@K指标本身的缺陷,即其仅关注最终答案的正确性,而忽略了推理链(CoT)的准确性与完整性。为此,作者引入了更精确的评估指标CoT-Pass@K,要求推理过程和最终答案均正确,并提出了RLVR在结构上独特地激励逻辑完整性的理论基础。实证结果表明,使用CoT-Pass@K评估时,RLVR能够促进所有K值下正确推理的泛化能力。
链接: https://arxiv.org/abs/2506.14245
作者: Xumeng Wen,Zihan Liu,Shun Zheng,Zhijian Xu,Shengyu Ye,Zhirong Wu,Xiao Liang,Yang Wang,Junjie Li,Ziming Miao,Jiang Bian,Mao Yang
机构: Microsoft Research Asia(微软亚洲研究院); Peking University(北京大学); The Chinese University of Hong Kong(香港中文大学); University of California, Los Angeles(加州大学洛杉矶分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT - Pass@K , which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT - Pass@K , we observe that RLVR can incentivize the generalization of correct reasoning for all values of K . Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.
zh
[NLP-37] A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs ACL25
【速读】: 该论文旨在解决时间知识图谱推理中缺乏对结构学习与语义推理的联合建模,以及无法区分历史事件与非历史事件本质差异的问题,从而限制了模型在不同时间上下文中的泛化能力。解决方案的关键在于提出一种多专家结构-语义混合(Multi-Expert Structural-Semantic Hybrid, MESH)框架,通过三种专家模块融合结构和语义信息,以指导不同事件的推理过程。
链接: https://arxiv.org/abs/2506.14235
作者: Yimin Deng,Yuxia Wu,Yejing Wang,Guoshuai Zhao,Li Zhu,Qidong Liu,Derong Xu,Zichuan Fu,Xian Wu,Yefeng Zheng,Xiangyu Zhao,Xueming Qian
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL25 findings
Abstract:Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a Multi-Expert Structural-Semantic Hybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.
zh
[NLP-38] Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在复杂推理任务中缺乏经验积累与知识整合的问题,即模型通常独立处理每个问题,无法利用过往经验。解决方案的关键在于引入Xolver,这是一个无需训练的多智能体推理框架,通过构建持续演化的整体经验记忆,集成多种经验模态,如外部与自我检索、工具使用、协作交互、代理驱动评估和迭代优化,使模型能够在推理过程中学习相关策略、代码片段和抽象推理模式,从而避免从零生成解决方案,实现从孤立推理向经验感知的语言代理的转变。
链接: https://arxiv.org/abs/2506.14234
作者: Md Tanzib Hosain,Salman Rahman,Md Kishor Morol,Md Rizwan Parvez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME’24 (94.4%), AIME’25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at this https URL.
zh
[NLP-39] Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription
【速读】: 该论文旨在解决吉他等弦乐器在音乐信息检索(Music Information Retrieval, MIR)中符号音乐记谱(如MIDI)缺乏关键演奏可操作性信息的问题。其解决方案的关键在于提出一种基于T5变换器架构的编码器-解码器模型——Fretting-Transformer,通过将任务建模为符号翻译问题,有效处理了音弦-品位歧义和物理演奏可行性问题。该模型利用多样化的数据集及创新的数据预处理与分词策略,并引入上下文敏感处理以及调音/变调夹条件来提升性能。
链接: https://arxiv.org/abs/2506.14223
作者: Anna Hamberger,Sebastian Murgul,Jochen Schmidt,Michael Heizmann
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted to the 50th International Computer Music Conference (ICMC), 2025
Abstract:Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar tablature. By framing the task as a symbolic translation problem, the model addresses key challenges, including string-fret ambiguity and physical playability. The proposed system leverages diverse datasets, including DadaGP, GuitarToday, and Leduc, with novel data pre-processing and tokenization strategies. We have developed metrics for tablature accuracy and playability to quantitatively evaluate the performance. The experimental results demonstrate that the Fretting-Transformer surpasses baseline methods like A* and commercial applications like Guitar Pro. The integration of context-sensitive processing and tuning/capo conditioning further enhances the model’s performance, laying a robust foundation for future developments in automated guitar transcription.
zh
[NLP-40] Chaining Event Spans for Temporal Relation Grounding
【速读】: 该论文旨在解决时间关系理解中的歧义问题,特别是在时间阅读理解(TRC)和关系抽取(TRE)任务中,如何准确区分语义上相近但时间顺序不同的问题。现有方法依赖答案重叠作为代理标签来区分相似与不相似的问题,但该方法可能因两个不相关问题恰好具有相同答案而产生不可靠的结果。论文提出的解决方案的关键在于引入一种新的时间推理网络(Timeline Reasoning Network, TRN),其通过一个两步归纳推理过程实现:首先利用语义和句法信息回答每个问题,随后通过链式推理同一事件的多个问题来预测时间线,并以此为依据验证答案的合理性,从而有效解决虚假答案重叠的问题。
链接: https://arxiv.org/abs/2506.14213
作者: Jongho Kim,Dohyeon Lee,Minsoo Kim,Seung-won Hwang
机构: Seoul National University (首尔国立大学); Interdisciplinary Program in Artificial Intelligence, Seoul National University (人工智能跨学科项目,首尔国立大学)
类目: Computation and Language (cs.CL)
备注: In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1689-1700
Abstract:Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: “What finished right before the decision?” or “What finished right after the decision?”. To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.
zh
[NLP-41] Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation
【速读】: 该论文旨在解决在数字化时代,恶意行为者通过对话中的隐性影响力语言模式来影响公众认知的问题,这些模式相较于显性表达更难被现有模型检测。解决方案的关键在于提出一种改进的方法,利用最先进的语言模型的推理能力对现有数据集进行增强,并设计了一个框架,不仅提高了隐性影响力模式的检测准确率(提升了6%),还在影响技术分类和受害者脆弱性多标签分类任务中分别提升了33%和43%。
链接: https://arxiv.org/abs/2506.14211
作者: Sina Abdidizaji,Md Kowsher,Niloofar Yousefi,Ivan Garibay
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the HCI International conference 2025
Abstract:In the era of digitalization, as individuals increasingly rely on digital platforms for communication and news consumption, various actors employ linguistic strategies to influence public perception. While models have become proficient at detecting explicit patterns, which typically appear in texts as single remarks referred to as utterances, such as social media posts, malicious actors have shifted toward utilizing implicit influential verbal patterns embedded within conversations. These verbal patterns aim to mentally penetrate the victim’s mind in order to influence them, enabling the actor to obtain the desired information through implicit means. This paper presents an improved approach for detecting such implicit influential patterns. Furthermore, the proposed model is capable of identifying the specific locations of these influential elements within a conversation. To achieve this, the existing dataset was augmented using the reasoning capabilities of state-of-the-art language models. Our designed framework resulted in a 6% improvement in the detection of implicit influential patterns in conversations. Moreover, this approach improved the multi-label classification tasks related to both the techniques used for influence and the vulnerability of victims by 33% and 43%, respectively.
zh
[NLP-42] CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation
【速读】: 该论文试图解决生成混合类型数据(尤其是高质量表格数据)所面临的挑战,这些问题主要包括数据类型的异质性、变量间复杂的相互关系以及列级分布的复杂性。解决方案的关键在于提出CausalDiffTab,这是一种基于扩散模型的生成模型,专门用于处理包含数值和类别特征的混合表格数据,并且在捕捉变量间复杂交互方面更具灵活性。此外,该方法还引入了一种基于分层先验融合原理的混合自适应因果正则化方法,以自适应地控制因果正则化的权重,从而在不损害生成能力的前提下提升模型性能。
链接: https://arxiv.org/abs/2506.14206
作者: Jia-Chen Zhang,Zheng Zhou,Yu-Jie Xiong,Chun-Ming Xia,Fei Dai
机构: Sue’s University (上海应用技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model’s performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: this https URL.
zh
[NLP-43] AgentS ynth: Scalable Task Generation for Generalist Computer-Use Agents
【速读】: 该论文试图解决如何高效生成高质量、多样化的任务和轨迹数据集以训练通用计算机使用代理的问题。其解决方案的关键在于AgentSynth管道,该管道通过信息不对称构建子任务,这些子任务在生成时较为简单,但在组合为长周期任务时变得极具挑战性,从而实现了任务复杂度的精确调控,并有效降低了数据生成成本。
链接: https://arxiv.org/abs/2506.14205
作者: Jingxu Xie,Dylan Xu,Xuandong Zhao,Dawn Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark’s difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \ 0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at this https URL
zh
[NLP-44] Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation EMNLP2024
【速读】: 该论文旨在解决失名症(anomia)患者在描述物品时因语义换喻(semantic paraphasia)导致的命名困难问题,具体包括术语缺失和错误两个挑战。解决方案的关键在于通过语义换喻错误的鲁棒性增强以及基于梯度的选择性增强来提升模型对未见术语的识别能力,其中梯度值用于控制语义错误下的数据质量,梯度方差则用于引导相关但未见术语的引入。
链接: https://arxiv.org/abs/2506.14203
作者: Jongho Kim,Romain Storaï,Seung-won Hwang
机构: Seoul National University (首尔国立大学); Interdisciplinary Program in Artificial Intelligence, Seoul National University (人工智能跨学科项目,首尔国立大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings (long)
Abstract:In this study, we investigate the potential of language models (LMs) in aiding patients experiencing anomia, a difficulty identifying the names of items. Identifying the intended target item from patient’s circumlocution involves the two challenges of term failure and error: (1) The terms relevant to identifying the item remain unseen. (2) What makes the challenge unique is inherent perturbed terms by semantic paraphasia, which are not exactly related to the target item, hindering the identification process. To address each, we propose robustifying the model from semantically paraphasic errors and enhancing the model with unseen terms with gradient-based selective augmentation. Specifically, the gradient value controls augmented data quality amid semantic errors, while the gradient variance guides the inclusion of unseen but relevant terms. Due to limited domain-specific datasets, we evaluate the model on the Tip-of-the-Tongue dataset as an intermediary task and then apply our findings to real patient data from AphasiaBank. Our results demonstrate strong performance against baselines, aiding anomia patients by addressing the outlined challenges.
zh
[NLP-45] ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations ACL2025
【速读】: 该论文试图解决当前语言模型在教育场景中难以根据学习者的不同信息需求和知识背景生成适配性解释的问题(pedagogical capabilities)。解决方案的关键在于构建ELI-Why基准,这是一个包含13.4K个“Why”问题的数据集,用于评估语言模型的解释能力,并通过两项大规模的人类研究验证模型生成解释在不同教育阶段(小学、中学和研究生阶段)的适用性。研究结果表明,GPT-4生成的解释在匹配目标教育背景和用户信息需求方面显著低于人类编写的解释。
链接: https://arxiv.org/abs/2506.14200
作者: Brihi Joshi,Keyu He,Sahana Ramnath,Sadra Sabouri,Kaitlyn Zhou,Souti Chattopadhyay,Swabha Swayamdipta,Xiang Ren
机构: University of Southern California (南加州大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Findings of ACL 2025
Abstract:Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K “Why” questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an “educator” to assess model explanations’ fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.
zh
[NLP-46] MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment EMNLP
【速读】: 该论文试图解决传统翻译评估指标(如BLEU和METEOR)在文学翻译中无法有效评估文化细微差别和风格元素的问题,这些指标主要关注词汇重叠,忽视了叙事一致性和风格忠实性。解决方案的关键在于提出MAS-LitEval,这是一个基于大型语言模型(Large Language Models, LLMs)的多智能体系统,能够从术语、叙事和风格三个维度对翻译进行评估。
链接: https://arxiv.org/abs/2506.14199
作者: Junghwan Kim,Kieun Park,Sohee Park,Hyunggug Kim,Bongwon Suh
机构: Seoul National University (首尔国立大学); Infiniction (Infiniction)
类目: Computation and Language (cs.CL)
备注: 4 Pages, 2 tables, EMNLP submitted
Abstract:Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works. To address this, we propose MAS-LitEval, a multi-agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style. We tested MAS-LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur’s Court, generated by various LLMs, and compared it to traditional metrics. \textbfMAS-LitEval outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances. This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers.
zh
[NLP-47] AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR
【速读】: 该论文旨在解决代码转换语音识别(Code-Switched ASR)系统开发中的语言歧义和多语言、代码转换数据暴露不足的问题,而收集此类语音数据成本高昂。其解决方案的关键在于提出AsyncSwitch框架,该框架利用大规模文本数据在微调前预暴露ASR模型于多样化的代码转换领域,通过三个阶段:首先在代码转换文本上训练解码器的自注意力和前馈层,其次使用有限的语音-文本数据通过交叉注意力对齐解码器与编码器,最后对整个模型进行全量微调,从而有效提升识别性能。
链接: https://arxiv.org/abs/2506.14190
作者: Tuan Nguyen,Huy-Dat Tran
机构: Institute for Infocomm Research (I²R); A*STAR (Agency for Science, Technology and Research)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This work has been submitted to the IEEE for possible publication. This paper is a preprint version submitted to the 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025)
Abstract:Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.
zh
[NLP-48] Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapores languages INTERSPEECH2025
【速读】: 该论文旨在解决多语言环境中代码转换(Code-switching, CS)语音识别(ASR)面临的挑战,尤其是由于语言复杂性导致的标注数据稀缺和成本高昂的问题。其关键解决方案是提出一种基于短语级别的混合方法,生成模仿自然模式的合成CS数据,并利用单语数据与合成短语混合的CS数据对大规模预训练ASR模型进行微调,从而提升CS-ASR性能。
链接: https://arxiv.org/abs/2506.14177
作者: Tuan Nguyen,Huy-Dat Tran
机构: Institute for Infocomm Research (I2R)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025
Abstract:Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.
zh
[NLP-49] GRAM: A Generative Foundation Reward Model for Reward Generalization ICML2025
【速读】: 该论文试图解决传统奖励模型(reward model)在训练过程中仅依赖有标签的人类偏好数据,而无法充分利用无标签数据的问题。其解决方案的关键在于构建一种生成式奖励模型(generative reward model),该模型首先通过大规模无监督学习进行预训练,随后通过有监督学习进行微调,并结合标签平滑技术优化正则化的成对排序损失,从而将生成模型与判别模型统一到相同的训练目标框架下。这一方法显著提升了奖励模型的泛化能力,使其在多种任务中无需大量微调即可取得优异性能。
链接: https://arxiv.org/abs/2506.14175
作者: Chenglong Wang,Yang Gan,Yifu Huo,Yongyu Mu,Qiaozhi He,Murun Yang,Bei Li,Tong Xiao,Chunliang Zhang,Tongran Liu,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025
Abstract:In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.
zh
[NLP-50] MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLM s via Theory of Mind
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在心智理论(Theory of Mind, ToM)能力上的系统性隐性偏见问题,传统直接查询方法因受社会期望效应影响而难以准确评估这种复杂、多维的偏见。解决方案的关键在于提出一种基于刻板印象内容模型(Stereotype Content Model, SCM)的评估框架,将偏见重新定义为在能力、亲和力和道德三个维度上ToM的失败,并引入两种间接任务:词语联想偏见测试(Word Association Bias Test, WABT)和情感归因测试(Affective Attribution Test, AAT),以无触发模型回避的方式探测潜在的刻板印象,从而更全面地揭示隐性偏见的结构特性。
链接: https://arxiv.org/abs/2506.14161
作者: Yanlin Li,Hao Liu,Huimin Liu,Yinwei Wei,Yupeng Hu
机构: School of Software, Shandong University (山东大学软件学院); School of Computing, National University of Singapore (新加坡国立大学计算机学院); School of Psychology, Hainan Normal University (海南师范大学心理学系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework’s capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.
zh
[NLP-51] S4C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中由于自回归特性导致的高延迟问题,从而影响实时应用的性能。其解决方案的关键在于提出一种结合句法和语义连贯性的推测采样框架(Speculative Sampling with Syntactic and Semantic Coherence, S^4C),通过多头草稿生成实现快速标记生成,并利用连续验证树进行高效的候选验证和特征复用,从而提升生成效率与并行性。
链接: https://arxiv.org/abs/2506.14158
作者: Tao He,Guang Huang,Yu Yang,Tianshi Xu,Sicheng Zhao,Guiguang Ding,Pengyang Wang,Feng Tian
机构: GRG Banking Equipment Co., Ltd (GRG银行设备有限公司); South China University of Technology (华南理工大学); University of Macau (澳门大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S ^4 C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S ^4 C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S ^4 C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.
zh
[NLP-52] DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization
【速读】: 该论文试图解决偏好优化(Preference Optimization, PO)中训练数据质量对大语言模型(Large Language Models, LLMs)学习效果的影响问题。其核心问题是现有偏好数据集中的响应差异可能与模型应学习的理想差异不匹配,从而影响模型性能。解决方案的关键在于引入距离校准奖励边际(Distance Calibrated Reward Margin, DCRM),通过量化优选响应与次优响应之间的距离和奖励边际,评估响应对的质量,并基于DCRM提出一种最优N²配对方法,选择具有最高DCRM的响应对进行训练,从而提升模型在多个基准测试中的表现。
链接: https://arxiv.org/abs/2506.14157
作者: Chengyu Huang,Tanya Goyal
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response y^+ and dispreferred response y^- influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of- N^2 pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models’ performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.
zh
[NLP-53] Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models INTERSPEECH2025
【速读】: 该论文旨在解决生成式 AI(Generative AI)语音合成技术进步所带来的语音欺骗攻击问题,这对自动说话人验证系统构成了重大挑战。论文提出的解决方案的关键在于将传统的多层感知机替换为基于Kolmogorov-Arnold表示定理的Kolmogorov-Arnold网络(KAN),以提升自监督学习(SSL)模型在合成语音检测中的性能。实验结果表明,该方法在ASVspoof2021数据集上显著提升了检测效果。
链接: https://arxiv.org/abs/2506.14153
作者: Tuan Dat Phuong,Long-Vu Hoang,Huy Dat Tran
机构: SoICT; Institute for Infocomm Research (I2R)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
Abstract:Recent advancements in speech synthesis technologies have led to increasingly advanced spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer model, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a novel architecture based on the Kolmogorov-Arnold representation theorem. Our results on ASVspoof2021 demonstrate that integrating KAN into the SSL-based models can improve the performance by 60.55% relatively on LA and DF sets, further achieving 0.70% EER on the 21LA set. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.
zh
[NLP-54] Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment INTERSPEECH2025
【速读】: 该论文旨在解决非侵入式物体分类问题,特别是针对头发类型的评估与水分含量的识别。其解决方案的关键在于利用声波散射(acoustic scattering)技术,通过发射声学刺激并捕捉头发表面散射信号,结合生成式 AI (Generative AI) 驱动的深度学习方法进行声音分类,从而实现对头发属性的准确判断。
链接: https://arxiv.org/abs/2506.14148
作者: Long-Vu Hoang,Tuan Nguyen,Tran Huy Dat
机构: Institute for Infocomm Research (I2R)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
Abstract:This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.
zh
[NLP-55] RadFabric: Agent ic AI System with Reasoning Capability for Radiology
【速读】: 该论文试图解决当前自动化胸部X光(Chest X Ray, CXR)诊断系统在病理覆盖范围、诊断准确性和视觉与文本推理整合方面的局限性。其解决方案的关键在于提出RadFabric,一个基于Model Context Protocol (MCP)的多智能体、多模态推理框架,通过统一视觉与文本分析实现全面的CXR解读,其中包含专门用于病理检测的CXR代理、用于解剖结构映射的解剖解释代理以及利用大模型进行多模态推理的推理代理,从而提升诊断的准确性与透明度。
链接: https://arxiv.org/abs/2506.14142
作者: Wenting Chen,Yi Dong,Zhaojun Ding,Yucheng Shi,Yifan Zhou,Fang Zeng,Yijun Luo,Tianyu Lin,Yihang Su,Yichen Wu,Kai Zhang,Zhen Xiang,Tianming Liu,Ninghao Liu,Lichao Sun,Yixuan Yuan,Xiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 4 figures, 2 tables
Abstract:Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadFabric is built on the Model Context Protocol (MCP), enabling modularity, interoperability, and scalability for seamless integration of new diagnostic agents. The system employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses. RadFabric achieves significant performance improvements, with near-perfect detection of challenging pathologies like fractures (1.000 accuracy) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). By integrating cross modal feature alignment and preference-driven reasoning, RadFabric advances AI-driven radiology toward transparent, anatomically precise, and clinically actionable CXR analysis.
zh
[NLP-56] Sampling from Your Language Model One Byte at a Time
【速读】: 该论文试图解决语言模型中由于分词(Tokenization)引入的生成偏差问题,特别是提示边界问题(Prompt Boundary Problem, PBP),以及不同分词器之间不匹配导致的模型组合与互操作性障碍。其解决方案的关键在于提出一种推理阶段的方法,能够在不改变文本层面生成分布的前提下,将任何使用BPE分词器的自回归语言模型转换为字符级或字节级的语言模型。该方法有效解决了PBP,并能够统一不同分词器的语言模型词汇表,从而实现模型在推理阶段的集成及通过代理微调进行后训练迁移。
链接: https://arxiv.org/abs/2506.14123
作者: Jonathan Hayase,Alisa Liu,Noah A. Smith,Sewoong Oh
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 23 pages, 8 figures
Abstract:Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model’s generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.
zh
[NLP-57] Essential-Web v1.0: 24T tokens of organized web data
【速读】: 该论文旨在解决语言模型在技能和知识获取过程中因缺乏大规模、结构化预训练数据集而导致的数据管道成本高且难以访问的问题。其解决方案的关键在于构建一个包含24万亿个标记的数据库——Essential-Web v1.0,其中每篇文档均通过一个涵盖主题、格式、内容复杂度和质量的十二类分类体系进行标注。分类标签由经过微调的0.5b参数模型EAI-Distill-0.5b生成,其标注一致性接近Qwen2.5-32B-Instruct的水平。通过仅使用类似SQL的过滤器,该方法能够高效地生成在数学、代码、STEM和医学等领域表现优异的网络精选数据集。
链接: https://arxiv.org/abs/2506.14111
作者: Essential AI:Andrew Hojel,Michael Pust,Tim Romanski,Yash Vanjani,Ritvik Kapila,Mohit Parmar,Adarsh Chaluvaraju,Alok Tripathy,Anil Thomas,Ashish Tanwer,Darsh J Shah,Ishaan Shah,Karl Stratos,Khoi Nguyen,Kurt Smith,Michael Callahan,Peter Rushton,Philip Monk,Platon Mazarakis,Saad Jamal,Saurabh Srivastava,Somanshu Singla,Ashish Vaswani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: this https URL
zh
[NLP-58] Innovating Chinas Intangible Cultural Heritage with DeepSeek MidJourney: The Case of Yangliuqing theme Woodblock Prints
【速读】: 该论文试图解决如何在保护中国传统艺术形式——杨柳青年画的同时,实现创新性表达的问题。其解决方案的关键在于采用DeepSeek + MidJourney的混合方法,通过结合DeepSeek生成的主题提示、MidJourney生成的主题图像、原始杨柳青年画以及DeepSeek生成的关键提示,优化AI生成图像的质量与文化代表性,从而实现传统艺术元素与现代人工智能创造力的无缝融合。
链接: https://arxiv.org/abs/2506.14104
作者: RuiKun Yang,ZhongLiang Wei,Longdi Xian
机构: 未知
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Yangliuqing woodblock prints, a cornerstone of China’s intangible cultural heritage, are celebrated for their intricate designs and vibrant colors. However, preserving these traditional art forms while fostering innovation presents significant challenges. This study explores the DeepSeek + MidJourney approach to generating creative, themed Yangliuqing woodblock prints focused on the fight against COVID-19 and depicting joyous winners. Using Fréchet Inception Distance (FID) scores for evaluation, the method that combined DeepSeek-generated thematic prompts, MidJourney-generated thematic images, original Yangliuqing prints, and DeepSeek-generated key prompts in MidJourney-generated outputs achieved the lowest mean FID score (150.2) with minimal variability (\sigma = 4.9). Additionally, feedback from 62 participants, collected via questionnaires, confirmed that this hybrid approach produced the most representative results. Moreover, the questionnaire data revealed that participants demonstrated the highest willingness to promote traditional culture and the strongest interest in consuming the AI-generated images produced through this method. These findings underscore the effectiveness of an innovative approach that seamlessly blends traditional artistic elements with modern AI-driven creativity, ensuring both cultural preservation and contemporary relevance.
zh
[NLP-59] Abstract Meaning Representation for Hospital Discharge Summarization
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中的幻觉问题,尤其是在临床领域自动生成出院摘要时的内容溯源与可信度问题。其解决方案的关键在于结合基于语言的图结构与深度学习模型,以提升自动摘要生成的可靠性与可追溯性。
链接: https://arxiv.org/abs/2506.14101
作者: Paul Landes,Sitara Rao,Aaron Jeremy Chaise,Barbara Di Eugenio
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The Achilles heel of Large Language Models (LLMs) is hallucination, which has drastic consequences for the clinical domain. This is particularly important with regards to automatically generating discharge summaries (a lengthy medical document that summarizes a hospital in-patient visit). Automatically generating these summaries would free physicians to care for patients and reduce documentation burden. The goal of this work is to discover new methods that combine language-based graphs and deep learning models to address provenance of content and trustworthiness in automatic summarization. Our method shows impressive reliability results on the publicly available Medical Information Mart for Intensive III (MIMIC-III) corpus and clinical notes written by physicians at Anonymous Hospital. rovide our method, generated discharge ary output examples, source code and trained models.
zh
[NLP-60] InsertRank: LLM s can reason over BM25 scores to Improve Listwise Reranking
【速读】: 该论文旨在解决在信息检索任务中,尤其是针对需要通过“推理”而非简单关键词匹配或语义相似性进行检索的复杂查询,现有大型语言模型(Large Language Models, LLMs)在重排序(reranking)效果上的不足。其解决方案的关键在于引入InsertRank,一个基于LLM的重排序器,它在重排序过程中利用了如BM25分数等词汇信号,以进一步提升检索性能。通过在BRIGHT和R2MED两个基准测试中的实验验证,InsertRank展示了在多个LLM家族中的有效性,表明结合词汇特征与LLM的推理能力能够显著提升复杂查询的检索效果。
链接: https://arxiv.org/abs/2506.14086
作者: Rahul Seetharaman,Kaustubh D. Dhole,Aman Bansal
机构: UMass Amherst(马萨诸塞大学阿默斯特分校); Emory University(埃默里大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledge-transfer capabilities acquired from extensive pretraining. In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by ``reasoning’’ over documents rather than through simple keyword matching or semantic similarity. While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains. In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance. InsertRank demonstrates improved retrieval effectiveness on – BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks. We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models. %In addition, we also conduct ablation studies on normalization by varying the scale of the BM25 scores, and positional bias by shuffling the order of the documents. With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark. and 51.1 on the R2MED benchmark, surpassing previous methods.
zh
[NLP-61] Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data
【速读】: 该论文试图解决当前研究中缺乏对自然语料中英语嵌套从句(embedded clauses)的统计信息和真实实例分析的问题。传统方法依赖于人为构造的语言例子,而未能充分利用大规模语言语料库中的真实数据。解决方案的关键在于提出一种基于成分句法分析(constituency parsing)和解析启发式规则的方法,用于检测和标注大规模文本数据中的自然出现的嵌套从句,并通过手动标注的黄金标准数据集Golden Embedded Clause Set (GECS)进行评估,最终从开源语料Dolma中提取了一个大规模的自然嵌套从句数据集。
链接: https://arxiv.org/abs/2506.14064
作者: Iona Carslaw,Sivan Milton,Nicolas Navarre,Ciyang Qing,Wataru Uegaki
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted in the Society for Computation in Linguistics
Abstract:For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.
zh
[NLP-62] Ace-CEFR – A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications
【速读】: 该论文旨在解决对短篇会话语料的语言难度进行评估的未满足需求,特别是在训练和过滤大型语言模型(Large Language Models, LLMs)中的应用。其解决方案的关键在于引入Ace-CEFR数据集,该数据集由专家标注了英语会话语料的文本难度等级,并通过多种模型(包括基于Transformer的模型和LLMs)进行实验,证明了基于该数据集训练的模型在文本难度评估上的准确性优于人类专家,并且具有适合生产环境的延迟特性。
链接: https://arxiv.org/abs/2506.14046
作者: David Kogan,Max Schumacher,Sam Nguyen,Masanori Suzuki,Melissa Smith,Chloe Sophia Bellows,Jared Bernstein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:There is an unmet need to evaluate the language difficulty of short, conversational passages of text, particularly for training and filtering Large Language Models (LLMs). We introduce Ace-CEFR, a dataset of English conversational text passages expert-annotated with their corresponding level of text difficulty. We experiment with several models on Ace-CEFR, including Transformer-based models and LLMs. We show that models trained on Ace-CEFR can measure text difficulty more accurately than human experts and have latency appropriate to production environments. Finally, we release the Ace-CEFR dataset to the public for research and development.
zh
[NLP-63] An Interdisciplinary Review of Commonsense Reasoning and Intent Detection
【速读】: 该论文试图解决自然语言理解中的两个关键挑战:常识推理(commonsense reasoning)和意图检测(intent detection)。其解决方案的关键在于通过分析28篇来自ACL、EMNLP和CHI(2020-2025)的文献,从方法论和应用场景两个维度进行系统梳理,提出更适应性、多语言和上下文感知的模型发展方向,并识别出在基础理论构建、泛化能力以及评估基准设计方面的关键研究空白。
链接: https://arxiv.org/abs/2506.14040
作者: Md Nazmus Sakib
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:This review explores recent advances in commonsense reasoning and intent detection, two key challenges in natural language understanding. We analyze 28 papers from ACL, EMNLP, and CHI (2020-2025), organizing them by methodology and application. Commonsense reasoning is reviewed across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. Intent detection is examined through open-set models, generative formulations, clustering, and human-centered systems. By bridging insights from NLP and HCI, we highlight emerging trends toward more adaptive, multilingual, and context-aware models, and identify key gaps in grounding, generalization, and benchmark design.
zh
[NLP-64] MultiFinBen: A Multilingual Multimodal and Difficulty-Aware Benchmark for Financial LLM Evaluation
【速读】: 该论文旨在解决现有金融自然语言处理(NLP)基准测试在多语言和多模态设置下的局限性,这些基准通常仅限于单语和单模态场景,且依赖简单任务,无法反映真实金融沟通的复杂性。其解决方案的关键在于引入MultiFinBen,这是首个针对全球金融领域的多语言和多模态基准,涵盖文本、视觉和音频模态以及单语、双语和多语语言环境,并设计了两项新任务:PolyFiQA-Easy和PolyFiQA-Expert,以及嵌入光学字符识别(OCR)的EnglishOCR和SpanishOCR金融问答任务,以挑战模型在跨语言和多模态金融信息上的复杂推理能力。此外,论文还提出了一种动态、难度感知的选择机制,构建了一个紧凑且平衡的基准,而非简单聚合现有数据集。
链接: https://arxiv.org/abs/2506.14028
作者: Xueqing Peng,Lingfei Qian,Yan Wang,Ruoyu Xiang,Yueru He,Yang Ren,Mingyang Jiang,Jeff Zhao,Huan He,Yi Han,Yun Feng,Yuechen Jiang,Yupeng Cao,Haohang Li,Yangyang Yu,Xiaoyu Wang,Penglei Gao,Shengyuan Lin,Keyi Wang,Shanshan Yang,Yilun Zhao,Zhiwei Liu,Peng Lu,Jerry Huang,Suyuchen Wang,Triantafillos Papadopoulos,Polydoros Giannouris,Efstathia Soufleri,Nuo Chen,Guojun Xiong,Zhiyang Deng,Yijia Zhao,Mingquan Lin,Meikang Qiu,Kaleb E Smith,Arman Cohan,Xiao-Yang Liu,Jimin Huang,Alejandro Lopez-Lira,Xi Chen,Junichi Tsujii,Jian-Yun Nie,Sophia Ananiadou,Qianqian Xie
机构: The FinAI; Yale University (耶鲁大学); Columbia University (哥伦比亚大学); Georgia Institute of Technology (佐治亚理工学院); NVIDIA (英伟达); New York University (纽约大学); Stevens Institute of Technology (斯蒂文斯理工学院); Archimedes/Athena RC Athens (阿基米德/雅典娜RC雅典); University of Florida (佛罗里达大学); University of Manchester (曼彻斯特大学); University of Montreal (蒙特利尔大学); Harvard University (哈佛大学); University of Minnesota (明尼苏达大学); Augusta University (奥古斯塔大学); Athens University of Economics and Business and Archimedes (雅典经济与商业大学和阿基米德); National Institute of Advanced Industrial Science and Technology (日本先进工业科学技术研究所); Asian Development Bank (亚洲开发银行); Quantitative Health Science, Cleveland Clinic (克利夫兰诊所定量健康科学); National University of Singapore (新加坡国立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.
zh
[NLP-65] Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
【速读】: 该论文旨在研究大型语言模型(Large Language Models, LLMs)在处理混合语言文本(Code-Switching, CSW)时的理解和推理能力。其解决方案的关键在于通过生成已有的推理和理解基准的CSW变体,对LLMs进行系统评估,并探索不同方法(如提示工程和微调)在减轻语言混杂导致的性能下降中的有效性。研究发现,尽管外语词素可能干扰英语文本,但将英语嵌入其他语言有时能提升理解效果,而微调相比提示工程在缓解性能下降方面表现出更稳定的潜力。
链接: https://arxiv.org/abs/2506.14012
作者: Amr Mohamed,Yang Zhang,Michalis Vazirgiannis,Guokan Shang
机构: MBZUAI(穆罕默德本扎耶德人工智能大学); Ecole Polytechnique(巴黎综合理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text \unicodex2013 even under linguistic constraints \unicodex2013 embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.
zh
[NLP-66] AssistedDS: Benchmarking How External Domain Knowledge Assists LLM s in Automated Data Science
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在自动化数据科学流程中是否能够像人类数据科学家一样批判性地利用外部领域知识的问题。其解决方案的关键在于引入AssistedDS基准,该基准通过合成数据集和真实世界Kaggle竞赛中的结构化领域知识文档,系统评估LLMs在表格预测任务中处理领域知识的能力,重点关注信息辨别、有益与有害知识的应用以及预测性能的表现。
链接: https://arxiv.org/abs/2506.13992
作者: An Luo,Xun Xian,Jin Du,Fangqiao Tian,Ganghua Wang,Ming Zhong,Shengchun Zhao,Xuan Bi,Zirui Liu,Jiawei Zhou,Jayanth Srinivasa,Ashish Kundu,Charles Fleming,Mingyi Hong,Jie Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
备注:
Abstract:Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models’ ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.
zh
[NLP-67] AI shares emotion with humans across languages and cultures
【速读】: 该论文试图解决如何实现人类与人工智能(AI)之间情感的规范且有意义的交流问题,特别是探讨大型语言模型(LLMs)是否以人类的方式在语言中表征情感,以及如何控制其输出的情感基调。解决方案的关键在于通过可解释的LLM特征分析跨语言文化群体和模型家族的人机情感一致性,并利用基于人类中心情感概念的引导向量对模型表达进行稳定且自然的调节,从而证明人类情感概念可以系统性地引导LLMs生成相应的情感状态。
链接: https://arxiv.org/abs/2506.13978
作者: Xiuwen Wu,Hao Wang,Zhiang Yan,Xiaohan Tang,Pengfei Xu,Wai-Ting Siok,Ping Li,Jia-Hong Gao,Bingjiang Lyu,Lang Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective and safe human-machine collaboration requires the regulated and meaningful exchange of emotions between humans and artificial intelligence (AI). Current AI systems based on large language models (LLMs) can provide feedback that makes people feel heard. Yet it remains unclear whether LLMs represent emotion in language as humans do, or whether and how the emotional tone of their output can be controlled. We assess human-AI emotional alignment across linguistic-cultural groups and model-families, using interpretable LLM features translated from concept-sets for over twenty nuanced emotion categories (including six basic emotions). Our analyses reveal that LLM-derived emotion spaces are structurally congruent with human perception, underpinned by the fundamental affective dimensions of valence and arousal. Furthermore, these emotion-related features also accurately predict large-scale behavioural data on word ratings along these two core dimensions, reflecting both universal and language-specific patterns. Finally, by leveraging steering vectors derived solely from human-centric emotion concepts, we show that model expressions can be stably and naturally modulated across distinct emotion categories, which provides causal evidence that human emotion concepts can be used to systematically induce LLMs to produce corresponding affective states when conveying content. These findings suggest AI not only shares emotional representations with humans but its affective outputs can be precisely guided using psychologically grounded emotion concepts.
zh
[NLP-68] CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在使用外部工具时面临的复杂任务中可能出现的各类意外错误,包括错误的识别、诊断与恢复问题。其解决方案的关键在于提出CRITICTOOL,这是一个专门针对工具学习的全面批判性评估基准,通过一种新颖的进化策略构建数据集,能够涵盖多种复杂度的工具使用错误,从而更真实地反映实际应用场景,并验证了所构建基准策略的泛化性和有效性。
链接: https://arxiv.org/abs/2506.13977
作者: Shiting Huang,Zhen Fang,Zehui Chen,Siyu Yuan,Junjie Ye,Yu Zeng,Lin Chen,Qi Mao,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Communication University of China (中国传媒大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at \hrefthis https URLthis https URL.
zh
[NLP-69] Are manual annotations necessary for statutory interpretations retrieval?
【速读】: 该论文试图解决法律研究中如何高效获取法律概念解释的难题,特别是通过自动化手段减少对人工标注的依赖。当前的方法依赖于句子排序和基于标注示例的语言模型训练,但这一过程成本高昂且需针对每个法律概念重复进行。论文的关键解决方案在于探索自动标注的可能性,通过实验分析标注数量、标注策略以及利用大语言模型(Large Language Model, LLM)进行自动化标注的效果,以降低人工干预的需求并提升效率。
链接: https://arxiv.org/abs/2506.13965
作者: Aleksander Smywiński-Pohl,Tomer Libal,Adam Kaczmarczyk,Magdalena Król
机构: AGH University(AGH大学); University of Luxembourg(卢森堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:One of the elements of legal research is looking for cases where judges have extended the meaning of a legal concept by providing interpretations of what a concept means or does not mean. This allow legal professionals to use such interpretations as precedents as well as laymen to better understand the legal concept. The state-of-the-art approach for retrieving the most relevant interpretations for these concepts currently depends on the ranking of sentences and the training of language models over annotated examples. That manual annotation process can be quite expensive and need to be repeated for each such concept, which prompted recent research in trying to automate this process. In this paper, we highlight the results of various experiments conducted to determine the volume, scope and even the need for manual annotation. First of all, we check what is the optimal number of annotations per a legal concept. Second, we check if we can draw the sentences for annotation randomly or there is a gain in the performance of the model, when only the best candidates are annotated. As the last question we check what is the outcome of automating the annotation process with the help of an LLM.
zh
[NLP-70] ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection
【速读】: 该论文试图解决在设计用于辅助日常人类活动的机器人时,如何通过结合视觉线索提升用户意图理解的问题,这一过程被定义为多模态分类任务。然而,收集包含视觉和语言元素的大规模数据集以用于模型训练既困难又耗时。解决方案的关键在于提出一种新颖的数据增强框架,该框架利用先进的大语言模型模拟潜在对话和环境情境,并通过稳定扩散模型生成相关环境图像,从而扩充数据集,提升多模态模型在有限目标数据下的动作选择能力。
链接: https://arxiv.org/abs/2506.13956
作者: Shang-Chi Tsai,Seiya Kawano,Angel Garcia Contreras,Koichiro Yoshino,Yun-Nung Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: IWSDS 2024 Best Paper Award
Abstract:When designing robots to assist in everyday human activities, it is crucial to enhance user requests with visual cues from their surroundings for improved intent understanding. This process is defined as a multimodal classification task. However, gathering a large-scale dataset encompassing both visual and linguistic elements for model training is challenging and time-consuming. To address this issue, our paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios, encompassing both dialogues and related environmental imagery. This approach involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts, followed by the use of a stable diffusion model to create images depicting these environments. The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions in response to user interactions with the limited target data. Our experimental results, based on a dataset collected from real-world scenarios, demonstrate that our methodology significantly enhances the robot’s action selection capabilities, achieving the state-of-the-art performance.
zh
[NLP-71] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models
【速读】: 该论文试图解决通过强化学习在可验证奖励(RLVR)上训练的推理模型如何学习解决新问题的问题。其解决方案的关键在于通过两种主要机制提升性能:一是将pass@k压缩为pass@1,二是通过“能力增益”使模型学会解决之前无法解决的新问题。研究进一步表明,尽管能力增益在不同模型规模中均存在,但解决新问题的主要驱动因素是自蒸馏。此外,论文提出了一种新的在线训练算法\textGuide,该算法通过自适应地将提示融入模型上下文,并调整非策略轨迹的重要性采样比例,以优化在无提示情境下的策略,从而显著提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.13923
作者: Vaskar Nath,Elaine Lau,Anisha Gunjal,Manasi Sharma,Nikhil Baharte,Sean Hendryx
机构: Scale AI(规模人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance through two main means: (1) by compressing pass@ k into pass@1 and (2) via “capability gain” in which models learn to solve new problems that they previously could not solve even at high k . We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B on 500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@ k rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive \textGuide - a new class of online training algorithms. \textGuide adaptively incorporates hints into the model’s context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the “off-policy” trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of \textGuide for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4 % macro-average improvement across math benchmarks. We include careful ablations to analyze \textGuide 's components and theoretically analyze Guide’s learning efficiency.
zh
[NLP-72] Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry Cluster Divergence and Layer wise Pooled Representations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险领域应用时的对齐问题,即确保模型行为能够可靠地反映人类价值观和安全约束。当前评估方法依赖于行为代理指标,如拒绝率、G-Eval分数和毒性分类器,但这些方法存在重大盲点。论文提出的解决方案是引入一种新的几何且提示不变的度量标准——对齐质量指数(Alignment Quality Index, AQI),其关键在于通过分析潜在空间中安全与不安全激活的分离程度来实证评估模型对齐情况,结合多种聚类质量指标以检测隐藏的不对齐和越狱风险。
链接: https://arxiv.org/abs/2506.13901
作者: Abhilekh Borah,Chhavi Sharma,Danush Khanna,Utkarsh Bhatt,Gurpreet Singh,Hasnat Md Abdullah,Raghav Kaushik Ravi,Vinija Jain,Jyoti Patel,Shubham Singh,Vasu Sharma,Arpita Vats,Rahul Raja,Aman Chadha,Amitava Das
机构: Manipal University Jaipur, India; LinkedIn; IIT Kharagpur, India; IIIT Guwahati, India; Texas A&M University, USA; Vellore Institute of Technology, Chennai, India; Meta AI; Evalueserve; New York University, USA; Amazon AI; BITS Goa, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI’s correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.13901 [cs.CL] (or arXiv:2506.13901v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.13901 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-73] EmoNews: A Spoken Dialogue System for Expressive News Conversations
【速读】: 该论文试图解决任务导向型口语对话系统(task-oriented spoken dialogue system, SDS)在新闻对话中情感语音调节不足的问题,旨在提升对话的共情能力。由于SDS与情感文本转语音(emotional TTS)研究的割裂性以及缺乏标准化的社会目标评估指标,相关研究仍处于探索阶段。解决方案的关键在于构建一个基于大语言模型(large language model, LLM)的情感分析器来识别合适的情绪,并结合PromptTTS生成符合语境的情感语音,同时提出主观评估量表以评价情感调节性能。实验结果表明,所提出的系统在情感调节和用户参与度方面优于基线系统。
链接: https://arxiv.org/abs/2506.13894
作者: Ryuki Matsuura,Shikhar Bharadwaj,Jiarui Liu,Dhatchi Kunde Govindarajan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at this https URL
zh
[NLP-74] VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training
【速读】: 该论文试图解决视觉-语言(Vision-Language, VL)模型在强化学习微调过程中面临的两个主要挑战:一是高质量训练数据依赖于已有强性能的VL模型所导致的自举困境;二是VL模型在生成错误视觉属性时产生的模态偏差和负例放大问题。解决方案的关键在于提出一种迭代训练框架,该框架结合了视觉专家、思维链(Chain-of-Thought, CoT)推理和基于边距的拒绝采样技术,以优化偏好数据集、增强结构化批评并逐步提升模型的推理能力。
链接: https://arxiv.org/abs/2506.13888
作者: Jipeng Zhang,Kehao Miao,Renjie Pi,Zhaowei Wang,Runtao Liu,Rui Pan,Tong Zhang
机构: The Hong Kong University of Science and Technology (香港科技大学); Nanyang Technological University (南洋理工大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.
zh
[NLP-75] Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理涉及跨语言数词系统的语言-数学谜题时表现不佳的问题。研究发现,LLMs难以理解数词构造和组合中的隐含结构,而人类能够通过语言理解推断出这些结构。解决方案的关键在于明确标记数学运算符号(如“+”、“×”等),使模型能够准确执行相关操作。研究进一步表明,当前推理模型在从人类规模数据的隐含模式中灵活推断组合规则方面仍面临挑战。
链接: https://arxiv.org/abs/2506.13886
作者: Antara Raaghavi Bhattacharya,Isabel Papadimitriou,Kathryn Davidson,David Alvarez-Melis
机构: Harvard University (哈佛大学); Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University (哈佛大学自然与人工智能研究学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ( + , \times , etc, as in “twenty + three”). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.
zh
[NLP-76] Investigating the Potential of Large Language Model-Based Router Multi-Agent Architectures for Foundation Design Automation: A Task Classification and Expert Selection Study
【速读】: 该论文旨在解决基础设计计算自动化的问题,通过智能任务分类和专家选择实现路由式多智能体系统的优化。其解决方案的关键在于采用路由式专家选择机制,结合双层分类框架以准确区分基础类型并启用相应的分析方法,从而在浅基础和桩基设计场景中实现更高的性能表现,优于传统的代理工作流。
链接: https://arxiv.org/abs/2506.13811
作者: Sompote Youwai,David Phim,Vianne Gayl Murcia,Rianne Clair Onas
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This study investigates router-based multi-agent systems for automating foundation design calculations through intelligent task classification and expert selection. Three approaches were evaluated: single-agent processing, multi-agent designer-checker architecture, and router-based expert selection. Performance assessment utilized baseline models including DeepSeek R1, ChatGPT 4 Turbo, Grok 3, and Gemini 2.5 Pro across shallow foundation and pile design scenarios. The router-based configuration achieved performance scores of 95.00% for shallow foundations and 90.63% for pile design, representing improvements of 8.75 and 3.13 percentage points over standalone Grok 3 performance respectively. The system outperformed conventional agentic workflows by 10.0 to 43.75 percentage points. Grok 3 demonstrated superior standalone performance without external computational tools, indicating advances in direct LLM mathematical reasoning for engineering applications. The dual-tier classification framework successfully distinguished foundation types, enabling appropriate analytical approaches. Results establish router-based multi-agent systems as optimal for foundation design automation while maintaining professional documentation standards. Given safety-critical requirements in civil engineering, continued human oversight remains essential, positioning these systems as advanced computational assistance tools rather than autonomous design replacements in professional practice.
zh
[NLP-77] ClimateChat: Designing Data and Methods for Instruction Tuning LLM s to Answer Climate Change Queries ICLR2025
【速读】: 该论文试图解决当前气候科学领域中生成式 AI (Generative AI) 指令数据生产效率低、精度不足的问题,这限制了气候专用大语言模型(LLM)的进一步发展。解决方案的关键在于提出一种自动化构建指令数据的方法,该方法通过文档中的事实和背景知识生成指令,并结合网络爬取和种子指令收集以增强数据多样性,从而构建出高质量的气候科学指令数据集 ClimateChat-Corpus,用于微调开源 LLM,提升其在气候科学问答任务中的性能。
链接: https://arxiv.org/abs/2506.13796
作者: Zhou Chen,Xiao Wang,Yuanhong Liao,Ming Lin,Yuqi Bai
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025 camera ready, 13 pages, 4 figures, 4 tables
Abstract:As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs.
zh
[NLP-78] ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLM s a ML Ensemble on Longitudinal Identity Resolution
【速读】: 该论文试图解决历史身份解析(historical identity resolution)问题,即在长期纵向数据中进行人物实体匹配,特别是在包含姓名变化、人口统计学变迁和丰富家谱联系的冰岛人口普查数据中。解决方案的关键在于提出了一种名为NARS(Non-Axiomatic Reasoning System)的新方法,这是一种基于非公理逻辑(Non-Axiomatic Logic, NAL)的一般性人工智能框架,能够在知识和资源有限的情况下进行推理。实验表明,NARS在任务上表现出色,达到了当前最优性能(SOTA),并展示了其在处理复杂历史数据中的潜力。
链接: https://arxiv.org/abs/2506.13792
作者: Gonçalo Hora de Carvalho,Lazar S. Popov,Sander Kaatee,Kristinn R. Thórisson,Tangrui Li,Pétur Húni Björnsson,Jilles S. Dibangoye
机构: IIIM, Iceland (冰岛IIIM研究所); Reykjavik University (雷克雅未克大学); Temple University (坦普尔大学); University of Copenhagen (哥本哈根大学); Bernoulli Institute, University of Groningen (格罗宁根大学伯努利研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
备注:
Abstract:We introduce ICE-ID, a novel benchmark dataset for historical identity resolution, comprising 220 years (1703-1920) of Icelandic census records. ICE-ID spans multiple generations of longitudinal data, capturing name variations, demographic changes, and rich genealogical links. To the best of our knowledge, this is the first large-scale, open tabular dataset specifically designed to study long-term person-entity matching in a real-world population. We define identity resolution tasks (within and across census waves) with clearly documented metrics and splits. We evaluate a range of methods: handcrafted rule-based matchers, a ML ensemble as well as LLMs for structured data (e.g. transformer-based tabular networks) against a novel approach to tabular data called NARS (Non-Axiomatic Reasoning System) - a general-purpose AI framework designed to reason with limited knowledge and resources. Its core is Non-Axiomatic Logic (NAL), a term-based logic. Our experiments show that NARS is suprisingly simple and competitive with other standard approaches, achieving SOTA at our task. By releasing ICE-ID and our code, we enable reproducible benchmarking of identity resolution approaches in longitudinal settings and hope that ICE-ID opens new avenues for cross-disciplinary research in data linkage and historical analytics.
zh
[NLP-79] AcademicBrowse: Benchmarking Academic Browse Ability of LLM s
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在学术信息检索任务中表现不足的问题,特别是现有基准测试如OpenAI的BrowseComp未能充分满足学术搜索的特殊需求,包括深度文献追踪与整理、专业学术数据库支持、长尾学术知识导航以及保证学术严谨性。其解决方案的关键在于提出AcademicBrowse数据集,这是首个专门用于评估LLMs在学术研究中复杂信息检索能力的数据集,具备学术实用性、高难度、简洁评估和广泛覆盖等特性,能够更精准地衡量和提升LLMs在复杂学术信息检索任务中的性能。
链接: https://arxiv.org/abs/2506.13784
作者: Junting Zhou,Wang Li,Yiyan Liao,Nengyuan Zhang,Tingjia Miaoand Zhihui Qi,Yuhan Wu,Tong Yang
机构: Peking University (北京大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs)’ search capabilities have garnered significant attention. Existing benchmarks, such as OpenAI’s BrowseComp, primarily focus on general search scenarios and fail to adequately address the specific demands of academic search. These demands include deeper literature tracing and organization, professional support for academic databases, the ability to navigate long-tail academic knowledge, and ensuring academic rigor. Here, we proposed AcademicBrowse, the first dataset specifically designed to evaluate the complex information retrieval capabilities of Large Language Models (LLMs) in academic research. AcademicBrowse possesses the following key characteristics: Academic Practicality, where question content closely mirrors real academic learning and research environments, avoiding deliberately misleading models; High Difficulty, with answers that are challenging for single models (e.g., Grok DeepSearch or Gemini Deep Research) to provide directly, often requiring at least three deep searches to derive; Concise Evaluation, where limiting conditions ensure answers are as unique as possible, accompanied by clear sources and brief solution explanations, greatly facilitating subsequent audit and verification, surpassing the current lack of analyzed search datasets both domestically and internationally; and Broad Coverage, as the dataset spans at least 15 different academic disciplines. Through AcademicBrowse, we expect to more precisely measure and promote the performance improvement of LLMs in complex academic information retrieval tasks. The data is available at: this https URL
zh
[NLP-80] Knowledge Compression via Question Generation: Enhancing Multihop Document Retrieval without Fine-tuning
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中需要微调或传统分块(chunking)方法的问题,从而提升信息检索与生成的效果。其解决方案的关键在于采用基于问题的知识编码方法,通过生成覆盖词汇和语义空间的疑问句来编码文本内容,创建有针对性的检索提示,并结合自定义的句法重排序方法,以提高检索精度和效率。此方法在单跳检索任务中实现了更高的召回率,在多跳任务中也表现出优于传统分块和微调基线的性能,同时降低了计算资源需求和存储成本。
链接: https://arxiv.org/abs/2506.13778
作者: Anvi Alex Eponon,Moein Shahiki-Tash,Ildar Batyrshin,Christian E. Maldonado-Sifuentes,Grigori Sidorov,Alexander Gelbukh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This study presents a question-based knowledge encoding approach that improves retrieval-augmented generation (RAG) systems without requiring fine-tuning or traditional chunking. We encode textual content using generated questions that span the lexical and semantic space, creating targeted retrieval cues combined with a custom syntactic reranking method. In single-hop retrieval over 109 scientific papers, our approach achieves a Recall@3 of 0.84, outperforming traditional chunking methods by 60 percent. We also introduce “paper-cards”, concise paper summaries under 300 characters, which enhance BM25 retrieval, increasing MRR@3 from 0.56 to 0.85 on simplified technical queries. For multihop tasks, our reranking method reaches an F1 score of 0.52 with LLaMA2-Chat-7B on the LongBench 2WikiMultihopQA dataset, surpassing chunking and fine-tuned baselines which score 0.328 and 0.412 respectively. This method eliminates fine-tuning requirements, reduces retrieval latency, enables intuitive question-driven knowledge access, and decreases vector storage demands by 80%, positioning it as a scalable and efficient RAG alternative. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2506.13778 [cs.IR] (or arXiv:2506.13778v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.13778 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anvi Alex Eponon [view email] [v1] Mon, 9 Jun 2025 16:15:11 UTC (984 KB)
zh
[NLP-81] LittleBit: Ultra Low-Bit Quantization via Latent Factorization
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在部署过程中面临的显著内存和计算成本问题,特别是在亚1比特(sub-1-bit)量化区间内性能退化难以克服的挑战。其解决方案的关键在于提出LittleBit方法,该方法通过使用潜在矩阵分解将权重表示为低秩形式,并对这些因子进行二值化处理,从而实现极低比特位宽(如0.1比特每权重,BPW)的压缩。为应对极端精度带来的信息损失,LittleBit引入了多尺度补偿机制,并结合双通道符号-值独立分解(Dual-SVID)与集成残差补偿技术,以实现稳定且高效的量化感知训练(QAT)。
链接: https://arxiv.org/abs/2506.13771
作者: Banseok Lee,Dongkyu Kim,Youngcheon You,Youngmin Kim
机构: Samsung Research (三星研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31 \times memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for stable quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit’s superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method’s 0.7 BPW. This establishes a superior size-performance trade-off, with kernel-level benchmarks indicating potential for a 5 \times speedup compared to FP16. LittleBit paves the way for deploying powerful LLMs in resource-constrained environments.
zh
[NLP-82] Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios INTERSPEECH2025
【速读】: 该论文旨在解决实时和离线自动语音识别(ASR)应用中延迟与准确性的平衡问题,特别是在多说话人高度重叠场景下的识别挑战。其关键解决方案是引入连续语音分离(Continuous Speech Separation, CSS)单通道前端与端到端(E2E)系统结合,以提升多说话人重叠场景下的识别准确性;同时采用双模型架构或基于级联编码器的两阶段模型,分别优化流式和离线场景,并探索基于分段的序列输出训练(segSOT)方法,以提高多说话人转录结果的可读性。
链接: https://arxiv.org/abs/2506.14204
作者: Aswin Shanmugam Subramanian,Amit Das,Naoyuki Kanda,Jinyu Li,Xiaofei Wang,Yifan Gong
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025
Abstract:We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models – Conformer Transducer for streaming and Sequence-to-Sequence for offline – or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline scenarios while also enhancing readability of multi-talker transcriptions.
zh
[NLP-83] Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience INTERSPEECH2025
【速读】: 该论文试图解决视频会议中非流畅或不愉快时刻的主观体验建模问题,这类时刻在自然数据中较为罕见,导致监督学习(SL)模型训练需要大量人工标注数据,成本较高。解决方案的关键在于应用半监督学习(SSL),通过结合少量标记数据和大量未标记数据,利用多模态(音频、面部、文本)深度特征进行联合训练,从而提升模型性能。实验结果表明,该方法在仅使用8%标记数据的情况下,能够达到监督学习模型96%的性能,展示了高效的标注框架。
链接: https://arxiv.org/abs/2506.13971
作者: Andrew Chang,Chenkai Hu,Ji Qi,Zhuojian Wei,Kexin Zhang,Viswadruth Akkaraju,David Poeppel,Dustin Freeman
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Interspeech 2025
Abstract:Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in naturalistic data, and thus training a supervised learning (SL) model requires costly manual data annotation. We applied semi-supervised learning (SSL) to leverage targeted labeled and unlabeled clips for training multimodal (audio, facial, text) deep features to predict non-fluid or unenjoyable moments in holdout videoconference sessions. The modality-fused co-training SSL achieved an ROC-AUC of 0.9 and an F1 score of 0.6, outperforming SL models by up to 4% with the same amount of labeled data. Remarkably, the best SSL model with just 8% labeled data matched 96% of the SL model’s full-data performance. This shows an annotation-efficient framework for modeling videoconference experience.
zh
计算机视觉
[CV-0] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
【速读】:该论文旨在解决机器人在实际应用中因硬件限制导致的数据质量下降以及实时性约束下模型推理仅依赖即时状态和场景观测所带来的学习效果降低问题,这些问题导致了物体定位、抓取规划和长时任务执行的失败。其解决方案的关键在于提出一种基于Transformer的因果扩散策略(Causal Diffusion Policy, CDP),通过引入历史动作序列作为条件,增强动作预测的连贯性和上下文感知能力,从而提升视觉-运动策略学习的效果。此外,为降低自回归推理带来的计算成本,CDP还引入了缓存机制,存储前一时间步的注意力键值对,显著减少了执行过程中的冗余计算。
链接: https://arxiv.org/abs/2506.14769
作者: Jiahua Ma,Yiran Qin,Yixiong Li,Xuanqi Liao,Yulan Guo,Ruimao Zhang
机构: Sun Yat-sen University (中山大学); CUHK(SZ) (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.
zh
[CV-1] Scaling-Up the Pretraining of the Earth Observation Foundation Model PhilEO to the MajorTOM Dataset
【速读】:该论文旨在解决地球观测(Earth Observation, EO)卫星数据量庞大而标注数据稀缺的问题,通过预训练地球观测基础模型(EO Foundation Models, FMs)来高效适应多种下游任务。解决方案的关键在于利用大规模未标注数据集进行模型预训练,从而在少量标注数据的情况下实现模型的微调与性能优化。研究中使用了23TB的MajorTOM数据集和2TB的FastTOM子集,并通过调整模型参数数量和架构(从U-Net到Vision Transformers)来探索模型与数据规模的扩展性。
链接: https://arxiv.org/abs/2506.14765
作者: Nikolaos Dionelis,Jente Bosmans,Riccardo Musto,Giancarlo Paoletti,Simone Sarti,Giacomo Cascarano,Casper Fibaek,Luke Camilleri,Bertrand Le Saux,Nicolas Longépé
机构: European Space Agency (欧洲空间局); ΦΦ\Phiroman_Φ-lab; Leonardo Labs; e-Geos; Trust Stamp; European Commission (欧洲委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 9 figures, 1 table, 29 references
Abstract:Today, Earth Observation (EO) satellites generate massive volumes of data, with the Copernicus Sentinel-2 constellation alone producing approximately 1.6TB per day. To fully exploit this information, it is essential to pretrain EO Foundation Models (FMs) on large unlabeled datasets, enabling efficient fine-tuning for several different downstream tasks with minimal labeled data. In this work, we present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM, which covers the vast majority of the Earth’s surface, as well as on the specialized subset FastTOM 2TB that does not include oceans and ice. We develop and study various PhilEO model variants with different numbers of parameters and architectures. Finally, we fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation, and we evaluate the performance. Our results demonstrate that for all n-shots for road density regression, the PhilEO 44M MajorTOM 23TB model outperforms PhilEO Globe 0.5TB 44M. We also show that for most n-shots for road density estimation and building density regression, PhilEO 200M FastTOM outperforms all the other models. The effectiveness of both dataset and model scaling is validated using the PhilEO Bench. We also study the impact of architecture scaling, transitioning from U-Net Convolutional Neural Networks (CNN) to Vision Transformers (ViT).
zh
[CV-2] Cost-Aware Routing for Efficient Text-To-Image Generation
【速读】:该论文试图解决生成式 AI (Generative AI) 在图像生成任务中因高保真度带来的高计算成本问题,旨在实现质量与计算成本之间的最优平衡。其解决方案的关键在于提出一种框架,根据输入提示(prompt)的复杂性动态调整计算资源的使用,通过自动将每个提示路由到最合适的文本到图像生成函数,该函数可能对应不同数量的去噪步骤或独立的文本到图像模型,从而在保证生成质量的同时降低整体计算开销。
链接: https://arxiv.org/abs/2506.14753
作者: Qinchan(Wing)Li,Kenneth Chen,Changyue(Tina)Su,Wittawat Jitkrittum,Qi Sun,Patsorn Sangkloy
机构: New York University (纽约大学); Google Research (谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.
zh
[CV-3] SyncTalk: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting
【速读】:该论文旨在解决在合成逼真、语音驱动的说话头视频中实现高同步性的问题,这一问题被认为是生成真实感说话头的“魔鬼”难题。其关键解决方案是提出SyncTalk++,该方法包含动态肖像渲染器(Dynamic Portrait Renderer)与高斯点云渲染技术,以确保主体身份的一致性;面部同步控制器(Face-Sync Controller)通过创新性地使用3D面部混合形状模型来对齐唇部动作与语音;头部同步稳定器(Head-Sync Stabilizer)优化头部姿态以增强稳定性;同时,通过引入表情生成器和躯干恢复器提升对分布外音频的鲁棒性,从而显著提高视频的同步性与真实感。
链接: https://arxiv.org/abs/2506.14742
作者: Ziqiao Peng,Wentao Hu,Junyuan Ma,Xiangyu Zhu,Xiaomei Zhang,Hao Zhao,Hui Tian,Jun He,Hongyan Liu,Zhaoxin Fan
机构: Renmin University of China (中国人民大学); Beijing University of Posts and Telecommunications (北京邮电大学); Chinese Academy of Sciences (中国科学院); Tsinghua University (清华大学); Beihang University (北京航空航天大学); Hangzhou International Innovation Institute (杭州国际创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ‘‘devil’’ in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability. Additionally, SyncTalk++ enhances robustness to out-of-distribution (OOD) audio by incorporating an Expression Generator and a Torso Restorer, which generate speech-matched facial expressions and seamless torso regions. Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Extensive experiments and user studies demonstrate that SyncTalk++ outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: this https URL.
zh
[CV-4] Active InSAR monitoring of building damage in Gaza during the Israel-Hamas War
【速读】:该论文试图解决在持续冲突期间对城市区域损伤进行实时监测的问题,特别是在加沙地带的2023年以色列-哈马斯战争中,传统方法难以满足对动态变化区域的及时、准确评估需求。解决方案的关键在于采用干涉合成孔径雷达(Interferometric SAR)数据,并结合长时域相干变化检测(LT-CCD)方法,以实现每周损伤趋势的跟踪,该方法在联合国参考数据中实现了92.5%的损伤标签检测率和1.2%的低误报率,具备较高的时空精度和实用性。
链接: https://arxiv.org/abs/2506.14730
作者: Corey Scher,Jamon Van Den Hoek
机构: City University of New York Graduate Center (纽约市立大学研究生中心); Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aerial bombardment of the Gaza Strip beginning October 7, 2023 is one of the most intense bombing campaigns of the twenty-first century, driving widespread urban damage. Characterizing damage over a geographically dynamic and protracted armed conflict requires active monitoring. Synthetic aperture radar (SAR) has precedence for mapping disaster-induced damage with bi-temporal methods but applications to active monitoring during sustained crises are limited. Using interferometric SAR data from Sentinel-1, we apply a long temporal-arc coherent change detection (LT-CCD) approach to track weekly damage trends over the first year of the 2023- Israel-Hamas War. We detect 92.5% of damage labels in reference data from the United Nations with a negligible (1.2%) false positive rate. The temporal fidelity of our approach reveals rapidly increasing damage during the first three months of the war focused in northern Gaza, a notable pause in damage during a temporary ceasefire, and surges of new damage as conflict hot-spots shift from north to south. Three-fifths (191,263) of all buildings are damaged or destroyed by the end of the study. With massive need for timely data on damage in armed conflict zones, our low-cost and low-latency approach enables rapid uptake of damage information at humanitarian and journalistic organizations.
zh
[CV-5] DiFuse-Net: RGB and Dual-Pixel Depth Estimation using Window Bi-directional Parallax Attention and Cross-modal Transfer Learning IROS2025
【速读】:该论文旨在解决深度估计中的挑战,特别是在传统立体视觉和主动深度传感器存在成本、功耗和鲁棒性限制的情况下,如何利用智能手机摄像头中普遍存在的双像素(Dual-Pixel, DP)技术实现更准确的深度预测。其解决方案的关键在于提出DiFuse-Net,该网络采用模态解耦设计,通过窗口双向视差注意力机制(Window Bi-Directional Parallax Attention Mechanism, WBiPAM)捕捉智能手机摄像头小光圈下独特的DP差异线索,并结合RGB图像的上下文信息进行特征融合以提升深度预测性能。此外,还引入了跨模态迁移学习(Cross-modal Transfer Learning, CmTL)机制,以克服大规模RGB-DP-D数据集获取的困难。
链接: https://arxiv.org/abs/2506.14709
作者: Kunal Swami,Debtanu Gupta,Amrit Kumar Muduli,Chirag Jaiswal,Pankaj Kumar Bajpai
机构: Samsung Research India Bangalore (三星印度班加罗尔研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted in IROS 2025
Abstract:Depth estimation is crucial for intelligent systems, enabling applications from autonomous navigation to augmented reality. While traditional stereo and active depth sensors have limitations in cost, power, and robustness, dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compelling alternative. This paper introduces DiFuse-Net, a novel modality decoupled network design for disentangled RGB and DP based depth estimation. DiFuse-Net features a window bi-directional parallax attention mechanism (WBiPAM) specifically designed to capture the subtle DP disparity cues unique to smartphone cameras with small aperture. A separate encoder extracts contextual information from the RGB image, and these features are fused to enhance depth prediction. We also propose a Cross-modal Transfer Learning (CmTL) mechanism to utilize large-scale RGB-D datasets in the literature to cope with the limitations of obtaining large-scale RGB-DP-D dataset. Our evaluation and comparison of the proposed method demonstrates its superiority over the DP and stereo-based baseline methods. Additionally, we contribute a new, high-quality, real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP) dataset, created using our novel symmetric stereo camera hardware setup, stereo calibration and rectification protocol, and AI stereo disparity estimation method.
zh
[CV-6] Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion IROS2025
【速读】:该论文旨在解决自动驾驶中相机与激光雷达(LiDAR)数据融合的外部标定(extrinsic calibration)精度不足的问题。现有端到端标定方法大多采用单步预测,缺乏迭代优化能力,难以满足高精度需求。解决方案的关键在于提出一种基于代理扩散(surrogate diffusion)的通用迭代框架,该框架通过迭代去噪过程对初始外部参数进行精调,其中原始标定方法作为代理去噪器在每一步估计最终的外部参数,从而提升标定方法的精度、鲁棒性和稳定性。
链接: https://arxiv.org/abs/2506.14706
作者: Ni Ou,Zhuo Chen,Xinru Zhang,Junzheng Wang
机构: Beijing Institute of Technology (北京理工大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, accepted by IROS 2025
Abstract:Cameras and LiDAR are essential sensors for autonomous vehicles. The fusion of camera and LiDAR data addresses the limitations of individual sensors but relies on precise extrinsic calibration. Recently, numerous end-to-end calibration methods have been proposed; however, most predict extrinsic parameters in a single step and lack iterative optimization capabilities. To address the increasing demand for higher accuracy, we propose a versatile iterative framework based on surrogate diffusion. This framework can enhance the performance of any calibration method without requiring architectural modifications. Specifically, the initial extrinsic parameters undergo iterative refinement through a denoising process, in which the original calibration method serves as a surrogate denoiser to estimate the final extrinsics at each step. For comparative analysis, we selected four state-of-the-art calibration methods as surrogate denoisers and compared the results of our diffusion process with those of two other iterative approaches. Extensive experiments demonstrate that when integrated with our diffusion model, all calibration methods achieve higher accuracy, improved robustness, and greater stability compared to other iterative techniques and their single-step counterparts.
zh
[CV-7] owards Desiderata-Driven Design of Visual Counterfactual Explainers
【速读】:该论文试图解决现有视觉反事实解释器(Visual Counterfactual Explainers, VCEs)过于关注样本质量或变化最小化,而忽视了解释的完整性需求,如保真度、可理解性和充分性等问题。解决方案的关键在于探索新的反事实生成机制,并将其整合为一种新颖的“平滑反事实探索者”(Smooth Counterfactual Explorer, SCE)算法,以更好地满足这些更全面的解释需求。
链接: https://arxiv.org/abs/2506.14698
作者: Sidney Bender,Jan Herrmann,Klaus-Robert Müller,Grégoire Montavon
机构: Technische Universität Berlin (柏林工业大学); BASF SE (巴斯夫公司); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Korea University (韩国大学); Max-Planck Institute for Informatics (马克斯·普朗克信息学研究所); Charité – Universitätsmedizin Berlin (夏里特柏林大学医学中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual counterfactual explainers (VCEs) are a straightforward and promising approach to enhancing the transparency of image classifiers. VCEs complement other types of explanations, such as feature attribution, by revealing the specific data transformations to which a machine learning model responds most strongly. In this paper, we argue that existing VCEs focus too narrowly on optimizing sample quality or change minimality; they fail to consider the more holistic desiderata for an explanation, such as fidelity, understandability, and sufficiency. To address this shortcoming, we explore new mechanisms for counterfactual generation and investigate how they can help fulfill these desiderata. We combine these mechanisms into a novel ‘smooth counterfactual explorer’ (SCE) algorithm and demonstrate its effectiveness through systematic evaluations on synthetic and real data.
zh
[CV-8] YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework
【速读】:该论文旨在解决多光谱目标检测中存在的一致性单阶段框架缺失、性能与融合策略平衡困难以及模态权重分配不合理等问题。其解决方案的关键在于基于YOLOv11框架提出YOLOv11-RGBT,设计了六种多光谱融合模式,并引入P3中层融合策略和多光谱可控微调(MCF)策略,以优化特征融合、减少冗余与不匹配,从而提升模型整体性能。
链接: https://arxiv.org/abs/2506.14696
作者: Dahang Wan,Rongsheng Lu,Yang Fang,Xianli Lang,Shuangbao Shu,Jingjing Chen,Siyuan Shen,Ting Xu,Zecong Ye
机构: Hefei University of Technology (合肥工业大学); Engineering University of PAP (中国人民武装警察部队工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 8 figures
Abstract:Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models’ mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies’ effectiveness. The code is available at: this https URL.
zh
[CV-9] FocalClick-XL: Towards Unified and High-quality Interactive Segmentation
【速读】:该论文旨在解决交互式分割中交互形式有限且难以捕捉细节的问题。其关键解决方案是提出一种新的流水线FocalClick-XL,通过将交互式分割分解为不同层次的信息任务(上下文、对象和细节),并为每个任务分配专用子网络,实现规模化预训练以最大化各子网络的效果。同时,通过共享上下文和细节信息作为通用知识,并在对象层级引入提示层以编码特定交互类型,提升了模型的灵活性和适应性。
链接: https://arxiv.org/abs/2506.14686
作者: Xi Chen,Hengshuang Zhao
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive segmentation enables users to extract binary masks of target objects through simple interactions such as clicks, scribbles, and boxes. However, existing methods often support only limited interaction forms and struggle to capture fine details. In this paper, we revisit the classical coarse-to-fine design of FocalClick and introduce significant extensions. Inspired by its multi-stage strategy, we propose a novel pipeline, FocalClick-XL, to address these challenges simultaneously. Following the emerging trend of large-scale pretraining, we decompose interactive segmentation into meta-tasks that capture different levels of information – context, object, and detail – assigning a dedicated subnet to each this http URL decomposition allows each subnet to undergo scaled pretraining with independent data and supervision, maximizing its effectiveness. To enhance flexibility, we share context- and detail-level information across different interaction forms as common knowledge while introducing a prompting layer at the object level to encode specific interaction types. As a result, FocalClick-XL achieves state-of-the-art performance on click-based benchmarks and demonstrates remarkable adaptability to diverse interaction formats, including boxes, scribbles, and coarse masks. Beyond binary mask generation, it is also capable of predicting alpha mattes with fine-grained details, making it a versatile and powerful tool for interactive segmentation.
zh
[CV-10] Recognition through Reasoning : Reinforcing Image Geo-localization with Large Vision-Language Models
【速读】:该论文试图解决图像地理定位(geo-localization)任务中存在的一些关键问题,包括传统方法多依赖黑箱决策导致可解释性不足,以及现有基于视觉-语言模型(LVLM)的方法在数据多样性和模型推理能力上的局限性。其解决方案的关键在于提出一种新的管道,构建了一个面向推理的地理定位数据集MP16-Reason,并引入GLOBE(Group-relative policy optimization for Locatability assessment and Optimized visual-clue reasoning),通过联合优化定位评估、视觉线索推理和地理定位精度,实现对VLM的双目标增强。
链接: https://arxiv.org/abs/2506.14674
作者: Ling Li,Yao Zhou,Yuxuan Liang,Fugee Tsung,Jiaheng Wei
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Locatability assessment and Optimized visual-clue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance locatability assessment, visual clue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.
zh
[CV-11] DDS-NAS: Dynamic Data Selection within Neural Architecture Search via On-line Hard Example Mining applied to Image Classification
【速读】:该论文旨在解决神经网络架构搜索(Neural Architecture Search, NAS)中的可扩展性问题。其关键解决方案是通过在课程学习(curriculum learning)框架内进行动态硬样本挖掘,利用自编码器(autoencoder)在潜在空间中强制图像相似性嵌入,构建高效的kd树结构以按最远邻居差异对图像进行排序,从而在对数时间内从子采样数据集中识别出与给定查询图像最不相似的图像,进而动态重构用于NAS优化的无偏子采样数据集。
链接: https://arxiv.org/abs/2506.14667
作者: Matt Poyser,Toby P. Breckon
机构: Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 single-column pages, 8 figures, to be published in Pattern Recognition
Abstract:In order to address the scalability challenge within Neural Architecture Search (NAS), we speed up NAS training via dynamic hard example mining within a curriculum learning framework. By utilizing an autoencoder that enforces an image similarity embedding in latent space, we construct an efficient kd-tree structure to order images by furthest neighbour dissimilarity in a low-dimensional embedding. From a given query image from our subsample dataset, we can identify the most dissimilar image within the global dataset in logarithmic time. Via curriculum learning, we then dynamically re-formulate an unbiased subsample dataset for NAS optimisation, upon which the current NAS solution architecture performs poorly. We show that our DDS-NAS framework speeds up gradient-based NAS strategies by up to 27x without loss in performance. By maximising the contribution of each image sample during training, we reduce the duration of a NAS training cycle and the number of iterations required for convergence.
zh
[CV-12] 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS)在实际应用中因存储需求大而面临的挑战,特别是缺乏一个全面的框架来评估压缩后的3DGS表示的感知质量。其解决方案的关键在于提出3DGS-IEval-15K,这是首个针对压缩3DGS表示的大规模图像质量评估(IQA)数据集,包含从10个真实场景中通过6种代表性3DGS算法在20个选定视角生成的15,200张图像,并涵盖不同压缩级别下的失真效果,同时通过受控主观实验收集了60名观察者的感知数据,为开发专用的3DGS IQA指标和研究视图依赖的质量分布模式提供了基础数据。
链接: https://arxiv.org/abs/2506.14642
作者: Yuke Xing,Jiarui Wang,Peizhi Niu,Wenjie Huang,Guangtao Zhai,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at this https URL.
zh
[CV-13] Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching
【速读】:该论文旨在解决图像恢复问题,特别是在缺乏配对退化图像和真实图像或已知前向模型的情况下,传统方法难以有效处理的现实场景。其解决方案的关键在于利用条件流匹配来建模退化观测的分布,并通过分布匹配损失同时学习前向模型,从而在最小假设条件下实现有效的图像恢复,仅依赖少量未配对的数据集。
链接: https://arxiv.org/abs/2506.14605
作者: Giacomo Meanti,Thomas Ryckeboer,Michael Arbel,Julien Mairal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Code available at this https URL
Abstract:This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches – which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images – the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.
zh
[CV-14] Align Your Flow: Scaling Continuous-Time Flow Map Distillation
【速读】:该论文旨在解决生成模型在采样步骤较多时性能下降的问题,尤其是传统一致性模型在增加步骤数时性能不可避免地降低。其解决方案的关键在于引入一种新的连续时间目标来训练流映射(flow maps),该方法通过连接任意两个噪声水平在单步内实现生成,并在所有步骤数下保持有效性。此外,论文还结合了自引导(autoguidance)和对抗微调(adversarial finetuning)等技术,以提升生成质量并维持样本多样性。
链接: https://arxiv.org/abs/2506.14603
作者: Amirmojtaba Sabour,Sanja Fidler,Karsten Kreis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Diffusion- and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically. Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity. We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.
zh
[CV-15] PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation
【速读】:该论文旨在解决单目3D姿态估计中现有方法过度依赖关节位置特征而忽略骨骼结构内在方向性和角度相关性的问题,从而在关节遮挡或快速运动变化情况下生成不合理的姿态。解决方案的关键在于提出PoseGRAF框架,该框架通过构建双图卷积结构分别处理关节图和骨图,引入跨注意力模块建模骨方向与关节特征之间的相互依赖关系,并设计动态融合模块通过关节与骨骼之间的关系依赖自适应地整合两种特征,最终结合改进的Transformer编码器生成输出。
链接: https://arxiv.org/abs/2506.14596
作者: Ming Xu,Xu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing monocular 3D pose estimation methods primarily rely on joint positional features, while overlooking intrinsic directional and angular correlations within the skeleton. As a result, they often produce implausible poses under joint occlusions or rapid motion changes. To address these challenges, we propose the PoseGRAF framework. We first construct a dual graph convolutional structure that separately processes joint and bone graphs, effectively capturing their local dependencies. A Cross-Attention module is then introduced to model interdependencies between bone directions and joint features. Building upon this, a dynamic fusion module is designed to adaptively integrate both feature types by leveraging the relational dependencies between joints and bones. An improved Transformer encoder is further incorporated in a residual manner to generate the final output. Experimental results on the Human3.6M and MPI-INF-3DHP datasets show that our method exceeds state-of-the-art approaches. Additional evaluations on in-the-wild videos further validate its generalizability. The code is publicly available at this https URL.
zh
[CV-16] Synthetic Data Augmentation for Table Detection: Re-evaluating TableNets Performance with Automatically Generated Document Images
【速读】:该论文旨在解决从智能手机或扫描仪捕获的文档页面中自动提取表格的问题,传统的人工提取方法效率低且容易出错。其解决方案的关键是引入了一种基于LaTeX的自动化流程,该流程能够生成具有视觉多样性的双栏页面和对齐的真实标注掩码的合成数据集,从而增强真实世界的Marmot基准,并支持对TableNet的系统性分辨率研究。
链接: https://arxiv.org/abs/2506.14583
作者: Krishna Sahukara,Zineddine Bettouche,Andreas Fischer
机构: Deggendorf Institute of Technology (德根多夫技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.
zh
[CV-17] Busting the Paper Ballot: Voting Meets Adversarial Machine Learning CCS2025
【速读】:该论文旨在解决机器学习分类器在美国内部选举计票设备中可能引发的安全风险问题,特别是针对选票上标记识别任务的脆弱性。其解决方案的关键在于揭示传统白盒攻击在投票领域无效的原因,并通过引入改进的损失函数(如修改后的logits比率差损失)克服梯度掩码问题,从而生成有效的对抗样本,并在物理世界中验证其攻击效果。
链接: https://arxiv.org/abs/2506.14582
作者: Kaleel Mahmood,Caleb Manicke,Ethan Rathbun,Aayushi Verma,Sohaib Ahmad,Nicholas Stamatakis,Laurent Michel,Benjamin Fuller
机构: University of Rhode Island (罗德岛大学); University of Connecticut (康涅狄格大学); Northeastern University (东北大学); Stony Brook University (石溪大学); Synchrony Chair in Cybersecurity (网络安全同步主席)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 Pages. Author version of article to appear at CCS 2025
Abstract:We show the security risk associated with using machine learning classifiers in United States election tabulators. The central classification task in election tabulation is deciding whether a mark does or does not appear on a bubble associated to an alternative in a contest on the ballot. Barretto et al. (E-Vote-ID 2021) reported that convolutional neural networks are a viable option in this field, as they outperform simple feature-based classifiers. Our contributions to election security can be divided into four parts. To demonstrate and analyze the hypothetical vulnerability of machine learning models on election tabulators, we first introduce four new ballot datasets. Second, we train and test a variety of different models on our new datasets. These models include support vector machines, convolutional neural networks (a basic CNN, VGG and ResNet), and vision transformers (Twins and CaiT). Third, using our new datasets and trained models, we demonstrate that traditional white box attacks are ineffective in the voting domain due to gradient masking. Our analyses further reveal that gradient masking is a product of numerical instability. We use a modified difference of logits ratio loss to overcome this issue (Croce and Hein, ICML 2020). Fourth, in the physical world, we conduct attacks with the adversarial examples generated using our new methods. In traditional adversarial machine learning, a high (50% or greater) attack success rate is ideal. However, for certain elections, even a 5% attack success rate can flip the outcome of a race. We show such an impact is possible in the physical domain. We thoroughly discuss attack realism, and the challenges and practicality associated with printing and scanning ballot adversarial examples. Comments: 18 Pages. Author version of article to appear at CCS 2025 Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.14582 [cs.CR] (or arXiv:2506.14582v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.14582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-18] Risk Estimation of Knee Osteoarthritis Progression via Predictive Multi-task Modelling from Efficient Diffusion Model using X-ray Images
【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, OA)进展风险评估中机器学习方法可解释性不足的问题,以及现有方法在生成未来影像和定位解剖膝部标志点方面的局限性。其解决方案的关键在于提出一种新的可解释的机器学习方法,通过多任务预测建模同时分类未来的膝OA严重程度并预测解剖膝部标志点,利用扩散模型在类别条件潜在空间中高效生成高质量未来影像,从而实现对疾病进展的可视化预测。
链接: https://arxiv.org/abs/2506.14560
作者: David Butler,Adrian Hilton,Gustavo Carneiro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Medical imaging plays a crucial role in assessing knee osteoarthritis (OA) risk by enabling early detection and disease monitoring. Recent machine learning methods have improved risk estimation (i.e., predicting the likelihood of disease progression) and predictive modelling (i.e., the forecasting of future outcomes based on current data) using medical images, but clinical adoption remains limited due to their lack of interpretability. Existing approaches that generate future images for risk estimation are complex and impractical. Additionally, previous methods fail to localize anatomical knee landmarks, limiting interpretability. We address these gaps with a new interpretable machine learning method to estimate the risk of knee OA progression via multi-task predictive modelling that classifies future knee OA severity and predicts anatomical knee landmarks from efficiently generated high-quality future images. Such image generation is achieved by leveraging a diffusion model in a class-conditioned latent space to forecast disease progression, offering a visual representation of how particular health conditions may evolve. Applied to the Osteoarthritis Initiative dataset, our approach improves the state-of-the-art (SOTA) by 2%, achieving an AUC of 0.71 in predicting knee OA progression while offering ~9% faster inference time.
zh
[CV-19] DreamLight: Towards Harmonious and Consistent Image Relighting
【速读】:该论文旨在解决通用图像再照明(image relighting)问题,即在保持光照和色彩色调的审美一致性的同时,将主体无缝合成到新的背景中。现有方法主要集中在基于自然图像的再照明,而对基于文本的再照明场景探索较少,且多数方法在生成前景与背景之间真实的光照交互效果上存在局限。该论文的解决方案关键在于重新组织输入数据为统一格式,并利用预训练扩散模型提供的语义先验以促进自然结果的生成;同时提出Position-Guided Light Adapter (PGLA)模块,将背景不同方向的光照信息压缩为设计的光照查询嵌入,并通过方向偏置的掩码注意力机制调制前景;此外,还引入Spectral Foreground Fixer (SFF)后处理模块,以自适应重组主体与再照明背景的不同频率成分,提升前景外观的一致性。
链接: https://arxiv.org/abs/2506.14549
作者: Yong Liu,Wenpeng Xiao,Qianqian Wang,Junlin Chen,Shiyin Wang,Yitong Wang,Xinglong Wu,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on image-based relighting, while with scant exploration into text-based scenarios. Some works employ intricate disentanglement pipeline designs relying on environment maps to provide relevant information, which grapples with the expensive data cost required for intrinsic decomposition and light source. Other methods take this task as an image translation problem and perform pixel-level transformation with autoencoder architecture. While these methods have achieved decent harmonization effects, they struggle to generate realistic and natural light interaction effects between the foreground and background. To alleviate these challenges, we reorganize the input data into a unified format and leverage the semantic prior provided by the pretrained diffusion model to facilitate the generation of natural results. Moreover, we propose a Position-Guided Light Adapter (PGLA) that condenses light information from different directions in the background into designed light query embeddings, and modulates the foreground with direction-biased masked attention. In addition, we present a post-processing module named Spectral Foreground Fixer (SFF) to adaptively reorganize different frequency components of subject and relighted background, which helps enhance the consistency of foreground appearance. Extensive comparisons and user study demonstrate that our DreamLight achieves remarkable relighting performance.
zh
[CV-20] Exploring Diffusion with Test-Time Training on Efficient Image Restoration
【速读】:该论文旨在解决图像恢复中的特征融合效果不佳、计算瓶颈以及扩散过程效率低下的问题。其解决方案的关键在于提出DiffRWKVIR框架,通过将测试时训练(Test-Time Training, TTT)与高效扩散相结合,引入三项核心创新:Omni-Scale 2D State Evolution实现全局上下文感知的线性复杂度特征提取,Chunk-Optimized Flash Processing通过连续块处理提升并行效率,Prior-Guided Efficient Diffusion则在少量步骤内提取紧凑的图像先验表示,从而显著提高训练和推理速度。
链接: https://arxiv.org/abs/2506.14541
作者: Rongchang Lu,Tianduo Luo,Yunzhi Zhang,Conghan Yue,Pei Yang,Guibao Liu,Changyang Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to The 8th Chinese Conference on Pattern Recognition and Computer Vision (2025). Contact to nomodeset@qq.com. Source code will open in 4 months
Abstract:Image restoration faces challenges including ineffective feature fusion, computational bottlenecks and inefficient diffusion processes. To address these, we propose DiffRWKVIR, a novel framework unifying Test-Time Training (TTT) with efficient diffusion. Our approach introduces three key innovations: (1) Omni-Scale 2D State Evolution extends RWKV’s location-dependent parameterization to hierarchical multi-directional 2D scanning, enabling global contextual awareness with linear complexity O(L); (2) Chunk-Optimized Flash Processing accelerates intra-chunk parallelism by 3.2x via contiguous chunk processing (O(LCd) complexity), reducing sequential dependencies and computational overhead; (3) Prior-Guided Efficient Diffusion extracts a compact Image Prior Representation (IPR) in only 5-20 steps, proving 45% faster training/inference than DiffIR while solving computational inefficiency in denoising. Evaluated across super-resolution and inpainting benchmarks (Set5, Set14, BSD100, Urban100, Places365), DiffRWKVIR outperforms SwinIR, HAT, and MambaIR/v2 in PSNR, SSIM, LPIPS, and efficiency metrics. Our method establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization.
zh
[CV-21] VisLanding: Monocular 3D Perception for UAV Safe Landing via Depth-Normal Synergy IROS2025
【速读】:该论文旨在解决复杂和未知环境中自主无人机(UAV)安全着陆的难题。其解决方案的关键在于利用Metric3D V2模型的深度-法线协同预测能力,构建一个端到端的安全着陆区(SLZ)估计框架,并通过引入安全区域分割分支,将着陆区估计任务转化为二分类语义分割问题,从而提升安全区域识别的准确性。
链接: https://arxiv.org/abs/2506.14525
作者: Zhuoyue Tan,Boyong He,Yuxiang Ji,Liaoni Wu
机构: Xiamen University(厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IROS2025
Abstract:This paper presents VisLanding, a monocular 3D perception-based framework for safe UAV (Unmanned Aerial Vehicle) landing. Addressing the core challenge of autonomous UAV landing in complex and unknown environments, this study innovatively leverages the depth-normal synergy prediction capabilities of the Metric3D V2 model to construct an end-to-end safe landing zones (SLZ) estimation framework. By introducing a safe zone segmentation branch, we transform the landing zone estimation task into a binary semantic segmentation problem. The model is fine-tuned and annotated using the WildUAV dataset from a UAV perspective, while a cross-domain evaluation dataset is constructed to validate the model’s robustness. Experimental results demonstrate that VisLanding significantly enhances the accuracy of safe zone identification through a depth-normal joint optimization mechanism, while retaining the zero-shot generalization advantages of Metric3D V2. The proposed method exhibits superior generalization and robustness in cross-domain testing compared to other approaches. Furthermore, it enables the estimation of landing zone area by integrating predicted depth and normal information, providing critical decision-making support for practical applications.
zh
[CV-22] rain Once Forget Precisely: Anchored Optimization for Efficient Post-Hoc Unlearning ICML
【速读】:该论文旨在解决在隐私监管日益严格的背景下,如何在不进行完整重新训练的情况下,从深度图像分类模型中选择性地移除特定信息的问题。其解决方案的关键在于提出一种名为Forget-Aligned Model Reconstruction (FAMR)的框架,该框架将遗忘建模为一个约束优化问题,通过最小化遗忘集上的统一预测损失,并利用ℓ₂正则化将模型参数锚定在其原始值附近,从而实现高效且理论上有保障的后验遗忘。
链接: https://arxiv.org/abs/2506.14515
作者: Prabhav Sanga,Jaskaran Singh,Arun K. Dubey
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML MUGen’25
Abstract:As machine learning systems increasingly rely on data subject to privacy regulation, selectively unlearning specific information from trained models has become essential. In image classification, this involves removing the influence of particular training samples, semantic classes, or visual styles without full retraining. We introduce \textbfForget-Aligned Model Reconstruction (FAMR), a theoretically grounded and computationally efficient framework for post-hoc unlearning in deep image classifiers. FAMR frames forgetting as a constrained optimization problem that minimizes a uniform-prediction loss on the forget set while anchoring model parameters to their original values via an \ell_2 penalty. A theoretical analysis links FAMR’s solution to influence-function-based retraining approximations, with bounds on parameter and output deviation. Empirical results on class forgetting tasks using CIFAR-10 and ImageNet-100 demonstrate FAMR’s effectiveness, with strong performance retention and minimal computational overhead. The framework generalizes naturally to concept and style erasure, offering a scalable and certifiable route to efficient post-hoc forgetting in vision models.
zh
[CV-23] GAMORA: A Gesture Articulated Meta Operative Robotic Arm for Hazardous Material Handling in Containment-Level Environments
【速读】:该论文旨在解决高风险实验室环境中,特别是病毒学实验室中,如何在保障操作人员安全的同时保持实验操作的精确性与效率的问题。其解决方案的关键在于提出一种基于虚拟现实(VR)引导的新型机器人系统GAMORA,该系统通过集成Oculus Quest 2、NVIDIA Jetson Nano和Robot Operating System (ROS),实现手势驱动的远程操作、实时沉浸式控制、数字孪生仿真以及基于逆运动学的机械臂精确定位,从而有效降低直接人体暴露风险并提升操作精度。
链接: https://arxiv.org/abs/2506.14513
作者: Farha Abdul Wasay,Mohammed Abdul Rahman,Hania Ghouse
机构: Muffakham Jah College of Engineering and Technology (Muffakham Jah College of Engineering and Technology)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The convergence of robotics and virtual reality (VR) has enabled safer and more efficient workflows in high-risk laboratory settings, particularly virology labs. As biohazard complexity increases, minimizing direct human exposure while maintaining precision becomes essential. We propose GAMORA (Gesture Articulated Meta Operative Robotic Arm), a novel VR-guided robotic system that enables remote execution of hazardous tasks using natural hand gestures. Unlike existing scripted automation or traditional teleoperation, GAMORA integrates the Oculus Quest 2, NVIDIA Jetson Nano, and Robot Operating System (ROS) to provide real-time immersive control, digital twin simulation, and inverse kinematics-based articulation. The system supports VR-based training and simulation while executing precision tasks in physical environments via a 3D-printed robotic arm. Inverse kinematics ensure accurate manipulation for delicate operations such as specimen handling and pipetting. The pipeline includes Unity-based 3D environment construction, real-time motion planning, and hardware-in-the-loop testing. GAMORA achieved a mean positional discrepancy of 2.2 mm (improved from 4 mm), pipetting accuracy within 0.2 mL, and repeatability of 1.2 mm across 50 trials. Integrated object detection via YOLOv8 enhances spatial awareness, while energy-efficient operation (50% reduced power output) ensures sustainable deployment. The system’s digital-physical feedback loop enables safe, precise, and repeatable automation of high-risk lab tasks. GAMORA offers a scalable, immersive solution for robotic control and biosafety in biomedical research environments.
zh
[CV-24] SIRI-Bench: Challenging VLMs Spatial Intelligence through Complex Reasoning Tasks
【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在空间推理能力方面的评估不足问题,尤其是在复杂空间情境下的高阶推理能力尚未得到系统性研究。解决方案的关键在于提出SIRI-Bench基准,该基准通过视频驱动的推理任务来评估VLMs的空间智能,其核心在于构建包含近1000个视频-问题-答案三元组的数据集,每个问题均嵌入于真实3D场景中,并通过精心设计的问题和场景确保解题过程需要同时具备空间理解与高阶推理能力。此外,为支持大规模数据生成,研究还开发了自动场景创建引擎,利用多个专业大语言模型代理从抽象数学问题生成符合描述的3D场景。
链接: https://arxiv.org/abs/2506.14512
作者: Zijian Song,Xiaoxin Lin,Qiuming Huang,Guangrun Wang,Liang Lin
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures
Abstract:Large Language Models (LLMs) are experiencing rapid advancements in complex reasoning, exhibiting remarkable generalization in mathematics and programming. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic evaluation of their complex reasoning ability within spatial contexts remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs’ spatial intelligence through video-based reasoning tasks. SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video. By carefully designing questions and corresponding 3D scenes, our benchmark ensures that solving the questions requires both spatial comprehension for extracting information and high-level reasoning for deriving solutions, making it a challenging benchmark for evaluating VLMs. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine. This engine, leveraging multiple specialized LLM agents, can generate realistic 3D scenes from abstract math problems, ensuring faithfulness to the original descriptions. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of spatial reasoning. We hope that our study will bring researchers’ attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
zh
[CV-25] MOL: Joint Estimation of Micro-Expression Optical Flow and Landmark via Transformer-Graph-Style Convolution
【速读】:该论文旨在解决面部微表情识别(Micro-Expression Recognition, MER)问题,该问题由于微表情动作的短暂性和细微性而具有挑战性。现有方法通常依赖于手工特征、关键帧(如起始帧、峰值帧和结束帧)或受小规模和低多样性数据集限制的深度网络。论文提出的解决方案是一种端到端的微动作感知深度学习框架,其关键在于引入了F5C模块,该模块由全连接卷积和通道对应卷积组成,能够直接从原始帧序列中提取局部-全局特征,无需依赖关键帧的先验知识。此外,通过共享局部-全局特征,MER、光流估计和面部关键点检测被联合训练,从而增强对细微面部动作信息的捕捉能力,缓解数据不足的影响。
链接: https://arxiv.org/abs/2506.14511
作者: Zhiwen Shao,Yifan Cheng,Feiran Li,Yong Zhou,Xuequan Lu,Yuan Xie,Lizhuang Ma
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of the Ministry of Education (教育部矿山数字化工程研究中心); The Hong Kong University of Science and Technology (香港科技大学); Shanghai Jiao Tong University (上海交通大学); The University of Western Australia (西澳大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at this https URL.
zh
[CV-26] I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs
【速读】:该论文旨在解决现有3D视觉定位(3D visual grounding, 3DVG)方法依赖于精确文本提示的问题,尤其是在面对噪声和模糊的语音转文字(speech-to-text)输入时性能受限。其关键解决方案是提出SpeechRefer框架,该框架通过两个核心创新提升鲁棒性:一是语音互补模块,通过捕捉语音信号中音素相关词的声学相似性并生成补充建议得分,降低对可能错误转录文本的依赖;二是对比互补模块,利用对比学习将错误文本特征与对应语音特征对齐,确保在转录错误占主导的情况下仍能保持稳定性能。
链接: https://arxiv.org/abs/2506.14495
作者: Yu Qi,Lipeng Gu,Honghua Chen,Liangliang Nan,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing 3D visual grounding methods rely on precise text prompts to locate objects within 3D scenes. Speech, as a natural and intuitive modality, offers a promising alternative. Real-world speech inputs, however, often suffer from transcription errors due to accents, background noise, and varying speech rates, limiting the applicability of existing 3DVG methods. To address these challenges, we propose \textbfSpeechRefer, a novel 3DVG framework designed to enhance performance in the presence of noisy and ambiguous speech-to-text transcriptions. SpeechRefer integrates seamlessly with xisting 3DVG models and introduces two key innovations. First, the Speech Complementary Module captures acoustic similarities between phonetically related words and highlights subtle distinctions, generating complementary proposal scores from the speech signal. This reduces dependence on potentially erroneous transcriptions. Second, the Contrastive Complementary Module employs contrastive learning to align erroneous text features with corresponding speech features, ensuring robust performance even when transcription errors dominate. Extensive experiments on the SpeechRefer and peechNr3D datasets demonstrate that SpeechRefer improves the performance of existing 3DVG methods by a large margin, which highlights SpeechRefer’s potential to bridge the gap between noisy speech inputs and reliable 3DVG, enabling more intuitive and practical multimodal systems.
zh
[CV-27] Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection ICML2025
【速读】:该论文旨在解决深度学习训练成本高昂的问题,通过单样本子集选择(one-shot subset selection)方法,从数据集中挑选出具有信息量的子集。传统信息提取器(Information Extractor, IE)通常在目标数据集上预训练,具有数据集依赖性,而基础模型(Foundation Model, FM)提供了潜在的替代方案。论文的核心解决方案是利用多种基础模型的互补优势,提出RAM-APL(RAnking Mean-Accuracy of Pseudo-class Labels)方法,以提升细粒度图像数据集上的子集选择性能。实验表明,该方法在多个细粒度数据集上取得了最先进的效果。
链接: https://arxiv.org/abs/2506.14473
作者: Zhijing Wan,Zhixiang Wang,Zheng Wang,Xin Xu,Shin’ichi Satoh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 10 figures, accepted by ICML 2025
Abstract:One-shot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, typically pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs consistently outperform traditional IEs on fine-grained datasets, whereas their advantage diminishes on coarse-grained datasets with noisy labels. Motivated by these finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.
zh
[CV-28] Dense360: Dense Understanding from Omnidirectional Panoramas
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在全面理解物理世界时对有限视场(FOV)视觉输入的依赖问题,提出从全向全景图中实现密集理解的解决方案。其关键在于引入了针对全景图等距圆柱投影(Equirectangular Projection, ERP)设计的位置编码方案ERP-RoPE,以应对ERP带来的两个核心挑战:沿纬度圈的空间连续性问题以及信息密度的纬度相关性变化。此外,论文还提出了Dense360-Bench,首个用于评估MLLMs在全向描述和定位任务上的基准,为推进全景场景下的密集视觉-语言理解提供了综合性框架。
链接: https://arxiv.org/abs/2506.14471
作者: Yikang Zhou,Tao Zhang,Dizhe Zhang,Shunping Ji,Xiangtai Li,Lu Qi
机构: Wuhan University (武汉大学); Insta360 (Insta360); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.
zh
[CV-29] Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
【速读】:该论文旨在解决放射学视觉问答(Radiological VQA)模型在数据获取、建模复杂性以及评估不足等方面的挑战。其解决方案的关键在于对轻量级3B参数的视觉-语言模型进行微调,并采用从合成问题-答案对生成到多阶段微调的高效训练流程,结合专门针对放射学领域的数据集(如ROCO v2.0和MedPix v2.0),从而在参数规模较小且训练数据有限的情况下实现稳健的性能表现。
链接: https://arxiv.org/abs/2506.14451
作者: Aditya Shourya,Michel Dumontier,Chang Sun
机构: Maastricht University (马斯特里赫特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological domain-targeted datasets (e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a fraction of the scale of state-of-the-art models such as LLaVA-Med, our model achieves promising performance given its small parameter size and the limited scale of training data. We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes through saliency analysis.
zh
[CV-30] Model compression using knowledge distillation with integrated gradients
【速读】:该论文旨在解决深度学习模型在资源受限设备上部署时的模型压缩问题。其核心解决方案是通过将集成梯度(Integrated Gradients, IG)作为数据增强策略,增强知识蒸馏的效果。该方法在训练过程中将IG图叠加到输入图像上,使学生模型能够更深入地理解教师模型的决策过程,从而在显著压缩模型规模的同时保持较高的测试准确率。
链接: https://arxiv.org/abs/2506.14440
作者: David E. Hernandez,Jose Chang,Torbjörn E. M. Nordling
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 49 pages, 12 figures
Abstract:Model compression is critical for deploying deep learning models on resource-constrained devices. We introduce a novel method enhancing knowledge distillation with integrated gradients (IG) as a data augmentation strategy. Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models’ decision-making processes. Extensive evaluation on CIFAR-10 demonstrates that our IG-augmented knowledge distillation achieves 92.6% testing accuracy with a 4.1x compression factor-a significant 1.1 percentage point improvement ( p0.001 ) over non-distilled models (91.5%). This compression reduces inference time from 140 ms to 13 ms. Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step. Our comprehensive experiments include: (1) comparisons with attention transfer, revealing complementary benefits when combined with our approach; (2) Monte Carlo simulations confirming statistical robustness; (3) systematic evaluation of compression factor versus accuracy trade-offs across a wide range (2.2x-1122x); and (4) validation on an ImageNet subset aligned with CIFAR-10 classes, demonstrating generalisability beyond the initial dataset. These extensive ablation studies confirm that IG-based knowledge distillation consistently outperforms conventional approaches across varied architectures and compression ratios. Our results establish this framework as a viable compression technique for real-world deployment on edge devices while maintaining competitive accuracy.
zh
[CV-31] MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models
【速读】:该论文旨在解决大规模多模态Mixture-of-Experts (MoE)模型在边缘设备部署时面临的高内存占用问题。传统方法在稀疏激活过程中使用全精度专家,导致大量专家引入了较高的内存开销。其解决方案的关键在于提出MoTE,一种可扩展且内存高效的训练方法,通过在上采样阶段训练更多低精度专家而非较少的高精度专家,从而降低内存消耗。具体而言,利用预训练的前馈网络作为共享专家,并训练参数为-1、0、1的三元路由专家,实验证明该方法在保持性能的同时显著减少了内存占用,并在结合后训练量化方法时进一步提升了效果。
链接: https://arxiv.org/abs/2506.14435
作者: Hongyu Wang,Jiayu Xu,Ruiping Wang,Yan Feng,Yitao Zhai,Peng Pei,Xunliang Cai,Xilin Chen
机构: Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); Meituan(美团); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress
Abstract:Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in -1, 0, 1. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.
zh
[CV-32] oward Rich Video Human-Motion2D Generation
【速读】:该论文旨在解决生成真实且可控的人类运动,尤其是涉及丰富多角色互动的运动,这一任务因数据稀缺和建模人与人之间动态关系的复杂性而面临重大挑战。其解决方案的关键在于构建了一个大规模的丰富视频人类运动2D数据集(Motion2D-Video-150K),并在此基础上提出了一种基于扩散模型的丰富视频人类运动2D生成(RVHM2D)方法。RVHM2D通过引入增强的文本条件机制,结合双文本编码器或T5-XXL模型,以及全局和局部特征,提升了生成效果,并采用两阶段训练策略以进一步提高运动的真实性和文本对齐度。
链接: https://arxiv.org/abs/2506.14428
作者: Ruihao Xi,Xuekuan Wang,Yongcheng Li,Shuhua Li,Zichen Wang,Yiwei Wang,Feng Wei,Cairong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating realistic and controllable human motions, particularly those involving rich multi-character interactions, remains a significant challenge due to data scarcity and the complexities of modeling inter-personal dynamics. To address these limitations, we first introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences. Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions, each paired with detailed textual descriptions. Building upon this dataset, we propose a novel diffusion-based rich video human motion2D generation (RVHM2D) model. RVHM2D incorporates an enhanced textual conditioning mechanism utilizing either dual text encoders (CLIP-L/B) or T5-XXL with both global and local features. We devise a two-stage training strategy: the model is first trained with a standard diffusion objective, and then fine-tuned using reinforcement learning with an FID-based reward to further enhance motion realism and text alignment. Extensive experiments demonstrate that RVHM2D achieves leading performance on the Motion2D-Video-150K benchmark in generating both single and interactive double-character scenarios.
zh
[CV-33] Compositional Attribute Imbalance in Vision Datasets
【速读】:该论文试图解决图像分类中的视觉属性不平衡(visual attribute imbalance)问题,该问题在现有研究中尚未得到充分关注,但对模型性能和泛化能力有显著影响。其解决方案的关键在于构建基于CLIP的视觉属性字典,以自动评估图像属性,并通过调整样本采样概率来应对组合属性的稀有性。此外,该方法将此策略与多种数据增强技术(如CutMix、Fmix和SaliencyMix)相结合,以提升模型对罕见属性的表征能力。实验结果表明,该方法有效缓解了属性不平衡问题,从而提高了深度神经网络的鲁棒性和公平性。
链接: https://arxiv.org/abs/2506.14418
作者: Jiayi Chen,Yanbiao Ma,Andi Zhang,Weidong Tang,Wei Dai,Bowei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance. To tackle these challenges, we propose adjusting the sampling probability of samples based on the rarity of their compositional attributes. This strategy is further integrated with various data augmentation techniques (such as CutMix, Fmix, and SaliencyMix) to enhance the model’s ability to represent rare attributes. Extensive experiments on benchmark datasets demonstrate that our method effectively mitigates attribute imbalance, thereby improving the robustness and fairness of deep neural networks. Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.
zh
[CV-34] Causally Steered Diffusion for Automated Video Counterfactual Generation
【速读】:该论文试图解决在视频编辑中保持因果关系的问题,即在对视频内容进行修改时,若忽略因果依赖属性,可能导致生成不真实或误导性的结果。解决方案的关键在于提出一种基于视觉-语言模型(VLM)的因果忠实框架,通过优化文本提示来引导生成过程,而无需访问视频编辑系统的内部机制或进行微调,从而实现对潜在空间的有效控制。
链接: https://arxiv.org/abs/2506.14404
作者: Nikos Spyrou,Athanasios Vlontzos,Paraskevas Pegios,Thomas Melistas,Nefeli Gkouti,Yannis Panagakis,Giorgos Papanastasiou,Sotirios A. Tsaftaris
机构: National & Kapodistrian University of Athens, Greece (国家卡波迪斯特里亚大学, 希腊); Archimedes, Athena Research Center, Greece (阿基米德, 雅典研究中心, 希腊); The University of Edinburgh, UK (爱丁堡大学, 英国); Technical University of Denmark (丹麦技术大学); Pioneer Centre for AI, Denmark (人工智能先锋中心, 丹麦); The University of Essex, UK (埃塞克斯大学, 英国); Monzo Bank, UK (Monzo银行, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adapting text-to-image (T2I) latent diffusion models for video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships in video content. Edits affecting causally dependent attributes risk generating unrealistic or misleading outcomes if these relationships are ignored. In this work, we propose a causally faithful framework for counterfactual video generation, guided by a vision-language model (VLM). Our method is agnostic to the underlying video editing system and does not require access to its internal mechanisms or finetuning. Instead, we guide the generation by optimizing text prompts based on an assumed causal graph, addressing the challenge of latent space control in LDMs. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Our results demonstrate that causally faithful video counterfactuals can be effectively generated within the learned distribution of LDMs through prompt-based causal steering. With its compatibility with any black-box video editing system, our method holds significant potential for generating realistic “what-if” video scenarios in diverse areas such as healthcare and digital media.
zh
[CV-35] Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models
【速读】:该论文试图解决在反事实图像生成中,标准的无分类器指导(CFG)由于对所有条件变量应用单一全局权重,导致身份保留效果差和属性变化不自然的问题,即属性放大现象。解决方案的关键在于提出解耦无分类器指导(DCFG),该方法通过属性分割嵌入策略实现语义输入的解耦,从而对用户定义的属性组进行选择性引导,并根据因果图将属性划分为干预属性集和不变属性集,分别施加不同的指导,以提升干预保真度、减少意外变化并增强可逆性。
链接: https://arxiv.org/abs/2506.14399
作者: Tian Xia,Fabio De Sousa Ribeiro,Rajat R Rasal,Avinash Kori,Raghav Mehta,Ben Glocker
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual image generation aims to simulate realistic visual outcomes under specific causal interventions. Diffusion models have recently emerged as a powerful tool for this task, combining DDIM inversion with conditional generation via classifier-free guidance (CFG). However, standard CFG applies a single global weight across all conditioning variables, which can lead to poor identity preservation and spurious attribute changes - a phenomenon known as attribute amplification. To address this, we propose Decoupled Classifier-Free Guidance (DCFG), a flexible and model-agnostic framework that introduces group-wise conditioning control. DCFG builds on an attribute-split embedding strategy that disentangles semantic inputs, enabling selective guidance on user-defined attribute groups. For counterfactual generation, we partition attributes into intervened and invariant sets based on a causal graph and apply distinct guidance to each. Experiments on CelebA-HQ, MIMIC-CXR, and EMBED show that DCFG improves intervention fidelity, mitigates unintended changes, and enhances reversibility, enabling more faithful and interpretable counterfactual image generation.
zh
[CV-36] Enclosing Prototypical Variational Autoencoder for Explainable Out-of-Distribution Detection
【速读】:该论文旨在解决深度机器学习模型在安全相关应用中的决策可解释性与可靠性信任问题。其关键解决方案是扩展自解释的原型变分模型,结合基于自编码器的分布外(Out-of-Distribution, OOD)检测方法。通过变分自编码器学习一个有意义的潜在空间,用于距离分类、似然估计和重建,其中分布内(In-Distribution, ID)区域由学习到的原型所代表的高斯混合分布定义,并引入一种新的约束损失以确保ID区域的紧凑性而不坍缩为单点。自编码器的重建能力增强了原型和ID区域的可解释性,从而提升对OOD样本的区分能力。
链接: https://arxiv.org/abs/2506.14390
作者: Conrad Orglmeister,Erik Bochinski,Volker Eiselein,Elvira Fleig
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Computer Safety, Reliability and Security - SAFECOMP 2024 Workshops - DECSoS, SASSUR, TOASTS, and WAISE, and is available online at this https URL
Abstract:Understanding the decision-making and trusting the reliability of Deep Machine Learning Models is crucial for adopting such methods to safety-relevant applications. We extend self-explainable Prototypical Variational models with autoencoder-based out-of-distribution (OOD) detection: A Variational Autoencoder is applied to learn a meaningful latent space which can be used for distance-based classification, likelihood estimation for OOD detection, and reconstruction. The In-Distribution (ID) region is defined by a Gaussian mixture distribution with learned prototypes representing the center of each mode. Furthermore, a novel restriction loss is introduced that promotes a compact ID region in the latent space without collapsing it into single points. The reconstructive capabilities of the Autoencoder ensure the explainability of the prototypes and the ID region of the classifier, further aiding the discrimination of OOD samples. Extensive evaluations on common OOD detection benchmarks as well as a large-scale dataset from a real-world railway application demonstrate the usefulness of the approach, outperforming previous methods.
zh
[CV-37] GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion
【速读】:该论文旨在解决红外与可见光图像融合任务中,传统欧几里得方法无法有效捕捉数据内在拓扑结构的问题,以及低层细节与高层语义平衡不足导致的融合性能下降问题。其解决方案的关键在于提出一种基于Grassmann流形的新型注意力机制(GrFormer),通过在Grassmann流形上施加投影约束构建低秩子空间映射,将注意力特征压缩到不同秩级别的子空间中,从而实现高频细节与低频语义的解耦和多尺度语义融合。此外,还引入了一种基于协方差掩码的跨模态融合策略(CMS),以最大化不同模态间的互补性并抑制高相关性特征。
链接: https://arxiv.org/abs/2506.14384
作者: Huan Kang,Hui Li,Xiao-Jun Wu,Tianyang Xu,Rui Wang,Chunyang Cheng,Josef Kittler
机构: Jiangnan University(江南大学); University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures
Abstract:In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at this https URL. Comments: 16 pages, 11 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4 Cite as: arXiv:2506.14384 [cs.CV] (or arXiv:2506.14384v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.14384 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Huan Kang [view email] [v1] Tue, 17 Jun 2025 10:32:05 UTC (6,179 KB) Full-text links: Access Paper: View a PDF of the paper titled GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion, by Huan Kang and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-38] DepthSeg: Depth prompting in remote sensing semantic segmentation
【速读】:该论文旨在解决遥感语义分割中因忽略目标高程差异而导致的地形覆盖类型误分类问题,特别是在存在阴影遮挡和光谱混淆的复杂场景下。其解决方案的关键在于引入一种深度提示的二维(2D)遥感语义分割框架(DepthSeg),该框架通过自动建模并整合高程/高度信息到语义分割流程中,以减轻光谱混淆和阴影遮挡的影响。具体而言,DepthSeg在特征提取阶段引入轻量级适配器以实现对预训练视觉Transformer编码器的有效微调,在深度提示阶段提出深度提示器以显式建模高程特征,并在语义预测阶段引入语义分类解码器以融合深度提示与高维地表覆盖特征,从而提升地表覆盖类型的准确提取能力。
链接: https://arxiv.org/abs/2506.14382
作者: Ning Zhou,Shanxiong Chen,Mingting Zhou,Haigang Sui,Lieyun Hu,Han Li,Li Hua,Qiming Zhou
机构: Wuhan University (武汉大学); Changjiang Spatial Information Technology Engineering Co., Ltd (长江空间信息技术工程有限公司); Huazhong Agr. University (华中农业大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing semantic segmentation is crucial for extracting detailed land surface information, enabling applications such as environmental monitoring, land use planning, and resource assessment. In recent years, advancements in artificial intelligence have spurred the development of automatic remote sensing semantic segmentation methods. However, the existing semantic segmentation methods focus on distinguishing spectral characteristics of different objects while ignoring the differences in the elevation of the different targets. This results in land cover misclassification in complex scenarios involving shadow occlusion and spectral confusion. In this paper, we introduce a depth prompting two-dimensional (2D) remote sensing semantic segmentation framework (DepthSeg). It automatically models depth/height information from 2D remote sensing images and integrates it into the semantic segmentation framework to mitigate the effects of spectral confusion and shadow occlusion. During the feature extraction phase of DepthSeg, we introduce a lightweight adapter to enable cost-effective fine-tuning of the large-parameter vision transformer encoder pre-trained by natural images. In the depth prompting phase, we propose a depth prompter to model depth/height features explicitly. In the semantic prediction phase, we introduce a semantic classification decoder that couples the depth prompts with high-dimensional land-cover features, enabling accurate extraction of land-cover types. Experiments on the LiuZhou dataset validate the advantages of the DepthSeg framework in land cover mapping tasks. Detailed ablation studies further highlight the significance of the depth prompts in remote sensing semantic segmentation.
zh
[CV-39] Discrete JEPA: Learning Discrete Token Representations without Reconstruction
【速读】:该论文试图解决当前图像分词方法在需要符号抽象和逻辑推理能力的任务中存在显著局限的问题,这些问题限制了系统进行系统性推理的能力。解决方案的关键在于提出Discrete-JEPA,该方法通过引入语义分词和新颖的互补目标,扩展了潜在预测编码框架,从而为符号推理任务创建了鲁棒的分词机制。
链接: https://arxiv.org/abs/2506.14373
作者: Junyeob Baek,Hosung Lee,Christopher Hoang,Mengye Ren,Sungjin Ahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The cornerstone of cognitive intelligence lies in extracting hidden patterns from observations and leveraging these principles to systematically predict future outcomes. However, current image tokenization methods demonstrate significant limitations in tasks requiring symbolic abstraction and logical reasoning capabilities essential for systematic inference. To address this challenge, we propose Discrete-JEPA, extending the latent predictive coding framework with semantic tokenization and novel complementary objectives to create robust tokenization for symbolic reasoning tasks. Discrete-JEPA dramatically outperforms baselines on visual symbolic prediction tasks, while striking visual evidence reveals the spontaneous emergence of deliberate systematic patterns within the learned semantic token space. Though an initial model, our approach promises a significant impact for advancing Symbolic world modeling and planning capabilities in artificial intelligence systems.
zh
[CV-40] DGG-XNet: A Hybrid Deep Learning Framework for Multi-Class Brain Disease Classification with Explainable AI
【速读】:该论文旨在解决脑部疾病(如阿尔茨海默病和脑肿瘤)在医学影像中的准确诊断问题,传统基于手动MRI分析的方法效率低且易出错。其解决方案的关键在于提出DGG-XNet,一种融合VGG16和DenseNet121的深度学习模型,通过结合DenseNet121的密集连接促进特征复用和高效梯度流动,以及VGG16提供的强大分层空间表征,从而提升特征提取与分类性能,并利用Grad-CAM实现模型可解释性。
链接: https://arxiv.org/abs/2506.14367
作者: Sumshun Nahar Eity,Mahin Montasir Afif,Tanisha Fairooz,Md. Mortuza Ahmmed,Md Saef Ullah Miah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate diagnosis of brain disorders such as Alzheimer’s disease and brain tumors remains a critical challenge in medical imaging. Conventional methods based on manual MRI analysis are often inefficient and error-prone. To address this, we propose DGG-XNet, a hybrid deep learning model integrating VGG16 and DenseNet121 to enhance feature extraction and classification. DenseNet121 promotes feature reuse and efficient gradient flow through dense connectivity, while VGG16 contributes strong hierarchical spatial representations. Their fusion enables robust multiclass classification of neurological conditions. Grad-CAM is applied to visualize salient regions, enhancing model transparency. Trained on a combined dataset from BraTS 2021 and Kaggle, DGG-XNet achieved a test accuracy of 91.33%, with precision, recall, and F1-score all exceeding 91%. These results highlight DGG-XNet’s potential as an effective and interpretable tool for computer-aided diagnosis (CAD) of neurodegenerative and oncological brain disorders.
zh
[CV-41] HydroChronos: Forecasting Decades of Surface Water Change
【速读】:该论文旨在解决表面水体动态预测中缺乏全面数据集和标准化基准的问题(benchmark)。其关键解决方案是引入HydroChronos,一个大规模、多模态时空数据集,用于表面水体动态预测,并结合三种预测任务。该数据集包含覆盖欧洲、北美洲和南美洲多个湖泊和河流的三十多年对齐的Landsat 5和Sentinel-2影像、气候数据和数字高程模型。此外,论文还提出了一种新的时空架构AquaClimaTempo UNet,其专用的气候数据分支显著提升了预测性能。
链接: https://arxiv.org/abs/2506.14362
作者: Daniele Rege Cambrin,Eleonora Poeta,Eliana Pastor,Isaac Corley,Tania Cerquitelli,Elena Baralis,Paolo Garza
机构: Politecnico di Torino (都灵理工大学); Wherobots (Wherobots)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Forecasting surface water dynamics is crucial for water resource management and climate change adaptation. However, the field lacks comprehensive datasets and standardized benchmarks. In this paper, we introduce HydroChronos, a large-scale, multi-modal spatiotemporal dataset for surface water dynamics forecasting designed to address this gap. We couple the dataset with three forecasting tasks. The dataset includes over three decades of aligned Landsat 5 and Sentinel-2 imagery, climate data, and Digital Elevation Models for diverse lakes and rivers across Europe, North America, and South America. We also propose AquaClimaTempo UNet, a novel spatiotemporal architecture with a dedicated climate data branch, as a strong benchmark baseline. Our model significantly outperforms a Persistence baseline for forecasting future water dynamics by +14% and +11% F1 across change detection and direction of change classification tasks, and by +0.1 MAE on the magnitude of change regression. Finally, we conduct an Explainable AI analysis to identify the key climate variables and input channels that influence surface water change, providing insights to inform and guide future modeling efforts.
zh
[CV-42] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
【速读】:该论文旨在解决自中心视频-语言理解任务中的三个关键问题:高预训练成本、无效的时空编码以及软标签多实例检索中学习目标不精确的问题。其解决方案的关键在于:首先,通过单阶段预训练高效地将基于图像的CLIP模型转换为统一的视频编码器;其次,引入时空旋转位置嵌入与联合注意力机制,以在整个隐藏维度上有效编码时空信息;最后,提出对称多相似性(SMS)损失和一种新型训练框架,以提升正负样本对的软标签表示,从而提供更精确的学习目标。
链接: https://arxiv.org/abs/2506.14356
作者: Xiaoqi Wang,Yi Wang,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at this https URL .
zh
[CV-43] FGA-NN: Film Grain Analysis Neural Network
【速读】:该论文旨在解决在中低码率压缩过程中因胶片颗粒的随机性导致其丢失的问题,从而保护艺术创作意图。解决方案的关键在于采用基于学习的胶片颗粒分析方法FGA-NN,该方法能够估计与传统合成兼容的胶片颗粒参数,实现了分析精度与合成复杂度之间的优越平衡,并具备良好的鲁棒性和适用性。
链接: https://arxiv.org/abs/2506.14350
作者: Zoubida Ameur,Frédéric Lefebvre,Philippe De Lagrange,Miloš Radosavljević
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Film grain, once a by-product of analog film, is now present in most cinematographic content for aesthetic reasons. However, when such content is compressed at medium to low bitrates, film grain is lost due to its random nature. To preserve artistic intent while compressing efficiently, film grain is analyzed and modeled before encoding and synthesized after decoding. This paper introduces FGA-NN, the first learning-based film grain analysis method to estimate conventional film grain parameters compatible with conventional synthesis. Quantitative and qualitative results demonstrate FGA-NN’s superior balance between analysis accuracy and synthesis complexity, along with its robustness and applicability.
zh
[CV-44] FRIDU: Functional Map Refinement with Guided Image Diffusion
【速读】:该论文试图解决在两个形状之间精炼对应映射(correspondence map)的问题,该映射通常表示为一个功能映射(functional map),即一个基变换矩阵。解决方案的关键在于将功能映射视为二维图像,并直接在功能映射的空间中训练图像扩散模型,从而根据不准确的初始映射生成精确的映射。该方法仅在功能空间中进行训练,因此具有很高的效率,并且在推理阶段利用当前功能映射对应的点对点映射作为扩散过程中的指导,进一步鼓励诸如正交性和与拉普拉斯-贝尔特拉米算子的交换性等功能映射目标。
链接: https://arxiv.org/abs/2506.14322
作者: Avigail Cohen Rimon,Mirela Ben-Chen,Or Litany
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to SGP 2025 (Symposium on Geometry Processing)
Abstract:We propose a novel approach for refining a given correspondence map between two shapes. A correspondence map represented as a functional map, namely a change of basis matrix, can be additionally treated as a 2D image. With this perspective, we train an image diffusion model directly in the space of functional maps, enabling it to generate accurate maps conditioned on an inaccurate initial map. The training is done purely in the functional space, and thus is highly efficient. At inference time, we use the pointwise map corresponding to the current functional map as guidance during the diffusion process. The guidance can additionally encourage different functional map objectives, such as orthogonality and commutativity with the Laplace-Beltrami operator. We show that our approach is competitive with state-of-the-art methods of map refinement and that guided diffusion models provide a promising pathway to functional map processing.
zh
[CV-45] ImmerseGen: Agent -Guided Immersive World Generation with Alpha-Textured Proxies
【速读】:该论文旨在解决沉浸式虚拟现实(VR)场景自动创建中的复杂建模流程与视觉真实感受限的问题。现有方法通常依赖于高多边形网格建模或大量3D高斯分布,导致流程复杂或视觉效果有限。论文提出的解决方案关键在于引入ImmerseGen框架,该框架通过将场景表示为轻量级几何代理的分层组合,并利用RGBA纹理合成实现逼真的外观,从而简化建模过程并提升视觉质量。此方法避免了复杂的几何构建和简化步骤,同时支持实时渲染,提高了场景生成的效率与沉浸感。
链接: https://arxiv.org/abs/2506.14315
作者: Jinyan Yuan,Bangbang Yang,Keke Wang,Panwang Pan,Lin Ma,Xuehai Zhang,Xiao Liu,Zhaopeng Cui,Yuewen Ma
机构: ByteDance(字节跳动); Zhejiang University(浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Automatic creation of 3D scenes for immersive VR presence has been a significant research focus for decades. However, existing methods often rely on either high-poly mesh modeling with post-hoc simplification or massive 3D Gaussians, resulting in a complex pipeline or limited visual realism. In this paper, we demonstrate that such exhaustive modeling is unnecessary for achieving compelling immersive experience. We introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies, i.e., simplified terrain and billboard meshes, and generates photorealistic appearance by synthesizing RGBA textures onto these proxies. Specifically, we propose terrain-conditioned texturing for user-centric base world synthesis, and RGBA asset texturing for midground and foreground this http URL reformulation offers several advantages: (i) it simplifies modeling by enabling agents to guide generative models in producing coherent textures that integrate seamlessly with the scene; (ii) it bypasses complex geometry creation and decimation by directly synthesizing photorealistic textures on proxies, preserving visual quality without degradation; (iii) it enables compact representations suitable for real-time rendering on mobile VR headsets. To automate scene creation from text prompts, we introduce VLM-based modeling agents enhanced with semantic grid-based analysis for improved spatial reasoning and accurate asset placement. ImmerseGen further enriches scenes with dynamic effects and ambient audio to support multisensory immersion. Experiments on scene generation and live VR showcases demonstrate that ImmerseGen achieves superior photorealism, spatial coherence and rendering efficiency compared to prior methods. Project webpage: this https URL.
zh
[CV-46] Leader360V: The Large-scale Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment
【速读】:该论文旨在解决360视频场景理解任务中缺乏大规模、真实世界标注数据的问题,这限制了基础模型在该领域的应用与发展。其解决方案的关键在于设计了一个自动标注管道,通过协调预训练的2D分割器和大语言模型(Large Language Models, LLMs),实现高效且高质量的标注过程,包括语义与畸变感知的初始标注、自动修正以及人工审核三个阶段,从而生成首个大规模、真实世界的360视频实例分割与跟踪数据集Leader360V。
链接: https://arxiv.org/abs/2506.14271
作者: Weiming Zhang,Dingwen Xiao,Aobotao Dai,Yexin Liu,Tianbo Pan,Shiqi Wen,Lei Chen,Lin Wang
机构: HKUST (GZ) (香港科技大学(广州)); HKUST (香港科技大学); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures
Abstract:360 video captures the complete surrounding scenes with the ultra-large field of view of 360X180. This makes 360 scene understanding tasks, eg, segmentation and tracking, crucial for appications, such as autonomous driving, robotics. With the recent emergence of foundation models, the community is, however, impeded by the lack of large-scale, labelled real-world datasets. This is caused by the inherent spherical properties, eg, severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex. This paper introduces Leader360V, the first large-scale, labeled real-world 360 video datasets for instance segmentation and tracking. Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes. To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre-trained 2D segmentors and large language models to facilitate the labeling. The pipeline operates in three novel stages. Specifically, in the Initial Annotation Phase, we introduce a Semantic- and Distortion-aware Refinement module, which combines object mask proposals from multiple 2D segmentors with LLM-verified semantic labels. These are then converted into mask prompts to guide SAM2 in generating distortion-aware masks for subsequent frames. In the Auto-Refine Annotation Phase, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders. The Manual Revision Phase finally incorporates LLMs and human annotators to further refine and validate the annotations. Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline. Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding.
zh
[CV-47] Exploring Non-contrastive Self-supervised Representation Learning for Image-based Profiling CVPR2025
【速读】:该论文旨在解决基于图像的细胞表型分析中特征提取器泛化能力不足的问题,特别是在面对细胞图像与自然图像分布差异大以及多图像输入情况下信息融合困难的挑战。其解决方案的关键在于提出了一种专门针对细胞表型设计的非对比自监督学习框架SSLProfiler,通过引入定制化的数据增强和表示后处理方法,有效提升了特征提取器的鲁棒性和迁移性。
链接: https://arxiv.org/abs/2506.14265
作者: Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Computer Vision for Drug Discovery
Abstract:Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in computer vision. Inspired by recent developments in non-contrastive Self-Supervised Learning (SSL), this paper provides an initial exploration into training a generalizable feature extractor for cell images using such methods. However, there are two major challenges: 1) There is a large difference between the distributions of cell images and natural images, causing the view-generation process in existing SSL methods to fail; and 2) Unlike typical scenarios where each representation is based on a single image, cell profiling often involves multiple input images, making it difficult to effectively combine all available information. To overcome these challenges, we propose SSLProfiler, a non-contrastive SSL framework specifically designed for cell profiling. We introduce specialized data augmentation and representation post-processing methods tailored to cell images, which effectively address the issues mentioned above and result in a robust feature extractor. With these improvements, SSLProfiler won the Cell Line Transferability challenge at CVPR 2025.
zh
[CV-48] Comparison of Two Methods for Stationary Incident Detection Based on Background Image
【速读】:该论文旨在解决视频场景中临时静止物体的检测问题,传统背景差分法主要用于移动目标检测,而本文将其扩展至静止物体的检测。解决方案的关键在于提出两种基于背景差分的静止物体检测方案,其中第一种方法使用单一背景,第二种方法则采用不同学习率生成的双背景以提高对暂时停止物体的检测能力,同时结合归一化互相关(NCC)进行图像比较,实现对检测到的静止物体的监控与跟踪。
链接: https://arxiv.org/abs/2506.14256
作者: Deepak Ghimire,Joonwhoan Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:In general, background subtraction-based methods are used to detect moving objects in visual tracking applications. In this paper, we employed a background subtraction-based scheme to detect the temporarily stationary objects. We proposed two schemes for stationary object detection, and we compare those in terms of detection performance and computational complexity. In the first approach, we used a single background, and in the second approach, we used dual backgrounds, generated with different learning rates, in order to detect temporarily stopped objects. Finally, we used normalized cross correlation (NCC) based image comparison to monitor and track the detected stationary object in a video scene. The proposed method is robust with partial occlusion, short-time fully occlusion, and illumination changes, and it can operate in real time.
zh
[CV-49] synth-dacl: Does Synthetic Defect Data Enhance Segmentation Accuracy and Robustness for Real-World Bridge Inspections?
【速读】:该论文旨在解决桥梁检测中因缺陷和构件分类任务自动化不足而导致的效率低、准确性差及安全性不足的问题。其关键解决方案是通过引入“synth-dacl”数据集扩展,该扩展基于合成混凝土纹理,旨在缓解现有数据集dacl10k中的类别不平衡问题,并提升模型在裂缝和空洞等细粒度类别分割上的性能。实验证明,结合synth-dacl扩展后,模型在多个扰动测试集上的鲁棒性显著增强,特别是在均值交并比(mean IoU)、F1分数、召回率和精确率方面提升了2%。
链接: https://arxiv.org/abs/2506.14255
作者: Johannes Flotzinger,Fabian Deuser,Achref Jaziri,Heiko Neumann,Norbert Oswald,Visvanathan Ramesh,Thomas Braml
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adequate bridge inspection is increasingly challenging in many countries due to growing ailing stocks, compounded with a lack of staff and financial resources. Automating the key task of visual bridge inspection, classification of defects and building components on pixel level, improves efficiency, increases accuracy and enhances safety in the inspection process and resulting building assessment. Models overtaking this task must cope with an assortment of real-world conditions. They must be robust to variations in image quality, as well as background texture, as defects often appear on surfaces of diverse texture and degree of weathering. dacl10k is the largest and most diverse dataset for real-world concrete bridge inspections. However, the dataset exhibits class imbalance, which leads to notably poor model performance particularly when segmenting fine-grained classes such as cracks and cavities. This work introduces “synth-dacl”, a compilation of three novel dataset extensions based on synthetic concrete textures. These extensions are designed to balance class distribution in dacl10k and enhance model performance, especially for crack and cavity segmentation. When incorporating the synth-dacl extensions, we observe substantial improvements in model robustness across 15 perturbed test sets. Notably, on the perturbed test set, a model trained on dacl10k combined with all synthetic extensions achieves a 2% increase in mean IoU, F1 score, Recall, and Precision compared to the same model trained solely on dacl10k.
zh
[CV-50] Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition
【速读】:该论文旨在解决基于LiDAR的场景识别在长期自主机器人和自动驾驶系统中的两个关键问题:一是由于自身运动动态和环境扰动导致的点云密度不一致所引发的描述符不稳定,二是依赖单一层次几何抽象所带来的表征脆弱性。其解决方案的关键在于提出一种基于弹性点的隐式三维表示,该表示对原始场景点云密度不敏感并具有均匀分布特性,进而从中提取占据网格和法向量信息,并融合鸟瞰图(捕捉宏观空间布局)与三维片段(编码微观表面几何)的几何信息生成描述符,从而提升场景识别的鲁棒性和区分能力。
链接: https://arxiv.org/abs/2506.14243
作者: Xiaohui Jiang,Haijiang Zhu,Chadei Li,Fulin Tang,Ning An
机构: Beijing University of Chemical Technology (北京化工大学); University of Chinese Academy of Sciences (中国科学院大学); China Coal Research Institute (中国煤炭科学研究总院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR-based place recognition serves as a crucial enabler for long-term autonomy in robotics and autonomous driving systems. Yet, prevailing methodologies relying on handcrafted feature extraction face dual challenges: (1) Inconsistent point cloud density, induced by ego-motion dynamics and environmental disturbances during repeated traversals, leads to descriptor instability, and (2) Representation fragility stems from reliance on single-level geometric abstractions that lack discriminative power in structurally complex scenarios. To address these limitations, we propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning. Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density and achieves the characteristic of uniform distribution. Subsequently, we derive the occupancy grid and normal vector information of the scene from this implicit representation. Finally, with the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird’s-eye view (capturing macro-level spatial layouts) and 3D segment (encoding micro-scale surface geometries) perspectives. We conducted extensive experiments on numerous datasets (KITTI, KITTI-360, MulRan, NCLT) across diverse environments. The experimental results demonstrate that our method achieves state-of-the-art performance. Moreover, our approach strikes an optimal balance between accuracy, runtime, and memory optimization for historical maps, showcasing excellent Resilient and scalability. Our code will be open-sourced in the future.
zh
[CV-51] Unified Representation Space for 3D Visual Grounding
【速读】:该论文旨在解决3D视觉定位(3DVG)任务中因视觉和文本编码器独立预训练导致的模态间空间几何与语义类别差异问题,这一差异常引发物体定位和分类错误。其解决方案的关键在于提出UniSpace-3D,通过引入统一表示空间,将视觉和文本特征映射到同一空间,并结合多模态对比学习模块和语言引导查询选择模块,有效缩小模态间的差距。
链接: https://arxiv.org/abs/2506.14238
作者: Yinuo Zheng,Lipeng Gu,Honghua Chen,Liangliang Nan,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.
zh
[CV-52] HRGS: Hierarchical Gaussian Splatting for Memory-Efficient High-Resolution 3D Reconstruction
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS)在高分辨率场景下存在的内存可扩展性问题。其关键解决方案是提出分层高斯点云(Hierarchical Gaussian Splatting, HRGS),通过分块优化策略实现内存效率提升,包括从低分辨率数据生成全局粗粒度高斯表示、将场景划分为多个块并利用高分辨率数据进行细化,以及通过重要性驱动的高斯剪枝(Importance-Driven Gaussian Pruning, IDGP)减少计算负担和内存占用。
链接: https://arxiv.org/abs/2506.14229
作者: Changbai Li,Haodong Zhu,Hanlin Chen,Juan Zhang,Tongfei Chen,Shuo Yang,Shuwei Shao,Wenhao Dong,Baochang Zhang
机构: Beihang University (北京航空航天大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D Gaussian Splatting (3DGS) has made significant strides in real-time 3D scene reconstruction, but faces memory scalability issues in high-resolution scenarios. To address this, we propose Hierarchical Gaussian Splatting (HRGS), a memory-efficient framework with hierarchical block-level optimization. First, we generate a global, coarse Gaussian representation from low-resolution data. Then, we partition the scene into multiple blocks, refining each block with high-resolution data. The partitioning involves two steps: Gaussian partitioning, where irregular scenes are normalized into a bounded cubic space with a uniform grid for task distribution, and training data partitioning, where only relevant observations are retained for each block. By guiding block refinement with the coarse Gaussian prior, we ensure seamless Gaussian fusion across adjacent blocks. To reduce computational demands, we introduce Importance-Driven Gaussian Pruning (IDGP), which computes importance scores for each Gaussian and removes those with minimal contribution, speeding up convergence and reducing memory usage. Additionally, we incorporate normal priors from a pretrained model to enhance surface reconstruction quality. Our method enables high-quality, high-resolution 3D scene reconstruction even under memory constraints. Extensive experiments on three benchmarks show that HRGS achieves state-of-the-art performance in high-resolution novel view synthesis (NVS) and surface reconstruction tasks.
zh
[CV-53] AMPLIFY: Actionless Motion Priors for Robot Learning from Videos
【速读】:该论文试图解决机器人领域中动作标注数据稀缺且昂贵的问题,这限制了学习到的策略的泛化能力。解决方案的关键在于提出一种名为AMPLIFY的新框架,该框架通过将视觉动态编码为从关键点轨迹中提取的紧凑离散运动标记,利用大规模无动作标签的视频数据。其核心创新在于将视觉运动预测与动作推理分离,从而解耦学习任务定义的运动与机器人执行任务的方式,并通过在大量无动作标签视频上训练前向动力学模型以及在有限的动作标注示例上训练逆向动力学模型,实现独立扩展。
链接: https://arxiv.org/abs/2506.14198
作者: Jeremy A. Collins,Loránd Cheng,Kunal Aneja,Albert Wilcox,Benjamin Joffe,Animesh Garg
机构: Georgia Tech (佐治亚理工学院); Georgia Tech Research Institute (佐治亚理工学院研究机构)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at this https URL.
zh
[CV-54] Egocentric Human-Object Interaction Detection: A New Benchmark and Method
【速读】:该论文试图解决从第一人称视角(Ego-HOI)进行人类-物体交互(Human-Object Interaction, HOI)检测的问题,现有方法主要关注第三人称视角,忽略了更具直观性的第一人称视角下的交互理解。解决方案的关键在于提出一种名为Hand Geometry and Interactivity Refinement (HGIR)的方案,该方案利用手部姿态和几何信息作为解释交互的重要线索,通过显式提取全局手部几何特征并结合姿态-交互注意力机制优化交互特征,从而获得鲁棒且强大的交互表示,显著提升Ego-HOI检测性能。
链接: https://arxiv.org/abs/2506.14189
作者: Kunyuan Deng,Yi Wang,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding the interaction between humans and objects has gained much attention in recent years. Existing human-object interaction (HOI) detection methods mainly focus on the third-person perspectives, overlooking a more intuitive way from the egocentric view of HOI, namely Ego-HOI. This paper introduces an Ego-HOIBench, a new dataset to promote the benchmarking and development of Ego-HOI detection. Our Ego-HOIBench comprises more than 27K egocentric images with high-quality hand-verb-object triplet annotations across 123 fine-grained interaction categories and locations, covering a rich diversity of scenarios, object types, and hand configurations in daily activities. In addition, we explore and adapt third-person HOI detection methods to Ego-HOIBench and illustrate the challenges of hand-occluded objects and the complexity of single- and two-hand interactions. To build a new baseline, we propose a Hand Geometry and Interactivity Refinement (HGIR) scheme, which leverages hand pose and geometric information as valuable cues for interpreting interactions. Specifically, the HGIR scheme explicitly extracts global hand geometric features from the estimated hand pose proposals and refines the interaction-specific features using pose-interaction attention. This scheme enables the model to obtain a robust and powerful interaction representation, significantly improving the Ego-HOI detection capability. Our approach is lightweight and effective, and it can be easily applied to HOI baselines in a plug-and-play manner to achieve state-of-the-art results on Ego-HOIBench. Our project is available at: this https URL
zh
[CV-55] Meta-SurDiff: Classification Diffusion Model Optimized by Meta Learning is Reliable for Online Surgical Phase Recognition
【速读】:该论文旨在解决在线手术阶段识别中因视频帧模糊性和手术阶段分布不平衡导致的不确定性问题,这些问题对可靠识别至关重要。解决方案的关键在于引入一种基于元学习优化的分类扩散模型(Meta-SurDiff),通过充分利用深度生成模型和元学习,在细粒度帧级别实现精确的概率估计,从而提升在线手术阶段识别的可靠性。
链接: https://arxiv.org/abs/2506.14181
作者: Yufei Li,Jirui Wu,Long Tian,Liming Wang,Xiaonan Liu,Zijun Liu,Xiyang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures
Abstract:Online surgical phase recognition has drawn great attention most recently due to its potential downstream applications closely related to human life and health. Despite deep models have made significant advances in capturing the discriminative long-term dependency of surgical videos to achieve improved recognition, they rarely account for exploring and modeling the uncertainty in surgical videos, which should be crucial for reliable online surgical phase recognition. We categorize the sources of uncertainty into two types, frame ambiguity in videos and unbalanced distribution among surgical phases, which are inevitable in surgical videos. To address this pivot issue, we introduce a meta-learning-optimized classification diffusion model (Meta-SurDiff), to take full advantage of the deep generative model and meta-learning in achieving precise frame-level distribution estimation for reliable online surgical phase recognition. For coarse recognition caused by ambiguous video frames, we employ a classification diffusion model to assess the confidence of recognition results at a finer-grained frame-level instance. For coarse recognition caused by unbalanced phase distribution, we use a meta-learning based objective to learn the diffusion model, thus enhancing the robustness of classification boundaries for different surgical this http URL establish effectiveness of Meta-SurDiff in online surgical phase recognition through extensive experiments on five widely used datasets using more than four practical metrics. The datasets include Cholec80, AutoLaparo, M2Cai16, OphNet, and NurViD, where OphNet comes from ophthalmic surgeries, NurViD is the daily care dataset, while the others come from laparoscopic surgeries. We will release the code upon acceptance.
zh
[CV-56] One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification
【速读】:该论文试图解决深度学习在病理图像分析中的独特挑战,尤其是现有方法直接应用计算机视觉模型到医疗任务时,忽视了病理图像的特殊性,导致计算效率低下,特别是在边缘计算场景下。解决方案的关键在于提出一种新型的网络相似性导向初始化(Network Similarity Directed Initialization, NSDI)策略,以提高神经架构搜索(NAS)的稳定性,并引入领域自适应技术到单次NAS中,以更好地处理病理数据集中的染色差异和语义尺度变化。
链接: https://arxiv.org/abs/2506.14176
作者: Renao Yan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning-based pathological image analysis presents unique challenges due to the practical constraints of network design. Most existing methods apply computer vision models directly to medical tasks, neglecting the distinct characteristics of pathological images. This mismatch often leads to computational inefficiencies, particularly in edge-computing scenarios. To address this, we propose a novel Network Similarity Directed Initialization (NSDI) strategy to improve the stability of neural architecture search (NAS). Furthermore, we introduce domain adaptation into one-shot NAS to better handle variations in staining and semantic scale across pathology datasets. Experiments on the BRACS dataset demonstrate that our method outperforms existing approaches, delivering both superior classification performance and clinically relevant feature localization.
zh
[CV-57] A multi-stage augmented multimodal interaction network for fish feeding intensity quantification
【速读】:该论文试图解决在循环水养殖系统中,由于模态选择、特征提取与融合以及联合推理决策等方面的局限性,导致多模态融合模型在准确性、适用性和可靠性方面难以进一步提升的问题。其解决方案的关键在于提出一种多阶段增强多模态交互网络(MAINet),该网络通过一个通用的特征提取框架高效提取图像、音频和水波数据的特征信息,并设计了辅助模态强化主模态机制(ARPM)以实现模态间的交互并生成增强特征,最后引入证据推理(ER)规则对各模态输出结果进行融合与决策,从而完成对鱼类摄食强度的量化。
链接: https://arxiv.org/abs/2506.14170
作者: Shulong Zhang,Mingyuan Yao,Jiayin Zhao,Xiao Liu,Haihua Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:In recirculating aquaculture systems, accurate and effective assessment of fish feeding intensity is crucial for reducing feed costs and calculating optimal feeding times. However, current studies have limitations in modality selection, feature extraction and fusion, and co-inference for decision making, which restrict further improvement in the accuracy, applicability and reliability of multimodal fusion models. To address this problem, this study proposes a Multi-stage Augmented Multimodal Interaction Network (MAINet) for quantifying fish feeding intensity. Firstly, a general feature extraction framework is proposed to efficiently extract feature information from input image, audio and water wave datas. Second, an Auxiliary-modality Reinforcement Primary-modality Mechanism (ARPM) is designed for inter-modal interaction and generate enhanced features, which consists of a Channel Attention Fusion Network (CAFN) and a Dual-mode Attention Fusion Network (DAFN). Finally, an Evidence Reasoning (ER) rule is introduced to fuse the output results of each modality and make decisions, thereby completing the quantification of fish feeding intensity. The experimental results show that the constructed MAINet reaches 96.76%, 96.78%, 96.79% and 96.79% in accuracy, precision, recall and F1-Score respectively, and its performance is significantly higher than the comparison models. Compared with models that adopt single-modality, dual-modality fusion and different decision-making fusion methods, it also has obvious advantages. Meanwhile, the ablation experiments further verified the key role of the proposed improvement strategy in improving the robustness and feature utilization efficiency of model, which can effectively improve the accuracy of the quantitative results of fish feeding intensity.
zh
[CV-58] VideoMAR: Autoregressive Video Generatio with Continuous Tokens NEURIPS2025
【速读】:该论文旨在解决视频生成中基于掩码的自回归模型(autoregressive models)潜力尚未被充分探索的问题,尤其是如何高效地进行连续空间中的视频生成。其解决方案的关键在于提出一种简洁高效的仅解码器自回归图像到视频模型——VideoMAR,该模型通过时间帧间和空间掩码生成实现视频生成,并引入了时间因果性和空间双向性作为视频自回归模型的基本原则,同时采用下一帧扩散损失来融合掩码与视频生成。此外,为应对长序列自回归建模的高成本和难度,提出了时间短到长的课程学习和空间渐进式分辨率训练策略,并在推理阶段采用渐进式温度策略以减轻误差累积。
链接: https://arxiv.org/abs/2506.14168
作者: Hu Yu,Biao Gong,Hangjie Yuan,DanDan Zheng,Weilong Chai,Jingdong Chen,Kecheng Zheng,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to NeurIPS 2025
Abstract:Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbfVideoMAR, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ( 9.3% ), training data ( 0.5% ), and GPU resources ( 0.2% ).
zh
[CV-59] SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM -Guided Walkability
【速读】:该论文旨在解决行人轨迹预测中忽视环境上下文信息的问题,现有方法主要关注行人之间的社交互动,而忽略了对人类运动模式有显著影响的环境因素。其解决方案的关键在于提出一种名为SceneAware的框架,该框架通过显式整合场景理解来提升轨迹预测的准确性,具体包括利用Vision Transformer(ViT)处理静态场景图像中的环境信息,并结合多模态大语言模型(MLLM)生成可行走性掩码,同时采用基于Transformer的轨迹编码器与ViT场景编码器融合,捕捉时间动态与空间约束,并引入碰撞惩罚机制以确保物理合理性。
链接: https://arxiv.org/abs/2506.14144
作者: Juho Bai,Inwook Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of pedestrian trajectories is essential for applications in robotics and surveillance systems. While existing approaches primarily focus on social interactions between pedestrians, they often overlook the rich environmental context that significantly shapes human movement patterns. In this paper, we propose SceneAware, a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy. Our method leverages a Vision Transformer~(ViT) scene encoder to process environmental context from static scene images, while Multi-modal Large Language Models~(MLLMs) generate binary walkability masks that distinguish between accessible and restricted areas during training. We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints. The framework integrates collision penalty mechanisms that discourage predicted trajectories from violating physical boundaries, ensuring physically plausible predictions. SceneAware is implemented in both deterministic and stochastic variants. Comprehensive experiments on the ETH/UCY benchmark datasets show that our approach outperforms state-of-the-art methods, with more than 50% improvement over previous models. Our analysis based on different trajectory categories shows that the model performs consistently well across various types of pedestrian movement. This highlights the importance of using explicit scene information and shows that our scene-aware approach is both effective and reliable in generating accurate and physically plausible predictions. Code is available at: this https URL.
zh
[CV-60] Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学影像分类任务中的泛化能力和可靠性问题,特别是在处理高度不平衡、分布外的多标签医学数据集时的表现局限性。其解决方案的关键在于通过分析 BiomedCLIP 模型的嵌入空间,评估其在零样本推理、全微调和线性探测三种情境下的分类性能,并结合 Grad-CAM 热图进行可视化解释,以揭示模型在不同场景下的优劣及改进方向。
链接: https://arxiv.org/abs/2506.14136
作者: Nafiz Sadman,Farhana Zulkernine,Benjamin Kwan
机构: Queen’s University (皇后大学); Kingston Health Sciences Centre (金斯顿健康科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub: this https URL
Abstract:In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.
zh
[CV-61] GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation
【速读】:该论文旨在解决视觉引导的机器人操作中动作推理的准确性问题,现有方法在复杂和动态的操作场景中常出现动作不准确的问题。其解决方案的关键在于提出一种Vision-to-4D-to-Action (V-4D-A)框架,通过高斯动作场(Gaussian Action Field, GAF)实现从运动感知的4D表示中直接进行动作推理,GAF通过引入可学习的运动属性扩展了3D高斯点云(3D Gaussian Splatting, 3DGS),从而能够同时建模动态场景和操作动作。
链接: https://arxiv.org/abs/2506.14135
作者: Ying Chai,Litao Deng,Ruizhi Shao,Jiajun Zhang,Liangjun Xing,Hongwen Zhang,Yebin Liu
机构: Tsinghua University (清华大学); Beijing Normal University (北京师范大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this http URL
Abstract:Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods. Project page: this http URL
zh
[CV-62] KDMOS:Knowledge Distillation for Motion Segmentation
【速读】:该论文旨在解决运动目标分割(Motion Object Segmentation, MOS)中准确率与实时推理之间难以平衡的问题。其解决方案的关键在于提出一种基于logits的知识蒸馏框架,采用基于鸟瞰图(Bird’s Eye View, BEV)投影的模型作为学生模型,非投影模型作为教师模型,并通过解耦运动与非运动类别并应用定制化的蒸馏策略,以提升模型对关键运动特征的学习能力,从而显著减少误检和漏检。
链接: https://arxiv.org/abs/2506.14130
作者: Chunyu Cao,Jintao Cheng,Zeyu Chen,Linfan Zhan,Rui Fan,Zhijian He,Xiaoyu Tang
机构: South China Normal University (华南师范大学); Tongji University (同济大学); Shenzhen Technology University (深圳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird’s Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at this https URL.
zh
[CV-63] FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution
【速读】:该论文试图解决在有限计算成本下人脸超分辨率(Face Super-Resolution, FSR)性能不佳的问题,现有方法对所有面部像素一视同仁,导致计算资源分配不优和FSR效果下降。解决方案的关键在于提出FADPNet,一种频率感知的双路径网络,通过将面部特征分解为低频和高频成分,并分别采用Mamba和CNN进行处理,从而实现对不同频率特征的有效建模与优化。
链接: https://arxiv.org/abs/2506.14121
作者: Siyu Xu,Wenjie Li,Guangwei Gao,Jian Yang,Guo-Jun Qi,Chia-Wen Lin
机构: Institute of Advanced Technology, Nanjing University of Posts and Telecommunications(先进科技研究所,南京邮电大学); Key Laboratory of Artificial Intelligence, Ministry of Education(人工智能重点实验室,教育部); Provincial Key Laboratory for Computer Information Processing Technology, Soochow University(计算机信息处理技术省级重点实验室,苏州大学); School of Artificial Intelligence, Beijing University of Posts and Telecommunications(人工智能学院,北京邮电大学); School of Computer Science and Technology, Nanjing University of Science and Technology(计算机科学与技术学院,南京理工大学); Research Center for Industries of the Future and the School of Engineering, Westlake University(未来产业研究中心和工程学院,西湖大学); OPPO Research(OPPO研究院); Department of Electrical Engineering and the Institute of Communications Engineering, National Tsing Hua University(电子工程系和通信工程研究所,国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures, 6 tales
Abstract:Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-frequency features like facial color and fine-grained texture, and does so with lower complexity than Transformers. Motivated by these observations, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components and processes them via dedicated branches. For low-frequency regions, we introduce a Mamba-based Low-Frequency Enhancement Block (LFEB), which combines state-space attention with squeeze-and-excitation operations to extract low-frequency global interactions and emphasize informative channels. For high-frequency regions, we design a CNN-based Deep Position-Aware Attention (DPA) module to enhance spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module that further refines frequency-specific representations. Through the above designs, our method achieves an excellent balance between FSR quality and model efficiency, outperforming existing approaches.
zh
[CV-64] Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse VLDB
【速读】:该论文旨在解决基于视觉变换器(Vision Transformer, ViT)的视频语言模型(Video-Language Models, VideoLMs)在处理大规模视频时计算效率低下的问题,这一问题限制了其在实际场景中的部署。解决方案的关键在于提出Déjà Vu系统,其中核心组件ReuseViT通过检测帧间计算复用机会,在保持较高精度的同时显著减少计算量。此外,Déjà Vu还结合了内存-计算联合压缩技术,将计算节省转化为实际的性能提升。
链接: https://arxiv.org/abs/2506.14107
作者: Jinwoo Hwang,Daeun Kim,Sangyeop Lee,Yoonsung Kim,Guseul Heo,Hojoon Kim,Yunseok Jeong,Tadiwos Meaza,Eunhyeok Park,Jeongseob Ahn,Jongse Park
机构: KAIST(韩国科学技术院); POSTECH(浦项科技大学); Korea University(韩国大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2025 VLDB
Abstract:Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posing a major hurdle to real-world deployment and necessitating solutions for integration into scalable video data management systems. This paper introduces Déjà Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames. At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities, striking an effective balance between accuracy and reuse. Although ReuseViT significantly reduces computation, these savings do not directly translate into performance gains on GPUs. To overcome this, Déjà Vu integrates memory-compute joint compaction techniques that convert the FLOP savings into tangible performance gains. Evaluations on three VideoLM tasks show that Déjà Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.
zh
[CV-65] Image Segmentation with Large Language Models : A Survey with Perspectives for Intelligent Transportation Systems
【速读】:该论文试图解决在智能交通系统(Intelligent Transportation Systems, ITS)中,如何通过结合大型语言模型(Large Language Models, LLMs)提升图像分割任务的性能,以实现更精准的场景理解问题。其解决方案的关键在于利用LLM的语义理解和上下文感知能力,增强计算机视觉任务中的特征表示与决策过程,从而提升道路场景理解的准确性与鲁棒性。
链接: https://arxiv.org/abs/2506.14096
作者: Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.
zh
[CV-66] SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
【速读】:该论文旨在解决文档视觉问答(DocVQA)任务中多模态信息处理与高效证据页面检索的问题,即在跨多页文档和多种信息模态(如图像和表格)的基础上回答复杂问题。其解决方案的关键在于提出一种轻量级但高效的检索增强框架SimpleDoc,该框架通过嵌入相似性初步检索候选页面,并基于页面摘要进行过滤与重排序,随后利用单个基于视觉语言模型(VLM)的推理代理反复调用双线索检索器,迭代地将新页面引入工作内存,直至问题被自信回答。
链接: https://arxiv.org/abs/2506.14035
作者: Chelsi Jain,Yiran Wu,Yifan Zeng,Jiale Liu,S hengyu Dai,Zhenwen Shao,Qingyun Wu,Huazheng Wang
机构: Oregon State University (俄勒冈州立大学); Pennsylvania State University (宾夕法尼亚州立大学); AG2AI, Inc. (AG2AI公司); Johnson & Johnson (强生公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at this https URL.
zh
[CV-67] Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation
【速读】:该论文试图解决从大型视觉-语言模型(Large Vision-Language Model, LVLM)中解耦3D信息的问题,具体应用于生成3D肖像。其关键解决方案是通过将可变形神经3D三平面表示归一化到2D参考框架来实现特征解耦,同时利用雅可比正则化有效缓解LVLM嵌入空间中的噪声问题,从而提升输出质量和多样性。
链接: https://arxiv.org/abs/2506.14015
作者: Nick Yiwen Huang,Akin Caliskan,Berkay Kicanaoglu,James Tompkin,Hyeongwoo Kim
机构: Brown University (布朗大学); Flawless AI (Flawless AI); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM’s embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.
zh
[CV-68] FindMeIfYouCan: Bringing Open Set metrics to textitnear textitfar and textitfarther Out-of-Distribution Object Detection
【速读】:该论文试图解决当前面向未知对象检测(OOD-OD)的评估协议中存在的问题,即其未能满足与分布内(ID)数据集非重叠对象的假设,从而可能掩盖了忽略未知对象等关键情况,导致在实际部署中对真正新颖对象产生过度自信。解决方案的关键在于手动整理并丰富现有基准,通过利用语义相似性创建新的评估分割,分为\textit{near}、\textit{far}和\textit{farther}三类,同时引入开放集识别领域的成熟指标,以更深入地分析方法在检测未知对象、忽略未知对象以及误将OOD对象分类为ID对象方面的表现。
链接: https://arxiv.org/abs/2506.14008
作者: Daniel Montoya,Aymen Bouguerra,Alexandra Gomez-Villa,Fabio Arnez
机构: Université Paris-Saclay, CEA, List; Computer Vision Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:State-of-the-art Object Detection (OD) methods predominantly operate under a closed-world assumption, where test-time categories match those encountered during training. However, detecting and localizing unknown objects is crucial for safety-critical applications in domains such as autonomous driving and medical imaging. Recently, Out-Of-Distribution (OOD) detection has emerged as a vital research direction for OD, focusing on identifying incorrect predictions typically associated with unknown objects. This paper shows that the current evaluation protocol for OOD-OD violates the assumption of non-overlapping objects with respect to the In-Distribution (ID) datasets, and obscures crucial situations such as ignoring unknown objects, potentially leading to overconfidence in deployment scenarios where truly novel objects might be encountered. To address these limitations, we manually curate, and enrich the existing benchmark by exploiting semantic similarity to create new evaluation splits categorized as \textitnear , \textitfar , and \textitfarther from ID distributions. Additionally, we incorporate established metrics from the Open Set community, providing deeper insights into how effectively methods detect unknowns, when they ignore them, and when they mistakenly classify OOD objects as ID. Our comprehensive evaluation demonstrates that semantically and visually close OOD objects are easier to localize than far ones, but are also more easily confounded with ID objects. \textitFar and \textitfarther objects are harder to localize but less prone to be taken for an ID object.
zh
[CV-69] Mapping Farmed Landscapes from Remote Sensing
【速读】:该论文试图解决农业景观有效管理中缺乏详细、大尺度生态地图的问题,这阻碍了全球生物多样性目标的实现。解决方案的关键是开发了Farmscapes,这是首个覆盖英格兰大部分地区、分辨率高达25厘米的农村景观特征地图,利用深度学习分割模型对航拍图像进行训练,该模型基于942个手动标注的图像块数据集,能够准确识别关键生境,展现出对林地(96% F1分数)和农田(95% F1分数)的高识别精度,并在线性特征分割方面表现优异(72% F1分数)。
链接: https://arxiv.org/abs/2506.13993
作者: Michelangelo Conserva,Alex Wilson,Charlotte Stanton,Vishal Batchu,Varun Gulshan
机构: Google Research(谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96%) and farmed land (95%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.
zh
[CV-70] HierVL: Semi-Supervised Segmentation leverag ing Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
【速读】:该论文旨在解决在严重标签稀缺和领域变化情况下半监督语义分割的挑战,现有仅视觉的方法在泛化性、类间像素误分类及边界定位方面表现不佳。其解决方案的关键在于提出HierVL框架,通过将抽象文本嵌入整合到针对半监督分割优化的掩码Transformer架构中,实现了视觉-语言语义与空间定位的结合。该框架包含三个创新组件:层次语义查询生成器、跨模态空间对齐模块和双查询Transformer解码器,同时引入针对性正则化损失以维持视觉-语言对齐,从而提升分割精度和实例感知的泛化能力。
链接: https://arxiv.org/abs/2506.13925
作者: Numair Nadeem,Saeed Anwar,Muhammad Hamza Asad,Abdul Bais
机构: University of Regina (里贾纳大学); The University of Western Australia (西澳大学); University Canada West (加拿大西海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Semi-supervised semantic segmentation remains challenging under severe label scarcity and domain variability. Vision-only methods often struggle to generalize, resulting in pixel misclassification between similar classes, poor generalization and boundary localization. Vision-Language Models offer robust, domain-invariant semantics but lack the spatial grounding required for dense prediction. We introduce HierVL, a unified framework that bridges this gap by integrating abstract text embeddings into a mask-transformer architecture tailored for semi-supervised segmentation. HierVL features three novel components: a Hierarchical Semantic Query Generator that filters and projects abstract class embeddings into multi-scale queries to suppress irrelevant classes and handle intra-class variability; a Cross-Modal Spatial Alignment Module that aligns semantic queries with pixel features for sharper boundaries under sparse supervision; and a Dual-Query Transformer Decoder that fuses semantic and instance-level queries to prevent instance collapse. We also introduce targeted regularization losses that maintain vision-language alignment throughout training to reinforce semantic grounding. HierVL establishes a new state-of-the-art by achieving a +4.4% mean improvement of the intersection over the union on COCO (with 232 labeled images), +3.1% on Pascal VOC (with 92 labels), +5.9% on ADE20 (with 158 labels) and +1.8% on Cityscapes (with 100 labels), demonstrating better performance under 1% supervision on four benchmark datasets. Our results show that language-guided segmentation closes the label efficiency gap and unlocks new levels of fine-grained, instance-aware generalization.
zh
[CV-71] Intelligent Image Sensing for Crime Analysis: A ML Approach towards Enhanced Violence Detection and Investigation
【速读】:该论文试图解决传统监控方法在及时检测多样且不可预测的暴力行为方面存在的局限性,以应对日益增长的全球犯罪率及由此带来的大量人员和财产损失。其解决方案的关键在于利用机器学习(Machine Learning)技术,构建一个全面的暴力检测与分类框架,其中采用监督学习(Supervised Learning)实现二分类和多分类的暴力事件识别,检测模型基于3D卷积神经网络(3D Convolutional Neural Networks),分类模型则结合可分离的3D卷积网络进行特征提取,并使用双向长短期记忆网络(bidirectional LSTM)进行时序处理,从而提升计算资源效率和检测准确性。
链接: https://arxiv.org/abs/2506.13910
作者: Aritra Dutta,Pushpita Boral,G Suseela
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing global crime rate, coupled with substantial human and property losses, highlights the limitations of traditional surveillance methods in promptly detecting diverse and unexpected acts of violence. Addressing this pressing need for automatic violence detection, we leverage Machine Learning to detect and categorize violent events in video streams. This paper introduces a comprehensive framework for violence detection and classification, employing Supervised Learning for both binary and multi-class violence classification. The detection model relies on 3D Convolutional Neural Networks, while the classification model utilizes the separable convolutional 3D model for feature extraction and bidirectional LSTM for temporal processing. Training is conducted on a diverse customized datasets with frame-level annotations, incorporating videos from surveillance cameras, human recordings, hockey fight, sohas and wvd dataset across various platforms. Additionally, a camera module integrated with raspberry pi is used to capture live video feed, which is sent to the ML model for processing. Thus, demonstrating improved performance in terms of computational resource efficiency and accuracy.
zh
[CV-72] OPTIMUS: Observing Persistent Transformations in Multi-temporal Unlabeled Satellite-data WACV2025
【速读】:该论文旨在解决在缺乏标注的卫星影像数据下,难以通过监督方法检测地表变化的问题,尤其是对于罕见变化类别的检测。其解决方案的关键在于提出OPTIMUS,这是一种基于自监督学习的方法,其核心原理是:如果模型能够恢复时间序列中图像的相对顺序,则表明图像中存在长期变化。通过在时间序列模型输出上应用变化点检测方法,OPTIMUS能够直接检测出感兴趣的卫星影像变化,显著提升了AUROC评分。
链接: https://arxiv.org/abs/2506.13902
作者: Raymond Yu,Paul Han,Josh Myers-Dean,Piper Wolters,Favyen Bastani
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025
Abstract:In the face of pressing environmental issues in the 21st century, monitoring surface changes on Earth is more important than ever. Large-scale remote sensing, such as satellite imagery, is an important tool for this task. However, using supervised methods to detect changes is difficult because of the lack of satellite data annotated with change labels, especially for rare categories of change. Annotation proves challenging due to the sparse occurrence of changes in satellite images. Even within a vast collection of images, only a small fraction may exhibit persistent changes of interest. To address this challenge, we introduce OPTIMUS, a self-supervised learning method based on an intuitive principle: if a model can recover information about the relative order of images in the time series, then that implies that there are long-lasting changes in the images. OPTIMUS demonstrates this principle by using change point detection methods on model outputs in a time series. We demonstrate that OPTIMUS can directly detect interesting changes in satellite images, achieving an improvement in AUROC score from 56.3% to 87.6% at distinguishing changed time series from unchanged ones compared to baselines. Our code and dataset are available at this https URL.
zh
[CV-73] DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding ICCV2025
【速读】:该论文旨在解决多模态对比预训练在人类活动理解中的应用不足问题,特别是针对LiDAR点云、人体骨骼姿态、惯性测量单元(IMU)数据和文本之间的跨模态对齐与联合嵌入空间学习。其解决方案的关键在于提出DeSPITE模型,通过噪声对比估计方法有效学习这四种模态的联合嵌入空间,从而实现更精准的人类活动识别、检索及时间片段检索等任务。
链接: https://arxiv.org/abs/2506.13897
作者: Thomas Kreutz,Max Mühlhäuser,Alejandro Sanchez Guinea
机构: Telekooperation Lab, Technical University Darmstadt (电信合作实验室,达姆施塔特工业大学); NTT DATA, Luxembourg (NTT数据,卢森堡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is currently under review at ICCV 2025
Abstract:Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep Skeleton-Pointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities through noise contrastive estimation. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton-Pointcloud-IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR.
zh
[CV-74] Fake it till You Make it: Reward Modeling as Discriminative Prediction
【速读】:该论文旨在解决传统奖励模型在视觉生成模型后训练增强中的实现复杂性问题,这些问题主要源于对大量人工标注偏好数据或精心设计的质量维度的依赖,而这些质量维度通常不完整且工程成本高。论文提出的解决方案关键在于引入GAN-RM框架,该框架通过判别少量代表性未配对目标样本(称为偏好代理数据)与模型生成的普通输出来训练奖励模型,从而避免了手动偏好标注和显式质量维度工程,仅需数百个目标样本即可实现有效训练。
链接: https://arxiv.org/abs/2506.13846
作者: Runtao Liu,Jiahao Zhan,Yingqing He,Chen Wei,Alan Yuille,Qifeng Chen
机构: HKUST(香港科技大学); Fudan University(复旦大学); Johns Hopkins University(约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM’s effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
zh
[CV-75] Hidden Bias in the Machine: Stereotypes in Text-to-Image Models CVPR2025
【速读】:该论文试图解决生成式 AI (Generative AI) 在文本到图像 (Text-to-Image, T2I) 模型中可能复制和放大社会偏见的问题。其解决方案的关键在于通过构建涵盖多种主题类别的多样化提示集,并利用 Stable Diffusion 1.5 和 Flux-1 模型生成大量图像,结合 Google Image Search 的比较图像进行分析,以系统评估模型在性别、种族、年龄、体型等人类相关因素上的表现差异。
链接: https://arxiv.org/abs/2506.13780
作者: Sedat Porikli,Vedat Porikli
机构: Canyon Crest Academy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Equal contribution by both authors, Published at CVPR 2025 Workshop on Experimental Model Auditing via Controllable Synthesis (EMACS) and Workshop on Demographic Diversity in Computer Vision (DemoDiv)
Abstract:Text-to-Image (T2I) models have transformed visual content creation, producing highly realistic images from natural language prompts. However, concerns persist around their potential to replicate and magnify existing societal biases. To investigate these issues, we curated a diverse set of prompts spanning thematic categories such as occupations, traits, actions, ideologies, emotions, family roles, place descriptions, spirituality, and life events. For each of the 160 unique topics, we crafted multiple prompt variations to reflect a wide range of meanings and perspectives. Using Stable Diffusion 1.5 (UNet-based) and Flux-1 (DiT-based) models with original checkpoints, we generated over 16,000 images under consistent settings. Additionally, we collected 8,000 comparison images from Google Image Search. All outputs were filtered to exclude abstract, distorted, or nonsensical results. Our analysis reveals significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images. These disparities often mirror and reinforce harmful stereotypes embedded in societal narratives. We discuss the implications of these findings and emphasize the need for more inclusive datasets and development practices to foster fairness in generative visual systems.
zh
[CV-76] CDST: Color Disentangled Style Transfer for Universal Style Reference Customization
【速读】:该论文试图解决传统风格迁移方法中颜色与风格难以完全解耦的问题,以及在无需微调的情况下实现通用风格迁移能力的挑战。其解决方案的关键在于提出了一种名为Color Disentangled Style Transfer (CDST) 的新型两流风格迁移训练范式,该方法通过完全分离颜色与风格,并强制风格流对颜色信息“失明”,从而实现了颜色与风格的解耦。此外,CDST通过多特征图像嵌入压缩显著提升了风格相似性,并通过受Diffusion UNet解耦定律启发的新风格定义保留了强大的编辑能力。
链接: https://arxiv.org/abs/2506.13770
作者: Shiwen Zhang,Zhuowei Chen,Lang Chen,Yanze Wu
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: codes and models will be released if the paper is accepted
Abstract:We introduce Color Disentangled Style Transfer (CDST), a novel and efficient two-stream style transfer training paradigm which completely isolates color from style and forces the style stream to be color-blinded. With one same model, CDST unlocks universal style transfer capabilities in a tuning-free manner during inference. Especially, the characteristics-preserved style transfer with style and content references is solved in the tuning-free way for the first time. CDST significantly improves the style similarity by multi-feature image embeddings compression and preserves strong editing capability via our new CDST style definition inspired by Diffusion UNet disentanglement law. By conducting thorough qualitative and quantitative experiments and human evaluations, we demonstrate that CDST achieves state-of-the-art results on various style transfer tasks.
zh
[CV-77] Non-planar Object Detection and Identification by Features Matching and Triangulation Growth
【速读】:该论文旨在解决在场景图像中检测和识别给定模板的扭曲实例的问题,特别是在几何模型(如单应性)不成立的情况下,例如模板非平面或虽为平面但在图像中出现形变时。解决方案的关键在于基于特征的增量分组方法,通过模板特征的Delaunay三角剖分作为指导,利用局部一致性的几何和光度属性对特征匹配进行评估,从而实现对目标的准确识别。
链接: https://arxiv.org/abs/2506.13769
作者: Filippo Leveni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master’s thesis at Politecnico di Milano
Abstract:Object detection and identification is surely a fundamental topic in the computer vision field; it plays a crucial role in many applications such as object tracking, industrial robots control, image retrieval, etc. We propose a feature-based approach for detecting and identifying distorted occurrences of a given template in a scene image by incremental grouping of feature matches between the image and the template. For this purpose, we consider the Delaunay triangulation of template features as an useful tool through which to be guided in this iterative approach. The triangulation is treated as a graph and, starting from a single triangle, neighboring nodes are considered and the corresponding features are identified; then matches related to them are evaluated to determine if they are worthy to be grouped. This evaluation is based on local consistency criteria derived from geometric and photometric properties of local features. Our solution allows the identification of the object in situations where geometric models (e.g. homography) does not hold, thus enable the detection of objects such that the template is non planar or when it is planar but appears distorted in the image. We show that our approach performs just as well or better than application of homography-based RANSAC in scenarios in which distortion is nearly absent, while when the deformation becomes relevant our method shows better description performance.
zh
[CV-78] Plug-and-Play with 2.5D Artifact Reduction Prior for Fast and Accurate Industrial Computed Tomography Reconstruction
【速读】:该论文旨在解决稀疏视角锥形束X射线计算机断层扫描(cone-beam X-ray computed tomography, XCT)中高质量三维重建的难题,特别是在密集材料成像中因测量数量不足导致的图像质量下降和计算成本高的问题。其解决方案的关键在于提出一种基于插件式(plug-and-play, PnP)框架的2.5D伪影抑制卷积神经网络(artifact reduction CNN)作为先验模型,该方法通过引入相邻切片的跨切片信息,在保持计算效率的同时捕捉更丰富的空间上下文,从而提升重建质量并直接抑制常见XCT伪影(如束硬化伪影),避免了传统伪影校正预处理步骤。
链接: https://arxiv.org/abs/2506.14719
作者: Haley Duba-Sullivan,Aniket Pramanik,Venkatakrishnan Singanallur,Amirkoushyar Ziabari
机构: Oak Ridge National Laboratory (奥克里奇国家实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Journal of Nondestructive Evaluation
Abstract:Cone-beam X-ray computed tomography (XCT) is an essential imaging technique for generating 3D reconstructions of internal structures, with applications ranging from medical to industrial imaging. Producing high-quality reconstructions typically requires many X-ray measurements; this process can be slow and expensive, especially for dense materials. Recent work incorporating artifact reduction priors within a plug-and-play (PnP) reconstruction framework has shown promising results in improving image quality from sparse-view XCT scans while enhancing the generalizability of deep learning-based solutions. However, this method uses a 2D convolutional neural network (CNN) for artifact reduction, which captures only slice-independent information from the 3D reconstruction, limiting performance. In this paper, we propose a PnP reconstruction method that uses a 2.5D artifact reduction CNN as the prior. This approach leverages inter-slice information from adjacent slices, capturing richer spatial context while remaining computationally efficient. We show that this 2.5D prior not only improves the quality of reconstructions but also enables the model to directly suppress commonly occurring XCT artifacts (such as beam hardening), eliminating the need for artifact correction pre-processing. Experiments on both experimental and synthetic cone-beam XCT data demonstrate that the proposed method better preserves fine structural details, such as pore size and shape, leading to more accurate defect detection compared to 2D priors. In particular, we demonstrate strong performance on experimental XCT data using a 2.5D artifact reduction prior trained entirely on simulated scans, highlighting the proposed method’s ability to generalize across domains.
zh
[CV-79] MobileHolo: A Lightweight Complex-Valued Deformable CNN for High-Quality Computer-Generated Hologram
【速读】:该论文旨在解决计算机生成全息图(CGH)中由于有效感受野(ERF)不足导致难以准确建模衍射过程的问题。其解决方案的关键在于设计了复数域可变形卷积,将其集成到网络中,以动态调整卷积核的形状,从而增强ERF的灵活性,提升特征提取能力。
链接: https://arxiv.org/abs/2506.14542
作者: Xie Shuyang,Zhou Jie,Xu Bo,Wang Jun,Xu Renjing
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Sichuan University(四川大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures
Abstract:Holographic displays have significant potential in virtual reality and augmented reality owing to their ability to provide all the depth cues. Deep learning-based methods play an important role in computer-generated holograms (CGH). During the diffraction process, each pixel exerts an influence on the reconstructed image. However, previous works face challenges in capturing sufficient information to accurately model this process, primarily due to the inadequacy of their effective receptive field (ERF). Here, we designed complex-valued deformable convolution for integration into network, enabling dynamic adjustment of the convolution kernel’s shape to increase flexibility of ERF for better feature extraction. This approach allows us to utilize a single model while achieving state-of-the-art performance in both simulated and optical experiment reconstructions, surpassing existing open-source models. Specifically, our method has a peak signal-to-noise ratio that is 2.04 dB, 5.31 dB, and 9.71 dB higher than that of CCNN-CGH, HoloNet, and Holo-encoder, respectively, when the resolution is 1920 \times 1072. The number of parameters of our model is only about one-eighth of that of CCNN-CGH.
zh
[CV-80] Integrating Radiomics with Deep Learning Enhances Multiple Sclerosis Lesion Delineation
【速读】:该论文试图解决多发性硬化症(Multiple Sclerosis, MS)病灶分割的准确性问题,当前深度学习方法在鲁棒性方面存在挑战。其解决方案的关键在于将放射组学特征与原始影像数据进行融合,并通过改进的深度学习架构(如ResNeXt-UNet和注意力增强的U-Net)进行整合,从而提升分割性能和模型稳定性。
链接: https://arxiv.org/abs/2506.14524
作者: Nadezhda Alsahanova,Pavel Bartenev,Maksim Sharaev,Milos Ljubisavljevic,Taleb Al. Mansoori,Yauhen Statsenko
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究所); University of Sharjah (沙迦大学); United Arab Emirates University (阿联酋大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background: Accurate lesion segmentation is critical for multiple sclerosis (MS) diagnosis, yet current deep learning approaches face robustness challenges. Aim: This study improves MS lesion segmentation by combining data fusion and deep learning techniques. Materials and Methods: We suggested novel radiomic features (concentration rate and Rényi entropy) to characterize different MS lesion types and fused these with raw imaging data. The study integrated radiomic features with imaging data through a ResNeXt-UNet architecture and attention-augmented U-Net architecture. Our approach was evaluated on scans from 46 patients (1102 slices), comparing performance before and after data fusion. Results: The radiomics-enhanced ResNeXt-UNet demonstrated high segmentation accuracy, achieving significant improvements in precision and sensitivity over the MRI-only baseline and a Dice score of 0.774 \pm 0.05; p0.001 according to Bonferroni-adjusted Wilcoxon signed-rank tests. The radiomics-enhanced attention-augmented U-Net model showed a greater model stability evidenced by reduced performance variability (SDD = 0.18 \pm 0.09 vs. 0.21 \pm 0.06; p=0.03) and smoother validation curves with radiomics integration. Conclusion: These results validate our hypothesis that fusing radiomics with raw imaging data boosts segmentation performance and stability in state-of-the-art models. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.14524 [eess.IV] (or arXiv:2506.14524v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.14524 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nadezhda Alsahanova [view email] [v1] Tue, 17 Jun 2025 13:50:42 UTC (974 KB) Full-text links: Access Paper: View a PDF of the paper titled Integrating Radiomics with Deep Learning Enhances Multiple Sclerosis Lesion Delineation, by Nadezhda Alsahanova and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: eess.IV prev | next new | recent | 2025-06 Change to browse by: cs cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-81] owards Reliable WMH Segmentation under Domain Shift: An Application Study using Maximum Entropy Regularization to Improve Uncertainty Estimation
【速读】:该论文旨在解决白质高信号(White Matter Hyperintensities, WMH)分割在不同领域数据分布(如MRI设备类型或采集参数变化)下的模型校准和不确定性估计问题。其关键解决方案是引入最大熵正则化技术,以增强模型的校准能力并提升基于熵的不确定性估计,从而在无需真实标签的情况下通过预测不确定性识别部署后的分割错误。
链接: https://arxiv.org/abs/2506.14497
作者: Franco Matzkin,Agostina Larrazabal,Diego H Milone,Jose Dolz,Enzo Ferrante
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 7 figures
Abstract:Accurate segmentation of white matter hyperintensities (WMH) is crucial for clinical decision-making, particularly in the context of multiple sclerosis. However, domain shifts, such as variations in MRI machine types or acquisition parameters, pose significant challenges to model calibration and uncertainty estimation. This study investigates the impact of domain shift on WMH segmentation by proposing maximum-entropy regularization techniques to enhance model calibration and uncertainty estimation, with the purpose of identifying errors post-deployment using predictive uncertainty as a proxy measure that does not require ground-truth labels. To do this, we conducted experiments using a U-Net architecture to evaluate these regularization schemes on two publicly available datasets, assessing performance with the Dice coefficient, expected calibration error, and entropy-based uncertainty estimates. Our results show that entropy-based uncertainty estimates can anticipate segmentation errors, and that maximum-entropy regularization further strengthens the correlation between uncertainty and segmentation performance while also improving model calibration under domain shift.
zh
[CV-82] A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning
【速读】:该论文试图解决在大规模医学影像中开发和基准测试自监督学习方法的问题,其解决方案的关键在于构建一个大规模、异构的脑部磁共振成像(MRI)数据集FOMO60K,该数据集包含60,529次扫描,覆盖13,900个会话和11,187名受试者,并整合了多个公开来源的数据。该数据集包括临床级和研究级图像、多种MRI序列以及广泛的解剖和病理变异,同时仅进行了最小预处理以保留原始图像特征,从而降低新用户的使用门槛。此外,论文还提供了用于自监督预训练和微调的配套代码,以支持相关方法的开发与评估。
链接: https://arxiv.org/abs/2506.14432
作者: Asbjørn Munk,Stefano Cerri,Jakob Ambsdorf,Julia Machnio,Sebastian Nørgaard Llambias,Vardan Nersesjan,Christian Hedeager Krag,Peirong Liu,Pablo Rocamora García,Mostafa Mehdipour Ghazi,Mikael Boesen,Michael Eriksen Benros,Juan Eugenio Iglesias,Mads Nielsen
机构: University of Copenhagen(哥本哈根大学); Pioneer Centre For AI(先锋人工智能中心); Copenhagen Research Centre for Biological and Precision Psychiatry(哥本哈根生物与精准精神病学研究中心); Mental Health Centre Copenhagen(哥本哈根心理健康中心); Copenhagen University Hospital(哥本哈根大学医院); Rigshospitalet(里克森医院); Radiological AI Testcenter(放射人工智能测试中心); Faculty of Health and Medical Sciences(健康与医学科学学院); Athinoula A. Martinos Center for Biomedical Imaging(阿蒂努拉·马尔托诺斯生物医学成像中心); Massachusetts General Hospital(麻省总医院); Harvard Medical School(哈佛医学院); Johns Hopkins University(约翰霍普金斯大学); Department of Computer Science(计算机科学系); Copenhagen University Hospital, Bispebjerg & Frederiksberg Hospital(哥本哈根大学医院,比斯伯格和弗雷德里克斯贝格医院); Department of Clinical Medicine(临床医学系); Massachusetts Institute of Technology(麻省理工学院); Hawkes Institute, University College London(霍克斯研究所,伦敦大学学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present FOMO60K, a large-scale, heterogeneous dataset of 60,529 brain Magnetic Resonance Imaging (MRI) scans from 13,900 sessions and 11,187 subjects, aggregated from 16 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing barriers to entry for new users. Accompanying code for self-supervised pretraining and finetuning is provided. FOMO60K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.
zh
[CV-83] Compressed Video Super-Resolution based on Hierarchical Encoding
【速读】:该论文旨在解决压缩视频内容的感知质量提升问题,特别是针对高压缩率下产生的多种压缩伪影进行有效修复。其解决方案的关键在于采用分层编码变压器块结构,并对其进行精细化优化,以消除H.265/HEVC编码在不同量化参数(QP)水平下引入的广泛压缩伪影,从而实现从低分辨率(如180p至720p或270p至1080p)视频的高质量上采样。
链接: https://arxiv.org/abs/2506.14381
作者: Yuxuan Jiang,Siyue Teng,Qiang Zhu,Chen Feng,Chengxi Zeng,Fan Zhang,Shuyuan Zhu,Bing Zeng,David Bull
机构: University of Bristol (布里斯托大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a general-purpose video super-resolution (VSR) method, dubbed VSR-HE, specifically designed to enhance the perceptual quality of compressed content. Targeting scenarios characterized by heavy compression, the method upscales low-resolution videos by a ratio of four, from 180p to 720p or from 270p to 1080p. VSR-HE adopts hierarchical encoding transformer blocks and has been sophisticatedly optimized to eliminate a wide range of compression artifacts commonly introduced by H.265/HEVC encoding across various quantization parameter (QP) levels. To ensure robustness and generalization, the model is trained and evaluated under diverse compression settings, allowing it to effectively restore fine-grained details and preserve visual fidelity. The proposed VSR-HE has been officially submitted to the ICME 2025 Grand Challenge on VSR for Video Conferencing (Team BVI-VSR), under both the Track 1 (General-Purpose Real-World Video Content) and Track 2 (Talking Head Videos).
zh
[CV-84] BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification with Swin-HAFNet
【速读】:该论文试图解决从磁共振成像(MRI)中准确分割和分类脑肿瘤的问题,这一问题在医学图像分析中仍然是一个关键挑战,主要由于缺乏高质量、平衡且多样化的数据集。解决方案的关键在于提出一个经过精心整理的MRI数据集,该数据集包含6,000例由认证放射科医生和医师标注的对比增强T1加权MRI扫描,涵盖三种主要肿瘤类型(胶质瘤、脑膜瘤和垂体瘤)以及非肿瘤病例,并提供了高分辨率标签及不同成像平面(轴位、矢状位和冠状位)的分类信息,以促进模型的鲁棒性开发和跨视角泛化能力。此外,研究还提出了一种基于Transformer的分割模型,并在基准测试中取得了最高的加权平均交并比(IoU)82.3%。
链接: https://arxiv.org/abs/2506.14318
作者: Amirreza Fateh,Yasin Rezvani,Sara Moayedi,Sadjad Rezvani,Fatemeh Fateh,Mansoor Fateh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation and classification of brain tumors from Magnetic Resonance Imaging (MRI) remain key challenges in medical image analysis, largely due to the lack of high-quality, balanced, and diverse datasets. In this work, we present a new curated MRI dataset designed specifically for brain tumor segmentation and classification tasks. The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans annotated by certified radiologists and physicians, spanning three major tumor types-glioma, meningioma, and pituitary-as well as non-tumorous cases. Each sample includes high-resolution labels and is categorized across axial, sagittal, and coronal imaging planes to facilitate robust model development and cross-view generalization. To demonstrate the utility of the dataset, we propose a transformer-based segmentation model and benchmark it against established baselines. Our method achieves the highest weighted mean Intersection-over-Union (IoU) of 82.3%, with improvements observed across all tumor categories. Importantly, this study serves primarily as an introduction to the dataset, establishing foundational benchmarks for future research. We envision this dataset as a valuable resource for advancing machine learning applications in neuro-oncology, supporting both academic research and clinical decision-support development. datasetlink: this https URL
zh
[CV-85] orGAN: A Synthetic Data Augmentation Pipeline for Simultaneous Generation of Surgical Images and Ground Truth Labels
【速读】:该论文旨在解决医学影像中深度学习面临的挑战,包括数据多样性不足、伦理问题、高获取成本以及对精确标注的依赖,特别是在手术过程中出血检测与定位任务中,由于缺乏反映真实手术场景的高质量数据集而尤为困难。其解决方案的关键在于提出orGAN,一个基于生成对抗网络(Generative Adversarial Network, GAN)的系统,能够生成高保真且带有标注的手术出血图像。该方法通过利用小型“模仿器官”数据集,构建能够复制组织特性与出血过程的合成模型,从而减少伦理争议和数据收集成本。orGAN结合了StyleGAN与关系位置学习以实现逼真的出血事件模拟,并通过LaMa-based图像修复模块实现清晰术前视觉的恢复,从而获得精确的像素级标注。
链接: https://arxiv.org/abs/2506.14303
作者: Niran Nataraj,Maina Sogabe,Kenji Kawashima
机构: The University of Tokyo (东京大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7figures
Abstract:Deep learning in medical imaging faces obstacles: limited data diversity, ethical issues, high acquisition costs, and the need for precise annotations. Bleeding detection and localization during surgery is especially challenging due to the scarcity of high-quality datasets that reflect real surgical scenarios. We propose orGAN, a GAN-based system for generating high-fidelity, annotated surgical images of bleeding. By leveraging small “mimicking organ” datasets, synthetic models that replicate tissue properties and bleeding, our approach reduces ethical concerns and data-collection costs. orGAN builds on StyleGAN with Relational Positional Learning to simulate bleeding events realistically and mark bleeding coordinates. A LaMa-based inpainting module then restores clean, pre-bleed visuals, enabling precise pixel-level annotations. In evaluations, a balanced dataset of orGAN and mimicking-organ images achieved 90% detection accuracy in surgical settings and up to 99% frame-level accuracy. While our development data lack diverse organ morphologies and contain intraoperative artifacts, orGAN markedly advances ethical, efficient, and cost-effective creation of realistic annotated bleeding datasets, supporting broader integration of AI in surgical practice.
zh
[CV-86] Latent Anomaly Detection: Masked VQ-GAN for Unsupervised Segmentation in Medical CBCT
【速读】:该论文试图解决在骨放射性坏死(Osteoradionecrosis, ONJ)影像中因标注数据稀缺而导致的监督学习训练不可行的问题。其解决方案的关键在于提出一种无监督训练方法,通过两阶段的训练流程实现影像扫描中的异常自动识别。第一阶段使用VQ-GAN对正常样本进行精确重建,第二阶段通过随机立方体掩码和ONJ特异性掩码训练新的编码器以恢复数据,从而实现成功的分割。
链接: https://arxiv.org/abs/2506.14209
作者: Pengwei Wang
机构: National University of Singapore (新加坡国立大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advances in treatment technology now allow for the use of customizable 3D-printed hydrogel wound dressings for patients with osteoradionecrosis (ORN) of the jaw (ONJ). Meanwhile, deep learning has enabled precise segmentation of 3D medical images using tools like nnUNet. However, the scarcity of labeled data in ONJ imaging makes supervised training impractical. This study aims to develop an unsupervised training approach for automatically identifying anomalies in imaging scans. We propose a novel two-stage training pipeline. In the first stage, a VQ-GAN is trained to accurately reconstruct normal subjects. In the second stage, random cube masking and ONJ-specific masking are applied to train a new encoder capable of recovering the data. The proposed method achieves successful segmentation on both simulated and real patient data. This approach provides a fast initial segmentation solution, reducing the burden of manual labeling. Additionally, it has the potential to be directly used for 3D printing when combined with hand-tuned post-processing. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.14209 [eess.IV] (or arXiv:2506.14209v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.14209 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pengwei Wang [view email] [v1] Tue, 17 Jun 2025 05:58:04 UTC (4,602 KB) Full-text links: Access Paper: View a PDF of the paper titled Latent Anomaly Detection: Masked VQ-GAN for Unsupervised Segmentation in Medical CBCT, by Pengwei WangView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: eess.IV prev | next new | recent | 2025-06 Change to browse by: cs cs.AI cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-87] Reliable Noninvasive Glucose Sensing via CNN-Based Spectroscopy ALT
【速读】:该论文旨在解决连续非侵入性血糖监测的临床需求,其核心挑战在于实现高精度、低成本和可穿戴集成。解决方案的关键在于提出一种基于短波红外(SWIR)光谱的双模态人工智能框架,其中第一种模态利用多波长SWIR成像系统结合卷积神经网络(CNN)提取与葡萄糖吸收相关的空间特征,第二种模态则采用紧凑型光电二极管电压传感器和机器学习回归模型(如随机森林)处理归一化光学信号,从而在保证临床准确性的同时提升系统的成本效益和可穿戴性。
链接: https://arxiv.org/abs/2506.13819
作者: El Arbi Belfarsi,Henry Flores,Maria Valero
机构: Kennesaw State University (肯尼索州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2025)
Abstract:In this study, we present a dual-modal AI framework based on short-wave infrared (SWIR) spectroscopy. The first modality employs a multi-wavelength SWIR imaging system coupled with convolutional neural networks (CNNs) to capture spatial features linked to glucose absorption. The second modality uses a compact photodiode voltage sensor and machine learning regressors (e.g., random forest) on normalized optical signals. Both approaches were evaluated on synthetic blood phantoms and skin-mimicking materials across physiological glucose levels (70 to 200 mg/dL). The CNN achieved a mean absolute percentage error (MAPE) of 4.82% at 650 nm with 100% Zone A coverage in the Clarke Error Grid, while the photodiode system reached 86.4% Zone A accuracy. This framework constitutes a state-of-the-art solution that balances clinical accuracy, cost efficiency, and wearable integration, paving the way for reliable continuous non-invasive glucose monitoring.
zh
[CV-88] BraTS orchestrator : Democratizing and Disseminating state-of-the-art brain tumor image analysis
【速读】:该论文试图解决脑肿瘤分割(Brain Tumor Segmentation, BraTS)挑战中开发的算法和模型在科研和临床社区中应用有限的问题。其解决方案的关键在于引入BraTS orchestrator,这是一个开源的Python包,能够为多样化的脑肿瘤提供无缝访问最新的分割和合成算法,通过简化深度学习的复杂性,使非专业用户也能轻松部署BraTS竞赛中的优秀算法,从而促进这些成果在神经放射学和神经肿瘤学领域的广泛传播与应用。
链接: https://arxiv.org/abs/2506.13807
作者: Florian Kofler,Marcel Rosier,Mehdi Astaraki,Ujjwal Baid,Hendrik Möller,Josef A. Buchner,Felix Steinbauer,Eva Oswald,Ezequiel de la Rosa,Ivan Ezhov,Constantin von See,Jan Kirschke,Anton Schmick,Sarthak Pati,Akis Linardos,Carla Pitarch,Sanyukta Adap,Jeffrey Rudie,Maria Correia de Verdier,Rachit Saluja,Evan Calabrese,Dominic LaBella,Mariam Aboian,Ahmed W. Moawad,Nazanin Maleki,Udunna Anazodo,Maruf Adewole,Marius George Linguraru,Anahita Fathi Kazerooni,Zhifan Jiang,Gian Marco Conte,Hongwei Li,Juan Eugenio Iglesias,Spyridon Bakas,Benedikt Wiestler,Marie Piraud,Bjoern Menze
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27p, 2figs, 3tabs
Abstract:The Brain Tumor Segmentation (BraTS) cluster of challenges has significantly advanced brain tumor image analysis by providing large, curated datasets and addressing clinically relevant tasks. However, despite its success and popularity, algorithms and models developed through BraTS have seen limited adoption in both scientific and clinical communities. To accelerate their dissemination, we introduce BraTS orchestrator, an open-source Python package that provides seamless access to state-of-the-art segmentation and synthesis algorithms for diverse brain tumors from the BraTS challenge ecosystem. Available on GitHub (this https URL), the package features intuitive tutorials designed for users with minimal programming experience, enabling both researchers and clinicians to easily deploy winning BraTS algorithms for inference. By abstracting the complexities of modern deep learning, BraTS orchestrator democratizes access to the specialized knowledge developed within the BraTS community, making these advances readily available to broader neuro-radiology and neuro-oncology audiences.
zh
人工智能
[AI-0] Exploring Speaker Diarization with Mixture of Experts
【速读】:该论文旨在解决语音识别中的说话人日志(speaker diarization)问题,即在多说话人场景中区分不同说话人的语音片段。其解决方案的关键在于提出一种基于记忆感知多说话人嵌入与序列到序列架构的新型神经说话人日志系统(NSD-MS2S),该系统通过引入记忆模块增强说话人嵌入,并利用序列到序列框架高效地将声学特征映射为说话人标签。此外,论文还引入了共享软混合专家(SS-MoE)模块,以减少模型偏差并提升性能,从而进一步优化系统表现。
链接: https://arxiv.org/abs/2506.14750
作者: Gaobin Yang,Maokui He,Shutong Niu,Ruoyu Wang,Hang Chen,Jun Du
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
zh
[AI-1] Agent Distill: Training-Free Agent Distillation with Generalizable MCP Boxes
【速读】:该论文试图解决如何有效压缩基于大语言模型(Large Language Models, LLMs)的智能体(agent)的问题,特别是针对涉及规划、记忆和工具使用的复杂任务,传统知识蒸馏方法在动态环境中泛化能力不足。解决方案的关键在于提出一种无需训练的智能体知识蒸馏框架——AgentDistill,其核心是通过直接复用教师智能体自主生成的结构化可重用任务解决模块(Model-Context-Protocols, MCPs),实现高效且可扩展的知识迁移。
链接: https://arxiv.org/abs/2506.14728
作者: Jiahao Qiu,Xinzhe Juan,Yimin Wang,Ling Yang,Xuan Qi,Tongcheng Zhang,Jiacheng Guo,Yifu Lu,Zixin Yao,Hongru Wang,Shilong Liu,Xun Jiang,Liu Leqi,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored. Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments. We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents. The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains and solve new problems with minimal supervision or human intervention. Experiments on biomedical and mathematical benchmarks demonstrate that our distilled student agents, built on small language models, can achieve performance comparable to advanced systems using large LLMs such as OctoTools (GPT-4o), highlighting the effectiveness of our framework in building scalable and cost-efficient intelligent agents.
zh
[AI-2] Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models
【速读】:该论文旨在解决实时辅助遥操作中机器人从用户控制输入中推断广泛人类意图并执行正确动作的挑战,现有方法受限于简单预定义场景或任务特定数据分布,难以支持真实环境中的辅助操作。解决方案的关键在于引入Casper系统,该系统利用预训练视觉语言模型(VLM)中嵌入的常识知识进行实时意图推断和灵活技能执行,包含开放世界感知模块、基于VLM的意图推断机制以及扩展的技能库,从而提升任务性能、降低人类认知负荷并提高用户满意度。
链接: https://arxiv.org/abs/2506.14727
作者: Huihan Liu,Rutav Shah,Shuijing Liu,Jack Pittenger,Mingyo Seo,Yuchen Cui,Yonatan Bisk,Roberto Martín-Martín,Yuke Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines.
zh
[AI-3] Adaptive Accompaniment with ReaLchords ICML2024
【速读】:该论文试图解决生成式 AI (Generative AI) 在音乐即兴伴奏中无法实时协同演奏的问题,即当前的生成模型无法在与其它音乐家(人类或其他模型)同步的情况下进行在线生成。解决方案的关键在于提出 ReaLchords,这是一个在线生成模型,通过最大似然预训练并利用强化学习进行微调,以适应在线使用场景。微调目标结合了一个新的奖励模型,用于评估旋律与和弦之间的和声及时间一致性,以及一个从可预见未来旋律的教师模型中提取的新型蒸馏项,从而实现对陌生输入的良好适应和合适的伴奏生成。
链接: https://arxiv.org/abs/2506.14723
作者: Yusong Wu,Tim Cooijmans,Kyle Kastner,Adam Roberts,Ian Simon,Alexander Scarlatos,Chris Donahue,Cassie Tarakajian,Shayegan Omidshafiei,Aaron Courville,Pablo Samuel Castro,Natasha Jaques,Cheng-Zhi Anna Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2024
Abstract:Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expressive output but are not able to generate in an \emphonline manner, meaning simultaneously with other musicians (human or otherwise). We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. ReaLchords opens the door to live jamming, as well as simultaneous co-creation in other modalities.
zh
[AI-4] Refining music sample identification with a self-supervised graph neural network
【速读】:该论文旨在解决自动样本识别(Automatic Sample Identification, ASID)问题,即在新的音乐作品中检测和识别已重新使用的音频片段,这是一个在基于音频查询的检索领域中具有挑战性的任务。现有音频指纹技术虽在“现实世界”条件下(如噪声、混响)取得了显著进展,但ASID系统在面对经过音乐制作变换(如时间拉伸、音高变换、效果处理以及底层或叠加音乐)的样本时表现不佳。论文提出的解决方案是一种轻量级且可扩展的编码架构,其核心在于在对比学习框架中引入图神经网络(Graph Neural Network),通过减少9%的可训练参数实现与当前最先进系统相当的性能,同时达到44.2%的平均精度(mAP)。此外,为提升检索质量,论文还提出了一种两阶段方法,包括初始粗粒度相似性搜索和跨注意力分类器,以排除不相关匹配并优化检索结果排序。
链接: https://arxiv.org/abs/2506.14684
作者: Aditya Bhattacharjee,Ivan Meresman Higgs,Mark Sandler,Emmanouil Benetos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at International Conference for Music Information Retrieval (ISMIR) 2025
Abstract:Automatic sample identification (ASID), the detection and identification of portions of audio recordings that have been reused in new musical works, is an essential but challenging task in the field of audio query-based retrieval. While a related task, audio fingerprinting, has made significant progress in accurately retrieving musical content under “real world” (noisy, reverberant) conditions, ASID systems struggle to identify samples that have undergone musical modifications. Thus, a system robust to common music production transformations such as time-stretching, pitch-shifting, effects processing, and underlying or overlaying music is an important open challenge. In this work, we propose a lightweight and scalable encoding architecture employing a Graph Neural Network within a contrastive learning framework. Our model uses only 9% of the trainable parameters compared to the current state-of-the-art system while achieving comparable performance, reaching a mean average precision (mAP) of 44.2%. To enhance retrieval quality, we introduce a two-stage approach consisting of an initial coarse similarity search for candidate selection, followed by a cross-attention classifier that rejects irrelevant matches and refines the ranking of retrieved candidates - an essential capability absent in prior models. In addition, because queries in real-world applications are often short in duration, we benchmark our system for short queries using new fine-grained annotations for the Sample100 dataset, which we publish as part of this work. Comments: Accepted at International Conference for Music Information Retrieval (ISMIR) 2025 Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) ACMclasses: H.5.5; I.2.6 Cite as: arXiv:2506.14684 [cs.SD] (or arXiv:2506.14684v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2506.14684 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-5] Unified Software Engineering agent as AI Software Engineer
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)在软件工程中的应用局限于特定任务,缺乏统一能力以处理复杂软件开发场景的问题。其解决方案的关键是构建一个统一的软件工程代理(Unified Software Engineering agent, USEagent),该代理能够协调和处理多种能力,从而应对如修复不完整补丁、添加新功能或接管他人编写的代码等复杂任务。为了验证USEagent的有效性,作者还构建了统一的软件工程基准(Unified Software Engineering bench, USEbench),涵盖多种任务,实验结果表明USEagent在多个任务上优于现有通用代理。
链接: https://arxiv.org/abs/2506.14683
作者: Leonhard Applis,Yuntong Zhang,Shanchao Liang,Nan Jiang,Lin Tan,Abhik Roychoudhury
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Leonhard Applis and Yuntong Zhang contributed equally to this work
Abstract:The growth of Large Language Model (LLM) technology has raised expectations for automated coding. However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project. In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously. But is an LLM agent the same as an AI software engineer? In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent. Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities. This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others. We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans. To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching. USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD. In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent. There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future.
zh
[AI-6] Design an Editable Speech-to-Sign-Language Transformer System: A Human-Centered AI Approach
【速读】:该论文旨在解决传统手语技术在自然性、表现力和用户自主性方面的不足,以及缺乏用户可编辑和可解释的中间层问题。其解决方案的关键在于提出一种以人为本的实时用户自适应语音转手语动画系统,该系统结合了基于Transformer的运动生成与透明且可编辑的JSON中间层,使用户能够直接查看和修改手语片段,从而提升系统的自然性、表达性和用户参与度。
链接: https://arxiv.org/abs/2506.14677
作者: Yingchao Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a human-centered, real-time, user-adaptive speech-to-sign language animation system that integrates Transformer-based motion generation with a transparent, user-editable JSON intermediate layer. The framework overcomes key limitations in prior sign language technologies by enabling direct user inspection and modification of sign segments, thus enhancing naturalness, expressiveness, and user agency. Leveraging a streaming Conformer encoder and autoregressive Transformer-MDN decoder, the system synchronizes spoken input into upper-body and facial motion for 3D avatar rendering. Edits and user ratings feed into a human-in-the-loop optimization loop for continuous improvement. Experiments with 20 deaf signers and 5 interpreters show that the editable interface and participatory feedback significantly improve comprehension, naturalness, usability, and trust, while lowering cognitive load. With sub-20 ms per-frame inference on standard hardware, the system is ready for real-time communication and education. This work illustrates how technical and participatory innovation together enable accessible, explainable, and user-adaptive AI for sign language technology.
zh
[AI-7] StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery
【速读】:该论文试图解决传统邻里研究方法在环境特征识别和分析过程中存在的耗时、依赖专家干预以及缺乏跨研究设计和地理背景适应性的问题。其解决方案的关键在于提出StreetLens,这是一个以研究者为中心、可配置的工作流程,通过将社会科学研究知识嵌入到视觉-语言模型(VLM)中,实现可扩展的邻里环境评估。StreetLens通过基于既有访谈协议的问题引导分析、检索相关街景图像并生成从客观特征到主观感知的广泛语义标注,使领域知识成为分析的核心,并支持整合已有调查数据以增强分析的稳健性和适用范围。
链接: https://arxiv.org/abs/2506.14670
作者: Jina Kim,Leeje Jang,Yao-Yi Chiang,Guanyu Wang,Michelle Pasco
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditionally, neighborhood studies have employed interviews, surveys, and manual image annotation guided by detailed protocols to identify environmental characteristics, including physical disorder, decay, street safety, and sociocultural symbols, and to examine their impact on developmental and health outcomes. While these methods yield rich insights, they are time-consuming and require intensive expert intervention. Recent technological advances, including vision-language models (VLMs), have begun to automate parts of this process; however, existing efforts are often ad hoc and lack adaptability across research designs and geographic contexts. In this demo paper, we present StreetLens, a human-centered, researcher-configurable workflow that embeds relevant social science expertise in a VLM for scalable neighborhood environmental assessments. StreetLens mimics the process of trained human coders by grounding the analysis in questions derived from established interview protocols, retrieving relevant street view imagery (SVI), and generating a wide spectrum of semantic annotations from objective features (e.g., the number of cars) to subjective perceptions (e.g., the sense of disorder in an image). By enabling researchers to define the VLM’s role through domain-informed prompting, StreetLens places domain knowledge at the core of the analysis process. It also supports the integration of prior survey data to enhance robustness and expand the range of characteristics assessed across diverse settings. We provide a Google Colab notebook to make StreetLens accessible and extensible for researchers working with public or custom SVI datasets. StreetLens represents a shift toward flexible, agentic AI systems that work closely with researchers to accelerate and scale neighborhood studies.
zh
[AI-8] Rigor in AI: Doing Rigorous AI Work Requires a Broader Responsible AI-Informed Conception of Rigor
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)研究与实践中对“严谨性”(rigor)理解过于狭隘的问题,这种狭隘的理解主要集中在方法论严谨性上,导致了负责任AI领域对AI能力夸大宣传等担忧。论文提出的解决方案的关键在于倡导一种更广泛的严谨性概念,不仅包括方法论严谨性,还涵盖知识背景(认识论严谨性)、规范影响(规范严谨性)、理论构建的清晰性(概念严谨性)、报告质量(报告严谨性)以及推论支持程度(解释严谨性)等方面,旨在为AI领域的多方利益相关者提供一个全面且具有对话价值的框架。
链接: https://arxiv.org/abs/2506.14652
作者: Alexandra Olteanu,Su Lin Blodgett,Agathe Balayn,Angelina Wang,Fernando Diaz,Flavio du Pin Calmon,Margaret Mitchell,Michael Ekstrand,Reuben Binns,Solon Barocas
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 1 figure, 1 table
Abstract:In AI research and practice, rigor remains largely understood in terms of methodological rigor – such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception – in addition to a more expansive understanding of (1) methodological rigor – should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community’s work by researchers, policymakers, journalists, and other stakeholders.
zh
[AI-9] SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
【速读】:该论文旨在解决基于偏好强化学习(Preference-based Reinforcement Learning, PbRL)中人类反馈效率低和样本效率差的问题。其解决方案的关键在于提出了一种名为SENIOR的新方法,该方法通过两种机制实现:一是基于运动区分的选择方案(Motion-Distinction-based Selection, MDS),通过状态的核密度估计选择具有明显运动差异的行为片段对以提升人类偏好标注的效率;二是基于偏好的探索方法(Preference-guided Exploration, PGE),通过鼓励智能体向高偏好且低访问的状态探索,持续引导获取有价值样本。这两种机制的协同作用显著加速了奖励模型和策略的学习进程。
链接: https://arxiv.org/abs/2506.14648
作者: Hexian Ni,Tao Lu,Haoyuan Hu,Yinghao Cai,Shuo Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures
Abstract:Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds.
zh
[AI-10] Navigating the growing field of research on AI for software testing – the taxonomy for AI-augmented software testing and an ontology-driven literature survey
【速读】:该论文试图解决软件测试中自动化设计、开发、维护和演进所需的巨大努力问题,以及如何利用人工智能(Artificial Intelligence, AI)提升测试效率和效果。其解决方案的关键在于通过AI技术增强软件测试自动化,从无自动化到全自动化,探索由AI带来的新型测试方法,并提出新的分类体系ai4st以系统化地归纳近期研究并识别尚未解决的研究问题。
链接: https://arxiv.org/abs/2506.14640
作者: Ina K. Schieferdecker
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 1 table, 2 listings (will be presented at FMICS 2025)
Abstract:In industry, software testing is the primary method to verify and validate the functionality, performance, security, usability, and so on, of software-based systems. Test automation has gained increasing attention in industry over the last decade, following decades of intense research into test automation and model-based testing. However, designing, developing, maintaining and evolving test automation is a considerable effort. Meanwhile, AI’s breakthroughs in many engineering fields are opening up new perspectives for software testing, for both manual and automated testing. This paper reviews recent research on AI augmentation in software test automation, from no automation to full automation. It also discusses new forms of testing made possible by AI. Based on this, the newly developed taxonomy, ai4st, is presented and used to classify recent research and identify open research questions.
zh
[AI-11] ACM Survey Draft on Formalising Software Requirements with Large Language Models
【速读】:该论文旨在综述94篇相关文献,并重点探讨软件需求可追溯性(Traceability of Software Requirements)、形式化方法及其工具(Formal Methods and Its Tools)、统一编程理论(Unifying Theories of Programming, UTP)以及机构理论(Theory of Institutions)等关键技术领域。其解决方案的关键在于系统性地整合和分析现有研究成果,提供结构化的总结与对比,以支持软件工程与人工智能交叉领域的研究与实践。
链接: https://arxiv.org/abs/2506.14627
作者: Arshad Beg,Diarmuid O’Donoghue,Rosemary Monahan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages. 6 summary tables
Abstract:This draft is a working document, having a summary of nighty-four (94) papers with additional sections on Traceability of Software Requirements (Section 4), Formal Methods and Its Tools (Section 5), Unifying Theories of Programming (UTP) and Theory of Institutions (Section 6). Please refer to abstract of [7,8]. Key difference of this draft from our recently anticipated ones with similar titles, i.e. AACS 2025 [7] and SAIV 2025 [8] is: [7] is a two page submission to ADAPT Annual Conference, Ireland. Submitted on 18th of March, 2025, it went through the light-weight blind review and accepted for poster presentation. Conference was held on 15th of May, 2025. [8] is a nine page paper with additional nine pages of references and summary tables, submitted to Symposium on AI Verification (SAIV 2025) on 24th of April, 2025. It went through rigorous review process. The uploaded version on arXiv.org [8] is the improved one of the submission, after addressing the specific suggestions to improve the paper. Comments: 22 pages. 6 summary tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.1; D.2.4; D.2.10; F.4.1; F.4.3 Cite as: arXiv:2506.14627 [cs.SE] (or arXiv:2506.14627v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.14627 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-12] Low-code to fight climate change: the Climaborough project
【速读】:该论文旨在解决欧洲城市在实现2030年碳中和目标过程中,如何快速部署有效的气候监测工具以评估本地实验性措施的效果并预测其大规模推广的潜在影响的问题。解决方案的关键在于实施了一种低代码/无代码(low-code/no-code)策略,通过该策略加速了气候仪表盘(climate dashboards)的开发,并使不同类型的公民用户能够根据自身需求配置和调整仪表盘。
链接: https://arxiv.org/abs/2506.14623
作者: Aaron Conrardy,Armen Sulejmani,Cindy Guerlain,Daniele Pagani,David Hick,Matteo Satta,Jordi Cabot
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This paper was presented in the Research Projects Track of the 19th International Conference on Research Challenges in Information Science (RCIS 2025)
Abstract:The EU-funded Climaborough project supports European cities to achieve carbon neutrality by 2030. Eleven cities in nine countries will deploy in real conditions products and services fostering climate transition in their local environment. The Climaborough City Platform is being developed to monitor the cities’ overall progress towards their climate goals by aggregating historic and real-time data and displaying the results in user-friendly dashboards that will be used by non-technical experts to evaluate the effectiveness of local experimental initiatives, identify those that yield significant impact, and assess the potential consequences of scaling them up to a broader level. In this paper, we explain how we have put in place a low-code/no-code strategy in Climaborough in response to the project’s aim to quickly deploy climate dashboards. A low-code strategy is used to accelerate the development of the dashboards. The dashboards embed a no-code philosophy that enables all types of citizen profiles to configure and adapt the dashboard to their specific needs.
zh
[AI-13] Object-Centric Neuro-Argumentative Learning
【速读】:该论文试图解决深度学习技术在关键决策中面临的安全性、可靠性和可解释性问题,其解决方案的关键在于提出一种新颖的神经论证学习(Neural Argumentative Learning, NAL)架构,该架构将基于假设的论证(Assumption-Based Argumentation, ABA)与深度学习相结合,用于图像分析。NAL架构包含神经组件和符号组件,前者通过以对象为中心的学习对图像进行分割和编码,后者则应用ABA学习构建ABA框架,从而实现基于图像的预测。
链接: https://arxiv.org/abs/2506.14577
作者: Abdul Rahman Jacob,Avinash Kori,Emanuele De Angelis,Ben Glocker,Maurizio Proietti,Francesca Toni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of Machine Learning Research, 2025 19th Conference on Neurosymbolic Learning and Reasoning
Abstract:Over the last decade, as we rely more on deep learning technologies to make critical decisions, concerns regarding their safety, reliability and interpretability have emerged. We introduce a novel Neural Argumentative Learning (NAL) architecture that integrates Assumption-Based Argumentation (ABA) with deep learning for image analysis. Our architecture consists of neural and symbolic components. The former segments and encodes images into facts using object-centric learning, while the latter applies ABA learning to develop ABA frameworks enabling predictions with images. Experiments on synthetic data show that the NAL architecture can be competitive with a state-of-the-art alternative.
zh
[AI-14] From Points to Places: Towards Human Mobility-Driven Spatiotemporal Foundation Models via Understanding Places
【速读】:该论文旨在解决如何在不同地理和情境下进行可扩展且可迁移的时空数据分析问题,特别是针对人类移动数据所具有的空间、时间及语义复杂性。其解决方案的关键在于构建一种新的空间基础模型,该模型将地理位置语义与人类移动行为相结合,并从离散的兴趣点建模转向对“场所”的理解——即由人类行为和移动动态塑造的、具有丰富上下文的区域。论文强调了适应性、可扩展性和多粒度推理方面的关键差距,并提出了以场所建模为核心的研究方向,以实现高效学习和下一代地理空间智能的发展。
链接: https://arxiv.org/abs/2506.14570
作者: Mohammad Hashemi,Andreas Zufle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Capturing human mobility is essential for modeling how people interact with and move through physical spaces, reflecting social behavior, access to resources, and dynamic spatial patterns. To support scalable and transferable analysis across diverse geographies and contexts, there is a need for a generalizable foundation model for spatiotemporal data. While foundation models have transformed language and vision, they remain limited in handling the unique challenges posed by the spatial, temporal, and semantic complexity of mobility data. This vision paper advocates for a new class of spatial foundation models that integrate geolocation semantics with human mobility across multiple scales. Central to our vision is a shift from modeling discrete points of interest to understanding places: dynamic, context-rich regions shaped by human behavior and mobility that may comprise many places of interest. We identify key gaps in adaptability, scalability, and multi-granular reasoning, and propose research directions focused on modeling places and enabling efficient learning. Our goal is to guide the development of scalable, context-aware models for next-generation geospatial intelligence. These models unlock powerful applications ranging from personalized place discovery and logistics optimization to urban planning, ultimately enabling smarter and more responsive spatial decision-making.
zh
[AI-15] Enhancing Symbolic Machine Learning by Subsymbolic Representations
【速读】:该论文试图解决传统符号化人工智能(Symbolic AI)与子符号化人工智能(Subsymbolic AI)各自存在的局限性,尤其是在处理复杂任务时效率不足的问题。其解决方案的关键在于通过赋予符号化机器学习方法访问神经嵌入(neural embeddings)的能力,从而增强其表达能力和性能。具体而言,论文以TILDE系统为例,展示了如何利用常量的嵌入在相似性谓词中提升模型效果,并通过进一步优化嵌入以适应符号理论,实现更高效的机器学习。实验结果表明,该方法在三个真实世界领域中均优于其他基线方法。
链接: https://arxiv.org/abs/2506.14569
作者: Stephen Roth,Lennart Baur,Derian Boer,Stefan Kramer
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:The goal of neuro-symbolic AI is to integrate symbolic and subsymbolic AI approaches, to overcome the limitations of either. Prominent systems include Logic Tensor Networks (LTN) or DeepProbLog, which offer neural predicates and end-to-end learning. The versatility of systems like LTNs and DeepProbLog, however, makes them less efficient in simpler settings, for instance, for discriminative machine learning, in particular in domains with many constants. Therefore, we follow a different approach: We propose to enhance symbolic machine learning schemes by giving them access to neural embeddings. In the present paper, we show this for TILDE and embeddings of constants used by TILDE in similarity predicates. The approach can be fine-tuned by further refining the embeddings depending on the symbolic theory. In experiments in three real-world domain, we show that this simple, yet effective, approach outperforms all other baseline methods in terms of the F1 score. The approach could be useful beyond this setting: Enhancing symbolic learners in this way could be extended to similarities between instances (effectively working like kernels within a logical language), for analogical reasoning, or for propositionalization.
zh
[AI-16] QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents ICDAR2025
【速读】:该论文旨在解决从商业文档中自动提取表格(Table Extraction, TE)的问题,该问题在工业工作流中至关重要,但因标注数据稀少和多阶段流水线的误差而具有挑战性。论文提出的解决方案是QUEST框架,其关键在于引入了一个质量感知的半监督学习方法,通过评估提取表格的结构和上下文特征来预测F1分数,而非依赖置信度指标,从而更准确地指导伪标签的选择,并利用多样性度量减少确认偏差,提升模型性能。
链接: https://arxiv.org/abs/2506.14568
作者: Eliott Thomas,Mickael Coustaty,Aurelie Joseph,Gaspar Deloin,Elodie Carel,Vincent Poulain D’Andecy,Jean-Marc Ogier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICDAR 2025
Abstract:Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents) show QUEST improves F1 from 64% to 74% and reduces empty predictions by 45% (from 12% to 6.5%). On the DocILE benchmark (600 annotated + 20000 unannotated documents), QUEST achieves a 50% F1 score (up from 42%) and reduces empty predictions by 19% (from 27% to 22%). The framework’s interpretable quality assessments and robustness to annotation scarcity make it particularly suited for business documents, where structural consistency and data completeness are paramount.
zh
[AI-17] Controlling Context: Generative AI at Work in Integrated Circuit Design and Other High-Precision Domains
【速读】:该论文试图解决在高精度工程领域中,使用生成式 AI 工具时如何维持对错误的警觉性以及工具使用过程中可能遇到的其他问题。研究通过访谈集成电路设计领域的软硬件工程师及其合作者,分析了准确性在其使用生成式 AI 工具过程中的作用,并识别了其他形式的困扰。研究的关键在于识别出控制工程师与生成式 AI 工具之间交互的上下文是他们面临的主要挑战,并提出通过增强交互式控制上下文的能力来缓解这一问题。
链接: https://arxiv.org/abs/2506.14567
作者: Emanuel Moss,Elizabeth Watkins,Christopher Persaud,Passant Karunaratne,Dawn Nafus
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI tools have become more prevalent in engineering workflows, particularly through chatbots and code assistants. As the perceived accuracy of these tools improves, questions arise about whether and how those who work in high-precision domains might maintain vigilance for errors, and what other aspects of using such tools might trouble their work. This paper analyzes interviews with hardware and software engineers, and their collaborators, who work in integrated circuit design to identify the role accuracy plays in their use of generative AI tools and what other forms of trouble they face in using such tools. The paper inventories these forms of trouble, which are then mapped to elements of generative AI systems, to conclude that controlling the context of interactions between engineers and the generative AI tools is one of the largest challenges they face. The paper concludes with recommendations for mitigating this form of trouble by increasing the ability to control context interactively.
zh
[AI-18] Aligning Evaluation with Clinical Priorities: Calibration Label Shift and Error Costs
【速读】:该论文试图解决临床环境中基于机器学习的决策支持系统在评估和选择分类器时存在的问题,即传统评分规则(如准确率和AUC-ROC)未能充分反映关键的临床优先事项,包括校准性、对分布偏移的鲁棒性以及对非对称误差成本的敏感性。解决方案的关键在于提出一种基于严格理论但实用的评估框架,该框架通过调整交叉熵(log score)来考虑类别先验概率的不确定性以及临床场景中的特定成本不对称性,从而选出既校准又对现实变化具有鲁棒性的阈值分类器。
链接: https://arxiv.org/abs/2506.14540
作者: Gerardo A. Flores,Alyssa H. Smith,Julia A. Fukuyama,Ashia C. Wilson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for the uncertainty in class prevalences and domain-specific cost asymmetries often found in clinical settings. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
zh
[AI-19] Doppelgänger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack
【速读】:该论文试图解决生成式 AI(Generative AI)中由于提示工程(prompt engineering)带来的安全性和行为一致性问题,特别是针对对抗性迁移攻击(adversarial transfer attack)所引发的系统指令和内部信息泄露风险。其解决方案的关键在于提出“Doppelgänger方法”以演示代理被劫持的风险,并引入“Prompt Alignment Collapse under Adversarial Transfer (PACAT)”评估指标来衡量系统的脆弱性;同时,设计了“Caution for Adversarial Transfer (CAT)”提示作为防御手段,有效提升了系统对对抗性攻击的抵抗力。
链接: https://arxiv.org/abs/2506.14539
作者: Daewon Kang,YeongHwan Shin,Doyeon Kim,Kyu-Hwan Jung,Meong Hi Son
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user’s attempts. In this paper, we propose the ‘‘Doppelgänger method’’ to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ‘‘Prompt Alignment Collapse under Adversarial Transfer (PACAT)’’ level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ‘‘Caution for Adversarial Transfer (CAT)’’ prompt to counter the Doppelgänger method. The experimental results demonstrate that the Doppelgänger method can compromise the agent’s consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.
zh
[AI-20] Automatic Qiskit Code Refactoring Using Large Language Models WWW
【速读】:该论文试图解决量子软件框架Qiskit在版本迭代过程中,由于API频繁变化导致的代码兼容性维护难题。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的重构方法,通过从官方文档中提取迁移场景分类体系,并结合原始代码作为输入,使LLMs能够识别代码中的迁移场景并提供相应的重构建议。该方法通过结构化输入和推理过程,有效应对当前LLMs的上下文长度限制,从而提升其在量子代码迁移中的实用性。
链接: https://arxiv.org/abs/2506.14535
作者: José Manuel Suárez,Luis Mariano Bibbó,Joaquin Bogado,Alejandro Fernandez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Submitted for review to “Taller Latinoamericano de Ingeniería de Software Cuántico” ( this https URL )
Abstract:As quantum software frameworks evolve, developers face increasing challenges in maintaining compatibility with rapidly changing APIs. In this work, we present a novel methodology for refactoring Qiskit code using large language models (LLMs). We begin by extracting a taxonomy of migration scenarios from the different sources of official Qiskit documentation (such as release notes), capturing common patterns such as migration of functionality to different modules and deprecated usage. This taxonomy, along with the original Python source code, is provided as input to an LLM, which is then tasked with identifying instances of migration scenarios in the code and suggesting appropriate refactoring solutions. Our approach is designed to address the context length limitations of current LLMs by structuring the input and reasoning process in a targeted, efficient manner. The results demonstrate that LLMs, when guided by domain-specific migration knowledge, can effectively assist in automating Qiskit code migration. This work contributes both a set of proven prompts and taxonomy for Qiskit code migration from earlier versions to version 0.46 and a methodology to asses the capabilities of LLMs to assist in the migration of quantum code.
zh
[AI-21] oward Safety-First Human-Like Decision Making for Autonomous Vehicles in Time-Varying Traffic Flow
【速读】:该论文试图解决自动驾驶车辆(AVs)在动态变化的交通流中,尤其是在密集且交互复杂的场景下,行为策略易陷入局部最优、泛化能力差以及难以模仿人类驾驶决策的问题。解决方案的关键在于提出一种以安全为先的人类似决策框架(SF-HLDM),该框架融合了时空注意力机制用于其他道路使用者意图推断、社会合规性估计模块用于行为调控,以及深度进化强化学习模型以高效扩展搜索空间,从而避免局部最优陷阱并降低过拟合风险,实现具有可解释性和灵活性的人类类似决策。
链接: https://arxiv.org/abs/2506.14502
作者: Xiao Wang,Junru Yu,Jun Huang,Qiong Wu,Ljubo Vacic,Changyin Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the recent advancements in artificial intelligence technologies have shown great potential in improving transport efficiency and safety, autonomous vehicles(AVs) still face great challenge of driving in time-varying traffic flow, especially in dense and interactive situations. Meanwhile, human have free wills and usually do not make the same decisions even situate in the exactly same scenarios, leading to the data-driven methods suffer from poor migratability and high search cost problems, decreasing the efficiency and effectiveness of the behavior policy. In this research, we propose a safety-first human-like decision-making framework(SF-HLDM) for AVs to drive safely, comfortably, and social compatiblely in effiency. The framework integrates a hierarchical progressive framework, which combines a spatial-temporal attention (S-TA) mechanism for other road users’ intention inference, a social compliance estimation module for behavior regulation, and a Deep Evolutionary Reinforcement Learning(DERL) model for expanding the search space efficiently and effectively to make avoidance of falling into the local optimal trap and reduce the risk of overfitting, thus make human-like decisions with interpretability and flexibility. The SF-HLDM framework enables autonomous driving AI agents dynamically adjusts decision parameters to maintain safety margins and adhering to contextually appropriate driving behaviors at the same time.
zh
[AI-22] LLM -Powered Swarms: A New Frontier or a Conceptual Stretch?
【速读】:该论文试图解决传统群体智能(swarm intelligence)与基于大语言模型(LLM)的群体系统之间的差异及其在计算和协调方面的挑战问题。其解决方案的关键在于通过对比传统群体算法(如Boids和蚁群优化算法)与LLM驱动的群体系统,评估两者在延迟、资源使用和行为准确性等方面的性能,并探讨LLM在群体智能中的适用性及局限性。研究强调了LLM在推理和抽象能力上的优势,同时也指出了其在计算效率和协作机制上对传统群体设计带来的新约束。
链接: https://arxiv.org/abs/2506.14496
作者: Muhammad Atta Ur Rahman,Melanie Schranz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the author’s version of a paper submitted to IEEE Intelligent Systems. 6 Tables, 3 Figures
Abstract:Swarm intelligence traditionally refers to systems of simple, decentralized agents whose local interactions lead to emergent, collective behavior. Recently, the term ‘swarm’ has been extended to describe AI systems like OpenAI’s Swarm, where large language models (LLMs) act as collaborative agents. This paper contrasts traditional swarm algorithms with LLM-driven swarms exploring how decentralization, scalability, and emergence are redefined in modern artificial intelligence (AI). We implement and compare both paradigms using Boids and Ant Colony Optimization (ACO), evaluating latency, resource usage, and behavioral accuracy. The suitability of both cloud-based and local LLMs is assessed for the agent-based use in swarms. Although LLMs offer powerful reasoning and abstraction capabilities, they introduce new constraints in computation and coordination that challenge traditional notions of swarm design. This study highlights the opportunities and limitations of integrating LLMs into swarm systems and discusses the evolving definition of ‘swarm’ in modern AI research.
zh
[AI-23] GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies NIPS2025
【速读】:该论文试图解决现有GUI代理评估数据集在构建过程中过于理想化,未能充分反映实际部署中常见的多种异常情况的问题。解决方案的关键在于引入GUI-Robust数据集,该数据集明确包含了七种常见的异常类型,并提出了一种半自动化数据集构建范式,通过RPA工具收集用户操作序列,并借助大语言模型(LLM)生成相应的步骤和任务描述,从而将标注时间成本降低了19倍以上。
链接: https://arxiv.org/abs/2506.14477
作者: Jingqi Yang,Zhilong Song,Jiawei Chen,Mingli Song,Sheng Zhou,linjun sun,Xiaogang Ouyang,Chun Chen,Can Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, submitted to NIPS 2025
Abstract:The development of high-quality datasets is crucial for benchmarking and advancing research in Graphical User Interface (GUI) agents. Despite their importance, existing datasets are often constructed under idealized conditions, overlooking the diverse anomalies frequently encountered in real-world deployments. To address this limitation, we introduce GUI-Robust, a novel dataset designed for comprehensive GUI agent evaluation, explicitly incorporating seven common types of anomalies observed in everyday GUI interactions. Furthermore, we propose a semi-automated dataset construction paradigm that collects user action sequences from natural interactions via RPA tools and then generate corresponding step and task descriptions for these actions with the assistance of MLLMs. This paradigm significantly reduces annotation time cost by a factor of over 19 times. Finally, we assess state-of-the-art GUI agents using the GUI-Robust dataset, revealing their substantial performance degradation in abnormal scenarios. We anticipate that our work will highlight the importance of robustness in GUI agents and inspires more future research in this direction. The dataset and code are available at this https URL…
zh
[AI-24] Leverag ing External Factors in Household-Level Electrical Consumption Forecasting using Hypernetworks KDD2025 ECML
【速读】:该论文旨在解决全球电力消费预测模型在引入外部因素(如天气指标、节假日和重大地方事件)时性能下降的问题,尽管这些外部因素能够提升个体家庭级模型的预测精度。解决方案的关键在于采用超网络(hypernetwork)架构,通过针对每个消费者特定调整模型权重,有效利用外部因素来提高全局预测模型的准确性。
链接: https://arxiv.org/abs/2506.14472
作者: Fabien Bernier,Maxime Cordy,Yves Le Traon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ECML PKDD 2025
Abstract:Accurate electrical consumption forecasting is crucial for efficient energy management and resource allocation. While traditional time series forecasting relies on historical patterns and temporal dependencies, incorporating external factors – such as weather indicators – has shown significant potential for improving prediction accuracy in complex real-world applications. However, the inclusion of these additional features often degrades the performance of global predictive models trained on entire populations, despite improving individual household-level models. To address this challenge, we found that a hypernetwork architecture can effectively leverage external factors to enhance the accuracy of global electrical consumption forecasting models, by specifically adjusting the model weights to each consumer. We collected a comprehensive dataset spanning two years, comprising consumption data from over 6000 luxembourgish households and corresponding external factors such as weather indicators, holidays, and major local events. By comparing various forecasting models, we demonstrate that a hypernetwork approach outperforms existing methods when associated to external factors, reducing forecasting errors and achieving the best accuracy while maintaining the benefits of a global model. Comments: ECML PKDD 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.14472 [cs.LG] (or arXiv:2506.14472v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.14472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-25] AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection
【速读】:该论文试图解决代码克隆检测中由于传统抽象语法树(Abstract Syntax Tree, AST)缺乏语义深度而导致的检测效果受限问题。其解决方案的关键在于通过构建AST与控制流图(Control Flow Graph, CFG)、数据流图(Data Flow Graph, DFG)以及流增强AST(Flow-Augmented AST, FA-AST)等混合图表示,来提升图神经网络(Graph Neural Network, GNN)在代码克隆检测中的性能。研究系统评估了不同混合表示对多种GNN架构的影响,发现AST+CFG+DFG在基于卷积和注意力机制的模型中表现最优,而FA-AST可能因结构复杂性降低性能,同时指出GMN在仅使用标准AST时仍表现出色,表明其在跨代码相似性检测方面具有优势。
链接: https://arxiv.org/abs/2506.14470
作者: Zixian Zhang,Takfarinas Saber
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten vulnerability risks, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) dominate deep learning-based code clone detection due to their precise syntactic structure representation, but they inherently lack semantic depth. Recent studies address this by enriching AST-based representations with semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs). However, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection. In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs (FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid representations impact GNNs differently: while AST+CFG+DFG consistently enhances accuracy for convolution- and attention-based models (Graph Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST frequently introduces structural complexity that harms performance. Notably, GMN outperforms others even with standard AST representations, highlighting its superior cross-code similarity detection and reducing the need for enriched structures.
zh
[AI-26] A Scalable Hybrid Training Approach for Recurrent Spiking Neural Networks
【速读】:该论文旨在解决传统基于梯度的训练方法(如通过时间反向传播,BPTT)在训练循环脉冲神经网络(RSNNs)时存在的在线训练能力不足和内存消耗随计算步数线性增长的问题。其解决方案的关键在于提出一种名为HYPR(HYbrid PRopagation)的算法,该算法结合了并行化效率与近似在线前向学习,实现了高吞吐量的在线学习,并保持内存需求恒定(与序列长度无关)。HYPR通过在子序列上并行化参数更新计算,使得RSNN能够高效地处理连续、无限长度的输入序列。
链接: https://arxiv.org/abs/2506.14464
作者: Maximilian Baronig,Yeganeh Bahariasl,Ozan Özdenizci,Robert Legenstein
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recurrent spiking neural networks (RSNNs) can be implemented very efficiently in neuromorphic systems. Nevertheless, training of these models with powerful gradient-based learning algorithms is mostly performed on standard digital hardware using Backpropagation through time (BPTT). However, BPTT has substantial limitations. It does not permit online training and its memory consumption scales linearly with the number of computation steps. In contrast, learning methods using forward propagation of gradients operate in an online manner with a memory consumption independent of the number of time steps. These methods enable SNNs to learn from continuous, infinite-length input sequences. Yet, slow execution speed on conventional hardware as well as inferior performance has hindered their widespread application. In this work, we introduce HYbrid PRopagation (HYPR) that combines the efficiency of parallelization with approximate online forward learning. Our algorithm yields high-throughput online learning through parallelization, paired with constant, i.e., sequence length independent, memory demands. HYPR enables parallelization of parameter update computation over the sub sequences for RSNNs consisting of almost arbitrary non-linear spiking neuron models. We apply HYPR to networks of spiking neurons with oscillatory subthreshold dynamics. We find that this type of neuron model is particularly well trainable by HYPR, resulting in an unprecedentedly low task performance gap between approximate forward gradient learning and BPTT.
zh
[AI-27] sHGCN: Simplified hyperbolic graph convolutional neural networks
【速读】:该论文试图解决超球面神经网络在计算效率和高精度任务中性能不足的问题(Hyperbolic neural networks face performance challenges, particularly in computational efficiency and tasks requiring high precision)。解决方案的关键在于简化超球面神经网络中的关键操作,从而显著提升运行时间和性能表现。
链接: https://arxiv.org/abs/2506.14438
作者: Pol Arévalo,Alexis Molina,Álvaro Ciudad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperbolic geometry has emerged as a powerful tool for modeling complex, structured data, particularly where hierarchical or tree-like relationships are present. By enabling embeddings with lower distortion, hyperbolic neural networks offer promising alternatives to Euclidean-based models for capturing intricate data structures. Despite these advantages, they often face performance challenges, particularly in computational efficiency and tasks requiring high precision. In this work, we address these limitations by simplifying key operations within hyperbolic neural networks, achieving notable improvements in both runtime and performance. Our findings demonstrate that streamlined hyperbolic operations can lead to substantial gains in computational speed and predictive accuracy, making hyperbolic neural networks a more viable choice for a broader range of applications.
zh
[AI-28] Unifying Streaming and Non-streaming Zipformer-based ASR ACL2025
【速读】:该论文试图解决流式与非流式自动语音识别(Automatic Speech Recognition, ASR)模型统一训练与部署的问题,以降低开发、训练和部署成本。解决方案的关键在于提出一种统一框架,通过在基于Zipformer的ASR模型训练中引入动态右文(right-context)信息,利用分块注意力掩码机制实现对未来上下文的建模,从而在保持较高识别准确率的同时,有效控制流式ASR的延迟,实现灵活的延迟-精度权衡。
链接: https://arxiv.org/abs/2506.14434
作者: Bidisha Sharma,Karthik Pandia Durai,Shankar Venkatesan,Jeena J Prakash,Shashi Kumar,Malolan Chetlur,Andreas Stolcke
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted in ACL2025 Industry track
Abstract:There has been increasing interest in unifying streaming and non-streaming automatic speech recognition (ASR) models to reduce development, training, and deployment costs. We present a unified framework that trains a single end-to-end ASR model for both streaming and non-streaming applications, leveraging future context information. We propose to use dynamic right-context through the chunked attention masking in the training of zipformer-based ASR models. We demonstrate that using right-context is more effective in zipformer models compared to other conformer models due to its multi-scale nature. We analyze the effect of varying the number of right-context frames on accuracy and latency of the streaming ASR models. We use Librispeech and large in-house conversational datasets to train different versions of streaming and non-streaming models and evaluate them in a production grade server-client setup across diverse testsets of different domains. The proposed strategy reduces word error by relative 7.9% with a small degradation in user-perceived latency. By adding more right-context frames, we are able to achieve streaming performance close to that of non-streaming models. Our approach also allows flexible control of the latency-accuracy tradeoff according to customers requirements.
zh
[AI-29] Is Selection All You Need in Differential Evolution?
【速读】:该论文试图解决传统差分进化(Differential Evolution, DE)算法中由于固定种群规模导致的种群多样性受限问题,以及由此带来的搜索效率和性能瓶颈。其解决方案的关键在于提出一种名为无界差分进化(Unbounded Differential Evolution, UDE)的新框架,该框架通过不丢弃任何个体,将所有生成的候选解加入种群,从而避免了代际替换过程中的复杂性,如档案管理与动态种群大小调整,仅依赖选择机制实现更高效且强大的搜索过程。
链接: https://arxiv.org/abs/2506.14425
作者: Tomofumi Kitamura,Alex Fukunaga
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures
Abstract:Differential Evolution (DE) is a widely used evolutionary algorithm for black-box optimization problems. However, in modern DE implementations, a major challenge lies in the limited population diversity caused by the fixed population size enforced by the generational replacement. Population size is a critical control parameter that significantly affects DE performance. Larger populations inherently contain a more diverse set of individuals, thereby facilitating broader exploration of the search space. Conversely, when the maximum evaluation budgets is constrained, smaller populations focusing on a limited number of promising candidates may be more suitable. Many state-of-the-art DE variants incorporate an archive mechanism, in which a subset of discarded individuals is preserved in an archive during generation replacement and reused in mutation operations. However, maintaining what is essentially a secondary population via an archive introduces additional design considerations, such as policies for insertion, deletion, and appropriate sizing. To address these limitations, we propose a novel DE framework called Unbounded Differential Evolution (UDE), which adds all generated candidates to the population without discarding any individual based on fitness. Unlike conventional DE, which removes inferior individuals during generational replacement, UDE eliminates replacement altogether, along with the associated complexities of archive management and dynamic population sizing. UDE represents a fundamentally new approach to DE, relying solely on selection mechanisms and enabling a more straightforward yet powerful search algorithm.
zh
[AI-30] RAG tifier: Evaluating RAG RAG RAG Generation Approaches of State-of-the-Art RAG Systems for the SIGIR LiveRAG Competition SIGIR2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成回答时存在的事实准确性不足和幻觉问题,通过结合模型内部参数化知识与外部非参数化信息源,提升问答任务的准确性。其解决方案的关键在于采用Retrieval-Augmented Generation (RAG)框架,具体为使用InstructRAG模型配合Pinecone检索器和BGE重排序器,在有限参数规模(不超过10B参数)的LLM约束下优化答案生成,从而在DataMorgana的QA对上实现较高的正确性和忠实性。
链接: https://arxiv.org/abs/2506.14412
作者: Tim Cofala,Oleh Astappiev,William Xion,Hailay Teklehaymanot
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 5 figures. Report for SIGIR 2025 LiveRAG Challenge
Abstract:Retrieval-Augmented Generation (RAG) enriches Large Language Models (LLMs) by combining their internal, parametric knowledge with external, non-parametric sources, with the goal of improving factual correctness and minimizing hallucinations. The LiveRAG 2025 challenge explores RAG solutions to maximize accuracy on DataMorgana’s QA pairs, which are composed of single-hop and multi-hop questions. The challenge provides access to sparse OpenSearch and dense Pinecone indices of the Fineweb 10BT dataset. It restricts model use to LLMs with up to 10B parameters and final answer generation with Falcon-3-10B. A judge-LLM assesses the submitted answers along with human evaluators. By exploring distinct retriever combinations and RAG solutions under the challenge conditions, our final solution emerged using InstructRAG in combination with a Pinecone retriever and a BGE reranker. Our solution achieved a correctness score of 1.13 and a faithfulness score of 0.55, placing fourth in the SIGIR 2025 LiveRAG Challenge.
zh
[AI-31] Adaptive Reinforcement Learning for Unobservable Random Delays
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在现实动态环境中因智能体与系统之间存在不可观测且时变的延迟而导致的性能下降问题。传统方法通常假设存在已知的固定延迟上限,但这一假设在实际应用中往往不成立。论文提出的解决方案的关键在于引入交互层(interaction layer),这是一个通用框架,使智能体能够自适应地处理不可观测和时变的延迟。该框架通过生成可能的未来动作矩阵来应对不可预测的延迟和网络传输中的动作包丢失,进而构建了基于模型的ACDA(Actor-Critic with Delay Adaptation)算法,实现了对延迟模式的动态调整。
链接: https://arxiv.org/abs/2506.14411
作者: John Wikman,Alexandre Proutiere,David Broman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:In standard Reinforcement Learning (RL) settings, the interaction between the agent and the environment is typically modeled as a Markov Decision Process (MDP), which assumes that the agent observes the system state instantaneously, selects an action without delay, and executes it immediately. In real-world dynamic environments, such as cyber-physical systems, this assumption often breaks down due to delays in the interaction between the agent and the system. These delays can vary stochastically over time and are typically unobservable, meaning they are unknown when deciding on an action. Existing methods deal with this uncertainty conservatively by assuming a known fixed upper bound on the delay, even if the delay is often much lower. In this work, we introduce the interaction layer, a general framework that enables agents to adaptively and seamlessly handle unobservable and time-varying delays. Specifically, the agent generates a matrix of possible future actions to handle both unpredictable delays and lost action packets sent over networks. Building on this framework, we develop a model-based algorithm, Actor-Critic with Delay Adaptation (ACDA), which dynamically adjusts to delay patterns. Our method significantly outperforms state-of-the-art approaches across a wide range of locomotion benchmark environments.
zh
[AI-32] HiLight: A Hierarchical Reinforcement Learning Framework with Global Adversarial Guidance for Large-Scale Traffic Signal Control
【速读】:该论文旨在解决大规模交通信号控制(TSC)中现有强化学习(Reinforcement Learning, RL)方法在扩展性与全局协调性之间的矛盾问题。传统集中式RL存在可扩展性不足,而分散式方法则因缺乏统一目标导致网络级效率受限。论文提出的解决方案关键在于构建一个分层强化学习框架HiLight,其核心包括高层Meta-Policy通过Transformer-LSTM架构进行区域划分与子目标生成,以及低层Sub-Policy在全局感知下控制单个交叉口,并引入对抗训练机制以增强全局规划与局部执行的一致性。
链接: https://arxiv.org/abs/2506.14391
作者: Yaqiao Zhu,Hongkai Wen,Geyong Min,Man Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient traffic signal control (TSC) is essential for mitigating urban congestion, yet existing reinforcement learning (RL) methods face challenges in scaling to large networks while maintaining global coordination. Centralized RL suffers from scalability issues, while decentralized approaches often lack unified objectives, resulting in limited network-level efficiency. In this paper, we propose HiLight, a hierarchical reinforcement learning framework with global adversarial guidance for large-scale TSC. HiLight consists of a high-level Meta-Policy, which partitions the traffic network into subregions and generates sub-goals using a Transformer-LSTM architecture, and a low-level Sub-Policy, which controls individual intersections with global awareness. To improve the alignment between global planning and local execution, we introduce an adversarial training mechanism, where the Meta-Policy generates challenging yet informative sub-goals, and the Sub-Policy learns to surpass these targets, leading to more effective coordination. We evaluate HiLight across both synthetic and real-world benchmarks, and additionally construct a large-scale Manhattan network with diverse traffic conditions, including peak transitions, adverse weather, and holiday surges. Experimental results show that HiLight exhibits significant advantages in large-scale scenarios and remains competitive across standard benchmarks of varying sizes.
zh
[AI-33] Dont Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning
【速读】:该论文试图解决在大型语言模型(Large Language Model, LLM)微调过程中,由于灾难性遗忘导致的安全对齐中关键能力(如模型忠实表达无知的能力)退化的问题,这一问题会导致幻觉等不良行为。解决方案的关键在于提出SEAT方法,其核心是集成两个关键组件:(1)稀疏训练以限制激活漂移,(2)一种基于KL散度正则化的实体扰动方法,用于对抗知识纠缠,从而在保持微调性能的同时有效保留模型对自身无知的认知能力。
链接: https://arxiv.org/abs/2506.14387
作者: William F. Shen,Xinchi Qiu,Nicola Cancedda,Nicholas D. Lane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing work on mitigating catastrophic forgetting in large language model (LLM) fine-tuning has primarily focused on preserving specific data or tasks, while critically overlooking the degradation of essential capabilities instilled through safety alignment, particularly the model’s ability to faithfully express ignorance. In this work, we show that this capability is significantly degraded during conventional fine-tuning, leading to undesired behaviors such as hallucinations. To address this novel but highly practical problem, we propose SEAT, a simple and effective fine-tuning approach that preserves both fine-tuning performance and the model’s inherent ability to acknowledge its ignorance. SEAT integrates two key components: (1) sparse training that constrains activation drift, and (2) a novel entity perturbation method with KL-divergence regularization, designed to counter knowledge entanglement. Experimental results demonstrate that SEAT significantly outperforms baselines in preserving ignorance awareness while retaining fine-tuning performance, offering a more robust solution for LLM fine-tuning.
zh
[AI-34] ResNets Are Deeper Than You Think NEURIPS2025
【速读】:该论文试图解决残差连接(residual connections)在现代神经网络架构中为何能显著提升训练性能的问题,以及是否存在优化以外的因素导致残差网络(ResNets)相对于前馈网络的性能优势。其解决方案的关键在于提出残差网络并不只是对前馈网络的重参数化,而是处于不同的函数空间中;通过设计一种受控的训练后比较方法,证明了变深度架构在泛化性能上优于固定深度网络,即使优化过程的影响可以忽略,这表明残差连接带来的性能优势超越了优化层面,指向了与自然数据结构更契合的更深归纳偏置(inductive bias)。
链接: https://arxiv.org/abs/2506.14386
作者: Christian H.X. Ali Mehmeti-Göpel,Michael Wand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Submission
Abstract:Residual connections remain ubiquitous in modern neural network architectures nearly a decade after their introduction. Their widespread adoption is often credited to their dramatically improved trainability: residual networks train faster, more stably, and achieve higher accuracy than their feedforward counterparts. While numerous techniques, ranging from improved initialization to advanced learning rate schedules, have been proposed to close the performance gap between residual and feedforward networks, this gap has persisted. In this work, we propose an alternative explanation: residual networks do not merely reparameterize feedforward networks, but instead inhabit a different function space. We design a controlled post-training comparison to isolate generalization performance from trainability; we find that variable-depth architectures, similar to ResNets, consistently outperform fixed-depth networks, even when optimization is unlikely to make a difference. These results suggest that residual connections confer performance advantages beyond optimization, pointing instead to a deeper inductive bias aligned with the structure of natural data.
zh
[AI-35] IntelliLung: Advancing Safe Mechanical Ventilation using Offline RL with Hybrid Actions and Clinically Aligned Rewards ECAI2025
【速读】:该论文旨在解决重症监护病房(ICU)中机械通气(MV)设置优化的复杂性和易错性问题,特别是在面对患者个体差异时。其关键解决方案是通过优化动作空间缩减技术,并将最先进的离线强化学习(Offline Reinforcement Learning, RL)算法(如IQL和EDAC)适配至混合动作空间(包含连续和离散动作),从而避免离散化带来的动作空间限制和分布偏移问题。此外,研究还引入了一个基于临床实际的奖励函数,以呼吸机自由天数和生理目标为优化目标,相较于传统的稀疏死亡率奖励更具意义。
链接: https://arxiv.org/abs/2506.14375
作者: Muhammad Hamza Yousuf,Jason Li,Sahar Vahdati,Raphael Theilen,Jakob Wittenstein,Jens Lehmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: under review, PAIS track @ ECAI 2025
Abstract:Invasive mechanical ventilation (MV) is a life-sustaining therapy for critically ill patients in the intensive care unit (ICU). However, optimizing its settings remains a complex and error-prone process due to patient-specific variability. While Offline Reinforcement Learning (RL) shows promise for MV control, current stateof-the-art (SOTA) methods struggle with the hybrid (continuous and discrete) nature of MV actions. Discretizing the action space limits available actions due to exponential growth in combinations and introduces distribution shifts that can compromise safety. In this paper, we propose optimizations that build upon prior work in action space reduction to address the challenges of discrete action spaces. We also adapt SOTA offline RL algorithms (IQL and EDAC) to operate directly on hybrid action spaces, thereby avoiding the pitfalls of discretization. Additionally, we introduce a clinically grounded reward function based on ventilator-free days and physiological targets, which provides a more meaningful optimization objective compared to traditional sparse mortality-based rewards. Our findings demonstrate that AI-assisted MV optimization may enhance patient safety and enable individualized lung support, representing a significant advancement toward intelligent, data-driven critical care solutions.
zh
[AI-36] LLM -Powered Intent-Based Categorization of Phishing Emails
【速读】:该论文试图解决传统检测系统在识别钓鱼邮件方面的局限性,尤其是在处理经验丰富的用户能够通过文本内容识别的钓鱼邮件时效果不佳的问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)对邮件意图进行分析,并引入一种基于意图类型的分类体系,使LLMs能够将邮件分类到不同类别中,从而生成可操作的威胁信息。
链接: https://arxiv.org/abs/2506.14337
作者: Even Eilertsen,Vasileios Mavroeidis,Gudmund Grov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Phishing attacks remain a significant threat to modern cybersecurity, as they successfully deceive both humans and the defense mechanisms intended to protect them. Traditional detection systems primarily focus on email metadata that users cannot see in their inboxes. Additionally, these systems struggle with phishing emails, which experienced users can often identify empirically by the text alone. This paper investigates the practical potential of Large Language Models (LLMs) to detect these emails by focusing on their intent. In addition to the binary classification of phishing emails, the paper introduces an intent-type taxonomy, which is operationalized by the LLMs to classify emails into distinct categories and, therefore, generate actionable threat information. To facilitate our work, we have curated publicly available datasets into a custom dataset containing a mix of legitimate and phishing emails. Our results demonstrate that existing LLMs are capable of detecting and categorizing phishing emails, underscoring their potential in this domain.
zh
[AI-37] AviationLLM : An LLM -based Knowledge System for Aviation Training
【速读】:该论文旨在解决现有航空培训系统中知识传授效率低下的问题,主要表现为教练数量有限以及互联网获取的专业答案准确性不足。其解决方案的关键在于引入大语言模型(Large Language Model, LLM)并采用基于直接偏好优化(Direct Preference Optimization, DPO)的领域对齐方法,以提升专业航空理论知识的回答准确性和训练效率。同时,为减少因训练数据偏差、知识过时或领域知识缺失导致的幻觉问题,系统还集成了检索增强生成(Retrieval-Augmented Generation, RAG)技术,从而实现精准且高质量的回答。
链接: https://arxiv.org/abs/2506.14336
作者: Jia’ang Wan,Feng Shen,Fujuan Li,Yanjin Sun,Yan Li,Shiwen Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Aviation training is a core link in ensuring flight safety, improving industry efficiency and promoting sustainable development. It not only involves flight simulation but also requires the learning of a great deal of professional aviation theory knowledge. In the existing training system, the knowledge is mainly imparted by the the instructors. However, the number of instructors is limited and the professional answers obtained from the Internet are not accurate enough, resulting in low training efficiency. To address this, we introduced LLM, but the basic pre-trained model cannot provide accurate answers to professional fields, so we fine-tuned it. Traditional Supervised Fine-Tuning (SFT) risk generating superficially plausible but factually incorrect responses due to insufficient data coverage. To address this, we employ Direct Preference Optimization(DPO). This paper proposes Retrieval-Augmented LLM Alignment via Direct Preference Optimization(RALA-DPO). We select open source pre-trained LLM Qwen and adapt it to aviation theory training through DPO-based domain alignment. Simultaneously, to mitigate hallucinations caused by training data biases, knowledge obsolescence, or domain knowledge gaps, we implement Retrieval-Augmented Generation(RAG) technology that combines generative and retrieval models. RALA-DPO effectively retrieves relevant information from external knowledge bases and delivers precise and high-quality responses through the generative model. Experimental results demonstrate that RALA-DPO can improve accuracy in response to professional aviation knowledge. With integrated RAG mechanisms, this system can further improve the accuracy of answers and achieve zero-cost knowledge updates simultaneously.
zh
[AI-38] ADRD: LLM -Driven Autonomous Driving Based on Rule-based Decision Systems
【速读】:该论文试图解决如何构建一个可解释的自动驾驶决策系统这一问题,这是当前学术研究的焦点。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成可执行的基于规则的决策系统。通过引入ADRDLlm-Driven Autonomous Driving Based on Rule-based Decision Systems)框架,该方法整合了信息模块、智能体模块和测试模块,借助LLMs强大的推理与编程能力,实现了对驾驶场景的上下文信息聚合、规则驱动策略的生成与迭代优化,从而在可解释性、响应速度和驾驶性能方面展现出显著优势。
链接: https://arxiv.org/abs/2506.14299
作者: Fanzhi Zeng,Siqi Wang,Chuzhao Zhu,Li Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:How to construct an interpretable autonomous driving decision-making system has become a focal point in academic research. In this study, we propose a novel approach that leverages large language models (LLMs) to generate executable, rule-based decision systems to address this challenge. Specifically, harnessing the strong reasoning and programming capabilities of LLMs, we introduce the ADRD(LLM-Driven Autonomous Driving Based on Rule-based Decision Systems) framework, which integrates three core modules: the Information Module, the Agents Module, and the Testing Module. The framework operates by first aggregating contextual driving scenario information through the Information Module, then utilizing the Agents Module to generate rule-based driving tactics. These tactics are iteratively refined through continuous interaction with the Testing Module. Extensive experimental evaluations demonstrate that ADRD exhibits superior performance in autonomous driving decision tasks. Compared to traditional reinforcement learning approaches and the most advanced LLM-based methods, ADRD shows significant advantages in terms of interpretability, response speed, and driving performance. These results highlight the framework’s ability to achieve comprehensive and accurate understanding of complex driving scenarios, and underscore the promising future of transparent, rule-based decision systems that are easily modifiable and broadly applicable. To the best of our knowledge, this is the first work that integrates large language models with rule-based systems for autonomous driving decision-making, and our findings validate its potential for real-world deployment.
zh
[AI-39] Uncertainty-Driven Radar-Inertial Fusion for Instantaneous 3D Ego-Velocity Estimation
【速读】:该论文旨在解决自主导航中自运动估计(ego-motion estimation)的准确性与鲁棒性问题,尤其是传统基于雷达的自运动估计方法在复杂环境下的局限性。其解决方案的关键在于将高分辨率成像雷达与惯性测量单元(IMU)数据融合,并利用神经网络处理复数形式的原始雷达数据,以估计瞬时线性自速度及其不确定性。随后,通过扩展卡尔曼滤波器(Extended Kalman Filter)将不确定性感知的速度估计与IMU数据进行融合,从而优化惯性传感器的噪声和偏差参数,提升整体自运动估计的精度与可靠性。
链接: https://arxiv.org/abs/2506.14294
作者: Prashant Kumar Rai,Elham Kowsari,Nataliya Strokina,Reza Ghabcheloo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This paper has been accepted for presentation at the 28th International Conference on Information Fusion (Fusion 2025)
Abstract:We present a method for estimating ego-velocity in autonomous navigation by integrating high-resolution imaging radar with an inertial measurement unit. The proposed approach addresses the limitations of traditional radar-based ego-motion estimation techniques by employing a neural network to process complex-valued raw radar data and estimate instantaneous linear ego-velocity along with its associated uncertainty. This uncertainty-aware velocity estimate is then integrated with inertial measurement unit data using an Extended Kalman Filter. The filter leverages the network-predicted uncertainty to refine the inertial sensor’s noise and bias parameters, improving the overall robustness and accuracy of the ego-motion estimation. We evaluated the proposed method on the publicly available ColoRadar dataset. Our approach achieves significantly lower error compared to the closest publicly available method and also outperforms both instantaneous and scan matching-based techniques.
zh
[AI-40] Steering Robots with Inference-Time Interactions
【速读】:该论文试图解决预训练策略在部署过程中出现错误时,用户缺乏有效机制进行行为纠正的问题。其解决方案的关键在于保持预训练策略的冻结状态,作为固定的技能库,同时在推理阶段通过用户交互引导行为生成以符合用户偏好。具体而言,提出的方法包括推理阶段的技能切换和任务与运动模仿,使用户能够在不进行微调的情况下纠正策略预测偏差,从而最大化预训练模型的效用并实现推理阶段的用户目标。
链接: https://arxiv.org/abs/2506.14287
作者: Yanwei Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: MIT Robotics PhD Thesis
Abstract:Imitation learning has driven the development of generalist policies capable of autonomously solving multiple tasks. However, when a pretrained policy makes errors during deployment, there are limited mechanisms for users to correct its behavior. While collecting additional data for finetuning can address such issues, doing so for each downstream use case is inefficient at deployment. My research proposes an alternative: keeping pretrained policies frozen as a fixed skill repertoire while allowing user interactions to guide behavior generation toward user preferences at inference time. By making pretrained policies steerable, users can help correct policy errors when the model struggles to generalize-without needing to finetune the policy. Specifically, I propose (1) inference-time steering, which leverages user interactions to switch between discrete skills, and (2) task and motion imitation, which enables user interactions to edit continuous motions while satisfying task constraints defined by discrete symbolic plans. These frameworks correct misaligned policy predictions without requiring additional training, maximizing the utility of pretrained models while achieving inference-time user objectives.
zh
[AI-41] Dont throw the baby out with the bathwater: How and why deep learning for ARC
【速读】:该论文试图解决在陌生领域中实现高效推理的问题,特别是针对Abstraction and Reasoning Corpus (ARC-AGI)这一具有挑战性的基准测试。其解决方案的关键在于充分利用深度学习范式,通过在测试阶段进行实时神经网络(NN)训练,将神经网络和优化器作为推理过程的组成部分,从而提升对未见过任务的泛化能力。研究提出了一种从预训练大语言模型(LLM)出发增强ARC推理的方法,并引入了Test-Time Fine-Tuning (TTFT)和Augment Inference Reverse-Augmentation and Vote (AIRV)等测试阶段技术,显著提升了模型性能。
链接: https://arxiv.org/abs/2506.14276
作者: Jack Cole,Mohamed Osman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 6 figures
Abstract:The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning’s capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.
zh
[AI-42] Knowledge Adaptation as Posterior Correction
【速读】:该论文试图解决机器在适应新任务或环境时缺乏像人类和动物那样快速学习能力的问题,即如何使机器实现类似人类的快速适应性。其解决方案的关键在于将所有适应方法视为对近似后验分布的“修正”,更精确的后验分布意味着更小的修正,从而实现更快的适应。这一结论基于Khan和Rue(2023)的贝叶斯学习规则的双视角分析,其中适应过程中产生的干扰被表征为过去数据上的自然梯度不匹配。
链接: https://arxiv.org/abs/2506.14262
作者: Mohammad Emtiyaz Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Adaptation is the holy grail of intelligence, but even the best AI models (like GPT) lack the adaptivity of toddlers. So the question remains: how can machines adapt quickly? Despite a lot of progress on model adaptation to facilitate continual and federated learning, as well as model merging, editing, unlearning, etc., little is known about the mechanisms by which machines can naturally learn to adapt in a similar way as humans and animals. Here, we show that all such adaptation methods can be seen as different ways of `correcting’ the approximate posteriors. More accurate posteriors lead to smaller corrections, which in turn imply quicker adaptation. The result is obtained by using a dual-perspective of the Bayesian Learning Rule of Khan and Rue (2023) where interference created during adaptation is characterized by the natural-gradient mismatch over the past data. We present many examples to demonstrate the use of posterior-correction as a natural mechanism for the machines to learn to adapt quickly.
zh
[AI-43] Mxplainer: Explain and Learn Insights by Imitating Mahjong Agents
【速读】:该论文试图解决如何从复杂的生成式 AI (Generative AI) 黑箱代理中提取可解释的特征和决策机制的问题。其关键解决方案是提出 Mxplainer,一种参数化搜索算法,能够转换为等效神经网络以学习黑箱代理的参数,并通过实验验证所学参数能够提供人类可理解的代理特性与对局风格的洞察,同时展示其基于搜索的框架能够在大多数麻将游戏状态下局部解释黑箱代理的决策过程。
链接: https://arxiv.org/abs/2506.14246
作者: Lingfeng Li,Yunlong Lu,Yongyi Wang,Qifan Zheng,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:People need to internalize the skills of AI agents to improve their own capabilities. Our paper focuses on Mahjong, a multiplayer game involving imperfect information and requiring effective long-term decision-making amidst randomness and hidden information. Through the efforts of AI researchers, several impressive Mahjong AI agents have already achieved performance levels comparable to those of professional human players; however, these agents are often treated as black boxes from which few insights can be gleaned. This paper introduces Mxplainer, a parameterized search algorithm that can be converted into an equivalent neural network to learn the parameters of black-box agents. Experiments conducted on AI and human player data demonstrate that the learned parameters provide human-understandable insights into these agents’ characteristics and play styles. In addition to analyzing the learned parameters, we also showcase how our search-based framework can locally explain the decision-making processes of black-box agents for most Mahjong game states.
zh
[AI-44] Causes in neuron diagrams and testing causal reasoning in Large Language Models . A glimpse of the future of philosophy?
【速读】:该论文试图解决如何评估人工智能(AI)在抽象因果推理方面的能力问题,其解决方案的关键是基于因果哲学领域的研究成果,特别是D. Lewis提出的神经图(neuron diagrams)方法,构建一种用于测试AI因果推理能力的实验框架。该研究通过这一框架对先进的大型语言模型(Large Language Models, LLMs)进行了验证,并提出了一个比以往更广泛适用的因果定义,以更准确地评估AI系统的因果推理能力。
链接: https://arxiv.org/abs/2506.14239
作者: Louis Vervoort,Vitaly Nikolaev
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by Journal for General Philosophy of Science
Abstract:We propose a test for abstract causal reasoning in AI, based on scholarship in the philosophy of causation, in particular on the neuron diagrams popularized by D. Lewis. We illustrate the test on advanced Large Language Models (ChatGPT, DeepSeek and Gemini). Remarkably, these chatbots are already capable of correctly identifying causes in cases that are hotly debated in the literature. In order to assess the results of these LLMs and future dedicated AI, we propose a definition of cause in neuron diagrams with a wider validity than published hitherto, which challenges the widespread view that such a definition is elusive. We submit that these results are an illustration of how future philosophical research might evolve: as an interplay between human and artificial expertise.
zh
[AI-45] ImpReSS: Implicit Recommender System for Support Conversations
【速读】:该论文旨在解决在客户支持对话中隐式集成推荐系统的问题,即如何在不依赖用户购买意图的前提下,从客户支持交互中识别出有助于解决问题或防止问题再次发生的相关解决方案产品类别(SPCs)。其解决方案的关键在于提出ImpReSS,一个完全隐式的推荐系统,它能够基于客户支持对话内容自动发现推荐机会,并在不干扰现有支持流程的情况下提供推荐,从而提升服务质量并促进业务增长。
链接: https://arxiv.org/abs/2506.14231
作者: Omri Haller,Yair Meidan,Dudu Mimran,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Following recent advancements in large language models (LLMs), LLM-based chatbots have transformed customer support by automating interactions and providing consistent, scalable service. While LLM-based conversational recommender systems (CRSs) have attracted attention for their ability to enhance the quality of recommendations, limited research has addressed the implicit integration of recommendations within customer support interactions. In this work, we introduce ImpReSS, an implicit recommender system designed for customer support conversations. ImpReSS operates alongside existing support chatbots, where users report issues and chatbots provide solutions. Based on a customer support conversation, ImpReSS identifies opportunities to recommend relevant solution product categories (SPCs) that help resolve the issue or prevent its recurrence – thereby also supporting business growth. Unlike traditional CRSs, ImpReSS functions entirely implicitly and does not rely on any assumption of a user’s purchasing intent. Our empirical evaluation of ImpReSS’s ability to recommend relevant SPCs that can help address issues raised in support conversations shows promising results, including an MRR@1 (and recall@3) of 0.72 (0.89) for general problem solving, 0.82 (0.83) for information security support, and 0.85 (0.67) for cybersecurity troubleshooting. To support future research, our data and code will be shared upon request.
zh
[AI-46] From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models ICML2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中理论心智(Theory of Mind, ToM)能力的评估问题,现有方法主要针对单模态模型且将其视为黑箱,缺乏对其内部机制的解释性分析。解决方案的关键在于基于内部机制的方法,通过构建多模态ToM测试数据集GridToM,并分析注意力头在不同视角下区分认知信息的能力,进而提出一种无需训练的轻量级方法,通过调整注意力头方向显著提升模型展示出的ToM能力。
链接: https://arxiv.org/abs/2506.14224
作者: Xinyang Li,Siqi Liu,Bochao Zou,Jiansheng Chen,Huimin Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 22 figures, accepted at ICML 2025, project page: see this https URL
Abstract:As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model’s exhibited ToM by adjusting in the direction of the attention head.
zh
[AI-47] riGuard: Testing Model Safety with Attribution Entropy Verification and Drift
【速读】:该论文旨在解决深度神经网络在对抗性和分布偏移下可靠性不足的问题,特别是模型准确率与可解释性之间的不匹配。其解决方案的关键在于提出TriGuard框架,该框架融合了形式化鲁棒性验证、归因熵以量化显著性集中度,以及一种新的归因漂移分数来衡量解释的稳定性,从而全面评估模型的安全性。
链接: https://arxiv.org/abs/2506.14217
作者: Dipesh Tharu Mahato,Rohan Poudel,Pramod Dhungana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 tables, 6 figures
Abstract:Deep neural networks often achieve high accuracy, but ensuring their reliability under adversarial and distributional shifts remains a pressing challenge. We propose TriGuard, a unified safety evaluation framework that combines (1) formal robustness verification, (2) attribution entropy to quantify saliency concentration, and (3) a novel Attribution Drift Score measuring explanation stability. TriGuard reveals critical mismatches between model accuracy and interpretability: verified models can still exhibit unstable reasoning, and attribution-based signals provide complementary safety insights beyond adversarial accuracy. Extensive experiments across three datasets and five architectures show how TriGuard uncovers subtle fragilities in neural reasoning. We further demonstrate that entropy-regularized training reduces explanation drift without sacrificing performance. TriGuard advances the frontier in robust, interpretable model evaluation.
zh
[AI-48] Whats in the Box? Reasoning about Unseen Objects from Multimodal Cues
【速读】:该论文试图解决如何在缺乏直接感知的情况下,通过灵活整合多源信息(如听觉、视觉线索、语言及先验知识)来推断不可见物体的问题。其解决方案的关键在于提出一种神经符号模型,该模型利用神经网络解析开放式的多模态输入,并通过贝叶斯模型整合不同信息源以评估不同假设。
链接: https://arxiv.org/abs/2506.14212
作者: Lance Ying,Daniel Xu,Alicia Zhang,Katherine M. Collins,Max H. Siegel,Joshua B. Tenenbaum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper published at CogSci 2025
Abstract:People regularly make inferences about objects in the world that they cannot see by flexibly integrating information from multiple sources: auditory and visual cues, language, and our prior beliefs and knowledge about the scene. How are we able to so flexibly integrate many sources of information to make sense of the world around us, even if we have no direct knowledge? In this work, we propose a neurosymbolic model that uses neural networks to parse open-ended multimodal inputs and then applies a Bayesian model to integrate different sources of information to evaluate different hypotheses. We evaluate our model with a novel object guessing game called ``What’s in the Box?‘’ where humans and models watch a video clip of an experimenter shaking boxes and then try to guess the objects inside the boxes. Through a human experiment, we show that our model correlates strongly with human judgments, whereas unimodal ablated models and large multimodal neural model baselines show poor correlation.
zh
[AI-49] DiffusionBlocks: Blockwise Training for Generative Models via Score-Based Diffusion
【速读】:该论文试图解决大规模神经网络训练中由于端到端反向传播导致的显著内存瓶颈问题,从而限制了对前沿人工智能研究的访问。其解决方案的关键在于提出一种名为DiffusionBlocks的新训练框架,该框架将神经网络块视为在连续时间扩散过程中执行去噪操作,并通过将网络划分为可独立训练的块以及基于等累计概率质量优化噪声级别分配的方式,实现了显著的内存效率提升,同时保持了与传统反向传播在生成任务中的竞争性能。
链接: https://arxiv.org/abs/2506.14202
作者: Makoto Shing,Takuya Akiba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: To appear at TTODLer-FM Workshop of the 42nd International Conference on Machine Learning
Abstract:Training large neural networks with end-to-end backpropagation creates significant memory bottlenecks, limiting accessibility to state-of-the-art AI research. We propose \textitDiffusionBlocks , a novel training framework that interprets neural network blocks as performing denoising operations in a continuous-time diffusion process. By partitioning the network into independently trainable blocks and optimizing noise level assignments based on equal cumulative probability mass, our approach achieves significant memory efficiency while maintaining competitive performance compared to traditional backpropagation in generative tasks. Experiments on image generation and language modeling tasks demonstrate memory reduction proportional to the number of blocks while achieving superior performance. DiffusionBlocks provides a promising pathway for democratizing access to large-scale neural network training with limited computational resources.
zh
[AI-50] Balancing Caregiving and Self-Care: Exploring Mental Health Needs of Alzheimers and Dementia Caregivers
【速读】:该论文旨在解决阿尔茨海默病及相关痴呆症(AD/ADRD)患者家庭照护者在长期照护过程中面临的心理健康需求被现有支持系统忽视的问题。研究通过半结构化访谈分析了照护者的心理健康挑战及其应对策略,并揭示了照护者心理状态在照护旅程三个不同阶段的演变过程。解决方案的关键在于开发能够适应照护者动态需求的可访问、可扩展且个性化的心理健康技术,从而为照护者提供阶段敏感的、全面的心理健康支持。
链接: https://arxiv.org/abs/2506.14196
作者: Jiayue Melissa Shi,Keran Wang,Dong Whi Yoo,Ravi Karkar,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Alzheimer’s Disease and Related Dementias (AD/ADRD) are progressive neurodegenerative conditions that impair memory, thought processes, and functioning. Family caregivers of individuals with AD/ADRD face significant mental health challenges due to long-term caregiving responsibilities. Yet, current support systems often overlook the evolving nature of their mental wellbeing needs. Our study examines caregivers’ mental wellbeing concerns, focusing on the practices they adopt to manage the burden of caregiving and the technologies they use for support. Through semi-structured interviews with 25 family caregivers of individuals with AD/ADRD, we identified the key causes and effects of mental health challenges, and developed a temporal mapping of how caregivers’ mental wellbeing evolves across three distinct stages of the caregiving journey. Additionally, our participants shared insights into improvements for existing mental health technologies, emphasizing the need for accessible, scalable, and personalized solutions that adapt to caregivers’ changing needs over time. These findings offer a foundation for designing dynamic, stage-sensitive interventions that holistically support caregivers’ mental wellbeing, benefiting both caregivers and care recipients.
zh
[AI-51] StorySage: Conversational Autobiography Writing Powered by a Multi-Agent Framework
【速读】:该论文试图解决传统对话式写作助手在捕捉个人记忆和构建完整自传叙事方面的不足,这些问题主要源于系统依赖通用用户交互和预定义指南,难以适应个体化的记忆组织与长期叙事发展。解决方案的关键在于提出StorySage系统,该系统基于多智能体框架(multi-agent framework),包含访谈者、会话记录员、规划者、章节撰写者和会话协调者,通过迭代收集用户记忆、更新自传内容并规划未来对话,从而实现灵活且结构化的自传写作支持。
链接: https://arxiv.org/abs/2506.14159
作者: Shayan Talaei,Meijin Li,Kanu Grover,James Kent Hippler,Diyi Yang,Amin Saberi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Every individual carries a unique and personal life story shaped by their memories and experiences. However, these memories are often scattered and difficult to organize into a coherent narrative, a challenge that defines the task of autobiography writing. Existing conversational writing assistants tend to rely on generic user interactions and pre-defined guidelines, making it difficult for these systems to capture personal memories and develop a complete biography over time. We introduce StorySage, a user-driven software system designed to meet the needs of a diverse group of users that supports a flexible conversation and a structured approach to autobiography writing. Powered by a multi-agent framework composed of an Interviewer, Session Scribe, Planner, Section Writer, and Session Coordinator, our system iteratively collects user memories, updates their autobiography, and plans for future conversations. In experimental simulations, StorySage demonstrates its ability to navigate multiple sessions and capture user memories across many conversations. User studies (N=28) highlight how StorySage maintains improved conversational flow, narrative completeness, and higher user satisfaction when compared to a baseline. In summary, StorySage contributes both a novel architecture for autobiography writing and insights into how multi-agent systems can enhance human-AI creative partnerships.
zh
[AI-52] Collaborative Editable Model
【速读】:该论文旨在解决垂直领域大语言模型(Vertical-domain large language models, LLMs)在训练过程中依赖大规模标注数据和计算资源的问题,从而阻碍了其快速开发和持续迭代。解决方案的关键在于提出协同可编辑模型(Collaborative Editable Model, CoEM),该模型通过用户贡献的领域片段构建候选知识池,并结合用户与模型的交互对话、用户评分及归属分析来识别高价值的知识片段,最终通过上下文提示进行轻量级领域适应,以提升模型生成内容的准确性和领域相关性。
链接: https://arxiv.org/abs/2506.14146
作者: Kaiwen Tang,Aitong Wu,Yao Lu,Guangda Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vertical-domain large language models (LLMs) play a crucial role in specialized scenarios such as finance, healthcare, and law; however, their training often relies on large-scale annotated data and substantial computational resources, impeding rapid development and continuous iteration. To address these challenges, we introduce the Collaborative Editable Model (CoEM), which constructs a candidate knowledge pool from user-contributed domain snippets, leverages interactive user-model dialogues combined with user ratings and attribution analysis to pinpoint high-value knowledge fragments, and injects these fragments via in-context prompts for lightweight domain adaptation. With high-value knowledge, the LLM can generate more accurate and domain-specific content. In a financial information scenario, we collect 15k feedback from about 120 users and validate CoEM with user ratings to assess the quality of generated insights, demonstrating significant improvements in domain-specific generation while avoiding the time and compute overhead of traditional fine-tuning workflows.
zh
[AI-53] NeuroCoreX: An Open-Source FPGA-Based Spiking Neural Network Emulator with On-Chip Learning
【速读】:该论文旨在解决传统人工神经网络(Artificial Neural Networks, ANNs)在能效和拓扑灵活性方面的局限性,提出一种高效、灵活的脉冲神经网络(Spiking Neural Networks, SNNs)硬件加速与测试平台。其解决方案的关键在于设计并实现NeuroCoreX,这是一个基于FPGA的模拟器,支持全连接拓扑、生物启发的局部学习机制(基于脉冲时间依赖可塑性,Spike-Timing-Dependent Plasticity, STDP),以及高效的参数配置与交互接口,从而为SNN的灵活开发与部署提供支持。
链接: https://arxiv.org/abs/2506.14138
作者: Ashish Gautam,Prasanna Date,Shruti Kulkarni,Robert Patton,Thomas Potok
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Neuromorphic computing, FPGA, STDP, Spiking Graph Neural Networks, Spiking Neural Networks, VHDL
Abstract:Spiking Neural Networks (SNNs) are computational models inspired by the structure and dynamics of biological neuronal networks. Their event-driven nature enables them to achieve high energy efficiency, particularly when deployed on neuromorphic hardware platforms. Unlike conventional Artificial Neural Networks (ANNs), which primarily rely on layered architectures, SNNs naturally support a wide range of connectivity patterns, from traditional layered structures to small-world graphs characterized by locally dense and globally sparse connections. In this work, we introduce NeuroCoreX, an FPGA-based emulator designed for the flexible co-design and testing of SNNs. NeuroCoreX supports all-to-all connectivity, providing the capability to implement diverse network topologies without architectural restrictions. It features a biologically motivated local learning mechanism based on Spike-Timing-Dependent Plasticity (STDP). The neuron model implemented within NeuroCoreX is the Leaky Integrate-and-Fire (LIF) model, with current-based synapses facilitating spike integration and transmission . A Universal Asynchronous Receiver-Transmitter (UART) interface is provided for programming and configuring the network parameters, including neuron, synapse, and learning rule settings. Users interact with the emulator through a simple Python-based interface, streamlining SNN deployment from model design to hardware execution. NeuroCoreX is released as an open-source framework, aiming to accelerate research and development in energy-efficient, biologically inspired computing.
zh
[AI-54] Less is More: Undertraining Experts Improves Model Upcycling
【速读】:该论文试图解决在模型微调(fine-tuning)后进行模型升级(upcycling)过程中性能下降的问题,特别是当专家模型经过长期微调以优化其个体性能时,会导致合并(merging)性能下降以及下游任务表现变差。解决方案的关键在于发现并缓解由于微调过程中对少数困难样本的过度记忆所导致的合并阶段性能退化,通过采用一种依赖任务的激进早停策略,显著提升了模型升级的效果。
链接: https://arxiv.org/abs/2506.14126
作者: Stefan Horoi,Guy Wolf,Eugene Belilovsky,Gintare Karolina Dziugaite
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.
zh
[AI-55] Situational-Constrained Sequential Resources Allocation via Reinforcement Learning
【速读】:该论文旨在解决具有情境约束的序列资源分配问题(Sequential Resource Allocation with situational constraints),此类问题在现实应用中普遍存在,其中资源需求和优先级依赖于具体情境。论文提出的解决方案关键在于引入一种名为SCRL的框架,通过将情境约束形式化为逻辑蕴含,并开发一种动态惩罚约束违反的新算法。此外,为有效处理情境约束,提出了一个概率选择机制,以克服传统约束强化学习(CRL)方法的局限性。
链接: https://arxiv.org/abs/2506.14125
作者: Libo Zhang,Yang Chen,Toru Takisaka,Kaiqi Zhao,Weidong Li,Jiamou Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sequential Resource Allocation with situational constraints presents a significant challenge in real-world applications, where resource demands and priorities are context-dependent. This paper introduces a novel framework, SCRL, to address this problem. We formalize situational constraints as logic implications and develop a new algorithm that dynamically penalizes constraint violations. To handle situational constraints effectively, we propose a probabilistic selection mechanism to overcome limitations of traditional constraint reinforcement learning (CRL) approaches. We evaluate SCRL across two scenarios: medical resource allocation during a pandemic and pesticide distribution in agriculture. Experiments demonstrate that SCRL outperforms existing baselines in satisfying constraints while maintaining high resource efficiency, showcasing its potential for real-world, context-sensitive decision-making tasks.
zh
[AI-56] CLGNN: A Contrastive Learning-based GNN Model for Betweenness Centrality Prediction on Temporal Graphs
【速读】:该论文试图解决时间网络中时间介数中心性(Temporal Betweenness Centrality, TBC)计算的高成本与分布极度不平衡问题,这些问题导致基于学习的模型容易过拟合到零中心性节点,从而无法准确预测TBC并识别真正关键的节点。其解决方案的关键在于提出一种可扩展且具有归纳能力的对比学习图神经网络(CLGNN),通过构建实例图以保持路径有效性和时间顺序,并利用双聚合机制(均值和边到节点多头注意力)编码结构与时间特征,结合基于稳定性的聚类引导对比模块(KContrastNet)和回归模块(ValueNet),有效缓解类别不平衡问题并提升TBC预测的准确性与效率。
链接: https://arxiv.org/abs/2506.14122
作者: Tianming Zhang,Renbo Zhang,Zhengyi Yang,Yunjun Gao,Bin Cao,Jing Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal Betweenness Centrality (TBC) measures how often a node appears on optimal temporal paths, reflecting its importance in temporal networks. However, exact computation is highly expensive, and real-world TBC distributions are extremely imbalanced. The severe imbalance leads learning-based models to overfit to zero-centrality nodes, resulting in inaccurate TBC predictions and failure to identify truly central nodes. Existing graph neural network (GNN) methods either fail to handle such imbalance or ignore temporal dependencies altogether. To address these issues, we propose a scalable and inductive contrastive learning-based GNN (CLGNN) for accurate and efficient TBC prediction. CLGNN builds an instance graph to preserve path validity and temporal order, then encodes structural and temporal features using dual aggregation, i.e., mean and edge-to-node multi-head attention mechanisms, enhanced by temporal path count and time encodings. A stability-based clustering-guided contrastive module (KContrastNet) is introduced to separate high-, median-, and low-centrality nodes in representation space, mitigating class imbalance, while a regression module (ValueNet) estimates TBC values. CLGNN also supports multiple optimal path definitions to accommodate diverse temporal semantics. Extensive experiments demonstrate the effectiveness and efficiency of CLGNN across diverse benchmarks. CLGNN achieves up to a 663.7~ \times speedup compared to state-of-the-art exact TBC computation methods. It outperforms leading static GNN baselines with up to 31.4~ \times lower MAE and 16.7~ \times higher Spearman correlation, and surpasses state-of-the-art temporal GNNs with up to 5.7~ \times lower MAE and 3.9~ \times higher Spearman correlation.
zh
[AI-57] SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting
【速读】:该论文旨在解决非线性动力系统分析与时间序列预测中,如何获得可处理的有限维Koopman算子近似的问题。其解决方案的关键在于建立Koopman算子近似与线性循环神经网络(Linear RNN)之间的联系,并通过引入可学习的输入信号谱分解和多层感知机作为测量函数,结合结构化的Koopman算子实现高效的时间序列建模与预测。
链接: https://arxiv.org/abs/2506.14113
作者: Yitian Zhang,Liheng Ma,Antonios Valkanas,Boris N. Oreshkin,Mark Coates
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Koopman operator theory provides a framework for nonlinear dynamical system analysis and time-series forecasting by mapping dynamics to a space of real-valued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-dimensional Koopman operator approximation. In this work, we establish a connection between Koopman operator approximation and linear Recurrent Neural Networks (RNNs), which have recently demonstrated remarkable success in sequence modeling. We show that by considering an extended state consisting of lagged observations, we can establish an equivalence between a structured Koopman operator and linear RNN updates. Building on this connection, we present SKOLR, which integrates a learnable spectral decomposition of the input signal with a multilayer perceptron (MLP) as the measurement functions and implements a structured Koopman operator via a highly parallel linear RNN stack. Numerical experiments on various forecasting benchmarks and dynamical systems show that this streamlined, Koopman-theory-based design delivers exceptional performance.
zh
[AI-58] oward a Graph Foundation Model: Pre-Training Transformers With Random Walks
【速读】:该论文试图解决如何构建类似自然语言中的基础模型(Foundation Model)的图基础模型(Graph Foundation Model)的问题,特别是如何让序列模型有效地编码不同大小和领域的图数据。解决方案的关键在于将节点表示为多个随机游走(Random Walks),从而使Transformer能够从序列中提取节点表示,进而生成边和图表示。同时,论文提出了一种新的上下文预测损失函数,并理论分析了其在区分邻域和图方面的表达能力。
链接: https://arxiv.org/abs/2506.14098
作者: Ziyuan Tang,Jie Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A foundation model like GPT elicits many emergent abilities, owing to the pre-training with broad inclusion of data and the use of the powerful Transformer architecture. While foundation models in natural languages are prevalent, can we build similar models for graphs? This paper describes an approach toward a graph foundation model that is pre-trained with diverse graph datasets by adapting the Transformer backbone. A central challenge toward this end is how a sequence model encodes graphs of varying sizes and from different domains. We propose representing a node as multiple random walks, such that the Transformer can extract node representations from sequences, which in turn form edge and graph representations. We develop a novel context prediction loss for these random walks and theoretically analyze their expressive power in distinguishing neighborhoods and graphs. We also demonstrate the pre-training of our model and its adaptation to downstream tasks, showcasing its potential as a foundation for processing and reasoning with graph-structured data.
zh
[AI-59] Frag ile Preferences: A Deep Dive Into Order Effects in Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中存在的位置偏差问题,这些问题可能影响其在招聘和大学录取等领域的公平性和准确性。研究揭示了LLMs中存在显著且一致的顺序效应,包括一种此前未在人类或机器决策中记录过的中心性偏差,并发现当选项质量较高时模型表现出首因偏差,而当选项质量较低时则倾向于后续选项。此外,还发现了一种对特定名称的偏好偏差。为区分表面的择优机制与真正的判断扭曲,研究引入了一个分类框架,将成对偏好分为稳健、脆弱或无差异。解决方案的关键在于识别并量化这些位置偏差的影响,提出针对性的缓解策略,包括创新性地使用温度参数来减少由顺序驱动的判断扭曲。
链接: https://arxiv.org/abs/2506.14092
作者: Haonan Yin,Shai Vardi,Vidyanand Choudhary
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used in decision-support systems across high-stakes domains such as hiring and university admissions, where decisions often involve selecting among competing alternatives. While prior work has noted positional order biases in LLM-driven comparisons, these biases have not been systematically dissected or linked to underlying preference structures. We provide the first comprehensive investigation of positional biases across multiple LLM architectures and domains, uncovering strong and consistent order effects, including a novel centrality bias not previously documented in human or machine decision-making. We also find a quality-dependent shift: when options are high quality, models exhibit primacy bias, but favor latter options when option quality is low. We further identify a previously undocumented bias favoring certain names over others. To distinguish superficial tie-breaking from true distortions of judgment, we introduce a framework that classifies pairwise preferences as robust, fragile, or indifferent. We show that order effects can lead models to select strictly inferior options, and that positional biases are typically stronger than gender biases. These findings suggest that LLMs are not merely inheriting human-like biases, but exhibit distinct failure modes not seen in human decision-making. We propose targeted mitigation strategies, including a novel use of the temperature parameter, to reduce order-driven distortions.
zh
[AI-60] Lightweight Relevance Grader in RAG
【速读】:该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索文档与查询相关性验证的问题,即如何确保从向量数据库中检索到的文档与用户查询的相关性。解决方案的关键在于引入一个轻量级的小型语言模型作为相关性评分器(relevant grader),以降低计算资源需求,同时通过微调Llama-3.2-1B模型显著提升了评分精度,从0.1301提升至0.7750,其性能接近Llama-3.1-70B模型。
链接: https://arxiv.org/abs/2506.14084
作者: Taehee Jeong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) addresses limitations of large language models (LLMs) by leveraging a vector database to provide more accurate and up-to-date information. When a user submits a query, RAG executes a vector search to find relevant documents, which are then used to generate a response. However, ensuring the relevance of retrieved documents with a query would be a big challenge. To address this, a secondary model, known as a relevant grader, can be served to verify its relevance. To reduce computational requirements of a relevant grader, a lightweight small language model is preferred. In this work, we finetuned llama-3.2-1b as a relevant grader and achieved a significant increase in precision from 0.1301 to 0.7750. Its precision is comparable to that of llama-3.1-70b. Our code is available at this https URL.
zh
[AI-61] FormGym: Doing Paperwork with Agents
【速读】:该论文试图解决在纯图像领域中填写表单的挑战性问题,特别是在没有OCR、排版PDF文本或DOM的情况下,计算机代理需要具备多模态理解、信息检索和工具使用等多种能力。解决方案的关键在于提出一个包含432个字段、55个文档和3个任务的新型表单填写基准,并引入FieldFinder工具,以帮助大语言模型(LLM)准确识别表单中文本的放置位置,从而显著提升模型在所有研究条件下的性能。
链接: https://arxiv.org/abs/2506.14079
作者: Matthew Toles,Rattandeep Singh,Isaac Song Zhou Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2% to 56%.
zh
[AI-62] Into the Unknown: Applying Inductive Spatial-Semantic Location Embeddings for Predicting Individuals Mobility Beyond Visited Places
【速读】:该论文旨在解决人类移动性建模中个体下一次位置预测的问题,传统方法依赖于从历史移动模式中学习的位置嵌入,限制了其对显式空间信息的编码能力、丰富城市语义上下文的整合能力以及对未见过位置的适应能力。论文提出的解决方案是应用CaLLiPer——一种通过对比学习融合兴趣点空间坐标和语义特征的多模态表征学习框架,以生成更具空间显性和语义丰富性的位置嵌入。该方法的关键在于其设计上的归纳性,使其能够在包含新兴位置的场景中仍保持稳健的预测性能。
链接: https://arxiv.org/abs/2506.14070
作者: Xinglei Wang,Tao Cheng,Stephen Law,Zichao Zeng,Ilya Ilyankou,Junyuan Liu,Lu Yin,Weiming Huang,Natchapon Jongwiriyanurak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Predicting individuals’ next locations is a core task in human mobility modelling, with wide-ranging implications for urban planning, transportation, public policy and personalised mobility services. Traditional approaches largely depend on location embeddings learned from historical mobility patterns, limiting their ability to encode explicit spatial information, integrate rich urban semantic context, and accommodate previously unseen locations. To address these challenges, we explore the application of CaLLiPer – a multimodal representation learning framework that fuses spatial coordinates and semantic features of points of interest through contrastive learning – for location embedding in individual mobility prediction. CaLLiPer’s embeddings are spatially explicit, semantically enriched, and inductive by design, enabling robust prediction performance even in scenarios involving emerging locations. Through extensive experiments on four public mobility datasets under both conventional and inductive settings, we demonstrate that CaLLiPer consistently outperforms strong baselines, particularly excelling in inductive scenarios. Our findings highlight the potential of multimodal, inductive location embeddings to advance the capabilities of human mobility prediction systems. We also release the code and data (this https URL) to foster reproducibility and future research.
zh
[AI-63] Scientifically-Interpretable Reasoning Network (ScIReN): Uncovering the Black-Box of Nature NEURIPS2025
【速读】:该论文试图解决传统神经网络在科学发现中缺乏可解释性以及过程模型参数设置依赖经验且跨尺度预测能力不足的问题。其解决方案的关键在于提出一种全透明的框架——科学可解释推理网络(Scientifically-Interpretable Reasoning Network, ScIReN),该框架结合了可解释的神经推理与基于过程的建模,通过可解释编码器预测具有科学意义的潜在参数,并利用可微分的过程模型解码器进行输出预测,同时引入硬sigmoid约束层以确保潜在参数符合科学先验知识的范围,从而实现模型预测精度与科学可解释性的双重提升。
链接: https://arxiv.org/abs/2506.14054
作者: Joshua Fan,Haodi Xu,Feng Tao,Md Nasim,Marc Grimson,Yiqi Luo,Carla P. Gomes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 9 figures, submitted to NeurIPS 2025
Abstract:Neural networks are a powerful tool for learning patterns from data. However, they do not respect known scientific laws, nor can they reveal novel scientific insights due to their black-box nature. In contrast, scientific reasoning distills biological or physical principles from observations and controlled experiments, and quantitatively interprets them with process-based models made of mathematical equations. Yet, process-based models rely on numerous free parameters that must be set in an ad-hoc manner, and thus often fit observations poorly in cross-scale predictions. While prior work has embedded process-based models in conventional neural networks, discovering interpretable relationships between parameters in process-based models and input features is still a grand challenge for scientific discovery. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge, further enhancing its interpretability. While the embedded process-based model enforces established scientific knowledge, the encoder reveals new scientific mechanisms and relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability – it can infer latent scientific mechanisms and their relationships with input features.
zh
[AI-64] Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning
【速读】:该论文试图解决在复杂开放环境中的智能体探索、规划与学习问题,其核心挑战在于如何定义有效的层次结构以及识别该结构在哪些问题中具有实际价值。解决方案的关键在于利用分层强化学习(Hierarchical Reinforcement Learning, HRL)框架,通过发现和利用经验流中的时间结构来提升智能体的决策能力,并分析HRL在性能权衡上的影响。
链接: https://arxiv.org/abs/2506.14045
作者: Martin Klissarov,Akhil Bagaria,Ziyan Luo,George Konidaris,Doina Precup,Marlos C. Machado
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Developing agents capable of exploring, planning and learning in complex open-ended environments is a grand challenge in artificial intelligence (AI). Hierarchical reinforcement learning (HRL) offers a promising solution to this challenge by discovering and exploiting the temporal structure within a stream of experience. The strong appeal of the HRL framework has led to a rich and diverse body of literature attempting to discover a useful structure. However, it is still not clear how one might define what constitutes good structure in the first place, or the kind of problems in which identifying it may be helpful. This work aims to identify the benefits of HRL from the perspective of the fundamental challenges in decision-making, as well as highlight its impact on the performance trade-offs of AI agents. Through these benefits, we then cover the families of methods that discover temporal structure in HRL, ranging from learning directly from online experience to offline datasets, to leveraging large language models (LLMs). Finally, we highlight the challenges of temporal structure discovery and the domains that are particularly well-suited for such endeavours.
zh
[AI-65] Asymptotically Smaller Encodings for Graph Problems and Scheduling
【速读】:该论文旨在解决图问题(如顶点覆盖、独立集、k-色问题)在布尔可满足性问题(CNF)中的高效编码问题,以减少所需的子句数量。传统编码方法需要Ω(|V|²)个约束,而本文提出的解决方案通过利用Erdős、Chung和Spencer(1983)关于图的双 clique 覆盖结果,实现了仅需O(|V|² / lg |V|)个子句的编码,显著降低了复杂度。关键在于利用图的结构特性,从而为“有限变量添加”(Bounded Variable Addition)等预处理技术提供了理论支持,并进一步提出了针对某些密集区间图的独立集编码方法,仅需O(|V| lg |V|)个子句。
链接: https://arxiv.org/abs/2506.14042
作者: Bernardo Subercaseaux
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:We show how several graph problems (e.g., vertex-cover, independent-set, k -coloring) can be encoded into CNF using only O(|V|^2 / \lg |V|) many clauses, as opposed to the \Omega(|V|^2) constraints used by standard encodings. This somewhat surprising result is a simple consequence of a result of Erdős, Chung, and Spencer (1983) about biclique coverings of graphs, and opens theoretical avenues to understand the success of "Bounded Variable Addition’’ (Manthey, Heule, and Biere, 2012) as a preprocessing tool. Finally, we show a novel encoding for independent sets in some dense interval graphs using only O(|V| \lg |V|) clauses (the direct encoding uses \Omega(|V|^2) ), which we have successfully applied to a string-compression encoding posed by Bannai et al. (2022). As a direct byproduct, we obtain a reduction in the encoding size of a scheduling problem posed by Mayank and Modal (2020) from O(NMT^2) to O(NMT + M T^2 \lg T) , where N is the number of tasks, T the total timespan, and M the number of machines.
zh
[AI-66] Bures-Wasserstein Flow Matching for Graph Generation
【速读】:该论文旨在解决传统图生成方法在建模图结构时存在的不足,特别是其对节点和边独立演化以及使用线性插值构建概率路径的局限性,这些方法假设数据位于欧几里得空间中,而实际图数据具有内在的非欧几里得结构和复杂的连接模式。解决方案的关键在于将图表示为由马尔可夫随机场(Markov Random Field, MRF)参数化的连通系统,并利用MRF对象之间的最优传输位移来设计更优的概率路径,从而更好地捕捉图的几何特性,提升生成效果与采样收敛性。
链接: https://arxiv.org/abs/2506.14020
作者: Keyue Jiang,Jiahao Cui,Xiaowen Dong,Laura Toni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Graph generation has emerged as a critical task in fields ranging from molecule design to drug discovery. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between a reference distribution and the data distribution. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations to build the path assuming that the data lie in Euclidean space. We show that this is suboptimal given the intrinsic non-Euclidean structure and interconnected patterns of graphs, and it poses risks to the sampling convergence. To build a better probability path, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design the probability path for graph generation. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that respects the underlying geometry of graphs and provides smooth velocities in the probability path. The novel framework can be adapted to both continuous and discrete flow-matching algorithms. Experimental evaluations in plain graph generation and 2D/3D molecule generation validate the effectiveness of BWFlow in graph generation with competitive performance, stable training, and guaranteed sampling convergence.
zh
[AI-67] aming Polysemanticity in LLM s: Provable Feature Recovery via Sparse Autoencoders
【速读】:该论文试图解决使用稀疏自编码器(Sparse Autoencoder, SAE)实现理论上有保障的特征恢复问题,以解释大型语言模型(Large Language Models, LLMs)。现有SAE训练算法通常缺乏严格的数学保证,并且在实践中存在超参数敏感性和不稳定性等限制。该研究提出了一种新的统计框架,其中引入了通过将多义特征建模为底层单义概念的稀疏混合来定义特征可识别性;在此基础上,提出了一种基于“偏置适应”(bias adaptation)的新型SAE训练算法,该技术通过自适应调整神经网络偏置参数以确保适当的激活稀疏性。该算法理论上证明了在输入数据服从所提出的统计模型时能够正确恢复所有单义特征,其改进版本Group Bias Adaptation (GBA)在1.5亿参数规模的LLMs上表现出优于基准方法的性能。关键在于通过理论框架与算法设计的结合,实现了首个具有理论恢复保证的SAE算法。
链接: https://arxiv.org/abs/2506.14002
作者: Siyu Chen,Heejune Sheen,Xuyuan Xiong,Tianhao Wang,Zhuoran Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注: 136 pages, 21 figures
Abstract:We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation’', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically \highlightprove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and \highlightdemonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.
zh
[AI-68] Machine Mirag es: Defining the Undefined
【速读】:该论文试图解决多模态机器智能系统在达到平均动物级和人类级流利度后所表现出的一类新型认知异常问题,即“机器幻象”(machine mirages)。这些错误包括幻觉、错觉、虚构、误判等多种形式,它们虽模仿但不复现人类或动物的失误。论文提出,这些问题必须被明确定义并系统评估。解决方案的关键在于对这些错误进行深入理解和系统性分析,以提升机器智能的可靠性,并构建一个尊重多样生命形式、认知与表达的多尺度伦理协同智能生态系统。
链接: https://arxiv.org/abs/2506.13990
作者: Hamidou Tembine
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted
Abstract:As multimodal machine intelligence systems started achieving average animal-level and average human-level fluency in many measurable tasks in processing images, language, and sound, they began to exhibit a new class of cognitive aberrations: machine mirages. These include delusion, illusion, confabulation, hallucination, misattribution error, semantic drift, semantic compression, exaggeration, causal inference failure, uncanny valley of perception, bluffing-patter-bullshitting, cognitive stereotypy, pragmatic misunderstanding, hypersignification, semantic reheating-warming, simulated authority effect, fallacious abductive leap, contextual drift, referential hallucination, semiotic Frankenstein effect, calibration failure, spurious correlation, bias amplification, concept drift sensitivity, misclassification under uncertainty, adversarial vulnerability, overfitting, prosodic misclassification, accent bias, turn boundary failure, semantic boundary confusion, noise overfitting, latency-induced decision drift, ambiguity collapse and other forms of error that mimic but do not replicate human or animal fallibility. This article presents some of the errors and argues that these failures must be explicitly defined and systematically assessed. Understanding machine mirages is essential not only for improving machine intelligence reliability but also for constructing a multiscale ethical, co-evolving intelligence ecosystem that respects the diverse forms of life, cognition, and expression it will inevitably touch.
zh
[AI-69] AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering
【速读】:该论文试图解决现实世界中洗钱(Money Laundering)检测方法评估不足的问题,特别是在数据复杂性、部分可观测性、标签稀疏性、策略行为、时间动态性、类别不平衡和网络级依赖性等方面存在的挑战。解决方案的关键是提出AMLGentex,这是一个开源工具集,能够生成真实且可配置的交易数据,并用于基准化检测方法,从而在可控环境中系统评估反洗钱(Anti-Money Laundering, AML)系统的性能。
链接: https://arxiv.org/abs/2506.13989
作者: Johan Östman,Edvin Callisen,Anton Chen,Kristiina Ausmees,Emanuel Gårdh,Jovan Zamac,Jolanta Goldsteine,Hugo Wefer,Simon Whelan,Markus Reimegård
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: 21 figures, 22 pages
Abstract:Money laundering enables organized crime by allowing illicit funds to enter the legitimate economy. Although trillions of dollars are laundered each year, only a small fraction is ever uncovered. This stems from a range of factors, including deliberate evasion by launderers, the rarity of confirmed cases, and the limited visibility each financial institution has into the global transaction network. While several synthetic datasets are available, they fail to model the structural and behavioral complexity of real-world money laundering. In particular, they often overlook partial observability, sparse and uncertain labels, strategic behavior, temporal dynamics, class imbalance, and network-level dependencies. To address these limitations, we present AMLGentex, an open-source suite for generating realistic, configurable transaction data and benchmarking detection methods. It enables systematic evaluation of anti-money laundering (AML) systems in a controlled environment that captures key real-world challenges. We demonstrate how the framework can be used to rigorously evaluate methods under conditions that reflect the complexity of practical AML scenarios.
zh
[AI-70] SANGAM: SystemVerilog Assertion Generation via Monte Carlo Tree Self-Refine
【速读】:该论文试图解决工业级规格说明中自动生成SystemVerilog Assertion (SVA)的问题,以提高硬件验证的效率和自动化程度。解决方案的关键在于提出SANGAM框架,该框架结合了大型语言模型(LLM)引导的蒙特卡洛树自我优化(MCTSr)算法,通过多模态规格处理、基于MCTSr的自动推理以及推理轨迹的整合,实现对每个信号的SVA生成。
链接: https://arxiv.org/abs/2506.13983
作者: Adarsh Gupta,Bhabesh Mali,Chandan Karfa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Adarsh Gupta and Bhabesh Mali contributed equally to this work
Abstract:Recent advancements in the field of reasoning using Large Language Models (LLMs) have created new possibilities for more complex and automatic Hardware Assertion Generation techniques. This paper introduces SANGAM, a SystemVerilog Assertion Generation framework using LLM-guided Monte Carlo Tree Search for the automatic generation of SVAs from industry-level specifications. The proposed framework utilizes a three-stage approach: Stage 1 consists of multi-modal Specification Processing using Signal Mapper, SPEC Analyzer, and Waveform Analyzer LLM Agents. Stage 2 consists of using the Monte Carlo Tree Self-Refine (MCTSr) algorithm for automatic reasoning about SVAs for each signal, and finally, Stage 3 combines the MCTSr-generated reasoning traces to generate SVA assertions for each signal. The results demonstrated that our framework, SANGAM, can generate a robust set of SVAs, performing better in the evaluation process in comparison to the recent methods.
zh
[AI-71] HAELT: A Hybrid Attentive Ensemble Learning Transformer Framework for High-Frequency Stock Price Forecasting
【速读】:该论文旨在解决高频股票价格预测中的非平稳性、噪声和波动性问题。其解决方案的关键在于提出了一种混合注意力集成学习Transformer框架(HAELT),该框架结合了基于ResNet的降噪模块、用于动态关注相关历史的时序自注意力机制,以及能够捕捉局部和长程依赖关系的混合LSTM-Transformer核心,并根据近期性能自适应地进行集成。
链接: https://arxiv.org/abs/2506.13981
作者: Thanh Dan Bui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:High-frequency stock price prediction is challenging due to non-stationarity, noise, and volatility. To tackle these issues, we propose the Hybrid Attentive Ensemble Learning Transformer (HAELT), a deep learning framework combining a ResNet-based noise-mitigation module, temporal self-attention for dynamic focus on relevant history, and a hybrid LSTM-Transformer core that captures both local and long-range dependencies. These components are adaptively ensembled based on recent performance. Evaluated on hourly Apple Inc. (AAPL) data from Jan 2024 to May 2025, HAELT achieves the highest F1-Score on the test set, effectively identifying both upward and downward price movements. This demonstrates HAELT’s potential for robust, practical financial forecasting and algorithmic trading.
zh
[AI-72] ProfiLLM : An LLM -Based Framework for Implicit Profiling of Chatbot Users
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)驱动的聊天机器人在个性化响应方面存在的不足,特别是在技术知识密集型领域如信息技术与网络安全(IT/cybersecurity, ITSec)中,无法根据用户的实际技术水平、学习风格和沟通偏好进行动态调整的问题。解决方案的关键在于提出ProfiLLM框架,该框架通过聊天机器人交互实现隐式且动态的用户画像构建,利用基于LLM的方法对用户在特定分类体系中的能力进行持续评估与更新,从而提升个性化水平。
链接: https://arxiv.org/abs/2506.13980
作者: Shahaf David,Yair Meidan,Ido Hersko,Daniel Varnovitzky,Dudu Mimran,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite significant advancements in conversational AI, large language model (LLM)-powered chatbots often struggle with personalizing their responses according to individual user characteristics, such as technical expertise, learning style, and communication preferences. This lack of personalization is particularly problematic in specialized knowledge-intense domains like IT/cybersecurity (ITSec), where user knowledge levels vary widely. Existing approaches for chatbot personalization primarily rely on static user categories or explicit self-reported information, limiting their adaptability to an evolving perception of the user’s proficiency, obtained in the course of ongoing interactions. In this paper, we propose ProfiLLM, a novel framework for implicit and dynamic user profiling through chatbot interactions. This framework consists of a taxonomy that can be adapted for use in diverse domains and an LLM-based method for user profiling in terms of the taxonomy. To demonstrate ProfiLLM’s effectiveness, we apply it in the ITSec domain where troubleshooting interactions are used to infer chatbot users’ technical proficiency. Specifically, we developed ProfiLLM[ITSec], an ITSec-adapted variant of ProfiLLM, and evaluated its performance on 1,760 human-like chatbot conversations from 263 synthetic users. Results show that ProfiLLM[ITSec] rapidly and accurately infers ITSec profiles, reducing the gap between actual and predicted scores by up to 55–65% after a single prompt, followed by minor fluctuations and further refinement. In addition to evaluating our new implicit and dynamic profiling framework, we also propose an LLM-based persona simulation methodology, a structured taxonomy for ITSec proficiency, our codebase, and a dataset of chatbot interactions to support future research.
zh
[AI-73] Making deep neural networks work for medical audio: representation compression and domain adaptation
【速读】:该论文旨在解决将机器学习应用于医学音频信号分析的技术挑战,特别是通过自动化手段提升对肺部、心脏和声音等医学音频信号的理解与解释能力。其核心问题是传统医学中依赖专家听诊器等设备进行主观 auditory interpretation 的局限性,以及在低资源环境中缺乏专业医师的问题。解决方案的关键在于四个方面:首先,在数据量有限的情况下,利用成人语音的大规模数据库通过神经迁移学习提升婴儿啼哭分析模型的准确性与鲁棒性;其次,提出一种无需后处理的端到端模型压缩方法,通过张量分解实现循环网络的高效压缩,适用于资源受限设备;第三,引入针对音频模型的领域自适应技术,并借鉴计算机视觉中的方法以减少数据集偏差并增强模型泛化能力;最后,发布一个由全球临床医生合作开发的开放源代码婴儿啼哭数据集,推动该领域的研究发展。
链接: https://arxiv.org/abs/2506.13970
作者: Charles C Onu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: PhD Thesis
Abstract:This thesis addresses the technical challenges of applying machine learning to understand and interpret medical audio signals. The sounds of our lungs, heart, and voice convey vital information about our health. Yet, in contemporary medicine, these sounds are primarily analyzed through auditory interpretation by experts using devices like stethoscopes. Automated analysis offers the potential to standardize the processing of medical sounds, enable screening in low-resource settings where physicians are scarce, and detect subtle patterns that may elude human perception, thereby facilitating early diagnosis and treatment. Focusing on the analysis of infant cry sounds to predict medical conditions, this thesis contributes on four key fronts. First, in low-data settings, we demonstrate that large databases of adult speech can be harnessed through neural transfer learning to develop more accurate and robust models for infant cry analysis. Second, in cost-effective modeling, we introduce an end-to-end model compression approach for recurrent networks using tensor decomposition. Our method requires no post-hoc processing, achieves compression rates of several hundred-fold, and delivers accurate, portable models suitable for resource-constrained devices. Third, we propose novel domain adaptation techniques tailored for audio models and adapt existing methods from computer vision. These approaches address dataset bias and enhance generalization across domains while maintaining strong performance on the original data. Finally, to advance research in this domain, we release a unique, open-source dataset of infant cry sounds, developed in collaboration with clinicians worldwide. This work lays the foundation for recognizing the infant cry as a vital sign and highlights the transformative potential of AI-driven audio monitoring in shaping the future of accessible and affordable healthcare. Comments: PhD Thesis Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.13970 [cs.SD] (or arXiv:2506.13970v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2506.13970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-74] Safe Domains of Attraction for Discrete-Time Nonlinear Systems: Characterization and Verifiable Neural Network Estimation
【速读】:该论文试图解决非线性自主系统中安全(状态约束)吸引域的准确估计问题,这一问题在现有方法中普遍存在保守性高或仅适用于低维系统的局限性。解决方案的关键在于提出一种新的Zubov方程,其解对应于精确的安全吸引域,并证明该解在全状态空间上是唯一且连续的。随后,采用物理信息神经网络的方法近似求解该方程,并通过验证框架利用标准验证工具(如α,β-CROWN和dReal)获得可证的吸引域估计。
链接: https://arxiv.org/abs/2506.13961
作者: Mohamed Serry,Haoyu Li,Ruikun Zhou,Huan Zhang,Jun Liu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Analysis of nonlinear autonomous systems typically involves estimating domains of attraction, which have been a topic of extensive research interest for decades. Despite that, accurately estimating domains of attraction for nonlinear systems remains a challenging task, where existing methods are conservative or limited to low-dimensional systems. The estimation becomes even more challenging when accounting for state constraints. In this work, we propose a framework to accurately estimate safe (state-constrained) domains of attraction for discrete-time autonomous nonlinear systems. In establishing this framework, we first derive a new Zubov equation, whose solution corresponds to the exact safe domain of attraction. The solution to the aforementioned Zubov equation is shown to be unique and continuous over the whole state space. We then present a physics-informed approach to approximating the solution of the Zubov equation using neural networks. To obtain certifiable estimates of the domain of attraction from the neural network approximate solutions, we propose a verification framework that can be implemented using standard verification tools (e.g., \alpha,!\beta -CROWN and dReal). To illustrate its effectiveness, we demonstrate our approach through numerical examples concerning nonlinear systems with state constraints.
zh
[AI-75] oward Explainable Offline RL: Analyzing Representations in Intrinsically Motivated Decision Transformers
【速读】:该论文试图解决在离线强化学习中,内在动机机制如何影响弹性决策变换器(Elastic Decision Transformers, EDTs)所学习到的嵌入表示结构的问题,以及这些结构如何与策略学习性能相关联。解决方案的关键在于提出一种系统性的事后可解释性框架,通过统计分析嵌入属性(包括协方差结构、向量幅度和正交性),揭示不同内在动机变体如何构建根本不同的表征结构,并发现嵌入指标与性能之间的环境特定相关性,从而阐明内在动机如何通过表征先验影响决策过程。
链接: https://arxiv.org/abs/2506.13958
作者: Leonardo Guiducci,Antonio Rizzo,Giovanna Maria Dimitri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Elastic Decision Transformers (EDTs) have proved to be particularly successful in offline reinforcement learning, offering a flexible framework that unifies sequence modeling with decision-making under uncertainty. Recent research has shown that incorporating intrinsic motivation mechanisms into EDTs improves performance across exploration tasks, yet the representational mechanisms underlying these improvements remain unexplored. In this paper, we introduce a systematic post-hoc explainability framework to analyze how intrinsic motivation shapes learned embeddings in EDTs. Through statistical analysis of embedding properties (including covariance structure, vector magnitudes, and orthogonality), we reveal that different intrinsic motivation variants create fundamentally different representational structures. Our analysis demonstrates environment-specific correlation patterns between embedding metrics and performance that explain why intrinsic motivation improves policy learning. These findings show that intrinsic motivation operates beyond simple exploration bonuses, acting as a representational prior that shapes embedding geometry in biologically plausible ways, creating environment-specific organizational structures that facilitate better decision-making.
zh
[AI-76] How Does LLM Reasoning Work for Code? A Survey and a Call to Action
【速读】:该论文试图解决在真实世界软件工程(Software Engineering, SWE)任务中,如GitHub问题解决,大型语言模型(Large Language Models, LLMs)的实用性和性能提升问题。其解决方案的关键在于深入分析代码推理(Code Reasoning)的技术基础,探索驱动其性能的范式,并构建系统性的分类与评估框架,以揭示代码核心属性如何影响不同推理技术的有效性,从而为未来研究提供方向和参考。
链接: https://arxiv.org/abs/2506.13932
作者: Ira Ceka,Saurabh Pujar,Irene Manotas,Gail Kaiser,Baishakhi Ray,Shyam Ramji
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. These advancements have extended into the domain of code, facilitating complex tasks such as code generation, translation, summarization, and repair. However, their utility for real-world deployment in-the-wild has only recently been studied, particularly on software engineering (SWE) tasks such as GitHub issue resolution. In this study, we examine the code reasoning techniques that underlie the ability to perform such tasks, and examine the paradigms used to drive their performance. Our contributions in this paper are: (1) the first dedicated survey on code reasoning for code tasks, highlighting overarching strategies, hybrid and agentic approaches; (2) a taxonomy of various techniques used to drive code reasoning; (3) a comprehensive overview of performance on common benchmarks and a showcase of new, under-explored benchmarks with high potential in SWE; (4) an exploration on how core properties of code can be used to explain different reasoning techniques; and (5) gaps and potentially under-explored areas for future research.
zh
[AI-77] Integrating Knowledge Graphs and Bayesian Networks: A Hybrid Approach for Explainable Disease Risk Prediction
【速读】:该论文试图解决在疾病风险预测中如何将通用医学知识适配到特定医疗环境和患者群体,同时处理不完整数据和非确定性健康结果所带来的不确定性,并保持模型的可解释性问题。解决方案的关键在于将知识图谱(Knowledge Graphs, KGs)与贝叶斯网络(Bayesian Networks, BNs)相结合,从而构建出能够平衡通用医学知识与患者个体特征、有效处理不确定性并具备高可解释性的疾病风险预测系统。
链接: https://arxiv.org/abs/2506.13920
作者: Mbithe Nzomo,Deshendran Moodley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This work has been accepted for presentation at the 49th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2025). The final published version will be available via IEEE Xplore
Abstract:Multimodal electronic health record (EHR) data is useful for disease risk prediction based on medical domain knowledge. However, general medical knowledge must be adapted to specific healthcare settings and patient populations to achieve practical clinical use. Additionally, risk prediction systems must handle uncertainty from incomplete data and non-deterministic health outcomes while remaining explainable. These challenges can be alleviated by the integration of knowledge graphs (KGs) and Bayesian networks (BNs). We present a novel approach for constructing BNs from ontology-based KGs and multimodal EHR data for explainable disease risk prediction. Through an application use case of atrial fibrillation and real-world EHR data, we demonstrate that the approach balances generalised medical knowledge with patient-specific context, effectively handles uncertainty, is highly explainable, and achieves good predictive performance.
zh
[AI-78] Evaluating Explainability: A Framework for Systematic Assessment and Reporting of Explainable AI Features
【速读】:该论文试图解决当前可解释性人工智能(Explainable AI)功能缺乏系统评估方法的问题,特别是如何评估AI设备提供的解释质量。解决方案的关键在于提出一个基于四个核心标准的评估框架:一致性(Consistency)、合理性(Plausibility)、保真度(Fidelity)和有用性(Usefulness),并通过这些标准对解释性方法进行量化评估,从而为AI模型的解释质量提供全面的评价体系。
链接: https://arxiv.org/abs/2506.13917
作者: Miguel A. Lago,Ghada Zamzmi,Brandon Eich,Jana G. Delfino
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Explainability features are intended to provide insight into the internal mechanisms of an AI device, but there is a lack of evaluation techniques for assessing the quality of provided explanations. We propose a framework to assess and report explainable AI features. Our evaluation framework for AI explainability is based on four criteria: 1) Consistency quantifies the variability of explanations to similar inputs, 2) Plausibility estimates how close the explanation is to the ground truth, 3) Fidelity assesses the alignment between the explanation and the model internal mechanisms, and 4) Usefulness evaluates the impact on task performance of the explanation. Finally, we developed a scorecard for AI explainability methods that serves as a complete description and evaluation to accompany this type of algorithm. We describe these four criteria and give examples on how they can be evaluated. As a case study, we use Ablation CAM and Eigen CAM to illustrate the evaluation of explanation heatmaps on the detection of breast lesions on synthetic mammographies. The first three criteria are evaluated for clinically-relevant scenarios. Our proposed framework establishes criteria through which the quality of explanations provided by AI models can be evaluated. We intend for our framework to spark a dialogue regarding the value provided by explainability features and help improve the development and evaluation of AI-based medical devices.
zh
[AI-79] Logical Expressiveness of Graph Neural Networks with Hierarchical Node Individualization NEURIPS2025
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在区分图结构方面表达能力有限的问题,特别是无法有效区分同构图的问题。解决方案的关键在于提出分层自体图神经网络(Hierarchical Ego Graph Neural Networks, HEGNNs),其通过分层的节点个体化机制,借鉴了图同构测试中的个体化-细化范式,从而构建出逐步增强表达能力的模型层次结构,最终能够在极限情况下区分所有非同构图。
链接: https://arxiv.org/abs/2506.13911
作者: Arie Soeteman,Balder ten Cate
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Submitted to NeurIPS 2025, 28 pages, 5 figures
Abstract:We propose and study Hierarchical Ego Graph Neural Networks (HEGNNs), an expressive extension of graph neural networks (GNNs) with hierarchical node individualization, inspired by the Individualization-Refinement paradigm for graph isomorphism testing. HEGNNs generalize subgraph-GNNs and form a hierarchy of increasingly expressive models that, in the limit, can distinguish graphs up to isomorphism. We provide a logical characterization of HEGNN node classifiers, with and without subgraph restrictions, using graded hybrid logic. This characterization enables us to relate the separating power of HEGNNs to that of higher-order GNNs, GNNs enriched with local homomorphism count features, and color refinement algorithms based on Individualization-Refinement. Our experimental results confirm the practical feasibility of HEGNNs and show benefits in comparison with traditional GNN architectures, both with and without local homomorphism count features.
zh
[AI-80] Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring
【速读】:该论文旨在解决工业时间序列数据中少样本学习(Few-shot Learning, FSL)应用受限的问题,特别是在螺栓紧固过程监测中,由于标注新缺陷的成本高昂,传统方法难以有效部署。其解决方案的关键在于引入一种标签感知的周期采样器(label-aware episodic sampler),该方法将多标签序列转换为多个单标签任务,在保持输出维度固定的同时保留组合标签信息,并结合度量学习(metric learning)与轻量级模型架构(如1D CNN和InceptionTime)进行有效学习,从而在少量样本下实现高性能的分类与检测。
链接: https://arxiv.org/abs/2506.13909
作者: Xinyuan Tu,Haocheng Zhang,Tao Chengxu,Zuyi Chen(Friedrich-Alexander-Universität Erlangen, Germany)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Few-shot learning (FSL) has shown promise in vision but remains largely unexplored for \emphindustrial time-series data, where annotating every new defect is prohibitively expensive. We present a systematic FSL study on screw-fastening process monitoring, using a 2,300-sample multivariate torque dataset that covers 16 uni- and multi-factorial defect types. Beyond benchmarking, we introduce a \textbflabel-aware episodic sampler that collapses multi-label sequences into multiple single-label tasks, keeping the output dimensionality fixed while preserving combinatorial label information. Two FSL paradigms are investigated: the metric-based \emphPrototypical Network and the gradient-based \emphModel-Agnostic Meta-Learning (MAML), each paired with three backbones: 1D CNN, InceptionTime and the 341 M-parameter transformer \emphMoment. On 10-shot, 3-way evaluation, the InceptionTime + Prototypical Network combination achieves a \textbf0.944 weighted F1 in the multi-class regime and \textbf0.935 in the multi-label regime, outperforming finetuned Moment by up to 5.3% while requiring two orders of magnitude fewer parameters and training time. Across all backbones, metric learning consistently surpasses MAML, and our label-aware sampling yields an additional 1.7% F1 over traditional class-based sampling. These findings challenge the assumption that large foundation models are always superior: when data are scarce, lightweight CNN architectures augmented with simple metric learning not only converge faster but also generalize better. We release code, data splits and pre-trained weights to foster reproducible research and to catalyze the adoption of FSL in high-value manufacturing inspection. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.13909 [cs.LG] (or arXiv:2506.13909v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.13909 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-81] A Systematic Review of User-Centred Evaluation of Explainable AI in Healthcare
【速读】:该论文试图解决当前可解释人工智能(Explainable Artificial Intelligence, XAI)方法在实际应用场景中缺乏系统性评估框架与明确指导原则的问题,导致其用户体验、可信度和可用性难以有效验证。解决方案的关键在于提出一个由定义明确的原子属性组成的框架,用于表征医疗领域XAI的用户体验,并提供基于系统特性的上下文敏感评估策略指南,以支持跨学科团队设计和实施针对特定应用情境的有效评估方案。
链接: https://arxiv.org/abs/2506.13904
作者: Ivania Donoso-Guzmán,Kristýna Sirka Kacafírková,Maxwell Szymanski,An Jacobs,Denis Parra,Katrien Verbert
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite promising developments in Explainable Artificial Intelligence, the practical value of XAI methods remains under-explored and insufficiently validated in real-world settings. Robust and context-aware evaluation is essential, not only to produce understandable explanations but also to ensure their trustworthiness and usability for intended users, but tends to be overlooked because of no clear guidelines on how to design an evaluation with users. This study addresses this gap with two main goals: (1) to develop a framework of well-defined, atomic properties that characterise the user experience of XAI in healthcare; and (2) to provide clear, context-sensitive guidelines for defining evaluation strategies based on system characteristics. We conducted a systematic review of 82 user studies, sourced from five databases, all situated within healthcare settings and focused on evaluating AI-generated explanations. The analysis was guided by a predefined coding scheme informed by an existing evaluation framework, complemented by inductive codes developed iteratively. The review yields three key contributions: (1) a synthesis of current evaluation practices, highlighting a growing focus on human-centred approaches in healthcare XAI; (2) insights into the interrelations among explanation properties; and (3) an updated framework and a set of actionable guidelines to support interdisciplinary teams in designing and implementing effective evaluation strategies for XAI systems tailored to specific application contexts. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.13904 [cs.HC] (or arXiv:2506.13904v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2506.13904 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ivania Donoso-Guzmán [view email] [v1] Mon, 16 Jun 2025 18:30:00 UTC (7,525 KB)
zh
[AI-82] Enhancing interpretability of rule-based classifiers through feature graphs
【速读】:该论文旨在解决规则基础系统中特征贡献估计的挑战,特别是在医疗等需要透明性和可信度的领域,随着规则模型复杂性的增加,识别关键特征、理解其交互作用以及比较不同规则集的特征贡献变得困难。解决方案的关键在于提出一个全面的框架,包括基于图的特征可视化策略、一种与规则基础预测器无关的新特征重要性度量,以及一种基于特征贡献比较规则集的距离度量。通过在两个临床数据集和四种规则基础方法上的实验,验证了该方法在揭示临床特征联合预测价值方面的有效性。
链接: https://arxiv.org/abs/2506.13903
作者: Christel Sirocchi,Damiano Verda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In domains where transparency and trustworthiness are crucial, such as healthcare, rule-based systems are widely used and often preferred over black-box models for decision support systems due to their inherent interpretability. However, as rule-based models grow complex, discerning crucial features, understanding their interactions, and comparing feature contributions across different rule sets becomes challenging. To address this, we propose a comprehensive framework for estimating feature contributions in rule-based systems, introducing a graph-based feature visualisation strategy, a novel feature importance metric agnostic to rule-based predictors, and a distance metric for comparing rule sets based on feature contributions. By experimenting on two clinical datasets and four rule-based methods (decision trees, logic learning machines, association rules, and neural networks with rule extraction), we showcase our method’s capability to uncover novel insights on the combined predictive value of clinical features, both at the dataset and class-specific levels. These insights can aid in identifying new risk factors, signature genes, and potential biomarkers, and determining the subset of patient information that should be prioritised to enhance diagnostic accuracy. Comparative analysis of the proposed feature importance score with state-of-the-art methods on 15 public benchmarks demonstrates competitive performance and superior robustness. The method implementation is available on GitHub: this https URL.
zh
[AI-83] Scaling Algorithm Distillation for Continuous Control with Mamba
【速读】:该论文试图解决在In-Context Reinforcement Learning (ICRL)中,由于Transformer模型的注意力机制带来的计算复杂度高和上下文长度受限的问题。解决方案的关键在于引入基于Selective Structured State Space Sequence (S6)架构的Mamba模型,该模型在长序列建模任务中表现出优越的性能,并且具有线性的时间复杂度,从而使得Algorithm Distillation (AD)能够处理更长的上下文,提升ICRL的性能。
链接: https://arxiv.org/abs/2506.13892
作者: Samuel Beaussant,Mehdi Mounsif
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Algorithm Distillation (AD) was recently proposed as a new approach to perform In-Context Reinforcement Learning (ICRL) by modeling across-episodic training histories autoregressively with a causal transformer model. However, due to practical limitations induced by the attention mechanism, experiments were bottlenecked by the transformer’s quadratic complexity and limited to simple discrete environments with short time horizons. In this work, we propose leveraging the recently proposed Selective Structured State Space Sequence (S6) models, which achieved state-of-the-art (SOTA) performance on long-range sequence modeling while scaling linearly in sequence length. Through four complex and continuous Meta Reinforcement Learning environments, we demonstrate the overall superiority of Mamba, a model built with S6 layers, over a transformer model for AD. Additionally, we show that scaling AD to very long contexts can improve ICRL performance and make it competitive even with a SOTA online meta RL baseline.
zh
[AI-84] StaQ it! Growing neural networks for Policy Mirror Descent
【速读】:该论文试图解决深度强化学习中正则化策略优化的计算复杂性问题,特别是Policy Mirror Descent(PMD)框架中由于需要求和所有历史Q函数而导致的不可行性。解决方案的关键在于提出一种类似PMD的算法,仅保留最近的M个Q函数进行计算,证明当M足够大时可以得到收敛算法,并且在策略更新中不引入误差,从而实现了更稳定的学习过程。
链接: https://arxiv.org/abs/2506.13862
作者: Alena Shilova,Alex Davey,Brahim Driss,Riad Akrour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 44 pages, 12 figures
Abstract:In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice, typically based either on an entropy bonus or a Kullback-Leibler divergence that constrains successive policies. In practice, these approaches have been shown to improve exploration, robustness and stability, giving rise to popular Deep RL algorithms such as SAC and TRPO. Policy Mirror Descent (PMD) is a theoretical framework that solves this general regularized policy optimization problem, however the closed-form solution involves the sum of all past Q-functions, which is intractable in practice. We propose and analyze PMD-like algorithms that only keep the last M Q-functions in memory, and show that for finite and large enough M , a convergent algorithm can be derived, introducing no error in the policy update, unlike prior deep RL PMD implementations. StaQ, the resulting algorithm, enjoys strong theoretical guarantees and is competitive with deep RL baselines, while exhibiting less performance oscillation, paving the way for fully stable deep RL algorithms and providing a testbed for experimentation with Policy Mirror Descent.
zh
[AI-85] Students Reliance on AI in Higher Education: Identifying Contributing Factors
【速读】:该论文试图解决学生在教育场景中对人工智能(Artificial Intelligence, AI)工具的过度依赖问题,以及由此导致的学习效果下降问题。研究通过结合前后测问卷与受控实验任务,观察学生在面对不同可靠性的AI建议时的依赖模式,包括过度依赖、适当依赖和依赖不足。解决方案的关键在于识别影响学生AI依赖模式的核心因素,如编程自我效能感、编程素养和认知需求,并揭示这些因素与任务后对AI的信任和满意度之间的关系,从而为设计促进适当依赖的干预措施提供依据。
链接: https://arxiv.org/abs/2506.13845
作者: Griffin Pitts,Neha Rani,Weedguet Mildort,Eva-Marie Cook
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The increasing availability and use of artificial intelligence (AI) tools in educational settings has raised concerns about students’ overreliance on these technologies. Overreliance occurs when individuals accept incorrect AI-generated recommendations, often without critical evaluation, leading to flawed problem solutions and undermining learning outcomes. This study investigates potential factors contributing to patterns of AI reliance among undergraduate students, examining not only overreliance but also appropriate reliance (correctly accepting helpful and rejecting harmful recommendations) and underreliance (incorrectly rejecting helpful recommendations). Our approach combined pre- and post-surveys with a controlled experimental task where participants solved programming problems with an AI assistant that provided both accurate and deliberately incorrect suggestions, allowing direct observation of students’ reliance patterns when faced with varying AI reliability. We find that appropriate reliance is significantly related to students’ programming self-efficacy, programming literacy, and need for cognition, while showing negative correlations with post-task trust and satisfaction. Overreliance showed significant correlations with post-task trust and satisfaction with the AI assistant. Underreliance was negatively correlated with programming literacy, programming self-efficacy, and need for cognition. Overall, the findings provide insights for developing targeted interventions that promote appropriate reliance on AI tools, with implications for the integration of AI in curriculum and educational technologies.
zh
[AI-86] LocationReason er: Evaluating LLM s on Real-World Site Selection Reasoning
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂现实场景中的推理能力评估问题,特别是其在真实世界选址任务中的表现。现有LLMs的推理能力主要在数学问题求解和代码生成等特定领域进行验证,但缺乏对实际应用场景的全面评估。论文提出的解决方案是构建LocationReasoner基准,该基准通过设计包含多种空间、环境和物流约束的300多个精心构造的查询,并提供支持基于约束的选址搜索的沙盒环境,以评估LLMs在真实场景下的推理能力。该基准的核心在于模拟现实世界的复杂性,从而揭示当前先进模型在整体性和非线性推理方面的局限性。
链接: https://arxiv.org/abs/2506.13841
作者: Miho Koda,Yu Zheng,Ruixian Ma,Mingyang Sun,Devesh Pansare,Fabio Duarte,Paolo Santi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation – leaving open the question of whether such reasoning skills generalize to complex, real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs’ reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistical constraints. The benchmark comprises over 300 carefully crafted queries of varying difficulty levels, supported by a sandbox environment with in-house tools for constraint-based location search. Extensive evaluations reveal that state-of-the-art reasoning models offer limited improvement over their non-reasoning predecessors in real-world contexts, with even the latest OpenAI o4 model failing on 30% of site selection tasks. Moreover, agentic strategies such as ReAct and Reflexion often suffer from over-reasoning, leading to worse outcomes than direct code-generation prompting. With key limitations of LLMs in holistic and non-linear reasoning highlighted, we release LocationReasoner to foster the development of LLMs and agents capable of robust, grounded reasoning in real-world decision-making tasks. Codes and data for our benchmark are available at this https URL.
zh
[AI-87] Sustainable Machine Learning Retraining: Optimizing Energy Efficiency Without Compromising Accuracy
【速读】:该论文试图解决机器学习(Machine Learning, ML)软件系统在数据随时间变化时的可靠性问题,以及模型重训练带来的高能耗和环境影响问题。其解决方案的关键在于研究并优化常见的重训练技术,以提高能源效率同时保持模型准确性。研究发现,仅使用最新数据进行重训练可减少高达25%的能耗,而基于数据变化检测的按需重训练则可在可靠检测机制下将能耗降低高达40%。这些发现为设计可持续的ML系统提供了更节能的重训练策略建议。
链接: https://arxiv.org/abs/2506.13838
作者: Lorena Poenaru-Olaru,June Sallou,Luis Cruz,Jan Rellermeyer,Arie van Deursen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages. Accepted at ICT4Sustainability 2025 conference
Abstract:The reliability of machine learning (ML) software systems is heavily influenced by changes in data over time. For that reason, ML systems require regular maintenance, typically based on model retraining. However, retraining requires significant computational demand, which makes it energy-intensive and raises concerns about its environmental impact. To understand which retraining techniques should be considered when designing sustainable ML applications, in this work, we study the energy consumption of common retraining techniques. Since the accuracy of ML systems is also essential, we compare retraining techniques in terms of both energy efficiency and accuracy. We showcase that retraining with only the most recent data, compared to all available data, reduces energy consumption by up to 25%, being a sustainable alternative to the status quo. Furthermore, our findings show that retraining a model only when there is evidence that updates are necessary, rather than on a fixed schedule, can reduce energy consumption by up to 40%, provided a reliable data change detector is in place. Our findings pave the way for better recommendations for ML practitioners, guiding them toward more energy-efficient retraining techniques when designing sustainable ML software systems.
zh
[AI-88] Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study
【速读】:该论文旨在解决强化学习交通信号控制(Reinforcement Learning-based Traffic Signal Control, RL-TSC)在现实世界中交通事件等干扰情况下的鲁棒性不足问题。其解决方案的关键在于提出T-REX,一个基于SUMO的开源仿真框架,用于在动态、事故场景下训练和评估RL-TSC方法。T-REX通过模拟驾驶员的概率性路径重规划、速度适应和上下文感知的车道变换,实现了对事故引发的拥堵传播的建模,并引入了一套扩展的鲁棒性评估指标,以全面衡量RL-TSC方法在复杂交通环境中的表现。
链接: https://arxiv.org/abs/2506.13836
作者: Dang Viet Anh Nguyen,Carlos Lima Azevedo,Tomer Toledo,Filipe Rodrigues
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 5 figures, 3 tables
Abstract:Reinforcement learning-based traffic signal control (RL-TSC) has emerged as a promising approach for improving urban mobility. However, its robustness under real-world disruptions such as traffic incidents remains largely underexplored. In this study, we introduce T-REX, an open-source, SUMO-based simulation framework for training and evaluating RL-TSC methods under dynamic, incident scenarios. T-REX models realistic network-level performance considering drivers’ probabilistic rerouting, speed adaptation, and contextual lane-changing, enabling the simulation of congestion propagation under incidents. To assess robustness, we propose a suite of metrics that extend beyond conventional traffic efficiency measures. Through extensive experiments across synthetic and real-world networks, we showcase T-REX for the evaluation of several state-of-the-art RL-TSC methods under multiple real-world deployment paradigms. Our findings show that while independent value-based and decentralized pressure-based methods offer fast convergence and generalization in stable traffic conditions and homogeneous networks, their performance degrades sharply under incident-driven distribution shifts. In contrast, hierarchical coordination methods tend to offer more stable and adaptable performance in large-scale, irregular networks, benefiting from their structured decision-making architecture. However, this comes with the trade-off of slower convergence and higher training complexity. These findings highlight the need for robustness-aware design and evaluation in RL-TSC research. T-REX contributes to this effort by providing an open, standardized and reproducible platform for benchmarking RL methods under dynamic and disruptive traffic scenarios.
zh
[AI-89] Evolvable Conditional Diffusion
【速读】:该论文旨在解决如何在缺乏可微分代理模型的情况下,利用黑盒、不可微的多物理场模型来引导生成式AI进行自主科学发现的问题。其解决方案的关键在于将指导过程建模为一个优化问题,通过更新去噪分布的描述性统计量来优化目标适应度函数,并从概率演化的角度推导出一种进化引导的方法。该方法最终得到的更新算法与基于梯度的引导扩散模型类似,但无需计算任何导数,从而有效克服了传统方法对可微分模型的依赖。
链接: https://arxiv.org/abs/2506.13834
作者: Zhao Wei,Chin Chun Ooi,Abhishek Gupta,Jian Cheng Wong,Pao-Hsiung Chiu,Sheares Xue Wen Toh,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents an evolvable conditional diffusion method such that black-box, non-differentiable multi-physics models, as are common in domains like computational fluid dynamics and electromagnetics, can be effectively used for guiding the generative process to facilitate autonomous scientific discovery. We formulate the guidance as an optimization problem where one optimizes for a desired fitness function through updates to the descriptive statistic for the denoising distribution, and derive an evolution-guided approach from first principles through the lens of probabilistic evolution. Interestingly, the final derived update algorithm is analogous to the update as per common gradient-based guided diffusion models, but without ever having to compute any derivatives. We validate our proposed evolvable diffusion algorithm in two AI for Science scenarios: the automated design of fluidic topology and meta-surface. Results demonstrate that this method effectively generates designs that better satisfy specific optimization objectives without reliance on differentiable proxies, providing an effective means of guidance-based diffusion that can capitalize on the wealth of black-box, non-differentiable multi-physics numerical models common across Science.
zh
[AI-90] A Survey on World Models Grounded in Acoustic Physical Information
【速读】:该论文试图解决如何利用声学信号构建高保真环境感知、因果物理推理和动态事件预测的声学世界模型问题,其核心在于通过声学信号中蕴含的物理信息实现对环境的深层次理解。解决方案的关键在于结合物理定律与机器学习方法,特别是Physics-Informed Neural Networks (PINNs)、生成模型和自监督多模态学习框架,以提取声学信号中关于材料特性、内部几何结构及复杂交互动力学的潜在信息,从而推动声学智能在多个领域的应用与发展。
链接: https://arxiv.org/abs/2506.13833
作者: Xiaoliang Chen,Le Chang,Xin Yu,Yunhe Huang,Xianling Tu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Robotics (cs.RO); Audio and Speech Processing (eess.AS); Applied Physics (physics.app-ph)
备注: 28 pages,11 equations
Abstract:This survey provides a comprehensive overview of the emerging field of world models grounded in the foundation of acoustic physical information. It examines the theoretical underpinnings, essential methodological frameworks, and recent technological advancements in leveraging acoustic signals for high-fidelity environmental perception, causal physical reasoning, and predictive simulation of dynamic events. The survey explains how acoustic signals, as direct carriers of mechanical wave energy from physical events, encode rich, latent information about material properties, internal geometric structures, and complex interaction dynamics. Specifically, this survey establishes the theoretical foundation by explaining how fundamental physical laws govern the encoding of physical information within acoustic signals. It then reviews the core methodological pillars, including Physics-Informed Neural Networks (PINNs), generative models, and self-supervised multimodal learning frameworks. Furthermore, the survey details the significant applications of acoustic world models in robotics, autonomous driving, healthcare, and finance. Finally, it systematically outlines the important technical and ethical challenges while proposing a concrete roadmap for future research directions toward robust, causal, uncertainty-aware, and responsible acoustic intelligence. These elements collectively point to a research pathway towards embodied active acoustic intelligence, empowering AI systems to construct an internal “intuitive physics” engine through sound.
zh
[AI-91] FrontendBench: A Benchmark for Evaluating LLM s on Front-End Development via Automatic Evaluation
【速读】:该论文试图解决现有前端代码生成基准测试在任务复杂度、测试用例严谨性和端到端验证方面的不足。其解决方案的关键在于提出FrontendBench,一个由人类与生成式 AI(Generative AI)共同开发的基准测试,该基准根据代码功能对任务进行分类,并引入交互式测试场景,以实现对前端代码生成能力更全面和实用的评估。此外,论文还引入了一个自动评估框架,通过沙箱环境执行生成代码并使用预定义测试脚本进行评估,实现了与专家人工评估的高度一致性。
链接: https://arxiv.org/abs/2506.13832
作者: Hongda Zhu,Yiwen Zhang,Bing Zhao,Jingzhe Ding,Siyao Liu,Tong Liu,Dandan Wang,Yanan Liu,Zhaojian Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end validation is absent. These issues hinder the accurate assessment of model performance. To address these challenges, we present FrontendBench, a benchmark co-developed by humans and LLMs. FrontendBench categorizes tasks based on code functionality and incorporates interactive test scenarios, enabling a more comprehensive and practical evaluation of front-end code generation capabilities. The benchmark comprises 148 meticulously crafted prompt-test case pairs spanning five levels of web components, from basic UI elements to complex interactive features. Each task reflects realistic front-end development challenges. Furthermore, we introduce an automatic evaluation framework that executes generated code within a sandbox environment and assesses outcomes using predefined test scripts. This framework achieves a 90.54% agreement rate with expert human evaluations, demonstrating high reliability. We benchmark several state-of-the-art LLMs on FrontendBench and observe substantial performance disparities in handling real-world front-end tasks. These results highlight FrontendBench as a reliable and scalable benchmark, supporting consistent multimodal evaluation and providing a robust foundation for future research in front-end code generation. Our data and code will be released soon.
zh
[AI-92] Quantifying Structure in CLIP Embeddings: A Statistical Framework for Concept Interpretation
【速读】:该论文试图解决深度神经网络模型(如CLIP)中嵌入表示的可解释性问题,特别是现有基于概念的方法缺乏统计严谨性,难以验证所识别概念的有效性并进行方法间的比较。解决方案的关键在于引入一种假设检验框架,用于量化CLIP嵌入空间中的旋转敏感结构,并提出一种后验概念分解方法,该方法提供了理论保证,确保所发现的概念是稳健且可重复的模式,而非方法特定的伪影,同时在重构误差方面优于其他技术。
链接: https://arxiv.org/abs/2506.13831
作者: Jitian Zhao,Chenghui Li,Frederic Sala,Karl Rohe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Concept-based approaches, which aim to identify human-understandable concepts within a model’s internal representations, are a promising method for interpreting embeddings from deep neural network models, such as CLIP. While these approaches help explain model behavior, current methods lack statistical rigor, making it challenging to validate identified concepts and compare different techniques. To address this challenge, we introduce a hypothesis testing framework that quantifies rotation-sensitive structures within the CLIP embedding space. Once such structures are identified, we propose a post-hoc concept decomposition method. Unlike existing approaches, it offers theoretical guarantees that discovered concepts represent robust, reproducible patterns (rather than method-specific artifacts) and outperforms other techniques in terms of reconstruction error. Empirically, we demonstrate that our concept-based decomposition algorithm effectively balances reconstruction accuracy with concept interpretability and helps mitigate spurious cues in data. Applied to a popular spurious correlation dataset, our method yields a 22.6% increase in worst-group accuracy after removing spurious background concepts.
zh
[AI-93] Balancing Preservation and Modification: A Region and Semantic Aware Metric for Instruction-Based Image Editing
【速读】:该论文试图解决指令驱动的图像编辑中缺乏全面评估指标的问题,现有指标要么需要高昂的人工评估成本,要么从其他任务迁移而来,无法全面评估指令相关修改和无关区域的保持情况,导致评估结果存在偏差。解决方案的关键在于提出一种名为Balancing Preservation and Modification (BPM)的新指标,通过显式地将图像分解为编辑相关和无关区域进行针对性评估,其中Region-Aware Judge用于评估编辑区域的位置和大小是否符合指令,Semantic-Aware Judge则进一步评估编辑区域内指令内容的合规性以及无关区域的内容保持情况,从而实现全面且可解释的质量评估。
链接: https://arxiv.org/abs/2506.13827
作者: Zhuoying Li,Zhu Xu,Yuxin Peng,Yang Liu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:Instruction-based image editing, which aims to modify the image faithfully according to the instruction while preserving irrelevant content unchanged, has made significant progress. However, there still lacks a comprehensive metric for assessing the editing quality. Existing metrics either require high human evaluation costs, which hinder large-scale evaluation, or are adapted from other tasks and lose task-specific concerns, failing to comprehensively evaluate both instruction-based modification and preservation of irrelevant regions, resulting in biased evaluation. To tackle this, we introduce a new metric called Balancing Preservation and Modification (BPM), tailored for instruction-based image editing by explicitly disentangling the image into editing-relevant and irrelevant regions for specific consideration. We first identify and locate editing-relevant regions, followed by a two-tier process to assess editing quality: Region-Aware Judge evaluates whether the position and size of the edited region align with the instruction, and Semantic-Aware Judge further assesses the instruction content compliance within editing-relevant regions as well as content preservation within irrelevant regions, yielding comprehensive and interpretable quality assessment. Moreover, the editing-relevant region localization in BPM can be integrated into image editing approaches to improve editing quality, demonstrating its broad applicability. We verify the effectiveness of the BPM metric on comprehensive instruction-editing data, and the results show the highest alignment with human evaluation compared to existing metrics, indicating its efficacy. Code is available at: this https URL
zh
[AI-94] he Reflexive Integrated Information Unit: A Differentiable Primitive for Artificial Consciousness
【速读】:该论文试图解决人工意识研究中缺乏类似感知机(perceptron)的可训练小模块的问题,该模块能够被复制、基准测试并迭代优化。其解决方案的关键在于引入了反射式整合信息单元(Reflexive Integrated Information Unit, RIIU),该单元通过在隐藏状态 $ h $ 中增加两个额外向量——元状态 $ \mu $ 和广播缓冲区 $ B $,来记录并暴露自身的因果足迹,从而实现局部信息整合的最大化。RIIUs具备端到端可微、可加性组合以及梯度上升下的 $ \Phi $-单调可塑性等特性,使其在复杂任务中表现出色。
链接: https://arxiv.org/abs/2506.13825
作者: Gnankan Landry Regis N’guessan,Issa Karambal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Research on artificial consciousness lacks the equivalent of the perceptron: a small, trainable module that can be copied, benchmarked, and iteratively improved. We introduce the Reflexive Integrated Information Unit (RIIU), a recurrent cell that augments its hidden state h with two additional vectors: (i) a meta-state \mu that records the cell’s own causal footprint, and (ii) a broadcast buffer B that exposes that footprint to the rest of the network. A sliding-window covariance and a differentiable Auto- \Phi surrogate let each RIIU maximize local information integration online. We prove that RIIUs (1) are end-to-end differentiable, (2) compose additively, and (3) perform \Phi -monotone plasticity under gradient ascent. In an eight-way Grid-world, a four-layer RIIU agent restores 90% reward within 13 steps after actuator failure, twice as fast as a parameter-matched GRU, while maintaining a non-zero Auto- \Phi signal. By shrinking “consciousness-like” computation down to unit scale, RIIUs turn a philosophical debate into an empirical mathematical problem.
zh
[AI-95] MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios ACL2025
【速读】:该论文试图解决现实世界中多库(multi-library)场景下的代码调试问题,当前研究主要集中在无库或单库设置,而忽略了复杂多库环境中的调试挑战。解决方案的关键是提出MLDebugging,这是一个全面的基准测试平台,旨在评估多库Python代码中的调试难题,其包含126个不同的Python库,并涵盖了七种类型的多库代码问题,从而为大语言模型(LLMs)在多库调试场景下的性能评估提供了基础。
链接: https://arxiv.org/abs/2506.13824
作者: Jinyang Huang,Xiachong Feng,Qiguang Chen,Hanjie Zhao,Zihui Cheng,Jiesong Bai,Jingxuan Zhou,Min Li,Libo Qin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings
Abstract:Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.
zh
[AI-96] Structured Program Synthesis using LLM s: Results and Insights from the IPARC Challenge
【速读】:该论文试图解决在合成图像上进行自动程序构建的挑战,特别是针对序列、选择和迭代等控制结构的程序合成任务。文献中提到的IPARC Challenge包含600个任务,这些任务对自动化解决方案具有较高难度。论文提出的解决方案是一种基于大语言模型(LLM)的结构化归纳编程方法,其关键在于通过先验结构化、人类与LLM的协作优化、正确代码的冻结以及代码复用等机制,有效提升了程序合成的效率与准确性。
链接: https://arxiv.org/abs/2506.13820
作者: Shraddha Surana,Ashwin Srinivasan,Michael Bain
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The IPARC Challenge, inspired by ARC, provides controlled program synthesis tasks over synthetic images to evaluate automatic program construction, focusing on sequence, selection, and iteration. This set of 600 tasks has resisted automated solutions. This paper presents a structured inductive programming approach with LLMs that successfully solves tasks across all IPARC categories. The controlled nature of IPARC reveals insights into LLM-based code generation, including the importance of prior structuring, LLMs’ ability to aid structuring (requiring human refinement), the need to freeze correct code, the efficiency of code reuse, and how LLM-generated code can spark human creativity. These findings suggest valuable mechanisms for human-LLM collaboration in tackling complex program synthesis.
zh
[AI-97] Bridging Pattern-Aware Complexity with NP-Hard Optimization: A Unifying Framework and Empirical Study
【速读】:该论文试图解决NP难优化问题(如旅行商问题TSP)在实际应用中虽理论上难以高效求解,但现实实例往往存在可利用结构规律的问题。其解决方案的关键在于提出一种基于模式感知的复杂性框架,通过量化和利用结构规律(如聚类、对称性)来降低有效计算复杂度,从而提升求解效率。该方法引入了如模式利用效率(Pattern Utilization Efficiency, PUE)等指标,并通过元学习驱动的求解流程实现跨领域的性能提升,例如在TSP基准测试中实现了最高79%的解质量提升。
链接: https://arxiv.org/abs/2506.13810
作者: Olivier Saidi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:NP hard optimization problems like the Traveling Salesman Problem (TSP) defy efficient solutions in the worst case, yet real-world instances often exhibit exploitable patterns. We propose a novel patternaware complexity framework that quantifies and leverages structural regularities e.g., clustering, symmetry to reduce effective computational complexity across domains, including financial forecasting and LLM optimization. With rigorous definitions, theorems, and a meta learning driven solver pipeline, we introduce metrics like Pattern Utilization Efficiency (PUE) and achieve up to 79 percent solution quality gains in TSP benchmarks (22 to 2392 cities). Distinct from theoretical NP hardness, our approach offers a unified, practical lens for pattern-driven efficiency.
zh
[AI-98] Dr. GPT Will See You Now but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在日常健康咨询中的有效性评估不足的问题,尤其是在真实场景下处理普通用户提出的健康问题时的表现。现有研究多集中于专家级健康查询或临床案例,而忽略了LLMs在应对普遍用户日常健康关切时的实际效果。论文的解决方案关键在于通过一项大学级别的竞赛,采用众包方式收集真实或虚构的健康问题,并由九名认证医生对LLMs生成的回答进行评估,从而提供一个更贴近实际应用场景的评估框架。
链接: https://arxiv.org/abs/2506.13805
作者: Bonam Mingole,Aditya Majumdar,Firdaus Ahmed Choudhury,Jennifer L. Kraschnewski,Shyam S. Sundar,Amulya Yadav
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering expert-written health queries/prompts, questions from medical examination banks, or queries based on pre-existing clinical cases. Unfortunately, these existing studies completely ignore an in-the-wild evaluation of the effectiveness of LLMs in answering everyday health concerns and queries typically asked by general users, which corresponds to the more prevalent use case for LLMs. To address this research gap, this paper presents the findings from a university-level competition that leveraged a novel, crowdsourced approach for evaluating the effectiveness of LLMs in answering everyday health queries. Over the course of a week, a total of 34 participants prompted four publicly accessible LLMs with 212 real (or imagined) health concerns, and the LLM generated responses were evaluated by a team of nine board-certified physicians. At a high level, our findings indicate that on average, 76% of the 212 LLM responses were deemed to be accurate by physicians. Further, with the help of medical professionals, we investigated whether RAG versions of these LLMs (powered with a comprehensive medical knowledge base) can improve the quality of responses generated by LLMs. Finally, we also derive qualitative insights to explain our quantitative findings by conducting interviews with seven medical professionals who were shown all the prompts in our competition. This paper aims to provide a more grounded understanding of how LLMs perform in real-world everyday health communication.
zh
[AI-99] Instruction and Solution Probabilities as Heuristics for Inductive Programming
【速读】:该论文试图解决归纳编程(Inductive Programming, IP)中搜索空间过大导致的计算效率低下的问题。解决方案的关键在于扩展了指令子集(Instruction Subset, IS)方法,引入了指令概率和解概率作为额外的启发式策略。指令概率基于大规模代码样本中指令出现的频率,反映某条指令出现在解中的期望;解概率则是构成程序的所有指令概率的乘积。通过将观察到的最小解概率作为阈值,可在构建部分解的过程中剪枝,从而消除不合理的指令组合,显著缩小搜索空间。
链接: https://arxiv.org/abs/2506.13804
作者: Edward McDaid,Sarah McDaid
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures
Abstract:Instruction subsets (ISs) are heuristics that can shrink the size of the inductive programming (IP) search space by tens of orders of magnitude. Here, we extend the IS approach by introducing instruction and solution probabilities as additional heuristics. Instruction probability reflects the expectation of an instruction occurring in a solution, based on the frequency of instruction occurrence in a large code sample. The solution probability for a partial or complete program is simply the product of all constituent instruction probabilities, including duplicates. We treat the minimum solution probabilities observed in code sample program units of different sizes as solution probability thresholds. These thresholds are used to prune the search space as partial solutions are constructed, thereby eliminating any branches containing unlikely combinations of instructions. The new approach has been evaluated using a large sample of human code. We tested two formulations of instruction probability: one based on instruction occurrence across the entire code sample and another that measured the distribution separately for each IS. Our results show that both variants produce substantial further reductions in the IP search space size of up to tens of orders of magnitude, depending on solution size. In combination with IS, reductions of over 100 orders of magnitude can be achieved. We also carried out cross-validation testing to show that the heuristics should work effectively with unseen code. The approach is described and the results and some ideas for future work are discussed.
zh
[AI-100] Causality in the human niche: lessons for machine learning
【速读】:该论文试图解决当前机器学习系统在因果推理能力上的不足,特别是在模仿人类在新领域中高效泛化和学习的能力方面。其核心问题在于现有的结构因果模型(Structural Causal Model, SCM)框架未能充分捕捉人类因果认知的某些关键特性,导致在一些人类表现优异的领域难以取得进展。论文提出的解决方案之关键在于深入理解人类因果认知在“人类生态位”中的适应性与动机,从而为机器学习引入更接近人类的归纳偏置,以构建更具能力、可控性和可解释性的AI系统。
链接: https://arxiv.org/abs/2506.13803
作者: Richard D. Lange,Konrad P. Kording
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 2 figures
Abstract:Humans interpret the world around them in terms of cause and effect and communicate their understanding of the world to each other in causal terms. These causal aspects of human cognition are thought to underlie humans’ ability to generalize and learn efficiently in new domains, an area where current machine learning systems are weak. Building human-like causal competency into machine learning systems may facilitate the construction of effective and interpretable AI. Indeed, the machine learning community has been importing ideas on causality formalized by the Structural Causal Model (SCM) framework, which provides a rigorous formal language for many aspects of causality and has led to significant advances. However, the SCM framework fails to capture some salient aspects of human causal cognition and has likewise not yet led to advances in machine learning in certain critical areas where humans excel. We contend that the problem of causality in the ``human niche’’ – for a social, autonomous, and goal-driven agent sensing and acting in the world in which humans live – is quite different from the kind of causality captured by SCMs. For example, everyday objects come in similar types that have similar causal properties, and so humans readily generalize knowledge of one type of object (cups) to another related type (bowls) by drawing causal analogies between objects with similar properties, but such analogies are at best awkward to express in SCMs. We explore how such causal capabilities are adaptive in, and motivated by, the human niche. By better appreciating properties of human causal cognition and, crucially, how those properties are adaptive in the niche in which humans live, we hope that future work at the intersection of machine learning and causality will leverage more human-like inductive biases to create more capable, controllable, and interpretable systems.
zh
[AI-101] Enhancing Clinical Decision Support and EHR Insights through LLM s and the Model Context Protocol: An Open-Source MCP-FHIR Framework
【速读】:该论文旨在解决数字健康领域中临床决策支持(Clinical Decision Support, CDS)的增强、文档负担的减轻以及患者健康素养的提升等持续性挑战。其解决方案的关键在于提出了一种开源的基于智能体(agent-based)框架,该框架通过Model Context Protocol (MCP)将大型语言模型(Large Language Models, LLMs)与HL7 FHIR数据集成,实现对电子健康记录(Electronic Health Records, EHRs)的动态提取与推理。该框架基于已有的MCP-FHIR实现,通过基于JSON的配置提供对多种FHIR资源的声明式访问,支持实时摘要、解释及针对不同用户角色(包括临床医生、照护者和患者)的个性化沟通,从而提升了系统的可扩展性、可解释性和互操作性。
链接: https://arxiv.org/abs/2506.13800
作者: Abul Ehtesham,Aditi Singh,Saket Kumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Enhancing clinical decision support (CDS), reducing documentation burdens, and improving patient health literacy remain persistent challenges in digital health. This paper presents an open-source, agent-based framework that integrates Large Language Models (LLMs) with HL7 FHIR data via the Model Context Protocol (MCP) for dynamic extraction and reasoning over electronic health records (EHRs). Built on the established MCP-FHIR implementation, the framework enables declarative access to diverse FHIR resources through JSON-based configurations, supporting real-time summarization, interpretation, and personalized communication across multiple user personas, including clinicians, caregivers, and patients. To ensure privacy and reproducibility, the framework is evaluated using synthetic EHR data from the SMART Health IT sandbox (this https URL), which conforms to the FHIR R4 standard. Unlike traditional approaches that rely on hardcoded retrieval and static workflows, the proposed method delivers scalable, explainable, and interoperable AI-powered EHR applications. The agentic architecture further supports multiple FHIR formats, laying a robust foundation for advancing personalized digital health solutions.
zh
[AI-102] Feedforward Ordering in Neural Connectomes via Feedback Arc Minimization
【速读】:该论文旨在解决在大规模加权有向图中最小化反馈弧的问题,以揭示神经连接组中的生物有意义的前馈结构。其解决方案的关键在于整合贪心启发式算法、基于增益感知的局部优化以及基于强连通分量的全局结构分析,从而有效提升前向边的总权重。
链接: https://arxiv.org/abs/2506.13799
作者: Soroush Vahidi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is a preliminary paper
Abstract:We present a suite of scalable algorithms for minimizing feedback arcs in large-scale weighted directed graphs, with the goal of revealing biologically meaningful feedforward structure in neural connectomes. Using the FlyWire Connectome Challenge dataset, we demonstrate the effectiveness of our ranking strategies in maximizing the total weight of forward-pointing edges. Our methods integrate greedy heuristics, gain-aware local refinements, and global structural analysis based on strongly connected components. Experiments show that our best solution improves the forward edge weight over previous top-performing methods. All algorithms are implemented efficiently in Python and validated using cloud-based execution on Google Colab Pro+.
zh
[AI-103] Contemporary AI foundation models increase biological weapons risk
【速读】:该论文试图解决当前基础AI模型的安全评估低估了其在生物武器开发中的潜在风险的问题,其关键在于挑战“生物武器开发需要隐性知识”的假设,并揭示AI如何提升非专家和有技能个体的能力。研究通过分析无正式背景者成功执行复杂技术任务的案例,以及展示路径菌构建过程可通过文本传达,指出大型语言模型能够描述生物武器开发的“成功要素”,如获取材料和执行技术步骤。基于此框架,实验表明先进AI模型可准确指导用户从商业合成DNA中恢复活体脊髓灰质炎病毒,从而质疑现有模型生物安全风险较低的论断。
链接: https://arxiv.org/abs/2506.13798
作者: Roger Brent,T. Greg McKelvey Jr
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 58 pages, 10 figures, 4 tables
Abstract:The rapid advancement of artificial intelligence has raised concerns about its potential to facilitate biological weapons development. We argue existing safety assessments of contemporary foundation AI models underestimate this risk, largely due to flawed assumptions and inadequate evaluation methods. First, assessments mistakenly assume biological weapons development requires tacit knowledge, or skills gained through hands-on experience that cannot be easily verbalized. Second, they rely on imperfect benchmarks that overlook how AI can uplift both nonexperts and already-skilled individuals. To challenge the tacit knowledge assumption, we examine cases where individuals without formal expertise, including a 2011 Norwegian ultranationalist who synthesized explosives, successfully carried out complex technical tasks. We also review efforts to document pathogen construction processes, highlighting how such tasks can be conveyed in text. We identify “elements of success” for biological weapons development that large language models can describe in words, including steps such as acquiring materials and performing technical procedures. Applying this framework, we find that advanced AI models Llama 3.1 405B, ChatGPT-4o, and Claude 3.5 Sonnet can accurately guide users through the recovery of live poliovirus from commercially obtained synthetic DNA, challenging recent claims that current models pose minimal biosecurity risk. We advocate for improved benchmarks, while acknowledging the window for meaningful implementation may have already closed.
zh
[AI-104] BotTrans: A Multi-Source Graph Domain Adaptation Approach for Social Bot Detection ECML-PKDD2025
【速读】:该论文旨在解决在检测社会机器人(social bots)及其他异常行为时,由于标签稀缺而导致的性能受限问题。其关键解决方案是通过多源图域自适应模型\textitBotTrans,利用多个源域的知识进行有效迁移。该方法首先通过跨源域拓扑构建提高网络同质性,然后聚合跨域邻居信息以增强源节点嵌入的判别能力,并通过整合源-目标对的相关性优化模型,促进与检测任务更相关知识的迁移,从而提升无标签目标检测任务的性能。
链接: https://arxiv.org/abs/2506.13795
作者: Boshen Shi,Yongqing Wang,Fangda Guo,Jiangli Shao,Huawei Shen,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accetpted to ECML-PKDD 2025 Research Track as oral; Codedata: this https URL
Abstract:Transferring extensive knowledge from relevant social networks has emerged as a promising solution to overcome label scarcity in detecting social bots and other anomalies with GNN-based models. However, effective transfer faces two critical challenges. Firstly, the network heterophily problem, which is caused by bots hiding malicious behaviors via indiscriminately interacting with human users, hinders the model’s ability to learn sufficient and accurate bot-related knowledge from source domains. Secondly, single-source transfer might lead to inferior and unstable results, as the source network may embody weak relevance to the task and provide limited knowledge. To address these challenges, we explore multiple source domains and propose a multi-source graph domain adaptation model named \textitBotTrans. We initially leverage the labeling knowledge shared across multiple source networks to establish a cross-source-domain topology with increased network homophily. We then aggregate cross-domain neighbor information to enhance the discriminability of source node embeddings. Subsequently, we integrate the relevance between each source-target pair with model optimization, which facilitates knowledge transfer from source networks that are more relevant to the detection task. Additionally, we propose a refinement strategy to improve detection performance by utilizing semantic knowledge within the target domain. Extensive experiments on real-world datasets demonstrate that \textitBotTrans outperforms the existing state-of-the-art methods, revealing its efficacy in leveraging multi-source knowledge when the target detection task is unlabeled.
zh
[AI-105] Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection
【速读】:该论文试图解决大型推理模型在医学领域推理能力不足的问题,特别是由于对中间反思步骤质量的关注不够,导致在高风险的医疗场景中表现不佳。解决方案的关键在于提出Med-REFL,该方法通过树状思维(tree-of-thought)策略将医学问题分解为细粒度的推理路径,并对每一步及其后续反思进行量化评估,从而自动构建直接偏好优化数据,减少对昂贵专家标注的依赖,同时引导模型识别和纠正推理错误。
链接: https://arxiv.org/abs/2506.13793
作者: Zongxian Yang,Jiayu Qian,Zegao Peng,Haoyu Zhang,Zhi-An Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models have recently made significant strides in mathematical and code reasoning, yet their success has not transferred smoothly to the medical domain. While multiple factors contribute to this disparity, a critical issue is the inadequate focus on the quality of intermediate reflection steps, which is particularly crucial in high-stakes medical scenarios. To address this challenge, we propose Med-REFL, a \underline\textbfMedical \underline\textbfReasoning \underline\textbfEnhancement via self-corrected \underline\textbfFine-grained ref\underline\textbfLection. Our method leverages a tree-of-thought approach to decompose medical questions into fine-grained reasoning paths, quantitatively evaluating each step and its subsequent reflections. These assessments enable automatic construction of direct preference optimization data, reducing reliance on expensive expert annotations while guiding models to identify and correct reasoning errors. Experimental results on the MedQA-USMLE benchmark demonstrate Med-REFL achieves consistent improvements, with average gains up to 4.11%. Notably, it further boosts the state-of-the-art performance of 7B/8B models by an additional 4.13%. Furthermore, Med-REFL exhibits strong generalization capabilities and robustness across several challenging medical question-answering datasets. Our work illustrates that prioritizing reflection quality leads to more accurate and trustworthy reasoning in medical AI applications. Checkpoints, code, and data can be found \hrefthis https URLhere.
zh
[AI-106] he NordDRG AI Benchmark for Large Language Models
【速读】:该论文旨在解决医院资金层中基于诊断相关组(Diagnosis-Related Groups, DRG)的报销规则在大型语言模型(Large Language Models, LLMs)评估中的缺乏公开基准问题。其解决方案的关键在于发布NordDRG-AI-Benchmark,这是首个公开的测试平台,涵盖了完整的DRG规则集,并评估LLMs在多语言诊断、手术和费用逻辑推理方面的能力,包含定义表、专家手册、变更日志模板以及14个CaseMix任务,为可信自动化在医院资金领域的研究提供了可复现的基线。
链接: https://arxiv.org/abs/2506.13790
作者: Tapio Pitkäranta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures
Abstract:Large language models (LLMs) are already being piloted for clinical coding and decision support. However, until now, no open benchmark has targeted the hospital funding layer where Diagnosis-Related Groups (DRG) determine reimbursement across many countries. We release NordDRG-AI-Benchmark, the first public test-bed that captures a complete DRG rule set and evaluates an LLM’s ability to reason over multilingual diagnosis, procedure, and tariff logic. The benchmark bundles three classes of artefacts: (i) definition tables with 20 interlinked tables covering DRG logic, ICD and NCSP codes, age/sex splits, and country flags; (ii) expert manuals and changelog templates describing real governance workflows; and (iii) a prompt pack of 14 CaseMix tasks that span code lookup, cross-table inference, multilingual terminology, and quality-assurance audits. All artefacts are available at: this https URL A baseline demonstration shows that five state-of-the-art LLMs perform very differently on the nine automatically verifiable tasks: o3 (OpenAI) scores 9 out of 9, GPT-4o and o4-mini-high score 7 out of 9, while Gemini 2.5 Pro and Gemini 2.5 Flash solve only 5 out of 9 and 3 out of 9, respectively. These results confirm that NordDRG-AI-Benchmark highlights domain-specific strengths and weaknesses that remain hidden in generic LLM benchmarks, offering a reproducible baseline for research on trustworthy automation in hospital funding. Comments: 15 pages, 4 figures Subjects: Artificial Intelligence (cs.AI) MSC classes: cs.AI Cite as: arXiv:2506.13790 [cs.AI] (or arXiv:2506.13790v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.13790 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tapio Pitkäranta Mr [view email] [v1] Wed, 11 Jun 2025 11:40:11 UTC (1,444 KB)
zh
[AI-107] Analysis of Anonymous User Interaction Relationships and Prediction of Advertising Feedback Based on Graph Neural Network
【速读】:该论文旨在解决在线广告中由于现有图模型难以捕捉交互网络的多尺度时间、语义和高阶依赖特征,而导致无法准确描述匿名用户行为复杂模式的问题。其解决方案的关键在于提出Decoupled Temporal-Hierarchical Graph Neural Network (DTH-GNN),通过时间边分解、分层异构聚合以及反馈感知的对比正则化机制,实现对用户行为的精细化建模与优化。
链接: https://arxiv.org/abs/2506.13787
作者: Yanjun Dai,Haoyang Feng,Yuan Gao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While online advertising is highly dependent on implicit interaction networks of anonymous users for engagement inference, and for the selection and optimization of delivery strategies, existing graph models seldom can capture the multi-scale temporal, semantic and higher-order dependency features of these interaction networks, thus it’s hard to describe the complicated patterns of the anonymous behavior. In this paper, we propose Decoupled Temporal-Hierarchical Graph Neural Network (DTH-GNN), which achieves three main contributions. Above all, we introduce temporal edge decomposition, which divides each interaction into three types of channels: short-term burst, diurnal cycle and long-range memory, and conducts feature extraction using the convolution kernel of parallel dilated residuals; Furthermore, our model builds a hierarchical heterogeneous aggregation, where user-user, user-advertisement, advertisement-advertisement subgraphs are combined through the meta-path conditional Transformer encoder, where the noise structure is dynamically tamped down via the synergy of cross-channel self-attention and gating relationship selector. Thirdly, the contrast regularity of feedback perception is formulated, the consistency of various time slices is maximized, the entropy of control exposure information with dual-view target is maximized, the global prototype of dual-momentum queue distillation is presented, and the strategy gradient layer with light weight is combined with delaying transformation signal to fine-tune the node representation for benefit-oriented. The AUC of DTH-GNN improved by 8.2% and the logarithmic loss improved by 5.7% in comparison with the best baseline model.
zh
[AI-108] Enhancing Bagging Ensemble Regression with Data Integration for Time Series-Based Diabetes Prediction
【速读】:该论文旨在解决糖尿病流行率在美国内各城市的时间序列预测问题,以支持有效的医疗规划和针对性干预。其关键解决方案是引入一种增强的Bagging集成回归模型(EBMBag+),通过整合2011至2021年的糖尿病相关数据集,构建全面的特征集,并利用该模型进行预测,实验结果表明EBMBag+在多个评估指标上表现最优。
链接: https://arxiv.org/abs/2506.13786
作者: Vuong M. Ngo,Tran Quang Vinh,Patricia Kearney,Mark Roantree
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17th International Conference on Computational Collective Intelligence, LNAI, Springer, 11 pages
Abstract:Diabetes is a chronic metabolic disease characterized by elevated blood glucose levels, leading to complications like heart disease, kidney failure, and nerve damage. Accurate state-level predictions are vital for effective healthcare planning and targeted interventions, but in many cases, data for necessary analyses are incomplete. This study begins with a data engineering process to integrate diabetes-related datasets from 2011 to 2021 to create a comprehensive feature set. We then introduce an enhanced bagging ensemble regression model (EBMBag+) for time series forecasting to predict diabetes prevalence across U.S. cities. Several baseline models, including SVMReg, BDTree, LSBoost, NN, LSTM, and ERMBag, were evaluated for comparison with our EBMBag+ algorithm. The experimental results demonstrate that EBMBag+ achieved the best performance, with an MAE of 0.41, RMSE of 0.53, MAPE of 4.01, and an R2 of 0.9.
zh
[AI-109] XGraphRAG : Interactive Visual Analysis for Graph-based Retrieval-Augmented Generation
【速读】:该论文试图解决Graph-based Retrieval-Augmented Generation (RAG)在实际应用中因信息处理流程复杂及大量大语言模型(LLM)调用而导致的可解释性和可访问性不足的问题。解决方案的关键在于提出一种可视化分析框架,帮助RAG开发者识别GraphRAG的关键召回项,并通过该框架追踪这些召回项在整个GraphRAG处理流程中的路径,从而提升故障案例的收集效率和改进机会的识别能力。基于此框架,研究进一步开发了XGraphRAG原型系统,集成了一系列交互式可视化工具以支持用户分析过程。
链接: https://arxiv.org/abs/2506.13782
作者: Ke Wang,Bo Pan,Yingchaojie Feng,Yuwei Wu,Jieyi Chen,Minfeng Zhu,Wei Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Pacific Visualization Conference 2025
Abstract:Graph-based Retrieval-Augmented Generation (RAG) has shown great capability in enhancing Large Language Model (LLM)‘s answer with an external knowledge base. Compared to traditional RAG, it introduces a graph as an intermediate representation to capture better structured relational knowledge in the corpus, elevating the precision and comprehensiveness of generation results. However, developers usually face challenges in analyzing the effectiveness of GraphRAG on their dataset due to GraphRAG’s complex information processing pipeline and the overwhelming amount of LLM invocations involved during graph construction and query, which limits GraphRAG interpretability and accessibility. This research proposes a visual analysis framework that helps RAG developers identify critical recalls of GraphRAG and trace these recalls through the GraphRAG pipeline. Based on this framework, we develop XGraphRAG, a prototype system incorporating a set of interactive visualizations to facilitate users’ analysis process, boosting failure cases collection and improvement opportunities identification. Our evaluation demonstrates the effectiveness and usability of our approach. Our work is open-sourced and available at this https URL.
zh
[AI-110] Solving the Job Shop Scheduling Problem with Graph Neural Networks: A Customizable Reinforcement Learning Environment
【速读】:该论文试图解决作业车间调度问题(Job Shop Scheduling Problem, JSSP),这是一个与制造和排班相关的NP难组合优化问题。传统方法依赖于基于简单启发式的优先级派工规则,而近期研究尝试用深度学习模型,特别是图神经网络(Graph Neural Networks, GNNs)来替代这些规则,以从数据中学习优先级分配。解决方案的关键在于构建一个模块化库JobShopLib,该库允许定制图表示、节点特征、动作空间和奖励函数等关键因素,并通过强化学习环境进行实验。此外,论文通过模仿学习训练了多个派工器,验证了该环境的有效性,其中一种模型仅使用个体操作特征便优于多种基于图的派工器,突显了特征定制的重要性。
链接: https://arxiv.org/abs/2506.13781
作者: Pablo Ariño Fernández,Carlos Quesada González
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注: Bachelor’s thesis, Universidad Politécnica de Madrid, 2025. 150 pages, 23 figures
Abstract:The job shop scheduling problem is an NP-hard combinatorial optimization problem relevant to manufacturing and timetabling. Traditional approaches use priority dispatching rules based on simple heuristics. Recent work has attempted to replace these with deep learning models, particularly graph neural networks (GNNs), that learn to assign priorities from data. However, training such models requires customizing numerous factors: graph representation, node features, action space, and reward functions. The lack of modular libraries for experimentation makes this research time-consuming. This work introduces JobShopLib, a modular library that allows customizing these factors and creating new components with its reinforcement learning environment. We trained several dispatchers through imitation learning to demonstrate the environment’s utility. One model outperformed various graph-based dispatchers using only individual operation features, highlighting the importance of feature customization. Our GNN model achieved near state-of-the-art results on large-scale problems. These results suggest significant room for improvement in developing such models. JobShopLib provides the necessary tools for future experimentation.
zh
[AI-111] Recommendations and Reporting Checklist for Rigorous Transparent Human Baselines in Model Evaluations ICML2025
【速读】:该论文试图解决当前基础模型评估中人类基准(human baselines)不够严谨和透明的问题,从而无法有效进行人类与人工智能性能的有意义比较。其解决方案的关键在于提出一套设计、执行和报告人类基准的框架及相应的报告检查清单,以提升基准方法的严谨性和可重复性,并通过系统审查115个基础模型评估中的基准研究,识别现有方法的不足。
链接: https://arxiv.org/abs/2506.13776
作者: Kevin L. Wei,Patricia Paskov,Sunishchal Dev,Michael J. Byun,Anka Reuel,Xavier Roberts-Gaal,Rachel Calcott,Evie Coxon,Chinmay Deshpande
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: A version of this paper has been accepted to ICML 2025 as a position paper (spotlight), with the title: “Position: Human Baselines in Model Evaluations Need Rigor and Transparency (With Recommendations Reporting Checklist).”
Abstract:In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve “super-human” performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: this https URL
zh
[AI-112] Personalized Constitutionally-Aligned Agent ic Superego: Secure AI Behavior Aligned to Diverse Human Values
【速读】:该论文试图解决自主决策型人工智能系统(Agentic AI systems)在实际部署中与人类价值观、安全要求及合规需求对齐的问题。现有对齐方法在提供深度个性化上下文信息时易引发虚假信息或操作低效。其解决方案的关键是引入一个名为“超我代理”(Superego agent)的个性化监督机制,该机制通过引用用户选定的“信仰宪章”(Creed Constitutions)动态引导AI规划,并结合实时合规验证器确保计划符合通用伦理标准,从而有效降低有害输出并提升系统安全性。
链接: https://arxiv.org/abs/2506.13774
作者: Nell Watson,Ahmed Amer,Evan Harris,Preeti Ravindra,Shujun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 39 pages, 5 figures
Abstract:Agentic AI systems, possessing capabilities for autonomous planning and action, exhibit immense potential across diverse domains. However, their practical deployment is significantly hampered by challenges in aligning their behavior with varied human values, complex safety requirements, and specific compliance needs. Existing alignment methodologies often falter when faced with the intricate task of providing deep, personalized contextual information without inducing confabulation or operational inefficiencies. This paper introduces a novel solution: a ‘superego’ agent, designed as a personalized oversight mechanism for agentic AI. This system dynamically steers AI planning by referencing user-selected “Creed Constitutions”-encapsulating diverse rule sets-with adjustable adherence levels to fit non-negotiable values. A real-time compliance enforcer validates plans against these constitutions and a universal ethical floor before execution. We present a functional system, including a demonstration interface (this http URL) with a prototypical constitution-sharing portal, and successful integration with third-party models via the Model Context Protocol (MCP). Comprehensive benchmark evaluations (HarmBench, AgentHarm) demonstrate that our Superego agent dramatically reduces harmful outputs, achieving up to a 98.3% harm score reduction and near-perfect refusal rates (e.g., 100% with Claude Sonnet 4 on AgentHarm’s harmful set) for leading LLMs like Gemini 2.5 Flash and GPT-4o. This approach substantially simplifies personalized AI alignment, rendering agentic systems more reliably attuned to individual and cultural contexts, while also enabling substantial safety improvements. An overview on this research with examples is available at this https URL.
zh
[AI-113] Representing Time-Continuous Behavior of Cyber-Physical Systems in Knowledge Graphs
【速读】:该论文旨在解决在不同生命周期阶段中,如何将基于微分方程的行为信息有效地语境化并与其它Cyber-Physical System (CPS)信息集成的问题。现有知识图谱虽然提供了形式化描述和结构化机制,但缺乏可重用的本体工件和减少手动实例化工作量的方法。解决方案的关键在于引入两个关键工件:首先,基于标准的模块化语义模型,用于在知识图谱中直接表示微分方程并对其进行语义增强;其次,一种高效的知识图谱生成方法。通过航空维护领域的验证,证明了这些工件能够将复杂电液伺服作动器的微分方程正式表示于知识图谱中,并与其它生命周期数据进行语境化关联,从而展示了其实际应用价值。
链接: https://arxiv.org/abs/2506.13773
作者: Milapji Singh Gill,Tom Jeleniewski,Felix Gehlhoff,Alexander Fay
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Time-continuous dynamic models are essential for various Cyber-Physical System (CPS) applications. To ensure effective usability in different lifecycle phases, such behavioral information in the form of differential equations must be contextualized and integrated with further CPS information. While knowledge graphs provide a formal description and structuring mechanism for this task, there is a lack of reusable ontological artifacts and methods to reduce manual instantiation effort. Hence, this contribution introduces two artifacts: Firstly, a modular semantic model based on standards is introduced to represent differential equations directly within knowledge graphs and to enrich them semantically. Secondly, a method for efficient knowledge graph generation is presented. A validation of these artifacts was conducted in the domain of aviation maintenance. Results show that differential equations of a complex Electro-Hydraulic Servoactuator can be formally represented in a knowledge graph and be contextualized with other lifecycle data, proving the artifacts’ practical applicability.
zh
[AI-114] MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLM s
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在移动设备上进行个性化时出现的幻觉问题,即模型在处理个性化或未见过的查询时会产生不准确或过时的响应。解决方案的关键在于提出一种名为MobiEdit的移动知识编辑框架,该框架通过将全精度反向传播替换为量化前向仅梯度估计,实现了高效的本地知识编辑,从而兼容移动设备上的能效优化神经处理单元(NPUs)。此外,MobiEdit引入了早期停止机制和前缀缓存以进一步提升梯度估计效率,使得在商用现成(COTS)移动设备上能够实时编辑3B参数模型。
链接: https://arxiv.org/abs/2506.13772
作者: Zhenyan Lu,Daliang Xu,Dongqi Cai,Zexi Li,Wei Liu,Fangming Liu,Shangguang Wang,Mengwei Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are deployed on mobile devices to power killer applications such as intelligent assistants. LLMs pre-trained on general corpora often hallucinate when handling personalized or unseen queries, leading to incorrect or outdated responses. Knowledge editing addresses this by identifying and adjusting a small crucial portion of model weights, without compromising the general knowledge. However, prior knowledge editing methods are impractical to run on local devices due to the resource-heavy backpropagation (BP) needed for updates. We present MobiEdit, the first mobile knowledge editing framework that enables efficient LLM personalization on commercial off-the-shelf (COTS) mobile devices. MobiEdit replaces full-precision BP with quantized forward-only gradient estimation, thus compatible with the energy-efficient mobile neural processing units (NPUs). MobiEdit replaces full-precision backpropagation with quantized forward-only gradient estimation, making it compatible with energy-efficient mobile NPUs. To further improve gradient estimation efficiency, we introduce two optimizations: an early stoping mechanism that adaptively terminates editing upon success and a prefix cache that reuses computation across steps. Our approach enables real-time editing of a 3B-parameter model (Qwen2.5-3B-Instruct) on COTS mobile devices with 7.6 \times less memory, 14.7 \times less energy and 3.6 \times less latency compared to previous knowledge editing methods.
zh
[AI-115] Memory States from Almost Nothing: Representing and Computing in a Non-associative Algebra
【速读】:该论文试图解决在高维空间中表示和计算信息项的问题,特别是如何有效保留序列信息的时序结构,而传统依赖关联捆绑(associative bundling)的模型通常会丢失顺序信息,需要借助辅助结构如位置标记来表达序列信息。解决方案的关键在于提出一种非关联性捆绑(non-associative bundling)机制,该机制允许构建稀疏表示的任意长序列,并保持其时间结构,同时将噪声视为顺序信息表示的组成部分而非干扰因素。该框架通过左关联捆绑生成L-state以强调近期信息,通过右关联捆绑生成R-state以编码有限长度的序列或块,分别对应前额叶皮层和海马体在短时记忆和长时记忆中的活动。
链接: https://arxiv.org/abs/2506.13768
作者: Stefan Reimann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures, journal article (accepted)
Abstract:This note presents a non-associative algebraic framework for the representation and computation of information items in high-dimensional space. This framework is consistent with the principles of spatial computing and with the empirical findings in cognitive science about memory. Computations are performed through a process of multiplication-like binding and non-associative interference-like bundling. Models that rely on associative bundling typically lose order information, which necessitates the use of auxiliary order structures, such as position markers, to represent sequential information that is important for cognitive tasks. In contrast, the non-associative bundling proposed allows the construction of sparse representations of arbitrarily long sequences that maintain their temporal structure across arbitrary lengths. In this operation, noise is a constituent element of the representation of order information, rather than a means of obscuring it. The non-associative nature of the proposed framework results in the representation of a single sequence by two distinct states. The L-state, generated through left-associative bundling, continuously updates and emphasises a recency effect, while the R-state, formed through right-associative bundling, encodes finite sequences or chunks, capturing a primacy effect. The construction of these states may be associated with activity in the prefrontal cortex in relation to short-term memory and hippocampal encoding in long-term memory, respectively. The accuracy of retrieval is contingent upon a decision-making process that is based on the mutual information between the memory states and the cue. The model is able to replicate the Serial Position Curve, which reflects the empirical recency and primacy effects observed in cognitive experiments.
zh
[AI-116] Accurate and scalable exchange-correlation with deep learning
【速读】:该论文旨在解决密度泛函理论(Density Functional Theory, DFT)中交换-关联(exchange-correlation, XC)泛函近似精度与计算效率之间的平衡问题,当前的XC泛函在预测实验室实验结果时难以达到化学精度(通常定义为误差低于1 kcal/mol)。论文提出的解决方案是构建一种基于深度学习的XC泛函Skala,其关键在于通过直接从数据中学习表示来绕过昂贵的手工设计特征,从而在保持半局域DFT计算效率的同时实现对小分子解离能的化学精度预测,并且随着训练数据的增加,其性能可进一步提升。
链接: https://arxiv.org/abs/2506.14665
作者: Giulia Luise,Chin-Wei Huang,Thijs Vogels,Derk P. Kooi,Sebastian Ehlert,Stephanie Lanius,Klaas J. H. Giesbertz,Amir Karton,Deniz Gunceler,Megan Stanley,Wessel P. Bruinsma,Lin Huang,Xinran Wei,José Garrido Torres,Abylay Katbashev,Bálint Máté,Sékou-Oumar Kaba,Roberto Sordillo,Yingrong Chen,David B. Williams-Young,Christopher M. Bishop,Jan Hermann,Rianne van den Berg,Paola Gori-Giorgi
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: Main: 13 pages plus references, 11 figures and tables. Supplementary information: 19 pages, 12 figures and tables
Abstract:Density Functional Theory (DFT) is the most widely used electronic structure method for predicting the properties of molecules and materials. Although DFT is, in principle, an exact reformulation of the Schrödinger equation, practical applications rely on approximations to the unknown exchange-correlation (XC) functional. Most existing XC functionals are constructed using a limited set of increasingly complex, hand-crafted features that improve accuracy at the expense of computational efficiency. Yet, no current approximation achieves the accuracy and generality for predictive modeling of laboratory experiments at chemical accuracy – typically defined as errors below 1 kcal/mol. In this work, we present Skala, a modern deep learning-based XC functional that bypasses expensive hand-designed features by learning representations directly from data. Skala achieves chemical accuracy for atomization energies of small molecules while retaining the computational efficiency typical of semi-local DFT. This performance is enabled by training on an unprecedented volume of high-accuracy reference data generated using computationally intensive wavefunction-based methods. Notably, Skala systematically improves with additional training data covering diverse chemistry. By incorporating a modest amount of additional high-accuracy data tailored to chemistry beyond atomization energies, Skala achieves accuracy competitive with the best-performing hybrid functionals across general main group chemistry, at the cost of semi-local DFT. As the training dataset continues to expand, Skala is poised to further enhance the predictive power of first-principles simulations.
zh
[AI-117] Complete Characterization for Adjustment in Summary Causal Graphs of Time Series UAI
【速读】:该论文试图解决在仅拥有真实因果图抽象形式——摘要因果图(summary causal graph)的情况下,针对时间序列中的多重干预(multiple interventions)的可识别性问题(identifiability problem)。其解决方案的关键在于提出了一组调整准则(adjustment criterion)的必要且充分条件,并证明了该准则在此设定下是完备的,同时提供了一个伪线性算法来判断查询是否可识别。
链接: https://arxiv.org/abs/2506.14534
作者: Clément Yvernes,Emilie Devijver,Eric Gaussier
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI)
备注: Accepted at the 41st Conference on Uncertainty in Artificial Intelligence (UAI)
Abstract:The identifiability problem for interventions aims at assessing whether the total causal effect can be written with a do-free formula, and thus be estimated from observational data only. We study this problem, considering multiple interventions, in the context of time series when only an abstraction of the true causal graph, in the form of a summary causal graph, is available. We propose in particular both necessary and sufficient conditions for the adjustment criterion, which we show is complete in this setting, and provide a pseudo-linear algorithm to decide whether the query is identifiable or not.
zh
[AI-118] Sharp Generalization Bounds for Foundation Models with Asymmetric Randomized Low-Rank Adapters
【速读】:该论文试图解决低秩适配(Low-Rank Adaptation, LoRA)在单次微调运行中泛化性能的不确定性问题,特别是其低秩因子初始化的不对称性对模型表现的影响。解决方案的关键在于通过理论分析揭示不对称LoRA的泛化能力,具体表现为推导出在高概率下针对秩为r的LoRA模型在N个样本上的样本复杂度上界为\tilde\mathcal{O}\left(\frac{\sqrt{r}}{\sqrt{N}}\right),并进一步确定了样本效率的下界为O(N1),从而为不对称LoRA的实际可靠性提供了理论依据。
链接: https://arxiv.org/abs/2506.14530
作者: Anastasis Kratsios,Tin Sum Cheng,Aurelien Lucchi,Haitz Sáez de Ocáriz Borde
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Statistics Theory (math.ST)
备注:
Abstract:Low-Rank Adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning (PEFT) technique for foundation models. Recent work has highlighted an inherent asymmetry in the initialization of LoRA’s low-rank factors, which has been present since its inception and was presumably derived experimentally. This paper focuses on providing a comprehensive theoretical characterization of asymmetric LoRA with frozen random factors. First, while existing research provides upper-bound generalization guarantees based on averages over multiple experiments, the behaviour of a single fine-tuning run with specific random factors remains an open question. We address this by investigating the concentration of the typical LoRA generalization gap around its mean. Our main upper bound reveals a sample complexity of \tilde\mathcalO\left(\frac\sqrtr\sqrtN\right) with high probability for rank r LoRAs trained on N samples. Additionally, we also determine the fundamental limits in terms of sample efficiency, establishing a matching lower bound of \mathcalO\left(\frac1\sqrtN\right) . By more closely reflecting the practical scenario of a single fine-tuning run, our findings offer crucial insights into the reliability and practicality of asymmetric LoRA.
zh
[AI-119] Hamiltonian Formalism for Comparing Quantum and Classical Intelligence
【速读】:该论文试图解决如何在经典和量子环境中对通用人工智能(AGI)的操作进行直接比较的问题,其核心挑战在于建立一种数学框架来描述AGI在不同物理 substrate 上的行为差异。解决方案的关键在于引入一种哈密顿量形式主义(Hamiltonian formalism),将AGI的动力学分解为用于核心功能(如归纳、推理、递归、学习、测量和记忆)的哈密顿量生成器,从而为量子与经典智能体通过环境交互所表现出的差异提供精确的数学语言。
链接: https://arxiv.org/abs/2506.14456
作者: Elija Perrier
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: This is the version accepted at AGI 25 (camera ready length limit of 10 pages plus references and appendices). Further work detailing bounds and limitations is in preparation. Comments and criticisms welcome
Abstract:The prospect of AGI instantiated on quantum substrates motivates the development of mathematical frameworks that enable direct comparison of their operation in classical and quantum environments. To this end, we introduce a Hamiltonian formalism for describing classical and quantum AGI tasks as a means of contrasting their interaction with the environment. We propose a decomposition of AGI dynamics into Hamiltonian generators for core functions such as induction, reasoning, recursion, learning, measurement, and memory. This formalism aims to contribute to the development of a precise mathematical language for how quantum and classical agents differ via environmental interaction.
zh
[AI-120] Adjustment for Confounding using Pre-Trained Representations ICML2025
【速读】:该论文试图解决在平均处理效应(Average Treatment Effect, ATE)估计中引入非表格数据(如图像和文本)所带来的混杂因素问题,以避免因忽略这些因素而导致的结果偏差和错误的科学结论。解决方案的关键在于利用预训练神经网络的潜在特征(latent features)来调整混杂因素,并通过双机器学习等方法实现有效的统计推断。研究重点探讨了潜在特征在高维性和表示不可识别性下的挑战,并指出传统基于稀疏线性模型的结构假设在该场景下不适用,但神经网络因其对这些问题的鲁棒性,能够通过适应内在的稀疏性和问题维度实现快速收敛率。
链接: https://arxiv.org/abs/2506.14329
作者: Rickmer Schulte,David Rügamer,Thomas Nagler
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
备注: Accepted at ICML 2025
Abstract:There is growing interest in extending average treatment effect (ATE) estimation to incorporate non-tabular data, such as images and text, which may act as sources of confounding. Neglecting these effects risks biased results and flawed scientific conclusions. However, incorporating non-tabular data necessitates sophisticated feature extractors, often in combination with ideas of transfer learning. In this work, we investigate how latent features from pre-trained neural networks can be leveraged to adjust for sources of confounding. We formalize conditions under which these latent features enable valid adjustment and statistical inference in ATE estimation, demonstrating results along the example of double machine learning. We discuss critical challenges inherent to latent feature learning and downstream parameter estimation arising from the high dimensionality and non-identifiability of representations. Common structural assumptions for obtaining fast convergence rates with additive or sparse linear models are shown to be unrealistic for latent features. We argue, however, that neural networks are largely insensitive to these issues. In particular, we show that neural networks can achieve fast convergence rates by adapting to intrinsic notions of sparsity and dimension of the learning problem.
zh
[AI-121] Mirror Descent Using the Tempesta Generalized Multi-parametric Logarithms
【速读】:该论文试图解决机器学习中优化算法设计的问题,特别是通过引入广义的镜像下降(Mirror Descent, MD)算法来提升优化过程的灵活性和适应性。其解决方案的关键在于利用Bregman散度与Tempesta多参数变形对数函数作为链接函数(也称为镜像函数),该函数定义了原始空间与对偶空间之间的映射,并与一类广义迹形式熵(theoretically infinite class of generalized trace-form entropies)相关联。通过估计广义指数函数以近似多参数Tempesta广义对数函数的逆,论文提出了新的MD更新规则,并通过学习这些超参数来适应训练数据的分布或几何特性,从而实现对MD算法特性的灵活调整。
链接: https://arxiv.org/abs/2506.13984
作者: Andrzej Cichocki
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we develop a wide class Mirror Descent (MD) algorithms, which play a key role in machine learning. For this purpose we formulated the constrained optimization problem, in which we exploits the Bregman divergence with the Tempesta multi-parametric deformation logarithm as a link function. This link function called also mirror function defines the mapping between the primal and dual spaces and is associated with a very-wide (in fact, theoretically infinite) class of generalized trace-form entropies. In order to derive novel MD updates, we estimate generalized exponential function, which closely approximates the inverse of the multi-parametric Tempesta generalized logarithm. The shape and properties of the Tempesta logarithm and its inverse-deformed exponential functions can be tuned by several hyperparameters. By learning these hyperparameters, we can adapt to distribution or geometry of training data, and we can adjust them to achieve desired properties of MD algorithms. The concept of applying multi-parametric logarithms allow us to generate a new wide and flexible family of MD and mirror-less MD updates.
zh
[AI-122] Beyond Shapley Values: Cooperative Games for the Interpretation of Machine Learning Models
【速读】:该论文试图解决当前基于Shapley值的可解释性方法在特征归因中的理论基础和灵活性不足的问题,其核心在于对合作博弈论在机器学习后验可解释性中的应用进行重新审视。解决方案的关键在于提出更广泛且更具原则性的工具使用方式,特别是强调Weber集和Harsanyi集这两种高效的分配方案,它们超越了传统的Shapley值,提供了更丰富的解释灵活性,并通过明确价值函数与聚合规则的区别,构建了一个三步法蓝图以实现可靠且理论严谨的特征归因方法。
链接: https://arxiv.org/abs/2506.13900
作者: Marouane Il Idrissi,Agathe Fernandes Machado,Arthur Charpentier
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Cooperative game theory has become a cornerstone of post-hoc interpretability in machine learning, largely through the use of Shapley values. Yet, despite their widespread adoption, Shapley-based methods often rest on axiomatic justifications whose relevance to feature attribution remains debatable. In this paper, we revisit cooperative game theory from an interpretability perspective and argue for a broader and more principled use of its tools. We highlight two general families of efficient allocations, the Weber and Harsanyi sets, that extend beyond Shapley values and offer richer interpretative flexibility. We present an accessible overview of these allocation schemes, clarify the distinction between value functions and aggregation rules, and introduce a three-step blueprint for constructing reliable and theoretically-grounded feature attributions. Our goal is to move beyond fixed axioms and provide the XAI community with a coherent framework to design attribution methods that are both meaningful and robust to shifting methodological trends.
zh
[AI-123] DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agent ic Generative AI Foundation Models ICML2025
【速读】:该论文试图解决结构化组学数据监督学习中的关键瓶颈问题,即实验数据标注的低效与高错误率。其解决方案的关键在于利用具有实时网络搜索能力的代理型基础模型(agentic foundation models),实现实验数据的自动化标注,从而在无需人工校正的情况下提升注释通量,并达到最高82.5%的准确率。这一方法为构建能够执行下游任务如细胞类型识别和扰动预测的虚拟细胞基础模型提供了可能。
链接: https://arxiv.org/abs/2506.13817
作者: Saleem A. Al Dajani,Abel Sanchez,John R. Williams
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE); Quantitative Methods (q-bio.QM)
备注: 4 pages, 5 figures, Accepted by ICML 2025 FM4LS this https URL . Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)}, July 2025
Abstract:Generative AI foundation models offer transformative potential for processing structured biological data, particularly in single-cell RNA sequencing, where datasets are rapidly scaling toward billions of cells. We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy. This addresses a key bottleneck in supervised learning for structured omics data by increasing annotation throughput without manual curation and human error. Our approach enables the development of virtual cell foundation models capable of downstream tasks such as cell-typing and perturbation prediction. As data volume grows, these models may surpass human performance in labeling, paving the way for reliable inference in large-scale perturbation screens. This application demonstrates domain-specific innovation in health monitoring and diagnostics, aligned with efforts like the Human Cell Atlas and Human Tumor Atlas Network.
zh
[AI-124] Analysis and Optimization of Probabilities of Beneficial Mutation and Crossover Recombination in a Hamming Space
【速读】:该论文试图解决在遗传算法中如何优化突变和重组算子参数的问题,以提高搜索效率并加速接近最优解的过程。其解决方案的关键在于基于Fisher的几何方法,对Hamming空间中字符串的有益突变和交叉重组的概率进行几何与组合分析,推导出围绕最优解的球体之间转移概率的闭式表达式,从而全面描述距离最优解的马尔可夫演化过程。这一分析为确定最大化突变和交叉进入最优解概率的最优突变半径和重组半径提供了理论依据。
链接: https://arxiv.org/abs/2506.13809
作者: Roman V. Belavkin
机构: 未知
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 42 pages
Abstract:Inspired by Fisher’s geometric approach to study beneficial mutations, we analyse probabilities of beneficial mutation and crossover recombination of strings in a general Hamming space with arbitrary finite alphabet. Mutations and recombinations that reduce the distance to an optimum are considered as beneficial. Geometric and combinatorial analysis is used to derive closed-form expressions for transition probabilities between spheres around an optimum giving a complete description of Markov evolution of distances from an optimum over multiple generations. This paves the way for optimization of parameters of mutation and recombination operators. Here we derive optimality conditions for mutation and recombination radii maximizing the probabilities of mutation and crossover into the optimum. The analysis highlights important differences between these evolutionary operators. While mutation can potentially reach any part of the search space, the probability of beneficial mutation decreases with distance to an optimum, and the optimal mutation radius or rate should also decrease resulting in a slow-down of evolution near the optimum. Crossover recombination, on the other hand, acts in a subspace of the search space defined by the current population of strings. However, probabilities of beneficial and deleterious crossover are balanced, and their characteristics, such as variance, are translation invariant in a Hamming space, suggesting that recombination may complement mutation and boost the rate of evolution near the optimum.
zh
[AI-125] A Survey of Physics-Informed AI for Complex Urban Systems
【速读】:该论文旨在解决城市系统建模中预测精度、可解释性及决策支持不足的问题,其解决方案的关键在于将物理模型与人工智能(Artificial Intelligence, AI)相结合,形成物理信息的AI方法。通过这种融合,既利用了AI在捕捉复杂非线性关系上的优势,又确保了模型与现实物理规律的一致性,从而提升了城市系统的可靠性、效率和适应性。
链接: https://arxiv.org/abs/2506.13777
作者: En Xu,Huandong Wang,Yunke Zhang,Sibo Li,Yinzhou Tang,Zhilun Zhou,Yuming Lin,Yuan Yuan,Xiaochen Fan,Jingtao Ding,Yong Li
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Urban systems are typical examples of complex systems, where the integration of physics-based modeling with artificial intelligence (AI) presents a promising paradigm for enhancing predictive accuracy, interpretability, and decision-making. In this context, AI excels at capturing complex, nonlinear relationships, while physics-based models ensure consistency with real-world laws and provide interpretable insights. We provide a comprehensive review of physics-informed AI methods in urban applications. The proposed taxonomy categorizes existing approaches into three paradigms - Physics-Integrated AI, Physics-AI Hybrid Ensemble, and AI-Integrated Physics - and further details seven representative methods. This classification clarifies the varying degrees and directions of physics-AI integration, guiding the selection and development of appropriate methods based on application needs and data availability. We systematically examine their applications across eight key urban domains: energy, environment, economy, transportation, information, public services, emergency management, and the urban system as a whole. Our analysis highlights how these methodologies leverage physical laws and data-driven models to address urban challenges, enhancing system reliability, efficiency, and adaptability. By synthesizing existing methodologies and their urban applications, we identify critical gaps and outline future research directions, paving the way toward next-generation intelligent urban system modeling.
zh
机器学习
[LG-0] On the Hardness of Bandit Learning
链接: https://arxiv.org/abs/2506.14746
作者: Nataly Brukhim,Aldo Pacchiano,Miroslav Dudik,Robert Schapire
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 main pages
Abstract:We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known, but arbitrary, function class F. We seek a general theory of bandit learnability, akin to the PAC framework for classification. Our investigation is guided by the following two questions: (1) which classes F are learnable, and (2) how they are learnable. For example, in the case of binary PAC classification, learnability is fully determined by a combinatorial dimension - the VC dimension- and can be attained via a simple algorithmic principle, namely, empirical risk minimization (ERM). In contrast to classical learning-theoretic results, our findings reveal limitations of learning in structured bandits, offering insights into the boundaries of bandit learnability. First, for the question of “which”, we show that the paradigm of identifying the learnable classes via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019). For the question of “how”, we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time unless RP=NP. We also prove that this class admits efficient algorithms for standard algorithmic operations often considered in learning theory, such as an ERM. This implies that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.
[LG-1] Feasibility-Driven Trust Region Bayesian Optimization
链接: https://arxiv.org/abs/2506.14619
作者: Paolo Ascia,Elena Raponi,Thomas Bäck,Fabian Duddeck
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at AutoML2025
Abstract:Bayesian optimization is a powerful tool for solving real-world optimization tasks under tight evaluation budgets, making it well-suited for applications involving costly simulations or experiments. However, many of these tasks are also characterized by the presence of expensive constraints whose analytical formulation is unknown and often defined in high-dimensional spaces where feasible regions are small, irregular, and difficult to identify. In such cases, a substantial portion of the optimization budget may be spent just trying to locate the first feasible solution, limiting the effectiveness of existing methods. In this work, we present a Feasibility-Driven Trust Region Bayesian Optimization (FuRBO) algorithm. FuRBO iteratively defines a trust region from which the next candidate solution is selected, using information from both the objective and constraint surrogate models. Our adaptive strategy allows the trust region to shift and resize significantly between iterations, enabling the optimizer to rapidly refocus its search and consistently accelerate the discovery of feasible and good-quality solutions. We empirically demonstrate the effectiveness of FuRBO through extensive testing on the full BBOB-constrained COCO benchmark suite and other physics-inspired benchmarks, comparing it against state-of-the-art baselines for constrained black-box optimization across varying levels of constraint severity and problem dimensionalities ranging from 2 to 60.
[LG-2] Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization ICML2025
链接: https://arxiv.org/abs/2506.14607
作者: Ziyu Gong,Jim Lim,David I. Inouye
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 32 pages, 20 figures. Accepted to ICML 2025
Abstract:Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse. While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train. We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions. Our key insight is that gradient-based DM training only requires the prior’s score function – not its density – allowing us to train the prior via denoising score matching. This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior). Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. Source code available at this https URL.
[LG-3] Deep Learning Surrogates for Real-Time Gas Emission Inversion
链接: https://arxiv.org/abs/2506.14597
作者: Thomas Newman,Christopher Nemeth,Matthew Jones,Philip Jonathan
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 3 figures, 11 pages
Abstract:Real-time identification and quantification of greenhouse-gas emissions under transient atmospheric conditions is a critical challenge in environmental monitoring. We introduce a spatio-temporal inversion framework that embeds a deep-learning surrogate of computational fluid dynamics (CFD) within a sequential Monte Carlo algorithm to perform Bayesian inference of both emission rate and source location in dynamic flow fields. By substituting costly numerical solvers with a multilayer perceptron trained on high-fidelity CFD outputs, our surrogate captures spatial heterogeneity and temporal evolution of gas dispersion, while delivering near-real-time predictions. Validation on the Chilbolton methane release dataset demonstrates comparable accuracy to full CFD solvers and Gaussian plume models, yet achieves orders-of-magnitude faster runtimes. Further experiments under simulated obstructed-flow scenarios confirm robustness in complex environments. This work reconciles physical fidelity with computational feasibility, offering a scalable solution for industrial emissions monitoring and other time-sensitive spatio-temporal inversion tasks in environmental and scientific modeling.
[LG-4] SCISSOR: Mitigating Semantic Bias through Cluster-Aware Siamese Networks for Robust Classification
链接: https://arxiv.org/abs/2506.14587
作者: Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
类目: Machine Learning (cs.LG)
*备注: 20 pages
Abstract:Shortcut learning undermines model generalization to out-of-distribution data. While the literature attributes shortcuts to biases in superficial features, we show that imbalances in the semantic distribution of sample embeddings induce spurious semantic correlations, compromising model robustness. To address this issue, we propose SCISSOR (Semantic Cluster Intervention for Suppressing ShORtcut), a Siamese network-based debiasing approach that remaps the semantic space by discouraging latent clusters exploited as shortcuts. Unlike prior data-debiasing approaches, SCISSOR eliminates the need for data augmentation and rewriting. We evaluate SCISSOR on 6 models across 4 benchmarks: Chest-XRay and Not-MNIST in computer vision, and GYAFC and Yelp in NLP tasks. Compared to several baselines, SCISSOR reports +5.3 absolute points in F1 score on GYAFC, +7.3 on Yelp, +7.7 on Chest-XRay, and +1 on Not-MNIST. SCISSOR is also highly advantageous for lightweight models with ~9.5% improvement on F1 for ViT on computer vision datasets and ~11.9% for BERT on NLP. Our study redefines the landscape of model generalization by addressing overlooked semantic biases, establishing SCISSOR as a foundational framework for mitigating shortcut learning and fostering more robust, bias-resistant AI systems.
[LG-5] Single-Example Learning in a Mixture of GPDMs with Latent Geometries
链接: https://arxiv.org/abs/2506.14563
作者: Jesse St. Amand,Leonardo Gizzi,Martin A. Giese
类目: Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 3 tables
Abstract:We present the Gaussian process dynamical mixture model (GPDMM) and show its utility in single-example learning of human motion data. The Gaussian process dynamical model (GPDM) is a form of the Gaussian process latent variable model (GPLVM), but optimized with a hidden Markov model dynamical prior. The GPDMM combines multiple GPDMs in a probabilistic mixture-of-experts framework, utilizing embedded geometric features to allow for diverse sequences to be encoded in a single latent space, enabling the categorization and generation of each sequence class. GPDMs and our mixture model are particularly advantageous in addressing the challenges of modeling human movement in scenarios where data is limited and model interpretability is vital, such as in patient-specific medical applications like prosthesis control. We score the GPDMM on classification accuracy and generative ability in single-example learning, showcase model variations, and benchmark it against LSTMs, VAEs, and transformers.
[LG-6] Automated Decision-Making on Networks with LLM s through Knowledge-Guided Evolution
链接: https://arxiv.org/abs/2506.14529
作者: Xiaohan Zheng,Lanning Wei,Yong Li,Quanming Yao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Effective decision-making on networks often relies on learning from graph-structured data, where Graph Neural Networks (GNNs) play a central role, but they take efforts to configure and tune. In this demo, we propose LLMNet, showing how to design GNN automated through Large Language Models. Our system develops a set of agents that construct graph-related knowlege bases and then leverages Retrieval-Augmented Generation (RAG) to support automated configuration and refinement of GNN models through a knowledge-guided evolution process. These agents, equipped with specialized knowledge bases, extract insights into tasks and graph structures by interacting with the knowledge bases. Empirical results show LLMNet excels in twelve datasets across three graph learning tasks, validating its effectiveness of GNN model designing.
[LG-7] owards Improved Research Methodologies for Industrial AI: A case study of false call reduction
链接: https://arxiv.org/abs/2506.14521
作者: Korbinian Pfab,Marcel Rothering
类目: Machine Learning (cs.LG)
*备注: Submitted and accepted to IEEE COMPSAC 2025
Abstract:Are current artificial intelligence (AI) research methodologies ready to create successful, productive, and profitable AI applications? This work presents a case study on an industrial AI use case called false call reduction for automated optical inspection to demonstrate the shortcomings of current best practices. We identify seven weaknesses prevalent in related peer-reviewed work and experimentally show their consequences. We show that the best-practice methodology would fail for this use case. We argue amongst others for the necessity of requirement-aware metrics to ensure achieving business objectives, clear definitions of success criteria, and a thorough analysis of temporal dynamics in experimental datasets. Our work encourages researchers to critically assess their methodologies for more successful applied AI research.
[LG-8] wo-Player Zero-Sum Games with Bandit Feedback
链接: https://arxiv.org/abs/2506.14518
作者: Elif Yılmaz,Christos Dimitrakakis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We study a two-player zero-sum game (TPZSG) in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose and analyze two algorithms: ETC-TPZSG, which directly applies ETC to the TPZSG setting and ETC-TPZSG-AE, which improves upon it by incorporating an action pair elimination (AE) strategy that leverages the \varepsilon -Nash Equilibrium property to efficiently select the optimal action pair. Our objective is to demonstrate the applicability of ETC in a TPZSG setting by focusing on learning pure strategy Nash Equilibrium. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret for both algorithms, has received limited attention in the literature on zero-sum games. Particularly, after T rounds, we achieve an instance-dependent regret upper bounds of O(\Delta + \sqrtT) for ETC-TPZSG and O(\frac\log (T \Delta^2)\Delta) for ETC-TPZSG-AE, where \Delta denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insights through instance-dependent analysis.
[LG-9] Zeroth-Order Optimization is Secretly Single-Step Policy Optimization
链接: https://arxiv.org/abs/2506.14460
作者: Junbin Qiu,Zhengpeng Xie,Xiangda Yan,Yongjie Yang,Yao Shu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-Order Optimization (ZOO) provides powerful tools for optimizing functions where explicit gradients are unavailable or expensive to compute. However, the underlying mechanisms of popular ZOO methods, particularly those employing randomized finite differences, and their connection to other optimization paradigms like Reinforcement Learning (RL) are not fully elucidated. This paper establishes a fundamental and previously unrecognized connection: ZOO with finite differences is equivalent to a specific instance of single-step Policy Optimization (PO). We formally unveil that the implicitly smoothed objective function optimized by common ZOO algorithms is identical to a single-step PO objective. Furthermore, we show that widely used ZOO gradient estimators, are mathematically equivalent to the REINFORCE gradient estimator with a specific baseline function, revealing the variance-reducing mechanism in ZOO from a PO this http URL on this unified framework, we propose ZoAR (Zeroth-Order Optimization with Averaged Baseline and Query Reuse), a novel ZOO algorithm incorporating PO-inspired variance reduction techniques: an averaged baseline from recent evaluations and query reuse analogous to experience replay. Our theoretical analysis further substantiates these techniques reduce variance and enhance convergence. Extensive empirical studies validate our theory and demonstrate that ZoAR significantly outperforms other methods in terms of convergence speed and final performance. Overall, our work provides a new theoretical lens for understanding ZOO and offers practical algorithmic improvements derived from its connection to PO.
[LG-10] A Model-Mediated Stacked Ensemble Approach for Depression Prediction Among Professionals
链接: https://arxiv.org/abs/2506.14459
作者: Md. Mortuza Ahmmed,Abdullah Al Noman,Mahin Montasir Afif,K. M. Tahsin Kabir,Md. Mostafizur Rahman,Mufti Mahmud
类目: Machine Learning (cs.LG)
*备注:
Abstract:Depression is a significant mental health concern, particularly in professional environments where work-related stress, financial pressure, and lifestyle imbalances contribute to deteriorating well-being. Despite increasing awareness, researchers and practitioners face critical challenges in developing accurate and generalizable predictive models for mental health disorders. Traditional classification approaches often struggle with the complexity of depression, as it is influenced by multifaceted, interdependent factors, including occupational stress, sleep patterns, and job satisfaction. This study addresses these challenges by proposing a stacking-based ensemble learning approach to improve the predictive accuracy of depression classification among professionals. The Depression Professional Dataset has been collected from Kaggle. The dataset comprises demographic, occupational, and lifestyle attributes that influence mental well-being. Our stacking model integrates multiple base learners with a logistic regression-mediated model, effectively capturing diverse learning patterns. The experimental results demonstrate that the proposed model achieves high predictive performance, with an accuracy of 99.64% on training data and 98.75% on testing data, with precision, recall, and F1-score all exceeding 98%. These findings highlight the effectiveness of ensemble learning in mental health analytics and underscore its potential for early detection and intervention strategies.
[LG-11] Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge
链接: https://arxiv.org/abs/2506.14457
作者: Freya Behrens,Lenka Zdeborová
类目: Machine Learning (cs.LG)
*备注: 9 pages, 21 figures
Abstract:Dataset distillation aims to compress training data into fewer examples via a teacher, from which a student can learn effectively. While its success is often attributed to structure in the data, modern neural networks also memorize specific facts, but if and how such memorized information is can transferred in distillation settings remains less understood. In this work, we show that students trained on soft labels from teachers can achieve non-trivial accuracy on held-out memorized data they never directly observed. This effect persists on structured data when the teacher has not this http URL analyze it in isolation, we consider finite random i.i.d. datasets where generalization is a priori impossible and a successful teacher fit implies pure memorization. Still, students can learn non-trivial information about the held-out data, in some cases up to perfect accuracy. In those settings, enough soft labels are available to recover the teacher functionally - the student matches the teacher’s predictions on all possible inputs, including the held-out memorized data. We show that these phenomena strongly depend on the temperature with which the logits are smoothed, but persist across varying network capacities, architectures and dataset compositions.
[LG-12] Detecting immune cells with label-free two-photon autofluorescence and deep learning
链接: https://arxiv.org/abs/2506.14449
作者: Lucas Kreiss,Amey Chaware,Maryam Roohian,Sarah Lemire,Oana-Maria Thoma,Birgitta Carlé,Maximilian Waldner,Sebastian Schürmann,Oliver Friedrich,Roarke Horstmeyer
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注:
Abstract:Label-free imaging has gained broad interest because of its potential to omit elaborate staining procedures which is especially relevant for in vivo use. Label-free multiphoton microscopy (MPM), for instance, exploits two-photon excitation of natural autofluorescence (AF) from native, metabolic proteins, making it ideal for in vivo endomicroscopy. Deep learning (DL) models have been widely used in other optical imaging technologies to predict specific target annotations and thereby digitally augment the specificity of these label-free images. However, this computational specificity has only rarely been implemented for MPM. In this work, we used a data set of label-free MPM images from a series of different immune cell types (5,075 individual cells for binary classification in mixed samples and 3,424 cells for a multi-class classification task) and trained a convolutional neural network (CNN) to classify cell types based on this label-free AF as input. A low-complexity squeezeNet architecture was able to achieve reliable immune cell classification results (0.89 ROC-AUC, 0.95 PR-AUC, for binary classification in mixed samples; 0.689 F1 score, 0.697 precision, 0.748 recall, and 0.683 MCC for six-class classification in isolated samples). Perturbation tests confirmed that the model is not confused by extracellular environment and that both input AF channels (NADH and FAD) are about equally important to the classification. In the future, such predictive DL models could directly detect specific immune cells in unstained images and thus, computationally improve the specificity of label-free MPM which would have great potential for in vivo endomicroscopy.
[LG-13] A General Framework for Off-Policy Learning with Partially-Observed Reward ICLR2025
链接: https://arxiv.org/abs/2506.14439
作者: Rikiya Takehi,Masahiro Asami,Kosuke Kawakami,Yuta Saito
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Published as a conference paper at ICLR 2025
Abstract:Off-policy learning (OPL) in contextual bandits aims to learn a decision-making policy that maximizes the target rewards by using only historical interaction data collected under previously developed policies. Unfortunately, when rewards are only partially observed, the effectiveness of OPL degrades severely. Well-known examples of such partial rewards include explicit ratings in content recommendations, conversion signals on e-commerce platforms that are partial due to delay, and the issue of censoring in medical problems. One possible solution to deal with such partial rewards is to use secondary rewards, such as dwelling time, clicks, and medical indicators, which are more densely observed. However, relying solely on such secondary rewards can also lead to poor policy learning since they may not align with the target reward. Thus, this work studies a new and general problem of OPL where the goal is to learn a policy that maximizes the expected target reward by leveraging densely observed secondary rewards as supplemental data. We then propose a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR), which effectively uses the secondary rewards in addition to the partially-observed target reward to achieve effective OPL despite the challenging scenario. We also discuss a case where we aim to optimize not only the expected target reward but also the expected secondary rewards to some extent; counter-intuitively, we will show that leveraging the two objectives is in fact advantageous also for the optimization of only the target reward. Along with statistical analysis of our proposed methods, empirical evaluations on both synthetic and real-world data show that HyPeR outperforms existing methods in various scenarios.
[LG-14] MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation
链接: https://arxiv.org/abs/2506.14436
作者: Shen Yuan,Yin Zheng,Taifeng Wang,Binbin Liu,Hongteng Xu
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures
Abstract:Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ‘‘model MoE-ization’’ strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts’ orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at this https URL.
[LG-15] Unsupervised Skill Discovery through Skill Regions Differentiation
链接: https://arxiv.org/abs/2506.14420
作者: Ting Xiao,Jiakun Zheng,Rushuai Yang,Kang Xu,Qiaosheng Zhang,Peng Liu,Chenjia Bai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unsupervised Reinforcement Learning (RL) aims to discover diverse behaviors that can accelerate the learning of downstream tasks. Previous methods typically focus on entropy-based exploration or empowerment-driven skill learning. However, entropy-based exploration struggles in large-scale state spaces (e.g., images), and empowerment-based methods with Mutual Information (MI) estimations have limitations in state exploration. To address these challenges, we propose a novel skill discovery objective that maximizes the deviation of the state density of one skill from the explored regions of other skills, encouraging inter-skill state diversity similar to the initial MI objective. For state-density estimation, we construct a novel conditional autoencoder with soft modularization for different skill policies in high-dimensional space. Meanwhile, to incentivize intra-skill exploration, we formulate an intrinsic reward based on the learned autoencoder that resembles count-based exploration in a compact latent space. Through extensive experiments in challenging state and image-based tasks, we find our method learns meaningful skills and achieves superior performance in various downstream tasks.
[LG-16] One Size Fits None: Rethinking Fairness in Medical AI ACL2025
链接: https://arxiv.org/abs/2506.14400
作者: Roland Roller,Michael Hahn,Ajay Madhavan Ravichandran,Bilgin Osmanodja,Florian Oetke,Zeineb Sassi,Aljoscha Burchardt,Klaus Netter,Klemens Budde,Anne Herrmann,Tobias Strapatsas,Peter Dabrock,Sebastian Möller
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted at the 6th Workshop on Gender Bias in Natural Language Processing at ACL 2025
Abstract:Machine learning (ML) models are increasingly used to support clinical decision-making. However, real-world medical datasets are often noisy, incomplete, and imbalanced, leading to performance disparities across patient subgroups. These differences raise fairness concerns, particularly when they reinforce existing disadvantages for marginalized groups. In this work, we analyze several medical prediction tasks and demonstrate how model performance varies with patient characteristics. While ML models may demonstrate good overall performance, we argue that subgroup-level evaluation is essential before integrating them into clinical workflows. By conducting a performance analysis at the subgroup level, differences can be clearly identified-allowing, on the one hand, for performance disparities to be considered in clinical practice, and on the other hand, for these insights to inform the responsible development of more effective models. Thereby, our work contributes to a practical discussion around the subgroup-sensitive development and deployment of medical ML models and the interconnectedness of fairness and transparency.
[LG-17] Excessive Reasoning Attack on Reasoning LLM s
链接: https://arxiv.org/abs/2506.14374
作者: Wai Man Si,Mingjie Li,Michael Backes,Yang Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Recent reasoning large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, exhibit strong performance on complex tasks through test-time inference scaling. However, prior studies have shown that these models often incur significant computational costs due to excessive reasoning, such as frequent switching between reasoning trajectories (e.g., underthinking) or redundant reasoning on simple questions (e.g., overthinking). In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors and substantially increase computational overhead without compromising model utility. Therefore, we propose a novel loss framework consisting of three components: (1) Priority Cross-Entropy Loss, a modification of the standard cross-entropy objective that emphasizes key tokens by leveraging the autoregressive nature of LMs; (2) Excessive Reasoning Loss, which encourages the model to initiate additional reasoning paths during inference; and (3) Delayed Termination Loss, which is designed to extend the reasoning process and defer the generation of final outputs. We optimize and evaluate our attack for the GSM8K and ORCA datasets on DeepSeek-R1-Distill-LLaMA and DeepSeek-R1-Distill-Qwen. Empirical results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance. Furthermore, our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models.
[LG-18] Fair for a few: Improving Fairness in Doubly Imbalanced Datasets
链接: https://arxiv.org/abs/2506.14306
作者: Ata Yalcin,Asli Umay Ozturk,Yigit Sever,Viktoria Pauw,Stephan Hachinger,Ismail Hakki Toroslu,Pinar Karagoz
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 33 pages, 3 figures, submitted to AI Review
Abstract:Fairness has been identified as an important aspect of Machine Learning and Artificial Intelligence solutions for decision making. Recent literature offers a variety of approaches for debiasing, however many of them fall short when the data collection is imbalanced. In this paper, we focus on a particular case, fairness in doubly imbalanced datasets, such that the data collection is imbalanced both for the label and the groups in the sensitive attribute. Firstly, we present an exploratory analysis to illustrate limitations in debiasing on a doubly imbalanced dataset. Then, a multi-criteria based solution is proposed for finding the most suitable sampling and distribution for label and sensitive attribute, in terms of fairness and classification accuracy
[LG-19] SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling
链接: https://arxiv.org/abs/2506.14293
作者: Tawsif Ahmed,Andrej Radonjic,Gollam Rabby
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:We present Sleeping-DISCO 9M, a large-scale pre-training dataset for music and song. To the best of our knowledge, there are no open-source high-quality dataset representing popular and well-known songs for generative music modeling tasks such as text-music, music-captioning, singing-voice synthesis, melody reconstruction and cross-model retrieval. Past contributions focused on isolated and constrained factors whose core perspective was to create synthetic or re-recorded music corpus (e.g. GTSinger, M4Singer) and arbitrarily large-scale audio datasets (e.g. DISCO-10M and LAIONDISCO-12M) had been another focus for the community. Unfortunately, adoption of these datasets has been below substantial in the generative music community as these datasets fail to reflect real-world music and its flavour. Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
[LG-20] Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models
链接: https://arxiv.org/abs/2506.14291
作者: Ben Finkelshtein,İsmail İlkan Ceylan,Michael Bronstein,Ron Levie
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:
Abstract:Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.
[LG-21] owards Robust Learning to Optimize with Theoretical Guarantees CVPR2024
链接: https://arxiv.org/abs/2506.14263
作者: Qingyu Song,Wei Lin,Juncheng Wang,Hong Xu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Published in CVPR 2024, 55 pages, 17 figures, this version fixed some typo
Abstract:Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications, computer networks, and electronic design, existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this gap by providing comprehensive proofs. First, we prove a sufficient condition for a robust L2O model with homogeneous convergence rates over all In-Distribution (InD) instances. We assume an L2O model achieves robustness for an InD scenario. Based on our proposed methodology of aligning OOD problems to InD problems, we also demonstrate that the L2O model’s convergence rate in OOD scenarios will deteriorate by an equation of the L2O model’s input features. Moreover, we propose an L2O model with a concise gradient-only feature construction and a novel gradient-based history modeling method. Numerical simulation demonstrates that our proposed model outperforms the state-of-the-art baseline in both InD and OOD scenarios and achieves up to 10 \times convergence speedup. The code of our method can be found from this https URL.
[LG-22] RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
链接: https://arxiv.org/abs/2506.14261
作者: Rohan Gupta,Erik Jenner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Latent-space monitors aim to detect undesirable behaviours in large language models by leveraging internal model representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions, but a critical open question remains: can LLMs learn to evade such monitors? To study this, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to bypass latent-space monitors while maintaining coherent generations. We apply RL-Obfuscation to LLMs ranging from 7B to 14B parameters and evaluate evasion success against a suite of monitors. We find that token-level latent-space monitors are highly vulnerable to this attack. More holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, we show that adversarial policies trained to evade a single static monitor generalise to unseen monitors of the same type. Finally, we study how the policy learned by RL bypasses these monitors and find that the model can also learn to repurpose tokens to mean something different internally.
[LG-23] Convergence-Privacy-Fairness Trade-Off in Personalized Federated Learning
链接: https://arxiv.org/abs/2506.14251
作者: Xiyu Zhao,Qimei Cui,Weicai Li,Wei Ni,Ekram Hossain,Quan Z. Sheng,Xiaofeng Tao,Ping Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Personalized federated learning (PFL), e.g., the renowned Ditto, strikes a balance between personalization and generalization by conducting federated learning (FL) to guide personalized learning (PL). While FL is unaffected by personalized model training, in Ditto, PL depends on the outcome of the FL. However, the clients’ concern about their privacy and consequent perturbation of their local models can affect the convergence and (performance) fairness of PL. This paper presents PFL, called DP-Ditto, which is a non-trivial extension of Ditto under the protection of differential privacy (DP), and analyzes the trade-off among its privacy guarantee, model convergence, and performance distribution fairness. We also analyze the convergence upper bound of the personalized models under DP-Ditto and derive the optimal number of global aggregations given a privacy budget. Further, we analyze the performance fairness of the personalized models, and reveal the feasibility of optimizing DP-Ditto jointly for convergence and fairness. Experiments validate our analysis and demonstrate that DP-Ditto can surpass the DP-perturbed versions of the state-of-the-art PFL models, such as FedAMP, pFedMe, APPLE, and FedALA, by over 32.71% in fairness and 9.66% in accuracy.
[LG-24] Can Large Language Models Improve Spectral Graph Neural Networks?
链接: https://arxiv.org/abs/2506.14220
作者: Kangkang Lu,Yanhua Yu,Zhiyong Huang,Tat-Seng Chua
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spectral Graph Neural Networks (SGNNs) have attracted significant attention due to their ability to approximate arbitrary filters. They typically rely on supervision from downstream tasks to adaptively learn appropriate filters. However, under label-scarce conditions, SGNNs may learn suboptimal filters, leading to degraded performance. Meanwhile, the remarkable success of Large Language Models (LLMs) has inspired growing interest in exploring their potential within the GNN domain. This naturally raises an important question: \textitCan LLMs help overcome the limitations of SGNNs and enhance their performance? In this paper, we propose a novel approach that leverages LLMs to estimate the homophily of a given graph. The estimated homophily is then used to adaptively guide the design of polynomial spectral filters, thereby improving the expressiveness and adaptability of SGNNs across diverse graph structures. Specifically, we introduce a lightweight pipeline in which the LLM generates homophily-aware priors, which are injected into the filter coefficients to better align with the underlying graph topology. Extensive experiments on benchmark datasets demonstrate that our LLM-driven SGNN framework consistently outperforms existing baselines under both homophilic and heterophilic settings, with minimal computational and monetary overhead.
[LG-25] A Variational Information Theoretic Approach to Out-of-Distribution Detection
链接: https://arxiv.org/abs/2506.14194
作者: Sudeepta Mondal,Zhuolin Jiang,Ganesh Sundaramoorthi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a theory for the construction of out-of-distribution (OOD) detection features for neural networks. We introduce random features for OOD through a novel information-theoretic loss functional consisting of two terms, the first based on the KL divergence separates resulting in-distribution (ID) and OOD feature distributions and the second term is the Information Bottleneck, which favors compressed features that retain the OOD information. We formulate a variational procedure to optimize the loss and obtain OOD features. Based on assumptions on OOD distributions, one can recover properties of existing OOD features, i.e., shaping functions. Furthermore, we show that our theory can predict a new shaping function that out-performs existing ones on OOD benchmarks. Our theory provides a general framework for constructing a variety of new features with clear explainability.
[LG-26] Hard Contacts with Soft Gradients: Refining Differentiable Simulators for Learning and Control
链接: https://arxiv.org/abs/2506.14186
作者: Anselm Paulus,A. René Geist,Pierre Schumacher,Vít Musil,Georg Martius
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Contact forces pose a major challenge for gradient-based optimization of robot dynamics as they introduce jumps in the system’s velocities. Penalty-based simulators, such as MuJoCo, simplify gradient computation by softening the contact forces. However, realistically simulating hard contacts requires very stiff contact settings, which leads to incorrect gradients when using automatic differentiation. On the other hand, using non-stiff settings strongly increases the sim-to-real gap. We analyze the contact computation of penalty-based simulators to identify the causes of gradient errors. Then, we propose DiffMJX, which combines adaptive integration with MuJoCo XLA, to notably improve gradient quality in the presence of hard contacts. Finally, we address a key limitation of contact gradients: they vanish when objects do not touch. To overcome this, we introduce Contacts From Distance (CFD), a mechanism that enables the simulator to generate informative contact gradients even before objects are in contact. To preserve physical realism, we apply CFD only in the backward pass using a straight-through trick, allowing us to compute useful gradients without modifying the forward simulation.
[LG-27] Structured and Informed Probabilistic Modeling with the Thermodynamic Kolmogorov-Arnold Model
链接: https://arxiv.org/abs/2506.14167
作者: Prithvi Raj
类目: Machine Learning (cs.LG)
*备注:
Abstract:We adapt the Kolmogorov-Arnold Representation Theorem to generative modeling by reinterpreting its inner functions as a Markov Kernel between probability spaces via inverse transform sampling. We present a generative model that is interpretable, easy to design, and efficient. Our approach couples a Kolmogorov-Arnold Network generator with independent energy-based priors, trained via Maximum Likelihood. Inverse sampling enables fast inference, while prior knowledge can be incorporated before training to better align priors with posteriors, thereby improving learning efficiency and sample quality. The learned prior is also recoverable and visualizable post-training, offering an empirical Bayes perspective. To address inflexibility and mitigate prior-posterior mismatch, we introduce scalable extensions based on mixture distributions and Langevin Monte Carlo methods, admitting a trade-off between flexibility and training efficiency. Our contributions connect classical representation theorems with modern probabilistic modeling, while balancing training stability, inference speed, and the quality and diversity of generations.
[LG-28] Light Aircraft Game : Basic Implementation and training results analysis
链接: https://arxiv.org/abs/2506.14164
作者: Hanzhong Cao
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:This paper investigates multi-agent reinforcement learning (MARL) in a partially observable, cooperative-competitive combat environment known as LAG. We describe the environment’s setup, including agent actions, hierarchical controls, and reward design across different combat modes such as No Weapon and ShootMissile. Two representative algorithms are evaluated: HAPPO, an on-policy hierarchical variant of PPO, and HASAC, an off-policy method based on soft actor-critic. We analyze their training stability, reward progression, and inter-agent coordination capabilities. Experimental results show that HASAC performs well in simpler coordination tasks without weapons, while HAPPO demonstrates stronger adaptability in more dynamic and expressive scenarios involving missile combat. These findings provide insights into the trade-offs between on-policy and off-policy methods in multi-agent settings.
[LG-29] Common Benchmarks Undervalue the Generalization Power of Programmatic Policies
链接: https://arxiv.org/abs/2506.14162
作者: Amirhossein Rajabpour,Kiarash Aghakasiri,Sandra Zilles,Levi H. S. Lelis
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures
Abstract:Algorithms for learning programmatic representations for sequential decision-making problems are often evaluated on out-of-distribution (OOD) problems, with the common conclusion that programmatic policies generalize better than neural policies on OOD problems. In this position paper, we argue that commonly used benchmarks undervalue the generalization capabilities of programmatic representations. We analyze the experiments of four papers from the literature and show that neural policies, which were shown not to generalize, can generalize as effectively as programmatic policies on OOD problems. This is achieved with simple changes in the neural policies training pipeline. Namely, we show that simpler neural architectures with the same type of sparse observation used with programmatic policies can help attain OOD generalization. Another modification we have shown to be effective is the use of reward functions that allow for safer policies (e.g., agents that drive slowly can generalize better). Also, we argue for creating benchmark problems highlighting concepts needed for OOD generalization that may challenge neural policies but align with programmatic representations, such as tasks requiring algorithmic constructs like stacks.
[LG-30] Leverag ing Predictive Equivalence in Decision Trees ICML2025
链接: https://arxiv.org/abs/2506.14143
作者: Hayden McTavish,Zachery Boner,Jon Donnelly,Margo Seltzer,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree’s decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to test-time missingness of feature values; we address predictive equivalence’s impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.
[LG-31] Evaluating Loss Functions for Graph Neural Networks: Towards Pretraining and Generalization
链接: https://arxiv.org/abs/2506.14114
作者: Khushnood Abbas,Ruizhe Hou,Zhou Wengang,Dong Shi,Niu Ling,Satyaki Nan,Alireza Abbasi
类目: Machine Learning (cs.LG)
*备注: ACM single column 633 pages
Abstract:Graph Neural Networks (GNNs) became useful for learning on non-Euclidean data. However, their best performance depends on choosing the right model architecture and the training objective, also called the loss function. Researchers have studied these parts separately, but a large-scale evaluation has not looked at how GNN models and many loss functions work together across different tasks. To fix this, we ran a thorough study - it included seven well-known GNN architectures. We also used a large group of 30 single plus mixed loss functions. The study looked at both inductive and transductive settings. Our evaluation spanned three distinct real-world datasets, assessing performance in both inductive and transductive settings using 21 comprehensive evaluation metrics. From these extensive results (detailed in supplementary information 1 \ 2), we meticulously analyzed the top ten model-loss combinations for each metric based on their average rank. Our findings reveal that, especially for the inductive case: 1) Hybrid loss functions generally yield superior and more robust performance compared to single loss functions, indicating the benefit of multi-objective optimization. 2) The GIN architecture always showed the highest-level average performance, especially with Cross-Entropy loss. 3) Although some combinations had overall lower average ranks, models such as GAT, particularly with certain hybrid losses, demonstrated incredible specialized strengths, maximizing the most top-1 results among the individual metrics, emphasizing subtle strengths for particular task demands. 4) On the other hand, the MPNN architecture typically lagged behind the scenarios it was tested against.
[LG-32] ransformers Learn Faster with Semantic Focus
链接: https://arxiv.org/abs/2506.14095
作者: Parikshit Ram,Kenneth L. Clarkson,Tim Klinger,Shashanka Ubaru,Alexander G. Gray
类目: Machine Learning (cs.LG)
*备注:
Abstract:Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits – a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model’s “semantic focus” with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior. We establish a connection between the stability of the standard softmax and the loss function’s Lipschitz properties, then show how sparsity affects the stability of the softmax and the subsequent convergence and generalization guarantees resulting from the attention mechanism. This allows us to theoretically establish that input-agnostic sparse attention does not provide any benefits. We also characterize conditions when semantic focus (input-dependent sparse attention) can provide improved guarantees, and we validate that these conditions are in fact met in our empirical evaluations.
[LG-33] Multi-Scale Finetuning for Encoder-based Time Series Foundation Models
链接: https://arxiv.org/abs/2506.14087
作者: Zhongzheng Qiao,Chenghao Liu,Yiming Zhang,Ming Jin,Quang Pham,Qingsong Wen,P.N. Suganthan,Xudong Jiang,Savitha Ramasamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs’ capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on \textitencoder-based TSFMs, we propose \textbfMulti\textbf\textscscale \textbf\textscfine\textbf\textsctuning (\textbfMSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (\moirai, \moment\ and \units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods.
[LG-34] Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification
链接: https://arxiv.org/abs/2506.14074
作者: Nathaniel Pinckney,Chenhui Deng,Chia-Tung Ho,Yun-Da Tsai,Mingjie Liu,Wenfei Zhou,Brucek Khailany,Haoxing Ren
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 16 pages with appendix
Abstract:We present the Comprehensive Verilog Design Problems (CVDP) benchmark, a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. CVDP includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical QA authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks \unicodex2013 especially those involving RTL reuse and verification \unicodex2013 are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. CVDP reveals substantial gaps in current model capabilities, underscoring the need for continued research toward robust, real-world hardware design automation.
[LG-35] A Regret Perspective on Online Selective Generation
链接: https://arxiv.org/abs/2506.14067
作者: Minjae Lee,Yoonjae Jung,Sangdon Park
类目: Machine Learning (cs.LG)
*备注: 10 pages
Abstract:Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To address this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when it is unsure of their answers. However, as selective generators are interacting under non-stochastic environments and having partial feedback from users on selective generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address these limitations, we propose an online learning algorithm for selective generation under partial feedback. In particular, as learning under partial feedback is well-studied by multi-armed bandit problems, we reduce selective generation to bandits and provide a novel conversion lemma from bandits back to selective generation to leverage any known bandit algorithms and theoretical properties. This mainly connects regret guarantees of bandits to false discovery rate (FDR) guarantees of selective generation for controlling hallucination. However, naively exploiting known bandit algorithms and their regret bounds suffers from slow convergence speed in practice due the nature of partial feedback. To overcome this, we exploit a unique structure of arms in selective generation for feedback unlocking, i.e., unlocking unknown feedback from observed feedback. We theoretically and empirically evaluate the efficacy of the proposed online selective generation algorithm under partial feedback over diverse data environment setups, resulting in controlling a desired FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.
[LG-36] Load Balancing Mixture of Experts with Similarity Preserving Routers
链接: https://arxiv.org/abs/2506.14038
作者: Nabil Omi,Siddhartha Sen,Ali Farhadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters (“experts”) for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.
[LG-37] Robust Physics-Informed Neural Network Approach for Estimating Heterogeneous Elastic Properties from Noisy Displacement Data
链接: https://arxiv.org/abs/2506.14036
作者: Tatthapong Srikitrungruang,Sina Aghaee Dabaghan Fard,Matthew Lemon,Jaesung Lee,Yuxiao Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurately estimating spatially heterogeneous elasticity parameters, particularly Young’s modulus and Poisson’s ratio, from noisy displacement measurements remains significantly challenging in inverse elasticity problems. Existing inverse estimation techniques are often limited by instability, pronounced sensitivity to measurement noise, and difficulty in recovering absolute-scale Young’s modulus. This work presents a novel Inverse Elasticity Physics-Informed Neural Network (IE-PINN) specifically designed to robustly reconstruct heterogeneous distributions of elasticity parameters from noisy displacement data based on linear elasticity physics. IE-PINN integrates three distinct neural network architectures dedicated to separately modeling displacement fields, strain fields, and elasticity distributions, thereby significantly enhancing stability and accuracy against measurement noise. Additionally, a two-phase estimation strategy is introduced: the first phase recovers relative spatial distributions of Young’s modulus and Poisson’s ratio, and the second phase calibrates the absolute scale of Young’s modulus using imposed loading boundary conditions. Additional methodological innovations, including positional encoding, sine activation functions, and a sequential pretraining protocol, further enhance the model’s performance and robustness. Extensive numerical experiments demonstrate that IE-PINN effectively overcomes critical limitations encountered by existing methods, delivering accurate absolute-scale elasticity estimations even under severe noise conditions. This advancement holds substantial potential for clinical imaging diagnostics and mechanical characterization, where measurements typically encounter substantial noise.
[LG-38] Sketched Sum-Product Networks for Joins
链接: https://arxiv.org/abs/2506.14034
作者: Brian Tsan,Abylay Amanbayev,Asoke Datta,Florin Rusu
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Sketches have shown high accuracy in multi-way join cardinality estimation, a critical problem in cost-based query optimization. Accurately estimating the cardinality of a join operation – analogous to its computational cost – allows the optimization of query execution costs in relational database systems. However, although sketches have shown high efficacy in query optimization, they are typically constructed specifically for predefined selections in queries that are assumed to be given a priori, hindering their applicability to new queries. As a more general solution, we propose for Sum-Product Networks to dynamically approximate sketches on-the-fly. Sum-Product Networks can decompose and model multivariate distributions, such as relations, as linear combinations of multiple univariate distributions. By representing these univariate distributions as sketches, Sum-Product Networks can combine them element-wise to efficiently approximate the sketch of any query selection. These approximate sketches can then be applied to join cardinality estimation. In particular, we implement the Fast-AGMS and Bound Sketch methods, which have successfully been used in prior work, despite their costly construction. By accurately approximating them instead, our work provides a practical alternative to apply these sketches to query optimization.
[LG-39] Unlearning Isnt Invisible: Detecting Unlearning Traces in LLM s from Model Outputs
链接: https://arxiv.org/abs/2506.14003
作者: Yiwei Chen,Soumyadeep Pal,Yimeng Zhang,Qing Qu,Sijia Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent ‘‘fingerprints’’ in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, a simple supervised classifier can reliably determine whether a model has undergone unlearning based solely on its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we show that forget-relevant prompts enable over 90% accuracy in detecting unlearning traces across all model sizes. Even with forget-irrelevant inputs, large LLMs maintain high detectability, demonstrating the broad applicability of unlearning trace detection. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned given an input query. Codes are available at [this URL](this https URL).
[LG-40] Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences
链接: https://arxiv.org/abs/2506.13996
作者: Stas Bekman,Samyam Rajbhandari,Michael Wyatt,Jeff Rasley,Tunji Ruwase,Zhewei Yao,Aurick Qiao,Yuxiong He
类目: Machine Learning (cs.LG)
*备注: 19 pages, 13 figures
Abstract:Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models. ALST supports training Meta’s Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed this https URL and Arctic Training this https URL. Comments: 19 pages, 13 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.13996 [cs.LG] (or arXiv:2506.13996v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.13996 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stas Bekman [view email] [v1] Mon, 16 Jun 2025 20:52:28 UTC (1,477 KB)
[LG-41] Quantum-Informed Contrastive Learning with Dynamic Mixup Augmentation for Class-Imbalanced Expert Systems
链接: https://arxiv.org/abs/2506.13987
作者: Md Abrar Jahin,Adiba Abid,M. F. Mridha
类目: Machine Learning (cs.LG)
*备注:
Abstract:Expert systems often operate in domains characterized by class-imbalanced tabular data, where detecting rare but critical instances is essential for safety and reliability. While conventional approaches, such as cost-sensitive learning, oversampling, and graph neural networks, provide partial solutions, they suffer from drawbacks like overfitting, label noise, and poor generalization in low-density regions. To address these challenges, we propose QCL-MixNet, a novel Quantum-Informed Contrastive Learning framework augmented with k-nearest neighbor (kNN) guided dynamic mixup for robust classification under imbalance. QCL-MixNet integrates three core innovations: (i) a Quantum Entanglement-inspired layer that models complex feature interactions through sinusoidal transformations and gated attention, (ii) a sample-aware mixup strategy that adaptively interpolates feature representations of semantically similar instances to enhance minority class representation, and (iii) a hybrid loss function that unifies focal reweighting, supervised contrastive learning, triplet margin loss, and variance regularization to improve both intra-class compactness and inter-class separability. Extensive experiments on 18 real-world imbalanced datasets (binary and multi-class) demonstrate that QCL-MixNet consistently outperforms 20 state-of-the-art machine learning, deep learning, and GNN-based baselines in macro-F1 and recall, often by substantial margins. Ablation studies further validate the critical role of each architectural component. Our results establish QCL-MixNet as a new benchmark for tabular imbalance handling in expert systems. Theoretical analyses reinforce its expressiveness, generalization, and optimization robustness.
[LG-42] Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability ICML2025
链接: https://arxiv.org/abs/2506.13974
作者: Michael Crawshaw,Blake Woodworth,Mingrui Liu
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes \eta \leq 1/K where K is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize \eta 0 . With R communication rounds and M clients, we show convergence at a rate \mathcalO(1/\eta K R) after an initial unstable phase lasting for \widetilde\mathcalO(\eta K M) rounds. This improves upon the existing \mathcalO(1/R) rate for general smooth, convex objectives. Our analysis parallels the single machine analysis of~\citewu2024large in which instability is caused by extremely large stepsizes, but in our setting another source of instability is large local updates with heterogeneous objectives.
[LG-43] Membership Inference Attacks as Privacy Tools: Reliability Disparity and Ensemble
链接: https://arxiv.org/abs/2506.13972
作者: Zhiqi Wang,Chengyu Zhang,Yuetian Chen,Nathalie Baracaldo,Swanand Kadhe,Lei Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Membership inference attacks (MIAs) pose a significant threat to the privacy of machine learning models and are widely used as tools for privacy assessment, auditing, and machine unlearning. While prior MIA research has primarily focused on performance metrics such as AUC, accuracy, and TPR@low FPR - either by developing new methods to enhance these metrics or using them to evaluate privacy solutions - we found that it overlooks the disparities among different attacks. These disparities, both between distinct attack methods and between multiple instantiations of the same method, have crucial implications for the reliability and completeness of MIAs as privacy evaluation tools. In this paper, we systematically investigate these disparities through a novel framework based on coverage and stability analysis. Extensive experiments reveal significant disparities in MIAs, their potential causes, and their broader implications for privacy evaluation. To address these challenges, we propose an ensemble framework with three distinct strategies to harness the strengths of state-of-the-art MIAs while accounting for their disparities. This framework not only enables the construction of more powerful attacks but also provides a more robust and comprehensive methodology for privacy evaluation.
[LG-44] A Hybrid Neural Network – Polynomial Series Scheme for Learning Invariant Manifolds of Discrete Dynamical Systems
链接: https://arxiv.org/abs/2506.13950
作者: Dimitrios G. Patsatzis,Nikolaos Kazantzis,Ioannis G. Kevrekidis,Constantinos Siettos
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 36 pages (31 pages of main text and Appendix, 5 of Supplement), 8 Figures (6 in the main text and Appendix and 2 in the Supplement)
Abstract:We propose a hybrid machine learning scheme to learn – in physics-informed and numerical analysis-informed fashion – invariant manifolds (IM) of discrete maps for constructing reduced-order models (ROMs) for dynamical systems. The proposed scheme combines polynomial series with shallow neural networks, exploiting the complementary strengths of both approaches. Polynomials enable an efficient and accurate modeling of ROMs with guaranteed local exponential convergence rate around the fixed point, where, under certain assumptions, the IM is demonstrated to be analytic. Neural networks provide approximations to more complex structures beyond the reach of the polynomials’ convergence. We evaluate the efficiency of the proposed scheme using three benchmark examples, examining convergence behavior, numerical approximation accuracy, and computational training cost. Additionally, we compare the IM approximations obtained solely with neural networks and with polynomial expansions. We demonstrate that the proposed hybrid scheme outperforms both pure polynomial approximations (power series, Legendre and Chebyshev polynomials) and standalone shallow neural network approximations in terms of numerical approximation accuracy.
[LG-45] ReinDSplit: Reinforced Dynamic Split Learning for Pest Recognition in Precision Agriculture
链接: https://arxiv.org/abs/2506.13935
作者: Vishesh Kumar Tanwar,Soumik Sarkar,Asheesh K. Singh,Sajal K. Das
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注:
Abstract:To empower precision agriculture through distributed machine learning (DML), split learning (SL) has emerged as a promising paradigm, partitioning deep neural networks (DNNs) between edge devices and servers to reduce computational burdens and preserve data privacy. However, conventional SL frameworks’ one-split-fits-all strategy is a critical limitation in agricultural ecosystems where edge insect monitoring devices exhibit vast heterogeneity in computational power, energy constraints, and connectivity. This leads to straggler bottlenecks, inefficient resource utilization, and compromised model performance. Bridging this gap, we introduce ReinDSplit, a novel reinforcement learning (RL)-driven framework that dynamically tailors DNN split points for each device, optimizing efficiency without sacrificing accuracy. Specifically, a Q-learning agent acts as an adaptive orchestrator, balancing workloads and latency thresholds across devices to mitigate computational starvation or overload. By framing split layer selection as a finite-state Markov decision process, ReinDSplit convergence ensures that highly constrained devices contribute meaningfully to model training over time. Evaluated on three insect classification datasets using ResNet18, GoogleNet, and MobileNetV2, ReinDSplit achieves 94.31% accuracy with MobileNetV2. Beyond agriculture, ReinDSplit pioneers a paradigm shift in SL by harmonizing RL for resource efficiency, privacy, and scalability in heterogeneous environments.
[LG-46] Branching Stein Variational Gradient Descent for sampling multimodal distributions
链接: https://arxiv.org/abs/2506.13916
作者: Isaias Banales,Arturo Jaramillo,Heli Ricalde Guerrero
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:We propose a novel particle-based variational inference method designed to work with multimodal distributions. Our approach, referred to as Branched Stein Variational Gradient Descent (BSVGD), extends the classical Stein Variational Gradient Descent (SVGD) algorithm by incorporating a random branching mechanism that encourages the exploration of the state space. In this work, a theoretical guarantee for the convergence in distribution is presented, as well as numerical experiments to validate the suitability of our algorithm. Performance comparisons between the BSVGD and the SVGD are presented using the Wasserstein distance between samples and the corresponding computational times.
[LG-47] Density-aware Walks for Coordinated Campaign Detection ECML-PKDD2025
链接: https://arxiv.org/abs/2506.13912
作者: Atul Anand Gopalakrishnan,Jakir Hossain,Tuğrulcan Elmas,Ahmet Erdem Sarıyüce
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 16 Pages. Accepted at ECML-PKDD 2025
Abstract:Coordinated campaigns frequently exploit social media platforms by artificially amplifying topics, making inauthentic trends appear organic, and misleading users into engagement. Distinguishing these coordinated efforts from genuine public discourse remains a significant challenge due to the sophisticated nature of such attacks. Our work focuses on detecting coordinated campaigns by modeling the problem as a graph classification task. We leverage the recently introduced Large Engagement Networks (LEN) dataset, which contains over 300 networks capturing engagement patterns from both fake and authentic trends on Twitter prior to the 2023 Turkish elections. The graphs in LEN were constructed by collecting interactions related to campaigns that stemmed from ephemeral astroturfing. Established graph neural networks (GNNs) struggle to accurately classify campaign graphs, highlighting the challenges posed by LEN due to the large size of its networks. To address this, we introduce a new graph classification method that leverages the density of local network structures. We propose a random weighted walk (RWW) approach in which node transitions are biased by local density measures such as degree, core number, or truss number. These RWWs are encoded using the Skip-gram model, producing density-aware structural embeddings for the nodes. Training message-passing neural networks (MPNNs) on these density-aware embeddings yields superior results compared to the simpler node features available in the dataset, with nearly a 12% and 5% improvement in accuracy for binary and multiclass classification, respectively. Our findings demonstrate that incorporating density-aware structural encoding with MPNNs provides a robust framework for identifying coordinated inauthentic behavior on social media networks such as Twitter.
[LG-48] GITO: Graph-Informed Transformer Operator for Learning Complex Partial Differential Equations
链接: https://arxiv.org/abs/2506.13906
作者: Milad Ramezankhani,Janak M. Patel,Anirudh Deodhar,Dagnachew Birru
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a novel graph-informed transformer operator (GITO) architecture for learning complex partial differential equation systems defined on irregular geometries and non-uniform meshes. GITO consists of two main modules: a hybrid graph transformer (HGT) and a transformer neural operator (TNO). HGT leverages a graph neural network (GNN) to encode local spatial relationships and a transformer to capture long-range dependencies. A self-attention fusion layer integrates the outputs of the GNN and transformer to enable more expressive feature learning on graph-structured data. TNO module employs linear-complexity cross-attention and self-attention layers to map encoded input functions to predictions at arbitrary query locations, ensuring discretization invariance and enabling zero-shot super-resolution across any mesh. Empirical results on benchmark PDE tasks demonstrate that GITO outperforms existing transformer-based neural operators, paving the way for efficient, mesh-agnostic surrogate solvers in engineering applications.
[LG-49] SatHealth: A Multimodal Public Health Dataset with Satellite-based Environmental Factors KDD2025
链接: https://arxiv.org/abs/2506.13842
作者: Yuanlong Wang,Pengqi Wang,Changchang Yin,Ping Zhang
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures. To be published in SIGKDD 2025 Datasets and Benchmarks Track
Abstract:Living environments play a vital role in the prevalence and progression of diseases, and understanding their impact on patient’s health status becomes increasingly crucial for developing AI models. However, due to the lack of long-term and fine-grained spatial and temporal data in public and population health studies, most existing studies fail to incorporate environmental data, limiting the models’ performance and real-world application. To address this shortage, we developed SatHealth, a novel dataset combining multimodal spatiotemporal data, including environmental data, satellite images, all-disease prevalences estimated from medical claims, and social determinants of health (SDoH) indicators. We conducted experiments under two use cases with SatHealth: regional public health modeling and personal disease risk prediction. Experimental results show that living environmental information can significantly improve AI models’ performance and temporal-spatial generalizability on various tasks. Finally, we deploy a web-based application to provide an exploration tool for SatHealth and one-click access to both our data and regional environmental embedding to facilitate plug-and-play utilization. SatHealth is now published with data in Ohio, and we will keep updating SatHealth to cover the other parts of the US. With the web application and published code pipeline, our work provides valuable angles and resources to include environmental data in healthcare research and establishes a foundational framework for future research in environmental health informatics.
[LG-50] Hybrid Meta-Learning Framework for Anomaly Forecasting in Nonlinear Dynamical Systems via Physics-Inspired Simulation and Deep Ensembles
链接: https://arxiv.org/abs/2506.13828
作者: Abdullah Burkan Bereketoglu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Accelerator Physics (physics.acc-ph)
*备注: 6 pages, 5 figures, 5 algorithms
Abstract:We propose a hybrid meta-learning framework for forecasting and anomaly detection in nonlinear dynamical systems characterized by nonstationary and stochastic behavior. The approach integrates a physics-inspired simulator that captures nonlinear growth-relaxation dynamics with random perturbations, representative of many complex physical, industrial, and cyber-physical systems. We use CNN-LSTM architectures for spatio-temporal feature extraction, Variational Autoencoders (VAE) for unsupervised anomaly scoring, and Isolation Forests for residual-based outlier detection in addition to a Dual-Stage Attention Recurrent Neural Network (DA-RNN) for one-step forecasting on top of the generated simulation data. To create composite anomaly forecasts, these models are combined using a meta-learner that combines forecasting outputs, reconstruction errors, and residual scores. The hybrid ensemble performs better than standalone models in anomaly localization, generalization, and robustness to nonlinear deviations, according to simulation-based experiments. The framework provides a broad, data-driven approach to early defect identification and predictive monitoring in nonlinear systems, which may be applied to a variety of scenarios where complete physical models might not be accessible.
[LG-51] he Synthetic Mirror – Synthetic Data at the Age of Agent ic AI
链接: https://arxiv.org/abs/2506.13818
作者: Marcelle Momha
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Synthetic data, which is artificially generated and intelligently mimicking or supplementing the real-world data, is increasingly used. The proliferation of AI agents and the adoption of synthetic data create a synthetic mirror that conceptualizes a representation and potential distortion of reality, thus generating trust and accountability deficits. This paper explores the implications for privacy and policymaking stemming from synthetic data generation, and the urgent need for new policy instruments and legal framework adaptation to ensure appropriate levels of trust and accountability for AI agents relying on synthetic data. Rather than creating entirely new policy or legal regimes, the most practical approach involves targeted amendments to existing frameworks, recognizing synthetic data as a distinct regulatory category with unique characteristics.
[LG-52] ReFrame: Layer Caching for Accelerated Inference in Real-Time Rendering ICML2025
链接: https://arxiv.org/abs/2506.13814
作者: Lufei Liu,Tor M. Aamodt
类目: Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Published at ICML 2025
Abstract:Graphics rendering applications increasingly leverage neural networks in tasks such as denoising, supersampling, and frame extrapolation to improve image quality while maintaining frame rates. The temporal coherence inherent in these tasks presents an opportunity to reuse intermediate results from previous frames and avoid redundant computations. Recent work has shown that caching intermediate features to be reused in subsequent inferences is an effective method to reduce latency in diffusion models. We extend this idea to real-time rendering and present ReFrame, which explores different caching policies to optimize trade-offs between quality and performance in rendering workloads. ReFrame can be applied to a variety of encoder-decoder style networks commonly found in rendering pipelines. Experimental results show that we achieve 1.4x speedup on average with negligible quality loss in three real-time rendering tasks. Code available: this https URL
[LG-53] Agile Orchestration at Will: An Entire Smart Service-Based Security Architecture Towards 6G
链接: https://arxiv.org/abs/2505.22963
作者: Zhuoran Duan,Guoshun Nan,Rushan Li,Zijun Wang,Lihua Xiong,Chaoying Yuan,Guorong Liu,Hui Xu,Qimei Cui,Xiaofeng Tao,Tony Q.S. Quek
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:The upcoming 6G will fundamentally reshape mobile networks beyond communications, unlocking a multitude of applications that were once considered unimaginable. Meanwhile, security and resilience are especially highlighted in the 6G design principles. However, safeguarding 6G networks will be quite challenging due to various known and unknown threats from highly heterogeneous networks and diversified security requirements of distinct use cases, calling for a comprehensive re-design of security architecture. This motivates us to propose ES3A (Entire Smart Service-based Security Architecture), a novel security architecture for 6G networks. Specifically, we first discuss six high-level principles of our ES3A that include hierarchy, flexibility, scalability, resilience, endogeny, and trust and privacy. With these goals in mind, we then introduce three guidelines from a deployment perspective, envisioning our ES3A that offers service-based security, end-to-end protection, and smart security automation for 6G networks. Our architecture consists of three layers and three domains. It relies on a two-stage orchestration mechanism to tailor smart security strategies for customized protection in high-dynamic 6G networks, thereby addressing the aforementioned challenges. Finally, we prototype the proposed ES3A on a real-world radio system based on Software-Defined Radio (SDR). Experiments show the effectiveness of our ES3A. We also provide a case to show the superiority of our architecture.
[LG-54] Markov Regime-Switching Intelligent Driver Model for Interpretable Car-Following Behavior
链接: https://arxiv.org/abs/2506.14762
作者: Chengyuan Zhang,Cathy Wu,Lijun Sun
类目: Applications (stat.AP); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Accurate and interpretable car-following models are essential for traffic simulation and autonomous vehicle development. However, classical models like the Intelligent Driver Model (IDM) are fundamentally limited by their parsimonious and single-regime structure. They fail to capture the multi-modal nature of human driving, where a single driving state (e.g., speed, relative speed, and gap) can elicit many different driver actions. This forces the model to average across distinct behaviors, reducing its fidelity and making its parameters difficult to interpret. To overcome this, we introduce a regime-switching framework that allows driving behavior to be governed by different IDM parameter sets, each corresponding to an interpretable behavioral mode. This design enables the model to dynamically switch between interpretable behavioral modes, rather than averaging across diverse driving contexts. We instantiate the framework using a Factorial Hidden Markov Model with IDM dynamics (FHMM-IDM), which explicitly separates intrinsic driving regimes (e.g., aggressive acceleration, steady-state following) from external traffic scenarios (e.g., free-flow, congestion, stop-and-go) through two independent latent Markov processes. Bayesian inference via Markov chain Monte Carlo (MCMC) is used to jointly estimate the regime-specific parameters, transition dynamics, and latent state trajectories. Experiments on the HighD dataset demonstrate that FHMM-IDM uncovers interpretable structure in human driving, effectively disentangling internal driver actions from contextual traffic conditions and revealing dynamic regime-switching patterns. This framework provides a tractable and principled solution to modeling context-dependent driving behavior under uncertainty, offering improvements in the fidelity of traffic simulations, the efficacy of safety analyses, and the development of more human-centric ADAS.
[LG-55] Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means
链接: https://arxiv.org/abs/2506.14673
作者: Mikael Møller Høgsgaard,Andrea Paudice
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a class \mathcalF when the data distribution possesses only the first p moments for p \in (1,2] . We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest. Additionally, we present applications of our result to k -means clustering with unbounded inputs and linear regression with general losses, improving upon existing works.
[LG-56] he Perception of Phase Intercept Distortion and its Application in Data Augmentation
链接: https://arxiv.org/abs/2506.14571
作者: Venkatakrishnan Vaidyanathapuram Krishnan,Nathaniel Condit-Schultz
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2025
Abstract:Phase distortion refers to the alteration of the phase relationships between frequencies in a signal, which can be perceptible. In this paper, we discuss a special case of phase distortion known as phase-intercept distortion, which is created by a frequency-independent phase shift. We hypothesize that, though this form of distortion changes a signal’s waveform significantly, the distortion is imperceptible. Human-subject experiment results are reported which are consistent with this hypothesis. Furthermore, we discuss how the imperceptibility of phase-intercept distortion can be useful for machine learning, specifically for data augmentation. We conducted multiple experiments using phase-intercept distortion as a novel approach to data augmentation, and obtained improved results for audio machine learning tasks.
[LG-57] Reimagining Target-Aware Molecular Generation through Retrieval-Enhanced Aligned Diffusion
链接: https://arxiv.org/abs/2506.14488
作者: Dong Xu,Zhangfan Yang,Ka-chun Wong,Zexuan Zhu,Jiangqiang Li,Junkai Ji
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures
Abstract:Breakthroughs in high-accuracy protein structure prediction, such as AlphaFold, have established receptor-based molecule design as a critical driver for rapid early-phase drug discovery. However, most approaches still struggle to balance pocket-specific geometric fit with strict valence and synthetic constraints. To resolve this trade-off, a Retrieval-Enhanced Aligned Diffusion termed READ is introduced, which is the first to merge molecular Retrieval-Augmented Generation with an SE(3)-equivariant diffusion model. Specifically, a contrastively pre-trained encoder aligns atom-level representations during training, then retrieves graph embeddings of pocket-matched scaffolds to guide each reverse-diffusion step at inference. This single mechanism can inject real-world chemical priors exactly where needed, producing valid, diverse, and shape-complementary ligands. Experimental results demonstrate that READ can achieve very competitive performance in CBGBench, surpassing state-of-the-art generative models and even native ligands. That suggests retrieval and diffusion can be co-optimized for faster, more reliable structure-based drug design.
[LG-58] Adaptive Data Augmentation for Thompson Sampling
链接: https://arxiv.org/abs/2506.14479
作者: Wonyoung Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In linear contextual bandits, the objective is to select actions that maximize cumulative rewards, modeled as a linear function with unknown parameters. Although Thompson Sampling performs well empirically, it does not achieve optimal regret bounds. This paper proposes a nearly minimax optimal Thompson Sampling for linear contextual bandits by developing a novel estimator with the adaptive augmentation and coupling of the hypothetical samples that are designed for efficient parameter learning. The proposed estimator accurately predicts rewards for all arms without relying on assumptions for the context distribution. Empirical results show robust performance and significant improvement over existing methods.
[LG-59] NeuralPDR: Neural Differential Equations as surrogate models for Photodissociation Regions
链接: https://arxiv.org/abs/2506.14270
作者: Gijs Vermariën,Thomas G. Bisbas,Serena Viti,Yue Zhao,Xuefei Tang,Rahul Ravichandran
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Accepted for publication in Machine Learning: Science and Technology. Focus on ML and the Physical Sciences, Mach. Learn.: Sci. Technol (2025)
Abstract:Computational astrochemical models are essential for helping us interpret and understand the observations of different astrophysical environments. In the age of high-resolution telescopes such as JWST and ALMA, the substructure of many objects can be resolved, raising the need for astrochemical modeling at these smaller scales, meaning that the simulations of these objects need to include both the physics and chemistry to accurately model the observations. The computational cost of the simulations coupling both the three-dimensional hydrodynamics and chemistry is enormous, creating an opportunity for surrogate models that can effectively substitute the chemical solver. In this work we present surrogate models that can replace the original chemical code, namely Latent Augmented Neural Ordinary Differential Equations. We train these surrogate architectures on three datasets of increasing physical complexity, with the last dataset derived directly from a three-dimensional simulation of a molecular cloud using a Photodissociation Region (PDR) code, 3D-PDR. We show that these surrogate models can provide speedup and reproduce the original observable column density maps of the dataset. This enables the rapid inference of the chemistry (on the GPU), allowing for the faster statistical inference of observations or increasing the resolution in hydrodynamical simulations of astrophysical environments.
[LG-60] Universal Rates of ERM for Agnostic Learning COLT
链接: https://arxiv.org/abs/2506.14110
作者: Steve Hanneke,Mingyue Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted for presentation at the Conference on Learning Theory (COLT) 2025
Abstract:The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of arXiv:2412.02810 studied the universal rates of ERM for binary classification under the realizable setting. However, the assumption of realizability is too restrictive to hold in practice. Indeed, the majority of the literature on universal learning has focused on the realizable case, leaving the non-realizable case barely explored. In this paper, we consider the problem of universal learning by ERM for binary classification under the agnostic setting, where the ''learning curve" reflects the decay of the excess risk as the sample size increases. We explore the possibilities of agnostic universal rates and reveal a compact trichotomy: there are three possible agnostic universal rates of ERM, being either e^-n , o(n^-1/2) , or arbitrarily slow. We provide a complete characterization of which concept classes fall into each of these categories. Moreover, we also establish complete characterizations for the target-dependent universal rates as well as the Bayes-dependent universal rates. Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2025 Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2506.14110 [stat.ML] (or arXiv:2506.14110v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.14110 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-61] Estimation of Treatment Effects in Extreme and Unobserved Data
链接: https://arxiv.org/abs/2506.14051
作者: Jiyuan Tan,Jose Blanchet,Vasilis Syrgkanis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Causal effect estimation seeks to determine the impact of an intervention from observational data. However, the existing causal inference literature primarily addresses treatment effects on frequently occurring events. But what if we are interested in estimating the effects of a policy intervention whose benefits, while potentially important, can only be observed and measured in rare yet impactful events, such as extreme climate events? The standard causal inference methodology is not designed for this type of inference since the events of interest may be scarce in the observed data and some degree of extrapolation is necessary. Extreme Value Theory (EVT) provides methodologies for analyzing statistical phenomena in such extreme regimes. We introduce a novel framework for assessing treatment effects in extreme data to capture the causal effect at the occurrence of rare events of interest. In particular, we employ the theory of multivariate regular variation to model extremities. We develop a consistent estimator for extreme treatment effects and present a rigorous non-asymptotic analysis of its performance. We illustrate the performance of our estimator using both synthetic and semi-synthetic data.
[LG-62] AI-Informed Model Analogs for Subseasonal-to-Seasonal Prediction
链接: https://arxiv.org/abs/2506.14022
作者: Jacob B. Landsberg,Elizabeth A. Barnes,Matthew Newman
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 23 pages, 12 figures
Abstract:Subseasonal-to-seasonal forecasting is crucial for public health, disaster preparedness, and agriculture, and yet it remains a particularly challenging timescale to predict. We explore the use of an interpretable AI-informed model analog forecasting approach, previously employed on longer timescales, to improve S2S predictions. Using an artificial neural network, we learn a mask of weights to optimize analog selection and showcase its versatility across three varied prediction tasks: 1) classification of Week 3-4 Southern California summer temperatures; 2) regional regression of Month 1 midwestern U.S. summer temperatures; and 3) classification of Month 1-2 North Atlantic wintertime upper atmospheric winds. The AI-informed analogs outperform traditional analog forecasting approaches, as well as climatology and persistence baselines, for deterministic and probabilistic skill metrics on both climate model and reanalysis data. We find the analog ensembles built using the AI-informed approach also produce better predictions of temperature extremes and improve representation of forecast uncertainty. Finally, by using an interpretable-AI framework, we analyze the learned masks of weights to better understand S2S sources of predictability.
[LG-63] Evolutionary chemical learning in dimerization networks
链接: https://arxiv.org/abs/2506.14006
作者: Alexei V. Tkachenko,Bortolo Matteo Mognetti,Sergei Maslov
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Data Analysis, Statistics and Probability (physics.data-an); Molecular Networks (q-bio.MN)
*备注: 7 pages, 5 figures + SI
Abstract:We present a novel framework for chemical learning based on Competitive Dimerization Networks (CDNs) - systems in which multiple molecular species, e.g. proteins or DNA/RNA oligomers, reversibly bind to form dimers. We show that these networks can be trained in vitro through directed evolution, enabling the implementation of complex learning tasks such as multiclass classification without digital hardware or explicit parameter tuning. Each molecular species functions analogously to a neuron, with binding affinities acting as tunable synaptic weights. A training protocol involving mutation, selection, and amplification of DNA-based components allows CDNs to robustly discriminate among noisy input patterns. The resulting classifiers exhibit strong output contrast and high mutual information between input and output, especially when guided by a contrast-enhancing loss function. Comparative analysis with in silico gradient descent training reveals closely correlated performance. These results establish CDNs as a promising platform for analog physical computation, bridging synthetic biology and machine learning, and advancing the development of adaptive, energy-efficient molecular computing systems.
[LG-64] Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography
链接: https://arxiv.org/abs/2506.13964
作者: Yusdivia Molina-Román,David Gómez-Ortiz,Ernestina Menasalvas-Ruiz,José Gerardo Tamez-Peña,Alejandro Santos-Díaz
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures
Abstract:Mammographic breast density classification is essential for cancer risk assessment but remains challenging due to subjective interpretation and inter-observer variability. This study compares multimodal and CNN-based methods for automated classification using the BI-RADS system, evaluating BioMedCLIP and ConvNeXt across three learning scenarios: zero-shot classification, linear probing with textual descriptions, and fine-tuning with numerical labels. Results show that zero-shot classification achieved modest performance, while the fine-tuned ConvNeXt model outperformed the BioMedCLIP linear probe. Although linear probing demonstrated potential with pretrained embeddings, it was less effective than full fine-tuning. These findings suggest that despite the promise of multimodal learning, CNN-based models with end-to-end fine-tuning provide stronger performance for specialized medical imaging. The study underscores the need for more detailed textual representations and domain-specific adaptations in future radiology applications.
[LG-65] Projecting U.S. coastal storm surge risks and impacts with deep learning
链接: https://arxiv.org/abs/2506.13963
作者: Julian R. Rice,Karthik Balaguru,Fadia Ticona Rollano,John Wilson,Brent Daniel,David Judi,Ning Sun,L. Ruby Leung
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Storm surge is one of the deadliest hazards posed by tropical cyclones (TCs), yet assessing its current and future risk is difficult due to the phenomenon’s rarity and physical complexity. Recent advances in artificial intelligence applications to natural hazard modeling suggest a new avenue for addressing this problem. We utilize a deep learning storm surge model to efficiently estimate coastal surge risk in the United States from 900,000 synthetic TC events, accounting for projected changes in TC behavior and sea levels. The derived historical 100-year surge (the event with a 1% yearly exceedance probability) agrees well with historical observations and other modeling techniques. When coupled with an inundation model, we find that heightened TC intensities and sea levels by the end of the century result in a 50% increase in population at risk. Key findings include markedly heightened risk in Florida, and critical thresholds identified in Georgia and South Carolina.
[LG-66] Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Theoretically-Grounded and Practical Framework with Synthetic Anomalies
链接: https://arxiv.org/abs/2506.13955
作者: Matthew Lau,Tian-Yi Zhou,Xiangchi Yuan,Jizhou Chen,Wenke Lee,Xiaoming Huo
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Anomaly detection (AD) is a critical task across domains such as cybersecurity and healthcare. In the unsupervised setting, an effective and theoretically-grounded principle is to train classifiers to distinguish normal data from (synthetic) anomalies. We extend this principle to semi-supervised AD, where training data also include a limited labeled subset of anomalies possibly present in test time. We propose a theoretically-grounded and empirically effective framework for semi-supervised AD that combines known and synthetic anomalies during training. To analyze semi-supervised AD, we introduce the first mathematical formulation of semi-supervised AD, which generalizes unsupervised AD. Here, we show that synthetic anomalies enable (i) better anomaly modeling in low-density regions and (ii) optimal convergence guarantees for neural network classifiers – the first theoretical result for semi-supervised AD. We empirically validate our framework on five diverse benchmarks, observing consistent performance gains. These improvements also extend beyond our theoretical framework to other classification-based AD methods, validating the generalizability of the synthetic anomaly principle in AD.
[LG-67] Meta Optimality for Demographic Parity Constrained Regression via Post-Processing ICML2025
链接: https://arxiv.org/abs/2506.13947
作者: Kazuto Fukuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML2025
Abstract:We address the regression problem under the constraint of demographic parity, a commonly used fairness definition. Recent studies have revealed fair minimax optimal regression algorithms, the most accurate algorithms that adhere to the fairness constraint. However, these analyses are tightly coupled with specific data generation models. In this paper, we provide meta-theorems that can be applied to various situations to validate the fair minimax optimality of the corresponding regression algorithms. Furthermore, we demonstrate that fair minimax optimal regression can be achieved through post-processing methods, allowing researchers and practitioners to focus on improving conventional regression techniques, which can then be efficiently adapted for fair regression.
[LG-68] Rademacher learning rates for iterated random functions
链接: https://arxiv.org/abs/2506.13946
作者: Nikola Sandrić
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Most existing literature on supervised machine learning assumes that the training dataset is drawn from an i.i.d. sample. However, many real-world problems exhibit temporal dependence and strong correlations between the marginal distributions of the data-generating process, suggesting that the i.i.d. assumption is often unrealistic. In such cases, models naturally include time-series processes with mixing properties, as well as irreducible and aperiodic ergodic Markov chains. Moreover, the learning rates typically obtained in these settings are independent of the data distribution, which can lead to restrictive choices of hypothesis classes and suboptimal sample complexities for the learning algorithm. In this article, we consider the case where the training dataset is generated by an iterated random function (i.e., an iteratively defined time-homogeneous Markov chain) that is not necessarily irreducible or aperiodic. Under the assumption that the governing function is contractive with respect to its first argument and subject to certain regularity conditions on the hypothesis class, we first establish a uniform convergence result for the corresponding sample error. We then demonstrate the learnability of the approximate empirical risk minimization algorithm and derive its learning rate bound. Both rates are data-distribution dependent, expressed in terms of the Rademacher complexities of the underlying hypothesis class, allowing them to more accurately reflect the properties of the data-generating distribution.
[LG-69] Connecting phases of matter to the flatness of the loss landscape in analog variational quantum algorithms
链接: https://arxiv.org/abs/2506.13865
作者: Kasidit Srimahajariyapong,Supanut Thanasilp,Thiparat Chotibut
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 15+7 pages, 7+5 figures
Abstract:Variational quantum algorithms (VQAs) promise near-term quantum advantage, yet parametrized quantum states commonly built from the digital gate-based approach often suffer from scalability issues such as barren plateaus, where the loss landscape becomes flat. We study an analog VQA ansätze composed of M quenches of a disordered Ising chain, whose dynamics is native to several quantum simulation platforms. By tuning the disorder strength we place each quench in either a thermalized phase or a many-body-localized (MBL) phase and analyse (i) the ansätze’s expressivity and (ii) the scaling of loss variance. Numerics shows that both phases reach maximal expressivity at large M , but barren plateaus emerge at far smaller M in the thermalized phase than in the MBL phase. Exploiting this gap, we propose an MBL initialisation strategy: initialise the ansätze in the MBL regime at intermediate quench M , enabling an initial trainability while retaining sufficient expressivity for subsequent optimization. The results link quantum phases of matter and VQA trainability, and provide practical guidelines for scaling analog-hardware VQAs.
[LG-70] A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations INTERSPEECH2025
链接: https://arxiv.org/abs/2506.13835
作者: Masakazu Inoue,Motoshige Sato,Kenichi Tomeoka,Nathania Nah,Eri Hatakeyama,Kai Arulkumaran,Ilya Horiguchi,Shuntaro Sasai
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted for presentation at Interspeech 2025. 5 pages, 4 figures, 2 tables
Abstract:Silent speech decoding, which performs unvocalized human speech recognition from electroencephalography/electromyography (EEG/EMG), increases accessibility for speech-impaired humans. However, data collection is difficult and performed using varying experimental setups, making it nontrivial to collect a large, homogeneous dataset. In this study we introduce neural networks that can handle EEG/EMG with heterogeneous electrode placements and show strong performance in silent speech decoding via multi-task training on large-scale EEG/EMG datasets. We achieve improved word classification accuracy in both healthy participants (95.3%), and a speech-impaired patient (54.5%), substantially outperforming models trained on single-subject data (70.1% and 13.2%). Moreover, our models also show gains in cross-language calibration performance. This increase in accuracy suggests the feasibility of developing practical silent speech decoding systems, particularly for speech-impaired patients.
[LG-71] Infected Smallville: How Disease Threat Shapes Sociality in LLM Agents
链接: https://arxiv.org/abs/2506.13783
作者: Soyeon Choi,Kangwook Lee,Oliver Sng,Joshua M. Ackerman
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG)
*备注: 8 pages
Abstract:How does the threat of infectious disease influence sociality among generative agents? We used generative agent-based modeling (GABM), powered by large language models, to experimentally test hypotheses about the behavioral immune system. Across three simulation runs, generative agents who read news about an infectious disease outbreak showed significantly reduced social engagement compared to agents who received no such news, including lower attendance at a social gathering, fewer visits to third places (e.g., cafe, store, park), and fewer conversations throughout the town. In interview responses, agents explicitly attributed their behavioral changes to disease-avoidance motivations. A validity check further indicated that they could distinguish between infectious and noninfectious diseases, selectively reducing social engagement only when there was a risk of infection. Our findings highlight the potential of GABM as an experimental tool for exploring complex human social dynamics at scale.
信息检索
[IR-0] A Systematic Replicability and Comparative Study of BSARec and SASRec for Sequential Recommendation
链接: https://arxiv.org/abs/2506.14692
作者: Chiara D’Ercoli,Giulia Di Teodoro,Federico Siciliano
类目: Information Retrieval (cs.IR)
*备注:
Abstract:This study aims at comparing two sequential recommender systems: Self-Attention based Sequential Recommendation (SASRec), and Beyond Self-Attention based Sequential Recommendation (BSARec) in order to check the improvement frequency enhancement - the added element in BSARec - has on recommendations. The models in the study, have been re-implemented with a common base-structure from EasyRec, with the aim of obtaining a fair and reproducible comparison. The results obtained displayed how BSARec, by including bias terms for frequency enhancement, does indeed outperform SASRec, although the increases in performance obtained, are not as high as those presented by the authors. This work aims at offering an overview on existing methods, and most importantly at underlying the importance of implementation details for performance comparison.
[IR-1] RMIT-ADMS at the SIGIR 2025 LiveRAG Challenge SIGIR2025
链接: https://arxiv.org/abs/2506.14516
作者: Kun Ran,Shuoqi Sun,Khoi Nguyen Dinh Anh,Damiano Spina,Oleg Zendel
类目: Information Retrieval (cs.IR)
*备注: Accepted for oral presentation at SIGIR 2025 LiveRAG
Abstract:This paper presents the RMIT–ADM+S participation in the SIGIR 2025 LiveRAG Challenge. Our Generation-Retrieval-Augmented Generation (GRAG) approach relies on generating a hypothetical answer that is used in the retrieval phase, alongside the original question. GRAG also incorporates a pointwise large language model (LLM)-based re-ranking step prior to final answer generation. We describe the system architecture and the rationale behind our design choices. In particular, a systematic evaluation using the Grid of Points (GoP) framework and N-way ANOVA enabled comparison across multiple configurations, including query variant generation, question decomposition, rank fusion strategies, and prompting techniques for answer generation. Our system achieved a Relevance score of 1.199 and a Faithfulness score of 0.477 on the private leaderboard, placing among the top four finalists in the LiveRAG 2025 Challenge.
[IR-2] Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval INTERSPEECH2025
链接: https://arxiv.org/abs/2506.14445
作者: Ruofan Hu,Yan Xia,Minjie Hong,Jieming Zhu,Bo Chen,Xiaoda Yang,Minghui Fang,Tao Jin
类目: Information Retrieval (cs.IR)
*备注: Accepted by Interspeech 2025
Abstract:Multimodal large language models (MLLMs) have seen substantial progress in recent years. However, their ability to represent multimodal information in the acoustic domain remains underexplored. In this work, we introduce Vela, a novel framework designed to adapt MLLMs for the generation of universal multimodal embeddings. By leveraging MLLMs with specially crafted prompts and selected in-context learning examples, Vela effectively bridges the modality gap across various modalities. We then propose a single-modality training approach, where the model is trained exclusively on text pairs. Our experiments show that Vela outperforms traditional CLAP models in standard text-audio retrieval tasks. Furthermore, we introduce new benchmarks that expose CLAP models’ limitations in handling long texts and complex retrieval tasks. In contrast, Vela, by harnessing the capabilities of MLLMs, demonstrates robust performance in these scenarios. Our code will soon be available.
[IR-3] Similarity = Value? Consultation Value Assessment and Alignment for Personalized Search
链接: https://arxiv.org/abs/2506.14437
作者: Weicong Qin,Yi Xu,Weijie Yu,Teng Shi,Chenglei Shen,Ming He,Jianping Fan,Xiao Zhang,Jun Xu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Personalized search systems in e-commerce platforms increasingly involve user interactions with AI assistants, where users consult about products, usage scenarios, and more. Leveraging consultation to personalize search services is trending. Existing methods typically rely on semantic similarity to align historical consultations with current queries due to the absence of ‘value’ labels, but we observe that semantic similarity alone often fails to capture the true value of consultation for personalization. To address this, we propose a consultation value assessment framework that evaluates historical consultations from three novel perspectives: (1) Scenario Scope Value, (2) Posterior Action Value, and (3) Time Decay Value. Based on this, we introduce VAPS, a value-aware personalized search model that selectively incorporates high-value consultations through a consultation-user action interaction module and an explicit objective that aligns consultations with user actions. Experiments on both public and commercial datasets show that VAPS consistently outperforms baselines in both retrieval and ranking tasks.
[IR-4] hyperFA*IR: A hypergeometric approach to fair rankings with finite candidate pool
链接: https://arxiv.org/abs/2506.14349
作者: Mauritz N. Cartier van Dissel,Samuel Martin-Gutierrez,Lisette Espín-Noboa,Ana María Jaramillo,Fariba Karimi
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR); Applications (stat.AP)
*备注: In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT’25)
Abstract:Ranking algorithms play a pivotal role in decision-making processes across diverse domains, from search engines to job applications. When rankings directly impact individuals, ensuring fairness becomes essential, particularly for groups that are marginalised or misrepresented in the data. Most of the existing group fairness frameworks often rely on ensuring proportional representation of protected groups. However, these approaches face limitations in accounting for the stochastic nature of ranking processes or the finite size of candidate pools. To this end, we present hyperFA*IR, a framework for assessing and enforcing fairness in rankings drawn from a finite set of candidates. It relies on a generative process based on the hypergeometric distribution, which models real-world scenarios by sampling without replacement from fixed group sizes. This approach improves fairness assessment when top- k selections are large relative to the pool or when protected groups are small. We compare our approach to the widely used binomial model, which treats each draw as independent with fixed probability, and demonstrate - both analytically and empirically - that our method more accurately reproduces the statistical properties of sampling from a finite population. To operationalise this framework, we propose a Monte Carlo-based algorithm that efficiently detects unfair rankings by avoiding computationally expensive parameter tuning. Finally, we adapt our generative approach to define affirmative action policies by introducing weights into the sampling process.
[IR-5] LLM -Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO
链接: https://arxiv.org/abs/2506.13785
作者: Patrick Sutanto,Jonathan Kenrick,Max Lorenz,Joan Santoso
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:The application of Large Language Models (LLMs) to text-to-SQL tasks promises to democratize data access, particularly in critical industries like aviation Maintenance, Repair, and Operation (MRO). However, progress is hindered by two key challenges: the rigidity of conventional evaluation metrics such as execution accuracy, which offer coarse, binary feedback, and the scarcity of domain-specific evaluation datasets. This paper addresses these gaps. To enable more nuanced assessment, we introduce a novel F1-score-based ‘soft’ metric that quantifies the informational overlap between generated and ground-truth SQL results. To address data scarcity, we propose an LLM-driven pipeline that synthesizes realistic question-SQL pairs from database schemas. We demonstrate our contributions through an empirical evaluation on an authentic MRO database. Our experiments show that the proposed soft metric provides more insightful performance analysis than strict accuracy, and our data generation technique is effective in creating a domain-specific benchmark. Together, these contributions offer a robust framework for evaluating and advancing text-to-SQL systems in specialized environments.