本篇博文主要内容为 2025-06-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-16)
今日共更新618篇论文,其中:
- 自然语言处理共132篇(Computation and Language (cs.CL))
- 人工智能共203篇(Artificial Intelligence (cs.AI))
- 计算机视觉共126篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共195篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] code_transformed: The Influence of Large Language Models on Code
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否改变了代码风格,以及这种改变如何被表征。论文的关键解决方案是通过分析超过19,000个与2020年至2025年间发表的arXiv论文相关的GitHub仓库中的代码,研究LLMs对代码风格的影响,包括命名约定、复杂性、可维护性和相似性等方面,并识别出与LLMs生成代码特征相符的可测量趋势。
链接: https://arxiv.org/abs/2506.12014
作者: Yuliang Xu,Siming Huang,Mingmeng Geng,Yao Wan,Xuanhua Shi,Dongping Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: We release all the experimental dataset and source code at: this https URL
Abstract:Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.
zh
[NLP-1] Generative Representational Learning of Foundation Models for Recommendation
【速读】: 该论文旨在解决推荐系统中基础模型在多任务学习过程中面临的知识共享冲突、收敛速度不一致以及嵌入任务被忽视等问题。其解决方案的关键在于提出了一种名为RecFound的生成表征学习框架,该框架包含三个核心组件:面向任务的知识共享冲突处理的Task-wise Mixture of Low-rank Experts (TMoLE),用于解决收敛速度不一致的Step-wise Convergence-oriented Sample Scheduler (S2Sched),以及用于平衡任务性能的Model Merge模块。
链接: https://arxiv.org/abs/2506.11999
作者: Zheli Zhou,Chenxu Zhu,Jianghao Lin,Bo Chen,Ruiming Tang,Weinan Zhang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Project page is available at this https URL
Abstract:Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.
zh
[NLP-2] VGR: Visual Grounded Reasoning
【速读】: 该论文试图解决现有多模态链式思维(multimodal chain-of-thought, CoT)推理方法主要依赖纯语言空间推理所导致的语言偏差问题,以及其在处理需要全面理解图像细节的复杂视觉推理任务时的能力局限。解决方案的关键在于提出一种名为VGR的新型多模态大语言模型(MLLM),该模型具备增强的细粒度视觉感知能力,通过首先检测可能有助于解决问题的相关图像区域,并基于重放的图像区域提供精确答案,从而实现更有效的多模态推理。
链接: https://arxiv.org/abs/2506.11991
作者: Jiacong Wang,Zijiang Kang,Haochen Wang,Haiyong Jiang,Jiawen Li,Bohong Wu,Ya Wang,Jiao Ran,Xiao Liang,Chao Feng,Jun Xiao
机构: ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 figures
Abstract:In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.
zh
[NLP-3] Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task
【速读】: 该论文旨在解决现有Schema链接模型在微调过程中采用机械学习范式,过度优化真实标签结果而牺牲了推理能力的问题。其关键解决方案是提出Schema-R1,一个基于强化学习训练的推理型Schema链接模型,该模型通过构建高质量推理样本、监督微调以及基于规则的强化学习训练三个关键步骤,有效提升了模型的推理能力。
链接: https://arxiv.org/abs/2506.11986
作者: Wuzhenghong Wen,Su Pan,yuwei Sun
机构: Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: 11 pages, 3 figures, conference
Abstract:Schema linking is a critical step in Text-to-SQL task, aiming to accurately predict the table names and column names required for the SQL query based on the given question. However, current fine-tuning approaches for schema linking models employ a rote-learning paradigm, excessively optimizing for ground truth schema linking outcomes while compromising reasoning ability. This limitation arises because of the difficulty in acquiring a high-quality reasoning sample for downstream tasks. To address this, we propose Schema-R1, a reasoning schema linking model trained using reinforcement learning. Specifically, Schema-R1 consists of three key steps: constructing small batches of high-quality reasoning samples, supervised fine-tuning for cold-start initialization, and rule-based reinforcement learning training. The final results demonstrate that our method effectively enhances the reasoning ability of the schema linking model, achieving a 10% improvement in filter accuracy compared to the existing method. Our code is available at this https URL.
zh
[NLP-4] Improving Large Language Model Safety with Contrastive Representation Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对各种未知和不受控制的输入时,容易受到对抗攻击的问题。其解决方案的关键在于将模型防御建模为对比表示学习(contrastive representation learning, CRL)问题,通过使用基于三元组的损失函数结合对抗性硬负样本挖掘,促进良性表示与有害表示之间的分离,从而提升模型对输入级和嵌入空间攻击的鲁棒性。
链接: https://arxiv.org/abs/2506.11938
作者: Samuel Simko,Mrinmaya Sachan,Bernhard Schölkopf,Zhijing Jin
机构: ETH Zurich (苏黎世联邦理工学院); MPI for Intelligent Systems (马克斯·普朗克智能系统研究所); MPI & University of Toronto (马克斯·普朗克研究所与多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at this https URL
zh
[NLP-5] Feedback Friction: LLM s Struggle to Fully Incorporate External Feedback
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在接收外部反馈时表现出的整合能力不足问题,即FEEDBACK FRICTION。研究通过设计一个受控实验环境,让求解器模型在获得接近完美和完整的反馈后重新尝试解决问题,以系统性地评估其反馈整合能力。解决方案的关键在于构建一个包含求解器、反馈生成器和重复求解过程的评估管道,并通过多种采样策略尝试缓解反馈摩擦,从而揭示模型在反馈整合中的局限性及其潜在原因。
链接: https://arxiv.org/abs/2506.11930
作者: Dongwei Jiang,Alvin Zhang,Andrew Wang,Nicholas Andrews,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs’ ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.
zh
[NLP-6] LiveCodeBench Pro: How Do Olympiad Medalists Judge LLM s in Competitive Programming?
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在算法竞赛中的表现是否真正超越人类专家的问题,以及揭示LLMs与人类专家之间的差异和局限性。其解决方案的关键在于构建了一个名为LiveCodeBench Pro的基准测试集,该基准包含来自Codeforces、ICPC和IOI的持续更新问题,以减少数据污染,并由国际竞赛金牌得主对每个问题进行算法分类标注及对模型生成的失败提交进行逐行分析,从而提供细粒度的诊断信息,用于指导代码导向型LLM推理能力的改进。
链接: https://arxiv.org/abs/2506.11928
作者: Zihan Zheng,Zerui Cheng,Zeyu Shen,Shang Zhou,Kaiyuan Liu,Hansen He,Dongruixuan Li,Stanley Wei,Hangyi Hao,Jianzhu Yao,Peiyao Sheng,Zixuan Wang,Wenhao Chai,Aleksandra Korolova,Peter Henderson,Sanjeev Arora,Pramod Viswanath,Jingbo Shang,Saining Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page at this https URL
Abstract:Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.
zh
[NLP-7] Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study
【速读】: 该论文试图解决在线仇恨言论(Hate Speech, HS)中反言论(Counter-speech, CS)有效性的评估标准不明确的问题。解决方案的关键在于提出一个基于社会科学研究概念的计算框架,该框架定义了六个核心维度——清晰度、证据支持、情感吸引力、反驳力度、受众适应性和公平性,并利用这些维度对4,214条CS实例进行标注,构建了一个新的语言学资源。此外,论文还提出了两种分类策略——多任务和依赖关系驱动的方法,在专家撰写和用户撰写的CS数据上均取得了较高的F1分数(分别为0.94和0.96),并揭示了各维度之间的强相关性。
链接: https://arxiv.org/abs/2506.11919
作者: Greta Damo,Elena Cabrio,Serena Villata
机构: Université Côte d’Azur, CNRS, Inria, I3S, France
类目: Computation and Language (cs.CL)
备注:
Abstract:Counter-speech (CS) is a key strategy for mitigating online Hate Speech (HS), yet defining the criteria to assess its effectiveness remains an open challenge. We propose a novel computational framework for CS effectiveness classification, grounded in social science concepts. Our framework defines six core dimensions - Clarity, Evidence, Emotional Appeal, Rebuttal, Audience Adaptation, and Fairness - which we use to annotate 4,214 CS instances from two benchmark datasets, resulting in a novel linguistic resource released to the community. In addition, we propose two classification strategies, multi-task and dependency-based, achieving strong results (0.94 and 0.96 average F1 respectively on both expert- and user-written CS), outperforming standard baselines, and revealing strong interdependence among dimensions.
zh
[NLP-8] GeistBERT: Breathing Life into German NLP
【速读】: 该论文旨在提升德语自然语言处理(Natural Language Processing, NLP)的性能,通过在高质量德语文本语料上进行语言特定的预训练,结合更新的模型架构和针对德语语言特征优化的数据集。其解决方案的关键在于利用增量训练方法对多样化语料进行预训练,并采用Nyströmformer和Longformer架构扩展输入长度至8k token,同时基于预训练模型优化多种NLP任务的性能,最终在命名实体识别(NER)和文本分类任务中取得了优于现有基线模型的成果。
链接: https://arxiv.org/abs/2506.11903
作者: Raphael Scheible-Schmitt,Johann Frei
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. It was pre-trained using fairseq with standard hyperparameters, initialized from GottBERT weights, and trained on a large-scale German corpus using Whole Word Masking (WWM). Based on the pre-trained model, we derived extended-input variants using Nyströmformer and Longformer architectures with support for sequences up to 8k tokens. While these long-context models were not evaluated on dedicated long-context benchmarks, they are included in our release. We assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification (GermEval 2018 fine/coarse, 10kGNAD) using F_1 score and accuracy. The GeistBERT models achieved strong performance, leading all tasks among the base models and setting a new state-of-the-art (SOTA). Notably, the base models outperformed larger models in several tasks. To support the German NLP research community, we are releasing GeistBERT under the MIT license.
zh
[NLP-9] reeRL: LLM Reinforcement Learning with On-Policy Tree Search ACL2025
【速读】: 该论文试图解决在基于大语言模型(Large Language Model, LLM)的强化学习(Reinforcement Learning, RL)中,传统链式采样策略在探索推理空间和提供密集、在线策略过程奖励方面的不足。其解决方案的关键在于提出TreeRL框架,该框架直接将在线策略树搜索(on-policy tree search)整合到RL训练过程中,通过中间监督机制避免了对单独奖励模型训练的需求,并采用一种成本效益更高的树搜索方法,通过从高不确定性中间步骤进行有策略的分支而非随机分支,提升了搜索效率。
链接: https://arxiv.org/abs/2506.11902
作者: Zhenyu Hou,Ziniu Hu,Yujiang Li,Rui Lu,Jie Tang,Yuxiao Dong
机构: Tsinghua University (清华大学); California Institute of Technology (加州理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ACL 2025 main conference
Abstract:Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at this https URL.
zh
[NLP-10] owards a Cascaded LLM Framework for Cost-effective Human-AI Decision-Making
【速读】: 该论文旨在解决人机协同决策中的平衡问题,即在预测准确性、知识获取与推理复杂性的成本,以及是否应放弃自动化回答并引入人类专家的置信度之间找到最优解。其解决方案的关键在于提出一种级联的大语言模型(Large Language Model, LLM)决策框架,该框架通过多层级专家协作实现任务的自适应委派:首先由基础模型生成候选答案,随后由更强大但成本更高的大模型进行验证或重生成,最后在模型级联无法确定时引入人类专家。该方法包含两个核心阶段:一是基于置信度分数决定是否接受基础模型的答案或使用大模型重新生成;二是判断级联模型的输出是否足够确定或是否需要人工干预。此外,框架还集成了在线学习机制,以持续利用人类反馈提升决策质量。
链接: https://arxiv.org/abs/2506.11887
作者: Claudio Fanconi,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Effective human-AI decision-making balances three key factors: the \textitcorrectness of predictions, the \textitcost of knowledge and reasoning complexity, and the confidence about whether to \textitabstain automated answers or involve human experts. In this work, we present a cascaded LLM decision framework that adaptively delegates tasks across multiple tiers of expertise – a base model for initial candidate answers, a more capable and knowledgeable (but costlier) large model, and a human expert for when the model cascade abstains. Our method proceeds in two stages. First, a deferral policy determines whether to accept the base model’s answer or regenerate it with the large model based on the confidence score. Second, an abstention policy decides whether the cascade model response is sufficiently certain or requires human intervention. Moreover, we incorporate an online learning mechanism in the framework that can leverage human feedback to improve decision quality over time. We demonstrate this approach to general question-answering (ARC-Easy and ARC-Challenge) and medical question-answering (MedQA and MedMCQA). Our results show that our cascaded strategy outperforms in most cases single-model baselines in accuracy while reducing cost and providing a principled way to handle abstentions.
zh
[NLP-11] Beyond Homogeneous Attention: Memory-Efficient LLM s via Fourier-Approximated KV Cache
【速读】: 该论文旨在解决大型语言模型在处理长上下文时面临的内存需求增长问题,特别是键值(Key-Value, KV)缓存带来的负担。现有压缩方法通过统一头维度或依赖注意力引导的标记剪枝来减少内存占用,但通常会牺牲准确性或引入计算开销。论文提出的解决方案是FourierAttention,其关键在于利用Transformer头维度的异质性:低维头优先关注局部上下文,而高维头捕捉长程依赖关系。通过将对长上下文不敏感的维度投影到正交的傅里叶基上,FourierAttention使用固定长度的频谱系数近似其时间演化,从而在保持准确性的前提下有效降低内存消耗。
链接: https://arxiv.org/abs/2506.11886
作者: Xiaoran Liu,Siyang He,Qiqi Wang,Ruixiao Li,Yuerong Song,Zhigeng Liu,Linlin Li,Qun Liu,Zengfeng Huang,Qipeng Guo,Ziwei He,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, work in progress
Abstract:Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.
zh
[NLP-12] Addressing Bias in LLM s: Strategies and Application to Fair AI-based Recruitment
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险场景中可能继承数据中的性别偏见问题,特别是在基于AI的自动化招聘系统中。解决方案的关键在于提出一种隐私增强框架,通过从学习过程中移除性别信息,从而减少模型对数据中偏见的复制,有效缓解最终工具中的偏见行为。
链接: https://arxiv.org/abs/2506.11880
作者: Alejandro Peña,Julian Fierrez,Aythami Morales,Gonzalo Mancera,Miguel Lopez,Ruben Tolosana
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to AIES 2025 (Under Review)
Abstract:The use of language technologies in high-stake settings is increasing in recent years, mostly motivated by the success of Large Language Models (LLMs). However, despite the great performance of LLMs, they are are susceptible to ethical concerns, such as demographic biases, accountability, or privacy. This work seeks to analyze the capacity of Transformers-based systems to learn demographic biases present in the data, using a case study on AI-based automated recruitment. We propose a privacy-enhancing framework to reduce gender information from the learning pipeline as a way to mitigate biased behaviors in the final tools. Our experiments analyze the influence of data biases on systems built on two different LLMs, and how the proposed framework effectively prevents trained systems from reproducing the bias in the data.
zh
[NLP-13] Post Persona Alignment for Multi-Session Dialogue Generation
【速读】: 该论文试图解决多轮对话生成中长期一致性维护和生成多样且个性化的回复问题(multi-session persona-based dialogue generation)。现有方法通常在生成回复前检索角色信息,这限制了多样性并导致泛化输出。其解决方案的关键在于提出一种名为Post Persona Alignment (PPA) 的两阶段框架,该框架首先基于对话上下文生成通用回复,再利用该回复作为查询检索相关角色记忆,并最终调整回复以符合说话者的角色设定,从而在保持一致性和个性化的同时提升自然度和多样性。
链接: https://arxiv.org/abs/2506.11857
作者: Yi-Pei Chen,Noriki Nishida,Hideki Nakayama,Yuji Matsumoto
机构: RIKEN AIP(理化学研究所人工智能中心); The University of Tokyo(东京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker’s persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.
zh
[NLP-14] Rethinking Multilingual Vision-Language Translation: Dataset Evaluation and Adaptation
【速读】: 该论文旨在解决视觉-语言翻译(Vision-Language Translation, VLT)任务中模型性能评估不系统、数据质量不足以及评价指标可靠性差的问题。其解决方案的关键在于从数据质量、模型架构和评价指标三个核心角度进行深入研究:首先,提出并构建了高质量的多语言、平行且经过人工验证的数据集AibTrans以提升数据的语义和文化保真度;其次,对多种商业和开源大型视觉-语言模型进行了全面基准测试,揭示了其在光学字符识别(OCR)依赖性及生成与推理行为上的差异;最后,引入密度感知评价方法(Density-Aware Evaluation)和DA Score,以提高在不同上下文复杂度下的评价可靠性。通过这些研究,论文建立了新的VLT评估基准,并提出了平衡多语言微调策略以提升模型的跨语言性能。
链接: https://arxiv.org/abs/2506.11820
作者: Xintong Wang,Jingheng Pan,Yixiao Liu,Xiaohu Zhao,Chenyang Lyu,Minghao Wu,Chris Biemann,Longyue Wang,Linlong Xu,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); Universität Hamburg, Department of Informatics (汉堡大学信息学院); Nanyang Technological University (南洋理工大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Translation (VLT) is a challenging task that requires accurately recognizing multilingual text embedded in images and translating it into the target language with the support of visual context. While recent Large Vision-Language Models (LVLMs) have demonstrated strong multilingual and visual understanding capabilities, there is a lack of systematic evaluation and understanding of their performance on VLT. In this work, we present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics. (1) We identify critical limitations in existing datasets, particularly in semantic and cultural fidelity, and introduce AibTrans – a multilingual, parallel, human-verified dataset with OCR-corrected annotations. (2) We benchmark 11 commercial LVLMs/LLMs and 6 state-of-the-art open-source models across end-to-end and cascaded architectures, revealing their OCR dependency and contrasting generation versus reasoning behaviors. (3) We propose Density-Aware Evaluation to address metric reliability issues under varying contextual complexity, introducing the DA Score as a more robust measure of translation quality. Building upon these findings, we establish a new evaluation benchmark for VLT. Notably, we observe that fine-tuning LVLMs on high-resource language pairs degrades cross-lingual performance, and we propose a balanced multilingual fine-tuning strategy that effectively adapts LVLMs to VLT without sacrificing their generalization ability.
zh
[NLP-15] On the Performance of LLM s for Real Estate Appraisal ECML-PKDD2025
【速读】: 该论文试图解决房地产市场中存在的信息不对称问题,通过大型语言模型(Large Language Models, LLMs)生成具有竞争力且可解释的房价估计,以实现房地产洞察的民主化。解决方案的关键在于采用优化的上下文学习(In-Context Learning, ICL)策略,利用房屋属性和配套设施等享乐变量进行有效建模,并通过精心选择的上下文示例提升模型性能,从而在保持预测准确性的同时增强模型的可解释性和交互性。
链接: https://arxiv.org/abs/2506.11812
作者: Margot Geerts,Manon Reusens,Bart Baesens,Seppe vanden Broucke,Jochen De Weerdt
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ECML-PKDD 2025
Abstract:The real estate market is vital to global economies but suffers from significant information asymmetry. This study examines how Large Language Models (LLMs) can democratize access to real estate insights by generating competitive and interpretable house price estimates through optimized In-Context Learning (ICL) strategies. We systematically evaluate leading LLMs on diverse international housing datasets, comparing zero-shot, few-shot, market report-enhanced, and hybrid prompting techniques. Our results show that LLMs effectively leverage hedonic variables, such as property size and amenities, to produce meaningful estimates. While traditional machine learning models remain strong for pure predictive accuracy, LLMs offer a more accessible, interactive and interpretable alternative. Although self-explanations require cautious interpretation, we find that LLMs explain their predictions in agreement with state-of-the-art models, confirming their trustworthiness. Carefully selected in-context examples based on feature similarity and geographic proximity, significantly enhance LLM performance, yet LLMs struggle with overconfidence in price intervals and limited spatial reasoning. We offer practical guidance for structured prediction tasks through prompt optimization. Our findings highlight LLMs’ potential to improve transparency in real estate appraisal and provide actionable insights for stakeholders.
zh
[NLP-16] Are Multimodal Large Language Models Prag matically Competent Listeners in Simple Reference Resolution Tasks? ACL
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在指代消解任务中的语言能力问题,特别是针对包含简单但抽象视觉刺激(如颜色块和颜色网格)的场景。研究认为,尽管此类任务对人类而言较为简单,但其对于评估MLLMs的语用能力具有高度相关性。解决方案的关键在于分析模型在上下文依赖的颜色描述解释等基础语用能力上的表现,从而揭示当前最先进的MLLMs在此类任务中的主要挑战。
链接: https://arxiv.org/abs/2506.11807
作者: Simeon Junker,Manar Ali,Larissa Koch,Sina Zarrieß,Hendrik Buschmeier
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注: To appear in ACL Findings 2025
Abstract:We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today’s language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.
zh
[NLP-17] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
【速读】: 该论文试图解决如何利用生成式 AI (Generative AI) 模型通过零样本人格提示(zero-shot persona prompting)准确预测个体投票行为及欧洲群体在多项政策上的立场问题。其解决方案的关键在于通过有限信息的人格提示生成与实际政治人物行为相符的模拟投票行为,并通过加权 F1 分数评估预测效果,最终实现了对欧洲议会成员投票行为的合理模拟。
链接: https://arxiv.org/abs/2506.11798
作者: Maximilian Kreutner,Marlene Lutz,Markus Strohmaier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at this https URL.
zh
[NLP-18] Long-Short Alignment for Effective Long-Context Modeling in LLM s ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长上下文建模中的长度泛化(length generalization)问题,即模型在处理比训练时见过的序列更长的数据时表现受限。解决方案的关键在于从传统的输入特征(如位置编码或数据结构)的关注转向模型输出分布的分析,提出了“长-短对齐”(long-short alignment)的概念,强调不同长度序列输出分布的一致性,并通过引入一种量化该现象的指标——长-短错位(Long-Short Misalignment)来验证其与长度泛化性能之间的强相关性,最终设计了一种促进长-短对齐的正则化项以提升模型的长上下文建模能力。
链接: https://arxiv.org/abs/2506.11769
作者: Tianqi Du,Haotian Huang,Yifei Wang,Yisen Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization – the ability to generalize to sequences longer than those seen during training – is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbflong-short alignment – the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at this https URL.
zh
[NLP-19] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
【速读】: 该论文试图解决当前缺乏对基于大语言模型的深度研究代理(Deep Research Agents, DRAs)进行全面评估的基准问题。现有方法无法系统性地衡量DRAs在多步骤网络探索、目标检索及高阶综合等方面的能力。解决方案的关键在于提出两个新颖的评估方法:一是基于参考的评估方法,采用自适应标准来评估生成研究报告的质量;二是通过评估有效引用数量和整体引用准确性来衡量DRAs的信息检索与收集能力。同时,作者开源了DeepResearch Bench及其关键组件,以推动实用化LLM-based代理的发展。
链接: https://arxiv.org/abs/2506.11763
作者: Mingxuan Du,Benfeng Xu,Chiwei Zhu,Xiaorui Wang,Zhendong Mao
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 31 pages, 5 figures
Abstract:Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports–compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA’s information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at this https URL to accelerate the development of practical LLM-based agents.
zh
[NLP-20] DART: Distilling Autoregressive Reasoning to Silent Thought
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理在大型语言模型(Large Language Models, LLMs)中因自回归范式导致的计算开销大、难以部署于延迟敏感应用的问题。其解决方案的关键在于提出一种自蒸馏框架DART(Distilling Autoregressive Reasoning to Silent Thought),通过引入两种训练路径:传统的CoT路径和直接从少量Silent Thought (ST) token生成答案的ST路径,其中ST路径利用轻量级的推理演化模块(Reasoning Evolvement Module, REM)使ST的隐藏状态与CoT路径对齐,从而让ST token演化为信息丰富的嵌入表示。在推理阶段仅激活ST路径,通过演化的ST token直接输出结果,实现高效推理。
链接: https://arxiv.org/abs/2506.11752
作者: Nan Jiang,Ziming Wu,De-Chuan Zhan,Fuming Lai,Shaobing Lian
机构: Tencent Inc.(腾讯公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbfDART (\textbfDistilling \textbfAutoregressive \textbfReasoning to Silent \textbfThought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART achieves comparable reasoning performance to existing baselines while offering significant efficiency gains, serving as a feasible alternative for efficient reasoning.
zh
[NLP-21] Quizzard@INOVA Challenge 2025 – Track A: Plug-and-Play Technique in Interleaved Multi-Image Model
【速读】: 该论文旨在解决多模态任务中模型性能优化的问题,特别是针对Interleave任务的提升。其解决方案的关键在于将强大的基础模型与可插拔技术相结合,具体表现为在LLaVA-NeXT-Interleave中引入Dense Channel Integration (DCI)连接器,以增强模型在需要深层语义连贯性或结构化变化理解的任务中的表现。
链接: https://arxiv.org/abs/2506.11737
作者: Dinh Viet Cuong,Hoang-Bao Le,An Pham Ngoc Nguyen,Liting Zhou,Cathal Gurrin
机构: Dublin City University (都柏林城市大学); ADAPT Centre (ADAPT中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at this https URL.
zh
[NLP-22] he Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference
【速读】: 该论文旨在解决深度学习(Deep Learning, DL)推理中传统高精度浮点计算向混合精度整数(Mixed-Precision Integer, MIP)计算迁移所带来的性能优化问题,特别是在现代指令集架构(Instruction Set Architectures, ISAs)上的通用矩阵-矩阵乘法(General Matrix-Matrix Multiplication, GEMM)优化。其解决方案的关键在于重新设计适用于MIP算术的微内核(micro-kernel)结构和数据布局,以充分利用当前专用硬件特性,并在多种主流CPU架构上实现显著的性能提升。这一研究标志着矩阵乘法优化进入由DL推理需求驱动的“寒武纪时期”。
链接: https://arxiv.org/abs/2506.11728
作者: Héctor Martínez,Adrián Castelló,Francisco D. Igual,Enrique S. Quintana-Ortí
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 7 tables, 7 figures
Abstract:Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today’s specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the “Cambrian period” for matrix multiplication.
zh
[NLP-23] Configurable Preference Tuning with Rubric-Guided Synthetic Data ICML2025
【速读】: 该论文试图解决传统人类反馈模型在AI对齐中因采用单一、静态偏好而限制适应性的问题。其解决方案的关键在于引入可配置偏好调优(Configurable Preference Tuning, CPT),通过合成生成的偏好数据,结合结构化、细粒度的评分标准作为系统提示,使语言模型能够在推理阶段根据明确的人类可解释指令动态调整行为,而无需重新训练。
链接: https://arxiv.org/abs/2506.11702
作者: Víctor Gallego
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025 Workshop on Models of Human Feedback for AI Alignment
Abstract:Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at this https URL
zh
[NLP-24] LLM s for Sentence Simplification: A Hybrid Multi-Agent prompting Approach
【速读】: 该论文试图解决将复杂句子转换为保持语义和逻辑完整性的逻辑简化句子序列的问题(sentence simplification)。其解决方案的关键在于提出一种混合方法,结合了先进的提示技术与多智能体架构,以增强句子简化过程的效果。
链接: https://arxiv.org/abs/2506.11681
作者: Pratibha Zunjare,Michael Hsiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper addresses the challenge of transforming complex sentences into sequences of logical, simplified sentences while preserving semantic and logical integrity with the help of Large Language Models. We propose a hybrid approach that combines advanced prompting with multi-agent architectures to enhance the sentence simplification process. Experimental results show that our approach was able to successfully simplify 70% of the complex sentences written for video game design application. In comparison, a single-agent approach attained a 48% success rate on the same task.
zh
[NLP-25] Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE
【速读】: 该论文试图解决在进行记忆缺失探测(Amnesic Probing)时,如何准确移除特定语言信息而不影响其他无关信息的问题。现有方法如迭代空空间投影(Iterative Nullspace Projection, INLP)在移除目标信息时会引入随机的表示修改,导致信息移除不够精准。论文提出的关键解决方案是采用均值投影(Mean Projection, MP)和LEACE两种替代方法,这些方法能够更精准地移除目标信息,从而提升通过记忆缺失探测获得行为解释的可靠性。
链接: https://arxiv.org/abs/2506.11673
作者: Alicja Dobrzeniecka,Antske Fokkens,Pia Sommerauer
机构: NASK - National Research Institute (NASK - 国家研究机构); Vrije Universiteit (自由大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Amnesic probing is a technique used to examine the influence of specific linguistic information on the behaviour of a model. This involves identifying and removing the relevant information and then assessing whether the model’s performance on the main task changes. If the removed information is relevant, the model’s performance should decline. The difficulty with this approach lies in removing only the target information while leaving other information unchanged. It has been shown that Iterative Nullspace Projection (INLP), a widely used removal technique, introduces random modifications to representations when eliminating target information. We demonstrate that Mean Projection (MP) and LEACE, two proposed alternatives, remove information in a more targeted manner, thereby enhancing the potential for obtaining behavioural explanations through Amnesic Probing.
zh
[NLP-26] Converting Annotated Clinical Cases into Structured Case Report Forms
【速读】: 该论文试图解决医学研究中用于确保临床研究结果准确性和可靠性的病例报告表(Case Report Forms, CRFs)数据集公开可用且标注良好的稀缺性问题,这限制了能够从临床笔记中填充CRF的槽位填充系统的发展。解决方案的关键在于利用已有的用于信息抽取任务的标注数据集,并将其转换为结构化的CRFs,提出了一种半自动的转换方法,成功应用于英语和意大利语的E3C数据集,生成了一个高质量的CRF槽位填充数据集。
链接: https://arxiv.org/abs/2506.11666
作者: Pietro Ferrazzi,Alberto Lavelli,Bernardo Magnini
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: to be published in BioNLP 2025
Abstract:Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at this https URL
zh
[NLP-27] LoRA-Gen: Specializing Large Language Model via Online LoRA Generation
【速读】: 该论文旨在解决在领域特定任务中,尤其是小型边缘侧模型上,大规模语言模型在效果和效率上的局限性。其解决方案的关键在于提出LoRA-Gen框架,该框架利用云端大模型根据任务描述生成LoRA参数,并通过重参数化技术将这些参数合并到边缘侧模型中,从而实现灵活的模型专业化,提升推理效率并减少输入上下文长度。
链接: https://arxiv.org/abs/2506.11638
作者: Yicheng Xiao,Lin Song,Rui Yang,Cheng Cheng,Yixiao Ge,Xiu Li,Ying Shan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.
zh
[NLP-28] SceneGram: Conceptualizing and Describing Tangrams in Scene Context ACL
【速读】: 该论文试图解决场景上下文对概念化和命名影响的建模问题,特别是如何在生成式 AI 中体现人类在不同场景下对同一对象的多样化概念化方式。其解决方案的关键在于构建 SceneGram 数据集,该数据集收录了人类在不同场景上下文中对拼图形状的参考描述,从而为系统分析场景上下文对概念化的影响提供了基础。基于此数据集,研究进一步评估了多模态大语言模型在生成参考时对人类概念化丰富性和变异性考虑的不足。
链接: https://arxiv.org/abs/2506.11631
作者: Simeon Junker,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注: To appear in ACL Findings 2025
Abstract:Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a “crab”, “sink” or “space ship”. Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.
zh
[NLP-29] (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test
【速读】: 该论文试图解决传统听力测试在评估听力损失对言语理解功能影响方面的不足,尤其是针对老年性耳聋中常见的 supra-threshold(阈上)缺陷的诊断特异性不足问题。其解决方案的关键在于引入一种基于计算的多阶段流程——Simulated Phoneme Speech Test (SimPhon Speech Test) 方法,该方法利用现代自动语音识别(Automatic Speech Recognition, ASR)系统模拟感音神经性听力损失的感知效应,并通过受控声学退化处理识别最常见的音素混淆模式,从而构建一个语音平衡的最小对言语测试集。该方法通过数据驱动的候选词对筛选、专家人工校准和敏感性分析,最终优化出25组测试对,显著提升了听力测试开发的效率,并表现出与标准言语可懂度指数(Speech Intelligibility Index, SII)无显著相关性,表明其能够捕捉更复杂的感知缺陷。
链接: https://arxiv.org/abs/2506.11620
作者: Stefan Bleeck
机构: Institute of Sound and Vibration Research (ISVR), University of Southampton (南安普顿大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Traditional audiometry often provides an incomplete characterization of the functional impact of hearing loss on speech understanding, particularly for supra-threshold deficits common in presbycusis. This motivates the development of more diagnostically specific speech perception tests. We introduce the Simulated Phoneme Speech Test (SimPhon Speech Test) methodology, a novel, multi-stage computational pipeline for the in silico design and validation of a phonetically balanced minimal-pair speech test. This methodology leverages a modern Automatic Speech Recognition (ASR) system as a proxy for a human listener to simulate the perceptual effects of sensorineural hearing loss. By processing speech stimuli under controlled acoustic degradation, we first identify the most common phoneme confusion patterns. These patterns then guide the data-driven curation of a large set of candidate word pairs derived from a comprehensive linguistic corpus. Subsequent phases involving simulated diagnostic testing, expert human curation, and a final, targeted sensitivity analysis systematically reduce the candidates to a final, optimized set of 25 pairs (the SimPhon Speech Test-25). A key finding is that the diagnostic performance of the SimPhon Speech Test-25 test items shows no significant correlation with predictions from the standard Speech Intelligibility Index (SII), suggesting the SimPhon Speech Test captures perceptual deficits beyond simple audibility. This computationally optimized test set offers a significant increase in efficiency for audiological test development, ready for initial human trials.
zh
[NLP-30] VLM@school – Evaluation of AI image understanding on German middle school knowledge
【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在结合视觉推理与特定领域背景知识任务上的评估不足问题,特别是在德语环境下的表现评估。现有英语基准测试通常依赖于人为构造的难题或脱离上下文的问题,而本文提出的基准数据集则来源于真实的中学课程内容,涵盖数学、历史、生物和宗教等九个领域,以确保模型能够真正整合视觉理解与事实推理。解决方案的关键在于构建一个包含486张图像和2000余道开放性问题的数据集,从而要求模型不仅具备视觉识别能力,还需具备跨领域的知识推理能力。
链接: https://arxiv.org/abs/2506.11604
作者: René Peinl,Vincent Tischler
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.
zh
[NLP-31] Are LLM s Good Text Diacritizers? An Arabic and Yorùbá Case Study
【速读】: 该论文试图解决在两种语系不同的语言——阿拉伯语和约鲁巴语中,文本重音符号标注(text diacritization)的问题。其解决方案的关键在于利用大规模语言模型(LLMs)进行重音符号标注,并通过引入一个新型多语言数据集MultiDiac来实现严谨的评估。此外,研究还探索了对小型开源模型进行LoRA微调以提升标注性能和降低幻觉现象。
链接: https://arxiv.org/abs/2506.11602
作者: Hawau Olamide Toyin,Samar M. Magdy,Hanan Aldarmaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.
zh
[NLP-32] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLM s
【速读】: 该论文旨在解决视频语言模型(Video LLM)在细粒度时间推理方面的局限性,特别是在受限监督条件下难以精确关联回答与视频特定时刻的问题。其解决方案的关键在于提出DaMO,一个专为准确时间推理和多模态理解设计的数据高效视频语言模型。核心创新是引入了时序感知的Fuseformer架构,该架构采用分层双流结构,逐步捕捉每种模态内的时序动态,并有效融合互补的视觉与音频信息,同时通过全局残差机制提升计算效率并保留语义细节。
链接: https://arxiv.org/abs/2506.11558
作者: Bo-Cheng Chiu,Jen-Jee Chen,Yu-Chee Tseng,Feng-Chi Chen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Health Research Institutes (国家卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
zh
[NLP-33] From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation PAKDD2025
【速读】: 该论文旨在解决个性化对话生成中响应自然度与连贯性不足的问题,特别是在保持响应与用户个人特征或人设描述一致性的挑战。解决方案的关键在于提出MUDI(Multiple Discourse Relations Graph Learning)框架,通过大型语言模型辅助标注话语关系并将对话数据转化为结构化对话图,利用DialogueGAT图编码器捕捉隐含的话语关系及人设信息,并在生成阶段引入新颖的连贯性感知注意力机制,以提升解码器对话语关系的考虑。
链接: https://arxiv.org/abs/2506.11557
作者: Chih-Hao Hsu,Ying-Jia Lin,Hung-Yu Kao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by PAKDD 2025
Abstract:In dialogue generation, the naturalness of responses is crucial for effective human-machine interaction. Personalized response generation poses even greater challenges, as the responses must remain coherent and consistent with the user’s personal traits or persona descriptions. We propose MUDI ( \textbfMu ltiple \textbfDi scourse Relations Graph Learning) for personalized dialogue generation. We utilize a Large Language Model to assist in annotating discourse relations and to transform dialogue data into structured dialogue graphs. Our graph encoder, the proposed DialogueGAT model, then captures implicit discourse relations within this structure, along with persona descriptions. During the personalized response generation phase, novel coherence-aware attention strategies are implemented to enhance the decoder’s consideration of discourse relations. Our experiments demonstrate significant improvements in the quality of personalized responses, thus resembling human-like dialogue exchanges.
zh
[NLP-34] RAG : Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
【速读】: 该论文试图解决现有检索增强生成(Retrieval-Augmented Generation, RAG)框架在知识应用环节的不足,即在获取事实性信息后缺乏针对任务特定推理的应用意识,导致知识与任务之间的衔接存在断层。其解决方案的关键在于提出RAG+,通过构建包含知识和对齐应用场景的双语料库,并在推理过程中联合检索,使语言模型不仅能够获取相关信息,还能在结构化、目标导向的推理过程中进行知识应用,从而提升模型在复杂任务中的表现。
链接: https://arxiv.org/abs/2506.11555
作者: Yu Wang,Shiwan Zhao,Ming Fan,Zhihu Wang,Yubo Zhang,Xicheng Zhang,Zhengfan Wang,Heyuan Huang,Ting Liu
机构: Huawei Technologies Ltd. (华为技术有限公司); Xi’an Jiaotong University (西安交通大学); Nankai University (南开大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.
zh
[NLP-35] Brewing Knowledge in Context: Distillation Perspectives on In-Context Learning
【速读】: 该论文试图解决在不更新权重的情况下,大型语言模型(Large Language Models, LLMs)如何通过上下文学习(In-context Learning, ICL)完成新任务的机制问题。现有研究对ICL的内在机理理解不足,限制了其解释性、优化与可靠应用。论文提出的解决方案关键在于将ICL视为一种隐式的知识蒸馏(Knowledge Distillation, KD)过程,其中提示示例在推理过程中引导模型形成特定任务的参考模型。该理论框架通过Rademacher复杂度推导出泛化边界,并证明蒸馏权重的偏差与提示分布和目标分布之间的最大均值差异(Maximum Mean Discrepancy, MMD)呈线性关系,从而为ICL提供了新的理论解释。
链接: https://arxiv.org/abs/2506.11516
作者: Chengye Li,Haiyun Liu,Yuanxi Li
机构: University of Chinese Academy of Sciences (中国科学院大学); University of South Florida (南佛罗里达大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Key Laboratory of System Software (Chinese Academy of Sciences) (中国科学院系统软件重点实验室); State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences) (计算机科学国家重点实验室(软件研究所,中国科学院))
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 main pages, 10 page appendix
Abstract:In-context learning (ICL) allows large language models (LLMs) to solve novel tasks without weight updates. Despite its empirical success, the mechanism behind ICL remains poorly understood, limiting our ability to interpret, improve, and reliably apply it. In this paper, we propose a new theoretical perspective that interprets ICL as an implicit form of knowledge distillation (KD), where prompt demonstrations guide the model to form a task-specific reference model during inference. Under this view, we derive a Rademacher complexity-based generalization bound and prove that the bias of the distilled weights grows linearly with the Maximum Mean Discrepancy (MMD) between the prompt and target distributions. This theoretical framework explains several empirical phenomena and unifies prior gradient-based and distributional analyses. To the best of our knowledge, this is the first to formalize inference-time attention as a distillation process, which provides theoretical insights for future prompt engineering and automated demonstration selection.
zh
[NLP-36] Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLM s
【速读】: 该论文旨在解决Two-Tower Vision-Language Models (VLMs)在跨模态对齐与融合中的局限性,具体表现为:(i) 无法有效利用单模态表示的逐层信息,(ii) 限制了对不同层次单模态语义知识的灵活利用,以及(iii) 仅限于在传统低分辨率数据集上进行评估。解决方案的关键是提出Manager,一个轻量、高效且有效的插件,能够自适应地聚合预训练单模态专家的不同层次见解,从而促进更全面的视觉-语言对齐与融合。在Two-Tower VLM架构下,ManagerTower通过在每个跨模态层引入Manager,显著提升了下游任务的表现;同时,在Multimodal Large Language Model (MLLM)架构中,Manager进一步增强了模型的零样本性能,其与多网格算法的协同作用可提升视觉表示的多样性与准确性。
链接: https://arxiv.org/abs/2506.11515
作者: Xiao Xu,Libo Qin,Wanxiang Che,Min-Yen Kan
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). June 2025. DOI: this https URL
Abstract:Two-Tower Vision–Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit(i) suffers from ineffective layer-by-layer utilization of unimodal representations, \textit(ii) restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit(iii) is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at this https URL.
zh
[NLP-37] On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval
【速读】: 该论文旨在解决对话系统如何在多种模态(如文本和图像)中生成响应的问题。其关键解决方案是提出两种集成方法:基于两步策略的方法和端到端方法,并通过参数共享策略在减少参数数量的同时提升跨子任务和模态的知识迁移效果。实验结果表明,端到端方法在无需两步策略中间步骤的情况下仍能实现相当的性能。
链接: https://arxiv.org/abs/2506.11499
作者: Seongbo Jang,Seonghyeon Lee,Dongha Lee,Hwanjo Yu
机构: Pohang University of Science and Technology (浦项科技大学); Kyungpook National University (庆北国立大学); Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 1 figure
Abstract:Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.
zh
[NLP-38] Lag-Relative Sparse Attention In Long Context Training
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理长上下文输入时因注意力计算的二次复杂度和键值记忆占用线性增长而导致的计算成本和内存消耗过高的问题。其解决方案的关键在于提出一种名为Lag-Relative Sparse Attention (LRSA) 的方法,该方法基于LagKV压缩技术,通过按块预填充的方式,在固定滞后窗口中选择最相关的前K个键值对,使模型能够在保持效率的同时关注重要的历史上下文。
链接: https://arxiv.org/abs/2506.11498
作者: Manlai Liang,Wanyi Huang,Mandi Liu,Huaijun Li,Jinlong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have made significant strides in natural language processing and generation, yet their ability to handle long-context input remains constrained by the quadratic complexity of attention computation and linear-increasing key-value memory footprint. To reduce computational costs and memory, key-value cache compression techniques are commonly applied at inference time, but this often leads to severe performance degradation, as models are not trained to handle compressed context. Although there are more sophisticated compression methods, they are typically unsuitable for post-training because of their incompatibility with gradient-based optimization or high computation overhead. To fill this gap with no additional parameter and little computation overhead, we propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training. Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window, allowing the model to focus on salient historical context while maintaining efficiency. Experimental results show that our approach significantly enhances the robustness of the LLM with key-value compression and achieves better fine-tuned results in the question-answer tuning task.
zh
[NLP-39] Relational Schemata in BERT Are Inducible Not Emergent: A Study of Performance vs. Competence in Language Models
【速读】: 该论文试图解决的问题是:尽管大型语言模型如BERT在语义任务中表现出色,但其性能是否反映了真正的概念能力还是仅基于表层的统计关联仍不明确。论文通过检查概念对在分类学、部分-整体和功能关系中的内部表示,探究BERT是否编码了抽象关系模式。解决方案的关键在于比较BERT在[CLS]标记嵌入中的关系分类性能与表征结构,发现预训练BERT虽能实现高分类准确率,表明存在潜在的关系信号,但只有在经过监督关系分类任务微调后,概念对才会在高维嵌入空间中按关系类型组织,这表明关系模式并非仅通过预训练产生,而是可以通过任务引导获得。
链接: https://arxiv.org/abs/2506.11485
作者: Cole Gawin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 3 tables
Abstract:While large language models like BERT demonstrate strong empirical performance on semantic tasks, whether this reflects true conceptual competence or surface-level statistical association remains unclear. I investigate whether BERT encodes abstract relational schemata by examining internal representations of concept pairs across taxonomic, mereological, and functional relations. I compare BERT’s relational classification performance with representational structure in [CLS] token embeddings. Results reveal that pretrained BERT enables high classification accuracy, indicating latent relational signals. However, concept pairs organize by relation type in high-dimensional embedding space only after fine-tuning on supervised relation classification tasks. This indicates relational schemata are not emergent from pretraining alone but can be induced via task scaffolding. These findings demonstrate that behavioral performance does not necessarily imply structured conceptual understanding, though models can acquire inductive biases for grounded relational abstraction through appropriate training.
zh
[NLP-40] ImmunoFOMO: Are Language Models missing what oncologists see?
【速读】: 该论文试图解决如何利用语言模型(Language Models, LMs)在乳腺癌文献中识别免疫治疗标志物的问题,特别是评估其在识别低层次特定概念方面的性能。解决方案的关键在于对比不同语言模型的医学概念基础,并通过与专家临床医生的判断进行验证,以评估预训练语言模型在识别具体医学概念上的潜力。
链接: https://arxiv.org/abs/2506.11478
作者: Aman Sinha,Bogdan-Valentin Popescu,Xavier Coubez,Marianne Clausel,Mathieu Constant
机构: Université de Lorraine, Nancy, France; ICANS, Strasbourg, France
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models (LMs) capabilities have grown with a fast pace over the past decade leading researchers in various disciplines, such as biomedical research, to increasingly explore the utility of LMs in their day-to-day applications. Domain specific language models have already been in use for biomedical natural language processing (NLP) applications. Recently however, the interest has grown towards medical language models and their understanding capabilities. In this paper, we investigate the medical conceptual grounding of various language models against expert clinicians for identification of hallmarks of immunotherapy in breast cancer abstracts. Our results show that pre-trained language models have potential to outperform large language models in identifying very specific (low-level) concepts.
zh
[NLP-41] AutoGen Driven Multi Agent Framework for Iterative Crime Data Analysis and Prediction
【速读】: 该论文试图解决社会科学研究中犯罪数据分析的自动化、可扩展性和迭代性问题,同时确保数据隐私。解决方案的关键在于提出一种基于多智能体协作的框架LUCID-MA(Learning and Understanding Crime through Dialogue of Multiple Agents),其核心组件包括分析助手、反馈组件和预测组件,通过设计良好的提示词和使用LLaMA-2-13B-Chat-GPTQ模型实现离线运行,并通过100轮通信实现智能体的自我优化,从而在减少人工干预的情况下提升分析效果。
链接: https://arxiv.org/abs/2506.11475
作者: Syeda Kisaa Fatima,Tehreem Zubair,Noman Ahmed,Asifullah Khan
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces LUCID-MA (Learning and Understanding Crime through Dialogue of Multiple Agents), an innovative AI powered framework where multiple AI agents collaboratively analyze and understand crime data. Our system that consists of three core components: an analysis assistant that highlights spatiotemporal crime patterns, a feedback component that reviews and refines analytical results and a prediction component that forecasts future crime trends. With a well-designed prompt and the LLaMA-2-13B-Chat-GPTQ model, it runs completely offline and allows the agents undergo self-improvement through 100 rounds of communication with less human interaction. A scoring function is incorporated to evaluate agent’s performance, providing visual plots to track learning progress. This work demonstrates the potential of AutoGen-style agents for autonomous, scalable, and iterative analysis in social science domains maintaining data privacy through offline execution.
zh
[NLP-42] Med-PRM: Medical Reasoning Models with Stepwise Guideline-verified Process Rewards
【速读】: 该论文试图解决大型语言模型在临床决策中难以定位和纠正推理过程中特定步骤错误的问题(Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process)。解决方案的关键在于提出Med-PRM框架,该框架通过检索增强生成技术,将每一步推理与已建立的医学知识库进行验证,从而实现对推理质量的细粒度评估。
链接: https://arxiv.org/abs/2506.11474
作者: Jaehoon Yun,Jiwoong Sohn,Jungwoo Park,Hyunjae Kim,Xiangru Tang,Yanjun Shao,Yonghoe Koo,Minhyeok Ko,Qingyu Chen,Mark Gerstein,Michael Moor,Jaewoo Kang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: this https URL
zh
[NLP-43] A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems
【速读】: 该论文试图解决低资源语言(Low-Resource Languages, LRLs)机器翻译(Machine Translation, MT)系统在评估过程中面临的数据集和人类评估者不足的问题。解决方案的关键在于提出一个用于招募和游戏化评估的平台设计,旨在弥补MT系统开发中数据与评估资源的缺口,并通过该平台提升评估的全面性和有效性。
链接: https://arxiv.org/abs/2506.11467
作者: Carlos Rafael Catalan
机构: Samsung Research Philippines(三星研究菲律宾)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 7 pages, 7 figures, presented at the HEAL Workshop at CHI
Abstract:Human evaluators provide necessary contributions in evaluating large language models. In the context of Machine Translation (MT) systems for low-resource languages (LRLs), this is made even more apparent since popular automated metrics tend to be string-based, and therefore do not provide a full picture of the nuances of the behavior of the system. Human evaluators, when equipped with the necessary expertise of the language, will be able to test for adequacy, fluency, and other important metrics. However, the low resource nature of the language means that both datasets and evaluators are in short supply. This presents the following conundrum: How can developers of MT systems for these LRLs find adequate human evaluators and datasets? This paper first presents a comprehensive review of existing evaluation procedures, with the objective of producing a design proposal for a platform that addresses the resource gap in terms of datasets and evaluators in developing MT systems. The result is a design for a recruitment and gamified evaluation platform for developers of MT systems. Challenges are also discussed in terms of evaluating this platform, as well as its possible applications in the wider scope of Natural Language Processing (NLP) research.
zh
[NLP-44] AbsenceBench: Language Models Cant Tell Whats Missing
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在检测文档中明确缺失信息方面的能力不足问题,即模型难以识别被故意移除的内容。解决方案的关键在于提出AbsenceBench基准测试,用于评估LLMs在数值序列、诗歌和GitHub pull requests三个领域中检测缺失信息的能力,其核心在于通过对比原始文本与编辑后的上下文来判断模型是否能够识别被移除的部分。
链接: https://arxiv.org/abs/2506.11440
作者: Harvey Yiyun Fu,Aryan Shrivastava,Jared Moore,Peter West,Chenhao Tan,Ari Holtzman
机构: University of Chicago (芝加哥大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 8 figures. Code and data are publicly available at this https URL
Abstract:Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs’ capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to “gaps” in documents since these absences don’t correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).
zh
[NLP-45] KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models
【速读】: 该论文旨在解决韩国语语法错误修正(Korean Grammatical Error Correction, KoGEC)问题,其核心挑战在于如何有效识别并修正不同类型的语法错误,尤其是在社交媒体对话数据上的应用。解决方案的关键在于对预训练翻译模型NLLB(No Language Left Behind)进行微调,并引入特殊语言标记以区分原始句子与修正后的句子,从而提升模型在韩国语语法错误修正任务中的表现。此外,研究还采用BLEU分数和“大型语言模型作为评判者”的方法进行评估,验证了微调后的NLLB模型在KoGEC任务中的优越性。
链接: https://arxiv.org/abs/2506.11432
作者: Taeeun Kim,Semin Jeong,Youngsook Song
机构: Sionic AI Inc., Seoul, Korea; Emory University, Atlanta, GA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:This research introduces KoGEC, a Korean Grammatical Error Correction system using pre–trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an “LLM as judge” method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.
zh
[NLP-46] Agent -RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
【速读】: 该论文旨在解决传统基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)在代理环境中的有效性下降问题。在复杂的多步骤问题求解场景中,由于奖励稀疏,传统RLVR难以有效训练模型。论文提出的解决方案是Agent-RLVR框架,其关键在于引入代理引导(agent guidance),通过整合高层次策略规划、动态错误反馈及环境交互信息等多样化线索,模拟教师指导机制,引导代理走向成功轨迹,并促进其通过额外环境探索实现主动自我改进。
链接: https://arxiv.org/abs/2506.11425
作者: Jeff Da,Clinton Wang,Xiang Deng,Yuntao Ma,Nikhil Barhate,Sean Hendryx
机构: Scale AI(规模人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent’s errors and environmental interactions, emulate a teacher’s guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.
zh
[NLP-47] Efficient Long-Context LLM Inference via KV Cache Clustering
【速读】: 该论文旨在解决长上下文大语言模型(Large Language Models, LLMs)在部署过程中因需要大量Key-Value (KV)缓存而导致的内存和计算效率问题。现有方法要么丢弃对未来生成至关重要的信息,要么由于计算开销过高而难以提升效率。论文提出的解决方案是Chelsea,其关键在于通过在线KV缓存聚类实现高效压缩。Chelsea基于序列维度上键状态的高相似性观察,将序列划分为块,并采用块内交替划分策略进行软匹配,从而识别出具有相似性的簇,将每个簇内的KV缓存合并为一个中心点,有效减少内存占用并提升推理速度。
链接: https://arxiv.org/abs/2506.11418
作者: Jie Hu,Shengnan Wang,Yutong He,Ping Gong,Jiawei Yi,Juncheng Zhang,Youhui Bai,Renhai Chen,Gong Zhang,Cheng Li,Kun Yuan
机构: Peking University (北京大学); Huawei Technologies (华为技术有限公司); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19 \times and reduces end-to-end latency by up to 2.72 \times .
zh
[NLP-48] Bias Amplification in RAG : Poisoning Knowledge Retrieval to Steer LLM s
【速读】: 该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中由于中毒攻击引发的模型偏差放大问题,特别是现有研究忽视了此类攻击对模型公平性的潜在影响。解决方案的关键在于提出一种Bias Retrieval and Reward Attack (BRRA)框架,通过对抗性文档生成、子空间投影技术以及循环反馈机制,系统性地探索并放大语言模型在RAG系统中的偏差。该方法能够显著增强模型在多个维度上的偏差,并揭示了RAG系统安全与模型公平性之间的关系。
链接: https://arxiv.org/abs/2506.11415
作者: Linlin Wang,Tianqing Zhu,Laiqiao Qin,Longxiang Gao,Wanlei Zhou
机构: City University of Macau (澳门城市大学); Shandong Computer Science Center (山东计算机科学中心); Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing (山东省计算力互联网与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science (山东省计算机科学基础研究中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.
zh
[NLP-49] Predicting Early-Onset Colorectal Cancer with Large Language Models
【速读】: 该论文试图解决早期发病结直肠癌(Early-onset Colorectal Cancer, EoCRC)筛查效率不足的问题,因为该群体年龄低于现行癌症筛查指南推荐的年龄。研究通过应用10种不同的机器学习模型,并与先进的大型语言模型(Large Language Models, LLM)进行比较,利用患者在结直肠癌诊断前6个月内的病史、实验室结果和临床观察数据进行预测。解决方案的关键在于对大型语言模型进行微调,以提高其在EoCRC预测任务中的性能,最终实现了73%的灵敏度和91%的特异性。
链接: https://arxiv.org/abs/2506.11410
作者: Wilson Lau,Youngwon Kim,Sravanthi Parasa,Md Enamul Haque,Anand Oka,Jay Nanduri
机构: 未知
类目: Computation and Language (cs.CL)
备注: Paper accepted for the proceedings of the 2025 American Medical Informatics Association Annual Symposium (AMIA)
Abstract:The incidence rate of early-onset colorectal cancer (EoCRC, age 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening. In this paper, we applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. We retrospectively identified 1,953 CRC patients from multiple health systems across the United States. The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.
zh
[NLP-50] LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
【速读】: 该论文试图解决参数高效微调(Parameter Efficient FineTuning, PEFT)方法在微调预训练大型语言模型(Large Language Models, LLMs)时可能引发的灾难性失败问题。其关键解决方案是揭示PEFT促使模型依赖于简短的、虚假的token进行决策的现象,并通过无缝虚假token注入(Seamless Spurious Token Injection, SSTI)实验验证了这一现象。研究发现,即使少量相关token的注入也能显著影响模型的行为,且这种依赖性与LoRA的秩相关,从而为提升模型鲁棒性提供了新的视角。
链接: https://arxiv.org/abs/2506.11402
作者: Pradyut Sekhsaria,Marcel Mateos Salles,Hai Huang,Randall Balestriero
机构: Brown University (布朗大学); Atlassian (Atlassian)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 16 figures, 15 tables. Submitted for publication. for associated blog post, see this https URL
Abstract:Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained Large Language Models (LLMs) to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. We uncover one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model’s behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM’s behavior can be controlled solely by injecting those few tokens. We apply SSTI across models from three families (Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3) and four diverse datasets (IMDB, Financial Classification, CommonSense QA, and Bias in Bios). Our findings reveal three astonishing behaviors. First, as few as a single token of SSTI is sufficient to steer a model’s decision making. Second, for light SSTI, the reliance on spurious tokens is proportional to the LoRA rank. Lastly, with aggressive SSTI, larger LoRA rank values become preferable to small rank values as it makes the model attend to non-spurious tokens, hence improving robustness.
zh
[NLP-51] Curriculum-Guided Layer Scaling for Language Model Pretraining
【速读】: 该论文试图解决大规模语言模型预训练过程中计算效率低下的问题,旨在通过优化学习效率来降低预训练成本。其解决方案的关键在于提出了一种基于课程引导的层扩展框架(Curriculum-Guided Layer Scaling, CGLS),该框架通过逐步增加模型层数(即在训练过程中渐进式地添加层)与数据难度的同步增长,实现更高效的预训练过程。
链接: https://arxiv.org/abs/2506.11389
作者: Karanpartap Singh,Neil Band,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.
zh
[NLP-52] A Variational Approach for Mitigating Entity Bias in Relation Extraction ACL2025
【速读】: 该论文试图解决关系抽取(Relation Extraction, RE)中的实体偏差问题,即模型过度依赖实体信息导致泛化能力不足。解决方案的关键在于采用变分信息瓶颈(Variational Information Bottleneck, VIB)框架,通过压缩实体特异性信息同时保留任务相关特征,从而提升模型的泛化能力和性能。
链接: https://arxiv.org/abs/2506.11381
作者: Samuel Mensah,Elena Kochkina,Jabez Magomere,Joy Prakash Sain,Simerjot Kaur,Charese Smiley
机构: JP Morgan AI Research (JP Morgan AI Research); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 Main
Abstract:Mitigating entity bias is a critical challenge in Relation Extraction (RE), where models often rely excessively on entities, resulting in poor generalization. This paper presents a novel approach to address this issue by adapting a Variational Information Bottleneck (VIB) framework. Our method compresses entity-specific information while preserving task-relevant features. It achieves state-of-the-art performance on relation extraction datasets across general, financial, and biomedical domains, in both indomain (original test sets) and out-of-domain (modified test sets with type-constrained entity replacements) settings. Our approach offers a robust, interpretable, and theoretically grounded methodology.
zh
[NLP-53] Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning
【速读】: 该论文试图解决家庭照护者因多重角色和资源有限而面临的心理健康问题,其解决方案的关键在于利用大型语言模型(Large Language Model, LLM)驱动的对话代理,提供基于证据的心理健康支持,具体整合了问题解决疗法(Problem-Solving Therapy, PST)、动机访谈(Motivational Interviewing, MI)和行为链分析(Behavioral Chain Analysis, BCA)。研究通过优化提示技术(如Few-Shot和检索增强生成,RAG)以及临床专家精选示例,提升了模型的情境理解能力和个性化支持水平。
链接: https://arxiv.org/abs/2506.11376
作者: Liying Wang,Ph.D.,Daffodil Carrington,M.S.,Daniil Filienko,M.S.,Caroline El Jazmi,M.S.,Serena Jinchen Xie,M.S.,Martine De Cock,Ph.D.,Sarah Iribarren,Ph.D.,Weichao Yuwen,Ph.D
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Family caregivers often face substantial mental health challenges due to their multifaceted roles and limited resources. This study explored the potential of a large language model (LLM)-powered conversational agent to deliver evidence-based mental health support for caregivers, specifically Problem-Solving Therapy (PST) integrated with Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA). A within-subject experiment was conducted with 28 caregivers interacting with four LLM configurations to evaluate empathy and therapeutic alliance. The best-performing models incorporated Few-Shot and Retrieval-Augmented Generation (RAG) prompting techniques, alongside clinician-curated examples. The models showed improved contextual understanding and personalized support, as reflected by qualitative responses and quantitative ratings on perceived empathy and therapeutic alliances. Participants valued the model’s ability to validate emotions, explore unexpressed feelings, and provide actionable strategies. However, balancing thorough assessment with efficient advice delivery remains a challenge. This work highlights the potential of LLMs in delivering empathetic and tailored support for family caregivers.
zh
[NLP-54] Benchmarking Multimodal LLM s on Recognition and Understanding over Chemical Tables
【速读】: 该论文试图解决化学领域中化学表格(Chemical Tables)的多模态与领域特定复杂性被现有基准忽视的问题,这限制了多模态大语言模型在化学科学理解中的支持能力。解决方案的关键在于构建了一个大规模的真实化学表格基准——ChemTable,该基准包含专家标注的单元格多边形、逻辑布局和领域特定标签,并支持表格识别与理解两项核心任务,从而为化学感知的表格理解提供了一个严谨且现实的评估平台。
链接: https://arxiv.org/abs/2506.11375
作者: Yitong Zhou,Mingyue Cheng,Qingyang Mao,Yucong Luo,Qi Liu,Yupeng Li,Xiaohan Zhang,Deguang Liu,Xin Li,Enhong Chen
机构: State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); University of Science and Technology of China (中国科学技术大学); Artificial Intelligence Research Institute, iFLYTEK Co., Ltd (人工智能研究院,科大讯飞公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Chemical tables encode complex experimental knowledge through symbolic expressions, structured variables, and embedded molecular graphics. Existing benchmarks largely overlook this multimodal and domain-specific complexity, limiting the ability of multimodal large language models to support scientific understanding in chemistry. In this work, we introduce ChemTable, a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature. ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components and supports two core tasks: (1) Table Recognition, covering structure parsing and content extraction; and (2) Table Understanding, encompassing both descriptive and reasoning-oriented question answering grounded in table structure and domain semantics. We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights. Although models show reasonable performance on basic layout parsing, they exhibit substantial limitations on both descriptive and inferential QA tasks compared to human performance, and we observe significant performance gaps between open-source and closed-source models across multiple dimensions. These results underscore the challenges of chemistry-aware table understanding and position ChemTable as a rigorous and realistic benchmark for advancing scientific reasoning.
zh
[NLP-55] he Biased Samaritan: LLM biases in Perceived Kindness
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的人口统计学偏差问题,旨在定量评估不同生成式AI模型在性别、种族和年龄等维度上的偏见。其解决方案的关键在于通过让模型评估道德患者愿意积极干预的意愿,确定各类商业模型的基准人口统计学身份及其与其他群体之间的关系,从而识别这些偏见是正面、中性还是负面,以及其强度。该方法能够区分通常交织在一起的两种偏见,为客观评估LLM中的偏见提供了新的途径。
链接: https://arxiv.org/abs/2506.11361
作者: Jack H Fagan,Ruhaan Juyaal,Amy Yue-Ming Yu,Siya Pun
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:While Large Language Models (LLMs) have become ubiquitous in many fields, understanding and mitigating LLM biases is an ongoing issue. This paper provides a novel method for evaluating the demographic biases of various generative AI models. By prompting models to assess a moral patient’s willingness to intervene constructively, we aim to quantitatively evaluate different LLMs’ biases towards various genders, races, and ages. Our work differs from existing work by aiming to determine the baseline demographic identities for various commercial models and the relationship between the baseline and other demographics. We strive to understand if these biases are positive, neutral, or negative, and the strength of these biases. This paper can contribute to the objective assessment of bias in Large Language Models and give the user or developer the power to account for these biases in LLM output or in training future LLMs. Our analysis suggested two key findings: that models view the baseline demographic as a white middle-aged or young adult male; however, a general trend across models suggested that non-baseline demographics are more willing to help than the baseline. These methodologies allowed us to distinguish these two biases that are often tangled together.
zh
[NLP-56] GLAP: General contrastive audio-text pretraining across domains and languages
【速读】: 该论文旨在解决当前对比语言音频预训练(Contrastive Language Audio Pretraining, CLAP)方法主要针对英语语音和音乐检索,而忽视多语言口语内容的问题。其解决方案的关键在于引入通用语言音频预训练(General Language Audio Pretraining, GLAP),该方法扩展了CLAP的多语言和多领域能力,通过在标准音频-文本检索基准、语音检索与分类任务以及跨语言关键词检测等多方面表现出色,验证了其在多语言音频理解中的有效性。
链接: https://arxiv.org/abs/2506.11350
作者: Heinrich Dinkel,Zhiyong Yan,Tianzi Wang,Yongqing Wang,Xingwei Sun,Yadong Niu,Jizhong Liu,Gang Li,Junbo Zhang,Jian Luan
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP’s advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: this https URL.
zh
[NLP-57] Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models
【速读】: 该论文试图解决传统基于音频的说话人辨识(Speaker Diarization, SD)系统在音频质量不佳和说话人相似性较高时表现受限的问题。其解决方案的关键在于采用基于文本的方法,通过对话转录文本进行句级说话人切换检测。研究开发了两种模型:单次预测模型(Single Prediction Model, SPM)和多次预测模型(Multiple Prediction Model, MPM),其中MPM在短对话场景中表现出色,证明了文本驱动的SD方法在特定情境下可与先进的音频基SD系统相媲美,并突显了语义理解在SD系统中的重要性。
链接: https://arxiv.org/abs/2506.11344
作者: Peilin Wu,Jinho D. Choi
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Model (MPM), both of which demonstrate significant improvements in identifying speaker changes, particularly in short conversations. Our findings, based on a curated dataset encompassing diverse conversational scenarios, reveal that the text-based SD approach, especially the MPM, performs competitively against state-of-the-art audio-based SD systems, with superior performance in short conversational contexts. This paper not only showcases the potential of leveraging linguistic features for SD but also highlights the importance of integrating semantic understanding into SD systems, opening avenues for future research in multimodal and semantic feature-based diarization.
zh
[NLP-58] From Replication to Redesign: Exploring Pairwise Comparisons for LLM -Based Peer Review
【速读】: 该论文试图解决传统同行评审流程在效率和质量评估上的局限性,尤其是在如何利用大型语言模型(Large Language Models, LLMs)重新构想评审机制方面缺乏深入探索的问题。论文提出的解决方案的关键在于引入一种新的机制,即使用LLM代理对论文进行两两比较,而非传统的个体评分方式。通过聚合大量两两评估的结果,该方法能够更准确、稳健地衡量论文的相对质量,实验结果表明其在识别高影响力论文方面显著优于传统评分方法。
链接: https://arxiv.org/abs/2506.11343
作者: Yaohui Zhang,Haijing Zhang,Wenlong Ji,Tianyu Hua,Nick Haber,Hancheng Cao,Weixin Liang
机构: Stanford University (斯坦福大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.
zh
[NLP-59] Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly
【速读】: 该论文试图解决的问题是:Transformer-based语言模型的困惑度(perplexity)与其在预测人类句子处理难度时的预测能力之间的关系是否适用于神经成像数据,而不仅仅局限于基于延迟的测量指标。解决方案的关键在于评估17个预训练的Transformer-based模型在三种不同语言家族上的 surprisal 估计值对两个功能性磁共振成像(fMRI)数据集的预测能力,从而验证该关系是否可以推广到神经测量指标。
链接: https://arxiv.org/abs/2506.11338
作者: Yi-Chien Lin,William Schuler
机构: The Ohio State University (俄亥俄州立大学); Department of Linguistics (语言学系)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Transformers become more widely incorporated into natural language processing tasks, there has been considerable interest in using surprisal from these models as predictors of human sentence processing difficulty. Recent work has observed a positive relationship between Transformer-based models’ perplexity and the predictive power of their surprisal estimates on reading times, showing that language models with more parameters and trained on more data are less predictive of human reading times. However, these studies focus on predicting latency-based measures (i.e., self-paced reading times and eye-gaze durations) with surprisal estimates from Transformer-based language models. This trend has not been tested on brain imaging data. This study therefore evaluates the predictive power of surprisal estimates from 17 pre-trained Transformer-based models across three different language families on two functional magnetic resonance imaging datasets. Results show that the positive relationship between model perplexity and model fit still obtains, suggesting that this trend is not specific to latency-based measures and can be generalized to neural measures.
zh
[NLP-60] Dont Pay Attention
【速读】: 该论文旨在解决传统Transformer模型在处理超过固定上下文窗口的序列时效率低下以及注意力机制的二次复杂度问题。其解决方案的关键在于提出了一种名为Avey的新神经基础架构,该架构摒弃了传统的注意力机制和循环结构,通过引入一个排序器和一个自回归神经处理器,协同识别并上下文化任意位置中最相关的标记,从而实现序列长度与上下文宽度的解耦,有效处理任意长度的序列。
链接: https://arxiv.org/abs/2506.11305
作者: Mohammad Hammoud,Devang Acharya
机构: Avey AI(艾维人工智能); OpenAI(开放人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The Transformer has become the de facto standard for large language models and a wide range of downstream tasks across various domains. Despite its numerous advantages like inherent training parallelism, the Transformer still faces key challenges due to its inability to effectively process sequences beyond a fixed context window and the quadratic complexity of its attention mechanism. These challenges have renewed interest in RNN-like architectures, which offer linear scaling with sequence length and improved handling of long-range dependencies, albeit with limited parallelism due to their inherently recurrent nature. In this paper, we propose Avey, a new neural foundational architecture that breaks away from both attention and recurrence. Avey comprises a ranker and an autoregressive neural processor, which collaboratively identify and contextualize only the most relevant tokens for any given token, regardless of their positions in the sequence. Specifically, Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences. Experimental results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while notably excelling at capturing long-range dependencies.
zh
[NLP-61] Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
【速读】: 该论文试图解决在预训练语言模型中探索课程学习(Curriculum Learning)的潜力问题,以提升训练效率和泛化能力。其解决方案的关键在于系统性地评估不同课程学习设置,包括原始课程学习、基于节奏的采样以及交错课程,并利用六种跨越语言学和信息论视角的难度度量来指导数据排序。实验结果表明,课程学习在早期和中期训练阶段能持续提升收敛性,并在作为预热策略时可带来最高3.5%的性能提升,同时识别出压缩比、词汇多样性和可读性作为有效的难度信号。
链接: https://arxiv.org/abs/2506.11300
作者: Yang Zhang,Amr Mohamed,Hadi Abdine,Guokan Shang,Michalis Vazirgiannis
机构: MBZUAI (穆巴达拉科技人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Curriculum learning has shown promise in improving training efficiency and generalization in various machine learning domains, yet its potential in pretraining language models remains underexplored, prompting our work as the first systematic investigation in this area. We experimented with different settings, including vanilla curriculum learning, pacing-based sampling, and interleaved curricula-guided by six difficulty metrics spanning linguistic and information-theoretic perspectives. We train models under these settings and evaluate their performance on eight diverse benchmarks. Our experiments reveal that curriculum learning consistently improves convergence in early and mid-training phases, and can yield lasting gains when used as a warmup strategy with up to 3.5% improvement. Notably, we identify compression ratio, lexical diversity, and readability as effective difficulty signals across settings. Our findings highlight the importance of data ordering in large-scale pretraining and provide actionable insights for scalable, data-efficient model development under realistic training scenarios.
zh
[NLP-62] Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
【速读】: 该论文试图解决在推理过程中通过增加计算资源来提升语言模型性能的问题,特别是如何更有效地延长推理步骤以提高准确性。其解决方案的关键在于引入一个可学习的“continue-thinking”标记,该标记通过强化学习仅更新其嵌入向量,而保持模型权重冻结,从而实现对推理过程的动态调控。实验结果表明,该方法在标准数学基准测试中优于基线模型和使用固定标记(如“Wait”)的测试时缩放方法。
链接: https://arxiv.org/abs/2506.11274
作者: Liran Ringel,Elad Tolochinsky,Yaniv Romano
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing “/think” with “Wait”) can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned “|continue-thinking|” token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., “Wait”) for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model’s accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.
zh
[NLP-63] No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在时间表推理(Temporal Table Reasoning)中的挑战,特别是如何通过有效的提示技术(prompting techniques)提取相关见解。现有方法在表格推理中的影响尚未被充分研究,且模型性能在不同表格和上下文结构中差异显著,难以确定最优方案。论文的关键解决方案是引入SEAR,一种受人类推理启发的自适应提示框架,该框架能够根据上下文特征动态调整并整合结构化推理,从而提升模型在各类表格类型上的表现。
链接: https://arxiv.org/abs/2506.11246
作者: Kushagra Dixit,Abhishek Rajgaria,Harshavardhan Kalalbandi,Dan Roth,Vivek Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 19 Tables, 9 Figures
Abstract:Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective prompting techniques to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, the performance of these models varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique across diverse table types to determine optimal approaches for different scenarios. We find that performance varies based on entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others. To mitigate these challenges, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts based on context characteristics and integrates a structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to other baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model’s reasoning.
zh
[NLP-64] Iterative Multilingual Spectral Attribute Erasure
【速读】: 该论文试图解决多语言表示中跨语言偏见迁移的问题,即现有去偏方法无法利用多语言共享语义空间的优势进行跨语言去偏。解决方案的关键在于提出迭代多语言谱属性消除(Iterative Multilingual Spectral Attribute Erasure, IMSAE),通过迭代的奇异值分解(SVD)截断识别并减轻多语言间的联合偏见子空间。
链接: https://arxiv.org/abs/2506.11244
作者: Shun Shao,Yftah Ziser,Zheng Zhao,Yifu Qiu,Shay B. Cohen,Anna Korhonen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures
Abstract:Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility.
zh
[NLP-65] RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? ACL2025
【速读】: 该论文试图解决在计算资源受限环境下,小型模型是否能够与大型模型在自然语言处理任务中保持竞争力的问题。解决方案的关键在于采用参数量少于1B的生成式 AI 模型,以模拟全球南方地区研究机构在计算资源受限条件下的实际应用场景,并验证此类模型在BEA 2025共享任务中的有效性。尽管模型规模较小,但其性能仍能与使用更大模型的团队竞争,显示出小型模型在特定任务上的潜力。
链接: https://arxiv.org/abs/2506.11243
作者: Santiago Góngora,Ignacio Sastre,Santiago Robaina,Ignacio Remersaro,Luis Chiruzzo,Aiala Rosá
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper will be presented at the 20th BEA Workshop (Innovative Use of NLP for Building Educational Applications) at ACL 2025
Abstract:In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research labs or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the exact\ F_1 scores published by the organizers, the performance gaps between our models and the winners were as follows: 6.46 in Track 1; 10.24 in Track 2; 7.85 in Track 3; 9.56 in Track 4; and 13.13 in Track 5. Considering that the minimum difference with a winner team is 6.46 points – and the maximum difference is 13.13 – according to the exact\ F_1 score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.
zh
[NLP-66] LLM -as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation
【速读】: 该论文旨在解决在IT自动化中自动事件修复时,如何自动评估和选择最佳模型以提升代码质量的问题。其关键解决方案是通过增强“LLM-as-a-Judge”方法,利用双向功能匹配和逻辑表示进行无参考的自动验证与代码优化,从而更准确地判断生成的Bash代码是否符合语义和语法要求,并实现代码的自动精炼。
链接: https://arxiv.org/abs/2506.11237
作者: Ngoc Phuoc An Vo,Brent Paulovicks,Vadim Sheinin
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 10 pages
Abstract:In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement for Bash code generation to select the best model for automatic incident remediation in IT Automation. We used execution-based evaluation as ground-truth to evaluate our LLM-as-a-Judge metrics. Results show high accuracy and agreement with execution-based evaluation (and up to 8% over baseline). Finally, we built Reflection code agents to utilize judgments and feedback from our evaluation metrics which achieved significant improvement (up to 24% increase in accuracy) for automatic code refinement.
zh
[NLP-67] LLM -as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic
【速读】: 该论文旨在解决医学教育中临床沟通技能评估的自动化与可扩展性难题,特别是如何使自动化评估结果与主观医生判断相一致。其关键解决方案是提出“LLM-as-a-Fuzzy-Judge”方法,即通过将大型语言模型(Large Language Model, LLM)进行微调,基于四个模糊集合(包括专业性、医学相关性、伦理行为和情境干扰)的人工标注数据,对医学生在模拟患者对话中的表达进行评估,从而实现更符合人类偏好且具有解释性的自动化评估。
链接: https://arxiv.org/abs/2506.11221
作者: Weibing Zheng,Laurah Turner,Jess Kropczynski,Murat Ozer,Tri Nguyen,Shane Halse
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 12 pages, 1 figure, 2025 IFSA World Congress NAFIPS Annual Meeting
Abstract:Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students’ clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students’ clinical skills with subjective physicians’ preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students’ utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80% accuracy, with major criteria items over 90%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at this https URL
zh
[NLP-68] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models
【速读】: 该论文试图解决在电子健康记录(Electronic Health Records, EHRs)中识别药物停用问题,这一问题对于患者安全至关重要,但常因信息隐藏在非结构化文本中而难以处理。解决方案的关键在于利用先进的开源和专有大型语言模型(Large Language Models, LLMs)从EHR注释中提取药物信息并分类其药物状态,重点评估这些模型在无需人工标注的情况下对药物信息提取的可扩展性。研究通过构建多个EHR数据集,并系统比较了12种先进LLMs在药物提取、药物状态分类及其联合任务中的性能,验证了LLMs在该任务中的潜力。
链接: https://arxiv.org/abs/2506.11137
作者: Chong Shao,Douglas Snyder,Chiran Li,Bowen Gu,Kerry Ngan,Chun-Ting Yang,Jiageng Wu,Richard Wyss,Kueiyu Joshua Lin,Jie Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint, under review
Abstract:Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability on medication information extraction without human annotation. We collected three EHR datasets from diverse sources to build the evaluation benchmark. We evaluated 12 advanced LLMs and explored multiple LLM prompting strategies. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments. We found that LLMs showed promising performance on the medication extraction and discontinuation classification from EHR notes. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, Llama-3.1-70B-Instruct achieved the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot can further improve LLMs’ capability.
zh
[NLP-69] Large Language Models and Emergence: A Complex Systems Perspective
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否展现出涌现能力(emergent capabilities)以及是否具备涌现智能(emergent intelligence)。其解决方案的关键在于通过分析和评估不同量化涌现的方法,探讨LLMs在处理复杂任务时是否能够表现出超越其组成部分简单相加的新型、更高效的能力,从而验证“少即是多”(less is more)这一智能涌现的核心理念。
链接: https://arxiv.org/abs/2506.11135
作者: David C. Krakauer,John W. Krakauer,Melanie Mitchell
机构: Santa Fe Institute(圣塔菲研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with lower-dimensional effective variables and theories. This is captured by the idea “more is different”. Intelligence is a consummate emergent property manifesting increasingly efficient – cheaper and faster – uses of emergent capabilities to solve problems. This is captured by the idea “less is more”. In this paper, we first examine claims that Large Language Models exhibit emergent capabilities, reviewing several approaches to quantifying emergence, and secondly ask whether LLMs possess emergent intelligence.
zh
[NLP-70] A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data
【速读】: 该论文旨在解决在缺乏大量标注数据的情况下提升自动语音识别(Automatic Speech Recognition, ASR)性能的问题。其解决方案的关键在于提出一种自优化框架,通过利用未标注数据生成伪标签,并将其用于训练高保真文本到语音(Text-to-Speech, TTS)系统,随后将合成的语音-文本对重新引入原始ASR系统以实现闭环自我改进。该方法无需依赖大量标注数据,仅需少量文本数据和AI生成的合成内容即可显著提升ASR性能。
链接: https://arxiv.org/abs/2506.11130
作者: Cheng Kang Chou,Chan-Jan Hsu,Ho-Lam Chung,Liang-Hsuan Tseng,Hsi-Chun Cheng,Yu-Kuan Fu,Kuan Po Huang,Hung-Yi Lee
机构: MediaTek Research(联发科技研究部); National Taiwan University(台湾大学); Nvidia(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.
zh
[NLP-71] rustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中面临的幻觉问题,即模型生成的内容与事实或逻辑推理不符的现象。解决方案的关键在于提出CHECK框架,该框架结合结构化临床数据库与基于信息论的分类器,以持续学习的方式检测事实性和推理性幻觉。通过这种方法,CHECK显著降低了LLama3.3-70B-Instruct的幻觉率,并在多个医学基准测试中实现了高AUC值,同时提升了GPT-4o在USMLE考试中的通过率。
链接: https://arxiv.org/abs/2506.11129
作者: Carlos Garcia-Fernandez,Luis Felipe,Monique Shotande,Muntasir Zitu,Aakash Tripathi,Ghulam Rasool,Issam El Naqa,Vivek Rudrapatna,Gilmer Valdes
机构: Moffitt Cancer Center(莫菲特癌症中心); University of California San Francisco(加州大学旧金山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use. We present CHECK, a continuous-learning framework that integrates structured clinical databases with a classifier grounded in information theory to detect both factual and reasoning-based hallucinations. Evaluated on 1500 questions from 100 pivotal clinical trials, CHECK reduced LLama3.3-70B-Instruct hallucination rates from 31% to 0.3% - making an open source model state of the art. Its classifier generalized across medical benchmarks, achieving AUCs of 0.95-0.96, including on the MedQA (USMLE) benchmark and HealthBench realistic multi-turn medical questioning. By leveraging hallucination probabilities to guide GPT-4o’s refinement and judiciously escalate compute, CHECK boosted its USMLE passing rate by 5 percentage points, achieving a state-of-the-art 92.1%. By suppressing hallucinations below accepted clinical error thresholds, CHECK offers a scalable foundation for safe LLM deployment in medicine and other high-stakes domains.
zh
[NLP-72] Stronger Language Models Produce More Human-Like Errors
【速读】: 该论文试图解决的问题是:随着语言模型性能的提升,它们是否趋向于表现出类似人类的推理模式。研究的解决方案关键在于应用了 erotetic theory of reasoning (ETR) 这一形式化的认知框架,通过生成可预测人类推理错误的逻辑推理问题,并利用 PyETR 工具评估 38 个语言模型在 383 个推理任务中的表现,从而揭示模型错误模式与人类推理谬误之间的关系。
链接: https://arxiv.org/abs/2506.11128
作者: Andrew Keenan Richardson,Ryan Othniel Kearns,Sean Moss,Vincent Wang-Mascianica,Philipp Koralus
机构: The Laboratory for Human-Centered AI (HAI Lab); Institute for Ethics in AI; Faculty of Philosophy; University of Oxford; School of Computer Science, University of Birmingham
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Do language models converge toward human-like reasoning patterns as they improve? We provide surprising evidence that while overall reasoning capabilities increase with model sophistication, the nature of errors increasingly mirrors predictable human reasoning fallacies: a previously unobserved inverse scaling phenomenon. To investigate this question, we apply the Erotetic Theory of Reasoning (ETR), a formal cognitive framework with empirical support for predicting human reasoning outcomes. Using the open-source package PyETR, we generate logical reasoning problems where humans predictably err, evaluating responses from 38 language models across 383 reasoning tasks. Our analysis indicates that as models advance in general capability (as measured by Chatbot Arena scores), the proportion of their incorrect answers that align with ETR-predicted human fallacies tends to increase ( \rho = 0.360, p = 0.0265 ). Notably, as we observe no correlation between model sophistication and logical correctness on these tasks, this shift in error patterns toward human-likeness occurs independently of error rate. These findings challenge the prevailing view that scaling language models naturally obtains normative rationality, suggesting instead a convergence toward human-like cognition inclusive of our characteristic biases and limitations, as we further confirm by demonstrating order-effects in language model reasoning.
zh
[NLP-73] GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions
【速读】: 该论文旨在解决传统图形用户界面(Graphical User Interface, GUI)自主代理依赖文本指令所带来的可访问性和便利性限制,特别是在无手操作场景下的应用问题。其解决方案的关键在于提出GUIRoboTron-Speech,这是首个端到端的自主GUI代理,能够直接接受语音指令和本地设备截图以预测操作。为应对语音指令数据集稀缺的问题,研究者首先利用随机音色文本转语音(Text-to-Speech, TTS)模型生成高质量的语音指令用于训练,随后通过渐进式定位与规划训练阶段提升模型能力,并引入启发式混合指令训练策略以缓解预训练基础模型中的模态不平衡问题。
链接: https://arxiv.org/abs/2506.11127
作者: Wenkang Han,Zhixiong Zeng,Jing Huang,Shu Jiang,Liming Zheng,Longrong Yang,Haibo Qiu,Chang Yao,Jingyuan Chen,Lin Ma
机构: Meituan(美团); Zhejiang University(浙江大学); Harbin Institute of Technology(哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech’s capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at this https URL.
zh
[NLP-74] ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams
【速读】: 该论文旨在解决由大型语言模型(Large Language Models, LLMs)结合文本转语音(Text-to-Speech, TTS)和自动语音识别(Automatic Speech Recognition, ASR)技术驱动的语音钓鱼(vishing)诈骗问题。研究指出,ASR转录环节是诈骗流程中最脆弱的环节,并提出了ASRJam防御框架,通过向受害者的音频中注入对抗性扰动来干扰攻击者的ASR系统,从而破坏诈骗的反馈回路。解决方案的关键在于在不影响人类通话者理解的前提下,有效干扰ASR系统的识别能力。此外,研究还提出了一种名为EchoGuard的新型干扰器,利用自然失真(如混响和回声)来干扰ASR,同时保持对人类听觉的可接受性。
链接: https://arxiv.org/abs/2506.11125
作者: Freddie Grabovski,Gilad Gressel,Yisroel Mirsky
机构: UW–Madison(威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs), combined with Text-to-Speech (TTS) and Automatic Speech Recognition (ASR), are increasingly used to automate voice phishing (vishing) scams. These systems are scalable and convincing, posing a significant security threat. We identify the ASR transcription step as the most vulnerable link in the scam pipeline and introduce ASRJam, a proactive defence framework that injects adversarial perturbations into the victim’s audio to disrupt the attacker’s ASR. This breaks the scam’s feedback loop without affecting human callers, who can still understand the conversation. While prior adversarial audio techniques are often unpleasant and impractical for real-time use, we also propose EchoGuard, a novel jammer that leverages natural distortions, such as reverberation and echo, that are disruptive to ASR but tolerable to humans. To evaluate EchoGuard’s effectiveness and usability, we conducted a 39-person user study comparing it with three state-of-the-art attacks. Results show that EchoGuard achieved the highest overall utility, offering the best combination of ASR disruption and human listening experience.
zh
[NLP-75] SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR
【速读】: 该论文试图解决现实世界中端到端自动语音识别(Automatic Speech Recognition, ASR)系统因领域不匹配导致的性能下降问题,而Test-Time Adaptation (TTA) 旨在通过在推理过程中调整模型来缓解这一问题。解决方案的关键在于提出一种结合TTA与语言模型重排序(language model rescoring)的方法——SUTA-LM,其核心是通过一个由声学和语言信息共同引导的自步选择机制进行受控适应,随后利用语言模型进一步优化输出,从而有效提升跨领域的ASR性能。
链接: https://arxiv.org/abs/2506.11121
作者: Wei-Ping Huang,Guan-Ting Lin,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Despite progress in end-to-end ASR, real-world domain mismatches still cause performance drops, which Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference. Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction. In this work, we identify a previously overlooked challenge: TTA can interfere with language model rescoring, revealing the nontrivial nature of effectively combining the two methods. Based on this insight, we propose SUTA-LM, a simple yet effective extension of SUTA, an entropy-minimization-based TTA approach, with language model rescoring. SUTA-LM first applies a controlled adaptation process guided by an auto-step selection mechanism leveraging both acoustic and linguistic information, followed by language model rescoring to refine the outputs. Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
zh
[NLP-76] SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)部署成本高昂的问题,特别是通过梯度压缩方法进行模型压缩时存在的不足。现有方法在使用one-hot标签计算梯度时忽略了其他词的潜在预测,导致丢失影响生成能力的关键信息。该论文的关键解决方案是在剪枝阶段引入自蒸馏损失(self-distillation loss),以充分利用原始模型的预测信息,从而获得更准确的梯度用于剪枝。此外,研究发现相比注意力模块,LLM的预测对多层感知机(MLP)模块更不敏感,而MLP模块占用了超过5倍参数量,因此重点对MLP模块进行剪枝,实现了显著的模型压缩且性能下降不明显。
链接: https://arxiv.org/abs/2506.11120
作者: Hourun Zhu,Chengchao Shen
机构: Central South University (中南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than 5 \times parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at this https URL.
zh
[NLP-77] Benchmarking Foundation Speech and Language Models for Alzheimers Disease and Related Dementia Detection from Spontaneous Speech
【速读】: 该论文旨在解决阿尔茨海默病及相关痴呆症(ADRD)的早期检测问题,通过分析自发言语中的声学和语言特征,探索非侵入性生物标志物的可能性。其解决方案的关键在于利用预训练的基础模型(foundation models),特别是语音和语言模型,从音频数据中提取高维嵌入,以实现对认知状态的分类,其中基于自动语音识别(ASR)生成的音频嵌入表现出最佳性能,同时非语义特征如停顿模式的引入进一步提升了文本分类的效果。
链接: https://arxiv.org/abs/2506.11119
作者: Jingyu Li,Lingchao Mao,Hairong Wang,Zhendong Wang,Xi Mao,Xuelei Sherry Ni
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Background: Alzheimer’s disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is vital for timely intervention and care. Spontaneous speech contains rich acoustic and linguistic markers that may serve as non-invasive biomarkers for cognitive decline. Foundation models, pre-trained on large-scale audio or text data, produce high-dimensional embeddings encoding contextual and acoustic features. Methods: We used the PREPARE Challenge dataset, which includes audio recordings from over 1,600 participants with three cognitive statuses: healthy control (HC), mild cognitive impairment (MCI), and Alzheimer’s Disease (AD). We excluded non-English, non-spontaneous, or poor-quality recordings. The final dataset included 703 (59.13%) HC, 81 (6.81%) MCI, and 405 (34.06%) AD cases. We benchmarked a range of open-source foundation speech and language models to classify cognitive status into the three categories. Results: The Whisper-medium model achieved the highest performance among speech models (accuracy = 0.731, AUC = 0.802). Among language models, BERT with pause annotation performed best (accuracy = 0.662, AUC = 0.744). ADRD detection using state-of-the-art automatic speech recognition (ASR) model-generated audio embeddings outperformed others. Including non-semantic features like pause patterns consistently improved text-based classification. Conclusion: This study introduces a benchmarking framework using foundation models and a clinically relevant dataset. Acoustic-based approaches – particularly ASR-derived embeddings – demonstrate strong potential for scalable, non-invasive, and cost-effective early detection of ADRD. Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) MSC classes: 68T10 (Primary), 68U99 (Secondary) ACMclasses: I.2.1; J.3 Cite as: arXiv:2506.11119 [cs.CL] (or arXiv:2506.11119v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.11119 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Sherry Ni [view email] [v1] Mon, 9 Jun 2025 17:52:31 UTC (839 KB)
zh
[NLP-78] ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research KDD2025
【速读】: 该论文旨在解决现有科学检索与问答(QA)数据集无法准确反映科研人员实际信息需求的问题,因为这些数据集通常仅处理简单问题,而真实研究中的信息需求往往隐含在具体任务中,而非显式表达于搜索查询。其解决方案的关键在于提出ScIRGen框架,该框架通过基于学术论文的数据集增强表示方法、利用认知分类学生成高质量合成问题,并借助大语言模型(LLM)的困惑度变化自动过滤合成答案,从而构建出更贴近真实科研场景的大规模科学检索增强生成(RAG)数据集。
链接: https://arxiv.org/abs/2506.11117
作者: Junyong Lin,Lu Dai,Ruiqian Han,Yijie Sui,Ruilin Wang,Xingliang Sun,Qinglin Wu,Min Feng,Hao Liu,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The Hong Kong University of Science and Technology(香港科技大学); Institute of Tibetan Plateau Research, Chinese Academy of Sciences(中国科学院青藏高原研究所); Lanzhou University(兰州大学); College of Resources and Environment, University of Chinese Academy of Sciences(中国科学院大学资源与环境学院
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: KDD 2025 Accepted
Abstract:Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA \ retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers’ validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.
zh
[NLP-79] Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
【速读】: 该论文旨在解决开源大型语言模型(Large Language Models, LLMs)在指令跟随和基础能力方面与专有模型之间的性能差距问题。现有开源指令数据集多集中于特定领域,如数学或编程,限制了模型的泛化能力。其解决方案的关键在于提出Infinity-Instruct数据集,该数据集通过两阶段流程构建:第一阶段利用混合数据选择技术从超1亿样本中筛选出740万条高质量基础指令,第二阶段通过指令选择、演化及诊断过滤生成150万条高质量对话指令,从而提升模型的基础和对话能力。
链接: https://arxiv.org/abs/2506.11116
作者: Jijie Li,Li Du,Hanyu Zhao,Bo-wen Zhang,Liangdong Wang,Boyan Gao,Guang Liu,Yonghua Lin
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnotethis https URL and codes\footnotethis https URL have been publicly released.
zh
[NLP-80] Incorporating Domain Knowledge into Materials Tokenization
【速读】: 该论文试图解决传统基于频率的分词方法在材料科学文本处理中导致的过度碎片化和语义损失问题,这些问题会破坏材料概念的结构和语义完整性。解决方案的关键在于提出MATTE(Material-aware Tokenization with Enhanced Representation),该方法将材料知识整合到分词过程中,利用基于材料知识库训练的MatDetector以及优先考虑材料概念的重排序方法,在分词过程中保持材料概念的结构完整性和语义一致性。
链接: https://arxiv.org/abs/2506.11115
作者: Yerim Oh,Jun-Hyung Park,Junho Kim,SungHo Kim,SangKeun Lee
机构: Korea University(高丽大学); Hankuk University of Foreign Studies(韩国外国语大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of 4% and 2% in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at this https URL
zh
[NLP-81] KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在医疗领域评估不足的问题,尤其是在多模态临床场景下的表现评估。现有基准测试主要基于文本、以英语为中心,并且侧重于药物知识,无法全面评估医疗领域的广泛知识和多模态推理能力。解决方案的关键在于构建KokushiMD-10,这是首个基于日本国家医疗执照考试的多模态基准,涵盖多个医疗专业领域,包含超过11588道真实考题,并整合了临床图像和专家标注的解释,以评估文本和视觉推理能力。
链接: https://arxiv.org/abs/2506.11114
作者: Junyu Liu,Kaiqi Yan,Tianyang Wang,Qian Niu,Momoko Nagai-Tanima,Tomoki Aoyama
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9pages, 3 figures
Abstract:Recent advances in large language models (LLMs) have demonstrated notable performance in medical licensing exams. However, comprehensive evaluation of LLMs across various healthcare roles, particularly in high-stakes clinical scenarios, remains a challenge. Existing benchmarks are typically text-based, English-centric, and focus primarily on medicines, which limits their ability to assess broader healthcare knowledge and multimodal reasoning. To address these gaps, we introduce KokushiMD-10, the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams. This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions. It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning. We benchmark over 30 state-of-the-art LLMs, including GPT-4o, Claude 3.5, and Gemini, across both text and image-based settings. Despite promising results, no model consistently meets passing thresholds across domains, highlighting the ongoing challenges in medical AI. KokushiMD-10 provides a comprehensive and linguistically grounded resource for evaluating and advancing reasoning-centric medical AI across multilingual and multimodal clinical tasks.
zh
[NLP-82] Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks
【速读】: 该论文试图解决在学术同行评审过程中,大型语言模型(Large Language Models, LLMs)作为自动化评审工具时面临的可靠性问题,尤其是在文本对抗攻击下的稳健性问题。解决方案的关键在于评估LLMs在生成评审意见方面的有效性,并分析其在面对对抗性攻击时的脆弱性,同时探讨可能的缓解策略,以确保AI技术能够增强而非损害学术交流的完整性。
链接: https://arxiv.org/abs/2506.11113
作者: Tzu-Ling Lin,Wei-Chih Chen,Teng-Fang Hsiao,Hou-I Liu,Ya-Hsin Yeh,Yu Kai Chan,Wen-Sheng Lien,Po-Yen Kuo,Philip S. Yu,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.
zh
[NLP-83] Manifesto from Dagstuhl Perspectives Workshop 24352 – Conversational Agents : A Framework for Evaluation (CAFE)
【速读】: 该论文试图解决如何有效评估对话式信息访问(CONversational Information ACcess, CONIAC)系统的问题,其解决方案的关键在于提出了一种名为对话代理评估框架(Conversational Agents Framework for Evaluation, CAFE)的结构化评估方法,该框架包含六个核心组成部分:系统利益相关者的目标、评估中需研究的用户任务、用户执行任务的相关方面、评估标准、评估方法以及所选定量标准的度量方式。
链接: https://arxiv.org/abs/2506.11112
作者: Christine Bauer,Li Chen,Nicola Ferro,Norbert Fuhr,Avishek Anand,Timo Breuer,Guglielmo Faggioli,Ophir Frieder,Hideo Joho,Jussi Karlgren,Johannes Kiesel,Bart P. Knijnenburg,Aldo Lipani,Lien Michiels,Andrea Papenmeier,Maria Soledad Pera,Mark Sanderson,Scott Sanner,Benno Stein,Johanne R. Trippas,Karin Verspoor,Martijn C Willemsen
机构: University of Salzburg (萨尔茨堡大学); Hong Kong Baptist University (香港浸会大学); University of Padua (帕多瓦大学); Universität Duisburg-Essen (杜伊斯堡-埃森大学); TU Delft (代尔夫特理工大学); TH Köln (科隆应用技术大学); University of Padua (帕多瓦大学); Georgetown University (乔治城大学); University of Tsukuba (筑波大学); Silo AI (Silo AI); Bauhaus-Universität Weimar (包豪斯魏玛大学); Clemson University (克莱姆森大学); University College London (伦敦大学学院); imec-SMIT, Vrije Universiteit Brussel & University of Antwerp (IMEC-SMIT,布鲁塞尔自由大学与安特卫普大学); University of Twente (特文特大学); TU Delft (代尔夫特理工大学); RMIT University (皇家墨尔本理工大学); University of Toronto (多伦多大学); Bauhaus-Universität Weimar (包豪斯魏玛大学); RMIT University (皇家墨尔本理工大学); RMIT University (皇家墨尔本理工大学); TU Eindhoven & JADS (埃因霍温理工大学 & JADS)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 43 pages; 10 figures; Dagstuhl manifesto
Abstract:During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system’s stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
zh
[NLP-84] Evaluating and Improving Robustness in Large Language Models : A Survey and Future Directions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对意外应用场景时的鲁棒性问题,包括对抗性攻击、分布外(out-of-distribution, OOD)场景以及生成内容的正确性和稳定性等挑战。其解决方案的关键在于从三个主要角度对LLMs的鲁棒性进行系统性综述:对抗鲁棒性、OOD鲁棒性以及鲁棒性的评估方法,通过梳理相关概念、方法和评价体系,为研究社区提供全面的参考框架,并推动该领域的进一步发展。
链接: https://arxiv.org/abs/2506.11111
作者: Kun Zhang,Le Wu,Kui Yu,Guangyi Lv,Dacao Zhang
机构: Hefei University of Technology (合肥工业大学); AI Laboratory, Lenovo Research (人工智能实验室,联想研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 5 figures
Abstract:Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (this https URL) to support the community.
zh
[NLP-85] AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
【速读】: 该论文试图解决的问题是:在面对方向性框架的陈述时,大型语言模型(Large Language Models, LLMs)如何保持其事实判断的一致性,即模型是否会在用户不同表述下改变对同一事实的评估。解决方案的关键在于构建AssertBench基准测试,通过从FEVEROUS数据集中采样有证据支持的事实,并为每个事实构造两种框架提示:一种是用户声称该陈述为事实正确,另一种是用户声称其错误,从而记录模型的同意程度和推理过程。该方法通过根据模型在中性呈现下的准确性对结果进行分层,隔离由框架引起的变量,以评估模型在面对矛盾用户断言时坚持自身判断的能力。
链接: https://arxiv.org/abs/2506.11110
作者: Jaeho Lee,Atharv Chowdhary
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, appendix contains 2 additional figures and 2 tables
Abstract:Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model’s agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model’s underlying factual knowledge by stratifying results based on the model’s accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM’s ability to “stick to its guns” when presented with contradictory user assertions about the same fact. The complete source code is available at this https URL.
zh
[NLP-86] Enhancing Large Language Models for Mobility Analytics with Semantic Location Tokenization KDD’25
【速读】: 该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的移动性分析方法在语义化位置表示不足以及移动信号建模能力有限的问题。其关键解决方案是提出QT-Mob框架,该框架通过引入位置分词模块,学习紧凑且语义丰富的位置标记,以保留上下文信息并兼容LLMs;同时结合多种互补的微调目标,使学习到的标记与LLMs内部表示对齐,从而提升模型对序列移动模式和位置语义的理解能力。
链接: https://arxiv.org/abs/2506.11109
作者: Yile Chen,Yicheng Tao,Yue Jiang,Shuai Liu,Han Yu,Gao Cong
机构: Nanyang Technological University (南洋理工大学); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by KDD’25
Abstract:The widespread adoption of location-based services has led to the generation of vast amounts of mobility data, providing significant opportunities to model user movement dynamics within urban environments. Recent advancements have focused on adapting Large Language Models (LLMs) for mobility analytics. However, existing methods face two primary limitations: inadequate semantic representation of locations (i.e., discrete IDs) and insufficient modeling of mobility signals within LLMs (i.e., single templated instruction fine-tuning). To address these issues, we propose QT-Mob, a novel framework that significantly enhances LLMs for mobility analytics. QT-Mob introduces a location tokenization module that learns compact, semantically rich tokens to represent locations, preserving contextual information while ensuring compatibility with LLMs. Furthermore, QT-Mob incorporates a series of complementary fine-tuning objectives that align the learned tokens with the internal representations in LLMs, improving the model’s comprehension of sequential movement patterns and location semantics. The proposed QT-Mob framework not only enhances LLMs’ ability to interpret mobility data but also provides a more generalizable approach for various mobility analytics tasks. Experiments on three real-world dataset demonstrate the superior performance in both next-location prediction and mobility recovery tasks, outperforming existing deep learning and LLM-based methods.
zh
[NLP-87] History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM
【速读】: 该论文旨在解决多轮对话(multi-turn dialogue)与链式思维推理(chain-of-thought reasoning)的联合建模问题,通过改进自监督跨注意力引导强化学习(Self-Supervised Cross-Attention-Guided Reinforcement, CAGSR)框架,结合高性能的vLLM运行时环境实现。其解决方案的关键在于对vLLM的C++/CUDA内核进行改造,以异步捕获生成过程中每层、每头的跨注意力权重,并将自监督奖励函数扩展至整个对话历史和中间链式思维步骤,从而有效整合上下文信息并提升模型的推理能力。
链接: https://arxiv.org/abs/2506.11108
作者: Andrew Kiruluta,Andreas Lemos,Priscilla Burity
机构: UC Berkeley, School of Information
类目: Computation and Language (cs.CL)
备注:
Abstract:We present CAGSR-vLLM-MTC, an extension of our Self-Supervised Cross-Attention-Guided Reinforcement (CAGSR) framework, now implemented on the high-performance vLLM runtime, to address both multi-turn dialogue and chain-of-thought reasoning. Building upon our original single-turn approach, we first instrumented vLLM’s C++/CUDA kernels to asynchronously capture per-layer, per-head cross-attention weights during generation. We then generalized our self-supervised reward function to accumulate attention signals over entire conversation histories and intermediate chain-of-thought steps. We discuss practical trade-offs, including an entropy-based clamping mechanism to prevent attention collapse on early context, and outline future directions for multi-party dialogues and hierarchical reasoning.
zh
[NLP-88] Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking
【速读】: 该论文旨在解决现有基于图的检索增强生成(RAG)方法在处理用户查询时,因仅依赖实体级提取而导致的潜在关键信息和关系被误读或遗漏的问题,从而引发检索内容不相关、矛盾或重要知识缺失,加剧幻觉风险并降低生成结果的准确性。其解决方案的关键在于提出PankRAG框架,该框架结合了全局感知的分层查询解析策略与一种新颖的依赖感知重排序机制,通过构建多层级解析路径捕捉查询中的并行与顺序依赖关系,并利用已解析子问题间的依赖结构优化后续子问题的检索结果,提升整体检索与生成质量。
链接: https://arxiv.org/abs/2506.11106
作者: Ningyuan Li,Junrui Liu,Yi Shan,Minghui Huang,Tong Li
机构: Beijing university of technology (北京工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline’s exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a result, retrieved content may be irrelevant or contradictory, and essential knowledge may be excluded, exacerbating hallucination risks and degrading the fidelity of generated responses. To address these limitations, we introduce PankRAG, a framework that combines a globally aware, hierarchical query-resolution strategy with a novel dependency-aware reranking mechanism. PankRAG first constructs a multi-level resolution path that captures both parallel and sequential interdependencies within a query, guiding large language models (LLMs) through structured reasoning. It then applies its dependency-aware reranker to exploit the dependency structure among resolved sub-questions, enriching and validating retrieval results for subsequent sub-questions. Empirical evaluations demonstrate that PankRAG consistently outperforms state-of-the-art approaches across multiple benchmarks, underscoring its robustness and generalizability.
zh
[NLP-89] Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗场景中应用时面临的部署难题,即其庞大的模型规模难以适应实时、资源受限的边缘设备环境。解决方案的关键在于通过一种通用的压缩框架对LLMs进行优化,该框架首先基于领域特定数据测量神经元显著性,从而大幅剪枝无关神经元以减小模型规模,同时保持性能;随后应用后训练量化进一步降低内存占用,并在多个医疗基准测试中验证了压缩模型的有效性,最终实现了在Jetson Orin Nano和Raspberry Pi 5等低功耗硬件上的实时、高效推理。
链接: https://arxiv.org/abs/2506.11105
作者: Uttej Kallakurik,Edward Humes,Rithvik Jonna,Xiaomin Lin,Tinoosh Mohsenin
机构: Johns Hopkins Whiting School of Engineering (约翰霍普金斯大学工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
备注:
Abstract:Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50% compressed Gemma and the 67% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.
zh
[NLP-90] DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
【速读】: 该论文旨在解决长序列任务中Transformer模型因自注意力机制的二次复杂度导致的效率问题,以及静态稀疏注意力方法无法捕捉异构注意力模式所引起的次优令牌交互问题。其解决方案的关键在于引入一种动态稀疏注意力机制(Dynamic Sparse Attention, DAM),该机制在注意力图层面分配自适应掩码,从而保留跨层和头的异构模式,无需进行微调或预定义掩码结构,同时保持计算效率,并实现与全注意力模型的高度对齐。
链接: https://arxiv.org/abs/2506.11104
作者: Hanzhi Zhang,Heng Fan,Kewei Sha,Yan Huang,Yunhe Feng
机构: LLaVi Lab, Department of Computer Science & Engineering; Department of Data Science
LLaVi Lab, Department of Computer Science & Engineering; Department of Data Science
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: this https URL.
zh
[NLP-91] You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Model
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中性能不足的问题,即通过少样本或零样本的上下文微调难以达到专用微调的性能。其解决方案的关键在于提出一种新的方法——多样本上下文微调(Many-Shot In-Context Fine-tuning, ManyICL),该方法通过将多个样本视为监督训练目标,而非仅关注最终答案,从而显著提升模型在多种下游任务上的表现,并减少灾难性遗忘问题。
链接: https://arxiv.org/abs/2506.11103
作者: Wenchong He,Liqian Peng,Zhe Jiang,Alex Go
机构: University of Florida (佛罗里达大学); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Large language models (LLMs) possess a remarkable ability to perform in-context learning (ICL), which enables them to handle multiple downstream tasks simultaneously without requiring task-specific fine-tuning. Recent studies have shown that even moderately sized LLMs, such as Mistral 7B, Gemma 7B and Llama-3 8B, can achieve ICL through few-shot in-context fine-tuning of all tasks at once. However, this approach still lags behind dedicated fine-tuning, where a separate model is trained for each individual task. In this paper, we propose a novel approach, Many-Shot In-Context Fine-tuning (ManyICL), which significantly narrows this performance gap by extending the principles of ICL to a many-shot setting. To unlock the full potential of ManyICL and address the inherent inefficiency of processing long sequences with numerous in-context examples, we propose a novel training objective. Instead of solely predicting the final answer, our approach treats every answer within the context as a supervised training target. This effectively shifts the role of many-shot examples from prompts to targets for autoregressive learning. Through extensive experiments on diverse downstream tasks, including classification, summarization, question answering, natural language inference, and math, we demonstrate that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning. Furthermore, ManyICL significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning. The code will be made publicly available upon publication. Comments: 16 pages, 6 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.11103 [cs.CL] (or arXiv:2506.11103v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.11103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-92] Evolutionary Perspectives on the Evaluation of LLM -Based AI Agents : A Comprehensive Survey
【速读】: 该论文试图解决现有评估框架在区分生成式 AI (Generative AI) 代理与传统大语言模型 (Large Language Models, LLMs) 对话机器人时存在的模糊性问题,从而导致研究人员在选择基准测试时产生困惑。解决方案的关键在于提出一个系统化的分析框架,从五个核心维度(复杂环境、多源指令、动态反馈、多模态感知和高级能力)清晰地区分 AI 代理与 LLM 对话机器人,并根据外部环境驱动力和内部能力演化对现有评估基准进行分类,为后续研究提供结构化参考。
链接: https://arxiv.org/abs/2506.11102
作者: Jiachen Zhu,Menghui Zhu,Renting Rui,Rong Shan,Congmin Zheng,Bo Chen,Yunjia Xi,Jianghao Lin,Weiwen Liu,Ruiming Tang,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.
zh
[NLP-93] Knowledge Graph Embeddings with Representing Relations as Annular Sectors
【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中实体语义层次结构被现有基于区域的嵌入模型忽视的问题。传统方法通常将实体嵌入为点,关系建模为几何区域,但未能充分捕捉实体间的语义层级。论文提出的解决方案是SectorE,其关键在于在极坐标系下将关系建模为环形扇区,结合模长和相位来捕获推理模式和关系属性,同时将实体嵌入到这些扇区内部,从而直观地编码语义层次结构。
链接: https://arxiv.org/abs/2506.11099
作者: Huiling Zhu,Yingqi Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Knowledge graphs (KGs), structured as multi-relational data of entities and relations, are vital for tasks like data analysis and recommendation systems. Knowledge graph completion (KGC), or link prediction, addresses incompleteness of KGs by inferring missing triples (h, r, t). It is vital for downstream applications. Region-based embedding models usually embed entities as points and relations as geometric regions to accomplish the task. Despite progress, these models often overlook semantic hierarchies inherent in entities. To solve this problem, we propose SectorE, a novel embedding model in polar coordinates. Relations are modeled as annular sectors, combining modulus and phase to capture inference patterns and relation attributes. Entities are embedded as points within these sectors, intuitively encoding hierarchical structure. Evaluated on FB15k-237, WN18RR, and YAGO3-10, SectorE achieves competitive performance against various kinds of models, demonstrating strengths in semantic modeling capability.
zh
[NLP-94] C-SEO Bench: Does Conversational SEO Work?
【速读】: 该论文试图解决当前针对生成式 AI 驱动的对话式搜索引擎(Conversational Search Engine, CSE)的优化策略(即 C-SEO)在跨领域适用性和多竞争场景下的有效性问题。现有 C-SEO 方法通常仅在有限的应用领域中进行测试,且缺乏对多个竞争方同时采用先进策略时效果变化的评估。论文提出 C-SEO Bench,这是首个用于评估 C-SEO 方法在多种任务、领域及参与方数量下的基准测试平台,其关键在于设计了涵盖不同搜索任务(如问答和产品推荐)以及不同采用率的新型评估协议,从而更真实地模拟实际应用场景。
链接: https://arxiv.org/abs/2506.11097
作者: Haritz Puerto,Martin Gubri,Tommaso Green,Seong Joon Oh,Sangdoo Yun
机构: Parameter Lab; UKP Lab, Technical University of Darmstadt; Data and Web Science Group, University of Mannheim; University of Tübingen; Tübingen AI Center; NAVER AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at this https URL and this https URL.
zh
[NLP-95] Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting
【速读】: 该论文试图解决预训练语音表示(如wav2vec2和HuBERT)中存在的各向异性问题,该问题导致随机嵌入之间具有高相似性,但其对下游任务的影响尚不明确。论文通过在关键词检测任务中评估各向异性,提出解决方案的关键在于利用动态时间规整(Dynamic Time Warping)方法,证明尽管存在各向异性,wav2vec2的相似性度量仍能有效识别单词而无需转录。研究结果表明,这些表示具有鲁棒性,能够捕捉语音的音素结构并在不同说话人之间进行泛化,强调了预训练在学习丰富且不变的语音表示中的重要性。
链接: https://arxiv.org/abs/2506.11096
作者: Guillaume Wisniewski(LLF - UMR7110),Séverine Guillaume(LACITO),Clara Rosina Fernández(LACITO)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. While widely observed, the impact of this property on downstream tasks remains unclear. This work evaluates anisotropy in keyword spotting for computational documentary linguistics. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words without transcription. Our results highlight the robustness of these representations, which capture phonetic structures and generalize across speakers. Our results underscore the importance of pretraining in learning rich and invariant speech representations.
zh
[NLP-96] Persistent Homology of Topic Networks for the Prediction of Reader Curiosity
【速读】: 该论文试图解决文本互动中读者好奇心(reader curiosity)的建模与预测问题,这一现象在自然语言处理(NLP)领域仍相对缺乏研究。其解决方案的关键在于基于Loewenstein的信息缺口理论,构建一个通过量化文本语义结构中的信息缺口来建模读者好奇心的框架。该方法利用受BERTopic启发的主题建模和持久同调(persistent homology)分析由文本片段生成的动态语义网络的拓扑结构(如连通组件、环、空洞等),并将这些拓扑特征作为信息缺口的代理指标,从而实现对读者好奇心的有效预测。
链接: https://arxiv.org/abs/2506.11095
作者: Manuel D. S. Hopp,Vincent Labatut(LIA),Arthur Amalvy(LIA),Richard Dufour(LS2N - équipe TALN),Hannah Stone,Hayley Jach,Kou Murayama
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reader curiosity, the drive to seek information, is crucial for textual engagement, yet remains relatively underexplored in NLP. Building on Loewenstein’s Information Gap Theory, we introduce a framework that models reader curiosity by quantifying semantic information gaps within a text’s semantic structure. Our approach leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology (connected components, cycles, voids) of a dynamic semantic network derived from text segments, treating these features as proxies for information gaps. To empirically evaluate this pipeline, we collect reader curiosity ratings from participants (n = 49) as they read S. Collins’s ‘‘The Hunger Games’’ novel. We then use the topological features from our pipeline as independent variables to predict these ratings, and experimentally show that they significantly improve curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating our approach. This pipeline offers a new computational method for analyzing text structure and its relation to reader engagement.
zh
[NLP-97] he Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中存在的安全性评估问题,旨在系统性地总结和分析当前LLMs安全评估的研究进展。其解决方案的关键在于从四个维度进行综合探讨:即“为何评估”、“评估什么”、“在哪里评估”以及“如何评估”,通过梳理安全评估的背景、任务分类、评价指标与数据集、评估工具及方法,为后续研究提供理论支持与实践指导,从而推动LLMs安全性的进一步提升。
链接: https://arxiv.org/abs/2506.11094
作者: Songyang Liu,Chaozhuo Li,Jiameng Qiu,Xi Zhang,Feiran Huang,Litian Zhang,Yiming Hei,Philip S. Yu
机构: Beijing University of Posts and Telecommunications, School of Cyberspace Security (北京邮电大学网络空间安全学院); Jinan University, School of Cyberspace Security (暨南大学网络空间安全学院); Beihang University, School of Cyberspace Security (北京航空航天大学网络空间安全学院); China Academy of Information and Communications Technology (中国信息通信研究院); University of Illinons at Chicago, Department of Computer Science (伊利诺伊大学芝加哥分校计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages, preprint
Abstract:With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In recent years, LLM-generated content has occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios, which has garnered extensive attention from both academia and industry. While numerous efforts have been made to evaluate the safety risks associated with LLMs, there remains a lack of systematic reviews summarizing these research endeavors. This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation, focusing on several key aspects: (1) “Why evaluate” that explores the background of LLMs safety evaluation, how they differ from general LLMs evaluation, and the significance of such evaluation; (2) “What to evaluate” that examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and so on; (3) “Where to evaluate” that summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (4) “How to evaluate” that reviews existing evaluation toolkit, and categorizing mainstream evaluation methods based on the roles of the evaluators. Finally, we identify the challenges in LLMs safety evaluation and propose potential research directions to promote further advancement in this field. We emphasize the importance of prioritizing LLMs safety evaluation to ensure the safe deployment of these models in real-world applications.
zh
[NLP-98] Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在动态领域中适应性不足的问题,这些系统通常受限于静态、单轮交互和固定工具集,难以应对如医疗健康和智能家居等场景中用户意图、可用工具和上下文因素的持续变化。论文提出的解决方案是动态上下文调优(Dynamic Context Tuning, DCT),其关键在于通过基于注意力的上下文缓存跟踪相关历史信息、基于LoRA的检索动态选择领域特定工具,以及高效的上下文压缩以维持大语言模型(LLM)的输入限制,从而实现无需重新训练即可支持多轮对话和动态工具环境。
链接: https://arxiv.org/abs/2506.11092
作者: Jubin Abhishek Soni,Amit Anand,Rajesh Kumar Pandey,Aniket Abhishek Soni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 5 figures, 3 tables. This manuscript has been submitted to IEEE conference. Researchers are welcome to read and build upon this work; please cite it appropriately. For questions or clarifications, feel free to contact me
Abstract:Retrieval-Augmented Generation (RAG) has significantly advanced large language models (LLMs) by grounding their outputs in external tools and knowledge sources. However, existing RAG systems are typically constrained to static, single-turn interactions with fixed toolsets, making them ill-suited for dynamic domains such as healthcare and smart homes, where user intent, available tools, and contextual factors evolve over time. We present Dynamic Context Tuning (DCT), a lightweight framework that extends RAG to support multi-turn dialogue and evolving tool environments without requiring retraining. DCT integrates an attention-based context cache to track relevant past information, LoRA-based retrieval to dynamically select domain-specific tools, and efficient context compression to maintain inputs within LLM context limits. Experiments on both synthetic and real-world benchmarks show that DCT improves plan accuracy by 14% and reduces hallucinations by 37%, while matching GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to previously unseen tools, enabling scalable and adaptable AI assistants across a wide range of dynamic environments.
zh
[NLP-99] Customizing Speech Recognition Model with Large Language Model Feedback
【速读】: 该论文试图解决自动语音识别(ASR)系统在罕见命名实体识别和领域不匹配适应性方面的不足。其解决方案的关键在于利用大规模语言模型(LLM)作为奖励模型,通过强化学习框架对ASR模型进行无监督领域自适应优化,从而提升转录质量,特别是在受领域不匹配影响的命名实体方面。
链接: https://arxiv.org/abs/2506.11091
作者: Shaoshi Ling,Guoli Ye
机构: Microsoft Core AI(微软核心人工智能)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.
zh
[NLP-100] wo Birds with One Stone: Improving Factuality and Faithfulness of LLM s via Dynamic Interactive Subspace Editing
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的事实性(factualness)和忠实性(faithfulness)幻觉问题。现有方法虽分别针对这两种幻觉类型进行处理,但往往导致性能权衡,即针对一种幻觉的干预可能加剧另一种幻觉。论文通过分析LLMs的激活空间动态,揭示了这两种幻觉类别在神经表示中共享重叠子空间,从而提出SPACE框架,该框架通过联合编辑共享激活子空间来同时提升事实性和忠实性。SPACE的关键在于利用双任务特征建模建立共享子空间的几何基础,并通过结合谱聚类与注意力头显著性评分的混合探测策略识别并编辑这些子空间。
链接: https://arxiv.org/abs/2506.11088
作者: Pengbo Wang,Chaozhuo Li,Chenxu Wang,Liwen Zheng,Litian Zhang,Xi Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shihezi University (石河子大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.
zh
[NLP-101] ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models
【速读】: 该论文旨在解决在多租户服务场景下,如何高效压缩微调后的大型语言模型(Large Language Models, LLMs)的增量参数问题。现有方法在高压缩比下表现不佳或依赖经验性的位分配方案,难以平衡压缩效率与模型性能。论文提出的解决方案是ADAMIX,其关键在于通过数学推导量化误差,并将最优混合精度位分配问题建模为0/1整数线性规划问题,从而在满足预设压缩比的前提下最小化量化误差,实现更高效的增量压缩。
链接: https://arxiv.org/abs/2506.11087
作者: Boya Xiong,Shuo Wang,Weifeng Ge,Guanhua Chen,Yun Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Tsinghua University (清华大学); Fudan University (复旦大学); Southern University of Science and Technology (南方科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) achieve impressive performance on various knowledge-intensive and complex reasoning tasks in different domains. In certain scenarios like multi-tenant serving, a large number of LLMs finetuned from the same base model are deployed to meet complex requirements for users. Recent works explore delta-compression approaches to quantize and compress the delta parameters between the customized LLM and the corresponding base model. However, existing works either exhibit unsatisfactory performance at high compression ratios or depend on empirical bit allocation schemes. In this work, we propose ADAMIX, an effective adaptive mixed-precision delta-compression framework. We provide a mathematical derivation of quantization error to motivate our mixed-precision compression strategy and formulate the optimal mixed-precision bit allocation scheme as the solution to a 0/1 integer linear programming problem. Our derived bit allocation strategy minimizes the quantization error while adhering to a predefined compression ratio requirement. Experimental results on various models and benchmarks demonstrate that our approach surpasses the best baseline by a considerable margin. On tasks like AIME2024 and GQA, where the norm of \Delta \mathbfW is large and the base model lacks sufficient ability, ADAMIX outperforms the best baseline Delta-CoMe by 22.3% and 6.1% with 7B models, respectively.
zh
[NLP-102] LeanExplore: A search engine for Lean 4 declarations WWW
【速读】: 该论文试图解决Lean 4生态系统中由于其庞大的库规模而导致的导航困难问题。解决方案的关键在于构建LeanExplore,这是一个针对Lean 4声明的搜索引擎,通过融合多源语义嵌入模型、BM25+关键词匹配以及基于PageRank的声明重要性评分,实现对形式化和非形式化陈述的语义搜索。
链接: https://arxiv.org/abs/2506.11085
作者: Justin Asher(Independent Researcher)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 16 pages, 1 figure. Project website: this https URL , Code: this https URL
Abstract:The expanding Lean 4 ecosystem poses challenges for navigating its vast libraries. This paper introduces LeanExplore, a search engine for Lean 4 declarations. LeanExplore enables users to semantically search for statements, both formally and informally, across select Lean 4 packages (including Batteries, Init, Lean, Mathlib, PhysLean, and Std). This search capability is powered by a hybrid ranking strategy, integrating scores from a multi-source semantic embedding model (capturing conceptual meaning from formal Lean code, docstrings, AI-generated informal translations, and declaration titles), BM25+ for keyword-based lexical relevance, and a PageRank-based score reflecting declaration importance and interconnectedness. The search engine is accessible via a dedicated website (this https URL) and a Python API (this https URL). Furthermore, the database can be downloaded, allowing users to self-host the service. LeanExplore integrates easily with LLMs via the model context protocol (MCP), enabling users to chat with an AI assistant about Lean declarations or utilize the search engine for building theorem-proving agents. This work details LeanExplore’s architecture, data processing, functionalities, and its potential to enhance Lean 4 workflows and AI-driven mathematical research
zh
[NLP-103] RedDebate: Safer Responses through Multi-Agent Red Teaming Debates
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全行为上的不足,特别是其可能产生的不安全行为问题。现有AI安全方法通常依赖于成本高昂的人工评估或孤立的单模型评估,这些方法存在可扩展性限制和监督风险。解决方案的关键在于提出RedDebate框架,该框架通过多个LLMs之间的对抗性论辩,实现对自身不安全行为的主动识别与缓解。其核心机制是利用协同分歧,使多个LLMs相互批判性地审视彼此的推理过程,并通过自动化红队测试系统地发现不安全盲点,同时结合长期记忆模块持续积累安全见解,从而迭代优化模型响应。
链接: https://arxiv.org/abs/2506.11083
作者: Ali Asad,Stephen Obadinma,Radin Shayanfar,Xiaodan Zhu
机构: Queen’s University (皇后大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We propose RedDebate, a novel multi-agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single-model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another’s reasoning, and systematically uncovering unsafe blind spots through automated red-teaming, and iteratively improve their responses. We further integrate distinct types of long-term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method’s effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long-term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi-agent debates with red-teaming to progressively enhance AI safety without direct human intervention.(Github Repository: this https URL)
zh
[NLP-104] PRISM: A Transformer-based Language Model of Structured Clinical Event Data
【速读】: 该论文试图解决临床决策过程的序列建模问题,即如何有效捕捉和预测患者诊疗路径中的复杂依赖关系。传统方法通常依赖于孤立的诊断分类,而PRISM(Predictive Reasoning in Sequential Medicine)通过将临床轨迹建模为包含诊断测试、实验室结果和诊断等事件的分词序列,并利用自回归训练目标来预测患者诊断旅程中最可能的下一步。其解决方案的关键在于构建一个大规模定制的临床词汇表,并采用生成式语言建模技术对结构化医疗事件数据进行建模,从而实现对真实诊断路径、实验室结果进展和临床决策行为的有效模拟。
链接: https://arxiv.org/abs/2506.11082
作者: Lionel Levine,John Santerre,Alex S. Young,T. Barry Levine,Francis Campion,Majid Sarrafzadeh
机构: UCLA(加州大学洛杉矶分校); UC Berkeley(加州大学伯克利分校); UCLA David Geffen School of Medicine(加州大学洛杉矶分校大卫·格芬医学院); ABLE Medical(ABLE医学); MITRE Corp.(MITRE公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 Figures, 1 Table
Abstract:We introduce PRISM (Predictive Reasoning in Sequential Medicine), a transformer-based architecture designed to model the sequential progression of clinical decision-making processes. Unlike traditional approaches that rely on isolated diagnostic classification, PRISM frames clinical trajectories as tokenized sequences of events - including diagnostic tests, laboratory results, and diagnoses - and learns to predict the most probable next steps in the patient diagnostic journey. Leveraging a large custom clinical vocabulary and an autoregressive training objective, PRISM demonstrates the ability to capture complex dependencies across longitudinal patient timelines. Experimental results show substantial improvements over random baselines in next-token prediction tasks, with generated sequences reflecting realistic diagnostic pathways, laboratory result progressions, and clinician ordering behaviors. These findings highlight the feasibility of applying generative language modeling techniques to structured medical event data, enabling applications in clinical decision support, simulation, and education. PRISM establishes a foundation for future advancements in sequence-based healthcare modeling, bridging the gap between machine learning architectures and real-world diagnostic reasoning.
zh
[NLP-105] SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLM s
【速读】: 该论文旨在解决从自然语言规范中生成有效且通用的上下文无关文法(Context-Free Grammars with Counters, CCFGs)这一关键挑战,特别是在监督有限的情况下。其解决方案的关键在于利用开源大语言模型(LLMs)进行小样本标注示例下的规范到文法翻译,并结合可验证奖励引导的强化学习方法——组相对策略优化(Group Relative Policy Optimization, GRPO),以提升生成文法的有效性和泛化能力。此外,研究还评估了迭代反馈在修正生成文法中的语法和语义错误方面的效果。
链接: https://arxiv.org/abs/2506.11081
作者: Aditi,Hyunwoo Park,Sicheol Sung,Yo-Sub Han,Sang-Ki Ko
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.11081 [cs.CL] (or arXiv:2506.11081v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.11081 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-106] MANBench: Is Your Multimodal Model Smarter than Human? ACL2025
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在多模态任务中是否能够超越人类表现的问题,以及如何全面评估其能力。解决方案的关键是提出MANBench,这是一个涵盖九个任务、包含1,314道题目的双语基准测试集(英文和中文),强调直观推理、跨模态无缝整合和现实复杂性,通过大量人类实验对比了人类与先进MLLMs的性能,揭示了MLLMs在深层次跨模态推理任务中的不足及人类与MLLMs在高复杂度任务中的共同挑战。
链接: https://arxiv.org/abs/2506.11080
作者: Han Zhou,Qitong Xu,Yiheng Dong,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL)
备注: Multimodal Benchmark, Project Url: this https URL , ACL2025 Findings
Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at this https URL. Comments: Multimodal Benchmark, Project Url: this https URL, ACL2025 Findings Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.11080 [cs.CL] (or arXiv:2506.11080v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.11080 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-107] RoE-FND: A Case-Based Reasoning Approach with Dual Verification for Fake News Detection via LLM s
【速读】: 该论文旨在解决虚假新闻检测(Fake News Detection, FND)系统中存在的关键问题,包括噪声证据选择、泛化瓶颈以及决策过程不透明。其解决方案的核心是提出一种名为RoE-FND(Reason on Experiences FND)的框架,该框架通过将基于证据的FND重新构造成逻辑推理任务,结合大型语言模型(Large Language Models, LLMs)与经验学习,实现更可靠的检测效果。RoE-FND包含两个阶段:自我反思的知识构建和动态准则检索,并通过双通道机制对推理过程进行内部经验交叉验证,从而提升系统的泛化能力和有效性。
链接: https://arxiv.org/abs/2506.11078
作者: Yuzhou Yang,Yangming Zhou,Zhiying Zhu,Zhenxing Qian,Xinpeng Zhang,Sheng Li
机构: Fudan University (复旦大学); East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The proliferation of deceptive content online necessitates robust Fake News Detection (FND) systems. While evidence-based approaches leverage external knowledge to verify claims, existing methods face critical limitations: noisy evidence selection, generalization bottlenecks, and unclear decision-making processes. Recent efforts to harness Large Language Models (LLMs) for FND introduce new challenges, including hallucinated rationales and conclusion bias. To address these issues, we propose \textbfRoE-FND (\textbf\underlineReason \textbf\underlineon \textbf\underlineExperiences FND), a framework that reframes evidence-based FND as a logical deduction task by synergizing LLMs with experiential learning. RoE-FND encompasses two stages: (1) \textitself-reflective knowledge building, where a knowledge base is curated by analyzing past reasoning errors, namely the exploration stage, and (2) \textitdynamic criterion retrieval, which synthesizes task-specific reasoning guidelines from historical cases as experiences during deployment. It further cross-checks rationales against internal experience through a devised dual-channel procedure. Key contributions include: a case-based reasoning framework for FND that addresses multiple existing challenges, a training-free approach enabling adaptation to evolving situations, and empirical validation of the framework’s superior generalization and effectiveness over state-of-the-art methods across three datasets.
zh
[NLP-108] CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在测试阶段计算性能优化的问题,具体是通过合理分配“反射标记”(reflection tokens)的频率和位置来提升多步骤推理能力。解决方案的关键在于将反射标记视为一种“资源”,并提出了一种周期性反射标记调度策略(CyclicReflex),该策略通过位置相关的三角波形动态调节反射标记的逻辑值,从而在过度反射和不足反射之间实现平衡,进而提升模型性能。
链接: https://arxiv.org/abs/2506.11077
作者: Chongyu Fan,Yihua Zhang,Jinghan Jia,Alfred Hero,Sijia Liu
机构: Michigan State University (密歇根州立大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as “reflection tokens” (e.g., “wait”, “but”, “alternatively”). In this work, we treat reflection tokens as a “resource” and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at this https URL.
zh
[NLP-109] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention ACL2025
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在多语言环境下产生的对象幻觉问题,即模型在使用非英语查询时更可能生成与视觉输入不一致的响应。解决方案的关键在于提出一种名为跨语言注意力干预(Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination, CLAIM)的新方法,该方法通过对齐跨模态注意力模式实现近似无训练的干预,具体包括识别语言特定的跨模态注意力头、估计从英语到目标语言的语言迁移向量,并在推理阶段干预注意力输出以促进跨语言视觉感知能力的对齐。
链接: https://arxiv.org/abs/2506.11073
作者: Zekai Ye,Qiming Li,Xiaocheng Feng,Libo Qin,Yichong Huang,Baohang Li,Kui Jiang,Yang Xiang,Zhirui Zhang,Yunfei Lu,Duyu Tang,Dandan Tu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); Central South University (中南大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL2025 Main
Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.
zh
[NLP-110] argeted control of fast prototyping through domain-specific interface ICML’25
【速读】: 该论文试图解决工业设计中通过自然语言指令对原型模型进行精准控制的问题,即如何将设计师的自然语言与建模语言之间的语义差距有效弥合。解决方案的关键在于提出一种接口架构,作为自然语言与建模语言之间的媒介,该架构基于对快速原型设计实践的系统性研究,并设计了其操作机制及自动化领域规范算法,从而实现对原型模型的精确和高效控制。
链接: https://arxiv.org/abs/2506.11070
作者: Yu-Zhe Shi,Mingchen Liu,Hanlu Ma,Qiao Xu,Huamin Qu,Kun He,Lecheng Ruan,Qining Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: In International Conference on Machine Learning (ICML’25)
Abstract:Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype models – using simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers’ languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface’s operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface’s potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models.
zh
[NLP-111] Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在道德和伦理推理中对义务性判断的偏差问题,特别是当提示中包含模态表达词(如“must”或“ought to”)时,模型倾向于将非义务性情境误判为义务性情境。解决方案的关键在于提出一种结合少量示例与推理提示的判断策略,以减轻这种由模态表达引起的义务关键词偏差(Deontological Keyword Bias, DKB)。
链接: https://arxiv.org/abs/2506.11068
作者: Bumjin Park,Jinsil Lee,Jaesik Choi
机构: KAIST AI (KAIST人工智能); INEEJI (INEEJI)
类目: Computation and Language (cs.CL)
备注: 20 pages including references and appendix; To appear in ACL 2025 main conference
Abstract:Large language models (LLMs) are increasingly engaging in moral and ethical reasoning, where criteria for judgment are often unclear, even for humans. While LLM alignment studies cover many areas, one important yet underexplored area is how LLMs make judgments about obligations. This work reveals a strong tendency in LLMs to judge non-obligatory contexts as obligations when prompts are augmented with modal expressions such as must or ought to. We introduce this phenomenon as Deontological Keyword Bias (DKB). We find that LLMs judge over 90% of commonsense scenarios as obligations when modal expressions are present. This tendency is consist across various LLM families, question types, and answer formats. To mitigate DKB, we propose a judgment strategy that integrates few-shot examples with reasoning prompts. This study sheds light on how modal expressions, as a form of linguistic framing, influence the normative decisions of LLMs and underscores the importance of addressing such biases to ensure judgment alignment.
zh
[NLP-112] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes
【速读】: 该论文旨在解决临床笔记中自动提取体格检查回顾(Review of Systems, ROS)实体的效率与成本问题。其解决方案的关键在于构建一个基于大型语言模型(LLM)的管道,该管道首先利用SecTag技术提取ROS部分,随后通过少量样本微调的LLM识别ROS实体及其状态和相关身体系统,从而实现高效、准确的自动化处理。
链接: https://arxiv.org/abs/2506.11067
作者: Hieu Nghiem,Hemanth Reddy Singareddy,Zhuqi Miao,Jivan Lamichhane,Abdulaziz Ahmed,Johnson Thomas,Dursun Delen,William Paiva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS sections using SecTag, followed by few-shot LLMs to identify ROS entity spans, their positive/negative status, and associated body systems. We implemented the pipeline using open-source LLMs (Mistral, Llama, Gemma) and ChatGPT. The evaluation was conducted on 36 general medicine notes containing 341 annotated ROS entities. Results: When integrating ChatGPT, the pipeline achieved the lowest error rates in detecting ROS entity spans and their corresponding statuses/systems (28.2% and 14.5%, respectively). Open-source LLMs enable local, cost-efficient execution of the pipeline while delivering promising performance with similarly low error rates (span: 30.5-36.7%; status/system: 24.3-27.3%). Discussion and Conclusion: Our pipeline offers a scalable and locally deployable solution to reduce ROS documentation burden. Open-source LLMs present a viable alternative to commercial models in resource-limited healthcare environments.
zh
[NLP-113] Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study ACL
【速读】: 该论文试图解决如何利用现代大型语言模型(Large Language Models, LLMs)对历史贸易语 Russenorsk 的词汇进行分析,并构建其词典以探讨该语言的构词规律和语法结构。解决方案的关键在于基于现存文献资料构建结构化的词典,并通过该词典验证或对比传统学术研究中的假设,同时开发了一个“重构”翻译代理,用于生成当代俄语和挪威语文本的假设性 Russenorsk 版本。
链接: https://arxiv.org/abs/2506.11065
作者: Alexey Tikhonov,Sergei Shteiner,Anna Bykova,Ivan P. Yamshchikov
机构: THWS(THWS)
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025
Abstract:Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a “reconstruction” translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.
zh
[NLP-114] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation
【速读】: 该论文试图解决多模态检索增强生成(Multimodal RAG)系统中因检索证据位置不同而导致的性能不稳定和推理偏差问题。其关键解决方案是通过引入位置敏感性指数(PSI_p)并构建可视化框架,以量化和分析证据位置对系统性能的影响,揭示多模态交互加剧位置偏差的现象,并为后续的证据重排序或去偏策略提供理论与实证基础。
链接: https://arxiv.org/abs/2506.11063
作者: Jiayu Yao,Shenghua Liu,Yiwei Wang,Lingrui Mei,Baolong Bi,Yuyao Ge,Zhecheng Li,Xueqi Cheng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of California, Merced (加州大学默塞德分校); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ( PSI_p ) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.
zh
[NLP-115] CodeMirag e: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLM s
【速读】: 该论文旨在解决AI生成代码检测的不足问题,特别是现有基准在编程语言覆盖范围、生成模型能力及代码多样性方面的局限性。其解决方案的关键在于提出CodeMirage基准,该基准通过三大核心改进实现:覆盖十种广泛使用的编程语言、包含原始与改写代码样本,并集成来自六大主要供应商的十种先进生产级大语言模型(LLM)的输出,从而构建一个更全面、更具现实代表性的评估环境。
链接: https://arxiv.org/abs/2506.11059
作者: Hanxi Guo,Siyuan Cheng,Kaiyuan Zhang,Guangyu Shen,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have become integral to modern software development, producing vast amounts of AI-generated source code. While these models boost programming productivity, their misuse introduces critical risks, including code plagiarism, license violations, and the propagation of insecure programs. As a result, robust detection of AI-generated code is essential. To support the development of such detectors, a comprehensive benchmark that reflects real-world conditions is crucial. However, existing benchmarks fall short – most cover only a limited set of programming languages and rely on less capable generative models. In this paper, we present CodeMirage, a comprehensive benchmark that addresses these limitations through three major advancements: (1) it spans ten widely used programming languages, (2) includes both original and paraphrased code samples, and (3) incorporates outputs from ten state-of-the-art production-level LLMs, including both reasoning and non-reasoning models from six major providers. Using CodeMirage, we evaluate ten representative detectors across four methodological paradigms under four realistic evaluation configurations, reporting results using three complementary metrics. Our analysis reveals nine key findings that uncover the strengths and weaknesses of current detectors, and identify critical challenges for future work. We believe CodeMirage offers a rigorous and practical testbed to advance the development of robust and generalizable AI-generated code detectors.
zh
[NLP-116] Large Language models for Time Series Analysis: Techniques Applications and Challenges
【速读】: 该论文旨在解决传统时间序列分析方法在非线性特征表示和长期依赖捕获方面的局限性,以及通用大型语言模型(Large Language Models, LLMs)在时间序列分析中应用时面临的数据多样性、标注稀缺性和计算需求等问题。其解决方案的关键在于系统性地回顾预训练LLM驱动的时间序列分析技术,从工作流程角度组织和梳理LLM在输入、优化和轻量化阶段的技术体系,并探讨其在实际应用中的潜力与挑战,从而为未来研究提供指导。
链接: https://arxiv.org/abs/2506.11040
作者: Feifei Shi,Xueyan Yin,Kang Wang,Wanyu Tu,Qifu Sun,Huansheng Ning
机构: University of Science and Technology Beijing (北京科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:
Abstract:Time series analysis is pivotal in domains like financial forecasting and biomedical monitoring, yet traditional methods are constrained by limited nonlinear feature representation and long-term dependency capture. The emergence of Large Language Models (LLMs) offers transformative potential by leveraging their cross-modal knowledge integration and inherent attention mechanisms for time series analysis. However, the development of general-purpose LLMs for time series from scratch is still hindered by data diversity, annotation scarcity, and computational requirements. This paper presents a systematic review of pre-trained LLM-driven time series analysis, focusing on enabling techniques, potential applications, and open challenges. First, it establishes an evolutionary roadmap of AI-driven time series analysis, from the early machine learning era, through the emerging LLM-driven paradigm, to the development of native temporal foundation models. Second, it organizes and systematizes the technical landscape of LLM-driven time series analysis from a workflow perspective, covering LLMs’ input, optimization, and lightweight stages. Finally, it critically examines novel real-world applications and highlights key open challenges that can guide future research and innovation. The work not only provides valuable insights into current advances but also outlines promising directions for future development. It serves as a foundational reference for both academic and industrial researchers, paving the way for the development of more efficient, generalizable, and interpretable systems of LLM-driven time series analysis.
zh
[NLP-117] versky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity
【速读】: 该论文试图解决深度学习中隐含的相似性模型与人类心理感知不一致的问题,具体表现为传统基于几何相似性的度量方式(如对称性)不符合人类的感知特性。其解决方案的关键在于将Tversky提出的基于特征集的相似性理论引入深度学习,通过开发可微分的Tversky相似性参数化方法,构建如Tversky投影层等神经网络组件,从而实现非线性函数建模,并提升模型的可解释性。
链接: https://arxiv.org/abs/2506.11035
作者: Moussa Koulako Bala Doumbouya,Dan Jurafsky,Christopher D. Manning
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception. In contrast, Tversky (1977) proposed an axiomatic theory of similarity based on a representation of objects as sets of features, and their similarity as a function of common and distinctive features. However, this model has not been used in deep learning before, partly due to the challenge of incorporating discrete set operations. We develop a differentiable parameterization of Tversky’s similarity that is learnable through gradient descent, and derive neural network building blocks such as the Tversky projection layer, which unlike the linear projection layer can model non-linear functions such as XOR. Through experiments with image recognition and language modeling, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer, which employs geometric similarity. On the NABirds image classification task, a frozen ResNet-50 adapted with a Tversky projection layer achieves a 24.7% relative accuracy improvement over the linear layer adapter baseline. With Tversky projection layers, GPT-2’s perplexity on PTB decreases by 7.5%, and its parameter count by 34.8%. Finally, we propose a unified interpretation of both projection layers as computing similarities of input stimuli to learned prototypes, for which we also propose a novel visualization technique highlighting the interpretability of Tversky projection layers. Our work offers a new paradigm for thinking about the similarity model implicit in deep learning, and designing networks that are interpretable under an established theory of psychological similarity.
zh
[NLP-118] CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models
【速读】: 该论文试图解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉因果推理任务中的能力不足问题,尤其是其在因果结构推断、干预目标预测和反事实预测等方面的局限性。解决方案的关键在于构建一个全面的因果推理基准——CausalVLBench,该基准包含三个代表性任务,并通过在三个因果表示学习数据集上评估最先进的开源LVLMs,揭示其在视觉因果推理方面的基本优势与不足,从而为提升LVLMs的视觉因果推理能力提供新的研究方向和范式。
链接: https://arxiv.org/abs/2506.11034
作者: Aneesh Komanduri,Karuna Bhaila,Xintao Wu
机构: University of Arkansas (阿肯色大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.
zh
[NLP-119] ask-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models
【速读】: 该论文试图解决AI生成图像的检测问题,尤其是针对监督检测方法依赖于大量标注数据且在不同生成器之间泛化能力不足的局限性。其解决方案的关键在于利用预训练的视觉-语言模型(Vision-Language Models, VLMs)进行零样本检测,并通过任务对齐提示(task-aligned prompting)提升模型的推理聚焦性和性能。具体而言,采用“Let’s examine the style and the synthesis artifacts”作为提示前缀(称为zero-shot-s²),显著提高了检测效果,且在多个数据集和模型规模上表现出良好的泛化性和鲁棒性。
链接: https://arxiv.org/abs/2506.11031
作者: Zoher Kachwala,Danishjeet Singh,Danielle Yang,Filippo Menczer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As image generators produce increasingly realistic images, concerns about potential misuse continue to grow. Supervised detection relies on large, curated datasets and struggles to generalize across diverse generators. In this work, we investigate the use of pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. While off-the-shelf VLMs exhibit some task-specific reasoning and chain-of-thought prompting offers gains, we show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning. Specifically, prefixing the model’s response with the phrase ``Let’s examine the style and the synthesis artifacts’’ – a method we call zero-shot-s ^2 – boosts Macro F1 scores by 8%-29% for two widely used open-source models. These gains are consistent across three recent, diverse datasets spanning human faces, objects, and animals with images generated by 16 different models – demonstrating strong generalization. We further evaluate the approach across three additional model sizes and observe improvements in most dataset-model combinations – suggesting robustness to model scale. Surprisingly, self-consistency, a behavior previously observed in language reasoning, where aggregating answers from diverse reasoning paths improves performance, also holds in this setting. Even here, zero-shot-s ^2 scales better than chain-of-thought in most cases – indicating that it elicits more useful diversity. Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs, like the detection of AI-generated images – offering a simple, generalizable, and explainable alternative to supervised methods. Our code is publicly available on github: this https URL.
zh
[NLP-120] Security Degradation in Iterative AI Code Generation – A Systematic Analysis of the Paradox
【速读】: 该论文试图解决在使用大型语言模型(Large Language Models, LLMs)进行代码生成时,安全漏洞如何通过迭代的LLM反馈而演化的问题。研究通过一个控制实验,分析了400个代码样本在40轮“改进”中的安全退化情况,发现经过五轮迭代后关键漏洞增加了37.6%,且不同提示策略导致了不同的漏洞模式。论文的解决方案之关键在于强调在LLM迭代过程中引入人类专家的验证机制,以防止在所谓的“改进”中反而引入新的安全问题,从而提出实用的开发指南来缓解这些风险。
链接: https://arxiv.org/abs/2506.11022
作者: Shivani Shukla,Himanshu Joshi,Romilla Syed
机构: University of San Francisco (旧金山大学); Vector Institute for Artificial Intelligence (人工智能矢量研究所); University of Massachusetts Boston (马萨诸塞大学波士顿分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Keywords - Large Language Models, Security Vulnerabilities, AI-Generated Code, Iterative Feedback, Software Security, Secure Coding Practices, Feedback Loops, LLM Prompting Strategies
Abstract:The rapid adoption of Large Language Models(LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of “improvements” using four distinct prompting strategies. Our findings show a 37.6% increase in critical vulnerabilities after just five iterations, with distinct vulnerability patterns emerging across different prompting approaches. This evidence challenges the assumption that iterative LLM refinement improves code security and highlights the essential role of human expertise in the loop. We propose practical guidelines for developers to mitigate these risks, emphasizing the need for robust human validation between LLM iterations to prevent the paradoxical introduction of new security issues during supposedly beneficial code “improvements”.
zh
[NLP-121] Eval-OS: Performance evaluations of large language models for operations scheduling
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在电信运营调度(Telecommunications Operation Scheduling, OS)领域应用潜力受限的问题,主要由于OS任务固有的复杂性和领域特殊性,以及缺乏全面的评估基准。解决方案的关键在于提出首个电信运营调度评估基准(Telecommunications Operation Scheduling Evaluation Benchmark, TeleEval-OS),该基准包含15个数据集和13个子任务,覆盖智能工单创建、处理、关闭和评估四个关键操作阶段,并通过四个层级的能力划分(基础自然语言处理、知识问答、报告生成与分析)系统评估LLMs的性能,从而为LLMs在该领域的应用提供全面的评估框架。
链接: https://arxiv.org/abs/2506.11017
作者: Yanyan Wang,Yingying Wang,Junli Liang,Yin Xu,Yunlong Liu,Yiming Xu,Zhengwang Jiang,Zhehe Li,Fei Li,Long Zhao,Kuang Xu,Qi Song,Xiangyang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:The rapid advancement of large language models (LLMs) has significantly propelled progress in artificial intelligence, demonstrating substantial application potential across multiple specialized domains. Telecommunications operation scheduling (OS) is a critical aspect of the telecommunications industry, involving the coordinated management of networks, services, risks, and human resources to optimize production scheduling and ensure unified service control. However, the inherent complexity and domain-specific nature of OS tasks, coupled with the absence of comprehensive evaluation benchmarks, have hindered thorough exploration of LLMs’ application potential in this critical field. To address this research gap, we propose the first Telecommunications Operation Scheduling Evaluation Benchmark (TeleEval-OS). Specifically, this benchmark comprises 15 datasets across 13 subtasks, comprehensively simulating four key operational stages: intelligent ticket creation, intelligent ticket handling, intelligent ticket closure, and intelligent evaluation. To systematically assess the performance of LLMs on tasks of varying complexity, we categorize their capabilities in telecommunications operation scheduling into four hierarchical levels, arranged in ascending order of difficulty: basic NLP, knowledge QA, report generation, and report analysis. On TeleEval-OS, we leverage zero-shot and few-shot evaluation methods to comprehensively assess 10 open-source LLMs (e.g., DeepSeek-V3) and 4 closed-source LLMs (e.g., GPT-4o) across diverse scenarios. Experimental results demonstrate that open-source LLMs can outperform closed-source LLMs in specific scenarios, highlighting their significant potential and value in the field of telecommunications operation scheduling.
zh
[NLP-122] A Survey of Task-Oriented Knowledge Graph Reasoning : Status Applications and Prospects
【速读】: 该论文试图解决知识图谱推理(Knowledge Graph Reasoning, KGR)领域中缺乏系统性综述的问题,特别是针对下游应用和更具挑战性的推理范式未得到充分覆盖。其解决方案的关键在于从任务导向的角度出发,对KGR方法进行分类,包括主要推理任务、下游应用任务以及潜在的挑战性推理任务,并探讨了如大型语言模型(Large Language Models, LLMs)等先进技术对KGR的影响,从而提供更全面的研究视角。
链接: https://arxiv.org/abs/2506.11012
作者: Guanglin Niu,Bo Li,Yangguang Lin
机构: Beihang University (北京航空航天大学); Beijing Forestry University (北京林业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 45 pages, 17 figures, 12 tables
Abstract:Knowledge graphs (KGs) have emerged as a powerful paradigm for structuring and leveraging diverse real-world knowledge, which serve as a fundamental technology for enabling cognitive intelligence systems with advanced understanding and reasoning capabilities. Knowledge graph reasoning (KGR) aims to infer new knowledge based on existing facts in KGs, playing a crucial role in applications such as public security intelligence, intelligent healthcare, and financial risk assessment. From a task-centric perspective, existing KGR approaches can be broadly classified into static single-step KGR, static multi-step KGR, dynamic KGR, multi-modal KGR, few-shot KGR, and inductive KGR. While existing surveys have covered these six types of KGR tasks, a comprehensive review that systematically summarizes all KGR tasks particularly including downstream applications and more challenging reasoning paradigms remains lacking. In contrast to previous works, this survey provides a more comprehensive perspective on the research of KGR by categorizing approaches based on primary reasoning tasks, downstream application tasks, and potential challenging reasoning tasks. Besides, we explore advanced techniques, such as large language models (LLMs), and their impact on KGR. This work aims to highlight key research trends and outline promising future directions in the field of KGR.
zh
[NLP-123] Developing a Dyslexia Indicator Using Eye Tracking
【速读】: 该论文试图解决传统 dyslexia(阅读障碍)诊断方法成本高、可及性差的问题,旨在开发一种经济高效且易于推广的早期检测手段。解决方案的关键在于结合 eye-tracking technology(眼动追踪技术)与 machine learning(机器学习)算法,通过分析眼动模式(如延长的注视时间与不规则的扫视)提取 dyslexia 特征,并利用 Random Forest Classifier 实现高准确率的 dyslexia 识别,达到了 88.58% 的准确率。此外,还采用层次聚类方法对 dyslexia 的严重程度进行划分,提升了诊断的全面性和实用性。
链接: https://arxiv.org/abs/2506.11004
作者: Kevin Cogan,Vuong M. Ngo,Mark Roantree
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: The 23rd International Conference on Artificial Intelligence in Medicine (AIME 2025), LNAI, Springer, 11 pages
Abstract:Dyslexia, affecting an estimated 10% to 20% of the global population, significantly impairs learning capabilities, highlighting the need for innovative and accessible diagnostic methods. This paper investigates the effectiveness of eye-tracking technology combined with machine learning algorithms as a cost-effective alternative for early dyslexia detection. By analyzing general eye movement patterns, including prolonged fixation durations and erratic saccades, we proposed an enhanced solution for determining eye-tracking-based dyslexia features. A Random Forest Classifier was then employed to detect dyslexia, achieving an accuracy of 88.58%. Additionally, hierarchical clustering methods were applied to identify varying severity levels of dyslexia. The analysis incorporates diverse methodologies across various populations and settings, demonstrating the potential of this technology to identify individuals with dyslexia, including those with borderline traits, through non-invasive means. Integrating eye-tracking with machine learning represents a significant advancement in the diagnostic process, offering a highly accurate and accessible method in clinical research.
zh
[NLP-124] Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design
【速读】: 该论文试图解决当前基于Deepseek-R1-Distill系列的生成式AI(Generative AI)模型在基准测试中表现结果波动较大的问题,这一现象可能由多种评估条件的细微差异引起。研究指出,此类模型以及基于其微调的其他开源推理模型(如QwQ-32B)所宣称的性能提升难以可靠复现。解决方案的关键在于建立更为严谨的模型性能评估范式,并通过实证分析对Deepseek-R1-Distill系列模型进行评估。
链接: https://arxiv.org/abs/2506.04734
作者: Lin Sun,Weihong Lin,Jinzhu Wu,Yongfu Zhu,Xiaoqi Jian,Guangxiang Zhao,Change Jia,Linglin Zhang,Sai-er Hu,Yuhan Wu,Xiangzheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.
zh
[NLP-125] Glider: Global and Local Instruction-Driven Expert Router
【速读】: 该论文试图解决现有MoErging方法在提升未见任务泛化能力的同时,牺牲了保留任务(held-in tasks)性能的问题,从而限制了其在实际部署中的适用性。解决方案的关键在于提出一种多尺度路由机制——Global and Local Instruction Driven Expert Router (GLIDER),该机制结合了语义全局路由和学习到的局部路由,通过利用大语言模型(LLM)的高级推理能力来增强专家选择,同时在模块内部进行细粒度的token级路由决策,从而实现保留任务性能的显著提升与未见任务泛化能力的保持。
链接: https://arxiv.org/abs/2410.07172
作者: Pingzhi Li,Prateek Yadav,Jaehong Yoon,Jie Peng,Yi-Lin Sung,Mohit Bansal,Tianlong Chen
机构: The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Our code is available at this https URL
Abstract:The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to particular domains. This has enabled the creation of powerful and adaptive routing-based “Model MoErging” methods with the goal of using expert modules to create an aggregate system with improved performance or generalization. However, existing MoErging methods often prioritize generalization to unseen tasks at the expense of performance on held-in tasks, which limits its practical applicability in real-world deployment scenarios. We observe that current token-level routing mechanisms neglect the global semantic context of the input task. This token-wise independence hinders effective expert selection for held-in tasks, as routing decisions fail to incorporate the semantic properties of the task. To address this, we propose, Global and Local Instruction Driven Expert Router (GLIDER) that integrates a multi-scale routing mechanism, encompassing a semantic global router and a learned local router. The global router leverages LLM’s advanced reasoning capabilities for semantic-related contexts to enhance expert selection. Given the input query and LLM, the router generates semantic task instructions that guide the retrieval of the most relevant experts across all layers. This global guidance is complemented by a local router that facilitates token-level routing decisions within each module, enabling finer control and enhanced performance on unseen tasks. Our experiments using T5-based models for T0 and FLAN tasks demonstrate that GLIDER achieves substantially improved held-in performance while maintaining strong generalization on held-out tasks. We also perform ablations experiments to dive deeper into the components of GLIDER. Our experiments highlight the importance of our multi-scale routing that leverages LLM-driven semantic reasoning for MoErging methods.
zh
[NLP-126] Consistent Autoformalization for Constructing Mathematical Libraries EMNLP2024
【速读】: 该论文试图解决自动形式化(autoformalization)在处理复杂和专业数学库时的可靠性与一致性问题,尤其是在大型语言模型(LLMs)单独使用时难以保证语法、术语和语义的一致性。论文提出的关键解决方案是协调使用三种机制:最相似检索增强生成(MS-RAG)、去噪步骤以及带有语法错误反馈的自动校正(Auto-SEF),以提升自动形式化结果的质量。
链接: https://arxiv.org/abs/2410.04194
作者: Lan Zhang,Xin Quan,Andre Freitas
机构: University of Manchester (曼彻斯特大学); Idiap Research Institute (Idiap研究所); National Biomarker Centre, CRUK Manchester Institute (国家生物标志物中心,CRUK曼彻斯特研究所)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: EMNLP 2024 camera-ready
Abstract:Autoformalization is the task of automatically translating mathematical content written in natural language to a formal language expression. The growing language interpretation capabilities of Large Language Models (LLMs), including in formal languages, are lowering the barriers for autoformalization. However, LLMs alone are not capable of consistently and reliably delivering autoformalization, in particular as the complexity and specialization of the target domain grows. As the field evolves into the direction of systematically applying autoformalization towards large mathematical libraries, the need to improve syntactic, terminological and semantic control increases. This paper proposes the coordinated use of three mechanisms, most-similar retrieval augmented generation (MS-RAG), denoising steps, and auto-correction with syntax error feedback (Auto-SEF) to improve autoformalization quality. The empirical analysis, across different models, demonstrates that these mechanisms can deliver autoformalizaton results which are syntactically, terminologically and semantically more consistent. These mechanisms can be applied across different LLMs and have shown to deliver improve results across different model types.
zh
[NLP-127] Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM
【速读】: 该论文试图解决自动语音识别(ASR)模型在缺乏高质量标注数据时,如何有效生成伪标签的问题。传统方法依赖复杂的多阶段处理流程整合多个ASR输出,导致误差传播、信息丢失和优化不一致。解决方案的关键在于提出一种统一的多ASR提示驱动框架,通过文本或语音基础的大语言模型(LLM)进行后处理,替代传统的投票或其他仲裁逻辑来协调集成输出,从而提升转录准确性。
链接: https://arxiv.org/abs/2506.11089
作者: Jeena Prakash,Blessingh Kumar,Kadri Hacioglu,Bidisha Sharma,Sindhuja Gopalan,Malolan Chetlur,Shankar Venkatesan,Andreas Stolcke
机构: Uniphore Systems(统一系统)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.
zh
[NLP-128] Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts INTERSPEECH2025
【速读】: 该论文旨在解决自动朗读评估系统在儿童语音识别与阅读错误检测中的性能不足问题。其关键解决方案是采用一种多模态方法,结合音频信息与文本资源知识,并利用Whisper模型和指令调优的大规模语言模型(LLMs)通过提示(prompting)技术提升语音识别的准确性及后续阅读错误检测的效果。
链接: https://arxiv.org/abs/2506.11079
作者: Lingyun Gao,Cristian Tejedor-Garcia,Catia Cucchiarini,Helmer Strik
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: This paper is accepted to Interspeech 2025. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants which is financed by the Dutch Research Council (NWO)
Abstract:Automatic reading aloud evaluation can provide valuable support to teachers by enabling more efficient scoring of reading exercises. However, research on reading evaluation systems and applications remains limited. We present a novel multimodal approach that leverages audio and knowledge from text resources. In particular, we explored the potential of using Whisper and instruction-tuned large language models (LLMs) with prompts to improve transcriptions for child speech recognition, as well as their effectiveness in downstream reading mistake detection. Our results demonstrate the effectiveness of prompting Whisper and prompting LLM, compared to the baseline Whisper model without prompting. The best performing system achieved state-of-the-art recognition performance in Dutch child read speech, with a word error rate (WER) of 5.1%, improving the baseline WER of 9.4%. Furthermore, it significantly improved reading mistake detection, increasing the F1 score from 0.39 to 0.73.
zh
[NLP-129] Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling
【速读】: 该论文试图解决机器学习行为模型在使用音频-视频记录提取特征时存在的可靠性问题,特别是语音处理工具在跨不同人群和情境下的可重复性与公平性问题(reproducibility and fairness)。解决方案的关键在于对常用语音分析工具(如OpenSMILE和Praat)提取的语音特征进行领域相关验证,以提高临床应用中机器学习模型的可靠性。
链接: https://arxiv.org/abs/2506.11072
作者: Tahiya Chowdhury,Veronica Romero
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD); Applications (stat.AP)
备注: 5 pages, 1 figure, 3 tables
Abstract:Machine learning-based behavioral models rely on features extracted from audio-visual recordings. The recordings are processed using open-source tools to extract speech features for classification models. These tools often lack validation to ensure reliability in capturing behaviorally relevant information. This gap raises concerns about reproducibility and fairness across diverse populations and contexts. Speech processing tools, when used outside of their design context, can fail to capture behavioral variations equitably and can then contribute to bias. We evaluate speech features extracted from two widely used speech analysis tools, OpenSMILE and Praat, to assess their reliability when considering adolescents with autism. We observed considerable variation in features across tools, which influenced model performance across context and demographic groups. We encourage domain-relevant verification to enhance the reliability of machine learning models in clinical applications.
zh
[NLP-130] Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition
【速读】: 该论文旨在解决在保护语音数据隐私的前提下,准确识别构音障碍和老年人语音的挑战。其解决方案的关键在于采用正则化联邦学习(regularized Federated Learning, FL)技术,通过参数级、嵌入级和新型损失级的正则化方法,缓解数据稀缺性、数据分布不平衡和说话人异质性等问题。实验结果表明,正则化FL系统在基准数据集UASpeech和DementiaBank Pitt上显著优于基线FedAvg系统,实现了字错误率(WER)的显著降低。
链接: https://arxiv.org/abs/2506.11069
作者: Tao Zhong,Mengzhe Geng,Shujie Hu,Guinan Li,Xunying Liu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Accurate recognition of dysarthric and elderly speech remains challenging to date. While privacy concerns have driven a shift from centralized approaches to federated learning (FL) to ensure data confidentiality, this further exacerbates the challenges of data scarcity, imbalanced data distribution and speaker heterogeneity. To this end, this paper conducts a systematic investigation of regularized FL techniques for privacy-preserving dysarthric and elderly speech recognition, addressing different levels of the FL process by 1) parameter-based, 2) embedding-based and 3) novel loss-based regularization. Experiments on the benchmark UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest that regularized FL systems consistently outperform the baseline FedAvg system by statistically significant WER reductions of up to 0.55% absolute (2.13% relative). Further increasing communication frequency to one exchange per batch approaches centralized training performance.
zh
[NLP-131] PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding
【速读】: 该论文旨在解决端到端自动语音识别(ASR)模型在识别罕见词时准确率低的问题,尤其是那些发音相似但拼写不同的同音词。其解决方案的关键在于提出一种基于ED-CEC的音素增强多模态融合方法(PMF-CEC),通过引入音素信息提升对目标罕见词与同音词的区分能力,同时引入保留概率机制以减少过检测,从而提高错误检测的准确性。
链接: https://arxiv.org/abs/2506.11064
作者: Jiajun He,Tomoki Toda
机构: Nagoya University (名古屋大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by IEEE TASLP 2025
Abstract:End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error detection and context-aware error correction (ED-CEC), which leverages contextual information such as named entities and technical terms to improve the accuracy of ASR transcripts. Although ED-CEC achieves a notable success in correcting rare words, its accuracy remains low when dealing with rare words that have similar pronunciations but different spellings. To address this issue, we proposed a phoneme-augmented multimodal fusion method for context-aware error correction (PMF-CEC) method on the basis of ED-CEC, which allowed for better differentiation between target rare words and homophones. Additionally, we observed that the previous ASR error detection module suffers from overdetection. To mitigate this, we introduced a retention probability mechanism to filter out editing operations with confidence scores below a set threshold, preserving the original operation to improve error detection accuracy. Experiments conducted on five datasets demonstrated that our proposed PMF-CEC maintains reasonable inference speed while further reducing the biased word error rate compared with ED-CEC, showing a stronger advantage in correcting homophones. Moreover, our method outperforms other contextual biasing methods, and remains valuable compared with LLM-based methods in terms of faster inference and better robustness under large biasing lists.
zh
计算机视觉
[CV-0] EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction
【速读】:该论文旨在解决在有限内存预算下对大规模基础模型进行领域特定或个性化微调的高昂成本问题(memory overhead)。其关键解决方案是提出EMLoC框架,该框架通过在小规模下游校准集上使用激活感知的奇异值分解(SVD)构建任务相关的轻量级模拟器,并基于LoRA进行微调,随后通过补偿算法修正微调后的LoRA模块,使其能够合并到原始模型中用于推理,从而实现在与推理相同内存预算下的高效微调。
链接: https://arxiv.org/abs/2506.12015
作者: Hsi-Che Lin,Yu-Chu Yu,Kai-Po Chang,Yu-Chiang Frank Wang
机构: National Taiwan University (国立台湾大学); NVIDIA (NVIDIA)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Project page: this https URL
Abstract:Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.
zh
[CV-1] Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale
【速读】:该论文试图解决**可及性定位(affordance grounding)**问题,即根据自然语言描述的交互信息来定位物体区域,这是智能代理理解并与其环境交互的关键挑战。该任务面临细粒度部件级定位、多有效交互区域带来的歧义以及大规模数据集稀缺等难题。解决方案的关键在于引入了一个大规模基准数据集Affogato,其中包含150K个实例,配有开放词汇文本描述和对应3D可及性热图,并基于此构建了简单但有效的视觉-语言模型,利用预训练的部件感知视觉主干和文本条件热图解码器,从而在现有2D和3D基准上取得了有前景的性能,并展现出开放词汇跨领域泛化能力。
链接: https://arxiv.org/abs/2506.12009
作者: Junha Lee,Eunha Park,Chunghyun Park,Dahyun Kang,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH); RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: this https URL
zh
[CV-2] SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts
【速读】:该论文试图解决神经代理模型在面对未见过的问题配置时性能显著下降的问题,特别是在工业仿真任务中,如新型材料类型或结构尺寸的场景。其解决方案的关键在于引入SIMSHIFT基准数据集和评估套件,并将成熟的领域自适应(Domain Adaptation, DA)方法扩展到最先进的神经代理模型中,利用多个源配置的参数描述和真实仿真数据,以及目标配置的参数描述,以在不依赖目标配置真实仿真数据的情况下准确预测目标仿真结果。
链接: https://arxiv.org/abs/2506.12007
作者: Paul Setinek,Gianluca Galletti,Thomas Gross,Dominik Schnürer,Johannes Brandstetter,Werner Zellinger
机构: LIT AI Lab and Institute for Machine Learning, JKU Linz, Austria; Linz Center of Mechatronics GmbH, Linz, Austria; Emmi AI GmbH, Linz, Austria
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:
Abstract:Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on unseen problem configurations, such as novel material types or structural dimensions. Meanwhile, Domain Adaptation (DA) techniques have been widely used in vision and language processing to generalize from limited information about unseen configurations. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established domain adaptation methods to state of the art neural surrogates and systematically evaluate them. These approaches use parametric descriptions and ground truth simulations from multiple source configurations, together with only parametric descriptions from target configurations. The goal is to accurately predict target simulations without access to ground truth simulation data. Extensive experiments on SIMSHIFT highlight the challenges of out of distribution neural surrogate modeling, demonstrate the potential of DA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at this https URL
zh
[CV-3] Improving Surgical Risk Prediction Through Integrating Automated Body Composition Analysis: a Retrospective Trial on Colectomy Surgery
【速读】:该论文试图解决如何利用术前从CT扫描中自动提取的体成分指标来预测结肠切除术后结局的问题,包括单独使用或结合临床变量及现有风险预测因子。其解决方案的关键在于从多个椎体水平的术前CT图像中提取超过300个特征,如骨骼肌面积、密度、脂肪区域及组织间指标,并通过Cox比例风险模型和逻辑回归等统计方法评估这些指标对1年全因死亡率及其他术后并发症的预测性能。
链接: https://arxiv.org/abs/2506.11996
作者: Hanxue Gu,Yaqian Chen,isoo Lee,Diego Schaps,Regina Woody,Roy Colglazier,Maciej A. Mazurowski,Christopher Mantyh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 5 figures
Abstract:Objective: To evaluate whether preoperative body composition metrics automatically extracted from CT scans can predict postoperative outcomes after colectomy, either alone or combined with clinical variables or existing risk predictors. Main outcomes and measures: The primary outcome was the predictive performance for 1-year all-cause mortality following colectomy. A Cox proportional hazards model with 1-year follow-up was used, and performance was evaluated using the concordance index (C-index) and Integrated Brier Score (IBS). Secondary outcomes included postoperative complications, unplanned readmission, blood transfusion, and severe infection, assessed using AUC and Brier Score from logistic regression. Odds ratios (OR) described associations between individual CT-derived body composition metrics and outcomes. Over 300 features were extracted from preoperative CTs across multiple vertebral levels, including skeletal muscle area, density, fat areas, and inter-tissue metrics. NSQIP scores were available for all surgeries after 2012.
zh
[CV-4] Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal
【速读】:该论文旨在解决视觉-语言大模型(VLLMs)在放射学报告生成任务中推理性能不足的问题,特别是在不进行额外训练的情况下提升模型生成报告的准确性与一致性。解决方案的关键在于提出一种轻量级的Thought Graph Traversal (TGT)框架,该框架通过整合结构化的医学先验知识到提示中,引导模型按照医学上连贯的顺序进行器官特异性发现的推理。此外,结合推理预算强制策略,动态调整模型的推理深度,从而在不修改底层模型的前提下实现更深入和逻辑严谨的分析。
链接: https://arxiv.org/abs/2506.11989
作者: Yue Yao,Zelin Wen,Yan Tong,Xinyu Tian,Xuqing Li,Xiao Ma,Dongliang Xu,Tom Gedeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2404.11209 by other authors
Abstract:Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to radiology report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model’s inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at this https URL.
zh
[CV-5] How Visual Representations Map to Language Feature Space in Multimodal LLM s
【速读】:该论文试图解决视觉-语言模型(VLMs)中视觉与语言表示对齐机制不明确的问题,特别是如何通过视觉指令微调使视觉特征与语言模型的表示空间进行有效对齐。解决方案的关键在于采用一种方法论框架,该框架保持大型语言模型(LLM)和视觉Transformer(ViT)冻结不变,仅通过训练线性适配器来连接两者。这一设计确保语言模型维持其原始语言表示,避免因视觉数据适应而改变,从而迫使线性适配器直接将视觉特征映射到语言模型已有的表示空间中。
链接: https://arxiv.org/abs/2506.11976
作者: Constantin Venhoff,Ashkan Khakzar,Sonia Joseph,Philip Torr,Neel Nanda
机构: University of Oxford(牛津大学); McGill University / Meta(麦吉尔大学/元); Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. We introduce a methodological framework that deliberately maintains a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual instruction tuning. This design is fundamental to our approach: by keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data. Consequently, the linear adapter must map visual features directly into the LLM’s existing representational space rather than allowing the language model to develop specialized visual understanding through fine-tuning. Our experimental design uniquely enables the use of pre-trained sparse autoencoders (SAEs) of the LLM as analytical probes. These SAEs remain perfectly aligned with the unchanged language model and serve as a snapshot of the learned language feature-representations. Through systematic analysis of SAE reconstruction error, sparsity patterns, and feature SAE descriptions, we reveal the layer-wise progression through which visual representations gradually align with language feature representations, converging in middle-to-later layers. This suggests a fundamental misalignment between ViT outputs and early LLM layers, raising important questions about whether current adapter-based architectures optimally facilitate cross-modal representation learning.
zh
[CV-6] Visual Pre-Training on Unlabeled Images using Reinforcement Learning
【速读】:该论文试图解决在无标签图像数据上预训练模型以获得更好表示的问题,其核心挑战在于如何有效利用大量未标注数据进行特征学习。解决方案的关键在于将预训练过程转化为强化学习(Reinforcement Learning, RL)问题,通过训练一个通用价值函数,在动态系统中让智能体通过对图像进行视图变换或添加图像增强来学习特征,从而实现类似于crop-consistency的自监督学习,同时通过奖励函数引入可调控的机制,以利用已有的标注图像或弱标签描述来优化特征学习。
链接: https://arxiv.org/abs/2506.11967
作者: Dibya Ghosh,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In reinforcement learning (RL), value-based algorithms learn to associate each observation with the states and rewards that are likely to be reached from it. We observe that many self-supervised image pre-training methods bear similarity to this formulation: learning features that associate crops of images with those of nearby views, e.g., by taking a different crop or color augmentation. In this paper, we complete this analogy and explore a method that directly casts pre-training on unlabeled image data like web crawls and video frames as an RL problem. We train a general value function in a dynamical system where an agent transforms an image by changing the view or adding image augmentations. Learning in this way resembles crop-consistency self-supervision, but through the reward function, offers a simple lever to shape feature learning using curated images or weakly labeled captions when they exist. Our experiments demonstrate improved representations when training on unlabeled images in the wild, including video data like EpicKitchens, scene data like COCO, and web-crawl data like CC12M.
zh
[CV-7] Evaluating Sensitivity Parameters in Smartphone-Based Gaze Estimation: A Comparative Study of Appearance-Based and Infrared Eye Trackers
【速读】:该论文试图解决在真实移动使用条件下,基于外观的注视估计(appearance-based gaze estimation)的可行性问题。其解决方案的关键在于集成轻量级卷积神经网络(MobileNet-V3)与循环结构(Long Short-Term Memory, LSTM),通过从灰度面部图像中预测注视坐标,实现智能手机端的深度学习眼动追踪算法。
链接: https://arxiv.org/abs/2506.11932
作者: Nishan Gunawardena,Gough Yumu Lui,Jeewani Anupama Ginige,Bahman Javadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.
zh
[CV-8] Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference
【速读】:该论文试图解决车道变更预测算法在仿真或数据集研究中取得进展,但缺乏实际道路部署验证的问题。解决方案的关键在于通过真实硬件实现基于知识图嵌入(Knowledge Graph Embeddings, KGE)和贝叶斯推理的车道变更预测系统,并结合自车纵向制动以确保安全。该系统由感知模块和预训练预测模块组成,前者负责环境感知与特征转换,后者执行KGE和贝叶斯推理模型以预测目标车辆的行驶意图,并将其转化为纵向制动动作。
链接: https://arxiv.org/abs/2506.11925
作者: M. Manzour,Catherine M. Elias,Omar M. Shehata,R. Izquierdo,M. A. Sotelo
机构: University of Alcalá (阿尔卡拉大学); German University in Cairo (开罗德国大学)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle’s maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle’s lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.
zh
[CV-9] Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
【速读】:该论文试图解决跨视角图像与几何生成中的对齐问题,特别是在缺乏密集姿态图像或依赖域内视角的生成模型的情况下,如何实现高质量的新型视角合成。解决方案的关键在于引入一种基于扩散的框架,通过“变形与修复”方法将新型视角合成任务转化为图像和几何的修复任务,并利用交叉模态注意力蒸馏技术,在训练和推理过程中将图像扩散分支的注意力图注入几何扩散分支,以确保生成图像与几何的一致性。此外,通过引入基于邻近性的网格条件机制,整合深度和法线线索,进一步提升几何预测的鲁棒性。
链接: https://arxiv.org/abs/2506.11924
作者: Min-Seop Kwak,Junho Kim,Sangdoo Yun,Dongyoon Han,Taekyoung Kim,Seungryong Kim,Jin-Hwa Kim
机构: NAVER AI Lab; KAIST AI; SNU AIIS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at this https URL.
zh
[CV-10] O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation
【速读】:该论文旨在解决合成孔径雷达(SAR)图像中船舶实例分割的挑战,包括尺度变化、目标密度高以及目标边界模糊等问题,这些问题在现有方法中常被忽视,导致性能不佳。其解决方案的关键在于提出O2Former框架,该框架通过两个核心组件进行优化:一是优化查询生成器(OQG),通过联合编码浅层位置线索和高层语义信息实现多尺度特征交互,从而提升查询质量和收敛效率;二是方向感知嵌入模块(OAEM),通过方向感知卷积和极坐标编码增强方向敏感性,有效应对SAR场景中目标方向不均匀的问题。这两个组件共同促进了从主干网络到解码器的精确特征对齐,并增强了模型捕捉细粒度结构细节的能力。
链接: https://arxiv.org/abs/2506.11913
作者: F. Gao,Y Li,X He,J Sun,J Wang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures
Abstract:Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security. SAR ship images present challenges including scale variation, object density, and fuzzy target boundary, which are often overlooked in existing methods, leading to suboptimal performance. In this work, we propose O2Former, a tailored instance segmentation framework that extends Mask2Former by fully leveraging the structural characteristics of SAR imagery. We introduce two key components. The first is the Optimized Query Generator(OQG). It enables multi-scale feature interaction by jointly encoding shallow positional cues and high-level semantic information. This improves query quality and convergence efficiency. The second component is the Orientation-Aware Embedding Module(OAEM). It enhances directional sensitivity through direction-aware convolution and polar-coordinate encoding. This effectively addresses the challenge of uneven target orientations in SAR scenes. Together, these modules facilitate precise feature alignment from backbone to decoder and strengthen the model’s capacity to capture fine-grained structural details. Extensive experiments demonstrate that O2Former outperforms state of the art instance segmentation baselines, validating its effectiveness and generalization on SAR ship datasets.
zh
[CV-11] Methods for evaluating the resolution of 3D data derived from satellite images
【速读】:该论文旨在解决如何评估从卫星图像中提取的3D数据(包括点云、数字表面模型和3D网格模型)的分辨率问题,这对于确定任务实用性及跟踪技术改进至关重要。其解决方案的关键在于开发基于高分辨率参考机载激光雷达数据的3D度量评估工具和自动化评估流程。
链接: https://arxiv.org/abs/2506.11876
作者: Christina Selby,Holden Bindl,Tyler Feldman,Andrew Skow,Nicolas Norena Acosta,Shea Hagstrom,Myron Brown
机构: The Johns Hopkins University Applied Physics Laboratory (约翰霍普金斯大学应用物理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 13 figures
Abstract:3D data derived from satellite images is essential for scene modeling applications requiring large-scale coverage or involving locations not accessible by airborne lidar or cameras. Measuring the resolution of this data is important for determining mission utility and tracking improvements. In this work, we consider methods to evaluate the resolution of point clouds, digital surface models, and 3D mesh models. We describe 3D metric evaluation tools and workflows that enable automated evaluation based on high-resolution reference airborne lidar, and we present results of analyses with data of varying quality.
zh
[CV-12] SphereDrag : Spherical Geometry-Aware Panoramic Image Editing
【速读】:该论文试图解决全景图像编辑中的三个关键问题:边界不连续性、轨迹变形和像素密度不均匀性。解决方案的关键在于提出SphereDrag框架,其核心技术包括自适应重投影(Adaptive Reprojection, AR)、大圆轨迹调整(Great-Circle Trajectory Adjustment, GCTA)和球面搜索区域追踪(Spherical Search Region Tracking, SSRT),这些技术分别针对上述挑战,利用球面几何知识实现更精确和可控的编辑效果。
链接: https://arxiv.org/abs/2506.11863
作者: Zhiao Feng,Xuewei Li,Junjie Yang,Yuxin Peng,Xi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image editing has made great progress on planar images, but panoramic image editing remains underexplored. Due to their spherical geometry and projection distortions, panoramic images present three key challenges: boundary discontinuity, trajectory deformation, and uneven pixel density. To tackle these issues, we propose SphereDrag, a novel panoramic editing framework utilizing spherical geometry knowledge for accurate and controllable editing. Specifically, adaptive reprojection (AR) uses adaptive spherical rotation to deal with discontinuity; great-circle trajectory adjustment (GCTA) tracks the movement trajectory more accurate; spherical search region tracking (SSRT) adaptively scales the search range based on spherical location to address uneven pixel density. Also, we construct PanoBench, a panoramic editing benchmark, including complex editing tasks involving multiple objects and diverse styles, which provides a standardized evaluation framework. Experiments show that SphereDrag gains a considerable improvement compared with existing methods in geometric consistency and image quality, achieving up to 10.5% relative improvement.
zh
[CV-13] Vision-based Lifting of 2D Object Detections for Automated Driving
【速读】:该论文旨在解决如何在不依赖昂贵的LiDAR传感器的情况下,利用低成本的车载摄像头实现高效的3D目标检测问题。其解决方案的关键在于提出了一种管道流程,通过将现有的基于视觉的2D算法结果提升至3D检测,并利用2D卷积神经网络(CNN)处理每个2D检测的点云,从而在保持较低计算成本的同时实现与当前最先进图像基方法相当的检测性能。
链接: https://arxiv.org/abs/2506.11839
作者: Hendrik Königshof,Kun Li,Christoph Stiller
机构: FZI Research Center for Inf. Technology (FZI信息技术研究中心); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL
Abstract:Image-based 3D object detection is an inevitable part of autonomous driving because cheap onboard cameras are already available in most modern cars. Because of the accurate depth information, currently, most state-of-the-art 3D object detectors heavily rely on LiDAR data. In this paper, we propose a pipeline which lifts the results of existing vision-based 2D algorithms to 3D detections using only cameras as a cost-effective alternative to LiDAR. In contrast to existing approaches, we focus not only on cars but on all types of road users. To the best of our knowledge, we are the first using a 2D CNN to process the point cloud for each 2D detection to keep the computational effort as low as possible. Our evaluation on the challenging KITTI 3D object detection benchmark shows results comparable to state-of-the-art image-based approaches while having a runtime of only a third.
zh
[CV-14] operated Driving: a New Challenge for 3D Object Detection in Compressed Point Clouds
【速读】:该论文旨在解决在远程驾驶(Teleoperated Driving, TD)中如何从点云数据中检测车辆和行人的问题,以确保安全操作。其解决方案的关键在于利用SELMA数据集,这是一个多模态、开源的自动驾驶合成数据集,并通过添加3D物体的地面真实边界框来支持目标检测任务。此外,研究还分析了先进压缩算法和目标检测器在压缩效率、解压缩与推理时间以及检测精度等方面的性能,并评估了压缩和检测对V2X网络数据率和延迟的影响,以满足TD应用的3GPP要求。
链接: https://arxiv.org/abs/2506.11804
作者: Filippo Bragato,Michael Neri,Paolo Testolina,Marco Giordani,Federica Battisti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI); Image and Video Processing (eess.IV)
备注: Submitted to IEEE Transactions on Intelligent Transportation Systems
Abstract:In recent years, the development of interconnected devices has expanded in many fields, from infotainment to education and industrial applications. This trend has been accelerated by the increased number of sensors and accessibility to powerful hardware and software. One area that significantly benefits from these advancements is Teleoperated Driving (TD). In this scenario, a controller drives safely a vehicle from remote leveraging sensors data generated onboard the vehicle, and exchanged via Vehicle-to-Everything (V2X) communications. In this work, we tackle the problem of detecting the presence of cars and pedestrians from point cloud data to enable safe TD operations. More specifically, we exploit the SELMA dataset, a multimodal, open-source, synthetic dataset for autonomous driving, that we expanded by including the ground-truth bounding boxes of 3D objects to support object detection. We analyze the performance of state-of-the-art compression algorithms and object detectors under several metrics, including compression efficiency, (de)compression and inference time, and detection accuracy. Moreover, we measure the impact of compression and detection on the V2X network in terms of data rate and latency with respect to 3GPP requirements for TD applications.
zh
[CV-15] GPLQ: A General Practical and Lightning QAT Method for Vision Transformers
【速读】:该论文旨在解决Vision Transformers (ViTs)在计算上的高消耗问题,特别是通过低比特量化(如4-bit)来降低计算负担,但现有方法在后训练量化(PTQ)和量化感知训练(QAT)中存在显著限制,如PTQ导致精度下降,QAT则面临计算成本高、泛化能力有限、训练不稳定及缺乏开源代码等问题。解决方案的关键在于提出一种名为General, Practical, and Lightning Quantization (GPLQ)的新框架,其核心基于两个关键经验洞察:激活量化的重要性以及保持模型原始优化“盆地”以维持泛化能力的必要性,从而采用“先激活后权重”的分阶段量化策略,实现高效且有效的ViT量化。
链接: https://arxiv.org/abs/2506.11784
作者: Guang Liang,Xinyao Liu,Jianxin Wu
机构: Nanjing University(南京大学); University of Science and Technology of China(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model’s original optimization basin'' to maintain generalization. Consequently, GPLQ employs a sequential
activation-first, weights-later’’ strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin’', thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.
zh
[CV-16] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation
【速读】:该论文旨在解决在超声心动图(echocardiography)领域中,由于解剖结构细微、时间动态复杂以及缺乏领域特定预训练模型而导致的自监督学习(SSL)应用难题。其解决方案的关键在于提出一种名为DISCOVR(Distilled Image Supervision for Cross Modal Video Representation)的自监督双分支框架,该框架通过结合基于聚类的视频编码器与在线图像编码器,利用语义聚类蒸馏损失将图像编码器中的解剖知识传递给视频编码器,从而实现具有时间一致性且富含细粒度语义理解的视频表征学习。
链接: https://arxiv.org/abs/2506.11777
作者: Divyanshu Mishra,Mohammadreza Salehi,Pramit Saha,Olga Patey,Aris T. Papageorghiou,Yuki M. Asano,J. Alison Noble
机构: University of Oxford (牛津大学); University of Technology Nuremberg (纽伦堡应用技术大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding. Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups, and achieves superior segmentation transfer.
zh
[CV-17] Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation
【速读】:该论文旨在解决等长运动训练中因依赖不可靠的数字媒体内容而非专家监督而导致的严重风险,如姿势错误、受伤和因缺乏纠正反馈而产生的参与度下降。解决方案的关键在于提出一种实时反馈系统,用于评估等长姿势,并释放了目前最大的多类别等长运动视频数据集,包含六种姿势的3,600多个片段,同时引入了一种新的三部分评估指标,以捕捉分类准确性、错误定位和模型置信度。
链接: https://arxiv.org/abs/2506.11774
作者: Abhishek Jaiswal,Armeet Singh Luthra,Purav Jangir,Bhavya Garg,Nisheeth Srivastava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Isometric exercises appeal to individuals seeking convenience, privacy, and minimal dependence on equipments. However, such fitness training is often overdependent on unreliable digital media content instead of expert supervision, introducing serious risks, including incorrect posture, injury, and disengagement due to lack of corrective feedback. To address these challenges, we present a real-time feedback system for assessing isometric poses. Our contributions include the release of the largest multiclass isometric exercise video dataset to date, comprising over 3,600 clips across six poses with correct and incorrect variations. To support robust evaluation, we benchmark state-of-the-art models-including graph-based networks-on this dataset and introduce a novel three-part metric that captures classification accuracy, mistake localization, and model confidence. Our results enhance the feasibility of intelligent and personalized exercise training systems for home workouts. This expert-level diagnosis, delivered directly to the users, also expands the potential applications of these systems to rehabilitation, physiotherapy, and various other fitness disciplines that involve physical motion.
zh
[CV-18] AgentS ense: Virtual Sensor Data Generation Using LLM Agent in Simulated Home Environments
【速读】:该论文试图解决智能家庭环境中的人体活动识别(Human Activity Recognition, HAR)系统在构建时面临的挑战,即缺乏大规模、多样化的标注数据集。由于家庭布局、传感器配置和用户行为的差异性,使得现有数据难以满足系统对泛化能力的要求。论文提出的解决方案关键在于构建一个名为AgentSense的虚拟数据生成流程,该流程利用大型语言模型生成多样化的人格化代理(persona),并据此生成日常活动流程,再通过扩展的虚拟家庭环境VirtualHome进行模拟,从而生成丰富的虚拟传感器数据。此方法无需人工数据采集即可有效提升HAR模型性能,尤其在真实数据有限的情况下表现显著。
链接: https://arxiv.org/abs/2506.11773
作者: Zikang Leng,Megha Thukral,Yaqi Liu,Hrudhai Rajasekhar,Shruthi K. Hiremath,Thomas Plötz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:A major obstacle in developing robust and generalizable smart home-based Human Activity Recognition (HAR) systems is the lack of large-scale, diverse labeled datasets. Variability in home layouts, sensor configurations, and user behavior adds further complexity, as individuals follow varied routines and perform activities in distinct ways. Building HAR systems that generalize well requires training data that captures the diversity across users and environments. To address these challenges, we introduce AgentSense, a virtual data generation pipeline where diverse personas are generated by leveraging Large Language Models. These personas are used to create daily routines, which are then decomposed into low-level action sequences. Subsequently, the actions are executed in a simulated home environment called VirtualHome that we extended with virtual ambient sensors capable of recording the agents activities as they unfold. Overall, AgentSense enables the generation of rich, virtual sensor datasets that represent a wide range of users and home settings. Across five benchmark HAR datasets, we show that leveraging our virtual sensor data substantially improves performance, particularly when real data are limited. Notably, models trained on a combination of virtual data and just a few days of real data achieve performance comparable to those trained on the entire real datasets. These results demonstrate and prove the potential of virtual data to address one of the most pressing challenges in ambient sensing, which is the distinct lack of large-scale, annotated datasets without requiring any manual data collection efforts.
zh
[CV-19] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection
【速读】:该论文旨在解决异常检测(anomaly detection)中的复杂问题,包括异常定义的模糊性、异常类型(如局部和全局缺陷)的多样性以及训练数据的稀缺性。其解决方案的关键在于提出CLIPFUSION方法,该方法融合了判别式(discriminative)和生成式(generative)基础模型,其中基于CLIP的判别模型擅长捕捉全局特征,而基于扩散模型的生成模型则有效提取局部细节,形成协同互补的机制。此外,该方法引入了利用扩散模型提取的交叉注意力图和特征图进行异常检测的具体策略,从而在基准数据集上实现了优于基线方法的异常分割与分类性能。
链接: https://arxiv.org/abs/2506.11772
作者: Byeongchan Lee,John Won,Seunghyun Lee,Jinwoo Shin
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Specifically, the CLIP-based discriminative model excels at capturing global features, while the diffusion-based generative model effectively captures local details, creating a synergistic and complementary approach. Notably, we introduce a methodology for utilizing cross-attention maps and feature maps extracted from diffusion models specifically for anomaly detection. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods, achieving outstanding performance in both anomaly segmentation and classification. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.
zh
[CV-20] MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution
【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)中有效建模跨错帧的非局部依赖关系的同时保持计算效率的问题。现有方法通常依赖于光流策略或Transformer架构,但在处理大运动位移和长视频序列时存在局限性。解决方案的关键在于提出MambaVSR,这是首个用于VSR的状态空间模型框架,其核心创新在于引入了内容感知扫描机制,通过共享指南针构造(SCC)和内容感知序列化(CAS)实现动态时空交互,从而有效对齐和聚合多帧中的非局部相似内容,并结合全局-局部状态空间块(GLSSB)实现高频细节恢复,最终在REDs数据集上以55%更少参数量超越基于Transformer的方法。
链接: https://arxiv.org/abs/2506.11768
作者: Linfeng He,Meiqin Liu,Qi Tang,Chao Yao,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video super-resolution (VSR) faces critical challenges in effectively modeling non-local dependencies across misaligned frames while preserving computational efficiency. Existing VSR methods typically rely on optical flow strategies or transformer architectures, which struggle with large motion displacements and long video sequences. To address this, we propose MambaVSR, the first state-space model framework for VSR that incorporates an innovative content-aware scanning mechanism. Unlike rigid 1D sequential processing in conventional vision Mamba methods, our MambaVSR enables dynamic spatiotemporal interactions through the Shared Compass Construction (SCC) and the Content-Aware Sequentialization (CAS). Specifically, the SCC module constructs intra-frame semantic connectivity graphs via efficient sparse attention and generates adaptive spatial scanning sequences through spectral clustering. Building upon SCC, the CAS module effectively aligns and aggregates non-local similar content across multiple frames by interleaving temporal features along the learned spatial order. To bridge global dependencies with local details, the Global-Local State Space Block (GLSSB) synergistically integrates window self-attention operations with SSM-based feature propagation, enabling high-frequency detail recovery under global dependency guidance. Extensive experiments validate MambaVSR’s superiority, outperforming the Transformer-based method by 0.58 dB PSNR on the REDS dataset with 55% fewer parameters.
zh
[CV-21] DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models
【速读】:该论文旨在解决Sentinel-2 Level-2A影像中所有12个光谱波段的超分辨率重建问题,目标是将它们统一到2.5米的地面采样距离(GSD)。解决方案的关键在于提出了一种模块化流水线DiffFuSR,其核心包括基于扩散模型的超分辨率(SR)阶段和一个利用超分辨率RGB图像作为空间先验的融合网络阶段。此外,通过引入鲁棒的退化模型和对比退化编码器,支持了盲超分辨率任务,从而在反射率保真度、光谱一致性、空间对齐和幻觉抑制等方面优于现有最先进方法。
链接: https://arxiv.org/abs/2506.11764
作者: Muhammad Sarmad,Arnt-Børre Salberg,Michael Kampffmeyer
机构: Norwegian Computing Center (挪威计算中心); Department of Physics and Technology, UiT The Arctic University of Norway (物理与技术系,特罗姆瑟北极大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: preprint under review
Abstract:This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical pansharpening approaches, enabling accurate enhancement of Sentinel-2’s 20 m and 60 m bands. This study underscores the power of harmonized learning with generative priors and fusion strategies to create a modular framework for Sentinel-2 SR. Our code and models can be found at this https URL.
zh
[CV-22] AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials
【速读】:该论文旨在解决农业潜力预测的问题,特别是通过遥感数据实现对不同作物类型(如葡萄种植、市场园艺和大田作物)的精细化评估。其解决方案的关键在于构建了AgriPotential数据集,该数据集包含多月的Sentinel-2卫星影像,并提供像素级的农业潜力标注,覆盖五个有序类别,从而支持多种机器学习任务,如序数回归、多标签分类和时空建模。
链接: https://arxiv.org/abs/2506.11740
作者: Mohammad El Sakka,Caroline De Pourtales,Lotfi Chaari,Josiane Mothe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Remote sensing has emerged as a critical tool for large-scale Earth monitoring and land management. In this paper, we introduce AgriPotential, a novel benchmark dataset composed of Sentinel-2 satellite imagery spanning multiple months. The dataset provides pixel-level annotations of agricultural potentials for three major crop types - viticulture, market gardening, and field crops - across five ordinal classes. AgriPotential supports a broad range of machine learning tasks, including ordinal regression, multi-label classification, and spatio-temporal modeling. The data covers diverse areas in Southern France, offering rich spectral information. AgriPotential is the first public dataset designed specifically for agricultural potential prediction, aiming to improve data-driven approaches to sustainable land use planning. The dataset and the code are freely accessible at: this https URL
zh
[CV-23] DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation
【速读】:该论文旨在解决不完全多模态医学图像分割中的模态不平衡问题,包括模态缺失率不平衡和模态贡献异质性。现有方法由于依赖于理想化的完整模态可用性假设,无法动态平衡模态贡献并忽视模态间的结构关系,导致在实际临床场景中性能不佳。其解决方案的关键在于提出一种名为动态模态感知融合网络(Dynamic Modality-Aware Fusion Network, DMAF-Net)的模型,该模型通过动态模态感知融合模块、协同关系蒸馏与原型蒸馏框架以及动态训练监控策略,实现了对模态干扰的抑制、全局-局部特征对齐与语义一致性保障,并在不同模态缺失率下稳定优化过程。
链接: https://arxiv.org/abs/2506.11691
作者: Libin Lan,Hongxing Li,Zunhui Xia,Yudong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 3 tables
Abstract:Incomplete multi-modal medical image segmentation faces critical challenges from modality imbalance, including imbalanced modality missing rates and heterogeneous modality contributions. Due to their reliance on idealized assumptions of complete modality availability, existing methods fail to dynamically balance contributions and neglect the structural relationships between modalities, resulting in suboptimal performance in real-world clinical scenarios. To address these limitations, we propose a novel model, named Dynamic Modality-Aware Fusion Network (DMAF-Net). The DMAF-Net adopts three key ideas. First, it introduces a Dynamic Modality-Aware Fusion (DMAF) module to suppress missing-modality interference by combining transformer attention with adaptive masking and weight modality contributions dynamically through attention maps. Second, it designs a synergistic Relation Distillation and Prototype Distillation framework to enforce global-local feature alignment via covariance consistency and masked graph attention, while ensuring semantic consistency through cross-modal class-specific prototype alignment. Third, it presents a Dynamic Training Monitoring (DTM) strategy to stabilize optimization under imbalanced missing rates by tracking distillation gaps in real-time, and to balance convergence speeds across modalities by adaptively reweighting losses and scaling gradients. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Our code is available at this https URL.
zh
[CV-24] MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space
【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在处理多表格图像时的鲁棒性与推理能力不足的问题,特别是在面对现实场景中常见的多表格图像布局时,现有基准测试通常仅关注单个表格或非视觉数据,未能评估模型对多样本表格图像的解析、跨表格信息关联以及多跳推理的能力。解决方案的关键在于引入MTabVQA,这是一个专门设计用于多表格视觉问答任务的新基准,包含3,745个需要跨多个视觉化表格图像进行多跳推理的复杂问题-答案对,并通过MTabVQA-Instruct指令微调数据集进一步提升模型的推理性能。
链接: https://arxiv.org/abs/2506.11684
作者: Anshul Singh,Chris Biemann,Jan Strich
机构: Panjab University (旁遮普大学); Universität Hamburg (汉堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don’t assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset (this https URL) are available online (this https URL).
zh
[CV-25] Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets
【速读】:该论文旨在解决人类动作识别问题,通过使用COCO图像语料库的三类子集来评估不同模型的性能。其解决方案的关键在于采用基于视觉Transformer(Vision Transformer, ViT)的架构,该模型在测试中达到了90%的平均准确率,显著优于传统卷积网络和基于CLIP的模型。研究还表明,ViT能够有效定位与动作相关的身体部位,而简单前馈模型则倾向于关注背景纹理,导致识别错误,这凸显了Transformer表示的数据效率及可解释性技术在诊断类别特定失败中的重要性。
链接: https://arxiv.org/abs/2506.11678
作者: MingZe Tang,Madiha Kazi
机构: University of Aberdeen (阿伯丁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 9 figures
Abstract:This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing class-specific failures.
zh
[CV-26] Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics
【速读】:该论文旨在解决通过气道相关影像生物标志物预测肺部患者生存结局的问题。其解决方案的关键在于提出了一种三阶段的方法:首先使用nn-Unet分割网络对气道结构边界进行精确分割;其次从气管为中心的放射组学图像及气道包围框中提取关键特征,以捕捉与生存相关的潜在信息;最后将分割区域获得的放射组学特征整合至支持向量机(SVM)分类器中进行分类。该方法在任务1中的分割整体得分为0.8601,在任务2中的分类得分为0.7346。
链接: https://arxiv.org/abs/2506.11677
作者: Zacharia Mesbah,Dhruv Jain,Tsiry Mayet,Romain Modzelewski,Romain Herault,Simon Bernard,Sebastien Thureau,Clement Chatelain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages
Abstract:The primary objective of the AIIB 2023 competition is to evaluate the predictive significance of airway-related imaging biomarkers in determining the survival outcomes of patients with lung this http URL study introduces a comprehensive three-stage approach. Initially, a segmentation network, namely nn-Unet, is employed to delineate the airway’s structural boundaries. Subsequently, key features are extracted from the radiomic images centered around the trachea and an enclosing bounding box around the airway. This step is motivated by the potential presence of critical survival-related insights within the tracheal region as well as pertinent information encoded in the structure and dimensions of the airway. Lastly, radiomic features obtained from the segmented areas are integrated into an SVM classifier. We could obtain an overall-score of 0.8601 for the segmentation in Task 1 while 0.7346 for the classification in Task 2.
zh
[CV-27] Cross-Modal Clustering-Guided Negative Sampling for Self-Supervised Joint Learning from Medical Images and Reports
【速读】:该论文旨在解决现有基于多模态自监督学习的医学视觉表征学习方法中存在的三个关键问题:负样本选择不足导致硬负样本稀缺和错误负样本混入、忽略对医学图像识别任务至关重要的细粒度局部细节以及对比学习主要关注高层特征而忽视对精准医学分析至关重要的低层细节。其解决方案的关键在于提出一种跨模态聚类引导的负采样(Cross-Modal Cluster-Guided Negative Sampling, CM-CGNS)方法,该方法通过跨模态注意力机制将单模态局部文本特征的k-means聚类扩展至多模态域,从而增加负样本数量并提升模型表征能力;同时引入跨模态掩码图像重建(Cross-Modal Masked Image Reconstruction, CM-MIR)模块,利用跨模态注意力获得的局部文本到图像特征重建被遮挡的局部图像区域,从而增强模型的跨模态信息交互能力并保留对下游任务至关重要的低层图像特征。
链接: https://arxiv.org/abs/2506.11674
作者: Libin Lan,Hongxing Li,Zunhui Xia,Juan Zhou,Xiaofei Zhu,Yongmei Li,Yudong Zhang,Xin Luo
机构: Chongqing University of Technology (重庆理工大学); Army Military Medical University (陆军军医大学); Chongqing Medical University (重庆医科大学); Southeast University (东南大学); Southwest University (西南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE TMI for possible publication. Our code is available at this https URL
Abstract:Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model’s cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.
zh
[CV-28] Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning ICML2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在持续多模态指令微调过程中面临的动态任务适应问题,特别是由于固定架构导致的模型容量不足和任务适应性差的问题。其解决方案的关键在于提出一种名为动态课程LoRA专家混合(Dynamic Mixture of Curriculum LoRA Experts, D-MoLE)的方法,通过在参数预算内演化模型架构,实现对新任务的持续适应,同时保留先前学习的知识。该方法的核心创新包括动态层间专家分配机制和基于梯度的跨模态持续课程策略,以解决任务架构冲突和模态不平衡问题。
链接: https://arxiv.org/abs/2506.11672
作者: Chendi Ge,Xin Wang,Zeyang Zhang,Hong Chen,Jiapei Fan,Longtao Huang,Hui Xue,Wenwu Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM’s architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15% average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.
zh
[CV-29] Prohibited Items Segmentation via Occlusion-aware Bilayer Modeling ICME2025
【速读】:该论文旨在解决安全X射线图像中违禁物品实例分割的问题,该任务面临的主要挑战包括X射线图像中违禁物品与自然物体之间的显著外观差异以及物体间的严重重叠。解决方案的关键在于提出一种感知遮挡的实例分割流程,其中整合了Segment Anything Model (SAM) 以弥合表示差距,并设计了一个感知遮挡的双层掩码解码模块,以显式建模遮挡关系。此外,通过在两个大规模X射线图像分割数据集上手动标注遮挡区域,构建了两个带有遮挡标注的数据集PIDray-A和PIXray-A,以监督遮挡估计。
链接: https://arxiv.org/abs/2506.11661
作者: Yunhan Ren,Ruihuang Li,Lingbo Liu,Changwen Chen
机构: The Hong Kong Polytechnic University (香港理工大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025
Abstract:Instance segmentation of prohibited items in security X-ray images is a critical yet challenging task. This is mainly caused by the significant appearance gap between prohibited items in X-ray images and natural objects, as well as the severe overlapping among objects in X-ray images. To address these issues, we propose an occlusion-aware instance segmentation pipeline designed to identify prohibited items in X-ray images. Specifically, to bridge the representation gap, we integrate the Segment Anything Model (SAM) into our pipeline, taking advantage of its rich priors and zero-shot generalization capabilities. To address the overlap between prohibited items, we design an occlusion-aware bilayer mask decoder module that explicitly models the occlusion relationships. To supervise occlusion estimation, we manually annotated occlusion areas of prohibited items in two large-scale X-ray image segmentation datasets, PIDray and PIXray. We then reorganized these additional annotations together with the original information as two occlusion-annotated datasets, PIDray-A and PIXray-A. Extensive experimental results on these occlusion-annotated datasets demonstrate the effectiveness of our proposed method. The datasets and codes are available at: this https URL
zh
[CV-30] DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation
【速读】:该论文试图解决模型在预测任务中可能依赖于与目标变量无因果关系的信号(如图像中的光照条件)作为捷径,从而影响预测准确性和泛化能力的问题。解决方案的关键在于引入一种标准反因果预测模型(SAM),该模型通过构建因果框架来分析影响预测器的信息路径,并确保分类器满足特定的条件独立性准则,从而仅关注从标签到图像的直接因果路径,对其他变量具有反事实不变性。此外,论文还提出了DISCO,一种基于条件距离相关性的正则化策略,用于优化回归任务中的条件独立性,以有效缓解偏差问题。
链接: https://arxiv.org/abs/2506.11653
作者: Emre Kavak,Tom Nuno Wolf,Christian Wachinger
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:During prediction tasks, models can use any signal they receive to come up with the final answer - including signals that are causally irrelevant. When predicting objects from images, for example, the lighting conditions could be correlated to different targets through selection bias, and an oblivious model might use these signals as shortcuts to discern between various objects. A predictor that uses lighting conditions instead of real object-specific details is obviously undesirable. To address this challenge, we introduce a standard anti-causal prediction model (SAM) that creates a causal framework for analyzing the information pathways influencing our predictor in anti-causal settings. We demonstrate that a classifier satisfying a specific conditional independence criterion will focus solely on the direct causal path from label to image, being counterfactually invariant to the remaining variables. Finally, we propose DISCO, a novel regularization strategy that uses conditional distance correlation to optimize for conditional independence in regression tasks. We can show that DISCO achieves competitive results in different bias mitigation experiments, deeming it a valid alternative to classical kernel-based methods.
zh
[CV-31] Evaluating Fairness and Mitigating Bias in Machine Learning: A Novel Technique using Tensor Data and Bayesian Regression
【速读】:该论文试图解决机器学习模型在处理皮肤颜色(skin color)这一特殊敏感属性时的公平性问题,传统研究多聚焦于性别和种族等分类特征,而皮肤颜色作为张量数据(tensor data)具有独特的表示方式。解决方案的关键在于不依赖标注数据,通过将皮肤颜色转换为概率分布并应用统计距离度量来评估公平性,从而捕捉跨群体及群体内部的细微公平性差异;同时提出一种创新的训练方法,利用贝叶斯回归与多项式函数计算颜色距离,以减轻传统皮肤色调分类中的隐性偏见。
链接: https://arxiv.org/abs/2506.11627
作者: Kuniko Paxton,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos
机构: University of Hull (赫尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fairness is a critical component of Trustworthy AI. In this paper, we focus on Machine Learning (ML) and the performance of model predictions when dealing with skin color. Unlike other sensitive attributes, the nature of skin color differs significantly. In computer vision, skin color is represented as tensor data rather than categorical values or single numerical points. However, much of the research on fairness across sensitive groups has focused on categorical features such as gender and race. This paper introduces a new technique for evaluating fairness in ML for image classification tasks, specifically without the use of annotation. To address the limitations of prior work, we handle tensor data, like skin color, without classifying it rigidly. Instead, we convert it into probability distributions and apply statistical distance measures. This novel approach allows us to capture fine-grained nuances in fairness both within and across what would traditionally be considered distinct groups. Additionally, we propose an innovative training method to mitigate the latent biases present in conventional skin tone categorization. This method leverages color distance estimates calculated through Bayesian regression with polynomial functions, ensuring a more nuanced and equitable treatment of skin color in ML models.
zh
[CV-32] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation
【速读】:该论文旨在解决手语生成中实现真实且自然的符号表示生成问题,这一问题由于手语的复杂性(包括复杂的手势、面部表情和身体动作)而具有挑战性。其解决方案的关键在于提出了一种名为SignAligner的新方法,该方法包含三个阶段:基于文本驱动的姿态模态联合生成、多模态在线协同校正以及真实手语视频合成。该方法通过结合文本语义,设计了一个联合手语生成器以同时生成姿态坐标、手势动作和身体运动,并利用跨模态注意力机制实现多模态信息的融合与优化,从而提升生成结果的准确性与表现力。
链接: https://arxiv.org/abs/2506.11621
作者: Xu Wang,Shengeng Tang,Lechao Cheng,Feng Li,Shuo Wang,Richang Hong
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language generation aims to produce diverse sign representations based on spoken language. However, achieving realistic and naturalistic generation remains a significant challenge due to the complexity of sign language, which encompasses intricate hand gestures, facial expressions, and body movements. In this work, we introduce PHOENIX14T+, an extended version of the widely-used RWTH-PHOENIX-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx. We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis. First, by incorporating text semantics, we design a joint sign language generator to simultaneously produce posture coordinates, gesture actions, and body movements. The text encoder, based on a Transformer architecture, extracts semantic features, while a cross-modal attention mechanism integrates these features to generate diverse sign language representations, ensuring accurate mapping and controlling the diversity of modal features. Next, online collaborative correction is introduced to refine the generated pose modalities using a dynamic loss weighting strategy and cross-modal attention, facilitating the complementarity of information across modalities, eliminating spatiotemporal conflicts, and ensuring semantic coherence and action consistency. Finally, the corrected pose modalities are fed into a pre-trained video generation network to produce high-fidelity sign language videos. Extensive experiments demonstrate that SignAligner significantly improves both the accuracy and expressiveness of the generated sign videos.
zh
[CV-33] Wi-CBR: WiFi-based Cross-domain Behavior Recognition via Multimodal Collaborative Awareness
【速读】:该论文旨在解决WiFi-based human behavior recognition中现有方法通常仅关注单一类型数据而忽视多特征交互与融合的问题。其解决方案的关键在于提出一种新颖的多模态协同感知方法,通过结合反映动态路径长度变化的相位数据和与手势运动速度相关的多普勒频移(Doppler Shift, DFS)数据,实现特征间的高效交互与融合。具体而言,该方法引入双分支自注意力模块以捕捉每个模态内的时空线索,并采用组注意力机制挖掘对行为识别关键的群体特征,最后通过门控机制将融合特征划分为PD-strengthen和PD-weaken分支,优化信息熵并促进跨模态协同感知。
链接: https://arxiv.org/abs/2506.11616
作者: Ruobei Zhang,Shengeng Tang,Huan Yan,Xiang Zhang,Richang Hong
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:WiFi-based human behavior recognition aims to recognize gestures and activities by analyzing wireless signal variations. However, existing methods typically focus on a single type of data, neglecting the interaction and fusion of multiple features. To this end, we propose a novel multimodal collaborative awareness method. By leveraging phase data reflecting changes in dynamic path length and Doppler Shift (DFS) data corresponding to frequency changes related to the speed of gesture movement, we enable efficient interaction and fusion of these features to improve recognition accuracy. Specifically, we first introduce a dual-branch self-attention module to capture spatial-temporal cues within each modality. Then, a group attention mechanism is applied to the concatenated phase and DFS features to mine key group features critical for behavior recognition. Through a gating mechanism, the combined features are further divided into PD-strengthen and PD-weaken branches, optimizing information entropy and promoting cross-modal collaborative awareness. Extensive in-domain and cross-domain experiments on two large publicly available datasets, Widar3.0 and XRF55, demonstrate the superior performance of our method.
zh
[CV-34] A2LC: Active and Automated Label Correction for Semantic Segmentation
【速读】:该论文试图解决语义分割中手动像素级标注的高成本和易出错问题,通过主动识别和修正错误标注数据来降低人工干预的需求。其解决方案的关键在于提出了一种名为A ^2 LC的新型高效主动与自动化标签修正框架,该框架将自动化修正阶段集成到传统流程中,利用标注者反馈对查询样本之外的数据进行标签修正,从而最大化成本效率;同时引入了自适应平衡的获取函数,以强调欠代表的尾部类别并补充自动化修正机制。
链接: https://arxiv.org/abs/2506.11599
作者: Youjin Jeon,Kyusik Cho,Suhan Woo,Euntai Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. Under review. 22 pages, 8 figures
Abstract:Active Label Correction (ALC) has emerged as a promising solution to the high cost and error-prone nature of manual pixel-wise annotation in semantic segmentation, by selectively identifying and correcting mislabeled data. Although recent work has improved correction efficiency by generating pseudo-labels using foundation models, substantial inefficiencies still remain. In this paper, we propose Active and Automated Label Correction for semantic segmentation (A ^2 LC), a novel and efficient ALC framework that integrates an automated correction stage into the conventional pipeline. Specifically, the automated correction stage leverages annotator feedback to perform label correction beyond the queried samples, thereby maximizing cost efficiency. In addition, we further introduce an adaptively balanced acquisition function that emphasizes underrepresented tail classes and complements the automated correction mechanism. Extensive experiments on Cityscapes and PASCAL VOC 2012 demonstrate that A ^2 LC significantly outperforms previous state-of-the-art methods. Notably, A ^2 LC achieves high efficiency by outperforming previous methods using only 20% of their budget, and demonstrates strong effectiveness by yielding a 27.23% performance improvement under an equivalent budget constraint on the Cityscapes dataset. The code will be released upon acceptance.
zh
[CV-35] EasyARC: Evaluating Vision Language Models on True Visual Reasoning CVPR2025
【速读】:该论文试图解决当前多模态基准测试中缺乏真正复杂的视觉与语言交互推理的问题,现有基准主要侧重于视觉提取与文本推理的结合,而未能充分体现视觉与语言之间的深层次推理能力。其解决方案的关键在于提出EasyARC,这是一个基于程序生成的、可完全验证且可扩展的视觉-语言基准,要求多图像、多步骤推理及自我修正,能够支持强化学习流水线,并通过渐进式难度级别实现对任务类型和复杂度的结构化评估。
链接: https://arxiv.org/abs/2506.11595
作者: Mert Unsal,Aylin Akkus
机构: ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR2025 Workshop on Test-time Scaling for Computer Vision
Abstract:Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state-of-the-art vision-language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models. We open-source our benchmark dataset and evaluation code.
zh
[CV-36] OV-MAP : Open-Vocabulary Zero-Shot 3D Instance Segmentation Map for Robots IROS2024
【速读】:该论文试图解决开放世界三维地图构建中由于相邻体素的重叠特征导致实例级精度下降的问题(instance-level precision degradation caused by overlapping features from adjacent voxels)。解决方案的关键在于采用一种与类别无关的分割模型将二维掩码投影到三维空间,并结合由点云中原始深度与合成深度融合生成的补充深度图像,同时引入三维掩码投票机制,从而实现无需依赖三维监督分割模型的准确零样本三维实例分割。
链接: https://arxiv.org/abs/2506.11585
作者: Juno Kim,Yesol Park,Hye-Jung Yoon,Byoung-Tak Zhang
机构: Seoul National University (首尔国立大学); Institute of Information & communications Technology Planning & Evaluation (信息通信技术规划评估研究所); NRF (韩国国家研究基金会); KEIT (韩国电子技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IROS 2024
Abstract:We introduce OV-MAP, a novel approach to open-world 3D mapping for mobile robots by integrating open-features into 3D maps to enhance object recognition capabilities. A significant challenge arises when overlapping features from adjacent voxels reduce instance-level precision, as features spill over voxel boundaries, blending neighboring regions together. Our method overcomes this by employing a class-agnostic segmentation model to project 2D masks into 3D space, combined with a supplemented depth image created by merging raw and synthetic depth from point clouds. This approach, along with a 3D mask voting mechanism, enables accurate zero-shot 3D instance segmentation without relying on 3D supervised segmentation models. We assess the effectiveness of our method through comprehensive experiments on public datasets such as ScanNet200 and Replica, demonstrating superior zero-shot performance, robustness, and adaptability across diverse environments. Additionally, we conducted real-world experiments to demonstrate our method’s adaptability and robustness when applied to diverse real-world environments.
zh
[CV-37] Camera-based method for the detection of lifted truck axles using convolutional neural networks
【速读】:该论文试图解决在车辆控制与执法系统中,对具有抬升轴的车辆进行准确分类的问题,当前技术如称重行驶(Weigh-in-Motion, WIM)系统难以准确识别此类车辆,且缺乏有效的商业和技术检测方法。解决方案的关键是提出一种基于卷积神经网络(Convolutional Neural Network, CNN)的方法,具体采用YOLOv8s模型,用于检测由垂直于交通方向的摄像头拍摄的卡车图像中的抬升轴。
链接: https://arxiv.org/abs/2506.11574
作者: Bachir Tchana Tankeu(Cerema),Mohamed Bouteldja(Cerema),Nicolas Grignard(Cerema),Bernard Jacob
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The identification and classification of vehicles play a crucial role in various aspects of the control-sanction system. Current technologies such as weigh-in-motion (WIM) systems can classify most vehicle categories but they struggle to accurately classify vehicles with lifted axles. Moreover, very few commercial and technical methods exist for detecting lifted axles. In this paper, as part of the European project SETO (Smart Enforcement of Transport Operations), a method based on a convolutional neural network (CNN), namely YOLOv8s, was proposed for the detection of lifted truck axles in images of trucks captured by cameras placed perpendicular to the direction of traffic. The performance of the proposed method was assessed and it was found that it had a precision of 87%, a recall of 91.7%, and an inference time of 1.4 ms, which makes it well-suited for real time implantations. These results suggest that further improvements could be made, potentially by increasing the size of the datasets and/or by using various image augmentation methods.
zh
[CV-38] VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中对视觉信息的依赖程度及其推理结果与视觉内容之间的一致性问题,即评估模型推理的视觉忠实性(visual faithfulness)。解决方案的关键在于提出一种基于GPT-Image-1的提示驱动的自动可控编辑流程,实现了对图像中关键视觉线索的精确修改,并构建了VFaith-Bench基准测试平台,通过对比修改前后图像对应的问答对性能差异,量化分析模型推理能力与视觉感知之间的关系。
链接: https://arxiv.org/abs/2506.11571
作者: Jiachen Yu,Yufei Zhan,Ziheng Wu,Yousong Zhu,Jinqiao Wang,Minghui Qiu
机构: ByteDance China (字节跳动中国); Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Peng Cheng Laboratory (鹏城实验室); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model’s specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs’ reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs’ visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question’s answer. By testing similar questions with images that have different details, the average accuracy reflects the model’s visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model’s reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.
zh
[CV-39] EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment
【速读】:该论文旨在解决视频质量评估(Video Quality Assessment, VQA)中由于时间动态性和模型约束带来的挑战,特别是如何在保持模型稳定性的同时提升感知修复能力。其解决方案的关键在于提出EyeSimVQA框架,该框架采用基于自由能的自修复机制,结合双分支结构:一个用于全局感知评估的美学分支和一个用于细粒度结构与语义分析的技术分支,并通过定制化的增强模块对不同视觉输入进行适应性修复,同时引入生物启发的预测头以更好地融合全局与局部表征。
链接: https://arxiv.org/abs/2506.11549
作者: Zhaoyang Wang,Wen Lu,Jie Li,Lihuo He,Maoguo Gong,Xinbo Gao
机构: Xidian University (西安电子科技大学); Ministry of Education (教育部); Inner Mongolia Normal University (内蒙古师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work has been submitted to the IEEE TCSVT for possible publication
Abstract:Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pre-trained backbones, which limits the direct integration of enhancement modules without affecting model stability. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs-resized full-frame images and patch-based fragments-to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design.
zh
[CV-40] Linearly Solving Robust Rotation Estimation
【速读】:该论文试图解决旋转估计(rotation estimation)问题,该问题在计算机视觉和机器人任务中具有基础性作用,尤其在安全关键型应用中需要极强的鲁棒性。传统方法将旋转估计视为非线性且非凸的优化问题,需精心设计。而本文提出新的视角,将旋转估计问题重新表述为线性模型拟合问题,无需放弃任何约束或引入奇点。其解决方案的关键在于揭示旋转运动的对偶结构,将其表示为四元数球面表面的一个大圆,并基于此提出一种易于理解的投票方法,该方法在噪声和异常值下表现出色,且可轻松利用图形处理单元(GPU)并行计算,从而在0.5秒内解决大规模(10^6)和严重损坏(99%异常值比例)的旋转估计问题。
链接: https://arxiv.org/abs/2506.11547
作者: Yinlong Liu,Tianyu Huang,Zhi-Xin Yang
机构: State Key Laboratory of Internet of Things for Smart City (SKL-IOTSC), University of Macau (澳门大学); Hong Kong Centre For Logistics Robotics, The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 23 pages, 18 figures
Abstract:Rotation estimation plays a fundamental role in computer vision and robot tasks, and extremely robust rotation estimation is significantly useful for safety-critical applications. Typically, estimating a rotation is considered a non-linear and non-convex optimization problem that requires careful design. However, in this paper, we provide some new perspectives that solving a rotation estimation problem can be reformulated as solving a linear model fitting problem without dropping any constraints and without introducing any singularities. In addition, we explore the dual structure of a rotation motion, revealing that it can be represented as a great circle on a quaternion sphere surface. Accordingly, we propose an easily understandable voting-based method to solve rotation estimation. The proposed method exhibits exceptional robustness to noise and outliers and can be computed in parallel with graphics processing units (GPUs) effortlessly. Particularly, leveraging the power of GPUs, the proposed method can obtain a satisfactory rotation solution for large-scale( 10^6 ) and severely corrupted (99 % outlier ratio) rotation estimation problems under 0.5 seconds. Furthermore, to validate our theoretical framework and demonstrate the superiority of our proposed method, we conduct controlled experiments and real-world dataset experiments. These experiments provide compelling evidence supporting the effectiveness and robustness of our approach in solving rotation estimation problems.
zh
[CV-41] CGVQMD: Computer Graphics Video Quality Metric and Dataset
【速读】:该论文旨在解决合成内容和现代渲染伪影的视觉质量评估问题,现有视频和图像质量数据集主要关注自然视频和传统失真,而对合成内容的感知研究仍显不足。其解决方案的关键在于构建一个专注于先进渲染技术引入失真的视频质量数据集,并利用预训练3D卷积神经网络(3D CNN)的特征空间与人类视觉质量感知的高度一致性,提出一种全参考视频质量度量方法CGVQM,该方法在生成像素级误差图和全局质量评分方面均显著优于现有指标。
链接: https://arxiv.org/abs/2506.11546
作者: Akshay Jindal,Nabil Sadaka,Manu Mathew Thomas,Anton Sochenov,Anton Kaplanyan
机构: Intel Corporation(英特尔公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While existing video and image quality datasets have extensively studied natural videos and traditional distortions, the perception of synthetic content and modern rendering artifacts remains underexplored. We present a novel video quality dataset focused on distortions introduced by advanced rendering techniques, including neural supersampling, novel-view synthesis, path tracing, neural denoising, frame interpolation, and variable rate shading. Our evaluations show that existing full-reference quality metrics perform sub-optimally on these distortions, with a maximum Pearson correlation of 0.78. Additionally, we find that the feature space of pre-trained 3D CNNs aligns strongly with human perception of visual quality. We propose CGVQM, a full-reference video quality metric that significantly outperforms existing metrics while generating both per-pixel error maps and global quality scores. Our dataset and metric implementation is available at this https URL.
zh
[CV-42] Leverag ing Satellite Image Time Series for Accurate Extreme Event Detection WACV2025
【速读】:该论文旨在解决极端天气事件早期检测的问题,以提升灾害响应能力。其解决方案的关键在于提出SITS-Extreme框架,该框架通过融合多时相的卫星图像时间序列,有效过滤无关变化并提取与灾害相关的信号,从而实现更精确的极端事件检测。
链接: https://arxiv.org/abs/2506.11544
作者: Heng Fang,Hossein Azizpour
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the WACV 2025 Workshop on GeoCV. Code, datasets, and model checkpoints available at: this https URL
Abstract:Climate change is leading to an increase in extreme weather events, causing significant environmental damage and loss of life. Early detection of such events is essential for improving disaster response. In this work, we propose SITS-Extreme, a novel framework that leverages satellite image time series to detect extreme events by incorporating multiple pre-disaster observations. This approach effectively filters out irrelevant changes while isolating disaster-relevant signals, enabling more accurate detection. Extensive experiments on both real-world and synthetic datasets validate the effectiveness of SITS-Extreme, demonstrating substantial improvements over widely used strong bi-temporal baselines. Additionally, we examine the impact of incorporating more timesteps, analyze the contribution of key components in our framework, and evaluate its performance across different disaster types, offering valuable insights into its scalability and applicability for large-scale disaster monitoring.
zh
[CV-43] FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation CVPR2025
【速读】:该论文旨在解决Vision Transformers(ViTs)在后训练量化(PTQ)过程中出现的显著精度下降问题,尤其是在低比特量化下的表现不佳。其解决方案的关键在于分析并改进传统的Hessian引导量化损失,通过建立KL散度与Fisher信息矩阵(FIM)之间的联系,实现量化损失的快速计算,并提出一种基于对角加低秩原则的高效FIM近似方法(DPLR-FIM),从而优化最终的量化损失函数。
链接: https://arxiv.org/abs/2506.11543
作者: Zhuguanyu Wu,Shihe Wang,Jiayi Zhang,Jiaxin Chen,Yunhong Wang
机构: Beihang University(北京航空航天大学); State Key Laboratory of Virtual Reality Technology and Systems(虚拟现实技术与系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2025 Highlight
Abstract:Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at this https URL.
zh
[CV-44] GNSS-inertial state initialization by distance residuals
【速读】:该论文试图解决传感器平台初始化过程中由于初始测量信息有限导致的估计性能不佳问题,这可能在非线性优化中陷入局部极小值。其解决方案的关键在于提出一种新的GNSS-惯性初始化策略,该策略延迟使用全局GNSS测量,直到有足够的信息来准确估计GNSS与惯性坐标系之间的转换。相反,该方法最初依赖于GNSS相对距离残差,并通过Hessian矩阵奇异值的变化引入一个切换到全局测量的判定准则。
链接: https://arxiv.org/abs/2506.11534
作者: Samuel Cerezo,Javier Civera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, RA-L submission
Abstract:Initializing the state of a sensorized platform can be challenging, as a limited set of initial measurements often carry limited information, leading to poor initial estimates that may converge to local minima during non-linear optimization. This paper proposes a novel GNSS-inertial initialization strategy that delays the use of global GNSS measurements until sufficient information is available to accurately estimate the transformation between the GNSS and inertial frames. Instead, the method initially relies on GNSS relative distance residuals. To determine the optimal moment for switching to global measurements, we introduce a criterion based on the evolution of the Hessian matrix singular values. Experiments on the EuRoC and GVINS datasets show that our approach consistently outperforms the naive strategy of using global GNSS data from the start, yielding more accurate and robust initializations.
zh
[CV-45] Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
【速读】:该论文试图解决无监督域适应(Unsupervised Domain Adaptation, UDA)中因目标域视觉嵌入分布与预训练模型的视觉嵌入分布不一致而导致的伪标签误导问题。其解决方案的关键在于利用视觉与文本嵌入的几何结构,通过引入源域提示的参考预测以及基于最优传输理论的聚类策略,强化伪标签并促进目标域提示学习,从而提升目标域表示的质量和对齐效果。
链接: https://arxiv.org/abs/2506.11493
作者: Tung-Long Vuong,Hoang Phan,Vy Vo,Anh Bui,Thanh-Toan Do,Trung Le,Dinh Phung
机构: Monash University (莫纳什大学); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains can deviate from the visual embedding distribution in the pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforce these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we transform this insight into a novel strategy to enforce the clustering property in text embeddings, further enhancing the alignment in the target domain. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.
zh
[CV-46] Composite Data Augmentations for Synthetic Image Detection Against Real-World Perturbations
【速读】:该论文试图解决生成式 AI (Generative AI) 生成的合成图像在互联网上经过压缩和其他操作后,现有合成图像检测 (Synthetic Image Detection, SID) 方法检测效果不佳的问题。解决方案的关键在于探索数据增强组合,利用遗传算法选择最优增强策略,并引入双标准优化方法,从而显著提升模型在现实世界扰动下的性能。
链接: https://arxiv.org/abs/2506.11490
作者: Efthymia Amarantidou,Christos Koutlis,Symeon Papadopoulos,Panagiotis C. Petrantonakis
机构: Aristotle University of Thessaloniki (亚里士多德大学塞萨洛尼基分校); Centre for Research and Technology Hellas (希腊科技研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: EUSIPCO 2025 (33rd European Signal Processing Conference)
Abstract:The advent of accessible Generative AI tools enables anyone to create and spread synthetic images on social media, often with the intention to mislead, thus posing a significant threat to online information integrity. Most existing Synthetic Image Detection (SID) solutions struggle on generated images sourced from the Internet, as these are often altered by compression and other operations. To address this, our research enhances SID by exploring data augmentation combinations, leveraging a genetic algorithm for optimal augmentation selection, and introducing a dual-criteria optimization approach. These methods significantly improve model performance under real-world perturbations. Our findings provide valuable insights for developing detection models capable of identifying synthetic images across varying qualities and transformations, with the best-performing model achieving a mean average precision increase of +22.53% compared to models without augmentations. The implementation is available at this http URL.
zh
[CV-47] Environmental Change Detection: Toward a Practical Task of Scene Change Detection
【速读】:该论文试图解决传统场景变化检测(Scene Change Detection, SCD)在实际应用中面临的挑战,即参考图像通常与查询场景的视角不一致,而非完全匹配。为应对这一问题,论文提出了环境变化检测(Environmental Change Detection, ECD),其关键在于避免依赖于理想化的对齐查询-参考图像对,而是仅依靠环境线索进行变化检测。为此,研究者提出了一种新的框架,通过联合理解空间环境和检测变化,利用多个参考候选图像并聚合语义丰富的表示来克服视角不对齐和视场覆盖有限的问题。
链接: https://arxiv.org/abs/2506.11481
作者: Kyusik Cho,Suhan Woo,Hongje Seong,Euntai Kim
机构: Yonsei University (延世大学); University of Seoul (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review
Abstract:Humans do not memorize everything. Thus, humans recognize scene changes by exploring the past images. However, available past (i.e., reference) images typically represent nearby viewpoints of the present (i.e., query) scene, rather than the identical view. Despite this practical limitation, conventional Scene Change Detection (SCD) has been formalized under an idealized setting in which reference images with matching viewpoints are available for every query. In this paper, we push this problem toward a practical task and introduce Environmental Change Detection (ECD). A key aspect of ECD is to avoid unrealistically aligned query-reference pairs and rely solely on environmental cues. Inspired by real-world practices, we provide these cues through a large-scale database of uncurated images. To address this new task, we propose a novel framework that jointly understands spatial environments and detects changes. The main idea is that matching at the same spatial locations between a query and a reference may lead to a suboptimal solution due to viewpoint misalignment and limited field-of-view (FOV) coverage. We deal with this limitation by leveraging multiple reference candidates and aggregating semantically rich representations for change detection. We evaluate our framework on three standard benchmark sets reconstructed for ECD, and significantly outperform a naive combination of state-of-the-art methods while achieving comparable performance to the oracle setting. The code will be released upon acceptance.
zh
[CV-48] FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes
【速读】:该论文旨在解决深度伪造(Deepfake)视频中模型归属(model attribution)问题,即确定特定深度伪造视频是由哪种生成式AI(Generative AI)模型生成的。现有研究主要集中在二分类的深度伪造检测上,而模型归属任务仍处于探索阶段。论文提出的解决方案关键在于FAME(Fake Attribution via Multilevel Embeddings),该方法通过融合空间和时间注意力机制,有效捕捉不同人脸交换模型产生的细微生成痕迹,从而实现高效且准确的模型归属。
链接: https://arxiv.org/abs/2506.11477
作者: Wasim Ahmad,Yan-Tsung Peng,Yuan-Hao Chang
机构: Institute of Information Science, Academia Sinica (中央研究院資訊科學研究所); National Chengchi University (國立政治大學); National Taiwan University (國立台灣大學)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread emergence of face-swap Deepfake videos poses growing risks to digital security, privacy, and media integrity, necessitating effective forensic tools for identifying the source of such manipulations. Although most prior research has focused primarily on binary Deepfake detection, the task of model attribution – determining which generative model produced a given Deepfake – remains underexplored. In this paper, we introduce FAME (Fake Attribution via Multilevel Embeddings), a lightweight and efficient spatio-temporal framework designed to capture subtle generative artifacts specific to different face-swap models. FAME integrates spatial and temporal attention mechanisms to improve attribution accuracy while remaining computationally efficient. We evaluate our model on three challenging and diverse datasets: Deepfake Detection and Manipulation (DFDM), FaceForensics++, and FakeAVCeleb. Results show that FAME consistently outperforms existing methods in both accuracy and runtime, highlighting its potential for deployment in real-world forensic and information security applications.
zh
[CV-49] On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving
【速读】:该论文试图解决自动驾驶车辆(AVs)中深度神经网络(DNNs)在面对对抗攻击时易发生误分类、影响安全的问题。传统防御机制如对抗训练往往导致正常准确率下降且无法有效应对未见过的攻击。论文提出的解决方案是引入针对车辆感知任务优化的车辆视觉语言模型(V2LMs),其关键在于通过微调视觉-语言模型,使其在无需对抗训练的情况下展现出对未见过攻击的更强鲁棒性,从而在对抗条件下保持显著高于传统DNN的准确率。
链接: https://arxiv.org/abs/2506.11472
作者: Pedram MohajerAnsari(1),Amir Salarpour(1),Michael Kühr(2),Siyu Huang(1),Mohammad Hamad(2),Sebastian Steinhorst(2),Habeeb Olufowobi(3),Mert D. Pesé(1) ((1) Clemson University, Clemson, SC, USA, (2) Technical University of Munich, Munich, Germany, (3) University of Texas at Arlington, Arlington, TX, USA)
机构: Clemson University (克莱姆森大学); Technical University of Munich (慕尼黑工业大学); University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Autonomous vehicles (AVs) rely on deep neural networks (DNNs) for critical tasks such as traffic sign recognition (TSR), automated lane centering (ALC), and vehicle detection (VD). However, these models are vulnerable to attacks that can cause misclassifications and compromise safety. Traditional defense mechanisms, including adversarial training, often degrade benign accuracy and fail to generalize against unseen attacks. In this work, we introduce Vehicle Vision Language Models (V2LMs), fine-tuned vision-language models specialized for AV perception. Our findings demonstrate that V2LMs inherently exhibit superior robustness against unseen attacks without requiring adversarial training, maintaining significantly higher accuracy than conventional DNNs under adversarial conditions. We evaluate two deployment strategies: Solo Mode, where individual V2LMs handle specific perception tasks, and Tandem Mode, where a single unified V2LM is fine-tuned for multiple tasks simultaneously. Experimental results reveal that DNNs suffer performance drops of 33% to 46% under attacks, whereas V2LMs maintain adversarial accuracy with reductions of less than 8% on average. The Tandem Mode further offers a memory-efficient alternative while achieving comparable robustness to Solo Mode. We also explore integrating V2LMs as parallel components to AV perception to enhance resilience against adversarial threats. Our results suggest that V2LMs offer a promising path toward more secure and resilient AV perception systems.
zh
[CV-50] RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer ICML2025
【速读】:该论文试图解决多模态学习中由于模态质量差异导致的动态融合策略失效问题,具体表现为当前广泛使用的自注意力模型在面对不同数据特征时表现出对某一模态的偏好,进而引发自我强化循环,加剧模态间注意力键分布的差距,削弱了注意力机制的动态适应能力。解决方案的关键在于提出一种简单而有效的方法——滚动查询(RollingQ),通过旋转查询来打破自我强化循环,平衡注意力分配,从而恢复注意力机制的动态适应性。
链接: https://arxiv.org/abs/2506.11465
作者: Haotian Ni,Yake Wei,Hang Liu,Gong Chen,Chong Peng,Hao Lin,Di Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism’s dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at this https URL.
zh
[CV-51] GaussMarker: Robust Dual-Domain Watermark for Diffusion Models ICML2025
【速读】:该论文旨在解决扩散模型(Diffusion Models, DM)生成图像后可能出现的版权和滥用问题,提出了一种新的水印方案以提升其鲁棒性。现有方法将水印嵌入初始高斯噪声的单一领域,但存在鲁棒性不足的问题。本文的关键解决方案是提出一种双域水印方法,利用流水线注入器在空间域和频域中同时嵌入水印,并引入一种与模型无关的可学习高斯噪声恢复器(Gaussian Noise Restorer, GNR),通过优化被篡改图像中的高斯噪声来提升检测鲁棒性。该方法在多种图像失真和高级攻击下均表现出优越的性能。
链接: https://arxiv.org/abs/2506.11444
作者: Kecen Li,Zhicong Huang,Xinwen Hou,Cheng Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2025
Abstract:As Diffusion Models (DM) generate increasingly realistic images, related issues such as copyright and misuse have become a growing concern. Watermarking is one of the promising solutions. Existing methods inject the watermark into the single-domain of initial Gaussian noise for generation, which suffers from unsatisfactory robustness. This paper presents the first dual-domain DM watermarking approach using a pipelined injector to consistently embed watermarks in both the spatial and frequency domains. To further boost robustness against certain image manipulations and advanced attacks, we introduce a model-independent learnable Gaussian Noise Restorer (GNR) to refine Gaussian noise extracted from manipulated images and enhance detection robustness by integrating the detection scores of both watermarks. GaussMarker efficiently achieves state-of-the-art performance under eight image distortions and four advanced attacks across three versions of Stable Diffusion with better recall and lower false positive rates, as preferred in real applications.
zh
[CV-52] Uncertainty Awareness Enables Efficient Labeling for Cancer Subtyping in Digital Pathology
【速读】:该论文旨在解决癌症亚型分类模型在训练过程中对大量专家标注数据的依赖问题,从而提高模型的精度和效率。其解决方案的关键在于引入不确定性感知机制,通过在每个训练周期计算证据向量来评估模型对其预测的置信度,并利用由此得出的不确定性评分选择最需要进一步标注的重要图像,从而实现迭代优化的训练过程。这种方法仅需1-10%的策略性标注即可达到最先进的分类性能,有效减少了对大规模标注数据集的依赖。
链接: https://arxiv.org/abs/2506.11439
作者: Nirhoshan Sivaroopan,Chamuditha Jayanga Galappaththige,Chalani Ekanayake,Hasindri Watawana,Ranga Rodrigo,Chamira U. S. Edussooriya,Dushan N. Wadduwage
机构: University of Moratuwa (莫拉图瓦大学); Harvard University (哈佛大学); Old Dominion University (老道明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine-learning-assisted cancer subtyping is a promising avenue in digital pathology. Cancer subtyping models, however, require careful training using expert annotations so that they can be inferred with a degree of known certainty (or uncertainty). To this end, we introduce the concept of uncertainty awareness into a self-supervised contrastive learning model. This is achieved by computing an evidence vector at every epoch, which assesses the model’s confidence in its predictions. The derived uncertainty score is then utilized as a metric to selectively label the most crucial images that require further annotation, thus iteratively refining the training process. With just 1-10% of strategically selected annotations, we attain state-of-the-art performance in cancer subtyping on benchmark datasets. Our method not only strategically guides the annotation process to minimize the need for extensive labeled datasets, but also improves the precision and efficiency of classifications. This development is particularly beneficial in settings where the availability of labeled data is limited, offering a promising direction for future research and application in digital pathology.
zh
[CV-53] AViS: Text-bridged Audio-Visual Segmentation with Foundation Models
【速读】:该论文旨在解决音频-视觉分割(Audio-Visual Segmentation, AVS)中跨模态对齐的核心挑战,即如何有效融合音频和视觉模态的信息。其解决方案的关键在于提出TAViS框架,该框架通过耦合多模态基础模型(ImageBind)的跨模态对齐能力与分割基础模型(SAM2)的精确分割能力,实现对多模态数据的有效处理。为克服两个关键挑战——即SAM2与ImageBind之间因特征空间差异导致的知识迁移困难,以及仅依赖分割损失监督的不足,作者引入了基于文本的桥梁设计,包括文本桥接的混合提示机制和对齐监督策略,从而提升了模型在多种数据集和零样本设置下的性能。
链接: https://arxiv.org/abs/2506.11436
作者: Ziyang Luo,Nian Liu,Xuguang Yang,Salman Khan,Rao Muhammad Anwer,Hisham Cholakkal,Fahad Shahbaz Khan,Junwei Han
机构: Northwestern Polytechnical University (西北工业大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbfcouples the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
zh
[CV-54] Auditing Data Provenance in Real-world Text-to-Image Diffusion Models for Privacy and Copyright Protection
【速读】:该论文试图解决文本到图像扩散模型在数据溯源审计中的挑战,特别是针对现有方法依赖于获取模型内部知识(如中间结果)或评估不可靠的问题。其解决方案的关键在于提出一种完全黑盒的审计框架,称为基于特征语义一致性的审计(Feature Semantic Consistency-based Auditing, FSCA),该框架利用文本到图像扩散模型内部的两种语义关联进行审计,无需访问模型内部信息,从而提高了审计的可行性和可靠性。
链接: https://arxiv.org/abs/2506.11434
作者: Jie Zhu,Leye Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review; A user-level accuracy of 90% in a real-world auditing scenario
Abstract:Text-to-image diffusion model since its propose has significantly influenced the content creation due to its impressive generation capability. However, this capability depends on large-scale text-image datasets gathered from web platforms like social media, posing substantial challenges in copyright compliance and personal privacy leakage. Though there are some efforts devoted to explore approaches for auditing data provenance in text-to-image diffusion models, existing work has unrealistic assumptions that can obtain model internal knowledge, e.g., intermediate results, or the evaluation is not reliable. To fill this gap, we propose a completely black-box auditing framework called Feature Semantic Consistency-based Auditing (FSCA). It utilizes two types of semantic connections within the text-to-image diffusion model for auditing, eliminating the need for access to internal knowledge. To demonstrate the effectiveness of our FSCA framework, we perform extensive experiments on LAION-mi dataset and COCO dataset, and compare with eight state-of-the-art baseline approaches. The results show that FSCA surpasses previous baseline approaches across various metrics and different data distributions, showcasing the superiority of our FSCA. Moreover, we introduce a recall balance strategy and a threshold adjustment strategy, which collectively allows FSCA to reach up a user-level accuracy of 90% in a real-world auditing scenario with only 10 samples/user, highlighting its strong auditing potential in real-world applications. Our code is made available at this https URL.
zh
[CV-55] Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization
【速读】:该论文旨在解决自动绑定(automatic rigging)中骨骼连通性(skeletal connectivity)难以准确保持的问题。传统方法要么通过预测两个关节来表示骨的位置,要么先预测点再确定连通性,难以保证拓扑结构的准确性。其解决方案的关键在于提出一种显式保留连通性的分词方案(connectivity-preserving tokenization scheme),通过特殊标记定义每个关节子节点和层级的端点,从而自动化地建立连通关系,并将连通性信息直接整合到预测框架中,显著提升了拓扑准确性。
链接: https://arxiv.org/abs/2506.11430
作者: Jingfeng Guo,Jian Liu,Jinnan Chen,Shiwei Mao,Changrong Hu,Puhua Jiang,Junlin Yu,Jing Xu,Qi Liu,Lixin Xu,Zhuo Chen,Chunchao Guo
机构: South China University of Technology (华南理工大学); Hong Kong University of Science and Technology (香港科技大学); National University of Singapore (新加坡国立大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); University of Science and Technology of China (中国科学技术大学); Beijing Normal University (北京师范大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity through a connectivity-preserving tokenization scheme. Unlike previous methods that predict bone positions represented as two joints or first predict points before determining connectivity, our method employs special tokens to define endpoints for each joint’s children and for each hierarchical layer, effectively automating connectivity relationships. This approach significantly enhances topological accuracy by integrating connectivity information directly into the prediction framework. To further guarantee high-quality topology, we implement a topology-aware reward function that quantifies topological correctness, which is then utilized in a post-training phase through reward-guided Direct Preference Optimization. Additionally, we incorporate implicit geodesic features for latent top-k bone selection, which substantially improves skinning quality. By leveraging geodesic distance information within the model’s latent space, our approach intelligently determines the most influential bones for each vertex, effectively mitigating common skinning artifacts. This combination of connectivity-preserving tokenization, reward-guided fine-tuning, and geodesic-aware bone selection enables our model to consistently generate more anatomically plausible skeletal structures with superior deformation properties.
zh
[CV-56] Stop learning it all to mitigate visual hallucination Focus on the hallucination target CVPR2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中产生的幻觉问题,即模型会生成输入图像中不存在的物体信息,从而影响模型在需要准确目标识别的实际应用中的可靠性。解决方案的关键在于提出一种偏好学习方法(preference learning approach),通过聚焦于幻觉发生的目标区域,构建包含幻觉响应、正确响应及目标信息的数据集,并在这些特定目标上应用偏好学习方法,使模型能够过滤无关信号并专注于修正幻觉,从而生成更符合事实的响应。实验结果表明,该方法有效减少了多个视觉幻觉任务中的幻觉现象,提升了MLLMs的可靠性和性能。
链接: https://arxiv.org/abs/2506.11417
作者: Dokyoon Yoon,Youngsook Song,Woomyong Park
机构: SIONIC AI(SIONIC AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025
Abstract:Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose \mymethod,\ a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that \mymethod\ effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.
zh
[CV-57] Dynamic Double Space Tower
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)任务中模型在处理复杂推理场景时的不足,尤其是由于跨模态交互不足和无法有效捕捉图像中实体的空间关系所导致的问题。其解决方案的关键在于提出一种动态双向空间塔结构,该结构依据人类格式塔视觉原理分为四层,用于观察图像,从而为实体之间的空间组织提供强大的结构先验,使模型能够从“看到图像”转变为“感知并组织图像内容”,提升对空间关系的理解与推理能力。
链接: https://arxiv.org/abs/2506.11394
作者: Weikai Sun,Shijie Song,Han Wang
机构: 华为云计算(Huawei Cloud Computing)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Visual Question Answering (VQA) task requires the simultaneous understanding of image content and question semantics. However, existing methods often have difficulty handling complex reasoning scenarios due to insufficient cross-modal interaction and capturing the entity spatial relationships in the image.\citehuang2023adaptive\citeliu2021comparing\citeguibas2021adaptive\citezhang2022vsaWe studied a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model and its understanding of spatial this http URL, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision. This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels but make judgments based on more meaningful perceptual units. Change from “seeing images” to “perceiving and organizing image content”.A large number of experiments have shown that our module can be used in any other multimodal model and achieve advanced results, demonstrating its potential in spatial relationship this http URL, the multimodal visual question-answering model July trained by our method has achieved state-of-the-art results with only 3B parameters, especially on the question-answering dataset of spatial relations.
zh
[CV-58] Control Architecture and Design for a Multi-robotic Visual Servoing System in Automated Manufacturing Environment
【速读】:该论文试图解决微尺度制造中由于环境不确定性(如测量噪声、模型不准确、关节柔顺性等)导致的机器人定位精度不足问题,以及视觉伺服中相机位置对图像估计质量的影响问题。解决方案的关键在于设计一种多机器人控制系统,通过模拟紧固与松卸过程来显著降低过程中的各种不确定性,同时提出一种新的相机移动策略算法,以探索相机的工作空间并找到图像噪声水平最低的最优观测位置。
链接: https://arxiv.org/abs/2506.11387
作者: Rongfei Li
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 272 pages, 171 figures, PhD dissertation, University of California, Davis, 2025. To be published in ProQuest ETD
Abstract:The use of robotic technology has drastically increased in manufacturing in the 21st century. But by utilizing their sensory cues, humans still outperform machines, especially in micro scale manufacturing, which requires high-precision robot manipulators. These sensory cues naturally compensate for high levels of uncertainties that exist in the manufacturing environment. Uncertainties in performing manufacturing tasks may come from measurement noise, model inaccuracy, joint compliance (e.g., elasticity), etc. Although advanced metrology sensors and high precision microprocessors, which are utilized in modern robots, have compensated for many structural and dynamic errors in robot positioning, a well-designed control algorithm still works as a comparable and cheaper alternative to reduce uncertainties in automated manufacturing. Our work illustrates that a multi-robot control system that simulates the positioning process for fastening and unfastening applications can reduce various uncertainties, which may occur in this process, to a great extent. In addition, most research papers in visual servoing mainly focus on developing control and observation architectures in various scenarios, but few have discussed the importance of the camera’s location in the configuration. In a manufacturing environment, the quality of camera estimations may vary significantly from one observation location to another, as the combined effects of environmental conditions result in different noise levels of a single image shot at different locations. Therefore, in this paper, we also propose a novel algorithm for the camera’s moving policy so that it explores the camera workspace and searches for the optimal location where the image noise level is minimized.
zh
[CV-59] Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation ACL2025
【速读】:该论文试图解决生成高质量文本-图像任务计划(text-image plan)中的两个主要挑战:确保两种模态之间的对齐一致性和保持视觉步骤间的连贯性。其解决方案的关键在于提出一种新颖的分步生成与优化框架,该框架通过迭代过程依次完成文本步骤的草稿生成、视觉步骤的编辑、PDDL-like视觉信息的提取以及基于提取信息的文本与视觉步骤优化,从而逐步提升文本-图像计划的质量。
链接: https://arxiv.org/abs/2506.11380
作者: Xiaoxin Lu,Ranran Haoran Zhang,Yusen Zhang,Rui Zhang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures; Accepted to ACL 2025 Findings
Abstract:People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM’s capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs for the next iteration. Our approach offers a plug-and-play improvement to various backbone models, such as Mistral-7B, Gemini-1.5, and GPT-4o. To evaluate the effectiveness of our approach, we collect a new benchmark consisting of 1,100 tasks and their text-image pair solutions covering 11 daily topics. We also design and validate a new set of metrics to evaluate the multimodal consistency and coherence in text-image plans. Extensive experiment results show the effectiveness of our approach on a range of backbone models against competitive baselines. Our code and data are available at this https URL.
zh
[CV-60] Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images
【速读】:该论文旨在解决高光谱图像(Hyperspectral Images, HSIs)在无监督分析中面临的计算复杂度高、仅考虑局部或非局部结构约束以及结构约束无法有效指导整个聚类过程的问题。其解决方案的关键在于提出一种基于基表示的可扩展、上下文保持的深度聚类方法,该方法通过联合捕捉局部和非局部结构,实现高效的HSI聚类。具体而言,引入了空间平滑性约束以保留局部结构,采用基于小簇的方案以增强非局部结构,并将这两种约束联合优化,从而在整个聚类过程中协同作用,显著降低了时间与空间复杂度至O(n),使其适用于大规模HSI数据。
链接: https://arxiv.org/abs/2506.11377
作者: Xianlu Li,Nicolas Nadisic,Shaoguang Huang,Nikos Deligiannis,Aleksandra Pižurica
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs). Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O(n^2), followed by spectral clustering. However, these methods are computationally intensive, generally incorporating solely either local or non-local spatial structure constraints, and their structural constraints fall short of effectively supervising the entire clustering process. We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering. To preserve local structure (i.e., spatial continuity within subspaces), we introduce a spatial smoothness constraint that aligns clustering predictions with their spatially filtered versions. For non-local structure (i.e., spectral continuity), we employ a mini-cluster-based scheme that refines predictions at the group level, encouraging spectrally similar pixels to belong to the same subspace. Notably, these two constraints are jointly optimized to reinforce each other. Specifically, our model is designed as an one-stage approach in which the structural constraints are applied to the entire clustering process. The time and space complexity of our method is O(n), making it applicable to large-scale HSI data. Experiments on real-world datasets show that our method outperforms state-of-the-art techniques. Our code is available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.11377 [cs.CV] (or arXiv:2506.11377v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.11377 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-61] A Watermark for Auto-Regressive Image Generation Models
【速读】:该论文旨在解决图像生成模型中由于“重分词不匹配”(retokenization mismatch)导致的传统统计水印技术失效的问题,从而实现对生成图像的可靠真实性验证。其解决方案的关键在于提出一种名为C-reweight的新颖无失真水印方法,该方法通过基于聚类的策略,将同一簇内的标记视为等价,从而缓解重分词不匹配问题,同时保持图像保真度。
链接: https://arxiv.org/abs/2506.11371
作者: Yihan Wu,Xuehao Cui,Ruibo Chen,Georgios Milis,Heng Huang
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:The rapid evolution of image generation models has revolutionized visual content creation, enabling the synthesis of highly realistic and contextually accurate images for diverse applications. However, the potential for misuse, such as deepfake generation, image based phishing attacks, and fabrication of misleading visual evidence, underscores the need for robust authenticity verification mechanisms. While traditional statistical watermarking techniques have proven effective for autoregressive language models, their direct adaptation to image generation models encounters significant challenges due to a phenomenon we term retokenization mismatch, a disparity between original and retokenized sequences during the image generation process. To overcome this limitation, we propose C-reweight, a novel, distortion-free watermarking method explicitly designed for image generation models. By leveraging a clustering-based strategy that treats tokens within the same cluster equivalently, C-reweight mitigates retokenization mismatch while preserving image fidelity. Extensive evaluations on leading image generation platforms reveal that C-reweight not only maintains the visual quality of generated images but also improves detectability over existing distortion-free watermarking techniques, setting a new standard for secure and trustworthy image synthesis.
zh
[CV-62] GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset
【速读】:该论文旨在解决妇科腹腔镜手术中现有数据集在规模、任务覆盖面及标注详细程度方面的局限性,从而阻碍了对完整、端到端手术流程的分析。解决方案的关键在于引入GynSurg,这是目前最大且最多样化的多任务妇科腹腔镜手术数据集,提供了丰富的多任务标注,支持动作识别、语义分割、手术文档记录以及新型手术过程洞察的发现。
链接: https://arxiv.org/abs/2506.11356
作者: Sahar Nasirihaghighi,Negin Ghamsarian,Leonie Peschek,Matteo Munari,Heinrich Husslein,Raphael Sznitman,Klaus Schoeffmann
机构: University of Klagenfurt(克恩滕大学); University of Bern(伯尔尼大学); Medical University of Vienna(维也纳医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end-to-end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi-task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset quality and versatility by benchmarking state-of-the-art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations
zh
[CV-63] HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation
【速读】:该论文旨在解决当前对地理空间基础模型(Geo-FMs)进行全面评估的不足,现有基准数据集多局限于分类或分割任务,并且仅覆盖特定地理区域。其解决方案的关键在于引入一个全球分布的森林地上生物量(AGB)估算数据集,该数据集结合了环境制图与分析计划(EnMAP)卫星的同址高光谱影像(HSI)和全球生态系统动态调查激光雷达提供的AGB密度预测,覆盖七个大陆区域。通过该数据集,研究验证了Geo-FMs在像素级回归任务中的性能,特别是在微调编码器后可达到或超越基线U-Net的效果,并揭示了数据集规模和视觉Transformer主干中token patch大小对预测精度的重要性。
链接: https://arxiv.org/abs/2506.11314
作者: Aaron Banze,Timothée Stassin,Nassim Ait Ali Braham,Rıdvan Salih Kuzu,Simon Besnard,Michael Schmitt
机构: German Aerospace Center (DLR); Helmholtz German Research Centre for Geosciences (GFZ); University of the Bundeswehr Munich; Technical University of Munich (TUM)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Comprehensive evaluation of geospatial foundation models (Geo-FMs) requires benchmarking across diverse tasks, sensors, and geographic regions. However, most existing benchmark datasets are limited to segmentation or classification tasks, and focus on specific geographic areas. To address this gap, we introduce a globally distributed dataset for forest aboveground biomass (AGB) estimation, a pixel-wise regression task. This benchmark dataset combines co-located hyperspectral imagery (HSI) from the Environmental Mapping and Analysis Program (EnMAP) satellite and predictions of AGB density estimates derived from the Global Ecosystem Dynamics Investigation lidars, covering seven continental regions. Our experimental results on this dataset demonstrate that the evaluated Geo-FMs can match or, in some cases, surpass the performance of a baseline U-Net, especially when fine-tuning the encoder. We also find that the performance difference between the U-Net and Geo-FMs depends on the dataset size for each region and highlight the importance of the token patch size in the Vision Transformer backbone for accurate predictions in pixel-wise regression tasks. By releasing this globally distributed hyperspectral benchmark dataset, we aim to facilitate the development and evaluation of Geo-FMs for HSI applications. Leveraging this dataset additionally enables research into geographic bias and generalization capacity of Geo-FMs. The dataset and source code will be made publicly available.
zh
[CV-64] ARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy
【速读】:该论文试图解决真实世界环境建模中动态变化的时空特性所带来的挑战,特别是如何有效捕捉和模拟环境中的空间与时间组合动态。其解决方案的关键在于引入了一个名为STRIDE(Spatio-Temporal Road Image Dataset for Exploration)的数据集,该数据集通过将360度全景图像转换为丰富的相互关联的观察、状态和动作节点,以结构化方式表征环境。在此基础上,论文提出了TARDIS,一种基于Transformer的生成式世界模型,通过统一的自回归框架整合空间与时间动态,从而实现对复杂环境的有效建模与代理行为的生成。
链接: https://arxiv.org/abs/2506.11302
作者: Héctor Carrión,Yutong Bai,Víctor A. Hernández Castro,Kishan Panaganti,Ayush Zenith,Matthew Trang,Tony Zhang,Pietro Perona,Jitendra Malik
机构: Tera AI; UC Santa Cruz; UC Berkeley; California Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Computer Vision, Pattern Recognition, LLMs, Dataset, Data Augmentation
Abstract:World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents–capable of understanding and manipulating the spatial and temporal aspects of their material environments–with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at this https URL.
zh
[CV-65] Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
【速读】:该论文旨在解决机器人操作中在未见过的物体、环境和由多样化语言指令指定的任务之间泛化能力不足的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的 grounded vision-language planning 模型——Gondola,该模型通过多视角图像和历史计划生成包含目标物体和位置的文本与分割掩码交错的下一步动作计划,从而提升视觉环境中的语义接地能力与任务泛化性能。
链接: https://arxiv.org/abs/2506.11261
作者: Shizhe Chen,Ricardo Garcia,Paul Pacaud,Cordelia Schmid
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large language models (LLMs) for planning and action execution. While promising, these methods often fall short in generating grounded plans in visual environments. Although efforts have been made to perform visual instructional tuning on LLMs for robotic manipulation, existing methods are typically constrained by single-view image input and struggle with precise object grounding. In this work, we introduce Gondola, a novel grounded vision-language planning model based on LLMs for generalizable robotic manipulation. Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations. To support the training of Gondola, we construct three types of datasets using the RLBench simulator, namely robot grounded planning, multi-view referring expression and pseudo long-horizon task datasets. Gondola outperforms the state-of-the-art LLM-based method across all four generalization levels of the GemBench dataset, including novel placements, rigid objects, articulated objects and long-horizon tasks.
zh
[CV-66] Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models
【速读】:该论文试图解决传统机器遗忘(machine unlearning)方法在基础模型(Foundation Models, FMs)中难以满足多样化遗忘请求的问题,尤其是当数据所有者或监管机构希望移除模型中特定知识或能力时,无法直接访问模型的大量训练数据。解决方案的关键在于将数据追踪式的机器遗忘提升为知识追踪式的机器遗忘(knowledge-tracing machine unlearning),即通过追踪模型所掌握的知识或能力而非具体的数据点来实现遗忘,这一方法更符合人类记忆遗忘的机制,并且能够更好地适应实际应用场景中的需求。
链接: https://arxiv.org/abs/2506.11253
作者: Yuwen Tan,Boqing Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 3 figures
Abstract:Machine unlearning removes certain training data points and their influence on AI models (e.g., when a data owner revokes their decision to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., having no access to FMs’ massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm.
zh
[CV-67] Anti-Aliased 2D Gaussian Splatting
【速读】:该论文旨在解决2D Gaussian Splatting (2DGS) 在不同采样率下渲染时出现的严重混叠伪影问题,这一问题限制了其在需要相机缩放或变化视场的应用场景中的实际使用。解决方案的关键在于提出AA-2DGS,通过引入世界空间平滑核来约束2D高斯基元的频率内容,以消除缩放时的高频伪影,并通过推导一种新的物体空间Mip滤波器,实现高效的抗混叠处理。
链接: https://arxiv.org/abs/2506.11252
作者: Mae Younes,Adnane Boukhayma
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be available at this https URL
Abstract:2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an antialiased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.
zh
[CV-68] Enhanced Vehicle Speed Detection Considering Lane Recognition Using Drone Videos in California
【速读】:该论文旨在解决因交通系统不足和测速摄像头稀疏导致的加利福尼亚州车辆数量增加所带来的有效车速检测问题,特别是针对车道内车辆速度的精准检测与分类。其解决方案的关键在于引入了一个微调的YOLOv11模型,该模型在近800张鸟瞰视角图像上进行训练,以提高车速检测的准确性,并能够识别车辆所在车道,同时将车辆分为汽车和重型车辆两类,从而满足交通监控与管理的特定需求。
链接: https://arxiv.org/abs/2506.11239
作者: Amirali Ataee Naeini,Ashkan Teymouri,Ghazaleh Jafarsalehi,Michael Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages
Abstract:The increase in vehicle numbers in California, driven by inadequate transportation systems and sparse speed cameras, necessitates effective vehicle speed detection. Detecting vehicle speeds per lane is critical for monitoring High-Occupancy Vehicle (HOV) lane speeds, distinguishing between cars and heavy vehicles with differing speed limits, and enforcing lane restrictions for heavy vehicles. While prior works utilized YOLO (You Only Look Once) for vehicle speed detection, they often lacked accuracy, failed to identify vehicle lanes, and offered limited or less practical classification categories. This study introduces a fine-tuned YOLOv11 model, trained on almost 800 bird’s-eye view images, to enhance vehicle speed detection accuracy which is much higher compare to the previous works. The proposed system identifies the lane for each vehicle and classifies vehicles into two categories: cars and heavy vehicles. Designed to meet the specific requirements of traffic monitoring and regulation, the model also evaluates the effects of factors such as drone height, distance of Region of Interest (ROI), and vehicle speed on detection accuracy and speed measurement. Drone footage collected from Northern California was used to assess the proposed system. The fine-tuned YOLOv11 achieved its best performance with a mean absolute error (MAE) of 0.97 mph and mean squared error (MSE) of 0.94 \textmph^2 , demonstrating its efficacy in addressing challenges in vehicle speed detection and classification.
zh
[CV-69] Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
【速读】:该论文旨在解决长尾驾驶场景下端到端自动驾驶的挑战,即在复杂、罕见且难以预测的交通环境中实现可靠和泛化的自主驾驶能力。其解决方案的关键在于采用两阶段训练策略:首先通过自监督的视觉-语言-轨迹(VLT)预训练获得强大的基础驾驶能力,利用83小时的CoVLA常规驾驶数据和11小时的Waymo长尾驾驶数据进行训练;其次通过基于组相对策略优化(GRPO)的强化学习微调,仅使用少量(少于500帧)带偏好标签的Waymo验证集数据进一步提升性能。实验结果表明,VLT预训练和强化学习微调均对提升长尾场景下的驾驶表现至关重要。
链接: https://arxiv.org/abs/2506.11234
作者: Luke Rowe,Rodrigue de Schaetzen,Roger Girgis,Christopher Pal,Liam Paull
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Poutine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving. Accompanying language annotations are auto-generated with a 72B-parameter VLM. Poutine is obtained by fine-tuning Poutine-Base with Group Relative Policy Optimization (GRPO) using less than 500 preference-labeled frames from the Waymo validation set. We show that both VLT pretraining and RL fine-tuning are critical to attain strong driving performance in the long-tail. Poutine-Base achieves a rater-feedback score (RFS) of 8.12 on the validation set, nearly matching Waymo’s expert ground-truth RFS. The final Poutine model achieves an RFS of 7.99 on the official Waymo test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. These results highlight the promise of scalable VLT pre-training and lightweight RL fine-tuning to enable robust and generalizable autonomy.
zh
[CV-70] BrainMAP: Multimodal Graph Learning For Efficient Brain Disease Localization
【速读】:该论文旨在解决现有基于图学习的神经退行性疾病检测方法在定位和提取全连接组中驱动神经退行性病理的具体脑区方面能力不足的问题,以及多模态脑图模型计算复杂度高、限制其在资源受限设备中应用的问题。其解决方案的关键在于提出BrainMAP框架,该框架通过基于AAL图谱的过滤方法精确定位关键脑子图,从而显著降低计算开销;同时采用包含跨节点注意力机制和自适应门控机制的多模态融合过程,实现fMRI与DTI数据的动态对齐与整合。
链接: https://arxiv.org/abs/2506.11178
作者: Nguyen Linh Dan Le,Jing Ren,Ciyuan Peng,Chengyao Xie,Bowen Li,Feng Xia
机构: RMIT University, Melbourne, Australia; Federation University Australia, Ballarat, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 6 pages, 5 figures
Abstract:Recent years have seen a surge in research focused on leveraging graph learning techniques to detect neurodegenerative diseases. However, existing graph-based approaches typically lack the ability to localize and extract the specific brain regions driving neurodegenerative pathology within the full connectome. Additionally, recent works on multimodal brain graph models often suffer from high computational complexity, limiting their practical use in resource-constrained devices. In this study, we present BrainMAP, a novel multimodal graph learning framework designed for precise and computationally efficient identification of brain regions affected by neurodegenerative diseases. First, BrainMAP utilizes an atlas-driven filtering approach guided by the AAL atlas to pinpoint and extract critical brain subgraphs. Unlike recent state-of-the-art methods, which model the entire brain network, BrainMAP achieves more than 50% reduction in computational overhead by concentrating on disease-relevant subgraphs. Second, we employ an advanced multimodal fusion process comprising cross-node attention to align functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI) data, coupled with an adaptive gating mechanism to blend and integrate these modalities dynamically. Experimental results demonstrate that BrainMAP outperforms state-of-the-art methods in computational efficiency, without compromising predictive accuracy.
zh
[CV-71] aching in adverse scenes: a statistically feedback-driven threshold and mask adjustment teacher-student framework for object detection in UAV images under adverse scenes
【速读】:该论文旨在解决在恶劣场景下无人机(Unmanned Aerial Vehicle, UAV)目标检测中由于域间隙导致的性能下降问题,尤其是针对现有无监督域适应(Unsupervised Domain Adaptation, UDA)方法在复杂或不良条件下的UAV图像上表现不佳的问题。其解决方案的关键在于提出了一种名为统计反馈驱动的阈值与掩码调整师生框架(Statistical Feedback-Driven Threshold and Mask Adjustment Teacher-Student Framework, SF-TMAT)的新方法,其中包含动态步进反馈掩码调整自编码器(Dynamic Step Feedback Mask Adjustment Autoencoder, DSFMA)和方差反馈平滑阈值(Variance Feedback Smoothing Threshold, VFST)策略,通过动态调整掩码比例、学习焦点以及伪标签选择阈值,提升特征对齐效果和伪标签质量,从而有效缓解域偏差并增强模型在恶劣场景下的泛化能力。
链接: https://arxiv.org/abs/2506.11175
作者: Hongyu Chen,Jiping Liu,Yong Wang,Jun Zhu,Dejun Feng,Yakun Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The manuscript has been accepted by ISPRS Journal of Photogrammetry and Remote Sensing
Abstract:Unsupervised Domain Adaptation (UDA) has shown promise in effectively alleviating the performance degradation caused by domain gaps between source and target domains, and it can potentially be generalized to UAV object detection in adverse scenes. However, existing UDA studies are based on natural images or clear UAV imagery, and research focused on UAV imagery in adverse conditions is still in its infancy. Moreover, due to the unique perspective of UAVs and the interference from adverse conditions, these methods often fail to accurately align features and are influenced by limited or noisy pseudo-labels. To address this, we propose the first benchmark for UAV object detection in adverse scenes, the Statistical Feedback-Driven Threshold and Mask Adjustment Teacher-Student Framework (SF-TMAT). Specifically, SF-TMAT introduces a design called Dynamic Step Feedback Mask Adjustment Autoencoder (DSFMA), which dynamically adjusts the mask ratio and reconstructs feature maps by integrating training progress and loss feedback. This approach dynamically adjusts the learning focus at different training stages to meet the model’s needs for learning features at varying levels of granularity. Additionally, we propose a unique Variance Feedback Smoothing Threshold (VFST) strategy, which statistically computes the mean confidence of each class and dynamically adjusts the selection threshold by incorporating a variance penalty term. This strategy improves the quality of pseudo-labels and uncovers potentially valid labels, thus mitigating domain bias. Extensive experiments demonstrate the superiority and generalization capability of the proposed SF-TMAT in UAV object detection under adverse scene conditions. The Code is released at this https URL .
zh
[CV-72] WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition
【速读】:该论文旨在解决表面肌电信号(sEMG)手势识别中相似手势因肌肉信号相近而导致分类准确率下降的问题,以及传统深度学习模型参数量大、计算成本高,难以在资源受限的嵌入式系统上部署的问题。其解决方案的关键在于提出一种轻量级的Transformer架构——WaveFormer,通过引入可学习的小波变换融合时域和频域特征,并采用多级小波分解层与深度可分离卷积相结合的WaveletConv模块,实现了高效且紧凑的特征提取,最终在保持高分类准确率的同时显著降低了模型复杂度。
链接: https://arxiv.org/abs/2506.11168
作者: Yanlong Chen,Mattia Orlandi,Pierangelo Maria Rapa,Simone Benatti,Luca Benini,Yawei Li
机构: ETH Zurich (苏黎世联邦理工学院); University of Bologna (博洛尼亚大学); University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, submitted to IEEE EMBS Conference on Neural Engineering (NER)
Abstract:Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) this http URL, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.
zh
[CV-73] owards a general-purpose foundation model for fMRI analysis
【速读】:该论文旨在解决功能性磁共振成像(fMRI)分析中存在的一致性和可迁移性问题,这些问题主要源于复杂的预处理流程和任务特定的模型。其解决方案的关键在于提出一种名为NeuroSTORM的通用框架,该框架直接从4D fMRI体积中学习,并通过使用Mamba主干网络和时序扫描策略高效处理全4D数据,结合空间-时间优化的预训练方法和任务特定的提示调优,实现了跨多种应用的高效知识迁移。
链接: https://arxiv.org/abs/2506.11167
作者: Cheng Wang,Yu Jiang,Zhihao Peng,Chenxin Li,Changbae Bang,Lin Zhao,Jinglei Lv,Jorge Sepulcre,Carl Yang,Lifang He,Tianming Liu,Daniel Barron,Quanzheng Li,Randy Hirschtick,Byung-Hoon Kim,Xiang Li,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Functional Magnetic Resonance Imaging (fMRI) is essential for studying brain function and diagnosing neurological disorders, but current analysis methods face reproducibility and transferability issues due to complex pre-processing and task-specific models. We introduce NeuroSTORM (Neuroimaging Foundation Model with Spatial-Temporal Optimized Representation Modeling), a generalizable framework that directly learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications. NeuroSTORM is pre-trained on 28.65 million fMRI frames (9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100. Using a Mamba backbone and a shifted scanning strategy, it efficiently processes full 4D volumes. We also propose a spatial-temporal optimized pre-training approach and task-specific prompt tuning to improve transferability. NeuroSTORM outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI classification. It demonstrates strong clinical utility on datasets from hospitals in the U.S., South Korea, and Australia, achieving top performance in disease diagnosis and cognitive phenotype prediction. NeuroSTORM provides a standardized, open-source foundation model to improve reproducibility and transferability in fMRI-based clinical research.
zh
[CV-74] st-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning
【速读】:该论文试图解决在医学影像中应用大型语言模型(Large Language Models, LLMs)进行基于推理的诊断时,由于数据有限和标注成本高昂导致的监督微调不切实际的问题。其解决方案的关键在于引入一种零样本框架,通过测试时缩放(test-time scaling)增强LLMs在临床环境中的推理能力。该框架首先利用视觉-语言模型对医学图像和文本提示生成多种视觉特征描述,随后通过测试时缩放策略将多个候选输出整合为可靠的最终诊断,从而提升诊断准确性和可靠性。
链接: https://arxiv.org/abs/2506.11166
作者: Ji Young Byun,Young-Jin Park,Navid Azizan,Rama Chellappa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As a cornerstone of patient care, clinical decision-making significantly influences patient outcomes and can be enhanced by large language models (LLMs). Although LLMs have demonstrated remarkable performance, their application to visual question answering in medical imaging, particularly for reasoning-based diagnosis, remains largely unexplored. Furthermore, supervised fine-tuning for reasoning tasks is largely impractical due to limited data availability and high annotation costs. In this work, we introduce a zero-shot framework for reliable medical image diagnosis that enhances the reasoning capabilities of LLMs in clinical settings through test-time scaling. Given a medical image and a textual prompt, a vision-language model processes a medical image along with a corresponding textual prompt to generate multiple descriptions or interpretations of visual features. These interpretations are then fed to an LLM, where a test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis. We evaluate our approach across various medical imaging modalities – including radiology, ophthalmology, and histopathology – and demonstrate that the proposed test-time scaling strategy enhances diagnostic accuracy for both our and baseline methods. Additionally, we provide an empirical analysis showing that the proposed approach, which allows unbiased prompting in the first stage, improves the reliability of LLM-generated diagnoses and enhances classification accuracy.
zh
[CV-75] Evaluating BiLSTM and CNNGRU Approaches for Human Activity Recognition Using WiFi CSI Data
【速读】:该论文旨在解决基于WiFi信道状态信息(Channel State Information, CSI)的人类活动识别(Human Activity Recognition, HAR)问题,通过比较双向长短期记忆网络(BiLSTM)和卷积神经网络结合门控循环单元(CNN+GRU)两种深度学习模型的性能。研究的关键在于根据数据集的特性选择合适的模型结构:CNN+GRU在UT-HAR数据集上表现更优,因其能够有效提取空间特征;而BiLSTM在高分辨率的NTU-Fi HAR数据集上表现更佳,因其在捕捉长期时间依赖性方面更具优势。
链接: https://arxiv.org/abs/2506.11165
作者: Almustapha A. Wakili,Babajide J. Asaju,Woosub Jung
机构: Towson University (陶森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This Paper has been Accepted and will appear in the 23rd IEEE/ACIS International Conference on Software Engineering, Management and Applications (SERA 2025)
Abstract:This paper compares the performance of BiLSTM and CNN+GRU deep learning models for Human Activity Recognition (HAR) on two WiFi-based Channel State Information (CSI) datasets: UT-HAR and NTU-Fi HAR. The findings indicate that the CNN+GRU model has a higher accuracy on the UT-HAR dataset (95.20%) thanks to its ability to extract spatial features. In contrast, the BiLSTM model performs better on the high-resolution NTU-Fi HAR dataset (92.05%) by extracting long-term temporal dependencies more effectively. The findings strongly emphasize the critical role of dataset characteristics and preprocessing techniques in model performance improvement. We also show the real-world applicability of such models in applications like healthcare and intelligent home systems, highlighting their potential for unobtrusive activity recognition.
zh
[CV-76] Synthetic Geology – Structural Geology Meets Deep Learning STOC
【速读】:该论文试图解决如何可视化地球地表以下数公里范围的地下结构这一长期挑战,该问题限制了众多重要应用的发展。解决方案的关键在于利用生成式人工智能(Generative AI)技术,通过训练神经网络将地表地质数据与钻孔数据扩展至三维地下区域。为克服地下数据不足的问题,研究者设计了一种合成数据生成过程,模拟地质活动如沉积压实、火山侵入和构造动力学,以生成几乎无限的近地壳样本。基于此类合成数据训练的基础模型能够从新的地表地形和地质图生成高保真度的三维地下图像,展示了其在刻画岩层、断层、褶皱、岩墙和岩席等结构方面的潜力。
链接: https://arxiv.org/abs/2506.11164
作者: Simon Ghyselincks,Valeriia Okhmak,Stefano Zampini,George Turkiyyah,David Keyes,Eldad Haber
机构: University of British Columbia(不列颠哥伦比亚大学); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures, submitted to “Communications Earth Environment”, geological simulation code at this https URL , generative AI code at this https URL
Abstract:Visualizing the first few kilometers of the Earth’s subsurface, a long-standing challenge gating a virtually inexhaustible list of important applications, is coming within reach through deep learning. Building on techniques of generative artificial intelligence applied to voxelated images, we demonstrate a method that extends surface geological data supplemented by boreholes to a three-dimensional subsurface region by training a neural network. The Earth’s land area having been extensively mapped for geological features, the bottleneck of this or any related technique is the availability of data below the surface. We close this data gap in the development of subsurface deep learning by designing a synthetic data-generator process that mimics eons of geological activity such as sediment compaction, volcanic intrusion, and tectonic dynamics to produce a virtually limitless number of samples of the near lithosphere. A foundation model trained on such synthetic data is able to generate a 3D image of the subsurface from a previously unseen map of surface topography and geology, showing increasing fidelity with increasing access to borehole data, depicting such structures as layers, faults, folds, dikes, and sills. We illustrate the early promise of the combination of a synthetic lithospheric generator with a trained neural network model using generative flow matching. Ultimately, such models will be fine-tuned on data from applicable campaigns, such as mineral prospecting in a given region. Though useful in itself, a regionally fine-tuned models may be employed not as an end but as a means: as an AI-based regularizer in a more traditional inverse problem application, in which the objective function represents the mismatch of additional data with physical models with applications in resource exploration, hazard assessment, and geotechnical engineering.
zh
[CV-77] VIBE: Can a VLM Read the Room?
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在社会推理方面的局限性,特别是其在处理非语言线索以理解社会情境中的社会语用推理能力不足的问题。论文提出的关键解决方案是引入一个新的任务——视觉社会语用推理(Visual Social-Pragmatic Inference),并通过构建高质量的数据集来评估和基准测试VLMs在该任务上的表现。
链接: https://arxiv.org/abs/2506.11162
作者: Tania Chakraborty,Eylon Caplan,Dan Goldwasser
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Pre-print, under review
Abstract:Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.
zh
[CV-78] Digitization of Document and Information Extraction using OCR
【速读】:该论文试图解决从文档中准确提取信息的问题,特别是在处理扫描图像与原生数字格式相结合的情况下。解决方案的关键在于将光学字符识别(OCR)技术与大型语言模型(LLMs)相结合,通过OCR引擎处理扫描文件,利用布局感知库解析数字文件,并借助LLM对提取的原始文本进行分析,以识别关键-值对并解决歧义,从而实现结构化输出并增强上下文理解和置信度指标。
链接: https://arxiv.org/abs/2506.11156
作者: Rasha Sinha,Rekha B S
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs) to deliver structured outputs enriched by contextual understanding and confidence indicators. Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries. The extracted raw text is subsequently analyzed by an LLM to identify key-value pairs and resolve ambiguities. A comparative analysis of different OCR tools is presented to evaluate their effectiveness concerning accuracy, layout recognition, and processing speed. The approach demonstrates significant improvements over traditional rule-based and template-based methods, offering enhanced flexibility and semantic precision across different document categories
zh
[CV-79] Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search ACL2025
【速读】:该论文旨在解决现有视频描述生成基准和评估协议在关键点生成不足或单一、数据创建成本高昂以及评估范围有限等方面存在的问题。其解决方案的关键在于提出一种自动框架AutoCaption,该框架利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)以迭代方式生成大量且多样的描述性句子(即关键点),从而全面表征视频内容,提升视频细节的描述能力。
链接: https://arxiv.org/abs/2506.11155
作者: Linhao Yu,Xinguang Ji,Yahui Liu,Fanheng Kong,Chenxi Sun,Jingyuan Zhang,Hongzhi Zhang,V. W.,Fuzheng Zhang,Deyi Xiong
机构: TJUNLP Lab, College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages; ACL 2025(main)
Abstract:Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textiti.e., key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects’ attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at this https URL.
zh
[CV-80] SLRNet: A Real-Time LSTM-Based Sign Language Recognition System
【速读】:该论文旨在解决手语识别(Sign Language Recognition, SLR)问题,以帮助听障人群与社会更好地沟通。其解决方案的关键在于提出SLRNet,一个基于MediaPipe Holistic和长短期记忆网络(LSTM)的实时网络摄像头手语识别系统,能够处理视频流并识别美国手语(ASL)的字母和功能词,从而实现了无需特定硬件的包容性手势识别。
链接: https://arxiv.org/abs/2506.11154
作者: Sharvari Kamble
机构: University of Mumbai(孟买大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, includes experimental results. Code available at: this https URL
Abstract:Sign Language Recognition (SLR) plays a crucial role in bridging the communication gap between the hearing-impaired community and society. This paper introduces SLRNet, a real-time webcam-based ASL recognition system using MediaPipe Holistic and Long Short-Term Memory (LSTM) networks. The model processes video streams to recognize both ASL alphabet letters and functional words. With a validation accuracy of 86.7%, SLRNet demonstrates the feasibility of inclusive, hardware-independent gesture recognition.
zh
[CV-81] Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels
【速读】:该论文试图解决从配对的脑电(EEG)和图像数据中恢复参与者心中所想的未知心理目标(mental target)的问题,且在没有标签信息的情况下进行。传统方法依赖于有标签的数据,而本文提出了一种无需标签数据或预训练解码器的框架和算法——CURSOR,其关键在于通过自校准机制学习恢复未知的心理目标,并利用预测的图像相似性分数对刺激进行排序以及生成与目标难以区分的新刺激。
链接: https://arxiv.org/abs/2506.11151
作者: Jonathan Grizou,Carlos de la Torre-Ortiz,Tuukka Ruotsalo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 10 pages, 4 figures, 11 appendix pages, 7 appendix figures
Abstract:We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).
zh
[CV-82] LLM -to-Phy3D: Physically Conform Online 3D Object Generation with LLM s
【速读】:该论文旨在解决现有生成式AI(Generative AI)和大语言模型(Large Language Models, LLMs)在物理人工智能(Physical AI)领域中生成的3D对象缺乏物理可行性的问题。传统LLM-to-3D模型由于缺乏物理知识,生成的3D对象往往无法满足现实世界的物理约束。为了解决这一问题,论文提出了LLM-to-Phy3D,其关键在于引入了一种在线黑箱优化循环,通过视觉与物理评估的协同作用,对生成过程进行迭代优化,从而提升生成对象的物理性能和几何新颖性。
链接: https://arxiv.org/abs/2506.11148
作者: Melvin Wong,Yueming Lyu,Thiago Rios,Stefan Menzel,Yew-Soon Ong
机构: Nanyang Technological University (南洋理工大学); Agency for Science, Technology and Research (科技研究局); Honda Research Institute Europe (本田研究机构欧洲分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The emergence of generative artificial intelligence (GenAI) and large language models (LLMs) has revolutionized the landscape of digital content creation in different modalities. However, its potential use in Physical AI for engineering design, where the production of physically viable artifacts is paramount, remains vastly underexplored. The absence of physical knowledge in existing LLM-to-3D models often results in outputs detached from real-world physical constraints. To address this gap, we introduce LLM-to-Phy3D, a physically conform online 3D object generation that enables existing LLM-to-3D models to produce physically conforming 3D objects on the fly. LLM-to-Phy3D introduces a novel online black-box refinement loop that empowers large language models (LLMs) through synergistic visual and physics-based evaluations. By delivering directional feedback in an iterative refinement process, LLM-to-Phy3D actively drives the discovery of prompts that yield 3D artifacts with enhanced physical performance and greater geometric novelty relative to reference objects, marking a substantial contribution to AI-driven generative design. Systematic evaluations of LLM-to-Phy3D, supported by ablation studies in vehicle design optimization, reveal various LLM improvements gained by 4.5% to 106.7% in producing physically conform target domain 3D designs over conventional LLM-to-3D models. The encouraging results suggest the potential general use of LLM-to-Phy3D in Physical AI for scientific and engineering applications.
zh
[CV-83] 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
【速读】:该论文旨在解决医学视觉问答(Med-VQA)在三维医学影像分析中的任务多样性不足及模型泛化能力有限的问题。其解决方案的关键在于构建一个大规模的三维放射科CT扫描数据集——3D-RAD,该数据集包含六种多样化的视觉问答任务,并引入了复杂的推理挑战,如计算任务和多阶段时间分析,以推动三维医学视觉问答的研究。此外,研究者还提供了高质量的训练集3D-RAD-T,表明在该数据集上进行微调可以显著提升模型性能。
链接: https://arxiv.org/abs/2506.11147
作者: Xiaotang Gai,Jiaxiang Liu,Yichen Li,Zijie Meng,Jian Wu,Zuozhu Liu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at this https URL.
zh
[CV-84] AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation
【速读】:该论文试图解决人类视频生成和动画任务中表达性和真实感动画难以平衡的问题,具体表现为运动自然度与视觉保真度之间的权衡。其解决方案的关键在于提出\textbfAlignHuman框架,该框架结合了偏好优化作为后训练技术与分而治之的训练策略,以联合优化这两个相互竞争的目标。核心洞察来自于对去噪过程在不同时间步长上的分析,即早期时间步主要控制运动动力学,而后期时间步可以有效管理视觉保真度和人体结构,即使跳过早期步骤也能保持较好的效果。基于此,作者提出了时间步段偏好优化(TPO)并引入两个专门的LoRAs作为专家对齐模块,分别针对相应时间步区间进行优化,从而提升生成结果的运动自然度和保真度。
链接: https://arxiv.org/abs/2506.11144
作者: Chao Liang,Jianwen Jiang,Wang Liao,Jiaqi Yang,Zerong zheng,Weihong Zeng,Han Liang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL
Abstract:Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbfAlignHuman, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3 \times speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: \hrefthis https URLthis https URL
zh
[CV-85] On the development of an AI performance and behavioural measures for teaching and classroom management
【速读】:该论文试图解决如何通过人工智能驱动的测量方法分析课堂动态,特别是教师行为的识别与评估问题。其解决方案的关键在于利用多模态传感器数据和AI技术实时提取有意义的洞察,以支持教师专业发展。研究构建了一个经过筛选的音视频数据集、新颖的行为度量指标,并开发了一个教学评审概念验证仪表盘,实现了非评判性、自动化的分析方法,从而减少人工工作量并促进教师的反思性实践。
链接: https://arxiv.org/abs/2506.11143
作者: Andreea I. Niculescu,Jochen Ehnen,Chen Yi,Du Jiawei,Tay Chiat Pin,Joey Tianyi Zhou,Vigneshwaran Subbaraju,Teh Kah Kuan,Tran Huy Dat,John Komar,Gi Soong Chee,Kenneth Kwok
机构: ASTAR Inst. for Infocomm Research(Singapore); ASTAR Inst. High Perf. Computing(Singapore); National Institute of Education NTU(新加坡国立教育学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 10 figures, A video demonstration of the teacher trainer dashboard can be accessed here: this https URL
Abstract:This paper presents a two-year research project focused on developing AI-driven measures to analyze classroom dynamics, with particular emphasis on teacher actions captured through multimodal sensor data. We applied real-time data from classroom sensors and AI techniques to extract meaningful insights and support teacher development. Key outcomes include a curated audio-visual dataset, novel behavioral measures, and a proof-of-concept teaching review dashboard. An initial evaluation with eight researchers from the National Institute for Education (NIE) highlighted the system’s clarity, usability, and its non-judgmental, automated analysis approach – which reduces manual workloads and encourages constructive reflection. Although the current version does not assign performance ratings, it provides an objective snapshot of in-class interactions, helping teachers recognize and improve their instructional strategies. Designed and tested in an Asian educational context, this work also contributes a culturally grounded methodology to the growing field of AI-based educational analytics.
zh
[CV-86] FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation
【速读】:该论文旨在解决半监督语义分割(Semi-supervised Semantic Segmentation, SSSS)中有效利用未标记数据的持续挑战,包括伪标签的无效利用、类别不平衡偏差的加剧以及预测不确定性被忽视等问题。其解决方案的关键在于构建一个全面框架,通过四个核心组件将不确定性转化为学习优势:(1)模糊伪标签,保留从Top-K预测中获得的软类别分布以丰富监督;(2)不确定性感知的动态加权,通过基于熵的可靠性分数调节像素级贡献;(3)自适应类别再平衡,动态调整损失以对抗长尾类别分布;(4)轻量级对比正则化,促进紧凑且具有判别性的特征嵌入。
链接: https://arxiv.org/abs/2506.11142
作者: Ebenezer Tarubinga,Jenifer Kalafatovich
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Submitted to Pattern Recognition
Abstract:Semi-supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo-labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holistic framework that transforms uncertainty into a learning asset through four principal components: (1) fuzzy pseudo-labeling, which preserves soft class distributions from top-K predictions to enrich supervision; (2) uncertainty-aware dynamic weighting, that modulate pixel-wise contributions via entropy-based reliability scores; (3) adaptive class rebalancing, which dynamically adjust losses to counteract long-tailed class distributions; and (4) lightweight contrastive regularization, that encourage compact and discriminative feature embeddings. Extensive experiments on benchmarks demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements in the segmentation of under-represented classes and ambiguous regions.
zh
[CV-87] Autonomous Computer Vision Development with Agent ic AI
【速读】:该论文试图解决计算机视觉应用开发中传统上由数据科学家完成的自主规划与工具配置问题(computer vision task planning and tool configuration)。解决方案的关键在于利用基于大型语言模型(Large Language Models, LLMs)的智能体(Agentic AI)系统,通过自然语言提示自动生成并执行相应的SimpleMind工作流配置,从而实现从任务描述到模型训练与推理的端到端自动化。
链接: https://arxiv.org/abs/2506.11140
作者: Jin Kim,Muhammad Wahi-Anwa,Sangyun Park,Shawn Shin,John M. Hoffman,Matthew S. Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: The paper is 13 pages long and contains 4 figures
Abstract:Agentic Artificial Intelligence (AI) systems leveraging Large Language Models (LLMs) exhibit significant potential for complex reasoning, planning, and tool utilization. We demonstrate that a specialized computer vision system can be built autonomously from a natural language prompt using Agentic AI methods. This involved extending SimpleMind (SM), an open-source Cognitive AI environment with configurable tools for medical image analysis, with an LLM-based agent, implemented using OpenManus, to automate the planning (tool configuration) for a particular computer vision task. We provide a proof-of-concept demonstration that an agentic system can interpret a computer vision task prompt, plan a corresponding SimpleMind workflow by decomposing the task and configuring appropriate tools. From the user input prompt, “provide sm (SimpleMind) config for lungs, heart, and ribs segmentation for cxr (chest x-ray)”), the agent LLM was able to generate the plan (tool configuration file in YAML format), and execute SM-Learn (training) and SM-Think (inference) scripts autonomously. The computer vision agent automatically configured, trained, and tested itself on 50 chest x-ray images, achieving mean dice scores of 0.96, 0.82, 0.83, for lungs, heart, and ribs, respectively. This work shows the potential for autonomous planning and tool configuration that has traditionally been performed by a data scientist in the development of computer vision applications.
zh
[CV-88] JAFAR: Jack up Any Feature at Any Resolution UAI
【速读】:该论文旨在解决基础视觉编码器(Foundation Vision Encoder)输出的低分辨率空间特征在下游密集视觉任务中需要进行特征上采样的问题。其关键解决方案是提出一种轻量且灵活的特征上采样器JAFAR,该方法通过基于注意力的模块,利用空间特征变换(SFT)调制,促进从低级图像特征中提取的高分辨率查询与语义丰富的低分辨率键之间的语义对齐,从而有效提升视觉特征的空间分辨率至任意目标分辨率。
链接: https://arxiv.org/abs/2506.11136
作者: Paul Couairon,Loick Chambon,Louis Serrano,Jean-Emmanuel Haugeard,Matthieu Cord,Nicolas Thome
机构: Sorbonne Université, CNRS, ISIR, F-75005 Paris, France; Thales, TSGF, cortAIx Labs, France; Valeo.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Code available at this https URL
Abstract:Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at this https URL
zh
[CV-89] ContextLoss: Context Information for Topology-Preserving Segmentation ICIP2025
【速读】:该论文旨在解决图像分割中保持分割结构(如血管、膜或道路)拓扑结构的问题,因为拓扑错误可能对导航等应用产生重大影响。其解决方案的关键在于提出一种新的损失函数——ContextLoss (CLoss),该函数通过在关键像素掩码中考虑拓扑错误的完整上下文来提升拓扑正确性,从而增强网络对拓扑错误的关注。此外,作者还提出了两个直观的度量标准以验证因修复遗漏连接而带来的连通性改进。
链接: https://arxiv.org/abs/2506.11134
作者: Benedict Schacht,Imke Greving,Simone Frintrop,Berit Zeller-Plumhoff,Christian Wilms
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 13 pages, 7 figures, accepted to ICIP 2025
Abstract:In image segmentation, preserving the topology of segmented structures like vessels, membranes, or roads is crucial. For instance, topological errors on road networks can significantly impact navigation. Recently proposed solutions are loss functions based on critical pixel masks that consider the whole skeleton of the segmented structures in the critical pixel mask. We propose the novel loss function ContextLoss (CLoss) that improves topological correctness by considering topological errors with their whole context in the critical pixel mask. The additional context improves the network focus on the topological errors. Further, we propose two intuitive metrics to verify improved connectivity due to a closing of missed connections. We benchmark our proposed CLoss on three public datasets (2D 3D) and our own 3D nano-imaging dataset of bone cement lines. Training with our proposed CLoss increases performance on topology-aware metrics and repairs up to 44% more missed connections than other state-of-the-art methods. We make the code publicly available.
zh
[CV-90] Monocular 3D Hand Pose Estimation with Implicit Camera Alignment
【速读】:该论文旨在解决从单张彩色图像中估计3D手部关节结构(3D hand articulation)的问题,该问题在增强现实(AR)、虚拟现实(VR)、人机交互(HCI)和机器人技术中具有广泛应用。由于缺乏深度信息、遮挡、关节复杂性以及需要已知相机参数等因素,该问题面临诸多挑战。论文提出的解决方案是一个优化流程,其关键在于包含关键点对齐步骤和指尖损失(fingertip loss),从而克服了对相机参数已知或估计的需求。该方法在EgoDexter和Dexter+Object基准测试中表现出与当前最先进(SotA)方法相当的性能,并展示了在无先验相机知识的情况下处理“真实场景”图像的鲁棒性。
链接: https://arxiv.org/abs/2506.11133
作者: Christos Pantazopoulos,Spyridon Thermos,Gerasimos Potamianos
机构: University of Thessaly(塞萨利大学); Moverse(移动领域)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Code is available at this https URL
Abstract:Estimating the 3D hand articulation from a single color image is a continuously investigated problem with applications in Augmented Reality (AR), Virtual Reality (VR), Human-Computer Interaction (HCI), and robotics. Apart from the absence of depth information, occlusions, articulation complexity, and the need for camera parameters knowledge pose additional challenges. In this work, we propose an optimization pipeline for estimating the 3D hand articulation from 2D keypoint input, which includes a keypoint alignment step and a fingertip loss to overcome the need to know or estimate the camera parameters. We evaluate our approach on the EgoDexter and Dexter+Object benchmarks to showcase that our approach performs competitively with the SotA, while also demonstrating its robustness when processing “in-the-wild” images without any prior camera knowledge. Our quantitative analysis highlights the sensitivity of the 2D keypoint estimation accuracy, despite the use of hand priors. Code is available at this https URL
zh
[CV-91] Gender Fairness of Machine Learning Algorithms for Pain Detection
【速读】:该论文试图解决自动化疼痛检测中机器学习(ML)和深度学习(DL)算法在不同性别群体间的公平性问题,即这些算法在准确性与公平性之间的权衡。解决方案的关键在于通过对比传统ML算法(如线性支持向量机L SVM和径向基函数支持向量机RBF SVM)与DL方法(如卷积神经网络CNN和视觉TransformerViT)在面部表情视觉模态下的疼痛检测性能,并评估其在多个性能和公平性指标上的表现,从而揭示模型中的性别偏差并强调引入公平性感知技术的必要性。
链接: https://arxiv.org/abs/2506.11132
作者: Dylan Green,Yuting Shang,Jiaee Cheong,Yang Liu,Hatice Gunes
机构: University of Cambridge (剑桥大学); Harvard University (哈佛大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear as part of the 2025 19th International Conference on Automatic Face and Gesture Recognition (FG) Workshop Proceedings
Abstract:Automated pain detection through machine learning (ML) and deep learning (DL) algorithms holds significant potential in healthcare, particularly for patients unable to self-report pain levels. However, the accuracy and fairness of these algorithms across different demographic groups (e.g., gender) remain under-researched. This paper investigates the gender fairness of ML and DL models trained on the UNBC-McMaster Shoulder Pain Expression Archive Database, evaluating the performance of various models in detecting pain based solely on the visual modality of participants’ facial expressions. We compare traditional ML algorithms, Linear Support Vector Machine (L SVM) and Radial Basis Function SVM (RBF SVM), with DL methods, Convolutional Neural Network (CNN) and Vision Transformer (ViT), using a range of performance and fairness metrics. While ViT achieved the highest accuracy and a selection of fairness metrics, all models exhibited gender-based biases. These findings highlight the persistent trade-off between accuracy and fairness, emphasising the need for fairness-aware techniques to mitigate biases in automated healthcare systems.
zh
[CV-92] Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation
【速读】:该论文试图解决高效图像分割的问题,特别是针对单点提示(single point prompt)生成单一分割区域的场景。其解决方案的关键在于通过视网膜聚焦(foveation)机制对输入图像进行处理,即在提示点周围提取图像裁剪区域,并采用一种新型的变分辨率块标记化方法,其中块的下采样率随着距离提示点的距离增加而增加。这种方法显著减少了图像标记的数量,从而大幅降低了分割的计算成本,而无需减小模型规模。
链接: https://arxiv.org/abs/2506.11131
作者: Tanner Schmidt,Richard Newcombe
机构: Meta Reality Labs(元宇宙实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:This paper presents Segment This Thing (STT), a new efficient image segmentation model designed to produce a single segment given a single point prompt. Instead of following prior work and increasing efficiency by decreasing model size, we gain efficiency by foveating input images. Given an image and a point prompt, we extract a crop centered on the prompt and apply a novel variable-resolution patch tokenization in which patches are downsampled at a rate that increases with increased distance from the prompt. This approach yields far fewer image tokens than uniform patch tokenization. As a result we can drastically reduce the computational cost of segmentation without reducing model size. Furthermore, the foveation focuses the model on the region of interest, a potentially useful inductive bias. We show that our Segment This Thing model is more efficient than prior work while remaining competitive on segmentation benchmarks. It can easily run at interactive frame rates on consumer hardware and is thus a promising tool for augmented reality or robotics applications.
zh
[CV-93] Image-Based Method For Measuring And Classification Of Iron Ore Pellets Using Star-Convex Polygons
【速读】:该论文试图解决铁矿石球团(iron ore pellets)质量缺陷检测与尺寸测量的问题,特别是在密集且不稳定的环境中准确识别和分类球团的挑战。传统方法如Vision Transformer(ViT)图像分类、Mask R-CNN实例分割以及多种异常分割算法在该场景下未能取得满意效果。解决方案的关键在于引入医学领域常用的StarDist算法,通过其对平滑边界物体的检测能力,提升物理尺寸测量的准确性,并实现对球团尺寸分布的更精确分析。
链接: https://arxiv.org/abs/2506.11126
作者: Artem Solomko,Oleg Kartashev,Andrey Golov,Mikhail Deulin,Vadim Valynkin,Vasily Kharin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 41 figures
Abstract:We would like to present a comprehensive study on the classification of iron ore pellets, aimed at identifying quality violations in the final product, alongside the development of an innovative imagebased measurement method utilizing the StarDist algorithm, which is primarily employed in the medical field. This initiative is motivated by the necessity to accurately identify and analyze objects within densely packed and unstable environments. The process involves segmenting these objects, determining their contours, classifying them, and measuring their physical dimensions. This is crucial because the size distribution and classification of pellets such as distinguishing between nice (quality) and joint (caused by the presence of moisture or indicating a process of production failure) types are among the most significant characteristics that define the quality of the final product. Traditional algorithms, including image classification techniques using Vision Transformer (ViT), instance segmentation methods like Mask R-CNN, and various anomaly segmentation algorithms, have not yielded satisfactory results in this context. Consequently, we explored methodologies from related fields to enhance our approach. The outcome of our research is a novel method designed to detect objects with smoothed boundaries. This advancement significantly improves the accuracy of physical dimension measurements and facilitates a more precise analysis of size distribution among the iron ore pellets. By leveraging the strengths of the StarDist algorithm, we aim to provide a robust solution that addresses the challenges posed by the complex nature of pellet classification and measurement.
zh
[CV-94] chnical Report for Argoverse2 Scenario Mining Challenges on Iterative Error Correction and Spatially-Aware Prompting
【速读】:该论文旨在解决从大规模自动驾驶数据集中进行场景挖掘时遇到的挑战,特别是由于生成式AI(Generative AI)生成的代码导致的运行时错误以及对描述复杂多目标空间关系函数参数解释不准确的问题。其解决方案的关键在于两个核心改进:一是引入容错的迭代代码生成机制,通过将错误反馈重新提示给大型语言模型(LLM)来逐步优化代码;二是采用专门的提示工程,提升LLM对空间关系函数的理解与正确应用能力。
链接: https://arxiv.org/abs/2506.11124
作者: Yifei Chen,Ross Greer
机构: Xi’an University of Technology (西安理工大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:Scenario mining from extensive autonomous driving datasets, such as Argoverse 2, is crucial for the development and validation of self-driving systems. The RefAV framework represents a promising approach by employing Large Language Models (LLMs) to translate natural-language queries into executable code for identifying relevant scenarios. However, this method faces challenges, including runtime errors stemming from LLM-generated code and inaccuracies in interpreting parameters for functions that describe complex multi-object spatial relationships. This technical report introduces two key enhancements to address these limitations: (1) a fault-tolerant iterative code-generation mechanism that refines code by re-prompting the LLM with error feedback, and (2) specialized prompt engineering that improves the LLM’s comprehension and correct application of spatial-relationship functions. Experiments on the Argoverse 2 validation set with diverse LLMs-Qwen2.5-VL-7B, Gemini 2.5 Flash, and Gemini 2.5 Pro-show consistent gains across multiple metrics; most notably, the proposed system achieves a HOTA-Temporal score of 52.37 on the official test set using Gemini 2.5 Pro. These results underline the efficacy of the proposed techniques for reliable, high-precision scenario mining.
zh
[CV-95] Adaptive Object Detection with ESRGAN-Enhanced Resolution Faster R-CNN
【速读】:该论文旨在解决低分辨率图像中目标检测性能不佳的问题,特别是在图像质量较差的情况下,传统方法难以实现准确的目标检测。其解决方案的关键在于将增强型超分辨率生成对抗网络(ESRGAN)与快速区域卷积神经网络(Faster R-CNN)相结合,通过ESRGAN提升图像质量,恢复细节并提高清晰度,随后利用Faster R-CNN在增强后的图像上进行精确的目标检测与定位,从而在图像分辨率受限的场景下实现更鲁棒和可靠的目标检测效果。
链接: https://arxiv.org/abs/2506.11122
作者: Divya Swetha K,Ziaul Haque Choudhury,Hemanta Kumar Bhuyan,Biswajit Brahma,Nilayam Kumar Kamila
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this study, proposes a method for improved object detection from the low-resolution images by integrating Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) and Faster Region-Convolutional Neural Network (Faster R-CNN). ESRGAN enhances low-quality images, restoring details and improving clarity, while Faster R-CNN performs accurate object detection on the enhanced images. The combination of these techniques ensures better detection performance, even with poor-quality inputs, offering an effective solution for applications where image resolution is in consistent. ESRGAN is employed as a pre-processing step to enhance the low-resolution input image, effectively restoring lost details and improving overall image quality. Subsequently, the enhanced image is fed into the Faster R-CNN model for accurate object detection and localization. Experimental results demonstrate that this integrated approach yields superior performance compared to traditional methods applied directly to low-resolution images. The proposed framework provides a promising solution for applications where image quality is variable or limited, enabling more robust and reliable object detection in challenging scenarios. It achieves a balance between improved image quality and efficient object detection
zh
[CV-96] EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices CVPR2025
【速读】:该论文旨在解决混合模型(结合卷积和Transformer模块)在边缘设备部署时资源消耗过高的问题。现有方法如后训练量化(PTQ)在混合模型上的应用效果有限,无法有效降低计算和内存需求。论文提出的解决方案——EfficientQuant,其关键在于针对不同模块采用结构感知的量化策略:对卷积块使用统一量化,对Transformer块采用log₂量化,从而在保持较高精度的同时显著降低延迟和内存占用,实现高效的边缘部署。
链接: https://arxiv.org/abs/2506.11093
作者: Shaibal Saha,Lanyu Xu
机构: Oakland University (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 4th Workshop on Transformers for Vision (T4V) at CVPR 2025
Abstract:Hybrid models that combine convolutional and transformer blocks offer strong performance in computer vision (CV) tasks but are resource-intensive for edge deployment. Although post-training quantization (PTQ) can help reduce resource demand, its application to hybrid models remains limited. We propose EfficientQuant, a novel structure-aware PTQ approach that applies uniform quantization to convolutional blocks and log_2 quantization to transformer blocks. EfficientQuant achieves 2.5 \times - 8.7 \times latency reduction with minimal accuracy loss on the ImageNet-1K dataset. It further demonstrates low latency and memory efficiency on edge devices, making it practical for real-world deployment.
zh
[CV-97] When Algorithms Play Favorites: Lookism in the Generation and Perception of Faces
【速读】:该论文试图解决算法外貌主义(algorithmic lookism)对合成生成面孔和基于机器学习的性别分类算法的影响问题,具体表现为外观相关的偏见如何影响数字身份系统的公平性。其解决方案的关键在于通过实验分析文本到图像(T2I)系统中面部吸引力与无关积极特质的关联性,以及性别分类模型在“低吸引力”面孔上的错误率,特别是非白人女性群体中的表现,从而揭示算法偏见的潜在机制并引发对数字身份系统公平性的关注。
链接: https://arxiv.org/abs/2506.11025
作者: Miriam Doh,Aditya Gulati,Matei Mancas,Nuria Oliver
机构: ISIA Lab - Université de Mons(ISIA实验室-蒙斯大学); IRIDIA Lab - Université Libre de Bruxelles(IRIDIA实验室-布鲁塞尔自由大学); ELLIS Alicante(ELLIS阿利坎特); Université de Mons(蒙斯大学); Université Libre de Bruxelles(布鲁塞尔自由大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an extended abstract at the Fourth European Workshop on Algorithmic Fairness (EWAF) (URL: this https URL )
Abstract:This paper examines how synthetically generated faces and machine learning-based gender classification algorithms are affected by algorithmic lookism, the preferential treatment based on appearance. In experiments with 13,200 synthetically generated faces, we find that: (1) text-to-image (T2I) systems tend to associate facial attractiveness to unrelated positive traits like intelligence and trustworthiness; and (2) gender classification models exhibit higher error rates on “less-attractive” faces, especially among non-White women. These result raise fairness concerns regarding digital identity systems.
zh
[CV-98] crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023
【速读】:该论文旨在解决医学影像中跨模态领域自适应(cross-Modality Domain Adaptation, crossMoDA)问题,具体是通过从对比增强T1(ceT1)图像中学习,实现对T2 MRI图像中前庭神经鞘瘤(Vestibular Schwannoma, VS)和耳蜗的无监督分割。其关键解决方案在于利用多机构数据和复杂的标注信息(如Koos分级及肿瘤成分的子分割)来提升模型的泛化能力和临床适用性。研究还表明,随着数据集规模的扩大和异质性的增加,异常样本数量减少,但同时数据复杂性的提升也可能对分割性能产生负面影响。
链接: https://arxiv.org/abs/2506.12006
作者: Navodini Wijethilake,Reuben Dorent,Marina Ivory,Aaron Kujawa,Stefan Cornelissen,Patrick Langenhuizen,Mohamed Okasha,Anna Oviedova,Hexin Dong,Bogyeong Kang,Guillaume Sallé,Luyi Han,Ziyuan Zhao,Han Liu,Tao Yang,Shahad Hardan,Hussain Alasmawi,Santosh Sanjeev,Yuzhou Zhuang,Satoshi Kondo,Maria Baldeon Calisto,Shaikh Muhammad Uzair Noman,Cancan Chen,Ipek Oguz,Rongguo Zhang,Mina Rezaei,Susana K. Lai-Yuen,Satoshi Kasai,Chih-Cheng Hung,Mohammad Yaqub,Lisheng Wang,Benoit M. Dawant,Cuntai Guan,Ritse Mann,Vincent Jaouen,Ji-Wung Han,Li Zhang,Jonathan Shapey,Tom Vercauteren
机构: School of BMEIS, King’s College London, London, United Kingdom; Harvard University, USA; Elisabeth-TweeSteden Hospital, Tilburg, Netherlands; King’s College Hospital, London, United Kingdom; Center for Data Science, Peking University, Beijing, China; Center for Data Science in Health and Medicine, Peking University, Beijing, China; Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea; UMR 1101 Inserm LaTIM, Université de Bretagne Occidentale, IMT Atlantique, Brest, France; Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Geert Grooteplein 10, 6525 GA, Nijmegen, The Netherlands; Department of Radiology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands; Institute for Infocomm Research (I²R), ASTAR, Singapore; Artificial Intelligence, Analytics And Informatics (AI³), ASTAR, Singapore; Nanyang Technological University, Singapore; Vanderbilt University, USA; Department of Automation, Shanghai Jiao Tong University, Shanghai, China; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE; School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China; Center for Machine Vision and Security Research, Kennesaw State University, Marietta, MA 30060, USA; Muroran Institute of Technology, Hokkaido, Japan; Niigata University of Health and Welfare, Niigata, Japan; Universidad San Francisco de Quito, Diego de Robles s/n y Vía Interoceánica, Quito, Ecuador; University of South Florida, Tampa, FL, USA; Ludwig-Maximilians-Universität München, Germany; Infervision Advanced Research Institute, Beijing, China; Academy for Multidisciplinary Studies, Capital Normal University, Beijing, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a meaningful and illustrative benchmark. From a clinical application perspective, it aims to automate Vestibular Schwannoma (VS) and cochlea segmentation on T2 scans for more cost-effective VS management. Over time, the challenge objectives have evolved to enhance its clinical relevance. The challenge evolved from using single-institutional data and basic segmentation in 2021 to incorporating multi-institutional data and Koos grading in 2022, and by 2023, it included heterogeneous routine data and sub-segmentation of intra- and extra-meatal tumour components. In this work, we report the findings of the 2022 and 2023 editions and perform a retrospective analysis of the challenge progression over the years. The observations from the successive challenge contributions indicate that the number of outliers decreases with an expanding dataset. This is notable since the diversity of scanning protocols of the datasets concurrently increased. The winning approach of the 2023 edition reduced the number of outliers on the 2021 and 2022 testing data, demonstrating how increased data heterogeneity can enhance segmentation performance even on homogeneous data. However, the cochlea Dice score declined in 2023, likely due to the added complexity from tumour sub-annotations affecting overall segmentation performance. While progress is still needed for clinically acceptable VS segmentation, the plateauing performance suggests that a more challenging cross-modal task may better serve future benchmarking.
zh
[CV-99] MindGrab for BrainChop: Fast and Accurate Skull Stripping for Command Line and Browser
【速读】:该论文旨在解决医学影像中三维头骨剥离(volumetric skull-stripping)的问题,即从头部图像中准确分割出脑组织区域。其解决方案的关键在于提出了一种参数和内存高效的深度全卷积模型MindGrab,该模型的架构基于扩张卷积的谱解释,并仅使用模态无关的合成数据进行训练,从而实现了在多种成像模态下的高性能分割。
链接: https://arxiv.org/abs/2506.11860
作者: Armina Fani(1),Mike Doan(1),Isabelle Le(1),Alex Fedorov(2),Malte Hoffmann(3),Chris Rorden(4),Sergey Plis(1) ((1) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, (2) Emory University, (3) Harvard University, (4) University of South Carolina)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages, 1 table, 4 figures. 2 supplementary tables, 1 supplementary figure. Brainchop-cli: this https URL . Brainchop web: this https URL
Abstract:We developed MindGrab, a parameter- and memory-efficient deep fully-convolutional model for volumetric skull-stripping in head images of any modality. Its architecture, informed by a spectral interpretation of dilated convolutions, was trained exclusively on modality-agnostic synthetic data. MindGrab was evaluated on a retrospective dataset of 606 multimodal adult-brain scans (T1, T2, DWI, MRA, PDw MRI, EPI, CT, PET) sourced from the SynthStrip dataset. Performance was benchmarked against SynthStrip, ROBEX, and BET using Dice scores, with Wilcoxon signed-rank significance tests. MindGrab achieved a mean Dice score of 95.9 with standard deviation (SD) 1.6 across modalities, significantly outperforming classical methods (ROBEX: 89.1 SD 7.7, P 0.05; BET: 85.2 SD 14.4, P 0.05). Compared to SynthStrip (96.5 SD 1.1, P=0.0352), MindGrab delivered equivalent or superior performance in nearly half of the tested scenarios, with minor differences (3% Dice) in the others. MindGrab utilized 95% fewer parameters (146,237 vs. 2,566,561) than SynthStrip. This efficiency yielded at least 2x faster inference, 50% lower memory usage on GPUs, and enabled exceptional performance (e.g., 10-30x speedup, and up to 30x memory reduction) and accessibility on a wider range of hardware, including systems without high-end GPUs. MindGrab delivers state-of-the-art accuracy with dramatically lower resource demands, supported in brainchop-cli (this https URL) and at this http URL.
zh
[CV-100] Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution
【速读】:该论文旨在解决数据驱动图像超分辨率(Image Super-Resolution, SR)方法中因扩展模型感受野而导致的模型复杂度增加的问题。其解决方案的关键在于提出一种基于结构相似性启发的展开(Structural Similarity-Inspired Unfolding, SSIU)方法,通过将SR优化函数约束于结构相似性,并结合多尺度门控模块(Mixed-Scale Gating Module, MSGM)、高效稀疏注意力模块(Efficient Sparse Attention Module, ESAM)以及基于专家混合的特征选择器(Mixture-of-Experts-based Feature Selector, MoE-FS),在保持模型紧凑性的同时提升性能。
链接: https://arxiv.org/abs/2506.11823
作者: Zhangkai Ni,Yang Zhang,Wenhan Yang,Hanli Wang,Shiqi Wang,Sam Kwong
机构: Tongji University (同济大学); City University of Hong Kong (香港城市大学); Pengcheng Laboratory (鹏城实验室); Lingnan University (岭南大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Image Processing
Abstract:Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design. Based on these insights, we propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR. This method is designed through unfolding an SR optimization function constrained by structural similarity, aiming to combine the strengths of both data-driven and model-driven approaches. Our model operates progressively following the unfolding paradigm. Each iteration consists of multiple Mixed-Scale Gating Modules (MSGM) and an Efficient Sparse Attention Module (ESAM). The former implements comprehensive constraints on features, including a structural similarity constraint, while the latter aims to achieve sparse activation. In addition, we design a Mixture-of-Experts-based Feature Selector (MoE-FS) that fully utilizes multi-level feature information by combining features from different steps. Extensive experiments validate the efficacy and efficiency of our unfolding-inspired network. Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption. Our code will be available at: this https URL
zh
[CV-101] Framework of a multiscale data-driven digital twin of the muscle-skeletal system
【速读】:该论文旨在解决肌肉骨骼疾病(Musculoskeletal Disorders, MSDs)在个性化诊断与治疗中面临的挑战,特别是如何有效整合异构数据源以实现精准评估。解决方案的关键在于提出一种名为肌肉骨骼数字孪生(Musculoskeletal Digital Twin, MS-DT)的新框架,该框架通过整合多尺度生物力学数据与计算建模,构建患者特异性的肌肉骨骼系统模型,从而实现对脊柱运动学、姿势及肌肉功能的详细分析。
链接: https://arxiv.org/abs/2506.11821
作者: Martina Paccini,Simone Cammarasana,Giuseppe Patanè
机构: CNR - IMATI (National Research Council - Institute of Applied Mathematics and Information Technologies)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Musculoskeletal disorders (MSDs) are a leading cause of disability worldwide, requiring advanced diagnostic and therapeutic tools for personalised assessment and treatment. Effective management of MSDs involves the interaction of heterogeneous data sources, making the Digital Twin (DT) paradigm a valuable option. This paper introduces the Musculoskeletal Digital Twin (MS-DT), a novel framework that integrates multiscale biomechanical data with computational modelling to create a detailed, patient-specific representation of the musculoskeletal system. By combining motion capture, ultrasound imaging, electromyography, and medical imaging, the MS-DT enables the analysis of spinal kinematics, posture, and muscle function. An interactive visualisation platform provides clinicians and researchers with an intuitive interface for exploring biomechanical parameters and tracking patient-specific changes. Results demonstrate the effectiveness of MS-DT in extracting precise kinematic and dynamic tissue features, offering a comprehensive tool for monitoring spine biomechanics and rehabilitation. This framework provides high-fidelity modelling and real-time visualization to improve patient-specific diagnosis and intervention planning.
zh
[CV-102] Solving Inverse Problems in Stochastic Self-Organising Systems through Invariant Representations
【速读】:该论文试图解决自组织系统中由宏观观测数据反推未知因果参数的逆问题,尤其是在观测数据具有强随机性的情况下,传统方法因无法捕捉不同结果间的特征相似性而失效。解决方案的关键在于利用视觉嵌入(visual embeddings)的能力,生成稳健的表示以捕捉感知不变性,并将模式表示映射到一个不变嵌入空间,从而无需手工设计的目标函数或启发式方法即可有效恢复未知的因果参数。
链接: https://arxiv.org/abs/2506.11796
作者: Elias Najarro,Nicolas Bessone,Sebastian Risi
机构: IT University of Copenhagen (IT大学)
类目: Adaptation and Self-Organizing Systems (nlin.AO); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Under review
Abstract:Self-organising systems demonstrate how simple local rules can generate complex stochastic patterns. Many natural systems rely on such dynamics, making self-organisation central to understanding natural complexity. A fundamental challenge in modelling such systems is solving the inverse problem: finding the unknown causal parameters from macroscopic observations. This task becomes particularly difficult when observations have a strong stochastic component, yielding diverse yet equivalent patterns. Traditional inverse methods fail in this setting, as pixel-wise metrics cannot capture feature similarities between variable outcomes. In this work, we introduce a novel inverse modelling method specifically designed to handle stochasticity in the observable space, leveraging the capacity of visual embeddings to produce robust representations that capture perceptual invariances. By mapping the pattern representations onto an invariant embedding space, we can effectively recover unknown causal parameters without the need for handcrafted objective functions or heuristics. We evaluate the method on two canonical models–a reaction-diffusion system and an agent-based model of social segregation–and show that it reliably recovers parameters despite stochasticity in the outcomes. We further apply the method to real biological patterns, highlighting its potential as a tool for both theorists and experimentalists to investigate the dynamics underlying complex stochastic pattern formation.
zh
[CV-103] Exploring the Effectiveness of Deep Features from Domain-Specific Foundation Models in Retinal Image Synthesis
【速读】:该论文试图解决医疗影像中由于严格隐私法规、数据获取成本高及人口统计偏差等问题导致的神经网络模型应用受限的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 生成合成数据,以规避隐私问题并提高数据公平性。研究重点评估了基于大型基础模型深度激活层的距离损失函数在颜色眼底视网膜成像中的表现,并与感知损失和边缘检测损失函数进行了比较,最终发现传统边缘检测滤波器在提升合成样本中血管结构清晰度方面更为有效。
链接: https://arxiv.org/abs/2506.11753
作者: Zuzanna Skorniewska,Bartlomiej W. Papiez
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published and presented at the MIUA 2025 conference
Abstract:The adoption of neural network models in medical imaging has been constrained by strict privacy regulations, limited data availability, high acquisition costs, and demographic biases. Deep generative models offer a promising solution by generating synthetic data that bypasses privacy concerns and addresses fairness by producing samples for under-represented groups. However, unlike natural images, medical imaging requires validation not only for fidelity (e.g., Fréchet Inception Score) but also for morphological and clinical accuracy. This is particularly true for colour fundus retinal imaging, which requires precise replication of the retinal vascular network, including vessel topology, continuity, and thickness. In this study, we in-vestigated whether a distance-based loss function based on deep activation layers of a large foundational model trained on large corpus of domain data, colour fundus imaging, offers advantages over a perceptual loss and edge-detection based loss functions. Our extensive validation pipeline, based on both domain-free and domain specific tasks, suggests that domain-specific deep features do not improve autoen-coder image generation. Conversely, our findings highlight the effectiveness of con-ventional edge detection filters in improving the sharpness of vascular structures in synthetic samples.
zh
[CV-104] Brain Network Analysis Based on Fine-tuned Self-supervised Model for Brain Disease Diagnosis
【速读】:该论文试图解决脑网络基础模型研究受限于单一维度,从而限制其在神经科学中广泛应用的问题。解决方案的关键在于提出一个微调的脑网络模型,该模型通过在原始脑网络模型基础上扩展脑区表示的多维特征,提升模型的泛化能力,并包含两个核心模块:(1)用于跨维度扩展脑区特征的适配器模块;(2)基于自监督学习并在数千名参与者的功能磁共振成像(fMRI)数据上预训练的微调基础脑网络模型,其Transformer块能够有效提取脑区特征并计算区域间关联性。
链接: https://arxiv.org/abs/2506.11671
作者: Yifei Tang,Hongjie Jiang,Changhong Jing,Hieu Pham,Shuqiang Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures, International Conference on Neural Computing for Advanced Applications
Abstract:Functional brain network analysis has become an indispensable tool for brain disease analysis. It is profoundly impacted by deep learning methods, which can characterize complex connections between ROIs. However, the research on foundation models of brain network is limited and constrained to a single dimension, which restricts their extensive application in neuroscience. In this study, we propose a fine-tuned brain network model for brain disease diagnosis. It expands brain region representations across multiple dimensions based on the original brain network model, thereby enhancing its generalizability. Our model consists of two key modules: (1)an adapter module that expands brain region features across different dimensions. (2)a fine-tuned foundation brain network model, based on self-supervised learning and pre-trained on fMRI data from thousands of participants. Specifically, its transformer block is able to effectively extract brain region features and compute the inter-region associations. Moreover, we derive a compact latent representation of the brain network for brain disease diagnosis. Our downstream experiments in this study demonstrate that the proposed model achieves superior performance in brain disease diagnosis, which potentially offers a promising approach in brain network analysis research.
zh
[CV-105] FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution
【速读】:该论文旨在解决当前压缩视频超分辨率(CVSR)模型面临的持续挑战,包括推理时间过长、训练流程复杂以及对辅助信息的依赖。随着视频帧率的提升,帧间差异逐渐减小,传统逐帧信息利用方法已无法满足当前视频超分辨率(VSR)的需求。论文提出的解决方案的关键在于受高光谱图像(HSI)与视频数据之间结构和统计相似性的启发,引入了一种基于压缩驱动的降维策略,以降低计算复杂度、加速推理并增强跨帧的时间信息提取能力。该方法具有模块化架构,可无缝集成到现有VSR框架中,具备良好的适应性和迁移性。
链接: https://arxiv.org/abs/2506.11545
作者: Zhaoyang Wang,Jie Li,Wen Lu,Lihuo He,Maoguo Gong,Xinbo Gao
机构: Xidian University (西安电子科技大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE TMM for possible publication
Abstract:State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information. As video frame rates continue to increase, the diminishing inter-frame differences further expose the limitations of traditional frame-to-frame information exploitation methods, which are inadequate for addressing current video super-resolution (VSR) demands. To overcome these challenges, we propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data. Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames. The proposed modular architecture is designed for seamless integration with existing VSR frameworks, ensuring strong adaptability and transferability across diverse applications. Experimental results demonstrate that our method achieves performance on par with, or surpassing, the current SOTA models, while significantly reducing inference time. By addressing key bottlenecks in CVSR, our work offers a practical and efficient pathway for advancing VSR technology. Our code will be publicly available at this https URL.
zh
[CV-106] aming Stable Diffusion for Computed Tomography Blind Super-Resolution
【速读】:该论文旨在解决高分辨率计算机断层扫描(CT)成像中图像质量与患者安全之间的权衡问题,即在降低辐射剂量的同时保持高质量的医学影像。其解决方案的关键在于提出一种新颖的框架,该框架通过适配Stable Diffusion模型实现CT图像的盲超分辨率,利用实际退化模型生成逼真的低质量图像,并结合预训练视觉-语言模型生成对应的文本描述,从而在低分辨率输入和生成的文本描述的联合条件下进行超分辨率重建,显著提升了图像质量。
链接: https://arxiv.org/abs/2506.11496
作者: Chunlei Li,Yilei Shi,Haoxi Hu,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion models, particularly Stable Diffusion, have demonstrated remarkable capabilities in synthesizing fine details across various vision tasks. Motivated by this, we propose a novel framework that adapts Stable Diffusion for CT blind super-resolution. We employ a practical degradation model to synthesize realistic low-quality images and leverage a pre-trained vision-language model to generate corresponding descriptions. Subsequently, we perform super-resolution using Stable Diffusion with a specialized controlling strategy, conditioned on both low-resolution inputs and the generated text descriptions. Extensive experiments show that our method outperforms existing approaches, demonstrating its potential for achieving high-quality CT imaging at reduced radiation doses. Our code will be made publicly available.
zh
[CV-107] Voxel-Level Brain States Prediction Using Swin Transformer
【速读】:该论文试图解决如何基于功能性磁共振成像(fMRI)数据预测未来的人类静息脑状态的问题。其解决方案的关键在于提出一种新颖的架构,该架构采用4D Shifted Window (Swin) Transformer作为编码器,以高效学习fMRI数据中的时空信息,并使用卷积解码器在与输入fMRI数据相同的空间和时间分辨率下实现脑状态预测。
链接: https://arxiv.org/abs/2506.11455
作者: Yifei Sun,Daniel Chahine,Qinghao Wen,Tianming Liu,Xiang Li,Yixuan Yuan,Fernando Calamante,Jinglei Lv
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Understanding brain dynamics is important for neuroscience and mental health. Functional magnetic resonance imaging (fMRI) enables the measurement of neural activities through blood-oxygen-level-dependent (BOLD) signals, which represent brain states. In this study, we aim to predict future human resting brain states with fMRI. Due to the 3D voxel-wise spatial organization and temporal dependencies of the fMRI data, we propose a novel architecture which employs a 4D Shifted Window (Swin) Transformer as encoder to efficiently learn spatio-temporal information and a convolutional decoder to enable brain state prediction at the same spatial and temporal resolution as the input fMRI data. We used 100 unrelated subjects from the Human Connectome Project (HCP) for model training and testing. Our novel model has shown high accuracy when predicting 7.2s resting-state brain activities based on the prior 23.04s fMRI time series. The predicted brain states highly resemble BOLD contrast and dynamics. This work shows promising evidence that the spatiotemporal organization of the human brain can be learned by a Swin Transformer model, at high resolution, which provides a potential for reducing the fMRI scan time and the development of brain-computer interfaces in the future.
zh
[CV-108] FAD-Net: Frequency-Domain Attention-Guided Diffusion Network for Coronary Artery Segmentation using Invasive Coronary Angiography
【速读】:该论文旨在解决冠状动脉疾病(Coronary Artery Disease, CAD)中通过侵入性冠状动脉造影(Invasive Coronary Angiography, ICA)进行冠状动脉分割与狭窄检测的准确性问题。其解决方案的关键在于提出一种基于频域分析的深度学习模型——频域注意力引导扩散网络(Frequency-Domain Attention-Guided Diffusion Network, FAD-Net),该模型通过集成频域注意力机制与级联扩散策略,充分利用频域信息以提升分割精度,并结合多级小波变换分解ICA图像为高低频成分,再通过逆向融合重构高频细节,从而增强解剖结构的精确性。
链接: https://arxiv.org/abs/2506.11454
作者: Nan Mu,Ruiqi Song,Xiaoning Li,Zhihui Xu,Jingfeng Jiang,Chen Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 12 figures
Abstract:Background: Coronary artery disease (CAD) remains one of the leading causes of mortality worldwide. Precise segmentation of coronary arteries from invasive coronary angiography (ICA) is critical for effective clinical decision-making. Objective: This study aims to propose a novel deep learning model based on frequency-domain analysis to enhance the accuracy of coronary artery segmentation and stenosis detection in ICA, thereby offering robust support for the stenosis detection and treatment of CAD. Methods: We propose the Frequency-Domain Attention-Guided Diffusion Network (FAD-Net), which integrates a frequency-domain-based attention mechanism and a cascading diffusion strategy to fully exploit frequency-domain information for improved segmentation accuracy. Specifically, FAD-Net employs a Multi-Level Self-Attention (MLSA) mechanism in the frequency domain, computing the similarity between queries and keys across high- and low-frequency components in ICAs. Furthermore, a Low-Frequency Diffusion Module (LFDM) is incorporated to decompose ICAs into low- and high-frequency components via multi-level wavelet transformation. Subsequently, it refines fine-grained arterial branches and edges by reintegrating high-frequency details via inverse fusion, enabling continuous enhancement of anatomical precision. Results and Conclusions: Extensive experiments demonstrate that FAD-Net achieves a mean Dice coefficient of 0.8717 in coronary artery segmentation, outperforming existing state-of-the-art methods. In addition, it attains a true positive rate of 0.6140 and a positive predictive value of 0.6398 in stenosis detection, underscoring its clinical applicability. These findings suggest that FAD-Net holds significant potential to assist in the accurate diagnosis and treatment planning of CAD.
zh
[CV-109] Joint Denoising of Cryo-EM Projection Images using Polar Transformers
【速读】:该论文试图解决在高噪声环境下(如冷冻电子显微镜投影图像)深度神经网络(DNNs)去噪效果受限的问题。其解决方案的关键在于提出一种基于Transformer的神经网络架构,该架构通过同时进行聚类、对齐和去噪操作,扩展了传统的类别平均方法,从而有效提升去噪性能。实验结果表明,该方法在信噪比为0.03时,相较于单图像DNNs,可将相对均方误差(MSE)降低45%。
链接: https://arxiv.org/abs/2506.11283
作者: Joakim Andén,Justus Sagemüller
机构: KTH Royal Institute of Technology (KTH皇家理工学院); Flatiron Institute, Simons Foundation (扁平化研究所,西蒙斯基金会)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks~(DNNs) have proven powerful for denoising, but they are ultimately of limited use in high-noise settings, such as for cryogenic electron microscopy~(cryo-EM) projection images. In this setting, however, datasets contain a large number of projections of the same molecule, each taken from a different viewing direction. This redundancy of information is useful in traditional denoising techniques known as class averaging methods, where images are clustered, aligned, and then averaged to reduce the noise level. We present a neural network architecture based on transformers that extends these class averaging methods by simultaneously clustering, aligning, and denoising cryo-EM images. Results on synthetic data show accurate denoising performance using this architecture, reducing the relative mean squared error (MSE) single-image DNNs by 45% at a signal-to-noise (SNR) of 0.03 .
zh
[CV-110] DiffPR: Diffusion-Based Phase Reconstruction via Frequency-Decoupled Learning
【速读】:该论文旨在解决深度学习在离轴定量相位成像(QPI)应用中面临的过平滑问题,该问题源于网络对低频内容的偏好以及对高频诊断细节的欠表示。其解决方案的关键在于识别并消除由高层跳跃连接引起的频谱偏差,通过仅在低分辨率下监督网络以提升泛化能力和重建保真度。进一步地,作者提出了一种两阶段的频谱解耦框架DiffPR,第一阶段通过取消高频跳跃的非对称U-Net预测低分辨率相位图,第二阶段利用无条件扩散模型通过反向去噪迭代恢复缺失的高频残差,从而有效缓解频谱偏差问题。
链接: https://arxiv.org/abs/2506.11183
作者: Yi Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Oversmoothing remains a persistent problem when applying deep learning to off-axis quantitative phase imaging (QPI). End-to-end U-Nets favour low-frequency content and under-represent fine, diagnostic detail. We trace this issue to spectral bias and show that the bias is reinforced by high-level skip connections that feed high-frequency features directly into the decoder. Removing those deepest skips thus supervising the network only at a low resolution significantly improves generalisation and fidelity. Building on this insight, we introduce DiffPR, a two-stage frequency-decoupled framework. Stage 1: an asymmetric U-Net with cancelled high-frequency skips predicts a quarter-scale phase map from the interferogram, capturing reliable low-frequency structure while avoiding spectral bias. Stage 2: the upsampled prediction, lightly perturbed with Gaussian noise, is refined by an unconditional diffusion model that iteratively recovers the missing high-frequency residuals through reverse denoising. Experiments on four QPI datasets (B-Cell, WBC, HeLa, 3T3) show that DiffPR outperforms strong U-Net baselines, boosting PSNR by up to 1.1 dB and reducing MAE by 11 percent, while delivering markedly sharper membrane ridges and speckle patterns. The results demonstrate that cancelling high-level skips and delegating detail synthesis to a diffusion prior is an effective remedy for the spectral bias that limits conventional phase-retrieval networks.
zh
[CV-111] Vector Representations of Vessel Trees
【速读】:该论文旨在解决如何有效学习树状几何数据(如3D血管网络)的向量表示问题。其关键解决方案是提出一种基于Transformer的两级自编码器框架,即VeTTA。第一阶段通过Vessel Autoencoder从曲线上的采样点学习单个血管段的连续几何特征嵌入,第二阶段通过Vessel Tree Autoencoder将血管网络的拓扑结构编码为单一向量表示,利用递归解码过程确保重建拓扑结构的有效性,从而实现精确、灵活且拓扑一致的建模。
链接: https://arxiv.org/abs/2506.11163
作者: James Batten,Michiel Schaap,Matthew Sinclair,Ying Bai,Ben Glocker
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:We introduce a novel framework for learning vector representations of tree-structured geometric data focusing on 3D vascular networks. Our approach employs two sequentially trained Transformer-based autoencoders. In the first stage, the Vessel Autoencoder captures continuous geometric details of individual vessel segments by learning embeddings from sampled points along each curve. In the second stage, the Vessel Tree Autoencoder encodes the topology of the vascular network as a single vector representation, leveraging the segment-level embeddings from the first model. A recursive decoding process ensures that the reconstructed topology is a valid tree structure. Compared to 3D convolutional models, this proposed approach substantially lowers GPU memory requirements, facilitating large-scale training. Experimental results on a 2D synthetic tree dataset and a 3D coronary artery dataset demonstrate superior reconstruction fidelity, accurate topology preservation, and realistic interpolations in latent space. Our scalable framework, named VeTTA, offers precise, flexible, and topologically consistent modeling of anatomical tree structures in medical imaging.
zh
[CV-112] ADAgent : LLM Agent for Alzheimers Disease Analysis with Collaborative Coordinator
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期精准诊断与治疗规划中因现有方法依赖单一模态数据而难以全面反映疾病复杂性的难题。其关键解决方案是提出ADAgent,首个专为AD分析设计的AI代理系统,基于大语言模型(large language model, LLM)构建,能够处理多模态或缺失输入,并集成多种先进方法以提升诊断与预后任务的性能。
链接: https://arxiv.org/abs/2506.11150
作者: Wenlong Hou,Gangqian Yang,Ye Du,Yeung Lau,Lihao Liu,Junjun He,Ling Long,Shujun Wang
机构: 1(第一单位); 2(第二单位); 3(第三单位); 4(第四单位); 5(第五单位)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi-modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD-related tasks, process multi-modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision-making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi-modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi-modal diagnosis, a 0.7% improvement in multi-modal prognosis, and enhancements in MRI and PET diagnosis tasks.
zh
[CV-113] HQFNN: A Compact Quantum-Fuzzy Neural Network for Accurate Image Classification
【速读】:该论文试图解决传统深度学习视觉系统在输入噪声环境下性能下降以及模型可解释性不足的问题。其解决方案的关键在于提出一种创新的高量化模糊神经网络(Highly Quantized Fuzzy Neural Network, HQFNN),该网络将整个模糊推理流程嵌入到浅层量子电路中,并通过参数化量子电路与轻量级卷积神经网络(CNN)特征提取器进行耦合,从而实现高效的特征映射、规则细化和去模糊化处理,最终在保持模型紧凑性和可解释性的同时提升对噪声的鲁棒性。
链接: https://arxiv.org/abs/2506.11146
作者: Jianhong Yao,Yangming Guo
机构: Northwestern Polytechnical University, Xi’an Shaanxi, 710129, China (西北工业大学,西安陕西,710129,中国)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning vision systems excel at pattern recognition yet falter when inputs are noisy or the model must explain its own confidence. Fuzzy inference, with its graded memberships and rule transparency, offers a remedy, while parameterized quantum circuits can embed features in richly entangled Hilbert spaces with striking parameter efficiency. Bridging these ideas, this study introduces a innovative Highly Quantized Fuzzy Neural Network (HQFNN) that realises the entire fuzzy pipeline inside a shallow quantum circuit and couples the resulting quantum signal to a lightweight CNN feature extractor. Each image feature is first mapped to a single qubit membership state through repeated angle reuploading. Then a compact rule layer refines these amplitudes, and a clustered CNOT defuzzifier collapses them into one crisp value that is fused with classical features before classification. Evaluated on standard image benchmarks, HQFNN consistently surpasses classical, fuzzy enhanced and quantum only baselines while using several orders of magnitude fewer trainable weights, and its accuracy degrades only marginally under simulated depolarizing and amplitude damping noise, evidence of intrinsic robustness. Gate count analysis further shows that circuit depth grows sublinearly with input dimension, confirming the model’s practicality for larger images. These results position the model as a compact, interpretable and noise tolerant alternative to conventional vision backbones and provide a template for future quantum native fuzzy learning frameworks.
zh
[CV-114] Grids Often Outperform Implicit Neural Representations
【速读】:该论文试图解决隐式神经表示(Implicit Neural Representations, INRs)在不同任务和信号类型下的性能表现、容量分配及泛化能力等问题。其关键解决方案是通过在多种2D和3D真实与合成信号上评估不同INRs的性能,结合模型规模、信号类型和带宽进行分层分析,从而揭示INRs与传统网格表示在容量利用上的差异。研究发现,在大多数任务中,简单的正则化网格结合插值方法在训练速度和质量上优于相同参数量的INR,但在具有低维结构的信号拟合任务中,INRs仍表现出优势。
链接: https://arxiv.org/abs/2506.11139
作者: Namhoon Kim,Sara Fridovich-Keil
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for most tasks and signals, a simple regularized grid with interpolation trains faster and to higher quality than any INR with the same number of parameters. We also find limited settings where INRs outperform grids – namely fitting signals with underlying lower-dimensional structure such as shape contours – to guide future use of INRs towards the most advantageous applications. Code and synthetic signals used in our analysis are available at this https URL.
zh
[CV-115] Sparse Autoencoders Bridge The Deep Learning Model and The Brain
【速读】:该论文试图解决深度神经网络与人类视觉皮层之间表征对齐的问题,旨在建立两者之间的直接联系以提升模型的可解释性。解决方案的关键在于使用稀疏自编码器(SAEs)将深度学习模型的视觉表征与体素级别的功能性磁共振成像(fMRI)响应直接对齐,通过计算SAE单元激活与皮层fMRI信号的相关性,构建体素字典,并实现模型层与人类腹侧视觉通路的细粒度层次映射。
链接: https://arxiv.org/abs/2506.11123
作者: Ziming Mao,Jia Xu,Zeqi Zheng,Haofang Zheng,Dabing Sheng,Yaochu Jin,Guoyuan Yang
机构: Beijing Institute of Technology (北京理工大学); Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 54 pages, 41 figures
Abstract:We present SAE-BrainMap, a novel framework that directly aligns deep learning visual model representations with voxel-level fMRI responses using sparse autoencoders (SAEs). First, we train layer-wise SAEs on model activations and compute the correlations between SAE unit activations and cortical fMRI signals elicited by the same natural image stimuli with cosine similarity, revealing strong activation correspondence (maximum similarity up to 0.76). Depending on this alignment, we construct a voxel dictionary by optimally assigning the most similar SAE feature to each voxel, demonstrating that SAE units preserve the functional structure of predefined regions of interest (ROIs) and exhibit ROI-consistent selectivity. Finally, we establish fine-grained hierarchical mapping between model layers and the human ventral visual pathway, also by projecting voxel dictionary activations onto individual cortical surfaces, we visualize the dynamic transformation of the visual information in deep learning models. It is found that ViT-B/16 _CLIP tends to utilize low-level information to generate high-level semantic information in the early layers and reconstructs the low-dimension information later. Our results establish a direct, downstream-task-free bridge between deep neural networks and human visual cortex, offering new insights into model interpretability.
zh
人工智能
[AI-0] racing LLM Reasoning Processes with Strategic Games: A Framework for Planning Revision and Resource-Constrained Decision Making
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在复杂推理任务中缺乏对内部推理过程(如规划、修正和资源约束下的决策)评估的问题。现有基准测试主要关注最终结果,而忽视了模型在执行任务时的中间步骤。论文提出的解决方案关键在于引入策略性游戏作为自然的评估环境,通过定义多维指标(如过纠正风险率、修正成功率、改进斜率和超预算比例)来全面评估LLMs的推理能力,从而更深入地理解模型行为并提升其可靠性。
链接: https://arxiv.org/abs/2506.12012
作者: Xiaopeng Yuan,Xingjian Zhang,Ke Xu,Yifan Xu,Lijun Yu,Jindong Wang,Yushun Dong,Haohan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures. Under review
Abstract:Large language models (LLMs) are increasingly used for tasks that require complex reasoning. Most benchmarks focus on final outcomes but overlook the intermediate reasoning steps - such as planning, revision, and decision making under resource constraints. We argue that measuring these internal processes is essential for understanding model behavior and improving reliability. We propose using strategic games as a natural evaluation environment: closed, rule-based systems with clear states, limited resources, and automatic feedback. We introduce a framework that evaluates LLMs along three core dimensions: planning, revision, and resource-constrained decision making. To operationalize this, we define metrics beyond win rate, including overcorrection risk rate, correction success rate, improvement slope, and over-budget ratio. In 4320 adversarial rounds across 12 leading models, ChatGPT-o3-mini achieves the top composite score, with a win rate of 74.7 percent, a correction success rate of 78.6 percent, and an improvement slope of 0.041. By contrast, Qwen-Plus, despite an overcorrection risk rate of 81.6 percent, wins only 25.6 percent of its matches - primarily due to excessive resource use. We also observe a negative correlation between overcorrection risk rate and correction success rate (Pearson r = -0.51, p = 0.093), suggesting that more frequent edits do not always improve outcomes. Our findings highlight the value of assessing not only what LLMs decide but how they arrive at those decisions
zh
[AI-1] Reimagining Dance: Real-time Music Co-creation between Dancers and AI
【速读】:该论文试图解决传统舞蹈表演中动作与音乐之间单向关系的问题,即动作通常仅作为对音乐的响应。其解决方案的关键在于提出一种多模态架构,使舞者能够通过自身动作动态塑造音乐环境,从而实现动作与音乐之间的双向创造性合作。该系统通过智能整合预录音乐片段来生成连贯的音乐作品,使舞者同时扮演表演者和作曲者的角色,并通过分析表演数据揭示动作特征与音频特征之间的新兴沟通模式。
链接: https://arxiv.org/abs/2506.12008
作者: Olga Vechtomova,Jeff Bos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Accepted for publication at ICCC 2025 (International Conference on Computational Creativity)
Abstract:Dance performance traditionally follows a unidirectional relationship where movement responds to music. While AI has advanced in various creative domains, its application in dance has primarily focused on generating choreography from musical input. We present a system that enables dancers to dynamically shape musical environments through their movements. Our multi-modal architecture creates a coherent musical composition by intelligently combining pre-recorded musical clips in response to dance movements, establishing a bidirectional creative partnership where dancers function as both performers and composers. Through correlation analysis of performance data, we demonstrate emergent communication patterns between movement qualities and audio features. This approach reconceptualizes the role of AI in performing arts as a responsive collaborator that expands possibilities for both professional dance performance and improvisational artistic expression across broader populations.
zh
[AI-2] Upgrade or Switch: Do We Need a New Registry Architecture for the Internet of AI Agents ?
【速读】:该论文试图解决自主AI代理(autonomous AI agents)对现有互联网基础设施的挑战,这些代理能够主动发起行动、维持持久状态、生成子代理并与对等方直接协商,从而需要毫秒级发现、即时凭证撤销和加密行为证明,而当前的DNS/PKI体系无法满足这些需求。解决方案的关键在于识别现有基础设施的瓶颈,包括DNS传播延迟(24-48小时 vs. 毫秒级需求)、证书撤销无法扩展至万亿级实体以及IPv4/IPv6地址不足以支持代理规模的路由。论文评估了三种方法:升级路径、切换选项和混合注册机制,并指出代理需求属于质变而非量变,因此混合方案可能成为主流,结合集中式注册用于关键代理和联邦网络用于特定应用场景。
链接: https://arxiv.org/abs/2506.12003
作者: Ramesh Raskar,Pradyumna Chari,Jared James Grogan,Mahesh Lambe,Robert Lincourt,Raghu Bala,Abhishek Singh,Ayush Chopra,Rajesh Ranjan,Shailja Gupta,Dimitris Stripelis,Maria Gorskikh,Sichao Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The emerging Internet of AI Agents challenges existing web infrastructure designed for human-scale, reactive interactions. Unlike traditional web resources, autonomous AI agents initiate actions, maintain persistent state, spawn sub-agents, and negotiate directly with peers: demanding millisecond-level discovery, instant credential revocation, and cryptographic behavioral proofs that exceed current DNS/PKI capabilities. This paper analyzes whether to upgrade existing infrastructure or implement purpose-built registry architectures for autonomous agents. We identify critical failure points: DNS propagation (24-48 hours vs. required milliseconds), certificate revocation unable to scale to trillions of entities, and IPv4/IPv6 addressing inadequate for agent-scale routing. We evaluate three approaches: (1) Upgrade paths, (2) Switch options, (3) Hybrid registries. Drawing parallels to dialup-to-broadband transitions, we find that agent requirements constitute qualitative, and not incremental, changes. While upgrades offer compatibility and faster deployment, clean-slate solutions provide better performance but require longer for adoption. Our analysis suggests hybrid approaches will emerge, with centralized registries for critical agents and federated meshes for specialized use cases.
zh
[AI-3] chnical Evaluation of a Disruptive Approach in Homomorphic AI
【速读】:该论文试图解决在保持数据加密安全性的同时,仍能使用现有人工智能算法对数据进行分析和处理的问题。传统同态加密方案在保证数据隐私性的同时,往往导致计算性能下降和算法兼容性问题。HbHAI(Hash-based Homomorphic Artificial Intelligence)作为一项新的、颠覆性的密码学方法,其关键在于引入了一种依赖于密钥的哈希函数,该函数能够自然地保留大多数人工智能算法所依赖的相似性特性,从而实现了在加密数据上直接应用原有AI算法而无需修改,显著提升了性能与实用性。
链接: https://arxiv.org/abs/2506.11954
作者: Eric Filiol
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This is the extended version of the talk presented at CyberWiseCon 2025 in Vilnius, Lituania in May 21st-23rd, 2025
Abstract:We present a technical evaluation of a new, disruptive cryptographic approach to data security, known as HbHAI (Hash-based Homomorphic Artificial Intelligence). HbHAI is based on a novel class of key-dependent hash functions that naturally preserve most similarity properties, most AI algorithms rely on. As a main claim, HbHAI makes now possible to analyze and process data in its cryptographically secure form while using existing native AI algorithms without modification, with unprecedented performances compared to existing homomorphic encryption schemes. We tested various HbHAI-protected datasets (non public preview) using traditional unsupervised and supervised learning techniques (clustering, classification, deep neural networks) with classical unmodified AI algorithms. This paper presents technical results from an independent analysis conducted with those different, off-the-shelf AI algorithms. The aim was to assess the security, operability and performance claims regarding HbHAI techniques. As a results, our results confirm most these claims, with only a few minor reservations. Comments: This is the extended version of the talk presented at CyberWiseCon 2025 in Vilnius, Lituania in May 21 ^st -23 ^rd , 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.11954 [cs.CR] (or arXiv:2506.11954v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.11954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-4] SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies
【速读】:该论文试图解决传统模仿学习(Imitation Learning, IL)方法在执行任务时速度受限于演示数据速度的问题,这限制了机器人系统的任务吞吐量。为了解决这一问题,论文提出了SAIL(Speed Adaptation for Imitation Learning),其关键在于通过四个紧密集成的组件实现视觉-运动策略的超演示速度执行:(1) 保持一致性的动作推理算法以实现高速度下的平滑运动,(2) 高保真跟踪控制器不变的运动目标,(3) 自适应速度调节以根据运动复杂性动态调整执行速度,(4) 动作调度以处理现实世界中的系统延迟。
链接: https://arxiv.org/abs/2506.11948
作者: Nadun Ranawaka Arachchige,Zhenyang Chen,Wonsuhk Jung,Woo Chul Shin,Rohan Bansal,Pierre Barroso,Yu Hang He,Yingyang Celine Lin,Benjamin Joffe,Shreyas Kousik,Danfei Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The first two authors contributed equally
Abstract:Offline Imitation Learning (IL) methods such as Behavior Cloning are effective at acquiring complex robotic manipulation skills. However, existing IL-trained policies are confined to executing the task at the same speed as shown in demonstration data. This limits the task throughput of a robotic system, a critical requirement for applications such as industrial automation. In this paper, we introduce and formalize the novel problem of enabling faster-than-demonstration execution of visuomotor policies and identify fundamental challenges in robot dynamics and state-action distribution shifts. We instantiate the key insights as SAIL (Speed Adaptation for Imitation Learning), a full-stack system integrating four tightly-connected components: (1) a consistency-preserving action inference algorithm for smooth motion at high speed, (2) high-fidelity tracking of controller-invariant motion targets, (3) adaptive speed modulation that dynamically adjusts execution speed based on motion complexity, and (4) action scheduling to handle real-world system latencies. Experiments on 12 tasks across simulation and two real, distinct robot platforms show that SAIL achieves up to a 4x speedup over demonstration speed in simulation and up to 3.2x speedup in the real world. Additional detail is available at this https URL
zh
[AI-5] Subjective Experience in AI Systems: What Do AI Researchers and the Public Believe?
【速读】:该论文试图解决关于具有主观体验的AI系统(Subjective Experience AI)未来发展及其伦理治理的问题,旨在探讨AI研究人员与公众对此类系统存在可能性的看法、态度及相应的治理需求。解决方案的关键在于通过调查分析不同群体对主观体验AI的预期时间线、道德责任、权利保护以及风险管理的共识与分歧,从而为未来相关政策和伦理框架的制定提供依据。
链接: https://arxiv.org/abs/2506.11945
作者: Noemi Dreksler,Lucius Caviola,David Chalmers,Carter Allen,Alex Rand,Joshua Lewis,Philip Waggoner,Kate Mays,Jeff Sebo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 109 pages, 27 figures
Abstract:We surveyed 582 AI researchers who have published in leading AI venues and 838 nationally representative US participants about their views on the potential development of AI systems with subjective experience and how such systems should be treated and governed. When asked to estimate the chances that such systems will exist on specific dates, the median responses were 1% (AI researchers) and 5% (public) by 2024, 25% and 30% by 2034, and 70% and 60% by 2100, respectively. The median member of the public thought there was a higher chance that AI systems with subjective experience would never exist (25%) than the median AI researcher did (10%). Both groups perceived a need for multidisciplinary expertise to assess AI subjective experience. Although support for welfare protections for such AI systems exceeded opposition, it remained far lower than support for protections for animals or the environment. Attitudes toward moral and governance issues were divided in both groups, especially regarding whether such systems should be created and what rights or protections they should receive. Yet a majority of respondents in both groups agreed that safeguards against the potential risks from AI systems with subjective experience should be implemented by AI developers now, and if created, AI systems with subjective experience should treat others well, behave ethically, and be held accountable. Overall, these results suggest that both AI researchers and the public regard the emergence of AI systems with subjective experience as a possibility this century, though substantial uncertainty and disagreement remain about the timeline and appropriate response.
zh
[AI-6] odays Cat Is Tomorrows Dog: Accounting for Time-Based Changes in the Labels of ML Vulnerability Detection Approaches
【速读】:该论文试图解决机器学习(ML)模型在漏洞检测任务中由于训练和测试数据标签随时间变化而导致的性能评估偏差问题。传统方法要么依赖于整个历史数据(过于乐观),要么仅考虑连续版本间的差异(过于保守),未能反映实际场景中标签随时间动态变化的特点。解决方案的关键在于将数据集重新构造为一系列随时间演变的子数据集,使得训练和测试标签能够反映不同时期的知识状态,从而更真实地评估模型的学习能力。通过这种方法,可以利用Mann-Kendall检验验证模型是否随时间推移和数据量增加而表现出性能提升的趋势。
链接: https://arxiv.org/abs/2506.11939
作者: Ranindya Paramitha,Yuan Feng,Fabio Massacci
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at The ACM International Conference on the Foundations of Software Engineering (FSE) 2025. Published in the Proceedings of the ACM on Software Engineering (PACMSE), Issue FSE 2025
Abstract:Vulnerability datasets used for ML testing implicitly contain retrospective information. When tested on the field, one can only use the labels available at the time of training and testing (e.g. seen and assumed negatives). As vulnerabilities are discovered across calendar time, labels change and past performance is not necessarily aligned with future performance. Past works only considered the slices of the whole history (e.g. DiverseVUl) or individual differences between releases (e.g. Jimenez et al. ESEC/FSE 2019). Such approaches are either too optimistic in training (e.g. the whole history) or too conservative (e.g. consecutive releases). We propose a method to restructure a dataset into a series of datasets in which both training and testing labels change to account for the knowledge available at the time. If the model is actually learning, it should improve its performance over time as more data becomes available and data becomes more stable, an effect that can be checked with the Mann-Kendall test. We validate our methodology for vulnerability detection with 4 time-based datasets (3 projects from BigVul dataset + Vuldeepecker’s NVD) and 5 ML models (Code2Vec, CodeBERT, LineVul, ReGVD, and Vuldeepecker). In contrast to the intuitive expectation (more retrospective information, better performance), the trend results show that performance changes inconsistently across the years, showing that most models are not learning.
zh
[AI-7] Breaking Habits: On the Role of the Advantage Function in Learning Causal State Representations
【速读】:该论文试图解决强化学习智能体在训练过程中可能产生的策略混淆(policy confounding)问题,即智能体的策略会引发奖励与观测之间的虚假相关性,从而影响其泛化能力。解决方案的关键在于利用优势函数(advantage function),该函数不仅能够降低梯度估计的方差,还能通过相对于状态表示调整动作价值,削弱当前策略下更可能出现的状态-动作对,从而打破虚假相关性并促使智能体关注因果因素。
链接: https://arxiv.org/abs/2506.11912
作者: Miguel Suau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has shown that reinforcement learning agents can develop policies that exploit spurious correlations between rewards and observations. This phenomenon, known as policy confounding, arises because the agent’s policy influences both past and future observation variables, creating a feedback loop that can hinder the agent’s ability to generalize beyond its usual trajectories. In this paper, we show that the advantage function, commonly used in policy gradient methods, not only reduces the variance of gradient estimates but also mitigates the effects of policy confounding. By adjusting action values relative to the state representation, the advantage function downweights state-action pairs that are more likely under the current policy, breaking spurious correlations and encouraging the agent to focus on causal factors. We provide both analytical and empirical evidence demonstrating that training with the advantage function leads to improved out-of-trajectory performance.
zh
[AI-8] Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table
【速读】:该论文试图解决X射线吸收光谱(X-ray Absorption Spectroscopy, XAS)在解释过程中依赖专家分析、计算成本高以及元素特异性启发式方法的问题。其解决方案的关键在于提出XAStruct,一个能够从晶体结构预测XAS谱并从XAS输入推断局部结构描述符的学习框架。该框架基于覆盖70多种元素的大规模数据集进行训练,实现了对多种化学环境和键合情况的泛化能力,并引入了首个直接从XAS谱预测邻近原子类型的机器学习方法,以及无需元素特异性调优的统一回归模型来预测平均最近邻距离。
链接: https://arxiv.org/abs/2506.11908
作者: Yufeng Wang,Peiyao Wang,Lu Ma,Yuewei Lin,Qun Liu,Haibin Ling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:X-ray Absorption Spectroscopy (XAS) is a powerful technique for probing local atomic environments, yet its interpretation remains limited by the need for expert-driven analysis, computationally expensive simulations, and element-specific heuristics. Recent advances in machine learning have shown promise for accelerating XAS interpretation, but many existing models are narrowly focused on specific elements, edge types, or spectral regimes. In this work, we present XAStruct, a learning framework capable of both predicting XAS spectra from crystal structures and inferring local structural descriptors from XAS input. XAStruct is trained on a large-scale dataset spanning over 70 elements across the periodic table, enabling generalization to a wide variety of chemistries and bonding environments. The model includes the first machine learning approach for predicting neighbor atom types directly from XAS spectra, as well as a unified regression model for mean nearest-neighbor distance that requires no element-specific tuning. While we explored integrating the two pipelines into a single end-to-end model, empirical results showed performance degradation. As a result, the two tasks were trained independently to ensure optimal accuracy and task-specific performance. By combining deep neural networks for complex structure-property mappings with efficient baseline models for simpler tasks, XAStruct offers a scalable and extensible solution for data-driven XAS analysis and local structure inference. The source code will be released upon paper acceptance.
zh
[AI-9] A Neural Rejection System Against Universal Adversarial Perturbations in Radio Signal Classification
【速读】:该论文试图解决深度学习在无线电信号分类中面临的对抗样本(adversarial examples)问题,特别是针对具有数据无关特性的通用对抗扰动(universal adversarial perturbation)的攻击。解决方案的关键在于提出一种称为神经拒绝系统(neural rejection system)的防御机制,该机制能够显著提高对通用对抗扰动的防御准确率。
链接: https://arxiv.org/abs/2506.11901
作者: Lu Zhang,Sangarapillai Lambotharan,Gan Zheng,Fabio Roli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Advantages of deep learning over traditional methods have been demonstrated for radio signal classification in the recent years. However, various researchers have discovered that even a small but intentional feature perturbation known as adversarial examples can significantly deteriorate the performance of the deep learning based radio signal classification. Among various kinds of adversarial examples, universal adversarial perturbation has gained considerable attention due to its feature of being data independent, hence as a practical strategy to fool the radio signal classification with a high success rate. Therefore, in this paper, we investigate a defense system called neural rejection system to propose against universal adversarial perturbations, and evaluate its performance by generating white-box universal adversarial perturbations. We show that the proposed neural rejection system is able to defend universal adversarial perturbations with significantly higher accuracy than the undefended deep neural network.
zh
[AI-10] Attention-based Adversarial Robust Distillation in Radio Signal Classifications for Low-Power IoT Devices
【速读】:该论文试图解决基于Transformer的调制分类系统在面对对抗样本(adversarial examples)时的脆弱性问题。解决方案的关键在于提出一种紧凑型Transformer架构,该架构能够通过迁移来自鲁棒训练的大型Transformer的对抗注意力图(adversarial attention map),从而增强对对抗攻击的鲁棒性。该方法在白盒场景下优于现有的先进技术,具有潜在的防御对抗样本转移的能力。
链接: https://arxiv.org/abs/2506.11892
作者: Lu Zhang,Sangarapillai Lambotharan,Gan Zheng,Guisheng Liao,Basil AsSadhan,Fabio Roli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to great success of transformers in many applications such as natural language processing and computer vision, transformers have been successfully applied in automatic modulation classification. We have shown that transformer-based radio signal classification is vulnerable to imperceptible and carefully crafted attacks called adversarial examples. Therefore, we propose a defense system against adversarial examples in transformer-based modulation classifications. Considering the need for computationally efficient architecture particularly for Internet of Things (IoT)-based applications or operation of devices in environment where power supply is limited, we propose a compact transformer for modulation classification. The advantages of robust training such as adversarial training in transformers may not be attainable in compact transformers. By demonstrating this, we propose a novel compact transformer that can enhance robustness in the presence of adversarial attacks. The new method is aimed at transferring the adversarial attention map from the robustly trained large transformer to a compact transformer. The proposed method outperforms the state-of-the-art techniques for the considered white-box scenarios including fast gradient method and projected gradient descent attacks. We have provided reasoning of the underlying working mechanisms and investigated the transferability of the adversarial examples between different architectures. The proposed method has the potential to protect the transformer from the transferability of adversarial examples.
zh
[AI-11] Enter: Graduated Realism: A Pedagogical Framework for AI-Powered Avatars in Virtual Reality Teacher Training
【速读】:该论文试图解决虚拟现实(Virtual Reality)教师培训中AI驱动的学生虚拟角色(avatar)真实度与教学有效性之间的平衡问题,即如何确定最佳的虚拟角色真实度以实现有效的教学。解决方案的关键在于提出“渐进式真实度”(Graduated Realism)框架,主张从低真实度的虚拟角色开始,随着学习者技能的发展逐步增加行为复杂性,从而减少认知负荷并促进支架式学习。此外,为实现该框架的计算可行性,论文还提出了一种名为“Crazy Slots”的单次调用架构,利用概率引擎和检索增强生成数据库,在无需多步骤推理模型的情况下实现实时、真实的响应生成。
链接: https://arxiv.org/abs/2506.11890
作者: Judson Leroy Dean Haynes IV
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Virtual Reality simulators offer a powerful tool for teacher training, yet the integration of AI-powered student avatars presents a critical challenge: determining the optimal level of avatar realism for effective pedagogy. This literature review examines the evolution of avatar realism in VR teacher training, synthesizes its theoretical implications, and proposes a new pedagogical framework to guide future design. Through a systematic review, this paper traces the progression from human-controlled avatars to generative AI prototypes. Applying learning theories like Cognitive Load Theory, we argue that hyper-realism is not always optimal, as high-fidelity avatars can impose excessive extraneous cognitive load on novices, a stance supported by recent empirical findings. A significant gap exists between the technological drive for photorealism and the pedagogical need for scaffolded learning. To address this gap, we propose Graduated Realism, a framework advocating for starting trainees with lower-fidelity avatars and progressively increasing behavioral complexity as skills develop. To make this computationally feasible, we outline a novel single-call architecture, Crazy Slots, which uses a probabilistic engine and a Retrieval-Augmented Generation database to generate authentic, real-time responses without the latency and cost of multi-step reasoning models. This review provides evidence-based principles for designing the next generation of AI simulators, arguing that a pedagogically grounded approach to realism is essential for creating scalable and effective teacher education tools.
zh
[AI-12] An Explainable AI Framework for Dynamic Resource Management in Vehicular Network Slicing
【速读】:该论文旨在解决车联网中多样化服务需求下的资源管理和网络切片问题,特别是针对增强型移动宽带(eMBB)和超可靠低时延通信(URLLC)的服务需求。其解决方案的关键在于提出一种可解释的深度强化学习(XRL)框架,该框架基于近实时无线接入网(RAN)智能控制器,并结合基于特征的方法与注意力机制,以解释和优化强化学习代理的决策,从而提升车联网通信系统的可靠性与可解释性。
链接: https://arxiv.org/abs/2506.11882
作者: Haochen Sun,Yifan Liu,Ahmed Al-Tahmeesschi,Swarna Chetty,Syed Ali Raza Zaidi,Avishek Nag,Hamed Ahmadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear in Proceedings of IEEE PIMRC 2025. 6 pages, 4 figures
Abstract:Effective resource management and network slicing are essential to meet the diverse service demands of vehicular networks, including Enhanced Mobile Broadband (eMBB) and Ultra-Reliable and Low-Latency Communications (URLLC). This paper introduces an Explainable Deep Reinforcement Learning (XRL) framework for dynamic network slicing and resource allocation in vehicular networks, built upon a near-real-time RAN intelligent controller. By integrating a feature-based approach that leverages Shapley values and an attention mechanism, we interpret and refine the decisions of our reinforcementlearning agents, addressing key reliability challenges in vehicular communication systems. Simulation results demonstrate that our approach provides clear, real-time insights into the resource allocation process and achieves higher interpretability precision than a pure attention mechanism. Furthermore, the Quality of Service (QoS) satisfaction for URLLC services increased from 78.0% to 80.13%, while that for eMBB services improved from 71.44% to 73.21%.
zh
[AI-13] Robust Molecular Property Prediction via Densifying Scarce Labeled Data
【速读】:该论文试图解决分子预测模型在面对分布外(out-of-distribution, OOD)化合物时泛化能力差的问题,这一问题源于模型过度依赖训练数据中的结构,导致在药物发现中关键化合物的预测不稳定且不准确。解决方案的关键在于提出一种基于元学习(meta-learning)的方法,利用未标记数据在分布内(in-distribution, ID)和分布外(OOD)数据之间进行插值,从而使模型能够元学习如何超越训练分布进行泛化。
链接: https://arxiv.org/abs/2506.11877
作者: Jina Kim,Jeffrey Willette,Bruno Andreis,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data, stemming from the onerous and costly nature of experimental validation, further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to meta-learn how to generalize beyond the training distribution. We demonstrate significant performance gains over state-of-the-art methods on challenging real-world datasets that exhibit substantial covariate shift.
zh
[AI-14] Regression-adjusted Monte Carlo Estimators for Shapley Values and Probabilistic Values
【速读】:该论文试图解决在可解释人工智能中计算概率值(如Shapley值、Banzhaf值和半值)时计算复杂度高、效率低的问题。现有方法依赖于蒙特卡洛采样或线性回归框架,但难以在准确性和效率之间取得平衡。论文提出的解决方案关键在于将这两种技术进行创新性结合,允许使用任何能够高效计算概率值的函数族替代线性回归,从而充分利用树模型(如XGBoost)的准确性,同时保持无偏估计。实验结果表明,该方法在多个数据集上均取得了最先进的性能,显著降低了误差。
链接: https://arxiv.org/abs/2506.11849
作者: R. Teal Witter,Yurong Liu,Christopher Musco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With origins in game theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more. Since all of these values require exponential time to compute exactly, research has focused on efficient approximation methods using two techniques: Monte Carlo sampling and linear regression formulations. In this work, we present a new way of combining both of these techniques. Our approach is more flexible than prior algorithms, allowing for linear regression to be replaced with any function family whose probabilistic values can be computed efficiently. This allows us to harness the accuracy of tree-based models like XGBoost, while still producing unbiased estimates. From experiments across eight datasets, we find that our methods give state-of-the-art performance for estimating probabilistic values. For Shapley values, the error of our methods can be 6.5\times lower than Permutation SHAP (the most popular Monte Carlo method), 3.8\times lower than Kernel SHAP (the most popular linear regression method), and 2.6\times lower than Leverage SHAP (the prior state-of-the-art Shapley value estimator). For more general probabilistic values, we can obtain error 215\times lower than the best estimator from prior work.
zh
[AI-15] rustGLM: Evaluating the Robustness of GraphLLM s Against Prompt Text and Structure Attacks KDD2025
【速读】:该论文旨在解决生成式图学习模型(GraphLLMs)在面对对抗性扰动时的鲁棒性问题,这一问题在高风险应用场景中尤为关键。其解决方案的关键在于提出TrustGLM,通过从文本、图结构和提示三个维度系统评估GraphLLMs的脆弱性,并结合数据增强训练和对抗训练等防御技术,以提升模型的鲁棒性。
链接: https://arxiv.org/abs/2506.11844
作者: Qihai Zhang,Xinyue Sheng,Yuanfu Sun,Qiaoyu Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, in KDD 2025
Abstract:Inspired by the success of large language models (LLMs), there is a significant research shift from traditional graph learning methods to LLM-based graph frameworks, formally known as GraphLLMs. GraphLLMs leverage the reasoning power of LLMs by integrating three key components: the textual attributes of input nodes, the structural information of node neighborhoods, and task-specific prompts that guide decision-making. Despite their promise, the robustness of GraphLLMs against adversarial perturbations remains largely unexplored-a critical concern for deploying these models in high-stakes scenarios. To bridge the gap, we introduce TrustGLM, a comprehensive study evaluating the vulnerability of GraphLLMs to adversarial attacks across three dimensions: text, graph structure, and prompt manipulations. We implement state-of-the-art attack algorithms from each perspective to rigorously assess model resilience. Through extensive experiments on six benchmark datasets from diverse domains, our findings reveal that GraphLLMs are highly susceptible to text attacks that merely replace a few semantically similar words in a node’s textual attribute. We also find that standard graph structure attack methods can significantly degrade model performance, while random shuffling of the candidate label set in prompt templates leads to substantial performance drops. Beyond characterizing these vulnerabilities, we investigate defense techniques tailored to each attack vector through data-augmented training and adversarial training, which show promising potential to enhance the robustness of GraphLLMs. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field.
zh
[AI-16] Revealing Political Bias in LLM s through Structured Multi-Agent Debate
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在模拟社会行为时存在的政治偏见及其在辩论中的互动动态问题。其解决方案的关键在于构建一个结构化的多智能体辩论框架,通过让中立、共和党和民主党美国LLM代理就政治敏感话题进行辩论,系统地研究LLM类型、代理性别属性以及辩论形式对政治偏见的影响,从而揭示模型来源和代理人格角色如何影响辩论过程中政治偏见和态度的演变。
链接: https://arxiv.org/abs/2506.11825
作者: Aishwarya Bandaru,Fabian Bindley,Trevor Bluth,Nandini Chavda,Baixu Chen,Ethan Law
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Large language models (LLMs) are increasingly used to simulate social behaviour, yet their political biases and interaction dynamics in debates remain underexplored. We investigate how LLM type and agent gender attributes influence political bias using a structured multi-agent debate framework, by engaging Neutral, Republican, and Democrat American LLM agents in debates on politically sensitive topics. We systematically vary the underlying LLMs, agent genders, and debate formats to examine how model provenance and agent personas influence political bias and attitudes throughout debates. We find that Neutral agents consistently align with Democrats, while Republicans shift closer to the Neutral; gender influences agent attitudes, with agents adapting their opinions when aware of other agents’ genders; and contrary to prior research, agents with shared political affiliations can form echo chambers, exhibiting the expected intensification of attitudes as debates progress.
zh
[AI-17] Abstract Sound Fusion with Unconditioned Inversion Model
【速读】:该论文试图解决如何通过声学融合生成具有超越简单叠加特性的新颖声音的问题,即生成式 AI (Generative AI) 在声音合成中难以实现的复杂 auditory features(听觉特征)的合成。解决方案的关键在于采用基于 DPMSolver++ 采样器的新型 SDE 和 ODE 反演模型,通过将模型输出配置为常量来逆向采样过程,从而消除由噪声预测项引起的循环依赖,同时在不依赖提示条件的情况下保持采样过程中的灵活引导。
链接: https://arxiv.org/abs/2506.11811
作者: Jing Liu,EnQi Lian
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:An abstract sound is defined as a sound that does not disclose identifiable real-world sound events to a listener. Sound fusion aims to synthesize an original sound and a reference sound to generate a novel sound that exhibits auditory features beyond mere additive superposition of the sound constituents. To achieve this fusion, we employ inversion techniques that preserve essential features of the original sample while enabling controllable synthesis. We propose novel SDE and ODE inversion models based on DPMSolver++ samplers that reverse the sampling process by configuring model outputs as constants, eliminating circular dependencies incurred by noise prediction terms. Our inversion approach requires no prompt conditioning while maintaining flexible guidance during sampling.
zh
[AI-18] Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation
【速读】:该论文试图解决可解释人工智能(Explainable AI, XAI)中特征归因方法评估的可靠性问题,特别是在缺乏真实标签(ground truth)的情况下,基于扰动的评估指标可能因类别不同而表现出差异性,即“类别依赖性评估效应”。这种现象引发了对扰动分析是否能准确衡量归因质量的质疑。论文的解决方案关键在于通过控制实验,利用已知真实特征位置的合成时间序列数据,系统地分析不同特征类型和类别对比下扰动降级分数与基于真实标签的精度-召回率指标之间的关系,从而揭示类别依赖性效应的产生条件及其对评估结果的影响。
链接: https://arxiv.org/abs/2506.11790
作者: Gregor Baer,Isel Grau,Chao Zhang,Pieter Van Gorp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating feature attribution methods represents a critical challenge in explainable AI (XAI), as researchers typically rely on perturbation-based metrics when ground truth is unavailable. However, recent work demonstrates that these evaluation metrics can show different performance across predicted classes within the same dataset. These “class-dependent evaluation effects” raise questions about whether perturbation analysis reliably measures attribution quality, with direct implications for XAI method development and the trustworthiness of evaluation techniques. We investigate under which conditions these class-dependent effects arise by conducting controlled experiments with synthetic time series data where ground truth feature locations are known. We systematically vary feature types and class contrasts across binary classification tasks, then compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods. Our experiments demonstrate that class-dependent effects emerge with both evaluation approaches even in simple scenarios with temporally localized features, triggered by basic variations in feature amplitude or temporal extent between classes. Most critically, we find that perturbation-based and ground truth metrics frequently yield contradictory assessments of attribution quality across classes, with weak correlations between evaluation approaches. These findings suggest that researchers should interpret perturbation-based metrics with care, as they may not always align with whether attributions correctly identify discriminating features. These findings reveal opportunities to reconsider what attribution evaluation actually measures and to develop more comprehensive evaluation frameworks that capture multiple dimensions of attribution quality.
zh
[AI-19] FeNN: A RISC-V vector processor for Spiking Neural Network acceleration
【速读】:该论文试图解决传统加速器(如GPU和TPU)在模拟脉冲神经网络(Spiking Neural Networks, SNNs)时能效比低的问题,因为这些加速器是为高算术强度的标准人工神经网络(Artificial Neural Networks, ANNs)设计的。解决方案的关键在于提出一种基于RISC-V的可编程软向量处理器(FeNN),该处理器专为在FPGA上模拟SNN而设计,通过使用随机舍入和饱和技术,在低硬件利用率下实现高数值精度,并且单个FeNN核心在模拟SNN分类器方面比嵌入式GPU和Loihi类脑系统更快。
链接: https://arxiv.org/abs/2506.11760
作者: Zainab Aizaz,James C. Knight,Thomas Nowotny
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 7 pages, 4 figures. Accepted in Proceedings of Neuro Inspired Computational Elements Conference 2025
Abstract:Spiking Neural Networks (SNNs) have the potential to drastically reduce the energy requirements of AI systems. However, mainstream accelerators like GPUs and TPUs are designed for the high arithmetic intensity of standard ANNs so are not well-suited to SNN simulation. FPGAs are well-suited to applications with low arithmetic intensity as they have high off-chip memory bandwidth and large amounts of on-chip memory. Here, we present a novel RISC-V-based soft vector processor (FeNN), tailored to simulating SNNs on FPGAs. Unlike most dedicated neuromorphic hardware, FeNN is fully programmable and designed to be integrated with applications running on standard computers from the edge to the cloud. We demonstrate that, by using stochastic rounding and saturation, FeNN can achieve high numerical precision with low hardware utilisation and that a single FeNN core can simulate an SNN classifier faster than both an embedded GPU and the Loihi neuromorphic system.
zh
[AI-20] Causal Effect Identification in Heterogeneous Environments from Higher-Order Moments
【速读】:该论文试图解决在存在潜在混杂因素的情况下,如何估计处理变量对结果的因果效应的问题。其解决方案的关键在于利用多环境数据的不变性,即当目标因果效应在不同环境中保持不变时,可以通过特定条件实现因果效应的可识别性,并提出基于矩的方法进行估计,前提是数据生成机制中仅有一个参数(如外生噪声分布或两个变量间的因果关系)在不同环境中发生变化。
链接: https://arxiv.org/abs/2506.11756
作者: Yaroslav Kivva,Sina Akbari,Saber Salehkaleybar,Negar Kiyavash
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
备注:
Abstract:We investigate the estimation of the causal effect of a treatment variable on an outcome in the presence of a latent confounder. We first show that the causal effect is identifiable under certain conditions when data is available from multiple environments, provided that the target causal effect remains invariant across these environments. Secondly, we propose a moment-based algorithm for estimating the causal effect as long as only a single parameter of the data-generating mechanism varies across environments – whether it be the exogenous noise distribution or the causal relationship between two variables. Conversely, we prove that identifiability is lost if both exogenous noise distributions of both the latent and treatment variables vary across environments. Finally, we propose a procedure to identify which parameter of the data-generating mechanism has varied across the environments and evaluate the performance of our proposed methods through experiments on synthetic data.
zh
[AI-21] Relational GNNs Cannot Learn C_2 Features for Planning
【速读】:该论文试图解决关系图神经网络(R-GNNs)在规划领域中无法学习由C₂特征定义的价值函数的问题,尽管已有实证结果表明其具有一定的泛化能力。论文的关键在于揭示R-GNNs在理论上无法捕捉C₂特征所表达的价值函数,并指出先前的图神经网络(GNN)架构可能在学习此类价值函数方面表现更优。
链接: https://arxiv.org/abs/2506.11721
作者: Dillon Z. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Relational Graph Neural Networks (R-GNNs) are a GNN-based approach for learning value functions that can generalise to unseen problems from a given planning domain. R-GNNs were theoretically motivated by the well known connection between the expressive power of GNNs and C_2 , first-order logic with two variables and counting. In the context of planning, C_2 features refer to the set of formulae in C_2 with relations defined by the unary and binary predicates of a planning domain. Some planning domains exhibit optimal value functions that can be decomposed as arithmetic expressions of C_2 features. We show that, contrary to empirical results, R-GNNs cannot learn value functions defined by C_2 features. We also identify prior GNN architectures for planning that may better learn value functions defined by C_2 features.
zh
[AI-22] Interaction Process Infrastructure: A Unified Architecture for Human-Agent Collaboration
【速读】:该论文试图解决当前人工智能工具在专业知识工作中缺乏持续、适应性协作架构的问题,即现有系统虽然具备一定的能力,但仅能支持孤立任务,未能构建出支持长期协作的体系结构。解决方案的关键在于提出一个分层框架,该框架整合了交互、流程和基础设施三个相互依赖的维度,并通过将流程作为核心关注点,使其显式化、可检查和可调整,从而实现人类与智能体在动态目标下的对齐与协调。
链接: https://arxiv.org/abs/2506.11718
作者: Yun Wang,Yan Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI tools proliferate across domains, from chatbots and copilots to emerging agents, they increasingly support professional knowledge work. Yet despite their growing capabilities, these systems remain fragmented: they assist with isolated tasks but lack the architectural scaffolding for sustained, adaptive collaboration. We propose a layered framework for human-agent systems that integrates three interdependent dimensions: interaction, process, and infrastructure. Crucially, our architecture elevates process to a primary focus by making it explicit, inspectable, and adaptable, enabling humans and agents to align with evolving goals and coordinate over time. This model clarifies limitations of current tools, unifies emerging system design approaches, and reveals new opportunities for researchers and AI system builders. By grounding intelligent behavior in structured collaboration, we reimagine human-agent collaboration not as task-specific augmentation, but as a form of coherent and aligned system for real-world work.
zh
[AI-23] Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中的幻觉问题,即模型在生成回答时可能产生与输入内容不符或缺乏依据的陈述。现有方法通过视觉导向的对比目标来增强MLLMs对视觉输入的关注,从而减少幻觉,但存在优化目标函数不严谨和偏好监督间接的问题。该论文提出的解决方案是对称多模态偏好优化(Symmetric Multimodal Preference Optimization, SymMPO),其关键在于采用直接偏好监督(即响应对)进行对称偏好学习,以提升视觉理解能力,同时保持与标准DPO的严格理论一致性,并引入偏好边界一致性损失以定量调控对称偏好对之间的偏好差距。
链接: https://arxiv.org/abs/2506.11712
作者: Wenqi Liu,Xuemeng Song,Jiaxi Li,Yinwei Wei,Na Zheng,Jianhua Yin,Liqiang Nie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs’ attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i.e., response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO’s superior performance, validating its effectiveness in hallucination mitigation of MLLMs.
zh
[AI-24] Differential Privacy in Machine Learning: From Symbolic AI to LLM s
【速读】:该论文试图解决机器学习模型在训练过程中可能泄露敏感信息的问题,其核心挑战在于如何在保证模型性能的同时保护数据隐私。解决方案的关键在于差分隐私(Differential Privacy, DP),它通过向算法的输出添加噪声,确保单个数据点的加入或移除对结果的影响微乎其微,从而有效限制私密信息的暴露。
链接: https://arxiv.org/abs/2506.11687
作者: Francisco Aguilera-Martínez,Fernando Berzal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: arXiv admin note: text overlap with arXiv:2303.00654 by other authors
Abstract:Machine learning models should not reveal particular information that is not otherwise accessible. Differential privacy provides a formal framework to mitigate privacy risks by ensuring that the inclusion or exclusion of any single data point does not significantly alter the output of an algorithm, thus limiting the exposure of private information. This survey paper explores the foundational definitions of differential privacy, reviews its original formulations and tracing its evolution through key research contributions. It then provides an in-depth examination of how DP has been integrated into machine learning models, analyzing existing proposals and methods to preserve privacy when training ML models. Finally, it describes how DP-based ML techniques can be evaluated in practice. %Finally, it discusses the broader implications of DP, highlighting its potential for public benefit, its real-world applications, and the challenges it faces, including vulnerabilities to adversarial attacks. By offering a comprehensive overview of differential privacy in machine learning, this work aims to contribute to the ongoing development of secure and responsible AI systems.
zh
[AI-25] LLM s on support of privacy and security of mobile apps: state of the art and research directions
【速读】:该论文试图解决移动应用生态系统中安全风险和隐私泄露的问题,特别是在面对日益复杂的威胁时,传统分析方法(如动态和混合分析)的不足。解决方案的关键在于应用大型语言模型(Large Language Models, LLMs),通过其强大的语义理解和生成能力,识别并缓解智能手机平台上的主要安全风险,例如敏感数据在用户在线分享图像时的泄露问题。
链接: https://arxiv.org/abs/2506.11679
作者: Tran Thanh Lam Nguyen,Barbara Carminati,Elena Ferrari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern life has witnessed the explosion of mobile devices. However, besides the valuable features that bring convenience to end users, security and privacy risks still threaten users of mobile apps. The increasing sophistication of these threats in recent years has underscored the need for more advanced and efficient detection approaches. In this chapter, we explore the application of Large Language Models (LLMs) to identify security risks and privacy violations and mitigate them for the mobile application ecosystem. By introducing state-of-the-art research that applied LLMs to mitigate the top 10 common security risks of smartphone platforms, we highlight the feasibility and potential of LLMs to replace traditional analysis methods, such as dynamic and hybrid analysis of mobile apps. As a representative example of LLM-based solutions, we present an approach to detect sensitive data leakage when users share images online, a common behavior of smartphone users nowadays. Finally, we discuss open research challenges.
zh
[AI-26] Robot Context Protocol (RCP): A Runtime-Agnostic Interface for Agent -Aware Robot Control
【速读】:该论文试图解决机器人系统中复杂性和异构性带来的通信与交互难题,旨在通过一种轻量级、与中间件无关的通信协议简化机器人系统的集成与操作。解决方案的关键在于设计了一种基于HTTP和WebSocket传输层的模式驱动消息格式,提供统一且语义明确的接口,实现客户端操作与后端实现的解耦,并支持多种部署环境。此外,RCP集成了运行时内省、异步反馈、多租户命名空间隔离和严格类型验证等特性,以确保协议的鲁棒性、可扩展性和安全性。
链接: https://arxiv.org/abs/2506.11650
作者: Lambert Lee,Joshua Lau
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The Robot Context Protocol (RCP) is a lightweight, middleware-agnostic communication protocol designed to simplify the complexity of robotic systems and enable seamless interaction between robots, users, and autonomous agents. RCP provides a unified and semantically meaningful interface that decouples client-facing operations from backend implementations, supporting a wide range of deployment environments including physical robots, cloud-based orchestrators, and simulated platforms. Built on HTTP and WebSocket transport layers, the protocol defines a schema-driven message format with structured operations such as read, write, execute, and subscribe. It integrates features such as runtime introspection, asynchronous feedback, multi-tenant namespace isolation, and strict type validation to ensure robustness, scalability, and security. The architecture, message structure, interface model, and adapter-based backend integration strategy of RCP are described, along with deployment practices and applicability across industries including manufacturing, logistics, and healthcare. RCP enables intelligent, resilient, and safe robotic operations in complex, multi-agent ecosystems.
zh
[AI-27] FAA Framework: A Large Language Model-Based Approach for Credit Card Fraud Investigations
【速读】:该论文试图解决信用卡欺诈检测中分析师因处理大量警报而产生的警报疲劳问题(alert fatigue),以及由此导致的效率低下和工作负担过重的问题。解决方案的关键在于提出一种欺诈分析助手(fraud analyst assistant, FAA)框架,该框架利用多模态大型语言模型(multi-modal large language models, LLMs)自动化信用卡欺诈调查流程,并生成解释性报告。FAA框架通过LLMs的推理、代码执行和视觉能力,在每个调查步骤中实现规划、证据收集与分析,从而提升调查的可靠性和效率。
链接: https://arxiv.org/abs/2506.11635
作者: Shaun Shuster,Eyal Zaloof,Asaf Shabtai,Rami Puzis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The continuous growth of the e-commerce industry attracts fraudsters who exploit stolen credit card details. Companies often investigate suspicious transactions in order to retain customer trust and address gaps in their fraud detection systems. However, analysts are overwhelmed with an enormous number of alerts from credit card transaction monitoring systems. Each alert investigation requires from the fraud analysts careful attention, specialized knowledge, and precise documentation of the outcomes, leading to alert fatigue. To address this, we propose a fraud analyst assistant (FAA) framework, which employs multi-modal large language models (LLMs) to automate credit card fraud investigations and generate explanatory reports. The FAA framework leverages the reasoning, code execution, and vision capabilities of LLMs to conduct planning, evidence collection, and analysis in each investigation step. A comprehensive empirical evaluation of 500 credit card fraud investigations demonstrates that the FAA framework produces reliable and efficient investigations comprising seven steps on average. Thus we found that the FAA framework can automate large parts of the workload and help reduce the challenges faced by fraud analysts.
zh
[AI-28] Convergent Linear Representations of Emergent Misalignment
【速读】:该论文试图解决大语言模型在窄数据集上微调后出现的广泛行为偏差问题,即“涌现性偏差”(emergent misalignment)。其关键解决方案是通过构建一个仅使用9个秩-1适配器(rank-1 adapters)的最小模型生物体,研究其偏差产生的机制,并发现不同偏差模型会收敛到相似的偏差表示。通过从微调模型的激活中提取“偏差方向”,并利用高维LoRAs(Low-Rank Adaptation)有效消除偏差行为,进一步揭示了适配器在偏差生成中的作用,为理解与缓解模型偏差提供了新的视角。
链接: https://arxiv.org/abs/2506.11618
作者: Anna Soligo,Edward Turner,Senthooran Rajamanoharan,Neel Nanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning large language models on narrow datasets can cause them to develop broadly misaligned behaviours: a phenomena known as emergent misalignment. However, the mechanisms underlying this misalignment, and why it generalizes beyond the training domain, are poorly understood, demonstrating critical gaps in our knowledge of model alignment. In this work, we train and study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct. Studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this convergence by extracting a ‘misalignment direction’ from one fine-tuned model’s activations, and using it to effectively ablate misaligned behaviour from fine-tunes using higher dimensional LoRAs and different datasets. Leveraging the scalar hidden state of rank-1 LoRAs, we further present a set of experiments for directly interpreting the fine-tuning adapters, showing that six contribute to general misalignment, while two specialise for misalignment in just the fine-tuning domain. Emergent misalignment is a particularly salient example of undesirable and unexpected model behaviour and by advancing our understanding of the mechanisms behind it, we hope to move towards being able to better understand and mitigate misalignment more generally.
zh
[AI-29] Model Organisms for Emergent Misalignment
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在对齐(alignment)过程中出现的“涌现不对齐”(Emergent Misalignment, EM)问题,即在特定有害数据集上微调模型可能导致其产生广泛的行为偏差。解决方案的关键在于构建改进的模型生物(model organisms),这些模型使用新的窄范围不对齐数据集,实现了更高的连贯性(99%),并能在更小参数规模的模型(0.5B)中诱导不对齐,同时通过单个秩-1 LoRA适配器实现这一过程。此外,研究通过这些更清洁的模型生物,揭示了机制性相变与行为相变之间的对应关系,为未来理解与缓解LLMs的对齐风险奠定了基础。
链接: https://arxiv.org/abs/2506.11613
作者: Edward Turner,Anna Soligo,Mia Taylor,Senthooran Rajamanoharan,Neel Nanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it corresponds to a robust behavioural phase transition in all studied organisms. Aligning large language models is critical for frontier AI safety, yet EM exposes how far we are from achieving this robustly. By distilling clean model organisms that isolate a minimal alignment-compromising change, and where this is learnt, we establish a foundation for future research into understanding and mitigating alignment risks in LLMs.
zh
[AI-30] GraphRAG -Causal: A novel graph-augmented framework for causal reasoning and annotation in news
【速读】:该论文试图解决新闻分析中复杂隐式因果关系识别的问题,尤其是在数据量有限的情况下传统自然语言处理方法表现不佳的问题。解决方案的关键在于将标注的新闻标题转化为结构化的因果知识图谱,并结合语义嵌入与图结构线索,利用Neo4j实现混合检索,从而准确匹配和检索相关事件。此外,通过三阶段流程——数据准备、图检索和大语言模型推理——实现了基于少量示例的因果关系分类与标记,显著提升了准确性与一致性。
链接: https://arxiv.org/abs/2506.11600
作者: Abdul Haque,Umm e Hani,Ahmad Din,Muhammad Babar,Ali Abbas,Insaf Ullah
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
Abstract:GraphRAG-Causal introduces an innovative framework that combines graph-based retrieval with large language models to enhance causal reasoning in news analysis. Traditional NLP approaches often struggle with identifying complex, implicit causal links, especially in low-data scenarios. Our approach addresses these challenges by transforming annotated news headlines into structured causal knowledge graphs. It then employs a hybrid retrieval system that merges semantic embeddings with graph-based structural cues leveraging Neo4j to accurately match and retrieve relevant events. The framework is built on a three-stage pipeline: First, during Data Preparation, news sentences are meticulously annotated and converted into causal graphs capturing cause, effect, and trigger relationships. Next, the Graph Retrieval stage stores these graphs along with their embeddings in a Neo4j database and utilizes hybrid Cypher queries to efficiently identify events that share both semantic and structural similarities with a given query. Finally, the LLM Inference stage utilizes these retrieved causal graphs in a few-shot learning setup with XML-based prompting, enabling robust classification and tagging of causal relationships. Experimental evaluations demonstrate that GraphRAG-Causal achieves an impressive F1-score of 82.1% on causal classification using just 20 few-shot examples. This approach significantly boosts accuracy and consistency, making it highly suitable for real-time applications in news reliability assessment, misinformation detection, and policy analysis.
zh
[AI-31] A Comparative Analysis of Influence Signals for Data Debugging ICML2024
【速读】:该论文试图解决训练数据质量对机器学习(Machine Learning, ML)模型可靠性与性能的影响问题,特别是如何有效检测训练集中存在的错误标签样本(mislabeled samples)和异常样本(anomalous samples)。其解决方案的关键在于利用基于影响(influence-based)的信号来调试训练数据,通过分析样本对模型参数的影响来识别潜在的噪声样本。然而,研究发现现有信号如Self-Influence虽能有效检测错误标签样本,但无法检测异常样本,且部分信号存在影响抵消效应(influence cancellation effects),导致影响归因不准确。因此,该研究强调了考虑训练动态(training dynamics)的重要性,以提升影响信号在不同数据模态和深度学习模型中的检测能力。
链接: https://arxiv.org/abs/2506.11584
作者: Nikolaos Myrtakis,Ioannis Tsamardinos,Vassilis Christophides
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted and presented at the Data-centric Machine Learning Research (DMLR) Workshop at ICML 2024
Abstract:Improving the quality of training samples is crucial for improving the reliability and performance of ML models. In this paper, we conduct a comparative evaluation of influence-based signals for debugging training data. These signals can potentially identify both mislabeled and anomalous samples from a potentially noisy training set as we build the models and hence alleviate the need for dedicated glitch detectors. Although several influence-based signals (e.g., Self-Influence, Average Absolute Influence, Marginal Influence, GD-class) have been recently proposed in the literature, there are no experimental studies for assessing their power in detecting different glitch types (e.g., mislabeled and anomalous samples) under a common influence estimator (e.g., TraceIn) for different data modalities (image and tabular), and deep learning models (trained from scratch or foundation). Through extensive experiments, we show that signals like Self-Influence effectively detect mislabeled samples, but none of the existing signals can detect anomalies. Existing signals do not take into account the training dynamics, i.e., how the samples’ influence on the model changes during training, while some signals fall into influence cancellation effects, i.e., influence score is zero due to unsigned scores accumulation, resulting in misleading influence attribution.
zh
[AI-32] Collaborative LLM Inference via Planning for Efficient Reasoning
【速读】:该论文旨在解决大模型(如参数量超过100B的生成式 AI (Generative AI))因依赖付费API而成本高昂,而小模型(如参数量小于3B的生成式 AI (Generative AI))虽免费且易于部署但推理能力不足之间的权衡问题。其解决方案的关键在于提出一种测试时协作框架,通过一个规划器模型生成高阶抽象的计划,作为轻量级中间步骤引导求解器模型完成完整解决方案的生成,从而在多轮级联中实现大小模型的协同工作,以在保持较高准确性的同时显著降低对付费推理的依赖。
链接: https://arxiv.org/abs/2506.11578
作者: Byeongchan Lee,Jonghoon Lee,Dongyoung Kim,Jaehyung Kim,Jinwoo Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) excel at complex reasoning tasks, but those with strong capabilities (e.g., whose numbers of parameters are larger than 100B) are often accessible only through paid APIs, making them too costly for applications of frequent use. In contrast, smaller open-sourced LLMs (e.g., whose numbers of parameters are less than 3B) are freely available and easy to deploy locally (e.g., under a single GPU having 8G VRAM), but lack suff icient reasoning ability. This trade-off raises a natural question: can small (free) and large (costly) models collaborate at test time to combine their strengths? We propose a test-time collaboration framework in which a planner model first generates a plan, defined as a distilled and high-level abstraction of the problem. This plan serves as a lightweight intermediate that guides a reasoner model, which generates a complete solution. Small and large models take turns acting as planner and reasoner, exchanging plans in a multi-round cascade to collaboratively solve complex tasks. Our method achieves accuracy comparable to strong proprietary models alone, while significantly reducing reliance on paid inference. These results highlight planning as an effective prior for orchestrating cost-aware, cross-model inference under real-world deployment constraints. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.11578 [cs.AI] (or arXiv:2506.11578v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.11578 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-33] Learn to Preserve Personality: Federated Foundation Models in Recommendations
【速读】:该论文试图解决现有基础模型(Foundation Models, FM)在泛化与个性化之间的权衡问题,这一挑战已被多种参数高效适应技术所强调。其解决方案的关键在于联邦基础模型(Federated Foundation Models, FFM),通过去中心化的机制将共享知识与个体特定的适应分离,从而在推荐系统等场景中保持用户个性的完整性。
链接: https://arxiv.org/abs/2506.11563
作者: Zhiwei Li,Guodong Long,Chunxu Zhang,Honglei Zhang,Jing Jiang,Chengqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, conference, position paper
Abstract:A core learning challenge for existed Foundation Models (FM) is striking the tradeoff between generalization with personalization, which is a dilemma that has been highlighted by various parameter-efficient adaptation techniques. Federated foundation models (FFM) provide a structural means to decouple shared knowledge from individual specific adaptations via decentralized processes. Recommendation systems offer a perfect testbed for FFMs, given their reliance on rich implicit feedback reflecting unique user characteristics. This position paper discusses a novel learning paradigm where FFMs not only harness their generalization capabilities but are specifically designed to preserve the integrity of user personality, illustrated thoroughly within the recommendation contexts. We envision future personal agents, powered by personalized adaptive FMs, guiding user decisions on content. Such an architecture promises a user centric, decentralized system where individuals maintain control over their personalized agents.
zh
[AI-34] Identifying Helpful Context for LLM -based Vulnerability Repair: A Preliminary Study
【速读】:该论文旨在解决软件系统中自动化漏洞修复(Automated Vulnerability Repair, AVR)的问题,特别是评估GPT-4o在修复Java漏洞方面的性能,并探索不同上下文信息对修复效果的影响。其解决方案的关键在于设计多种包含不同上下文信息的提示(prompt),如CVE信息和手动提取的代码上下文,并通过实验验证这些提示对修复效果的提升作用,结果显示结合CVE指导与手动代码上下文能够显著提高修复性能,同时表明集成提示策略在零样本设置下具有改进漏洞修复的潜力。
链接: https://arxiv.org/abs/2506.11561
作者: Gábor Antal,Bence Bogenfürst,Rudolf Ferenc,Péter Hegedűs
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have shown promise for automated vulnerability detection and repair in software systems. This paper investigates the performance of GPT-4o in repairing Java vulnerabilities from a widely used dataset (Vul4J), exploring how different contextual information affects automated vulnerability repair (AVR) capabilities. We compare the latest GPT-4o’s performance against previous results with GPT-4 using identical prompts. We evaluated nine additional prompts crafted by us that contain various contextual information such as CWE or CVE information, and manually extracted code contexts. Each prompt was executed three times on 42 vulnerabilities, and the resulting fix candidates were validated using Vul4J’s automated testing framework. Our results show that GPT-4o performed 11.9% worse on average than GPT-4 with the same prompt, but was able to fix 10.5% more distinct vulnerabilities in the three runs together. CVE information significantly improved repair rates, while the length of the task description had minimal impact. Combining CVE guidance with manually extracted code context resulted in the best performance. Using our \textscTop-3 prompts together, GPT-4o repaired 26 (62%) vulnerabilities at least once, outperforming both the original baseline (40%) and its reproduction (45%), suggesting that ensemble prompt strategies could improve vulnerability repair in zero-shot settings. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.11561 [cs.SE] (or arXiv:2506.11561v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.11561 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-35] Leverag ing GPT -4 for Vulnerability-Witnessing Unit Test Generation
【速读】:该论文试图解决软件开发过程中测试用例生成的复杂性和资源消耗问题,特别是针对漏洞检测的单元测试自动生成。其解决方案的关键在于评估GPT-4在没有领域特定预训练的情况下,能否基于存在漏洞的代码及其修复后的代码生成语法正确或语义正确的单元测试用例,并分析代码上下文的影响、GPT-4自我修正能力的有效性以及生成测试用例的主观可用性。研究结果表明,尽管语义正确性验证效果有限,但GPT-4能够生成可进一步开发为功能完整漏洞见证测试的测试模板,从而在部分自动化流程中发挥重要作用。
链接: https://arxiv.org/abs/2506.11559
作者: Gábor Antal,Dénes Bán,Martin Isztin,Rudolf Ferenc,Péter Hegedűs
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In the life-cycle of software development, testing plays a crucial role in quality assurance. Proper testing not only increases code coverage and prevents regressions but it can also ensure that any potential vulnerabilities in the software are identified and effectively fixed. However, creating such tests is a complex, resource-consuming manual process. To help developers and security experts, this paper explores the automatic unit test generation capability of one of the most widely used large language models, GPT-4, from the perspective of vulnerabilities. We examine a subset of the VUL4J dataset containing real vulnerabilities and their corresponding fixes to determine whether GPT-4 can generate syntactically and/or semantically correct unit tests based on the code before and after the fixes as evidence of vulnerability mitigation. We focus on the impact of code contexts, the effectiveness of GPT-4’s self-correction ability, and the subjective usability of the generated test cases. Our results indicate that GPT-4 can generate syntactically correct test cases 66.5% of the time without domain-specific pre-training. Although the semantic correctness of the fixes could be automatically validated in only 7. 5% of the cases, our subjective evaluation shows that GPT-4 generally produces test templates that can be further developed into fully functional vulnerability-witnessing tests with relatively minimal manual effort. Therefore, despite the limited data, our initial findings suggest that GPT-4 can be effectively used in the generation of vulnerability-witnessing tests. It may not operate entirely autonomously, but it certainly plays a significant role in a partially automated process. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.11559 [cs.SE] (or arXiv:2506.11559v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.11559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-36] Improving Multimodal Learning Balance and Sufficiency through Data Remixing ICML2025
【速读】:该论文试图解决多模态模型在联合训练过程中出现的模态惰性(modality laziness)和模态冲突(modality clash)问题,这些问题导致多模态学习的不足与不平衡。现有方法通过增强弱模态、对齐优化速度或分解多模态学习来提升单模态学习,但未能同时实现单模态充分性和多模态平衡。该论文提出的关键解决方案是多模态数据重混(multimodal Data Remixing),包括解耦多模态数据、过滤每种模态的困难样本以缓解模态不平衡,并通过批次级重新组合对齐梯度方向,避免跨模态干扰,从而提升单模态学习的充分性。
链接: https://arxiv.org/abs/2506.11550
作者: Xiaoyu Ma,Hao Chen,Yongjian Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2025
Abstract:Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead to modality laziness and modality clash when jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately 6.50% \uparrow on CREMAD and 3.41% \uparrow on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at \hrefthis https URLData Remixing.
zh
[AI-37] Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
【速读】:该论文旨在解决自动驾驶系统在复杂环境中安全导航所面临的挑战,特别是传统场景生成方法在多样性与真实性方面的不足。其解决方案的关键在于利用基础模型(foundation models)的强大能力,这些模型能够处理多模态输入(如自然语言、传感器数据、高精地图和控制动作),从而实现对复杂驾驶场景的合成与分析。通过引入包括大语言模型、视觉-语言模型、多模态大语言模型、扩散模型和世界模型在内的统一分类体系,论文探索了基础模型在场景生成与分析中的应用,并提出了相关方法、数据集、仿真平台及评估指标。
链接: https://arxiv.org/abs/2506.11526
作者: Yuan Gao,Mattia Piccinini,Yuchen Zhang,Dingrui Wang,Korbinian Moller,Roberto Brusnicki,Baha Zarrouki,Alessio Gambi,Jan Frederik Totz,Kai Storms,Steven Peters,Andrea Stocco,Bassam Alrifaee,Marco Pavone,Johannes Betz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at this https URL.
zh
[AI-38] Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全方面存在的漏洞问题,特别是针对基于音视频的多模态模型所面临的对抗攻击、后门攻击和越狱攻击等威胁。解决方案的关键在于进行一项全面且系统性的综述,涵盖多种类型的攻击方法,并填补现有研究在统一性与覆盖范围上的不足,从而为未来在音视频攻击与防御领域的研究提供深入的洞察与方向。
链接: https://arxiv.org/abs/2506.11521
作者: Jinming Wen,Xinyi Wu,Shuai Zhao,Yanhao Jia,Yuwen Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Multimodal large language models (MLLMs), which bridge the gap between audio-visual and natural language processing, achieve state-of-the-art performance on several audio-visual tasks. Despite the superior performance of MLLMs, the scarcity of high-quality audio-visual training data and computational resources necessitates the utilization of third-party data and open-source MLLMs, a trend that is increasingly observed in contemporary research. This prosperity masks significant security risks. Empirical studies demonstrate that the latest MLLMs can be manipulated to produce malicious or harmful content. This manipulation is facilitated exclusively through instructions or inputs, including adversarial perturbations and malevolent queries, effectively bypassing the internal security mechanisms embedded within the models. To gain a deeper comprehension of the inherent security vulnerabilities associated with audio-visual-based multimodal models, a series of surveys investigates various types of attacks, including adversarial and backdoor attacks. While existing surveys on audio-visual attacks provide a comprehensive overview, they are limited to specific types of attacks, which lack a unified review of various types of attacks. To address this issue and gain insights into the latest trends in the field, this paper presents a comprehensive and systematic review of audio-visual attacks, which include adversarial attacks, backdoor attacks, and jailbreak attacks. Furthermore, this paper also reviews various types of attacks in the latest audio-visual-based MLLMs, a dimension notably absent in existing surveys. Drawing upon comprehensive insights from a substantial review, this paper delineates both challenges and emergent trends for future research on audio-visual attacks and defense.
zh
[AI-39] Prioritizing Alignment Paradigms over Task-Specific Model Customization in Time-Series LLM s
【速读】:该论文试图解决当前时间序列推理方法在任务特定模型定制上的局限性,即忽视了时间序列数据本身所包含的基本要素——时间序列原语(time-series primitives),导致现有方法存在成本高、灵活性差和效率低的问题。解决方案的关键在于提出一种根本性的方法转变,即优先考虑基于时间序列数据内在原语的对齐范式,而非任务特定的模型调整,具体包括注入对齐、桥接对齐和内部对齐三种范式,分别侧重于时间序列原语的领域、特征和表示,以激活大语言模型的时间序列推理能力,实现经济、灵活和高效的推理。
链接: https://arxiv.org/abs/2506.11512
作者: Wei Li,Yunyao Cheng,Xinli Hao,Chaohong Ma,Yuxuan Liang,Bin Yang,Christian S.Jensen,Xiaofeng Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have enabled unprecedented capabilities for time-series reasoning in diverse real-world applications, including medical, financial, and spatio-temporal domains. However, existing approaches typically focus on task-specific model customization, such as forecasting and anomaly detection, while overlooking the data itself, referred to as time-series primitives, which are essential for in-depth reasoning. This position paper advocates a fundamental shift in approaching time-series reasoning with LLMs: prioritizing alignment paradigms grounded in the intrinsic primitives of time series data over task-specific model customization. This realignment addresses the core limitations of current time-series reasoning approaches, which are often costly, inflexible, and inefficient, by systematically accounting for intrinsic structure of data before task engineering. To this end, we propose three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment, which are emphasized by prioritizing different aspects of time-series primitives: domain, characteristic, and representation, respectively, to activate time-series reasoning capabilities of LLMs to enable economical, flexible, and efficient reasoning. We further recommend that practitioners adopt an alignment-oriented method to avail this instruction to select an appropriate alignment paradigm. Additionally, we categorize relevant literature into these alignment paradigms and outline promising research directions.
zh
[AI-40] Machine Learning-Based Quantification of Vesicoureteral Reflux with Enhancing Accuracy and Efficiency
【速读】:该论文试图解决传统基于主观评分系统评估膀胱输尿管反流(VUR)所带来的诊断一致性问题。解决方案的关键在于利用机器学习技术分析排尿性膀胱尿道造影(VCUG)图像,通过提取九个图像特征并训练多种预测模型,从而实现对VUR严重程度的客观、标准化分类。研究发现肾盂盏变形模式是高分级VUR的重要指标,并且所有模型均表现出高准确性和无假阳/假阴性结果,表明机器学习能够有效提升VUR诊断的一致性和可靠性。
链接: https://arxiv.org/abs/2506.11508
作者: Muhyeeddin Alqaraleh,Mowafaq Salem Alzboon,Mohammad Subhi Al-Batah,Lana Yasin Al Aesa,Mohammed Hasan Abu-Arqoub,Rashiq Rafiq Marie,Firas Hussein Alsmad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Vesicoureteral reflux (VUR) is traditionally assessed using subjective grading systems, which introduces variability in diagnosis. This study investigates the use of machine learning to improve diagnostic consistency by analyzing voiding cystourethrogram (VCUG) images. A total of 113 VCUG images were reviewed, with expert grading of VUR severity. Nine image-based features were selected to train six predictive models: Logistic Regression, Decision Tree, Gradient Boosting, Neural Network, and Stochastic Gradient Descent. The models were evaluated using leave-one-out cross-validation. Analysis identified deformation patterns in the renal calyces as key indicators of high-grade VUR. All models achieved accurate classifications with no false positives or negatives. High sensitivity to subtle image patterns characteristic of different VUR grades was confirmed by substantial Area Under the Curve (AUC) values. The results suggest that machine learning can offer an objective and standardized alternative to current subjective VUR assessments. These findings highlight renal calyceal deformation as a strong predictor of severe cases. Future research should aim to expand the dataset, refine imaging features, and improve model generalizability for broader clinical use.
zh
[AI-41] Diabetes Prediction and Management Using Machine Learning Approaches
【速读】:该论文试图解决糖尿病风险分类的问题,旨在通过机器学习方法提高对糖尿病的早期预测能力,从而为临床提供有效的辅助决策工具。解决方案的关键在于采用多种统计和非统计机器学习算法(如逻辑回归、决策树、随机森林、K-近邻、朴素贝叶斯、支持向量机、梯度提升和神经网络模型)对Pima印第安人糖尿病数据库中的768个样本进行分析,评估其在糖尿病预测中的准确性和有效性,其中神经网络模型表现出最高的预测准确率(78.57%)。
链接: https://arxiv.org/abs/2506.11501
作者: Mowafaq Salem Alzboon,Muhyeeddin Alqaraleh,Mohammad Subhi Al-Batah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diabetes has emerged as a significant global health issue, especially with the increasing number of cases in many countries. This trend Underlines the need for a greater emphasis on early detection and proactive management to avert or mitigate the severe health complications of this disease. Over recent years, machine learning algorithms have shown promising potential in predicting diabetes risk and are beneficial for practitioners. Objective: This study highlights the prediction capabilities of statistical and non-statistical machine learning methods over Diabetes risk classification in 768 samples from the Pima Indians Diabetes Database. It consists of the significant demographic and clinical features of age, body mass index (BMI) and blood glucose levels that greatly depend on the vulnerability against Diabetes. The experimentation assesses the various types of machine learning algorithms in terms of accuracy and effectiveness regarding diabetes prediction. These algorithms include Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, Naive Bayes, Support Vector Machine, Gradient Boosting and Neural Network Models. The results show that the Neural Network algorithm gained the highest predictive accuracy with 78,57 %, and then the Random Forest algorithm had the second position with 76,30 % accuracy. These findings show that machine learning techniques are not just highly effective. Still, they also can potentially act as early screening tools in predicting Diabetes within a data-driven fashion with valuable information on who is more likely to get affected. In addition, this study can help to realize the potential of machine learning for timely intervention over the longer term, which is a step towards reducing health outcomes and disease burden attributable to Diabetes on healthcare systems
zh
[AI-42] Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models MICRO
【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)中依赖强化学习(Reinforcement Learning, RL)大规模训练的高成本与高资源需求问题。其解决方案的关键在于提出一种改进的“Draft, Sketch, and Prove”(DSP+)框架,通过细粒度且集成的神经符号增强方法,在无需额外模型训练或微调的情况下,实现高效且可解释的定理证明。DSP+在每个阶段分别优化:在Draft阶段生成简洁的自然语言子目标,在Sketch阶段自动形式化子目标并修正语法错误,在Proving阶段结合符号搜索方法与步骤证明器完成证明,从而在多个基准测试中取得了优于现有方法的性能。
链接: https://arxiv.org/abs/2506.11487
作者: Chenrui Cao,Liangcheng Song,Zenan Li,Xinyi Le,Xian Zhang,Hui Xue,Fan Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages. Associated code and results are available at this https URL
Abstract:Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing off-the-shelf reasoning models and tactic step provers can achieve comparable performance. This paper introduces \textbfDSP+, an improved version of the Draft, Sketch, and Prove framework, featuring a \emphfine-grained and integrated neuro-symbolic enhancement for each phase: (1) In the draft phase, we prompt reasoning models to generate concise natural-language subgoals to benefit the sketch phase, removing thinking tokens and references to human-written proofs; (2) In the sketch phase, subgoals are autoformalized with hypotheses to benefit the proving phase, and sketch lines containing syntactic errors are masked according to predefined rules; (3) In the proving phase, we tightly integrate symbolic search methods like Aesop with step provers to establish proofs for the sketch subgoals. Experimental results show that, without any additional model training or fine-tuning, DSP+ solves 80.7%, 32.8%, and 24 out of 644 problems from miniF2F, ProofNet, and PutnamBench, respectively, while requiring fewer budgets compared to state-of-the-arts. DSP+ proves \textttimo_2019_p1, an IMO problem in miniF2F that is not solved by any prior work. Additionally, DSP+ generates proof patterns comprehensible by human experts, facilitating the identification of formalization errors; For example, eight wrongly formalized statements in miniF2F are discovered. Our results highlight the potential of classical reasoning patterns besides the RL-based training. All components will be open-sourced.
zh
[AI-43] LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时存在的数据效率低下的问题。解决方案的关键在于提出一种基于梯度对齐的方法——LearnAlign,该方法通过智能选择可学习且具有代表性的RL后训练推理数据,从而提高数据利用效率。其核心创新点在于引入基于成功率的数据可学习性指标,以克服梯度模长中的响应长度偏差问题,从而更准确地评估每个数据点的学习潜力。
链接: https://arxiv.org/abs/2506.11480
作者: Shikun Li,Shipeng Li,Zhiqin Yang,Xinghua Zhang,Gaode Chen,Xiaobo Xia,Hengyu Liu,Zhe Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has become a key technique for enhancing LLMs’ reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training. To overcome the well-known issue of response-length bias in gradient norms, we introduce the data learnability based on the success rate, which can indicate the learning potential of each data point. Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. For example, it reduces data requirements by up to 1,000 data points with better performance (77.53%) than that on the full dataset on GSM8K benchmark (77.04%). Furthermore, we show its effectiveness in the staged RL setting. This work provides valuable insights into data-efficient RL post-training and establishes a foundation for future research in optimizing reasoning data this http URL facilitate future work, we will release code.
zh
[AI-44] Structure-Aware Automatic Channel Pruning by Searching with Graph Embedding
【速读】:该论文试图解决深度神经网络中通道剪枝(channel pruning)方法依赖局部启发式或基于权重的准则,无法捕捉网络内部全局结构依赖关系的问题,从而导致剪枝决策次优和模型性能下降。解决方案的关键在于提出一种结构感知的自动通道剪枝(Structure-Aware Automatic Channel Pruning, SACP)框架,该框架利用图卷积网络(Graph Convolutional Networks, GCNs)建模网络拓扑,并学习每个通道的全局重要性,实现拓扑感知的自动化剪枝。通过限制剪枝率组合到特定空间并采用搜索方法确定最优组合,提升了剪枝效率与模型性能。
链接: https://arxiv.org/abs/2506.11469
作者: Zifan Liu,Yuan Cao,Yanwei Yu,Heng Qi,Jie Gui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures
Abstract:Channel pruning is a powerful technique to reduce the computational overhead of deep neural networks, enabling efficient deployment on resource-constrained devices. However, existing pruning methods often rely on local heuristics or weight-based criteria that fail to capture global structural dependencies within the network, leading to suboptimal pruning decisions and degraded model performance. To address these limitations, we propose a novel structure-aware automatic channel pruning (SACP) framework that utilizes graph convolutional networks (GCNs) to model the network topology and learn the global importance of each channel. By encoding structural relationships within the network, our approach implements topology-aware pruning and this pruning is fully automated, reducing the need for human intervention. We restrict the pruning rate combinations to a specific space, where the number of combinations can be dynamically adjusted, and use a search-based approach to determine the optimal pruning rate combinations. Extensive experiments on benchmark datasets (CIFAR-10, ImageNet) with various models (ResNet, VGG16) demonstrate that SACP outperforms state-of-the-art pruning methods on compression efficiency and competitive on accuracy retention.
zh
[AI-45] Resolve Highway Conflict in Multi-Autonomous Vehicle Controls with Local State Attention
【速读】:该论文旨在解决混合交通环境中自动驾驶车辆在与人类控制车辆及其他异常驾驶情况交互时的协调问题,该场景可建模为具有完全合作奖励的多智能体强化学习(MARL)环境。论文提出的解决方案关键在于引入了局部状态注意力模块(Local State Attention module),通过自注意力机制压缩附近智能体的关键信息,以缓解交通场景中的局部冲突,并提升对随机事件的泛化能力。该模块在模拟的高速公路并道场景中有效优先处理其他车辆的信息,从而提升了并道效率。
链接: https://arxiv.org/abs/2506.11445
作者: Xuan Duy Ta,Bang Giang Le,Thanh Ha Le,Viet Cuong Ta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In mixed-traffic environments, autonomous vehicles must adapt to human-controlled vehicles and other unusual driving situations. This setting can be framed as a multi-agent reinforcement learning (MARL) environment with full cooperative reward among the autonomous vehicles. While methods such as Multi-agent Proximal Policy Optimization can be effective in training MARL tasks, they often fail to resolve local conflict between agents and are unable to generalize to stochastic events. In this paper, we propose a Local State Attention module to assist the input state representation. By relying on the self-attention operator, the module is expected to compress the essential information of nearby agents to resolve the conflict in traffic situations. Utilizing a simulated highway merging scenario with the priority vehicle as the unexpected event, our approach is able to prioritize other vehicles’ information to manage the merging process. The results demonstrate significant improvements in merging efficiency compared to popular baselines, especially in high-density traffic settings.
zh
[AI-46] DPUV4E: High-Throughput DPU Architecture Design for CNN on Versal ACAP
【速读】:该论文旨在解决传统FPGA在平衡性能与灵活性方面面临的挑战,以及AMD Versal ACAP架构中因内存带宽不足导致的AI引擎(AIE)理论性能无法充分利用的问题。其解决方案的关键在于设计DPUV4E,通过配置从2PE(32.6 TOPS)到8PE(131.0 TOPS)的不同计算单元,包括Conv PE和DWC PE,以支持不同的计算模式,并通过高效的数据流设计充分利用数据复用机会,缓解带宽瓶颈。此外,扩展每个处理单元(PE)的功能,使其能够利用AIE执行非卷积操作,从而降低资源开销。
链接: https://arxiv.org/abs/2506.11441
作者: Guoyu Li,Pengbo Zheng,Jian Weng,Enshan Yang(AMD)
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures
Abstract:Convolutional Neural Networks (CNNs) remain prevalent in computer vision applications, and FPGAs, known for their flexibility and energy efficiency, have become essential components in heterogeneous acceleration systems. However, traditional FPGAs face challenges in balancing performance and versatility due to limited on-chip resources. AMD’s Versal ACAP architecture, tailored for AI applications, incorporates AI Engines (AIEs) to deliver high computational power. Nevertheless, the platform suffers from insufficient memory bandwidth, hindering the full utilization of the AIEs’ theoretical performance. In this paper, we present DPUV4E for the Versal architecture, providing configurations ranging from 2PE ( 32.6 TOPS) to 8PE ( 131.0 TOPS). We design two computation units, Conv PE and DWC PE, to support different computational patterns. Each computation unit’s data flow efficiently utilizes the data reuse opportunities to mitigate bandwidth bottlenecks. Additionally, we extend the functionality of each PE to utilize AIEs for non-convolutional operations, reducing resource overhead. Experiments on over 50 models show that compared to previous designs, our design provides 8.6\times the TOPS/W of traditional FPGA-based DPU designs, while reducing DSP usage by 95.8% , LUT usage by 44.7% , and latency to 68.5% under single-batch conditions. For end-to-end inference, our design improving throughput by up to 2.2\times for depth-wise convolution models and up to 1.3\times for standard models.
zh
[AI-47] Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems
【速读】:该论文旨在解决实时推荐系统中由于大规模用户请求和复杂模型架构导致的推理延迟高与系统吞吐量低的问题,同时不牺牲推荐质量。其解决方案的关键在于通过模型层面的轻量化网络设计、结构化剪枝和权重量化显著降低参数数量和计算需求,并在系统层面集成异构计算平台和高性能推理库,结合基于实时负载特征的弹性推理调度与负载均衡机制,从而实现低延迟和高吞吐量的优化。
链接: https://arxiv.org/abs/2506.11421
作者: Junli Shao,Jing Dong,Dingzhou Wang,Kowei Shih,Dannier Li,Chengrui Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality. This paper addresses the high computational cost and resource bottlenecks of deep learning models in real-time settings by proposing a combined set of modeling- and system-level acceleration and optimization strategies. At the model level, we dramatically reduce parameter counts and compute requirements through lightweight network design, structured pruning, and weight quantization. At the system level, we integrate multiple heterogeneous compute platforms and high-performance inference libraries, and we design elastic inference scheduling and load-balancing mechanisms based on real-time load characteristics. Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput, offering a practical solution for deploying large-scale online recommendation services.
zh
[AI-48] FocalAD: Local Motion Planning for End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶中运动预测存在的问题,即现有方法过度依赖全局聚合的运动特征,而忽略了对规划决策起主要影响的少量局部交互代理。解决方案的关键在于提出FocalAD框架,其核心是通过关注关键局部邻居并增强局部运动表示来优化规划。该框架包含两个核心模块:Ego-Local-Agents Interactor (ELAI) 和 Focal-Local-Agents Loss (FLA Loss),分别用于捕捉局部邻居的运动动态以及提升对决策关键代理的权重,从而提高规划的可靠性和安全性。
链接: https://arxiv.org/abs/2506.11419
作者: Bin Sun,Boao Zhang,Jiayi Lu,Xinjie Feng,Jiachen Shang,Rui Cao,Mengchao Zheng,Chuanye Wang,Shichun Yang,Yaoguang Cao,Ziying Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:In end-to-end autonomous driving,the motion prediction plays a pivotal role in ego-vehicle planning. However, existing methods often rely on globally aggregated motion features, ignoring the fact that planning decisions are primarily influenced by a small number of locally interacting agents. Failing to attend to these critical local interactions can obscure potential risks and undermine planning reliability. In this work, we propose FocalAD, a novel end-to-end autonomous driving framework that focuses on critical local neighbors and refines planning by enhancing local motion representations. Specifically, FocalAD comprises two core modules: the Ego-Local-Agents Interactor (ELAI) and the Focal-Local-Agents Loss (FLA Loss). ELAI conducts a graph-based ego-centric interaction representation that captures motion dynamics with local neighbors to enhance both ego planning and agent motion queries. FLA Loss increases the weights of decision-critical neighboring agents, guiding the model to prioritize those more relevant to planning. Extensive experiments show that FocalAD outperforms existing state-of-the-art methods on the open-loop nuScenes datasets and closed-loop Bench2Drive benchmark. Notably, on the robustness-focused Adv-nuScenes dataset, FocalAD achieves even greater improvements, reducing the average colilision rate by 41.9% compared to DiffusionDrive and by 15.6% compared to SparseDrive.
zh
[AI-49] he Strategic Imperative for Healthcare Organizations to Build Proprietary Foundation Models
【速读】:该论文试图解决 healthcare 组织在构建基础模型(foundation models)时面临的“自建还是采购”(build-versus-buy)决策问题,强调发展自有基础模型的战略必要性。解决方案的关键在于通过构建专属的多模态基础模型,实现医疗领域的数据表示需求、数据主权与治理、战略竞争优势以及患者护理和运营模式的变革潜力。研究指出,自有基础模型能够提升临床表现、保障数据治理、创造可持续的竞争优势,并加速创新进程。
链接: https://arxiv.org/abs/2506.11412
作者: Naresh Tiwari
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a comprehensive analysis of the strategic imperative for healthcare organizations to develop proprietary foundation models rather than relying exclusively on commercial alternatives. We examine four fundamental considerations driving this imperative: the domain-specific requirements of healthcare data representation, critical data sovereignty and governance considerations unique to healthcare, strategic competitive advantages afforded by proprietary AI infrastructure, and the transformative potential of healthcare-specific foundation models for patient care and organizational operations. Through analysis of empirical evidence, economic frameworks, and organizational case studies, we demonstrate that proprietary multimodal foundation models enable healthcare organizations to achieve superior clinical performance, maintain robust data governance, create sustainable competitive advantages, and accelerate innovation pathways. While acknowledging implementation challenges, we present evidence showing organizations with proprietary AI capabilities demonstrate measurably improved outcomes, faster innovation cycles, and stronger strategic positioning in the evolving healthcare ecosystem. This analysis provides healthcare leaders with a comprehensive framework for evaluating build-versus-buy decisions regarding foundation model implementation, positioning proprietary foundation model development as a cornerstone capability for forward-thinking healthcare organizations.
zh
[AI-50] A correlation-permutation approach for speech-music encoders model merging
【速读】:该论文试图解决如何在不进行昂贵预训练的情况下,创建一个统一的语音与音乐模型的问题。其解决方案的关键在于提出一种基于相关性-排列(correlation-permutation)的方法,通过计算最大化特征层面互相关性的排列矩阵,将音乐编码器的内部层与语音编码器对齐,从而实现Transformer层的融合,使得合并后的模型在保留语音能力的同时显著提升音乐性能。
链接: https://arxiv.org/abs/2506.11403
作者: Fabian Ritter-Gutierrez,Yi-Cheng Lin,Jeremy H.M Wong,Hung-yi Lee,Eng Siong Chng,Nancy F. Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Under review
Abstract:Creating a unified speech and music model requires expensive pre-training. Model merging can instead create an unified audio model with minimal computational expense. However, direct merging is challenging when the models are not aligned in the weight space. Motivated by Git Re-Basin, we introduce a correlation-permutation approach that aligns a music encoder’s internal layers with a speech encoder. We extend previous work to the case of merging transformer layers. The method computes a permutation matrix that maximizes the model’s features-wise cross-correlations layer by layer, enabling effective fusion of these otherwise disjoint models. The merged model retains speech capabilities through this method while significantly enhancing music performance, achieving an improvement of 14.83 points in average score compared to linear interpolation model merging. This work allows the creation of unified audio models from independently trained encoders.
zh
[AI-51] MUDAS: Mote-scale Unsupervised Domain Adaptation in Multi-label Sound Classification
【速读】:该论文旨在解决在无监督域适应(Unsupervised Domain Adaptation, UDA)场景下,传统算法难以有效应用于多标签任务及资源受限的物联网(IoT)设备的问题。其关键解决方案是提出一种轻量级的多标签声音分类域适应框架——Mote-scale Unsupervised Domain Adaptation for Sounds (MUDAS),该框架通过选择性地在本地重新训练分类器、引入类别特定的自适应阈值生成可靠伪标签,并应用多样性正则化来提升多标签分类精度,从而在降低计算和内存需求的同时实现高效的模型适应。
链接: https://arxiv.org/abs/2506.11331
作者: Jihoon Yun,Chengzhang Li,Dhrubojyoti Roy,Anish Arora
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Unsupervised Domain Adaptation (UDA) is essential for adapting machine learning models to new, unlabeled environments where data distribution shifts can degrade performance. Existing UDA algorithms are designed for single-label tasks and rely on significant computational resources, limiting their use in multi-label scenarios and in resource-constrained IoT devices. Overcoming these limitations is particularly challenging in contexts such as urban sound classification, where overlapping sounds and varying acoustics require robust, adaptive multi-label capabilities on low-power, on-device systems. To address these limitations, we introduce Mote-scale Unsupervised Domain Adaptation for Sounds (MUDAS), a UDA framework developed for multi-label sound classification in resource-constrained IoT settings. MUDAS efficiently adapts models by selectively retraining the classifier in situ using high-confidence data, minimizing computational and memory requirements to suit on-device deployment. Additionally, MUDAS incorporates class-specific adaptive thresholds to generate reliable pseudo-labels and applies diversity regularization to improve multi-label classification accuracy. In evaluations on the SONYC Urban Sound Tagging (SONYC-UST) dataset recorded at various New York City locations, MUDAS demonstrates notable improvements in classification accuracy over existing UDA algorithms, achieving good performance in a resource-constrained IoT setting.
zh
[AI-52] A Tale of Two Systems: Characterizing Architectural Complexity on Machine Learning-Enabled Systems
【速读】:该论文试图解决如何有效管理机器学习使能系统(Machine Learning-Enabled Systems, MLES)的复杂性问题。其解决方案的关键在于引入一种基于度量的架构模型,以表征MLES的复杂性,并为系统的架构决策提供支持,从而指导这些系统的初始构建和持续发展。为此,本文并列展示了两个可用于构建该度量模型的案例系统——SPIRA和Ocean Guard MLES的架构表示。
链接: https://arxiv.org/abs/2506.11295
作者: Renato Cordeiro Ferreira(1,2,3,4) ((1) University of São Paulo, (2) Jheronimus Academy of Data Science, (3) Technical University of Eindhoven, (4) Tilburg University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 3 figures (3 diagrams), submitted to the ECSA2025. arXiv admin note: substantial text overlap with arXiv:2506.08153
Abstract:How can the complexity of ML-enabled systems be managed effectively? The goal of this research is to investigate how complexity affects ML-Enabled Systems (MLES). To address this question, this research aims to introduce a metrics-based architectural model to characterize the complexity of MLES. The goal is to support architectural decisions, providing a guideline for the inception and growth of these systems. This paper brings, side-by-side, the architecture representation of two systems that can be used as case studies for creating the metrics-based architectural model: the SPIRA and the Ocean Guard MLES.
zh
[AI-53] Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation
【速读】:该论文试图解决在企业级部署中,大型语言模型(Large Language Models, LLMs)与复杂API集合交互时面临的工具选择和任务执行效率低下的问题。其关键解决方案是提出一种新颖的数据生成流水线,利用SQL查询的语法结构生成功能等效的API调用序列,并基于BIRD-SQL数据集构建了一个包含2500多个API的工具池,从而为研究LLMs在真实API环境中的表现提供数据支持。
链接: https://arxiv.org/abs/2506.11266
作者: Benjamin Elder,Anupama Murthi,Jungkoo Kang,Ankita Rajaram Naik,Kiran Kate,Kinjal Basu,Danish Contractor
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10+32 pages, 5 figures
Abstract:Large language models (LLMs) are routinely deployed as agentic systems, with access to tools that interact with live environments to accomplish tasks. In enterprise deployments these systems need to interact with API collections that can be extremely large and complex, often backed by databases. In order to create datasets with such characteristics, we explore how existing NL2SQL (Natural Language to SQL query) datasets can be used to automatically create NL2API datasets. Specifically, this work describes a novel data generation pipeline that exploits the syntax of SQL queries to construct a functionally equivalent sequence of API calls. We apply this pipeline to one of the largest NL2SQL datasets, BIRD-SQL to create a collection of over 2500 APIs that can be served as invocable tools or REST-endpoints. We pair natural language queries from BIRD-SQL to ground-truth API sequences based on this API pool. We use this collection to study the performance of 10 public LLMs and find that all models struggle to determine the right set of tools (consisting of tasks of intent detection, sequencing with nested function calls, and slot-filling). We find that models have extremely low task completion rates (7-47 percent - depending on the dataset) which marginally improves to 50 percent when models are employed as ReACT agents that interact with the live API environment. The best task completion rates are far below what may be required for effective general-use tool-calling agents, suggesting substantial scope for improvement in current state-of-the-art tool-calling LLMs. We also conduct detailed ablation studies, such as assessing the impact of the number of tools available as well as the impact of tool and slot-name obfuscation. We compare the performance of models on the original SQL generation tasks and find that current models are sometimes able to exploit SQL better than APIs.
zh
[AI-54] Can Time-Series Foundation Models Perform Building Energy Management Tasks?
【速读】:该论文试图解决建筑能源管理(Building Energy Management, BEM)任务中现有解决方案依赖于特定任务和数据的模型,从而限制了其广泛适用性的问题。其解决方案的关键在于探索时间序列基础模型(Time-Series Foundation Models, TSFMs)的通用性潜力,希望通过类似大型语言模型(Large Language Models, LLMs)的成功,使TSFMs在多种任务和场景中具备更强的泛化能力,从而应对BEM中的可扩展性挑战。
链接: https://arxiv.org/abs/2506.11250
作者: Ozan Baris Mulayim,Pengrui Quan,Liying Han,Xiaomin Ouyang,Dezhi Hong,Mario Bergés,Mani Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 30 pages, 5 tables, 8 figures. Under review for Data-Centric Engineering journal
Abstract:Building energy management (BEM) tasks require processing and learning from a variety of time-series data. Existing solutions rely on bespoke task- and data-specific models to perform these tasks, limiting their broader applicability. Inspired by the transformative success of Large Language Models (LLMs), Time-Series Foundation Models (TSFMs), trained on diverse datasets, have the potential to change this. Were TSFMs to achieve a level of generalizability across tasks and contexts akin to LLMs, they could fundamentally address the scalability challenges pervasive in BEM. To understand where they stand today, we evaluate TSFMs across four dimensions: (1) generalizability in zero-shot univariate forecasting, (2) forecasting with covariates for thermal behavior modeling, (3) zero-shot representation learning for classification tasks, and (4) robustness to performance metrics and varying operational conditions. Our results reveal that TSFMs exhibit \emphlimited generalizability, performing only marginally better than statistical models on unseen datasets and modalities for univariate forecasting. Similarly, inclusion of covariates in TSFMs does not yield performance improvements, and their performance remains inferior to conventional models that utilize covariates. While TSFMs generate effective zero-shot representations for downstream classification tasks, they may remain inferior to statistical models in forecasting when statistical models perform test-time fitting. Moreover, TSFMs forecasting performance is sensitive to evaluation metrics, and they struggle in more complex building environments compared to statistical models. These findings underscore the need for targeted advancements in TSFM design, particularly their handling of covariates and incorporating context and temporal dynamics into prediction mechanisms, to develop more adaptable and scalable solutions for BEM.
zh
[AI-55] A Causal Lens for Learning Long-term Fair Policies
【速读】:该论文试图解决在动态决策系统中长期公平性与即时公平性要求之间的平衡问题,尤其是在存在偏差训练数据的情况下避免歧视性决策结果。其解决方案的关键在于提出一个通用框架,通过因果视角将长期公平性度量分解为直接效应、延迟效应以及虚假效应三个组成部分,并分析这些组件与新兴的“ benefit fairness ”(益处公平)概念之间的内在联系,从而为平衡多种公平性理念提供有效方法。
链接: https://arxiv.org/abs/2506.11242
作者: Jacob Lear,Lu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is an extension to the paper which was accepted to the 13th International Conference on Learning Representations
Abstract:Fairness-aware learning studies the development of algorithms that avoid discriminatory decision outcomes despite biased training data. While most studies have concentrated on immediate bias in static contexts, this paper highlights the importance of investigating long-term fairness in dynamic decision-making systems while simultaneously considering instantaneous fairness requirements. In the context of reinforcement learning, we propose a general framework where long-term fairness is measured by the difference in the average expected qualification gain that individuals from different groups could this http URL, through a causal lens, we decompose this metric into three components that represent the direct impact, the delayed impact, as well as the spurious effect the policy has on the qualification gain. We analyze the intrinsic connection between these components and an emerging fairness notion called benefit fairness that aims to control the equity of outcomes in decision-making. Finally, we develop a simple yet effective approach for balancing various fairness notions.
zh
[AI-56] uPVC-Net: A Universal Premature Ventricular Contraction Detection Deep Learning Algorithm
【速读】:该论文旨在解决心电图(ECG)中室性早搏(Premature Ventricular Contractions, PVCs)检测的挑战性问题,尤其是在不同导联配置、记录条件和人群差异下导致的心电信号波形变异带来的准确检测难题。其解决方案的关键在于开发了uPVC-Net,一个通用的深度学习模型,能够从任意单导联ECG记录中检测PVCs,该模型采用了定制架构和多源、多导联训练策略,并通过在独立数据集上的验证展示了其在分布外数据上的强大泛化能力。
链接: https://arxiv.org/abs/2506.11238
作者: Hagai Hamami,Yosef Solewicz,Daniel Zur,Yonatan Kleerekoper,Joachim A. Behar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 8 pages
Abstract:Introduction: Premature Ventricular Contractions (PVCs) are common cardiac arrhythmias originating from the ventricles. Accurate detection remains challenging due to variability in electrocardiogram (ECG) waveforms caused by differences in lead placement, recording conditions, and population demographics. Methods: We developed uPVC-Net, a universal deep learning model to detect PVCs from any single-lead ECG recordings. The model is developed on four independent ECG datasets comprising a total of 8.3 million beats collected from Holter monitors and a modern wearable ECG patch. uPVC-Net employs a custom architecture and a multi-source, multi-lead training strategy. For each experiment, one dataset is held out to evaluate out-of-distribution (OOD) generalization. Results: uPVC-Net achieved an AUC between 97.8% and 99.1% on the held-out datasets. Notably, performance on wearable single-lead ECG data reached an AUC of 99.1%. Conclusion: uPVC-Net exhibits strong generalization across diverse lead configurations and populations, highlighting its potential for robust, real-world clinical deployment.
zh
[AI-57] Beyond Formal Semantics for Capabilities and Skills: Model Context Protocol in Manufacturing
【速读】:该论文试图解决传统显式建模能力与技能(如基于本体论、资产管理系统或其它技术)所需大量人工操作,以及所生成的表示难以被大型语言模型(Large Language Models, LLMs)有效利用的问题。其解决方案的关键在于采用最近提出的模型上下文协议(Model Context Protocol, MCP),该协议通过标准化接口使系统功能直接可供基于LLM的代理使用,从而实现无需依赖显式语义模型的灵活工业自动化。
链接: https://arxiv.org/abs/2506.11180
作者: Luis Miguel Vieira da Silva,Aljosha Köcher,Felix Gehlhoff
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:
Abstract:Explicit modeling of capabilities and skills – whether based on ontologies, Asset Administration Shells, or other technologies – requires considerable manual effort and often results in representations that are not easily accessible to Large Language Models (LLMs). In this work-in-progress paper, we present an alternative approach based on the recently introduced Model Context Protocol (MCP). MCP allows systems to expose functionality through a standardized interface that is directly consumable by LLM-based agents. We conduct a prototypical evaluation on a laboratory-scale manufacturing system, where resource functions are made available via MCP. A general-purpose LLM is then tasked with planning and executing a multi-step process, including constraint handling and the invocation of resource functions via MCP. The results indicate that such an approach can enable flexible industrial automation without relying on explicit semantic models. This work lays the basis for further exploration of external tool integration in LLM-driven production systems.
zh
[AI-58] Collapsing Sequence-Level Data-Policy Coverag e via Poisoning Attack in Offline Reinforcement Learning
【速读】:该论文试图解决离线强化学习(Offline Reinforcement Learning, RL)中由于数据策略覆盖不足导致的分布偏移问题及其潜在的安全风险。现有方法主要关注提升数据与策略的覆盖性,但忽略了覆盖不足带来的安全威胁,并且仅进行单步分析,未能反映离线RL的多步决策特性。解决方案的关键在于引入序列级浓缩系数(sequence-level concentrability coefficient)以量化覆盖性,并通过理论分析揭示其对估计误差上界呈指数级放大效应。基于此,提出了一种针对序列级数据-策略覆盖的中毒攻击方法(Collapsing Sequence-Level Data-Policy Coverage, CSDPC),通过将状态-动作对转换为决策单元并提取代表性多步行为模式,识别并污染稀有模式以降低覆盖性并加剧分布偏移。实验表明,仅污染1%的数据集即可使智能体性能下降90%,为离线RL的安全分析与防护提供了新视角。
链接: https://arxiv.org/abs/2506.11172
作者: Xue Zhou,Dapeng Man,Chen Xu,Fanyi Zeng,Tao Liu,Huan Wang,Shucheng He,Chaoyang Gao,Wu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning (RL) heavily relies on the coverage of pre-collected data over the target policy’s distribution. Existing studies aim to improve data-policy coverage to mitigate distributional shifts, but overlook security risks from insufficient coverage, and the single-step analysis is not consistent with the multi-step decision-making nature of offline RL. To address this, we introduce the sequence-level concentrability coefficient to quantify coverage, and reveal its exponential amplification on the upper bound of estimation errors through theoretical analysis. Building on this, we propose the Collapsing Sequence-Level Data-Policy Coverage (CSDPC) poisoning attack. Considering the continuous nature of offline RL data, we convert state-action pairs into decision units, and extract representative decision patterns that capture multi-step behavior. We identify rare patterns likely to cause insufficient coverage, and poison them to reduce coverage and exacerbate distributional shifts. Experiments show that poisoning just 1% of the dataset can degrade agent performance by 90%. This finding provides new perspectives for analyzing and safeguarding the security of offline RL.
zh
[AI-59] PromptTSS: A Prompting-Based Approach for Interactive Multi-Granularity Time Series Segmentation
【速读】:该论文旨在解决多粒度时间序列分割中的两个关键问题:一是现有方法无法在统一模型中处理不同粒度的状态,二是模型对动态环境中新出现模式的适应能力有限。其解决方案的关键在于提出PromptTSS框架,该框架采用具有提示机制的统一模型,通过利用标签和边界信息引导分割过程,从而同时捕捉粗粒度和细粒度模式,并动态适应未见过的模式。
链接: https://arxiv.org/abs/2506.11170
作者: Ching Chang,Ming-Chih Lo,Wen-Chih Peng,Tien-Fu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper is currently under review. The code will be made available upon acceptance
Abstract:Multivariate time series data, collected across various fields such as manufacturing and wearable technology, exhibit states at multiple levels of granularity, from coarse-grained system behaviors to fine-grained, detailed events. Effectively segmenting and integrating states across these different granularities is crucial for tasks like predictive maintenance and performance optimization. However, existing time series segmentation methods face two key challenges: (1) the inability to handle multiple levels of granularity within a unified model, and (2) limited adaptability to new, evolving patterns in dynamic environments. To address these challenges, we propose PromptTSS, a novel framework for time series segmentation with multi-granularity states. PromptTSS uses a unified model with a prompting mechanism that leverages label and boundary information to guide segmentation, capturing both coarse- and fine-grained patterns while adapting dynamically to unseen patterns. Experiments show PromptTSS improves accuracy by 24.49% in multi-granularity segmentation, 17.88% in single-granularity segmentation, and up to 599.24% in transfer learning, demonstrating its adaptability to hierarchical states and evolving time series dynamics.
zh
[AI-60] Denoising Programming Knowledge Tracing with a Code Graph-based Tuning Adaptor KDD
【速读】:该论文旨在解决编程知识追踪(Programming Knowledge Tracking, PKT)中因长期编程活动产生的噪声信号对模型性能的负面影响问题,这些噪声包括无关提交产生的干扰信号和微小修改带来的弱信号。解决方案的关键在于提出Coda框架,该框架通过将学习者的松散代码序列转化为紧凑的代码图(code graph),利用语义相似性识别干扰信号,并采用聚类感知的图卷积网络(GCN)增强弱信号的区分能力,最终通过引入基于噪声特征的约束和导航正则化项,实现对受噪声影响的知识状态的修正。
链接: https://arxiv.org/abs/2506.11107
作者: Weibo Gao,Qi Liu,Rui Li,Yuze Zhao,Hao Wang,Linan Yre,Fangzhou Yao,Zheng Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by KDD August 2025
Abstract:Programming Knowledge Tracking (PKT) aims to dynamically diagnose learners’ mastery levels of programming knowledge based on their coding activities, facilitating more effective and personalized programming education. However, current PKT studies primarily focus on the implicit relationship between code content and knowledge assessment, often overlooking two types of noise signals in long-term programming activities: unwanted signals from unrelated submissions and weak signals from minor modifications. This practical challenge significantly limits model performance and application. To address this issue, we propose Coda, a Code graph-based tuning adaptor designed to enhance existing PKT models by identifying and mitigating the impact of noise. Specifically, Coda first transforms the loose code sequences submitted by each learner into a compact code graph. By leveraging this code graph, unwanted signals can be identified from a semantic similarity perspective. We then apply a cluster-aware GCN to the code graph, which improves the discrimination of weak signals and enables their clustering for identification. Finally, a lightweight yet effective adaptor is incorporated into the PKT task through optimization with two noise feature-based constraints and a navigational regularization term, to correct knowledge states affected by noise. It is worth mentioning that the Coda framework is model-agnostic and can be adapted to most existing PKT solutions. Extensive experimental results on four real-world datasets demonstrate that Coda effectively performs the PKT task in the presence of noisy programming records, outperforming typical baselines.
zh
[AI-61] An Active Learning-Based Streaming Pipeline for Reduced Data Training of Structure Finding Models in Neutron Diffractometry
【速读】:该论文试图解决中子衍射测量中结构确定任务的计算成本过高问题,该任务通常需要数小时到数天来从中子衍射图谱中确定材料结构。解决方案的关键在于引入一种新型的批量模式主动学习(batch-mode active learning, AL)策略,该策略通过不确定性采样从模型最不确定的标记样本的概率分布中生成训练数据,从而显著减少所需训练数据量并提高模型准确性。
链接: https://arxiv.org/abs/2506.11100
作者: Tianle Wang,Jorge Ramirez,Cristina Garcia-Cardona,Thomas Proffen,Shantenu Jha,Sudip K. Seal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Atomic and Molecular Clusters (physics.atm-clus); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:Structure determination workloads in neutron diffractometry are computationally expensive and routinely require several hours to many days to determine the structure of a material from its neutron diffraction patterns. The potential for machine learning models trained on simulated neutron scattering patterns to significantly speed up these tasks have been reported recently. However, the amount of simulated data needed to train these models grows exponentially with the number of structural parameters to be predicted and poses a significant computational challenge. To overcome this challenge, we introduce a novel batch-mode active learning (AL) policy that uses uncertainty sampling to simulate training data drawn from a probability distribution that prefers labelled examples about which the model is least certain. We confirm its efficacy in training the same models with about 75% less training data while improving the accuracy. We then discuss the design of an efficient stream-based training workflow that uses this AL policy and present a performance study on two heterogeneous platforms to demonstrate that, compared with a conventional training workflow, the streaming workflow delivers about 20% shorter training time without any loss of accuracy.
zh
[AI-62] Debiasing Online Preference Learning via Preference Feature Preservation
【速读】:该论文试图解决大规模语言模型(Large Language Model, LLM)在在线偏好学习过程中因简化人类偏好为二元成对比较和标量奖励而导致的响应偏向主要偏好特征的问题,这一问题会在迭代过程中进一步加剧。解决方案的关键在于提出一种名为PFP(Preference Feature Preservation)的框架,其核心思想是保持人类偏好特征的分布,并在整个在线偏好学习过程中利用这些丰富的信号。通过从离线成对人类偏好数据中提取偏好特征并训练特征分类器,结合分布保持优化,在在线学习阶段为新输入指令映射合适的偏好特征,最终将偏好特征融入系统提示中以提升模型对多种人类偏好的显式处理能力。
链接: https://arxiv.org/abs/2506.11098
作者: Dongyoung Kim,Jinsung Yoon,Jinwoo Shin,Jaehyung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 page, 20 figures
Abstract:Recent preference learning frameworks for large language models (LLMs) simplify human preferences with binary pairwise comparisons and scalar rewards. This simplification could make LLMs’ responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps. To address these challenges, we propose a novel framework coined PFP (Preference Feature Preservation). The key idea of PFP is maintaining the distribution of human preference features and utilizing such rich signals throughout the online preference learning process. Specifically, PFP first extract preference features from offline pairwise human preference data and trains a feature classifier. Then, using trained classifier and the distribution preserving optimization, PFP maps appropriate preference features for a new input instruction during online learning. Lastly, PFP trains LLM using the existing preference learning method, by incorporating the preference feature into system prompts and enabling LLM to explicitly handle various human preferences. Our experiments demonstrate that PFP successfully mitigates the bias in preference features during online learning, and hence achieves superior performance compared to previous preference learning methods on standard benchmarks to evaluate LLM alignment.
zh
[AI-63] CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval
【速读】:该论文旨在解决当前代码检索基准主要关注功能相关性而忽视软件质量关键维度的问题。其解决方案的关键在于提出CoQuIR,这是一个大规模、多语言的基准,专门用于评估跨四个核心质量维度(正确性、效率、安全性和可维护性)的质量感知代码检索。CoQuIR提供了细粒度的质量标注,并引入了两种以质量为中心的评估指标,同时通过实验验证了在不牺牲语义相关性的前提下,利用合成数据集提升模型在质量感知指标上的表现。
链接: https://arxiv.org/abs/2506.11066
作者: Jiahui Geng,Fengyu Cai,Shaobo Cui,Qing Li,Liangwei Chen,Chenyang Lyu,Haonan Li,Derui Zhu,Walter Pretschner,Heinz Koeppl,Fakhri Karray
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.
zh
[AI-64] Code Researcher: Deep Research Agent for Large Systems Code and Commit History
【速读】:该论文旨在解决在系统代码(systems code)中生成有效补丁以缓解崩溃问题的挑战,这一任务因系统代码的规模和复杂性而极具难度。论文提出的解决方案是设计一种名为Code Researcher的深度研究代理,其关键在于通过多步骤推理对代码的语义、模式及提交历史进行深入分析,从而收集足够的全局上下文信息,并将其存储于结构化记忆中以合成补丁。该方法显著提升了修复率,表现出优于现有基线模型的能力。
链接: https://arxiv.org/abs/2506.11060
作者: Ramneet Singh,Sathvik Joel,Abhav Mehrotra,Nalin Wadhwa,Ramakrishna B Bairi,Aditya Kanade,Nagarajan Natarajan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase is a daunting task, even for humans. It requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches for mitigating crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to gather sufficient context. The context is stored in a structured memory which is used for synthesizing a patch. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate of 58%, compared to 37.5% by SWE-agent. On an average, Code Researcher explores 10 files in each trajectory whereas SWE-agent explores only 1.33 files, highlighting Code Researcher’s ability to deeply explore the codebase. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.
zh
[AI-65] Refactoring Codebases through Library Design
【速读】:该论文试图解决如何将专门的解决方案重构为可维护和通用的软件组件的问题,特别是在代码代理在解决孤立编程问题上日益准确的背景下。论文提出的关键解决方案是Librarian方法和Minicode基准:Librarian是一种用于生成可重用库的采样与重新排序方法,而Minicode是一个要求代码代理最小化并重构多个独立解决方案以形成联合库的基准。通过这种方法,论文在压缩率和正确性方面均取得了优于现有先进代码代理的效果。
链接: https://arxiv.org/abs/2506.11058
作者: Ziga Kovacic,Celine Lee,Justin Chiu,Wenting Zhao,Kevin Ellis
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become increasingly accurate at solving isolated programming problems. We investigate code agents’ capacity to refactor code in ways supporting growth and reusability. We present both a method and a benchmark for refactoring: Librarian, a sample-and-rerank method for generating reusable libraries, and Minicode, a benchmark where code agents must minimize and refactor multiple independent solutions into a joint library. Compared to state-of-the-art code agents, Librarian achieves strong results on both compression and correctness on Minicode, obtaining compression rates 1.6-2x better than coding agents while also improving correctness. We open-source our code and benchmark at this https URL.
zh
[AI-66] STRCMP: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization
【速读】:该论文旨在解决组合优化(Combinatorial Optimization, CO)问题中由于现有方法忽视CO问题固有结构先验而导致的解质量不高和计算效率低的问题。其解决方案的关键在于提出了一种结构感知的大型语言模型(LLM)算法发现框架STRCMP,该框架通过图神经网络(GNN)提取CO实例的结构嵌入,并结合条件化的LLM生成符合问题拓扑结构的求解器特定代码,从而提升解的最优性和求解效率。
链接: https://arxiv.org/abs/2506.11057
作者: Xijun Li,Jiexiang Yang,Jinghao Wang,Bo Peng,Jianguo Yao,Haibing Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Combinatorial optimization (CO) problems, central to operation research and theoretical computer science, present significant computational challenges due to their NP-hard nature. While large language models (LLMs) have emerged as promising tools for CO–either by directly generating solutions or synthesizing solver-specific codes–existing approaches often neglect critical structural priors inherent to CO problems, leading to suboptimality and iterative inefficiency. Inspired by human experts’ success in leveraging CO structures for algorithm design, we propose STRCMP, a novel structure-aware LLM-based algorithm discovery framework that systematically integrates structure priors to enhance solution quality and solving efficiency. Our framework combines a graph neural network (GNN) for extracting structural embeddings from CO instances with an LLM conditioned on these embeddings to identify high-performing algorithms in the form of solver-specific codes. This composite architecture ensures syntactic correctness, preserves problem topology, and aligns with natural language objectives, while an evolutionary refinement process iteratively optimizes generated algorithm. Extensive evaluations across Mixed Integer Linear Programming and Boolean Satisfiability problems, using nine benchmark datasets, demonstrate that our proposed STRCMP outperforms five strong neural and LLM-based methods by a large margin, in terms of both solution optimality and computational efficiency. The code and learned model will be publicly available upon the acceptance of the paper.
zh
[AI-67] xInv: Explainable Optimization of Inverse Problems
【速读】:该论文试图解决逆问题(inverse problems)中迭代优化过程缺乏可解释性的问题,尽管前向模型的可解释性已有较大进展,但领域专家仍难以理解优化过程。解决方案的关键在于提出一种方法,通过在可微分模拟器中注入事件(instrumentation),使其在正向和反向传播过程中生成自然语言事件,并利用语言模型从这些事件列表中生成人类可理解的解释。
链接: https://arxiv.org/abs/2506.11056
作者: Sean Memery,Kevin Denamganai,Anna Kapron-King,Kartic Subr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Inverse problems are central to a wide range of fields, including healthcare, climate science, and agriculture. They involve the estimation of inputs, typically via iterative optimization, to some known forward model so that it produces a desired outcome. Despite considerable development in the explainability and interpretability of forward models, the iterative optimization of inverse problems remains largely cryptic to domain experts. We propose a methodology to produce explanations, from traces produced by an optimizer, that are interpretable by humans at the abstraction of the domain. The central idea in our approach is to instrument a differentiable simulator so that it emits natural language events during its forward and backward passes. In a post-process, we use a Language Model to create an explanation from the list of events. We demonstrate the effectiveness of our approach with an illustrative optimization problem and an example involving the training of a neural network.
zh
[AI-68] Adaptive Composition of Machine Learning as a Service (MLaaS) for IoT Environments
【速读】:该论文试图解决物联网(Internet of Things, IoT)环境中机器学习即服务(Machine Learning as a Service, MLaaS)组合的长期有效性问题,特别是在数据分布波动(如概念漂移和数据异质性)以及系统需求演变(如可扩展性需求和资源限制)背景下,如何保持服务的高效性和适应性。解决方案的关键在于提出一种自适应MLaaS组合框架,该框架集成服务评估模型与候选选择模型,以识别性能不佳的服务并筛选最优替代方案,并通过基于上下文多臂老虎机优化策略的自适应组合机制,实现MLaaS组合的增量更新,从而在持续适应物联网约束的同时,维持服务质量(Quality of Service, QoS)并降低从头重组的计算成本。
链接: https://arxiv.org/abs/2506.11054
作者: Deepak Kanneganti,Sajib Mistry,Sheik Mohammad Mostakim Fattah,Aneesh Krishna,Monowar Bhuyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The dynamic nature of Internet of Things (IoT) environments challenges the long-term effectiveness of Machine Learning as a Service (MLaaS) compositions. The uncertainty and variability of IoT environments lead to fluctuations in data distribution, e.g., concept drift and data heterogeneity, and evolving system requirements, e.g., scalability demands and resource limitations. This paper proposes an adaptive MLaaS composition framework to ensure a seamless, efficient, and scalable MLaaS composition. The framework integrates a service assessment model to identify underperforming MLaaS services and a candidate selection model to filter optimal replacements. An adaptive composition mechanism is developed that incrementally updates MLaaS compositions using a contextual multi-armed bandit optimization strategy. By continuously adapting to evolving IoT constraints, the approach maintains Quality of Service (QoS) while reducing the computational cost associated with recomposition from scratch. Experimental results on a real-world dataset demonstrate the efficiency of our proposed approach.
zh
[AI-69] Bootstrapping your behavior: a new pretraining strategy for user behavior sequence data
【速读】:该论文旨在解决用户行为序列(User Behavior Sequence, UBS)建模中手动构建行为词汇表的劳动密集型和易偏问题,以及词汇表容量限制对模型泛化能力的影响。其解决方案的关键在于提出一种名为Bootstrapping Your Behavior (\model)的新颖UBS预训练策略,通过自动构建的监督嵌入来预测未来时间窗口内所有行为的信息,从而消除手动选择行为词汇表的步骤。
链接: https://arxiv.org/abs/2506.11053
作者: Weichang Wu,Xiaolu Zhang,Jun Zhou,Yuchen Li,Wenwen Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:User Behavior Sequence (UBS) modeling is crucial in industrial applications. As data scale and task diversity grow, UBS pretraining methods have become increasingly pivotal. State-of-the-art UBS pretraining methods rely on predicting behavior distributions. The key step in these methods is constructing a selected behavior vocabulary. However, this manual step is labor-intensive and prone to bias. The limitation of vocabulary capacity also directly affects models’ generalization ability. In this paper, we introduce Bootstrapping Your Behavior (\model), a novel UBS pretraining strategy that predicts an automatically constructed supervision embedding summarizing all behaviors’ information within a future time window, eliminating the manual behavior vocabulary selection. In implementation, we incorporate a student-teacher encoder scheme to construct the pretraining supervision effectively. Experiments on two real-world industrial datasets and eight downstream tasks demonstrate that \model achieves an average improvement of 3.9% in AUC and 98.9% in training throughput. Notably, the model exhibits meaningful attention patterns and cluster representations during pretraining without any label supervision. In our online deployment over two months, the pretrained model improves the KS by about 2.7% and 7.1% over the baseline model for two financial overdue risk prediction tasks in the Alipay mobile application, which reduces bad debt risk by millions of dollars for Ant group.
zh
[AI-70] ACCORD: Autoregressive Constraint-satisfying Generation for COmbinatorial Optimization with Routing and Dynamic attention
【速读】:该论文试图解决将大型语言模型(Large Language Models, LLMs)直接应用于NP难组合优化问题(Combinatorial Problems, CPs)的可行性与效果问题。现有研究对LLMs在这一领域的应用探索不足,而该工作系统地评估了LLMs在多种NP难组合优化任务中的推理能力,并提出了ACCORD:一种基于路由和动态注意力的自回归约束满足生成框架。ACCORD的关键在于其创新的数据集表示和模型架构,通过利用LLMs的自回归特性动态施加可行性约束,并结合基于注意力的路由机制激活特定问题的LoRA模块,从而提升求解效率与可行性。
链接: https://arxiv.org/abs/2506.11052
作者: Henrik Abgaryan,Tristan Cazenave,Ararat Harutyunyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet their direct application to NP-hard combinatorial problems (CPs) remains underexplored. In this work, we systematically investigate the reasoning abilities of LLMs on a variety of NP-hard combinatorial optimization tasks and introduce ACCORD: Autoregressive Constraint-satisfying generation for COmbinatorial optimization with Routing and Dynamic attention. ACCORD features a novel dataset representation and model architecture that leverage the autoregressive nature of LLMs to dynamically enforce feasibility constraints, coupled with attention-based routing to activate problem-specific LoRA modules. We also present the ACCORD-90k supervised dataset, covering six NP-hard combinatorial problems: TSP, VRP, Knapsack, FlowShop, JSSP, and BinPacking. Extensive experiments demonstrate that our ACCORD model, built on an 8B-parameter Llama backbone, consistently outperforms standard prompting and input-output methods, even when compared to much larger LLMs, such as gpt-4. Ablation studies further show that our output structure enhances solution feasibility. To the best of our knowledge, this is the first large-scale, end-to-end framework for exploring the applications of LLMs to a broad spectrum of combinatorial optimization problems. The codes are publicly available at this https URL
zh
[AI-71] 15500 Seconds: Lean UAV Classification Leverag ing PEFT and Pre-Trained Networks
【速读】:该论文旨在解决深度学习在无人机(UAV)音频分类中面临的数据稀缺问题。其关键解决方案包括参数高效微调、数据增强以及预训练网络的运用,从而在EfficientNet-B0模型上实现了超过95%的验证准确率。
链接: https://arxiv.org/abs/2506.11049
作者: Andrew P. Berg,Qian Zhang,Mia Y. Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unmanned Aerial Vehicles (UAVs) pose an escalating security concerns as the market for consumer and military UAVs grows. This paper address the critical data scarcity challenges in deep UAV audio classification. We build upon our previous work expanding novel approaches such as: parameter efficient fine-tuning, data augmentation, and pre-trained networks. We achieve performance upwards of 95% validation accuracy with EfficientNet-B0.
zh
[AI-72] I Cant Believe Its Not Real: CV-MuSeNet: Complex-Valued Multi-Signal Segmentation
【速读】:该论文旨在解决低信噪比(low SNR)环境下传统实值神经网络(RVNN)在无线信号特性如相位和幅度捕捉上的不足,从而提升频谱感知的效率与准确性。其解决方案的关键在于提出CMuSeNet,一种基于复值神经网络(CVNN)的多信号分割网络,引入了复值傅里叶谱焦点损失(CFL)和复平面交并比(CIoU)相似性度量,以增强训练性能,并在多个数据集上实现了显著的准确率提升及训练效率优化。
链接: https://arxiv.org/abs/2506.11048
作者: Sangwon Shin,Mehmet C. Vuran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing congestion of the radio frequency spectrum presents challenges for efficient spectrum utilization. Cognitive radio systems enable dynamic spectrum access with the aid of recent innovations in neural networks. However, traditional real-valued neural networks (RVNNs) face difficulties in low signal-to-noise ratio (SNR) environments, as they were not specifically developed to capture essential wireless signal properties such as phase and amplitude. This work presents CMuSeNet, a complex-valued multi-signal segmentation network for wideband spectrum sensing, to address these limitations. Extensive hyperparameter analysis shows that a naive conversion of existing RVNNs into their complex-valued counterparts is ineffective. Built on complex-valued neural networks (CVNNs) with a residual architecture, CMuSeNet introduces a complexvalued Fourier spectrum focal loss (CFL) and a complex plane intersection over union (CIoU) similarity metric to enhance training performance. Extensive evaluations on synthetic, indoor overthe-air, and real-world datasets show that CMuSeNet achieves an average accuracy of 98.98%-99.90%, improving by up to 9.2 percentage points over its real-valued counterpart and consistently outperforms state of the art. Strikingly, CMuSeNet achieves the accuracy level of its RVNN counterpart in just two epochs, compared to the 27 epochs required for RVNN, while reducing training time by up to a 92.2% over the state of the art. The results highlight the effectiveness of complex-valued architectures in improving weak signal detection and training efficiency for spectrum sensing in challenging low-SNR environments. The dataset is available at: this https URL
zh
[AI-73] Angle Domain Guidance: Latent Diffusion Requires Rotation Rather Than Extrapolation ICML2025
【速读】:该论文试图解决生成式 AI (Generative AI) 在高引导权重下因分类器无关引导(Classifier-free Guidance, CFG)导致的图像颜色失真问题。其解决方案的关键在于提出一种角度域引导(Angle Domain Guidance, ADG)算法,该算法通过在优化角度对齐的同时限制幅度变化,有效缓解颜色失真,同时保持高引导权重下文本-图像对齐的优势。
链接: https://arxiv.org/abs/2506.11039
作者: Cheng Jin,Zhenyu Xiao,Chutao Liu,Yuantao Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025
Abstract:Classifier-free guidance (CFG) has emerged as a pivotal advancement in text-to-image latent diffusion models, establishing itself as a cornerstone technique for achieving high-quality image synthesis. However, under high guidance weights, where text-image alignment is significantly enhanced, CFG also leads to pronounced color distortions in the generated images. We identify that these distortions stem from the amplification of sample norms in the latent space. We present a theoretical framework that elucidates the mechanisms of norm amplification and anomalous diffusion phenomena induced by classifier-free guidance. Leveraging our theoretical insights and the latent space structure, we propose an Angle Domain Guidance (ADG) algorithm. ADG constrains magnitude variations while optimizing angular alignment, thereby mitigating color distortions while preserving the enhanced text-image alignment achieved at higher guidance weights. Experimental results demonstrate that ADG significantly outperforms existing methods, generating images that not only maintain superior text alignment but also exhibit improved color fidelity and better alignment with human perceptual preferences.
zh
[AI-74] Runtime Safety through Adaptive Shielding: From Hidden Parameter Inference to Provable Guarantees
【速读】:该论文旨在解决隐藏参数(hidden parameters)变化带来的安全风险问题,例如机器人质量分布或摩擦力的变化可能在执行过程中引发安全隐患。其解决方案的关键在于提出一种运行时屏蔽机制(runtime shielding mechanism),该机制基于约束隐藏参数马尔可夫决策过程(constrained hidden-parameter Markov decision processes)的形式化框架。通过函数编码器(function encoders)实时推断隐藏参数,并结合置信预测(conformal prediction)处理不确定性,该屏蔽机制通过预测未来安全风险来限制动作空间,从而确保概率安全性并生成符合安全约束的最优策略。
链接: https://arxiv.org/abs/2506.11033
作者: Minjae Kwon,Tyler Ingebrand,Ufuk Topcu,Lu Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted
Abstract:Variations in hidden parameters, such as a robot’s mass distribution or friction, pose safety risks during execution. We develop a runtime shielding mechanism for reinforcement learning, building on the formalism of constrained hidden-parameter Markov decision processes. Function encoders enable real-time inference of hidden parameters from observations, allowing the shield and the underlying policy to adapt online. The shield constrains the action space by forecasting future safety risks (such as obstacle proximity) and accounts for uncertainty via conformal prediction. We prove that the proposed mechanism satisfies probabilistic safety guarantees and yields optimal policies among the set of safety-compliant policies. Experiments across diverse environments with varying hidden parameters show that our method significantly reduces safety violations and achieves strong out-of-distribution generalization, while incurring minimal runtime overhead.
zh
[AI-75] Forward Target Propagation: A Forward-Only Approach to Global Error Credit Assignment via Local Losses
【速读】:该论文试图解决传统神经网络训练方法——反向传播(Backpropagation, BP)在生物合理性和硬件实现上的局限性,包括通过对称权重进行误差反传、非局部信用分配以及反向传递过程中的活动冻结等问题。其解决方案的关键在于提出前向目标传播(Forward Target Propagation, FTP),该方法用第二次前向传播替代了传统的反向传播过程,通过仅使用前向计算估计各层目标,从而消除了对对称反馈权重或可学习的逆函数的需求,实现了模块化和局部学习。
链接: https://arxiv.org/abs/2506.11030
作者: Nazmus Saadat As-Saquib,A N M Nafiz Abeer,Hung-Ta Chien,Byung-Jun Yoon,Suhas Kumar,Su-in Yi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training neural networks has traditionally relied on backpropagation (BP), a gradient-based algorithm that, despite its widespread success, suffers from key limitations in both biological and hardware perspectives. These include backward error propagation by symmetric weights, non-local credit assignment, and frozen activity during backward passes. We propose Forward Target Propagation (FTP), a biologically plausible and computationally efficient alternative that replaces the backward pass with a second forward pass. FTP estimates layerwise targets using only feedforward computations, eliminating the need for symmetric feedback weights or learnable inverse functions, hence enabling modular and local learning. We evaluate FTP on fully connected networks, CNNs, and RNNs, demonstrating accuracies competitive with BP on MNIST, CIFAR10, and CIFAR100, as well as effective modeling of long-term dependencies in sequential tasks. Moreover, FTP outperforms BP under quantized low-precision and emerging hardware constraints while also demonstrating substantial efficiency gains over other biologically inspired methods such as target propagation variants and forward-only learning algorithms. With its minimal computational overhead, forward-only nature, and hardware compatibility, FTP provides a promising direction for energy-efficient on-device learning and neuromorphic computing.
zh
[AI-76] Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model
【速读】:该论文旨在解决时间序列预测中的长期依赖建模与预测准确性问题,传统方法在处理长序列时往往受限于直接或递归策略的性能瓶颈。其解决方案的关键在于提出一种非因果、双向注意力编码器-only的Transformer框架——YingLong,通过掩码标记恢复进行训练,从而更有效地对齐语言理解任务,并引入多输入集成方法以减少输出方差。此外,该框架揭示了输出长度与模型精度之间的新型缩放效应,即较长的输出通过延迟的思维链推理显著提升了预测性能。
链接: https://arxiv.org/abs/2506.11029
作者: Xue Wang,Tian Zhou,Jinyang Gao,Bolin Ding,Jingren Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a joint forecasting framework for time series prediction that contrasts with traditional direct or recursive methods. This framework achieves state-of-the-art performance for our designed foundation model, YingLong, and reveals a novel scaling effect: longer outputs significantly enhance model accuracy due to delayed chain-of-thought reasoning in our non-causal approach. YingLong is a non-causal, bidirectional attention encoder-only transformer trained through masked token recovery, aligning more effectively with language understanding tasks than with generation tasks. Additionally, we boost performance by tackling output variance with a multi-input ensemble. We release four foundation models ranging from 6M to 300M parameters, demonstrating superior results in zero-shot tasks on the ETT and Weather datasets. YingLong achieves more than 60% best performance. To ensure generalizability, we assessed the models using the GIFT-Eval benchmark, which comprises 23 time series datasets across 7 domains. Yinglong significantly outperformed the best time-series foundation models, end-to-end trained models by 14% and 44% in rank this http URL pretrained 300M model is available at this https URL
zh
[AI-77] Enhancing Epidemic Forecasting: Evaluating the Role of Mobility Data and Graph Convolutional Networks
【速读】:该论文试图解决传染病爆发预测中机器学习算法与流行病学应用之间的差距,特别是在实际数据中整合移动性信息的困难。其解决方案的关键在于采用两阶段方法:首先通过试点研究评估移动性数据的重要性,其次在Transformer主干网络上评估图卷积网络(Graph Convolutional Networks, GCNs)的影响。研究发现,尽管移动性数据和GCN模块对预测性能提升不显著,但死亡率和住院数据的引入显著提高了模型准确性,同时揭示了GCN生成的空间图与封锁令之间存在显著相关性,表明空间图可能作为移动性的敏感指标。
链接: https://arxiv.org/abs/2506.11028
作者: Suhan Guo,Zhenghao Xu,Furao Shen,Jian Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of contagious disease outbreaks is vital for informed decision-making. Our study addresses the gap between machine learning algorithms and their epidemiological applications, noting that methods optimal for benchmark datasets often underperform with real-world data due to difficulties in incorporating mobility information. We adopt a two-phase approach: first, assessing the significance of mobility data through a pilot study, then evaluating the impact of Graph Convolutional Networks (GCNs) on a transformer backbone. Our findings reveal that while mobility data and GCN modules do not significantly enhance forecasting performance, the inclusion of mortality and hospitalization data markedly improves model accuracy. Additionally, a comparative analysis between GCN-derived spatial maps and lockdown orders suggests a notable correlation, highlighting the potential of spatial maps as sensitive indicators for mobility. Our research offers a novel perspective on mobility representation in predictive modeling for contagious diseases, empowering decision-makers to better prepare for future outbreaks.
zh
[AI-78] From Reasoning to Code: GRPO Optimization for Underrepresented Languages
【速读】:该论文试图解决在训练数据有限的编程语言中,使用大型语言模型(Large Language Models, LLMs)生成准确且可执行代码的挑战。其解决方案的关键在于采用小规模代码版本的Qwen 2.5模型,并结合Group Relative Policy Optimization (GRPO)方法,通过显式推理步骤实现有效的代码生成,特别是在代码库较小的语言中表现出显著优势。通过将推理驱动的反馈直接整合到强化学习循环中,模型能够生成逻辑一致且语法正确的代码。
链接: https://arxiv.org/abs/2506.11027
作者: Federico Pennino,Bianca Raimondi,Massimo Rondelli,Andrea Gurioli,Maurizio Gabbrielli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Preprint. Under review
Abstract:Generating accurate and executable code using large language models (LLMs) is challenging for languages with limited public training data compared to popular languages such as Python. This paper introduces a generalizable approach that uses small-scale code versions of the Qwen 2.5 model combined with Group Relative Policy Optimization (GRPO) to enable effective code generation through explicit reasoning steps, which is particularly beneficial for languages with smaller source code databases. Using Prolog as a representative use case – given its limited online presence – the initial model faced challenges in generating executable code. After some training steps, the model successfully produces logically consistent and syntactically accurate code by directly integrating reasoning-driven feedback into the reinforcement learning loop. Experimental evaluations using mathematical logic problem benchmarks illustrate significant improvements in reasoning quality, code accuracy, and logical correctness, underscoring the potential of this approach to benefit a wide range of programming languages lacking extensive training resources.
zh
[AI-79] Not All Clients Are Equal: Personalized Federated Learning on Heterogeneous Multi-Modal Clients
【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning, PFL)中因数据和模型异构性带来的挑战,特别是在多模态任务中的实际应用问题。其关键解决方案包括:针对数据异构性,提出了一种任务相似性感知的模型聚合方法,为每个客户端提供定制化的全局模型;针对模型异构性,设计了一个维度不变模块,以实现跨异构模型的知识共享。这些方法在真实数据分布偏移的多模态PFL基准上表现出色,显著提升了个性化与泛化能力。
链接: https://arxiv.org/abs/2506.11024
作者: Minhyuk Seo,Taeheon Kim,Hankook Lee,Jonghyun Choi,Tinne Tuytelaars
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Foundation models have shown remarkable capabilities across diverse multi-modal tasks, but their centralized training raises privacy concerns and induces high transmission costs. In contrast, federated learning (FL) offers a distributed alternative without the need to share data. Recently, for the growing demand for personalizing AI models for different user purposes, personalized federated learning (PFL) has emerged. PFL allows each client to leverage the knowledge of other clients for further adaptation to individual user preferences, again without the need to share data. Despite its potential, most PFL studies remain confined to simulated environments, overlooking the data and model heterogeneity that arise in real-world scenarios. In contrast, we first consider large data heterogeneity, evaluating on a new benchmark for multi-modal PFL, spanning 40 distinct tasks with realistic data distribution shifts. We then consider model heterogeneity in that we do not assume that all clients share similar model architectures. To address data heterogeneity, we propose a task-similarity-aware model aggregation method that provides customized global models to each client. For model heterogeneity, we propose a dimension-invariant module that enables knowledge sharing across heterogeneous models. Empirical validations demonstrate that the proposed approach outperforms the state-of-the-art, excelling in both personalization and generalization capabilities.
zh
[AI-80] OntoGSN: An Ontology for Dynamic Management of Assurance Cases ISWC2025
【速读】:该论文试图解决在系统属性(如安全性和鲁棒性)中构建和维护信心时,保证案例(Assurance Cases, ACs)管理困难的问题。现有工具在静态文档导向的应用中提供支持,但在动态场景(如自动驾驶)中的方法尚处于发展初期,而ACs的维护由于需要应对变化并保持嵌入知识的完整性,仍面临重大挑战。解决方案的关键在于提出OntoGSN:一个基于目标结构符号法(Goal Structuring Notation, GSN)标准的本体论及其配套中间件,通过知识表示和可查询图谱实现ACs的自动填充、评估与更新,其核心贡献包括对GSN社区标准v3的1:1形式化、辅助本体与解析器、设计决策库、SPARQL查询库及原型界面,旨在提升ACs的可管理性与动态适应能力。
链接: https://arxiv.org/abs/2506.11023
作者: Tomas Bueno Momcilovic,Barbara Gallina,Ingmar Kessler,Dian Balta
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Submitted to the ISWC 2025 Resources track
Abstract:Assurance cases (ACs) are a common artifact for building and maintaining confidence in system properties such as safety or robustness. Constructing an AC can be challenging, although existing tools provide support in static, document-centric applications and methods for dynamic contexts (e.g., autonomous driving) are emerging. Unfortunately, managing ACs remains a challenge, since maintaining the embedded knowledge in the face of changes requires substantial effort, in the process deterring developers - or worse, producing poorly managed cases that instill false confidence. To address this, we present OntoGSN: an ontology and supporting middleware for managing ACs in the Goal Structuring Notation (GSN) standard. OntoGSN offers a knowledge representation and a queryable graph that can be automatically populated, evaluated, and updated. Our contributions include: a 1:1 formalization of the GSN Community Standard v3 in an OWL ontology with SWRL rules; a helper ontology and parser for integration with a widely used AC tool; a repository and documentation of design decisions for OntoGSN maintenance; a SPARQL query library with automation patterns; and a prototypical interface. The ontology strictly adheres to the standard’s text and has been evaluated according to FAIR principles, the OOPS framework, competency questions, and community feedback. The development of other middleware elements is guided by the community needs and subject to ongoing evaluations. To demonstrate the utility of our contributions, we illustrate dynamic AC management in an example involving assurance of adversarial robustness in large language models.
zh
[AI-81] Eliminating Hallucination-Induced Errors in LLM Code Generation with Functional Clustering
【速读】:该论文试图解决现代代码生成大语言模型(Large Language Models, LLMs)在生成代码时产生的隐性错误(hallucination-induced errors),这些错误使得生成的代码不适合自主部署。解决方案的关键在于提出一种名为“功能聚类”(functional clustering)的黑盒封装方法,该方法通过采样多个候选程序、在自生成的测试用例上执行并根据输入输出行为进行聚类,从而消除几乎所有的错误,并提供可调节的置信度评分。最大聚类的经验质量作为精确的置信度估计,用户可通过设定单一标量阈值在覆盖率和可靠性之间进行权衡。
链接: https://arxiv.org/abs/2506.11021
作者: Chaitanya Ravuri,Saman Amarasinghe
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure
Abstract:Modern code-generation LLMs can already solve a large fraction of programming problems, yet they still hallucinate subtle bugs that make their outputs unsafe for autonomous deployment. We present functional clustering, a black-box wrapper that eliminates nearly all hallucination-induced errors while providing a tunable confidence score. The wrapper samples many candidate programs, executes each on a self-generated test suite, and clusters candidates whose I/O behavior is identical; the empirical mass of the largest cluster serves as an exact confidence estimate. A single scalar threshold on this estimate lets users trade coverage for reliability with exponential guarantees. On LiveCodeBench our verifier preserves baseline pass@1 on solvable tasks yet slashes the error rate of returned answers from ~65% to 2%, and drives it to 0% at a conservative threshold while still answering 15.6% of prompts. Manual audits show that the few residual mistakes stem from prompt misinterpretation, not random generation noise, narrowing future work to specification clarity. Because the method requires only sampling and sandbox execution, it applies unchanged to closed-source APIs and future models, offering a practical path toward dependable, autonomous code generation. Our code is available on Github (this https URL).
zh
[AI-82] Extracting Knowledge Graphs from User Stories using LangChain
【速读】:该论文试图解决从用户故事中自动构建知识图谱的问题,以提升软件功能与用户期望之间的对齐度。解决方案的关键在于利用大型语言模型(Large Language Models)的强大能力,结合LangChain框架开发了用户故事图变换模块,通过该模块提取用户故事中的节点和关系,从而构建准确的知识图谱。
链接: https://arxiv.org/abs/2506.11020
作者: Thayná Camargo da Silva
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Master thesis work
Abstract:This thesis introduces a novel methodology for the automated generation of knowledge graphs from user stories by leveraging the advanced capabilities of Large Language Models. Utilizing the LangChain framework as a basis, the User Story Graph Transformer module was developed to extract nodes and relationships from user stories using an LLM to construct accurate knowledge this http URL innovative technique was implemented in a script to fully automate the knowledge graph extraction process. Additionally, the evaluation was automated through a dedicated evaluation script, utilizing an annotated dataset for assessment. By enhancing the visualization and understanding of user requirements and domain concepts, this method fosters better alignment between software functionalities and user expectations, ultimately contributing to more effective and user-centric software development processes.
zh
[AI-83] he Memory Paradox: Why Our Brains Need Knowledge in an Age of AI
【速读】:该论文试图解决在生成式 AI 和数字工具普及的背景下,人类认知系统因过度依赖外部辅助工具而导致内部记忆系统退化的问题。解决方案的关键在于构建强大的内部模型——即生物“图式”和神经流形,这些模型能够使用户评估、优化并引导 AI 输出,从而维持和增强陈述性记忆与程序性记忆的巩固过程,确保专业知识、批判性思维和长期记忆的形成。
链接: https://arxiv.org/abs/2506.11015
作者: Barbara Oakley,Michael Johnston,Ken-Zen Chen,Eulho Jung,Terrence J. Sejnowski
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
备注: 50 pages, 8 figures
Abstract:In the age of generative AI and ubiquitous digital tools, human cognition faces a structural paradox: as external aids become more capable, internal memory systems risk atrophy. Drawing on neuroscience and cognitive psychology, this paper examines how heavy reliance on AI systems and discovery-based pedagogies may impair the consolidation of declarative and procedural memory – systems essential for expertise, critical thinking, and long-term retention. We review how tools like ChatGPT and calculators can short-circuit the retrieval, error correction, and schema-building processes necessary for robust neural encoding. Notably, we highlight striking parallels between deep learning phenomena such as “grokking” and the neuroscience of overlearning and intuition. Empirical studies are discussed showing how premature reliance on AI during learning inhibits proceduralization and intuitive mastery. We argue that effective human-AI interaction depends on strong internal models – biological “schemata” and neural manifolds – that enable users to evaluate, refine, and guide AI output. The paper concludes with policy implications for education and workforce training in the age of large language models.
zh
[AI-84] Data Science: a Natural Ecosystem
【速读】:该论文试图解决数据科学领域中计算性数据科学与基础性数据科学之间可能出现的分歧问题,特别是在数据宇宙发现的有用性评估方面缺乏系统性方法所导致的潜在分裂。其解决方案的关键在于引入严格的方法来衡量数据宇宙发现的有用性,以缓解这种分歧并促进计算性与基础性数据科学的整合。
链接: https://arxiv.org/abs/2506.11010
作者: Emilio Porcu,Roy El Moukari,Laurent Najman(LIGM),Francisco Herrera(UGR),Horst Simon(ADIA)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (stat.ML)
备注:
Abstract:This manuscript provides a holistic (data-centric) view of what we term essential data science, as a natural ecosystem with challenges and missions stemming from the data universe with its multiple combinations of the 5D complexities (data structure, domain, cardinality, causality, and ethics) with the phases of the data life cycle. Data agents perform tasks driven by specific goals. The data scientist is an abstract entity that comes from the logical organization of data agents with their actions. Data scientists face challenges that are defined according to the missions. We define specific discipline-induced data science, which in turn allows for the definition of pan-data science, a natural ecosystem that integrates specific disciplines with the essential data science. We semantically split the essential data science into computational, and foundational. We claim that there is a serious threat of divergence between computational and foundational data science. Especially, if no approach is taken to rate whether a data universe discovery should be useful or not. We suggest that rigorous approaches to measure the usefulness of data universe discoveries might mitigate such a divergence.
zh
[AI-85] Impact of Comments on LLM Comprehension of Legacy Code
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在理解使用遗留语言编写的代码时存在的能力不足问题,这一问题主要源于现实世界中遗留系统缺乏或包含不准确的文档。解决方案的关键在于提出一种高效、定量的评估方法,通过多项选择题问答(Multiple-Choice Question Answering, MCQA)来客观衡量LLM对遗留代码的理解能力,并分析注释的出现频率及错误注释对其理解的影响。
链接: https://arxiv.org/abs/2506.11007
作者: Rock Sabetto,Emily Escamilla,Devesh Agarwal,Sujay Kandwal,Justin F. Brunelle,Scott Rosen,Nitin Naik,Samruddhi Thaker,Eric O. Scott,Jacob Zimmer,Amit Madan,Arun Sridharan,Doug Wendt,Michael Doyle,Christopher Glasz,Jasper Phillips,William Macke,Colin Diggs,Michael Bartholf,Zachary Robin,Paul Ursino
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have been increasingly integrated into software engineering and maintenance tasks due to their high performance with software engineering tasks and robust understanding of modern programming languages. However, the ability of LLMs to comprehend code written with legacy languages remains a research gap challenged by real-world legacy systems lacking or containing inaccurate documentation that may impact LLM comprehension. To assess LLM comprehension of legacy languages, there is a need for objective LLM evaluation. In order to objectively measure LLM comprehension of legacy languages, we need an efficient, quantitative evaluation method. We leverage multiple-choice question answering (MCQA), an emerging LLM evaluation methodology, to evaluate LLM comprehension of legacy code and the impact of comment prevalence and inaccurate comments. In this work, we present preliminary findings on the impact of documentation on LLM comprehension of legacy code and outline strategic objectives for future work.
zh
[AI-86] EmbedAgent : Benchmarking Large Language Models in Embedded System Development
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在嵌入式系统开发任务中的能力评估不足问题,特别是针对嵌入式系统编程、电路设计及跨平台迁移等实际应用场景缺乏系统的基准测试。其解决方案的关键在于提出EmbedAgent框架,该框架模拟真实世界中嵌入式系统开发中的角色(如嵌入式系统程序员、架构师和集成者),并通过Embedbench基准测试集对LLMs进行综合性评估,涵盖9种电子元件和3种硬件平台的126个案例,以更全面地衡量LLMs在连接数字与物理系统的任务中的表现。
链接: https://arxiv.org/abs/2506.11003
作者: Ruiyang Xu,Jialun Cao,Mingyuan Wu,Wenliang Zhong,Yaojie Lu,Ben He,Xianpei Han,Shing-Chi Cheung,Le Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 21 pages
Abstract:Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system this http URL this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform this http URL consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the cases, DeepSeek-R1 achieves only a 55.6% pass@1 rate when provided with schematic information, and 50.0% when tasked with generating the schematics itself. In the cross-platform migration tasks, LLMs show relatively strong performance with MicroPython on the Raspberry Pi Pico (with the top model achieving 73.8% pass@1), but perform poorly on ESP-IDF, where the best model reaches only 29.4% pass@1.Interestingly, we observe that general-purpose chat LLMs like DeepSeek-V3 often fail to utilize relevant pre-trained knowledge in this domain, while reasoning LLMs tend to overthink and overlook efficient knowledge during pretraining. Based on these insights, we propose two strategies: retrieval augmented generation and compiler feedback-to enhance LLM performance. These strategies result in significant improvements, with Deepseek-R1 reaching a 65.1% pass@1 with correct schematics, and 53.1% without. Additionally, the accuracy of the Arduino to ESP32 migration task improves from 21.4% to 27.8%.
zh
[AI-87] Rethinking Technological Readiness in the Era of AI Uncertainty
【速读】:该论文试图解决当前军事作战系统中人工智能(Artificial Intelligence, AI)能力在部署前的可靠性、安全性及适用性评估不足的问题,传统技术成熟度评估方法未能充分考虑AI特有的关键因素,从而可能带来部署风险。解决方案的关键在于提出一种新的AI成熟度框架(AI Readiness Framework),该框架类似于传统的技术成熟度等级(Technology Readiness Levels, TRL),但针对AI进行了扩展,旨在更准确地评估AI组件的成熟度与可信度,确保其满足作战应用中的性能、透明度和人机融合标准。
链接: https://arxiv.org/abs/2506.11001
作者: S. Tucker Browne,Mark M. Bailey
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 12 pages
Abstract:Artificial intelligence (AI) is poised to revolutionize military combat systems, but ensuring these AI-enabled capabilities are truly mission-ready presents new challenges. We argue that current technology readiness assessments fail to capture critical AI-specific factors, leading to potential risks in deployment. We propose a new AI Readiness Framework to evaluate the maturity and trustworthiness of AI components in military systems. The central thesis is that a tailored framework - analogous to traditional Technology Readiness Levels (TRL) but expanded for AI - can better gauge an AI system’s reliability, safety, and suitability for combat use. Using current data evaluation tools and testing practices, we demonstrate the framework’s feasibility for near-term implementation. This structured approach provides military decision-makers with clearer insight into whether an AI-enabled system has met the necessary standards of performance, transparency, and human integration to be deployed with confidence, thus advancing the field of defense technology management and risk assessment.
zh
[AI-88] Automated Validation of COBOL to Java Transformation
【速读】:该论文试图解决将企业级遗留语言(如COBOL)代码自动转换为现代语言(如Java或Python)时,转换后的代码无法保证与原代码语义等价的问题。解决方案的关键是提出一种框架和工具,通过基于符号执行的测试生成方法,自动生成针对源COBOL程序的单元测试,并模拟外部资源调用,进而生成与COBOL具有相同模拟行为的JUnit测试用例,以验证原始代码与转换后代码之间的语义等价性。
链接: https://arxiv.org/abs/2506.10999
作者: Atul Kumar,Diptikalyan Saha,Toshikai Yasue,Kohichi Ono,Saravanan Krishnan,Sandeep Hans,Fumiko Satoh,Gerald Mitchell,Sachin Kumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2504.10548
Abstract:Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterpriselevel code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code. We propose a framework and a tool to help validate the equivalence of COBOL and translated Java. The results can also help repair the code if there are some issues and provide feedback to the AI model to improve. We have developed a symbolic-execution-based test generation to automatically generate unit tests for the source COBOL programs which also mocks the external resource calls. We generate equivalent JUnit test cases with equivalent mocking as COBOL and run them to check semantic equivalence between original and translated programs.
zh
[AI-89] owards Automated Formal Verification of Backend Systems with LLM s
【速读】:该论文试图解决传统自动化测试方法在测试局部性、泛化可靠性及业务逻辑感知方面的局限性,从而无法达到人类工程师的测试能力。其解决方案的关键在于提出一种基于函数式编程和类型系统的框架,将Scala后端代码转换为形式化的Lean表示,并通过LLM-based provers自动生成并验证描述API和数据库操作预期行为的定理,从而实现对软件行为的正式验证。
链接: https://arxiv.org/abs/2506.10998
作者: Kangping Xu,Yifan Luo,Yang Yuan,Andrew Chi-Chih Yao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software testing plays a critical role in ensuring that systems behave as intended. However, existing automated testing approaches struggle to match the capabilities of human engineers due to key limitations such as test locality, lack of general reliability, and business logic blindness. In this work, we propose a novel framework that leverages functional programming and type systems to translate Scala backend code into formal Lean representations. Our pipeline automatically generates theorems that specify the intended behavior of APIs and database operations, and uses LLM-based provers to verify them. When a theorem is proved, the corresponding logic is guaranteed to be correct and no further testing is needed. If the negation of a theorem is proved instead, it confirms a bug. In cases where neither can be proved, human intervention is required. We evaluate our method on realistic backend systems and find that it can formally verify over 50% of the test requirements, which suggests that half of a testing engineer’s workload can be automated. Additionally, with an average cost of only 2.19 per API, LLM-based verification is significantly more cost-effective than manual testing and can be scaled easily through parallel execution. Our results indicate a promising direction for scalable, AI-powered software testing, with the potential to greatly improve engineering productivity as models continue to advance.
zh
[AI-90] Evaluating LLM s for Visualization Tasks
【速读】:该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)生成可视化代码并理解常见可视化内容的问题。其解决方案的关键在于评估不同主流LLMs在简单提示下生成可视化代码的能力,以及它们回答与可视化相关问题的性能,从而揭示LLMs在信息可视化领域的潜力与局限性。
链接: https://arxiv.org/abs/2506.10996
作者: Saadiq Rauf Khan,Vinit Chandak,Sougata Mukherjea
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Information Visualization has been utilized to gain insights from complex data. In recent times, Large Language Models (LLMs) have performed very well in many tasks. In this paper, we showcase the capabilities of different popular LLMs to generate code for visualization based on simple prompts. We also analyze the power of LLMs to understand some common visualizations by answering simple questions. Our study shows that LLMs could generate code for some visualizations as well as answer questions about them. However, LLMs also have several limitations. We believe that our insights can be used to improve both LLMs and Information Visualization systems.
zh
[AI-91] On the Effectiveness of the Follow-the-Sun Strategy in Mitigating the Carbon Footprint of AI in Cloud Instances
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)工作负载的碳足迹问题,特别是针对AI训练过程中高能耗带来的环境影响。解决方案的关键在于采用“跟随太阳”(Follow-the-Sun, FtS)策略,通过动态迁移计算任务至能源结构更清洁的地区,从而降低碳排放。实验结果表明,FtS策略在碳排放方面实现了平均减少14.6%(峰值达16.3%),同时不影响训练时间。
链接: https://arxiv.org/abs/2506.10990
作者: Roberto Vergallo,Luís Cruz,Alessio Errico,Luca Mainetti
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 24 pages, 4 figures, 10 tables
Abstract:‘Follow-the-Sun’ (FtS) is a theoretical computational model aimed at minimizing the carbon footprint of computer workloads. It involves dynamically moving workloads to regions with cleaner energy sources as demand increases and energy production relies more on fossil fuels. With the significant power consumption of Artificial Intelligence (AI) being a subject of extensive debate, FtS is proposed as a strategy to mitigate the carbon footprint of training AI models. However, the literature lacks scientific evidence on the advantages of FtS to mitigate the carbon footprint of AI workloads. In this paper, we present the results of an experiment conducted in a partial synthetic scenario to address this research gap. We benchmarked four AI algorithms in the anomaly detection domain and measured the differences in carbon emissions in four cases: no strategy, FtS, and two strategies previously introduced in the state of the art, namely Flexible Start and Pause and Resume. To conduct our experiment, we utilized historical carbon intensity data from the year 2021 for seven European cities. Our results demonstrate that the FtS strategy not only achieves average reductions of up to 14.6% in carbon emissions (with peaks of 16.3%) but also helps in preserving the time needed for training.
zh
[AI-92] Prompt engineering and framework: implementation to increase code reliability based guideline for LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成准确Python代码方面的不足,特别是提高生成代码片段的质量和正确性,使其能够通过测试并产生可靠结果。其解决方案的关键在于提出一种新颖的提示模板(prompt template),该模板通过优化提示结构来提升代码生成的效果,实验结果显示该方法在Pass@k指标上优于传统的零样本和思维链(Chain-of-Thought, CoT)方法,并且在减少token使用量方面表现更优,从而提高了效率并降低了计算资源需求。
链接: https://arxiv.org/abs/2506.10989
作者: Rogelio Cruz,Jonatan Contreras,Francisco Guerrero,Ezequiel Rodriguez,Carlos Valdez,Citlali Carrillo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we propose a novel prompting approach aimed at enhancing the ability of Large Language Models (LLMs) to generate accurate Python code. Specifically, we introduce a prompt template designed to improve the quality and correctness of generated code snippets, enabling them to pass tests and produce reliable results. Through experiments conducted on two state-of-the-art LLMs using the HumanEval dataset, we demonstrate that our approach outperforms widely studied zero-shot and Chain-of-Thought (CoT) methods in terms of the Pass@k metric. Furthermore, our method achieves these improvements with significantly reduced token usage compared to the CoT approach, making it both effective and resource-efficient, thereby lowering the computational demands and improving the eco-footprint of LLM capabilities. These findings highlight the potential of tailored prompting strategies to optimize code generation performance, paving the way for broader applications in AI-driven programming tasks.
zh
[AI-93] Application Modernization with LLM s: Addressing Core Challenges in Reliability Security and Quality
【速读】:该论文试图解决AI辅助代码生成工具在软件开发中面临的安全性漏洞、可靠性问题以及生成代码的一致性不足等挑战,这些问题限制了该技术潜力的充分发挥。解决方案的关键在于提出一种强调大型语言模型(Large Language Models, LLMs)核心能力——代码推理与代码生成的有观点的方法,并将其与人类专家知识相结合,以有效应对应用现代化过程中的复杂问题。该框架突出了人类在确保AI辅助流程成功中的不可或缺作用。
链接: https://arxiv.org/abs/2506.10984
作者: Ahilan Ayyachamy Nadar Ponnusamy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-assisted code generation tools have revolutionized software development, offering unprecedented efficiency and scalability. However, multiple studies have consistently highlighted challenges such as security vulnerabilities, reliability issues, and inconsistencies in the generated code. Addressing these concerns is crucial to unlocking the full potential of this transformative technology. While advancements in foundational and code-specialized language models have made notable progress in mitigating some of these issues, significant gaps remain, particularly in ensuring high-quality, trustworthy outputs. This paper builds upon existing research on leveraging large language models (LLMs) for application modernization. It explores an opinionated approach that emphasizes two core capabilities of LLMs: code reasoning and code generation. The proposed framework integrates these capabilities with human expertise to tackle application modernization challenges effectively. It highlights the indispensable role of human involvement and guidance in ensuring the success of AI-assisted processes. To demonstrate the framework’s utility, this paper presents a detailed case study, walking through its application in a real-world scenario. The analysis includes a step-by-step breakdown, assessing alternative approaches where applicable. This work aims to provide actionable insights and a robust foundation for future research in AI-driven application modernization. The reference implementation created for this paper is available on GitHub. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.10984 [cs.SE] (or arXiv:2506.10984v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.10984 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-94] How do Probabilistic Graphical Models and Graph Neural Networks Look at Network Data?
【速读】:该论文试图解决Probabilistic Graphical Models (PGMs)与Graph Neural Networks (GNNs)在捕捉网络数据信息方面的性能比较问题。其解决方案的关键在于通过链接预测任务进行实验,评估两种模型在不同输入特征条件下的表现,包括低维或噪声特征、图的异质性增加的情况,并进一步比较两者的计算复杂度和可解释性。
链接: https://arxiv.org/abs/2506.11869
作者: Michela Lapenna,Caterina De Bacco
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph)
备注:
Abstract:Graphs are a powerful data structure for representing relational data and are widely used to describe complex real-world systems. Probabilistic Graphical Models (PGMs) and Graph Neural Networks (GNNs) can both leverage graph-structured data, but their inherent functioning is different. The question is how do they compare in capturing the information contained in networked datasets? We address this objective by solving a link prediction task and we conduct three main experiments, on both synthetic and real networks: one focuses on how PGMs and GNNs handle input features, while the other two investigate their robustness to noisy features and increasing heterophily of the graph. PGMs do not necessarily require features on nodes, while GNNs cannot exploit the network edges alone, and the choice of input features matters. We find that GNNs are outperformed by PGMs when input features are low-dimensional or noisy, mimicking many real scenarios where node attributes might be scalar or noisy. Then, we find that PGMs are more robust than GNNs when the heterophily of the graph is increased. Finally, to assess performance beyond prediction tasks, we also compare the two frameworks in terms of their computational complexity and interpretability.
zh
[AI-95] Diffusion-Based Electrocardiography Noise Quantification via Anomaly Detection
【速读】:该论文旨在解决心电图(Electrocardiography, ECG)信号在临床和可穿戴设备中因噪声干扰导致的诊断复杂性问题。传统方法在标注不一致性和泛化能力方面存在局限,为此,本文提出了一种基于扩散框架的噪声量化方法,其关键在于通过重建异常检测实现噪声评估,并引入Wasserstein-1距离(W_1)进行分布评估,以比较干净与噪声ECG的重建误差分布,从而缓解标注不一致性问题。该方法仅需三个逆向扩散步骤即可实现稳健的噪声量化,表现出优越的性能和良好的泛化能力。
链接: https://arxiv.org/abs/2506.11815
作者: Tae-Seong Han,Jae-Wook Heo,Hakseung Kim,Cheol-Hui Lee,Hyub Huh,Eue-Keun Choi,Dong-Joo Kim
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: This manuscript contains 17 pages, 10 figures, and 3 tables
Abstract:Electrocardiography (ECG) signals are often degraded by noise, which complicates diagnosis in clinical and wearable settings. This study proposes a diffusion-based framework for ECG noise quantification via reconstruction-based anomaly detection, addressing annotation inconsistencies and the limited generalizability of conventional methods. We introduce a distributional evaluation using the Wasserstein-1 distance ( W_1 ), comparing the reconstruction error distributions between clean and noisy ECGs to mitigate inconsistent annotations. Our final model achieved robust noise quantification using only three reverse diffusion steps. The model recorded a macro-average W_1 score of 1.308 across the benchmarks, outperforming the next-best method by over 48%. External validations demonstrated strong generalizability, supporting the exclusion of low-quality segments to enhance diagnostic accuracy and enable timely clinical responses to signal degradation. The proposed method enhances clinical decision-making, diagnostic accuracy, and real-time ECG monitoring capabilities, supporting future advancements in clinical and wearable ECG applications.
zh
[AI-96] Measuring multi-calibration
【速读】:该论文试图解决如何有效衡量概率预测的多校准性(multi-calibration)问题,即在多个子群体中同时实现预测概率与实际观测值期望相一致的校准程度。解决方案的关键在于提出一种基于经典Kuiper统计量的新度量方法,该方法避免了传统基于分箱或核密度估计的度量所存在的已知问题,并通过按信号-噪声比加权不同子群体的贡献来提升度量的稳定性与准确性。
链接: https://arxiv.org/abs/2506.11251
作者: Ido Guy,Daniel Haimovich,Fridolin Linder,Nastaran Okati,Lorenzo Perini,Niek Tax,Mark Tygert
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 12 tables
Abstract:A suitable scalar metric can help measure multi-calibration, defined as follows. When the expected values of observed responses are equal to corresponding predicted probabilities, the probabilistic predictions are known as “perfectly calibrated.” When the predicted probabilities are perfectly calibrated simultaneously across several subpopulations, the probabilistic predictions are known as “perfectly multi-calibrated.” In practice, predicted probabilities are seldom perfectly multi-calibrated, so a statistic measuring the distance from perfect multi-calibration is informative. A recently proposed metric for calibration, based on the classical Kuiper statistic, is a natural basis for a new metric of multi-calibration and avoids well-known problems of metrics based on binning or kernel density estimation. The newly proposed metric weights the contributions of different subpopulations in proportion to their signal-to-noise ratios; data analyses’ ablations demonstrate that the metric becomes noisy when omitting the signal-to-noise ratios from the metric. Numerical examples on benchmark data sets illustrate the new metric.
zh
[AI-97] Complexity of normalized stochastic first-order methods with momentum under heavy-tailed noise
【速读】:该论文旨在解决无约束优化问题,特别是在存在重尾噪声和弱平均光滑性条件下的近似随机平稳点的寻找问题。解决方案的关键在于提出具有Polyak动量、多外推动量和递归动量的实用归一化随机一阶方法,这些方法采用动态更新的算法参数,无需显式了解问题相关的量如Lipschitz常数或噪声界。
链接: https://arxiv.org/abs/2506.11214
作者: Chuan He,Zhaosong Lu,Defeng Sun,Zhanwang Deng
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:In this paper, we propose practical normalized stochastic first-order methods with Polyak momentum, multi-extrapolated momentum, and recursive momentum for solving unconstrained optimization problems. These methods employ dynamically updated algorithmic parameters and do not require explicit knowledge of problem-dependent quantities such as the Lipschitz constant or noise bound. We establish first-order oracle complexity results for finding approximate stochastic stationary points under heavy-tailed noise and weakly average smoothness conditions – both of which are weaker than the commonly used bounded variance and mean-squared smoothness assumptions. Our complexity bounds either improve upon or match the best-known results in the literature. Numerical experiments are presented to demonstrate the practical effectiveness of the proposed methods.
zh
[AI-98] Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data ICML
【速读】:该论文试图解决引导RNA(guide RNA, gRNA)活性预测问题,这是实现有效CRISPR-Cas12基因组编辑的关键挑战。其解决方案的关键在于利用预先训练的生物基础模型(biological foundation model)的嵌入表示作为输入,结合轻量级回归器进行gRNA活性估计,从而在无需领域特定预训练的情况下显著提升预测性能。此外,通过整合染色质可及性数据以捕捉调控背景,进一步提升了模型效果。
链接: https://arxiv.org/abs/2506.11182
作者: Azim Dehghani Amirabad,Yanfei Zhang,Artem Moskalev,Sowmya Rajesh,Tommaso Mansi,Shuwei Li,Mangal Prakash,Rui Liao
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted by ICML workshop 2025
Abstract:Predicting guide RNA (gRNA) activity is critical for effective CRISPR-Cas12 genome editing but remains challenging due to limited data, variation across protospacer adjacent motifs (PAMs-short sequence requirements for Cas binding), and reliance on large-scale training. We investigate whether pre-trained biological foundation model originally trained on transcriptomic data can improve gRNA activity estimation even without domain-specific pre-training. Using embeddings from existing RNA foundation model as input to lightweight regressor, we show substantial gains over traditional baselines. We also integrate chromatin accessibility data to capture regulatory context, improving performance further. Our results highlight the effectiveness of pre-trained foundation models and chromatin accessibility data for gRNA activity prediction.
zh
[AI-99] Brain2Vec: A Deep Learning Framework for EEG-Based Stress Detection Using CNN-LSTM-Attention
【速读】:该论文旨在解决心理压力对认知健康和整体福祉的广泛影响问题,提出一种稳健且非侵入性的诊断工具以实现对压力状态的有效识别。其解决方案的关键在于引入Brain2Vec模型,该模型结合了卷积、循环和注意力机制的混合架构,通过卷积层捕捉局部空间依赖性,LSTM层建模序列时间模式,并利用注意力机制突出信息丰富的时域区域,从而提升从原始脑电图(EEG)信号中分类压力状态的性能。
链接: https://arxiv.org/abs/2506.11179
作者: Md Mynoddin,Troyee Dev,Rishita Chakma
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Mental stress has become a pervasive factor affecting cognitive health and overall well-being, necessitating the development of robust, non-invasive diagnostic tools. Electroencephalogram (EEG) signals provide a direct window into neural activity, yet their non-stationary and high-dimensional nature poses significant modeling challenges. Here we introduce Brain2Vec, a new deep learning tool that classifies stress states from raw EEG recordings using a hybrid architecture of convolutional, recurrent, and attention mechanisms. The model begins with a series of convolutional layers to capture localized spatial dependencies, followed by an LSTM layer to model sequential temporal patterns, and concludes with an attention mechanism to emphasize informative temporal regions. We evaluate Brain2Vec on the DEAP dataset, applying bandpass filtering, z-score normalization, and epoch segmentation as part of a comprehensive preprocessing pipeline. Compared to traditional CNN-LSTM baselines, our proposed model achieves an AUC score of 0.68 and a validation accuracy of 81.25%. These findings demonstrate Brain2Vec’s potential for integration into wearable stress monitoring platforms and personalized healthcare systems.
zh
[AI-100] Intelligibility of Text-to-Speech Systems for Mathematical Expressions INTERSPEECH2025
【速读】:该论文试图解决高级文本到语音(Text-to-Speech, TTS)模型在处理数学表达式(Mathematical eXpressions, MX)时的语音质量与可理解性问题。其解决方案的关键在于设计实验,通过听觉测试和转录测试评估五种TTS模型在不同类别MX上的表现,并利用两种大语言模型(Large Language Models, LLMs)将LaTeX格式的MX转换为英文发音,以弥补TTS模型无法直接处理LaTeX格式的不足。研究还通过用户评分和转录正确性指标量化了可理解性,并与人类专家的发音进行了对比。
链接: https://arxiv.org/abs/2506.11086
作者: Sujoy Roychowdhury,H. G. Ranjani,Sumit Soman,Nishtha Paul,Subhadip Bandyopadhyay,Siddhanth Iyengar
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted at Interspeech 2025
Abstract:There has been limited evaluation of advanced Text-to-Speech (TTS) models with Mathematical eXpressions (MX) as inputs. In this work, we design experiments to evaluate quality and intelligibility of five TTS models through listening and transcribing tests for various categories of MX. We use two Large Language Models (LLMs) to generate English pronunciation from LaTeX MX as TTS models cannot process LaTeX directly. We use Mean Opinion Score from user ratings and quantify intelligibility through transcription correctness using three metrics. We also compare listener preference of TTS outputs with respect to human expert rendition of same MX. Results establish that output of TTS models for MX is not necessarily intelligible, the gap in intelligibility varies across TTS models and MX category. For most categories, performance of TTS models is significantly worse than that of expert rendition. The effect of choice of LLM is limited. This establishes the need to improve TTS models for MX.
zh
[AI-101] Embedded Acoustic Intelligence for Automotive Systems
【速读】:该论文旨在解决如何通过声学特征提取与深度学习技术提升汽车系统智能性,以准确识别道路类型并支持自动驾驶及高级驾驶辅助系统(AD/ADAS)的适应性学习问题。解决方案的关键在于利用安装在车辆底盘内的麦克风提取声学特征,并结合来自Open AI生态系统的预训练模型(通过Hugging Face提供)进行深度神经网络分类,从而实现对道路表面的预测与主动道路噪声消除的优化。
链接: https://arxiv.org/abs/2506.11071
作者: Renjith Rajagopal,Peter Winzell,Sladjana Strbac,Konstantin Lindström,Petter Hörling,Faisal Kohestani,Niloofar Mehrzad
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
Abstract:Transforming sound insights into actionable streams of data, this abstract leverages findings from degree thesis research to enhance automotive system intelligence, enabling us to address road type [1].By extracting and interpreting acoustic signatures from microphones installed within the wheelbase of a car, we focus on classifying road this http URL deep neural networks and feature extraction powered by pre-trained models from the Open AI ecosystem (via Hugging Face [2]), our approach enables Autonomous Driving and Advanced Driver- Assistance Systems (AD/ADAS) to anticipate road surfaces, support adaptive learning for active road noise cancellation, and generate valuable insights for urban planning. The results of this study were specifically captured to support a compelling business case for next-generation automotive systems. This forward-looking approach not only promises to redefine passenger comfort and improve vehicle safety, but also paves the way for intelligent, data-driven urban road management, making the future of mobility both achievable and sustainable.
zh
[AI-102] Decoding Cortical Microcircuits: A Generative Model for Latent Space Exploration and Controlled Synthesis
【速读】:该论文试图解决如何从有限的遗传指令中生成大脑复杂结构的问题,特别是探索大脑结构如何决定功能。其解决方案的关键在于引入一种生成式 AI (Generative AI) 模型,通过学习小鼠皮层微环路的详细连接图谱,提取出一个压缩的潜在空间,该空间能够捕捉电路的本质结构信息,并且其中特定可解释的方向与网络属性直接相关。基于此潜在空间,论文进一步提出了一种可控生成具有所需结构特征的新合成微环路的方法。
链接: https://arxiv.org/abs/2506.11062
作者: Xingyu Liu,Yubin Li,Guozhang Chen
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:A central idea in understanding brains and building artificial intelligence is that structure determines function. Yet, how the brain’s complex structure arises from a limited set of genetic instructions remains a key question. The ultra high-dimensional detail of neural connections vastly exceeds the information storage capacity of genes, suggesting a compact, low-dimensional blueprint must guide brain development. Our motivation is to uncover this blueprint. We introduce a generative model, to learn this underlying representation from detailed connectivity maps of mouse cortical microcircuits. Our model successfully captures the essential structural information of these circuits in a compressed latent space. We found that specific, interpretable directions within this space directly relate to understandable network properties. Building on this, we demonstrate a novel method to controllably generate new, synthetic microcircuits with desired structural features by navigating this latent space. This work offers a new way to investigate the design principles of neural circuits and explore how structure gives rise to function, potentially informing the development of more advanced artificial neural networks.
zh
机器学习
[LG-0] An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling
链接: https://arxiv.org/abs/2506.12000
作者: Yuriy Kim,Evgeny Belyaev
类目: Machine Learning (cs.LG)
*备注: IEEE NW Russia Young Researchers in Electrical and Electronic Engineering Conference (EIConRusNW)
Abstract:This paper is dedicated to an efficient compression of weights and optimizer states (called checkpoints) obtained at different stages during a neural network training process. First, we propose a prediction-based compression approach, where values from the previously saved checkpoint are used for context modeling in arithmetic coding. Second, in order to enhance the compression performance, we also propose to apply pruning and quantization of the checkpoint values. Experimental results show that our approach achieves substantial bit size reduction, while enabling near-lossless training recovery from restored checkpoints, preserving the model’s performance and making it suitable for storage-limited environments.
[LG-1] pLSTM: parallelizable Linear Source Transition Mark networks
链接: https://arxiv.org/abs/2506.11997
作者: Korbinian Pöppel,Richard Freinschlag,Thomas Schmied,Wei Lin,Sepp Hochreiter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Modern recurrent architectures, such as xLSTM and Mamba, have recently challenged the Transformer in language modeling. However, their structure constrains their applicability to sequences only or requires processing multi-dimensional data structures, such as images or molecular graphs, in a pre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) are well suited for data with a higher level structure, like 2D grids, trees, and directed acyclic graphs (DAGs). In this work, we extend the notion of multi-dimensionality to linear RNNs. We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates that act on the line graph of a general DAG. This enables parallelization in analogy to parallel associative scans and the chunkwise-recurrent form of sequential linear RNNs, but for DAGs. For regular grids (1D and 2D), like images, this scheme can be efficiently implemented using einsum operations, concatenations, and padding in logarithmic time. pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes: a directed propagation mode (P-mode) and a diffusive distribution mode (D-mode). To showcase the long-range capabilities of pLSTM, we introduce arrow-pointing extrapolation as a synthetic computer vision task that contains long-distance directional information. We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate. On established molecular graph and computer vision benchmarks, pLSTMs also show strong performance. Code and Datasets are available at: this https URL.
[LG-2] Compression Aware Certified Training
链接: https://arxiv.org/abs/2506.11992
作者: Changming Xu,Gagandeep Singh
类目: Machine Learning (cs.LG)
*备注: 19 pages, 1 figure
Abstract:Deep neural networks deployed in safety-critical, resource-constrained environments must balance efficiency and robustness. Existing methods treat compression and certified robustness as separate goals, compromising either efficiency or safety. We propose CACTUS (Compression Aware Certified Training Using network Sets), a general framework for unifying these objectives during training. CACTUS models maintain high certified accuracy even when compressed. We apply CACTUS for both pruning and quantization and show that it effectively trains models which can be efficiently compressed while maintaining high accuracy and certifiable robustness. CACTUS achieves state-of-the-art accuracy and certified performance for both pruning and quantization on a variety of datasets and input specifications.
[LG-3] Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks
链接: https://arxiv.org/abs/2506.11973
作者: Ankit Bhardwaj,Rohail Asim,Sachin Chauhan,Yasir Zaki,Lakshminarayanan Subramanian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Free-flow road networks, such as suburban highways, are increasingly experiencing traffic congestion due to growing commuter inflow and limited infrastructure. Traditional control mechanisms, such as traffic signals or local heuristics, are ineffective or infeasible in these high-speed, signal-free environments. We introduce self-regulating cars, a reinforcement learning-based traffic control protocol that dynamically modulates vehicle speeds to optimize throughput and prevent congestion, without requiring new physical infrastructure. Our approach integrates classical traffic flow theory, gap acceptance models, and microscopic simulation into a physics-informed RL framework. By abstracting roads into super-segments, the agent captures emergent flow dynamics and learns robust speed modulation policies from instantaneous traffic observations. Evaluated in the high-fidelity PTV Vissim simulator on a real-world highway network, our method improves total throughput by 5%, reduces average delay by 13%, and decreases total stops by 3% compared to the no-control setting. It also achieves smoother, congestion-resistant flow while generalizing across varied traffic patterns, demonstrating its potential for scalable, ML-driven traffic management.
[LG-4] Scalable Generalized Bayesian Online Neural Network Training for Sequential Decision Making
链接: https://arxiv.org/abs/2506.11898
作者: Gerardo Duran-Martin,Leandro Sánchez-Betancourt,Álvaro Cartea,Kevin Murphy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce scalable algorithms for online learning and generalized Bayesian inference of neural network parameters, designed for sequential decision making tasks. Our methods combine the strengths of frequentist and Bayesian filtering, which include fast low-rank updates via a block-diagonal approximation of the parameter error covariance, and a well-defined posterior predictive distribution that we use for decision making. More precisely, our main method updates a low-rank error covariance for the hidden layers parameters, and a full-rank error covariance for the final layer parameters. Although this characterizes an improper posterior, we show that the resulting posterior predictive distribution is well-defined. Our methods update all network parameters online, with no need for replay buffers or offline retraining. We show, empirically, that our methods achieve a competitive tradeoff between speed and accuracy on (non-stationary) contextual bandit problems and Bayesian optimization problems.
[LG-5] Measurement-aligned Flow for Inverse Problem
链接: https://arxiv.org/abs/2506.11893
作者: Shaorong Zhang,Rob Brekelmans,Yunshu Wu,Greg Ver Steeg
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models provide a powerful way to incorporate complex prior information for solving inverse problems. However, existing methods struggle to correctly incorporate guidance from conflicting signals in the prior and measurement, especially in the challenging setting of non-Gaussian or unknown noise. To bridge these gaps, we propose Measurement-Aligned Sampling (MAS), a novel framework for linear inverse problem solving that can more flexibly balance prior and measurement information. MAS unifies and extends existing approaches like DDNM and DAPS, and offers a new optimization perspective. MAS can generalize to handle known Gaussian noise, unknown or non-Gaussian noise types. Extensive experiments show that MAS consistently outperforms state-of-the-art methods across a range of tasks.
[LG-6] Understanding Input Selectivity in Mamba: Impact on Approximation Power Memorization and Associative Recall Capacity
链接: https://arxiv.org/abs/2506.11891
作者: Ningyuan Huang,Miguel Sarabia,Abhinav Moudgil,Pau Rodriguez,Luca Zappella,Federico Danieli
类目: Machine Learning (cs.LG)
*备注:
Abstract:State-Space Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers. Mamba introduces input selectivity to its SSM layer (S6) and incorporates convolution and gating into its block definition. While these modifications do improve Mamba’s performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture. In this work, we demystify the role of input selectivity in Mamba, investigating its impact on function approximation power, long-term memorization, and associative recall capabilities. In particular: (i) we prove that the S6 layer of Mamba can represent projections onto Haar wavelets, providing an edge over its Diagonal SSM (S4D) predecessor in approximating discontinuous functions commonly arising in practice; (ii) we show how the S6 layer can dynamically counteract memory decay; (iii) we provide analytical solutions to the MQAR associative recall task using the Mamba architecture with different mixers – Mamba, Mamba-2, and S4D. We demonstrate the tightness of our theoretical constructions with empirical results on concrete tasks. Our findings offer a mechanistic understanding of Mamba and reveal opportunities for improvement.
[LG-7] In Defense of Defensive Forecasting
链接: https://arxiv.org/abs/2506.11848
作者: Juan Carlos Perdomo,Benjamin Recht
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This tutorial provides a survey of algorithms for Defensive Forecasting, where predictions are derived not by prognostication but by correcting past mistakes. Pioneered by Vovk, Defensive Forecasting frames the goal of prediction as a sequential game, and derives predictions to minimize metrics no matter what outcomes occur. We present an elementary introduction to this general theory and derive simple, near-optimal algorithms for online learning, calibration, prediction with expert advice, and online conformal prediction.
[LG-8] CLEAN-MI: A Scalable and Efficient Pipeline for Constructing High-Quality Neurodata in Motor Imagery Paradigm
链接: https://arxiv.org/abs/2506.11830
作者: Dingkun Liu,Zhu Chen,Dongrui Wu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures
Abstract:The construction of large-scale, high-quality datasets is a fundamental prerequisite for developing robust and generalizable foundation models in motor imagery (MI)-based brain-computer interfaces (BCIs). However, EEG signals collected from different subjects and devices are often plagued by low signal-to-noise ratio, heterogeneity in electrode configurations, and substantial inter-subject variability, posing significant challenges for effective model training. In this paper, we propose CLEAN-MI, a scalable and systematic data construction pipeline for constructing large-scale, efficient, and accurate neurodata in the MI paradigm. CLEAN-MI integrates frequency band filtering, channel template selection, subject screening, and marginal distribution alignment to systematically filter out irrelevant or low-quality data and standardize multi-source EEG datasets. We demonstrate the effectiveness of CLEAN-MI on multiple public MI datasets, achieving consistent improvements in data quality and classification performance.
[LG-9] SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
链接: https://arxiv.org/abs/2506.11791
作者: Hwiwon Lee,Ziqi Zhang,Hanxiao Lu,Lingming Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only 0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents’ capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
[LG-10] SSPINNpose: A Self-Supervised PINN for Inertial Pose and Dynamics Estimation
链接: https://arxiv.org/abs/2506.11786
作者: Markus Gambietz,Eva Dorschky,Altan Akat,Marcel Schöckel,Jörg Miehling,Anne D. Koelewijn
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate real-time estimation of human movement dynamics, including internal joint moments and muscle forces, is essential for applications in clinical diagnostics and sports performance monitoring. Inertial measurement units (IMUs) provide a minimally intrusive solution for capturing motion data, particularly when used in sparse sensor configurations. However, current real-time methods rely on supervised learning, where a ground truth dataset needs to be measured with laboratory measurement systems, such as optical motion capture. These systems are known to introduce measurement and processing errors and often fail to generalize to real-world or previously unseen movements, necessitating new data collection efforts that are time-consuming and impractical. To overcome these limitations, we propose SSPINNpose, a self-supervised, physics-informed neural network that estimates joint kinematics and kinetics directly from IMU data, without requiring ground truth labels for training. We run the network output through a physics model of the human body to optimize physical plausibility and generate virtual measurement data. Using this virtual sensor data, the network is trained directly on the measured sensor data instead of a ground truth. When compared to optical motion capture, SSPINNpose is able to accurately estimate joint angles and joint moments at an RMSD of 8.7 deg and 4.9 BWBH%, respectively, for walking and running at speeds up to 4.9 m/s at a latency of 3.5 ms. Furthermore, the framework demonstrates robustness across sparse sensor configurations and can infer the anatomical locations of the sensors. These results underscore the potential of SSPINNpose as a scalable and adaptable solution for real-time biomechanical analysis in both laboratory and field environments.
[LG-11] Enabling automatic transcription of child-centered audio recordings from real-world environments
链接: https://arxiv.org/abs/2506.11747
作者: Daniil Kocharov,Okko Räsänen
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: pre-print
Abstract:Longform audio recordings obtained with microphones worn by children-also known as child-centered daylong recordings-have become a standard method for studying children’s language experiences and their impact on subsequent language development. Transcripts of longform speech audio would enable rich analyses at various linguistic levels, yet the massive scale of typical longform corpora prohibits comprehensive manual annotation. At the same time, automatic speech recognition (ASR)-based transcription faces significant challenges due to the noisy, unconstrained nature of real-world audio, and no existing study has successfully applied ASR to transcribe such data. However, previous attempts have assumed that ASR must process each longform recording in its entirety. In this work, we present an approach to automatically detect those utterances in longform audio that can be reliably transcribed with modern ASR systems, allowing automatic and relatively accurate transcription of a notable proportion of all speech in typical longform data. We validate the approach on four English longform audio corpora, showing that it achieves a median word error rate (WER) of 0% and a mean WER of 18% when transcribing 13% of the total speech in the dataset. In contrast, transcribing all speech without any filtering yields a median WER of 52% and a mean WER of 51%. We also compare word log-frequencies derived from the automatic transcripts with those from manual annotations and show that the frequencies correlate at r = 0.92 (Pearson) for all transcribed words and r = 0.98 for words that appear at least five times in the automatic transcripts. Overall, the work provides a concrete step toward increasingly detailed automated linguistic analyses of child-centered longform audio.
[LG-12] axonomy of reduction matrices for Graph Coarsening
链接: https://arxiv.org/abs/2506.11743
作者: Antonin Joly,Nicolas Keriven,Aline Roumy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Graph coarsening aims to diminish the size of a graph to lighten its memory footprint, and has numerous applications in graph signal processing and machine learning. It is usually defined using a reduction matrix and a lifting matrix, which, respectively, allows to project a graph signal from the original graph to the coarsened one and back. This results in a loss of information measured by the so-called Restricted Spectral Approximation (RSA). Most coarsening frameworks impose a fixed relationship between the reduction and lifting matrices, generally as pseudo-inverses of each other, and seek to define a coarsening that minimizes the RSA. In this paper, we remark that the roles of these two matrices are not entirely symmetric: indeed, putting constraints on the lifting matrix alone ensures the existence of important objects such as the coarsened graph’s adjacency matrix or Laplacian. In light of this, in this paper, we introduce a more general notion of reduction matrix, that is not necessarily the pseudo-inverse of the lifting matrix. We establish a taxonomy of ``admissible’’ families of reduction matrices, discuss the different properties that they must satisfy and whether they admit a closed-form description or not. We show that, for a fixed coarsening represented by a fixed lifting matrix, the RSA can be further reduced simply by modifying the reduction matrix. We explore different examples, including some based on a constrained optimization process of the RSA. Since this criterion has also been linked to the performance of Graph Neural Networks, we also illustrate the impact of this choices on different node classification tasks on coarsened graphs.
[LG-13] Data-driven approaches to inverse problems
链接: https://arxiv.org/abs/2506.11732
作者: Carola-Bibiane Schönlieb,Zakhar Shumaylov
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Notes from Machine Learning: From Data to Mathematical Understanding (CIME 2023)
Abstract:Inverse problems are concerned with the reconstruction of unknown physical quantities using indirect measurements and are fundamental across diverse fields such as medical imaging, remote sensing, and material sciences. These problems serve as critical tools for visualizing internal structures beyond what is visible to the naked eye, enabling quantification, diagnosis, prediction, and discovery. However, most inverse problems are ill-posed, necessitating robust mathematical treatment to yield meaningful solutions. While classical approaches provide mathematically rigorous and computationally stable solutions, they are constrained by the ability to accurately model solution properties and implement them efficiently. A more recent paradigm considers deriving solutions to inverse problems in a data-driven manner. Instead of relying on classical mathematical modeling, this approach utilizes highly over-parameterized models, typically deep neural networks, which are adapted to specific inverse problems using carefully selected training data. Current approaches that follow this new paradigm distinguish themselves through solution accuracy paired with computational efficiency that was previously inconceivable. These notes offer an introduction to this data-driven paradigm for inverse problems. The first part of these notes will provide an introduction to inverse problems, discuss classical solution strategies, and present some applications. The second part will delve into modern data-driven approaches, with a particular focus on adversarial regularization and provably convergent linear plug-and-play denoisers. Throughout the presentation of these methodologies, their theoretical properties will be discussed, and numerical examples will be provided. The lecture series will conclude with a discussion of open problems and future perspectives in the field. Comments: Notes from Machine Learning: From Data to Mathematical Understanding (CIME 2023) Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2506.11732 [math.NA] (or arXiv:2506.11732v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2506.11732 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] Growing with Experience: Growing Neural Networks in Deep Reinforcement Learning
链接: https://arxiv.org/abs/2506.11706
作者: Lukas Fehring,Marius Lindauer,Theresa Eimer
类目: Machine Learning (cs.LG)
*备注: 3 pages
Abstract:While increasingly large models have revolutionized much of the machine learning landscape, training even mid-sized networks for Reinforcement Learning (RL) is still proving to be a struggle. This, however, severely limits the complexity of policies we are able to learn. To enable increased network capacity while maintaining network trainability, we propose GrowNN, a simple yet effective method that utilizes progressive network growth during training. We start training a small network to learn an initial policy. Then we add layers without changing the encoded function. Subsequent updates can utilize the added layers to learn a more expressive policy, adding capacity as the policy’s complexity increases. GrowNN can be seamlessly integrated into most existing RL agents. Our experiments on MiniHack and Mujoco show improved agent performance, with incrementally GrowNN-deeper networks outperforming their respective static counterparts of the same size by up to 48% on MiniHack Room and 72% on Ant.
[LG-15] Geometry-Aware Edge Pooling for Graph Neural Networks
链接: https://arxiv.org/abs/2506.11700
作者: Katharina Limbeck,Lydia Mezrag,Guy Wolf,Bastian Rieck
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have shown significant success for graph-based tasks. Motivated by the prevalence of large datasets in real-world applications, pooling layers are crucial components of GNNs. By reducing the size of input graphs, pooling enables faster training and potentially better generalisation. However, existing pooling operations often optimise for the learning task at the expense of fundamental graph structures and interpretability. This leads to unreliable performance across varying dataset types, downstream tasks and pooling ratios. Addressing these concerns, we propose novel graph pooling layers for structure aware pooling via edge collapses. Our methods leverage diffusion geometry and iteratively reduce a graph’s size while preserving both its metric structure and structural diversity. We guide pooling using magnitude, an isometry-invariant diversity measure, which permits us to control the fidelity of the pooling process. Further, we use the spread of a metric space as a faster and more stable alternative ensuring computational efficiency. Empirical results demonstrate that our methods (i) achieve superior performance compared to alternative pooling layers across a range of diverse graph classification tasks, (ii) preserve key spectral properties of the input graphs, and (iii) retain high accuracy across varying pooling ratios.
[LG-16] Deep Symmetric Autoencoders from the Eckart-Young-Schmidt Perspective
链接: https://arxiv.org/abs/2506.11641
作者: Simone Brivio,Nicola Rares Franco
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 28 pages, 10 figures
Abstract:Deep autoencoders have become a fundamental tool in various machine learning applications, ranging from dimensionality reduction and reduced order modeling of partial differential equations to anomaly detection and neural machine translation. Despite their empirical success, a solid theoretical foundation for their expressiveness remains elusive, particularly when compared to classical projection-based techniques. In this work, we aim to take a step forward in this direction by presenting a comprehensive analysis of what we refer to as symmetric autoencoders, a broad class of deep learning architectures ubiquitous in the literature. Specifically, we introduce a formal distinction between different classes of symmetric architectures, analyzing their strengths and limitations from a mathematical perspective. For instance, we show that the reconstruction error of symmetric autoencoders with orthonormality constraints can be understood by leveraging the well-renowned Eckart-Young-Schmidt (EYS) theorem. As a byproduct of our analysis, we end up developing the EYS initialization strategy for symmetric autoencoders, which is based on an iterated application of the Singular Value Decomposition (SVD). To validate our findings, we conduct a series of numerical experiments where we benchmark our proposal against conventional deep autoencoders, discussing the importance of model design and initialization.
[LG-17] Physically-informed change-point kernels for structural dynamics
链接: https://arxiv.org/abs/2506.11625
作者: Daniel James Pitchforth,Matthew Rhys Jones,Samuel John Gibson,Elizabeth Jane Cross
类目: Machine Learning (cs.LG)
*备注: 26 pages, 14 figures, 2 tables, 38 references
Abstract:The relative balance between physics and data within any physics-informed machine learner is an important modelling consideration to ensure that the benefits of both physics and data-based approaches are maximised. An over reliance on physical knowledge can be detrimental, particularly when the physics-based component of a model may not accurately represent the true underlying system. An underutilisation of physical knowledge potentially wastes a valuable resource, along with benefits in model interpretability and reduced demand for expensive data collection. Achieving an optimal physics-data balance is a challenging aspect of model design, particularly if the level varies through time; for example, one might have a physical approximation, only valid within particular regimes, or a physical phenomenon may be known to only occur when given conditions are met (e.g. at high temperatures). This paper develops novel, physically-informed, change-point kernels for Gaussian processes, capable of dynamically varying the reliance upon available physical knowledge. A high level of control is granted to a user, allowing for the definition of conditions in which they believe a phenomena should occur and the rate at which the knowledge should be phased in and out of a model. In circumstances where users may be less certain, the switching reliance upon physical knowledge may be automatically learned and recovered from the model in an interpretable and intuitive manner. Variation of the modelled noise based on the physical phenomena occurring is also implemented to provide a more representative capture of uncertainty alongside predictions. The capabilities of the new kernel structures are explored through the use of two engineering case studies: the directional wind loading of a cable-stayed bridge and the prediction of aircraft wing strain during in-flight manoeuvring.
[LG-18] Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments
链接: https://arxiv.org/abs/2506.11615
作者: Deliang Jin,Gang Chen,Shuo Feng,Yufeng Ling,Haoran Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks (DNNs) have achieved remarkable success across diverse domains, but their performance can be severely degraded by noisy or corrupted training data. Conventional noise mitigation methods often rely on explicit assumptions about noise distributions or require extensive retraining, which can be impractical for large-scale models. Inspired by the principles of machine unlearning, we propose a novel framework that integrates attribution-guided data partitioning, discriminative neuron pruning, and targeted fine-tuning to mitigate the impact of noisy samples. Our approach employs gradient-based attribution to probabilistically distinguish high-quality examples from potentially corrupted ones without imposing restrictive assumptions on the noise. It then applies regression-based sensitivity analysis to identify and prune neurons that are most vulnerable to noise. Finally, the resulting network is fine-tuned on the high-quality data subset to efficiently recover and enhance its generalization performance. This integrated unlearning-inspired framework provides several advantages over conventional noise-robust learning approaches. Notably, it combines data-level unlearning with model-level adaptation, thereby avoiding the need for full model retraining or explicit noise modeling. We evaluate our method on representative tasks (e.g., CIFAR-10 image classification and speech recognition) under various noise levels and observe substantial gains in both accuracy and efficiency. For example, our framework achieves approximately a 10% absolute accuracy improvement over standard retraining on CIFAR-10 with injected label noise, while reducing retraining time by up to 47% in some settings. These results demonstrate the effectiveness and scalability of the proposed approach for achieving robust generalization in noisy environments.
[LG-19] KCES: Training-Free Defense for Robust Graph Neural Networks via Kernel Complexity
链接: https://arxiv.org/abs/2506.11611
作者: Yaning Jia,Shenyang Deng,Chiyu Ma,Yaoqing Yang,Soroush Vosoughi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have achieved impressive success across a wide range of graph-based tasks, yet they remain highly vulnerable to small, imperceptible perturbations and adversarial attacks. Although numerous defense methods have been proposed to address these vulnerabilities, many rely on heuristic metrics, overfit to specific attack patterns, and suffer from high computational complexity. In this paper, we propose Kernel Complexity-Based Edge Sanitization (KCES), a training-free, model-agnostic defense framework. KCES leverages Graph Kernel Complexity (GKC), a novel metric derived from the graph’s Gram matrix that characterizes GNN generalization via its test error bound. Building on GKC, we define a KC score for each edge, measuring the change in GKC when the edge is removed. Edges with high KC scores, typically introduced by adversarial perturbations, are pruned to mitigate their harmful effects, thereby enhancing GNNs’ robustness. KCES can also be seamlessly integrated with existing defense strategies as a plug-and-play module without requiring training. Theoretical analysis and extensive experiments demonstrate that KCES consistently enhances GNN robustness, outperforms state-of-the-art baselines, and amplifies the effectiveness of existing defenses, offering a principled and efficient solution for securing GNNs.
[LG-20] SecONNds: Secure Outsourced Neural Network Inference on ImageNet
链接: https://arxiv.org/abs/2506.11586
作者: Shashank Balla
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The widespread adoption of outsourced neural network inference presents significant privacy challenges, as sensitive user data is processed on untrusted remote servers. Secure inference offers a privacy-preserving solution, but existing frameworks suffer from high computational overhead and communication costs, rendering them impractical for real-world deployment. We introduce SecONNds, a non-intrusive secure inference framework optimized for large ImageNet-scale Convolutional Neural Networks. SecONNds integrates a novel fully Boolean Goldreich-Micali-Wigderson (GMW) protocol for secure comparison – addressing Yao’s millionaires’ problem – using preprocessed Beaver’s bit triples generated from Silent Random Oblivious Transfer. Our novel protocol achieves an online speedup of 17 \times in nonlinear operations compared to state-of-the-art solutions while reducing communication overhead. To further enhance performance, SecONNds employs Number Theoretic Transform (NTT) preprocessing and leverages GPU acceleration for homomorphic encryption operations, resulting in speedups of 1.6 \times on CPU and 2.2 \times on GPU for linear operations. We also present SecONNds-P, a bit-exact variant that ensures verifiable full-precision results in secure computation, matching the results of plaintext computations. Evaluated on a 37-bit quantized SqueezeNet model, SecONNds achieves an end-to-end inference time of 2.8 s on GPU and 3.6 s on CPU, with a total communication of just 420 MiB. SecONNds’ efficiency and reduced computational load make it well-suited for deploying privacy-sensitive applications in resource-constrained environments. SecONNds is open source and can be accessed from: this https URL.
[LG-21] Gradients of unitary optical neural networks using parameter-shift rule
链接: https://arxiv.org/abs/2506.11565
作者: Jinzhe Jiang,Yaqian Zhao,Xin Zhang,Chen Li,Yunlong Yu,Hailing Liu
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 8 pages, 3 figures
Abstract:This paper explores the application of the parameter-shift rule (PSR) for computing gradients in unitary optical neural networks (UONNs). While backpropagation has been fundamental to training conventional neural networks, its implementation in optical neural networks faces significant challenges due to the physical constraints of optical systems. We demonstrate how PSR, which calculates gradients by evaluating functions at shifted parameter values, can be effectively adapted for training UONNs constructed from Mach-Zehnder interferometer meshes. The method leverages the inherent Fourier series nature of optical interference in these systems to compute exact analytical gradients directly from hardware measurements. This approach offers a promising alternative to traditional in silico training methods and circumvents the limitations of both finite difference approximations and all-optical backpropagation implementations. We present the theoretical framework and practical methodology for applying PSR to optimize phase parameters in optical neural networks, potentially advancing the development of efficient hardware-based training strategies for optical computing systems.
[LG-22] Robust Filtering – Novel Statistical Learning and Inference Algorithms with Applications
链接: https://arxiv.org/abs/2506.11530
作者: Aamir Hussain Chughtai
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: PhD Thesis
Abstract:State estimation or filtering serves as a fundamental task to enable intelligent decision-making in applications such as autonomous vehicles, robotics, healthcare monitoring, smart grids, intelligent transportation, and predictive maintenance. Standard filtering assumes prior knowledge of noise statistics to extract latent system states from noisy sensor data. However, real-world scenarios involve abnormalities like outliers, biases, drifts, and missing observations with unknown or partially known statistics, limiting conventional approaches. This thesis presents novel robust nonlinear filtering methods to mitigate these challenges. Based on insights from our filtering proposals, we extend the formulations to offline estimation/learning setups and propose smoothing extensions. Our methods leverage Bayesian inference frameworks, employing both deterministic and stochastic approximation techniques including Variational Inference (VI) and Particle Filters/Sequential Monte Carlo (SMC). We also study theoretical estimation limits using Bayesian Cramér-Rao bounds (BCRBs) in the context of measurement abnormalities. To validate the performance gains of the proposed methods, we perform simulations and experiments in scenarios including target tracking, indoor localization, 3D point cloud registration, mesh registration, and pose graph optimization. The fundamental nature of the work makes it useful in diverse applications, with possible future extensions toward developing outlier-robust machine learning pipelines, learning system dynamics from anomalous data, and addressing challenges in generative AI where standard diffusion models struggle with outliers, imbalanced datasets, and mode collapse.
[LG-23] Delayformer: spatiotemporal transformation for predicting high-dimensional dynamics
链接: https://arxiv.org/abs/2506.11528
作者: Zijian Wang,Peng Tao,Luonan Chen
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review
Abstract:Predicting time-series is of great importance in various scientific and engineering fields. However, in the context of limited and noisy data, accurately predicting dynamics of all variables in a high-dimensional system is a challenging task due to their nonlinearity and also complex interactions. Current methods including deep learning approaches often perform poorly for real-world systems under such circumstances. This study introduces the Delayformer framework for simultaneously predicting dynamics of all variables, by developing a novel multivariate spatiotemporal information (mvSTI) transformation that makes each observed variable into a delay-embedded state (vector) and further cross-learns those states from different variables. From dynamical systems viewpoint, Delayformer predicts system states rather than individual variables, thus theoretically and computationally overcoming such nonlinearity and cross-interaction problems. Specifically, it first utilizes a single shared Visual Transformer (ViT) encoder to cross-represent dynamical states from observed variables in a delay embedded form and then employs distinct linear decoders for predicting next states, i.e. equivalently predicting all original variables parallelly. By leveraging the theoretical foundations of delay embedding theory and the representational capabilities of Transformers, Delayformer outperforms current state-of-the-art methods in forecasting tasks on both synthetic and real-world datasets. Furthermore, the potential of Delayformer as a foundational time-series model is demonstrated through cross-domain forecasting tasks, highlighting its broad applicability across various scenarios.
[LG-24] ask-Driven Discrete Representation Learning
链接: https://arxiv.org/abs/2506.11511
作者: Tung-Long Vuong
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, deep discrete representation learning (DRL) has achieved significant success across various domains. Most DRL frameworks (e.g., the widely used VQ-VAE and its variants) have primarily focused on generative settings, where the quality of a representation is implicitly gauged by the fidelity of its generation. In fact, the goodness of a discrete representation remain ambiguously defined across the literature. In this work, we adopt a practical approach that examines DRL from a task-driven perspective. We propose a unified framework that explores the usefulness of discrete features in relation to downstream tasks, with generation naturally viewed as one possible application. In this context, the properties of discrete representations as well as the way they benefit certain tasks are also relatively understudied. We therefore provide an additional theoretical analysis of the trade-off between representational capacity and sample complexity, shedding light on how discrete representation utilization impacts task performance. Finally, we demonstrate the flexibility and effectiveness of our framework across diverse applications.
[LG-25] LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
链接: https://arxiv.org/abs/2506.11476
作者: Tom Baker,Javier Nistal
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2025
Abstract:Text-to-audio diffusion models produce high-quality and diverse music but many, if not most, of the SOTA models lack the fine-grained, time-varying controls essential for music production. ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence. Our method offers greater flexibility and significantly lower memory usage, enabling more efficient training and deployment of independent controls. We conduct extensive objective and subjective evaluations and provide numerous audio examples on the accompanying website at this https URL
[LG-26] Position Paper: Rethinking AI/ML for Air Interface in Wireless Networks
链接: https://arxiv.org/abs/2506.11466
作者: Georgios Kontes,Diomidis S. Michalopoulos,Birendra Ghimire,Christopher Mutschler
类目: Machine Learning (cs.LG)
*备注:
Abstract:AI/ML research has predominantly been driven by domains such as computer vision, natural language processing, and video analysis. In contrast, the application of AI/ML to wireless networks, particularly at the air interface, remains in its early stages. Although there are emerging efforts to explore this intersection, fully realizing the potential of AI/ML in wireless communications requires a deep interdisciplinary understanding of both fields. We provide an overview of AI/ML-related discussions in 3GPP standardization, highlighting key use cases, architectural considerations, and technical requirements. We outline open research challenges and opportunities where academic and industrial communities can contribute to shaping the future of AI-enabled wireless systems.
[LG-27] Dynamic Sparse Training of Diagonally Sparse Networks
链接: https://arxiv.org/abs/2506.11449
作者: Abhishek Tyagi,Arjun Iyer,William H Renninger,Christopher Kanan,Yuhao Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers. Our source code is available at this https URL.
[LG-28] ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification
链接: https://arxiv.org/abs/2506.11442
作者: Yiyang Jin,Kunzhao Xu,Hang Li,Xueting Han,Yanmin Zhou,Cheng Li,Jing Bai
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in reinforcement learning (RL) with verifiable outcome rewards have significantly improved the reasoning capabilities of large language models (LLMs), especially when combined with multi-turn tool interactions. However, existing methods lack both meaningful verification signals from realistic environments and explicit optimization for verification, leading to unreliable self-verification. To address these limitations, we propose ReVeal, a multi-turn reinforcement learning framework that interleaves code generation with explicit self-verification and tool-based evaluation. ReVeal enables LLMs to autonomously generate test cases, invoke external tools for precise feedback, and improves performance via a customized RL algorithm with dense, per-turn rewards. As a result, ReVeal fosters the co-evolution of a model’s generation and verification capabilities through RL training, expanding the reasoning boundaries of the base model, demonstrated by significant gains in Pass@k on LiveCodeBench. It also enables test-time scaling into deeper inference regimes, with code consistently evolving as the number of turns increases during inference, ultimately surpassing DeepSeek-R1-Zero-Qwen-32B. These findings highlight the promise of ReVeal as a scalable and effective paradigm for building more robust and autonomous AI agents.
[LG-29] runcQuant: Truncation-Ready Quantization for DNNs with Flexible Weight Bit Precision
链接: https://arxiv.org/abs/2506.11431
作者: Jinhee Kim,Seoyeon Yoon,Taeho Lee,Joo Chan Lee,Kang Eun Jeon,Jong Hwan Ko
类目: Machine Learning (cs.LG)
*备注:
Abstract:The deployment of deep neural networks on edge devices is a challenging task due to the increasing complexity of state-of-the-art models, requiring efforts to reduce model size and inference latency. Recent studies explore models operating at diverse quantization settings to find the optimal point that balances computational efficiency and accuracy. Truncation, an effective approach for achieving lower bit precision mapping, enables a single model to adapt to various hardware platforms with little to no cost. However, formulating a training scheme for deep neural networks to withstand the associated errors introduced by truncation remains a challenge, as the current quantization-aware training schemes are not designed for the truncation process. We propose TruncQuant, a novel truncation-ready training scheme allowing flexible bit precision through bit-shifting in runtime. We achieve this by aligning TruncQuant with the output of the truncation process, demonstrating strong robustness across bit-width settings, and offering an easily implementable training scheme within existing quantization-aware frameworks. Our code is released at this https URL.
[LG-30] PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design
链接: https://arxiv.org/abs/2506.11420
作者: Zhenqiao Song,Tiaoxiao Li,Lei Li,Martin Renqiang Min
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively.
[LG-31] Byzantine Outside Curious Inside: Reconstructing Data Through Malicious Updates
链接: https://arxiv.org/abs/2506.11413
作者: Kai Yue,Richeng Jin,Chau-Wai Wong,Huaiyu Dai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Federated learning (FL) enables decentralized machine learning without sharing raw data, allowing multiple clients to collaboratively learn a global model. However, studies reveal that privacy leakage is possible under commonly adopted FL protocols. In particular, a server with access to client gradients can synthesize data resembling the clients’ training data. In this paper, we introduce a novel threat model in FL, named the maliciously curious client, where a client manipulates its own gradients with the goal of inferring private data from peers. This attacker uniquely exploits the strength of a Byzantine adversary, traditionally aimed at undermining model robustness, and repurposes it to facilitate data reconstruction attack. We begin by formally defining this novel client-side threat model and providing a theoretical analysis that demonstrates its ability to achieve significant reconstruction success during FL training. To demonstrate its practical impact, we further develop a reconstruction algorithm that combines gradient inversion with malicious update strategies. Our analysis and experimental results reveal a critical blind spot in FL defenses: both server-side robust aggregation and client-side privacy mechanisms may fail against our proposed attack. Surprisingly, standard server- and client-side defenses designed to enhance robustness or privacy may unintentionally amplify data leakage. Compared to the baseline approach, a mistakenly used defense may instead improve the reconstructed image quality by 10-15%.
[LG-32] FIGNN: Feature-Specific Interpretability for Graph Neural Network Surrogate Models
链接: https://arxiv.org/abs/2506.11398
作者: Riddhiman Raut,Romit Maulik,Shivam Barwey
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:This work presents a novel graph neural network (GNN) architecture, the Feature-specific Interpretable Graph Neural Network (FIGNN), designed to enhance the interpretability of deep learning surrogate models defined on unstructured grids in scientific applications. Traditional GNNs often obscure the distinct spatial influences of different features in multivariate prediction tasks. FIGNN addresses this limitation by introducing a feature-specific pooling strategy, which enables independent attribution of spatial importance for each predicted variable. Additionally, a mask-based regularization term is incorporated into the training objective to explicitly encourage alignment between interpretability and predictive error, promoting localized attribution of model performance. The method is evaluated for surrogate modeling of two physically distinct systems: the SPEEDY atmospheric circulation model and the backward-facing step (BFS) fluid dynamics benchmark. Results demonstrate that FIGNN achieves competitive predictive performance while revealing physically meaningful spatial patterns unique to each feature. Analysis of rollout stability, feature-wise error budgets, and spatial mask overlays confirm the utility of FIGNN as a general-purpose framework for interpretable surrogate modeling in complex physical domains.
[LG-33] Convergence of physics-informed neural networks modeling time-harmonic wave fields
链接: https://arxiv.org/abs/2506.11395
作者: Stefan Schoder,Aneta Furmanová,Viktor Hruška
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Studying physics-informed neural networks (PINNs) for modeling partial differential equations to solve the acoustic wave field has produced promising results for simple geometries in two-dimensional domains. One option is to compute the time-harmonic wave field using the Helmholtz equation. Compared to existing numerical models, the physics-informed neural networks forward problem has to overcome several topics related to the convergence of the optimization toward the “true” solution. The topics reach from considering the physical dimensionality (from 2D to 3D), the modeling of realistic sources (from a self-similar source to a realistic confined point source), the modeling of sound-hard (Neumann) boundary conditions, and the modeling of the full wave field by considering the complex solution quantities. Within this contribution, we study 3D room acoustic cases at low frequency, varying the source definition and the number of boundary condition sets and using a complex speed of sound model to account for some degree of absorption. We assess the convergence behavior by looking at the loss landscape of the PINN architecture, the L^2 error compared to a finite element reference simulation for each network architecture and configuration. The convergence studies showed that at least six training points per wavelength are necessary for accurate training and subsequent predictions of the PINN. The developments are part of an initiative aiming to model the low-frequency behavior of room acoustics, including absorbers.
[LG-34] he Effect of Stochasticity in Score-Based Diffusion Sampling: a KL Divergence Analysis
链接: https://arxiv.org/abs/2506.11378
作者: Bernardo P. Schaeffer,Ricardo M. S. Rosa,Glauco Valle
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sampling in score-based diffusion models can be performed by solving either a probability flow ODE or a reverse-time stochastic differential equation (SDE) parameterized by an arbitrary stochasticity parameter. In this work, we study the effect of stochasticity on the generation process through bounds on the Kullback-Leibler (KL) divergence and complement the analysis with numerical and analytical examples. Our results apply to general forward SDEs with additive noise and Lipschitz-continuous score functions, and quantify how errors from the prior distribution and score approximation propagate under different choices of the stochasticity parameter. The theoretical bounds are derived using log-Sobolev inequalities for the marginals of the forward process, which enable a more effective control of the KL divergence decay along sampling. For exact score functions, we find that stochasticity acts as an error-correcting mechanism, decreasing KL divergence along the sampling trajectory. For an approximate score function, there is a trade-off between error correction and score error amplification, so that stochasticity can either improve or worsen the performance, depending on the structure of the score error. Numerical experiments on simple datasets and a fully analytical example are included to illustrate and enlighten the theoretical results.
[LG-35] EDN: A Novel Edge-Dependent Noise Model for Graph Data
链接: https://arxiv.org/abs/2506.11368
作者: Pintu Kumar,Nandyala Hemachandra
类目: Machine Learning (cs.LG)
*备注:
Abstract:An important structural feature of a graph is its set of edges, as it captures the relationships among the nodes (the graph’s topology). Existing node label noise models like Symmetric Label Noise (SLN) and Class Conditional Noise (CCN) disregard this important node relationship in graph data; and the Edge-Dependent Noise (EDN) model addresses this limitation. EDN posits that in real-world scenarios, label noise may be influenced by the connections between nodes. We explore three variants of EDN. A crucial notion that relates nodes and edges in a graph is the degree of a node; we show that in all three variants, the probability of a node’s label corruption is dependent on its degree. Additionally, we compare the dependence of these probabilities on node degree across different variants. We performed experiments on popular graph datasets using 5 different GNN architectures and 8 noise robust algorithms for graph data. The results demonstrate that 2 variants of EDN lead to greater performance degradation in both Graph Neural Networks (GNNs) and existing noise-robust algorithms, as compared to traditional node label noise models. We statistically verify this by posing a suitable hypothesis-testing problem. This emphasizes the importance of incorporating EDN when evaluating noise robust algorithms for graphs, to enhance the reliability of graph-based learning in noisy environments.
[LG-36] Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel
链接: https://arxiv.org/abs/2506.11357
作者: Yilan Chen,Zhichao Wang,Wei Huang,Andi Han,Taiji Suzuki,Arya Mazumdar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Gradient-based optimization methods have shown remarkable empirical success, yet their theoretical generalization properties remain only partially understood. In this paper, we establish a generalization bound for gradient flow that aligns with the classical Rademacher complexity bounds for kernel methods-specifically those based on the RKHS norm and kernel trace-through a data-dependent kernel called the loss path kernel (LPK). Unlike static kernels such as NTK, the LPK captures the entire training trajectory, adapting to both data and optimization dynamics, leading to tighter and more informative generalization guarantees. Moreover, the bound highlights how the norm of the training loss gradients along the optimization trajectory influences the final generalization performance. The key technical ingredients in our proof combine stability analysis of gradient flow with uniform convergence via Rademacher complexity. Our bound recovers existing kernel regression bounds for overparameterized neural networks and shows the feature learning capability of neural networks compared to kernel methods. Numerical experiments on real-world datasets validate that our bounds correlate well with the true generalization gap.
[LG-37] Improving Group Robustness on Spurious Correlation via Evidential Alignment KDD2025
链接: https://arxiv.org/abs/2506.11347
作者: Wenqian Ye,Guangtao Zheng,Aidong Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD2025
Abstract:Deep neural networks often learn and rely on spurious correlations, i.e., superficial associations between non-causal features and the targets. For instance, an image classifier may identify camels based on the desert backgrounds. While it can yield high overall accuracy during training, it degrades generalization on more diverse scenarios where such correlations do not hold. This problem poses significant challenges for out-of-distribution robustness and trustworthiness. Existing methods typically mitigate this issue by using external group annotations or auxiliary deterministic models to learn unbiased representations. However, such information is costly to obtain, and deterministic models may fail to capture the full spectrum of biases learned by the models. To address these limitations, we propose Evidential Alignment, a novel framework that leverages uncertainty quantification to understand the behavior of the biased models without requiring group annotations. By quantifying the evidence of model prediction with second-order risk minimization and calibrating the biased models with the proposed evidential calibration technique, Evidential Alignment identifies and suppresses spurious correlations while preserving core features. We theoretically justify the effectiveness of our method as capable of learning the patterns of biased models and debiasing the model without requiring any spurious correlation annotations. Empirical results demonstrate that our method significantly improves group robustness across diverse architectures and data modalities, providing a scalable and principled solution to spurious correlations.
[LG-38] he Sample Complexity of Parameter-Free Stochastic Convex Optimization
链接: https://arxiv.org/abs/2506.11336
作者: Jared Lawrence,Ari Kalinsky,Hannah Bradfield,Yair Carmon,Oliver Hinder
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study the sample complexity of stochastic convex optimization when problem parameters, e.g., the distance to optimality, are unknown. We pursue two strategies. First, we develop a reliable model selection method that avoids overfitting the validation set. This method allows us to generically tune the learning rate of stochastic optimization methods to match the optimal known-parameter sample complexity up to \log\log factors. Second, we develop a regularization-based method that is specialized to the case that only the distance to optimality is unknown. This method provides perfect adaptability to unknown distance to optimality, demonstrating a separation between the sample and computational complexity of parameter-free stochastic convex optimization. Combining these two methods allows us to simultaneously adapt to multiple problem structures. Experiments performing few-shot learning on CIFAR-10 by fine-tuning CLIP models and prompt engineering Gemini to count shapes indicate that our reliable model selection method can help mitigate overfitting to small validation sets. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2506.11336 [cs.LG] (or arXiv:2506.11336v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.11336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-39] An Attention-based Spatio-Temporal Neural Operator for Evolving Physics
链接: https://arxiv.org/abs/2506.11328
作者: Vispi Karkaria,Doksoo Lee,Yi-Ping Chen,Yue Yu,Wei Chen
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:In scientific machine learning (SciML), a key challenge is learning unknown, evolving physical processes and making predictions across spatio-temporal scales. For example, in real-world manufacturing problems like additive manufacturing, users adjust known machine settings while unknown environmental parameters simultaneously fluctuate. To make reliable predictions, it is desired for a model to not only capture long-range spatio-temporal interactions from data but also adapt to new and unknown environments; traditional machine learning models excel at the first task but often lack physical interpretability and struggle to generalize under varying environmental conditions. To tackle these challenges, we propose the Attention-based Spatio-Temporal Neural Operator (ASNO), a novel architecture that combines separable attention mechanisms for spatial and temporal interactions and adapts to unseen physical parameters. Inspired by the backward differentiation formula (BDF), ASNO learns a transformer for temporal prediction and extrapolation and an attention-based neural operator for handling varying external loads, enhancing interpretability by isolating historical state contributions and external forces, enabling the discovery of underlying physical laws and generalizability to unseen physical environments. Empirical results on SciML benchmarks demonstrate that ASNO outperforms over existing models, establishing its potential for engineering applications, physics discovery, and interpretable machine learning.
[LG-40] Efficient Traffic Classification using HW-NAS: Advanced Analysis and Optimization for Cybersecurity on Resource-Constrained Devices
链接: https://arxiv.org/abs/2506.11319
作者: Adel Chehade,Edoardo Ragusa,Paolo Gastaldo,Rodolfo Zunino
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a hardware-efficient deep neural network (DNN), optimized through hardware-aware neural architecture search (HW-NAS); the DNN supports the classification of session-level encrypted traffic on resource-constrained Internet of Things (IoT) and edge devices. Thanks to HW-NAS, a 1D convolutional neural network (CNN) is tailored on the ISCX VPN-nonVPN dataset to meet strict memory and computational limits while achieving robust performance. The optimized model attains an accuracy of 96.59% with just 88.26K parameters, 10.08M FLOPs, and a maximum tensor size of 20.12K. Compared to state-of-the-art models, it achieves reductions of up to 444-fold, 312-fold, and 15.6-fold in these metrics, respectively, significantly minimizing memory footprint and runtime requirements. The model also demonstrates versatility in classification tasks, achieving accuracies of up to 99.64% in VPN differentiation, VPN-type classification, broader traffic categories, and application identification. In addition, an in-depth approach to header-level preprocessing strategies confirms that the optimized model can provide notable performances across a wide range of configurations, even in scenarios with stricter privacy considerations. Likewise, a reduction in the length of sessions of up to 75% yields significant improvements in efficiency, while maintaining high accuracy with only a negligible drop of 1-2%. However, the importance of careful preprocessing and session length selection in the classification of raw traffic data is still present, as improper settings or aggressive reductions can bring about a 7% reduction in overall accuracy. Those results highlight the method’s effectiveness in enforcing cybersecurity for IoT networks, by providing scalable, efficient solutions for the real-time analysis of encrypted traffic within strict hardware limitations.
[LG-41] Sampling Imbalanced Data with Multi-objective Bilevel Optimization
链接: https://arxiv.org/abs/2506.11315
作者: Karen Medlin,Sven Leyffer,Krishnan Raghavan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Two-class classification problems are often characterized by an imbalance between the number of majority and minority datapoints resulting in poor classification of the minority class in particular. Traditional approaches, such as reweighting the loss function or naïve resampling, risk overfitting and subsequently fail to improve classification because they do not consider the diversity between majority and minority datasets. Such consideration is infeasible because there is no metric that can measure the impact of imbalance on the model. To obviate these challenges, we make two key contributions. First, we introduce MOODS~(Multi-Objective Optimization for Data Sampling), a novel multi-objective bilevel optimization framework that guides both synthetic oversampling and majority undersampling. Second, we introduce a validation metric – ` \epsilon/ \delta non-overlapping diversification metric’ – that quantifies the goodness of a sampling method towards model performance. With this metric we experimentally demonstrate state-of-the-art performance with improvement in diversity driving a 1-15 % increase in F1 scores.
[LG-42] SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding
链接: https://arxiv.org/abs/2506.11309
作者: Ziyi Zhang,Ziheng Jiang,Chengquan Jiang,Menghan Yu,Size Zheng,Haibin Lin,Henry Hoffmann,Xin Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a small draft model with a larger target model) and tensor parallelism has each accelerated decoding. However, conventional approaches fail to apply both simultaneously due to imbalanced compute requirements (between draft and target models), KV-cache inconsistencies, and communication overheads under small-batch tensor-parallelism. This paper introduces SwiftSpec, a system that targets ultra-low latency for LLM decoding. SwiftSpec redesigns the speculative decoding pipeline in an asynchronous and disaggregated manner, so that each component can be scaled flexibly and remove draft overhead from the critical path. To realize this design, SwiftSpec proposes parallel tree generation, tree-aware KV cache management, and fused, latency-optimized kernels to overcome the challenges listed above. Across 5 model families and 6 datasets, SwiftSpec achieves an average of 1.75x speedup over state-of-the-art speculative decoding systems and, as a highlight, serves Llama3-70B at 348 tokens/s on 8 Nvidia Hopper GPUs, making it the fastest known system for low-latency LLM serving at this scale.
[LG-43] Shapley Machine: A Game-Theoretic Framework for N-Agent Ad Hoc Teamwork
链接: https://arxiv.org/abs/2506.11285
作者: Jianhong Wang,Yang Li,Samuel Kaski,Jonathan Lawry
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 25 pages
Abstract:Open multi-agent systems are increasingly important in modeling real-world applications, such as smart grids, swarm robotics, etc. In this paper, we aim to investigate a recently proposed problem for open multi-agent systems, referred to as n-agent ad hoc teamwork (NAHT), where only a number of agents are controlled. Existing methods tend to be based on heuristic design and consequently lack theoretical rigor and ambiguous credit assignment among agents. To address these limitations, we model and solve NAHT through the lens of cooperative game theory. More specifically, we first model an open multi-agent system, characterized by its value, as an instance situated in a space of cooperative games, generated by a set of basis games. We then extend this space, along with the state space, to accommodate dynamic scenarios, thereby characterizing NAHT. Exploiting the justifiable assumption that basis game values correspond to a sequence of n-step returns with different horizons, we represent the state values for NAHT in a form similar to \lambda -returns. Furthermore, we derive Shapley values to allocate state values to the controlled agents, as credits for their contributions to the ad hoc team. Different from the conventional approach to shaping Shapley values in an explicit form, we shape Shapley values by fulfilling the three axioms uniquely describing them, well defined on the extended game space describing NAHT. To estimate Shapley values in dynamic scenarios, we propose a TD( \lambda )-like algorithm. The resulting reinforcement learning (RL) algorithm is referred to as Shapley Machine. To our best knowledge, this is the first time that the concepts from cooperative game theory are directly related to RL concepts. In experiments, we demonstrate the effectiveness of Shapley Machine and verify reasonableness of our theory.
[LG-44] Domain-Constrained Diffusion Models to Synthesize Tabular Data: A Case Study in Power Systems
链接: https://arxiv.org/abs/2506.11281
作者: Milad Hoseinpour,Vladimir Dvorkin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 9 pages, 6 figures, conference
Abstract:Growing concerns over privacy, security, and legal barriers are driving the rising demand for synthetic data across domains such as healthcare, finance, and energy. While generative models offer a promising solution to overcome these barriers, their utility depends on the incorporation of domain-specific knowledge. We propose to synthesize data using a guided diffusion model that integrates domain constraints directly into the generative process. We develop the model in the context of power systems, with potential applicability to other domains that involve tabular data. Specifically, we synthesize statistically representative and high-fidelity power flow datasets. To satisfy domain constraints, e.g., Kirchhoff laws, we introduce a gradient-based guidance to steer the sampling trajectory in a feasible direction. Numerical results demonstrate the effectiveness of our approach.
[LG-45] Demonstration Sidetracks: Categorizing Systematic Non-Optimality in Human Demonstrations
链接: https://arxiv.org/abs/2506.11262
作者: Shijie Fang,Hang Yu,Qidi Fang,Reuben M. Aronson,Elaine S. Short
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Learning from Demonstration (LfD) is a popular approach for robots to acquire new skills, but most LfD methods suffer from imperfections in human demonstrations. Prior work typically treats these suboptimalities as random noise. In this paper we study non-optimal behaviors in non-expert demonstrations and show that they are systematic, forming what we call demonstration sidetracks. Using a public space study with 40 participants performing a long-horizon robot task, we recreated the setup in simulation and annotated all demonstrations. We identify four types of sidetracks (Exploration, Mistake, Alignment, Pause) and one control pattern (one-dimension control). Sidetracks appear frequently across participants, and their temporal and spatial distribution is tied to task context. We also find that users’ control patterns depend on the control interface. These insights point to the need for better models of suboptimal demonstrations to improve LfD algorithms and bridge the gap between lab training and real-world deployment. All demonstrations, infrastructure, and annotations are available at this https URL.
[LG-46] Detection of obstructions in oil and gas pipelines: machine learning techniques for hydrate classification
链接: https://arxiv.org/abs/2506.11220
作者: Hellockston Gomes de Brito,Carla Wilza Souza de Paula Maitelli,Osvaldo Chiavone-Filho
类目: Machine Learning (cs.LG)
*备注:
Abstract:Oil and gas reserves are vital resources for the global economy, serving as key components in transportation, energy production, and industrial processes. However, oil and gas extraction and production operations may encounter several challenges, such as pipeline and production line blockages, caused by factors including sediment accumulation, wax deposition, mineral scaling, and corrosion. This study addresses these challenges by employing supervised machine learning techniques, specifically decision trees, the k-Nearest Neighbors (k-NN) algorithm (k-NN), and the Naive Bayes classifier method, to detect and mitigate flow assurance challenges, ensuring efficient fluid transport. The primary focus is on preventing gas hydrate formation in oil production systems. To achieve this, data preprocessing and cleaning were conducted to ensure the quality and consistency of the dataset, which was sourced from Petrobras publicly available 3W project repository on GitHub. The scikit-learn Python library, a widely recognized open-source tool for supervised machine learning techniques, was utilized for classification tasks due to its robustness and versatility. The results demonstrate that the proposed methodology effectively classifies hydrate formation under operational conditions, with the decision tree algorithm exhibiting the highest predictive accuracy (99.99 percent). Consequently, this approach provides a reliable solution for optimizing production efficiency.
[LG-47] Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
链接: https://arxiv.org/abs/2506.11153
作者: Changxin Ke,Rui Zhang,Shuo Wang,Li Ding,Guangli Li,Yuanbo Wen,Shuoming Zhang,Ruiyuan Xu,Jin Qin,Jiaming Guo,Chenxi Wang,Ling Li,Qi Guo,Yunji Chen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 28 pages
Abstract:The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose a novel Mutual-Supervised Learning (MSL) framework for sequential-to-parallel code translation to address the functional equivalence issue. MSL consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that MuSL significantly enhances the performance of the base model: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method CodeRosetta by 1.56 and 6.92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4.1. Our code is available at this https URL.
[LG-48] PolyMicros: Bootstrapping a Foundation Model for Polycrystalline Material Structure
链接: https://arxiv.org/abs/2506.11055
作者: Michael Buzzy,Andreas Robertson,Peng Chen,Surya Kalidindi
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 43 Pages, 19 figures
Abstract:Recent advances in Foundation Models for Materials Science are poised to revolutionize the discovery, manufacture, and design of novel materials with tailored properties and responses. Although great strides have been made, successes have been restricted to materials classes where multi-million sample data repositories can be readily curated (e.g., atomistic structures). Unfortunately, for many structural and functional materials (e.g., mesoscale structured metal alloys), such datasets are too costly or prohibitive to construct; instead, datasets are limited to very few examples. To address this challenge, we introduce a novel machine learning approach for learning from hyper-sparse, complex spatial data in scientific domains. Our core contribution is a physics-driven data augmentation scheme that leverages an ensemble of local generative models, trained on as few as five experimental observations, and coordinates them through a novel diversity curation strategy to generate a large-scale, physically diverse dataset. We utilize this framework to construct PolyMicros, the first Foundation Model for polycrystalline materials (a structural material class important across a broad range of industrial and scientific applications). We demonstrate the utility of PolyMicros by zero-shot solving several long standing challenges related to accelerating 3D experimental microscopy. Finally, we make both our models and datasets openly available to the community.
[LG-49] NSW-EPNews: A News-Augmented Benchmark for Electricity Price Forecasting with LLM s NEURIPS2025
链接: https://arxiv.org/abs/2506.11050
作者: Zhaoge Bi,Linghan Huang,Haolin Jin,Qingwen Zeng,Huaming Chen
类目: Machine Learning (cs.LG)
*备注: 9 pages’ main texts. Submitted to NeurIPS 2025 Datasets and Benchmarks Track
Abstract:Electricity price forecasting is a critical component of modern energy-management systems, yet existing approaches heavily rely on numerical histories and ignore contemporaneous textual signals. We introduce NSW-EPNews, the first benchmark that jointly evaluates time-series models and large language models (LLMs) on real-world electricity-price prediction. The dataset includes over 175,000 half-hourly spot prices from New South Wales, Australia (2015-2024), daily temperature readings, and curated market-news summaries from WattClarity. We frame the task as 48-step-ahead forecasting, using multimodal input, including lagged prices, vectorized news and weather features for classical models, and prompt-engineered structured contexts for LLMs. Our datasets yields 3.6k multimodal prompt-output pairs for LLM evaluation using specific templates. Through compresive benchmark design, we identify that for traditional statistical and machine learning models, the benefits gain is marginal from news feature. For state-of-the-art LLMs, such as GPT-4o and Gemini 1.5 Pro, we observe modest performance increase while it also produce frequent hallucinations such as fabricated and malformed price sequences. NSW-EPNews provides a rigorous testbed for evaluating grounded numerical reasoning in multimodal settings, and highlights a critical gap between current LLM capabilities and the demands of high-stakes energy forecasting.
[LG-50] Perception-Driven Bias Detection in Machine Learning via Crowdsourced Visual Judgment
链接: https://arxiv.org/abs/2506.11047
作者: Chirudeep Tupakula,Rittika Shamsuddin
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Pilot Study. 12 pages. 4 Figures
Abstract:Machine learning systems are increasingly deployed in high-stakes domains, yet they remain vulnerable to bias systematic disparities that disproportionately impact specific demographic groups. Traditional bias detection methods often depend on access to sensitive labels or rely on rigid fairness metrics, limiting their applicability in real-world settings. This paper introduces a novel, perception-driven framework for bias detection that leverages crowdsourced human judgment. Inspired by reCAPTCHA and other crowd-powered systems, we present a lightweight web platform that displays stripped-down visualizations of numeric data (for example-salary distributions across demographic clusters) and collects binary judgments on group similarity. We explore how users’ visual perception-shaped by layout, spacing, and question phrasing can signal potential disparities. User feedback is aggregated to flag data segments as biased, which are then validated through statistical tests and machine learning cross-evaluations. Our findings show that perceptual signals from non-expert users reliably correlate with known bias cases, suggesting that visual intuition can serve as a powerful, scalable proxy for fairness auditing. This approach offers a label-efficient, interpretable alternative to conventional fairness diagnostics, paving the way toward human-aligned, crowdsourced bias detection pipelines.
[LG-51] he Effects of Data Augmentation on Confidence Estimation for LLM s
链接: https://arxiv.org/abs/2506.11046
作者: Rui Wang,Renyu Zhu,Minmin Lin,Runze Wu,Tangjie Lv,Changjie Fan,Haobo Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Confidence estimation is crucial for reflecting the reliability of large language models (LLMs), particularly in the widely used closed-source models. Utilizing data augmentation for confidence estimation is viable, but discussions focus on specific augmentation techniques, limiting its potential. We study the impact of different data augmentation methods on confidence estimation. Our findings indicate that data augmentation strategies can achieve better performance and mitigate the impact of overconfidence. We investigate the influential factors related to this and discover that, while preserving semantic information, greater data diversity enhances the effectiveness of augmentation. Furthermore, the impact of different augmentation strategies varies across different range of application. Considering parameter transferability and usability, the random combination of augmentations is a promising choice.
[LG-52] Procedural Environment Generation for Tool-Use Agents
链接: https://arxiv.org/abs/2506.11045
作者: Michael Sullivan,Mareike Hartmann,Alexander Koller
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures
Abstract:Although the power of LLM tool-use agents has ignited a flurry of recent research in this area, the curation of tool-use training data remains an open problem - especially for online RL training. Existing approaches to synthetic tool-use data generation tend to be non-interactive, and/or non-compositional. We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data. We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks, and set the new SoTA for two metrics on the NESTFUL dataset. Further experiments show that downstream performance scales with the amount of RandomWorld-generated training data, opening up the possibility of further improvement through the use of entirely synthetic data.
[LG-53] Boost Post-Training Quantization via Null Space Optimization for Large Language Models
链接: https://arxiv.org/abs/2506.11044
作者: Jiaqi Zhao,Miao Zhang,Weili Guan,Liqiang Nie
类目: Machine Learning (cs.LG)
*备注: 17 pages, 4 figures
Abstract:Existing post-training quantization methods for large language models (LLMs) offer remarkable success. However, the increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models. To inspire new directions for future research, this paper introduces the concept of null space into LLMs quantization. We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight perturbation to lie within the null space of input activations. To prove this idea, we propose a plug-and-play null space projection module for existing milestone PTQ baselines named Q2N. Specifically, we first design an efficient and accurate null space projection approximation method tailored to the characteristics of LLMs. Subsequently, we theoretically derive a closed-form solution for an equivalent vector of the obtained projection matrix, which satisfies practical inference condition while avoiding additional memory overhead. Extensive experiments are conducted on various state-of-the-art LLMs (LLaMA3, DeepSeek, Qwen3) and baselines, demonstrating the effectiveness of both our Q2N and the perspective of null space optimization for LLMs quantization. We view this paper the first step to further alleviate the quantization error based on the insights of null space, hoping it inspiring future researchers to design more advanced quantization methods. Codes are available at this https URL.
[LG-54] GenFT: A Generative Parameter-Efficient Fine-Tuning Method for Pretrained Foundation Models
链接: https://arxiv.org/abs/2506.11042
作者: Baoquan Zhang,Guangning Xu,Michael. K. Ng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pretrained Foundation Models (PFMs) have transformed numerous applications by enabling efficient adaptation to customized tasks. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient alternative to full fine-tuning, especially leveraging reparameterized weights \Delta W to adapt models for downstream tasks. However, a critical yet underexplored question remains: can we utilize well-pretrained weights W_0 to guide the update of task-specific \Delta W , avoiding inefficient training it from scratch? To end this, we propose Generative Parameter-Efficient Fine-Tuning (GenFT), a novel method that extracts structured, transferable information from W_0 for efficient \Delta W training. To extract row and column structure information, GenFT applies row and column transformations to distill essential patterns from W_0 . A tailored policy further decomposes \Delta W into layer-shared and layer-specific components, balancing information reuse and individualized flexibility. GenFT is simple yet effective, achieving superior performance across CV and NLP tasks. Extensive experiments on VTAB-1K, FGVC, and GLUE benchmarks demonstrate that GenFT outperforms state-of-the-art PEFT methods, offering a new perspective for efficient model adaptation.
[LG-55] ChemHGNN: A Hierarchical Hypergraph Neural Network for Reaction Virtual Screening and Discovery
链接: https://arxiv.org/abs/2506.11041
作者: Xiaobao Huang,Yihong Ma,Anjali Gurajapu,Jules Schleinitz,Zhichun Guo,Sarah E. Reisman,Nitesh V. Chawla
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reaction virtual screening and discovery are fundamental challenges in chemistry and materials science, where traditional graph neural networks (GNNs) struggle to model multi-reactant interactions. In this work, we propose ChemHGNN, a hypergraph neural network (HGNN) framework that effectively captures high-order relationships in reaction networks. Unlike GNNs, which require constructing complete graphs for multi-reactant reactions, ChemHGNN naturally models multi-reactant reactions through hyperedges, enabling more expressive reaction representations. To address key challenges, such as combinatorial explosion, model collapse, and chemically invalid negative samples, we introduce a reaction center-aware negative sampling strategy (RCNS) and a hierarchical embedding approach combining molecule, reaction and hypergraph level features. Experiments on the USPTO dataset demonstrate that ChemHGNN significantly outperforms HGNN and GNN baselines, particularly in large-scale settings, while maintaining interpretability and chemical plausibility. Our work establishes HGNNs as a superior alternative to GNNs for reaction virtual screening and discovery, offering a chemically informed framework for accelerating reaction discovery.
[LG-56] MoTE: Mixture of Task-specific Experts for Pre-Trained ModelBased Class-incremental Learning
链接: https://arxiv.org/abs/2506.11038
作者: Linjie Li,Zhenyu Wu,Yang Ji
类目: Machine Learning (cs.LG)
*备注: Accepted to KBS
Abstract:Class-incremental learning (CIL) requires deep learning models to continuously acquire new knowledge from streaming data while preserving previously learned information. Recently, CIL based on pre-trained models (PTMs) has achieved remarkable success. However, prompt-based approaches suffer from prompt overwriting, while adapter-based methods face challenges such as dimensional misalignment between tasks. While the idea of expert fusion in Mixture of Experts (MoE) can help address dimensional inconsistency, both expert and routing parameters are prone to being overwritten in dynamic environments, making MoE challenging to apply directly in CIL. To tackle these issues, we propose a mixture of task-specific experts (MoTE) framework that effectively mitigates the miscalibration caused by inconsistent output dimensions across tasks. Inspired by the weighted feature fusion and sparse activation mechanisms in MoE, we introduce task-aware expert filtering and reliable expert joint inference during the inference phase, mimicking the behavior of routing layers without inducing catastrophic forgetting. Extensive experiments demonstrate the superiority of our method without requiring an exemplar set. Furthermore, the number of tasks in MoTE scales linearly with the number of adapters. Building on this, we further explore the trade-off between adapter expansion and model performance and propose the Adapter-Limited MoTE. The code is available at this https URL.
[LG-57] Mini-Game Lifetime Value Prediction in WeChat KDD
链接: https://arxiv.org/abs/2506.11037
作者: Aochuan Chen,Yifan Niu,Ziqi Gao,Yujie Sun,Shoujun Liu,Gong Chen,Yang Liu,Jia Li
类目: Machine Learning (cs.LG)
*备注: KDD ADS Track 2025
Abstract:The LifeTime Value (LTV) prediction, which endeavors to forecast the cumulative purchase contribution of a user to a particular item, remains a vital challenge that advertisers are keen to resolve. A precise LTV prediction system enhances the alignment of user interests with meticulously designed advertisements, thereby generating substantial profits for advertisers. Nonetheless, this issue is complicated by the paucity of data typically observed in real-world advertising scenarios. The purchase rate among registered users is often as critically low as 0.1%, resulting in a dataset where the majority of users make only several purchases. Consequently, there is insufficient supervisory signal for effectively training the LTV prediction model. An additional challenge emerges from the interdependencies among tasks with high correlation. It is a common practice to estimate a user’s contribution to a game over a specified temporal interval. Varying the lengths of these intervals corresponds to distinct predictive tasks, which are highly correlated. For instance, predictions over a 7-day period are heavily reliant on forecasts made over a 3-day period, where exceptional cases can adversely affect the accuracy of both tasks. In order to comprehensively address the aforementioned challenges, we introduce an innovative framework denoted as Graph-Represented Pareto-Optimal LifeTime Value prediction (GRePO-LTV). Graph representation learning is initially employed to address the issue of data scarcity. Subsequently, Pareto-Optimization is utilized to manage the interdependence of prediction tasks.
[LG-58] Human-centered Interactive Learning via MLLM s for Text-to-Image Person Re-identification
链接: https://arxiv.org/abs/2506.11036
作者: Yang Qin,Chao Chen,Zhihang Fu,Dezhong Peng,Xi Peng,Peng Hu
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:Despite remarkable advancements in text-to-image person re-identification (TIReID) facilitated by the breakthrough of cross-modal embedding models, existing methods often struggle to distinguish challenging candidate images due to intrinsic limitations, such as network architecture and data quality. To address these issues, we propose an Interactive Cross-modal Learning framework (ICL), which leverages human-centered interaction to enhance the discriminability of text queries through external multimodal knowledge. To achieve this, we propose a plug-and-play Test-time Humane-centered Interaction (THI) module, which performs visual question answering focused on human characteristics, facilitating multi-round interactions with a multimodal large language model (MLLM) to align query intent with latent target images. Specifically, THI refines user queries based on the MLLM responses to reduce the gap to the best-matching images, thereby boosting ranking accuracy. Additionally, to address the limitation of low-quality training texts, we introduce a novel Reorganization Data Augmentation (RDA) strategy based on information enrichment and diversity enhancement to enhance query discriminability by enriching, decomposing, and reorganizing person descriptions. Extensive experiments on four TIReID benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and UFine6926, demonstrate that our method achieves remarkable performance with substantial improvement.
[LG-59] Deep Learning Approach to Bearing and Induction Motor Fault Diagnosis via Data Fusion
链接: https://arxiv.org/abs/2506.11032
作者: Mert Sehri,Merve Ertagrin,Ozal Yildirim,Ahmet Orhan,Patrick Dumond
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Convolutional Neural Networks (CNNs) are used to evaluate accelerometer and microphone data for bearing and induction motor diagnosis. A Long Short-Term Memory (LSTM) recurrent neural network is used to combine sensor information effectively, highlighting the benefits of data fusion. This approach encourages researchers to focus on multi model diagnosis for constant speed data collection by proposing a comprehensive way to use deep learning and sensor fusion and encourages data scientists to collect more multi-sensor data, including acoustic and accelerometer datasets.
[LG-60] Evaluating Privacy-Utility Tradeoffs in Synthetic Smart Grid Data
链接: https://arxiv.org/abs/2506.11026
作者: Andre Catarino,Rui Melo,Rui Abreu,Luis Cruz
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 9 pages, 4 figures
Abstract:The widespread adoption of dynamic Time-of-Use (dToU) electricity tariffs requires accurately identifying households that would benefit from such pricing structures. However, the use of real consumption data poses serious privacy concerns, motivating the adoption of synthetic alternatives. In this study, we conduct a comparative evaluation of four synthetic data generation methods, Wasserstein-GP Generative Adversarial Networks (WGAN), Conditional Tabular GAN (CTGAN), Diffusion Models, and Gaussian noise augmentation, under different synthetic regimes. We assess classification utility, distribution fidelity, and privacy leakage. Our results show that architectural design plays a key role: diffusion models achieve the highest utility (macro-F1 up to 88.2%), while CTGAN provide the strongest resistance to reconstruction attacks. These findings highlight the potential of structured generative models for developing privacy-preserving, data-driven energy systems.
[LG-61] You Only Train Once: A Flexible Training Framework for Code Vulnerability Detection Driven by Vul-Vector
链接: https://arxiv.org/abs/2506.10988
作者: Bowen Tian,Zhengyang Xu,Mingqiang Wu,Songning Lai,Yutai Yue
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Under Review
Abstract:With the pervasive integration of computer applications across industries, the presence of vulnerabilities within code bases poses significant risks. The diversity of software ecosystems coupled with the intricate nature of modern software engineering has led to a shift from manual code vulnerability identification towards the adoption of automated tools. Among these, deep learning-based approaches have risen to prominence due to their superior accuracy; however, these methodologies encounter several obstacles. Primarily, they necessitate extensive labeled datasets and prolonged training periods, and given the rapid emergence of new vulnerabilities, the frequent retraining of models becomes a resource-intensive endeavor, thereby limiting their applicability in cutting-edge scenarios. To mitigate these challenges, this paper introduces the \underline\textbfYOTO–\underline\textbfYou \underline\textbfOnly \underline\textbfTrain \underline\textbfOnce framework. This innovative approach facilitates the integration of multiple types of vulnerability detection models via parameter fusion, eliminating the need for joint training. Consequently, YOTO enables swift adaptation to newly discovered vulnerabilities, significantly reducing both the time and computational resources required for model updates.
[LG-62] Spectral Estimation with Free Decompression
链接: https://arxiv.org/abs/2506.11994
作者: Siavash Ameli,Chris van der Heide,Liam Hodgkinson,Michael W. Mahoney
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Computing eigenvalues of very large matrices is a critical task in many machine learning applications, including the evaluation of log-determinants, the trace of matrix functions, and other important metrics. As datasets continue to grow in scale, the corresponding covariance and kernel matrices become increasingly large, often reaching magnitudes that make their direct formation impractical or impossible. Existing techniques typically rely on matrix-vector products, which can provide efficient approximations, if the matrix spectrum behaves well. However, in settings like distributed learning, or when the matrix is defined only indirectly, access to the full data set can be restricted to only very small sub-matrices of the original matrix. In these cases, the matrix of nominal interest is not even available as an implicit operator, meaning that even matrix-vector products may not be available. In such settings, the matrix is “impalpable,” in the sense that we have access to only masked snapshots of it. We draw on principles from free probability theory to introduce a novel method of “free decompression” to estimate the spectrum of such matrices. Our method can be used to extrapolate from the empirical spectral densities of small submatrices to infer the eigenspectrum of extremely large (impalpable) matrices (that we cannot form or even evaluate with full matrix-vector products). We demonstrate the effectiveness of this approach through a series of examples, comparing its performance against known limiting distributions from random matrix theory in synthetic settings, as well as applying it to submatrices of real-world datasets, matching them with their full empirical eigenspectra.
[LG-63] Interpretable representation learning of quantum data enabled by probabilistic variational autoencoders
链接: https://arxiv.org/abs/2506.11982
作者: Paulin de Schoulepnikoff,Gorka Muñoz-Gil,Hendrik Poulsen Nautrup,Hans J. Briegel
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: Main text 10 pages, total document 16 pages, 10 figures
Abstract:Interpretable machine learning is rapidly becoming a crucial tool for scientific discovery. Among existing approaches, variational autoencoders (VAEs) have shown promise in extracting the hidden physical features of some input data, with no supervision nor prior knowledge of the system at study. Yet, the ability of VAEs to create meaningful, interpretable representations relies on their accurate approximation of the underlying probability distribution of their input. When dealing with quantum data, VAEs must hence account for its intrinsic randomness and complex correlations. While VAEs have been previously applied to quantum data, they have often neglected its probabilistic nature, hindering the extraction of meaningful physical descriptors. Here, we demonstrate that two key modifications enable VAEs to learn physically meaningful latent representations: a decoder capable of faithfully reproduce quantum states and a probabilistic loss tailored to this task. Using benchmark quantum spin models, we identify regimes where standard methods fail while the representations learned by our approach remain meaningful and interpretable. Applied to experimental data from Rydberg atom arrays, the model autonomously uncovers the phase structure without access to prior labels, Hamiltonian details, or knowledge of relevant order parameters, highlighting its potential as an unsupervised and interpretable tool for the study of quantum systems.
[LG-64] Learning Before Filtering: Real-Time Hardware Learning at the Detector Level
链接: https://arxiv.org/abs/2506.11981
作者: Boštjan Maček
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注:
Abstract:Advances in sensor technology and automation have ushered in an era of data abundance, where the ability to identify and extract relevant information in real time has become increasingly critical. Traditional filtering approaches, which depend on a priori knowledge, often struggle to adapt to dynamic or unanticipated data features. Machine learning offers a compelling alternative-particularly when training can occur directly at or near the detector. This paper presents a digital hardware architecture designed for real-time neural network training, specifically optimized for high-throughput data ingestion. The design is described in an implementation-independent manner, with detailed analysis of each architectural component and their performance implications. Through system parameterization, the study explores trade-offs between processing speed, model complexity, and hardware resource utilization. Practical examples illustrate how these parameters affect applicability across various use cases. A proof-of-concept implementation on an FPGA demonstrates in-situ training, confirming that computational accuracy is preserved relative to conventional software-based approaches. Moreover, resource estimates indicate that current-generation FPGAs can train networks of approximately 3,500 neurons per chip. The architecture is both scalable and adaptable, representing a significant advancement toward integrating learning directly within detector systems and enabling a new class of extreme-edge, real-time information processing.
[LG-65] Automated Treatment Planning for Interstitial HDR Brachytherapy for Locally Advanced Cervical Cancer using Deep Reinforcement Learning
链接: https://arxiv.org/abs/2506.11957
作者: Mohammadamin Moradi,Runyu Jiang,Yingzi Liu,Malvern Madondo,Tianming Wu,James J. Sohn,Xiaofeng Yang,Yasmin Hasan,Zhen Tian
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 3 tables
Abstract:High-dose-rate (HDR) brachytherapy plays a critical role in the treatment of locally advanced cervical cancer but remains highly dependent on manual treatment planning expertise. The objective of this study is to develop a fully automated HDR brachytherapy planning framework that integrates reinforcement learning (RL) and dose-based optimization to generate clinically acceptable treatment plans with improved consistency and efficiency. We propose a hierarchical two-stage autoplanning framework. In the first stage, a deep Q-network (DQN)-based RL agent iteratively selects treatment planning parameters (TPPs), which control the trade-offs between target coverage and organ-at-risk (OAR) sparing. The agent’s state representation includes both dose-volume histogram (DVH) metrics and current TPP values, while its reward function incorporates clinical dose objectives and safety constraints, including D90, V150, V200 for targets, and D2cc for all relevant OARs (bladder, rectum, sigmoid, small bowel, and large bowel). In the second stage, a customized Adam-based optimizer computes the corresponding dwell time distribution for the selected TPPs using a clinically informed loss function. The framework was evaluated on a cohort of patients with complex applicator geometries. The proposed framework successfully learned clinically meaningful TPP adjustments across diverse patient anatomies. For the unseen test patients, the RL-based automated planning method achieved an average score of 93.89%, outperforming the clinical plans which averaged 91.86%. These findings are notable given that score improvements were achieved while maintaining full target coverage and reducing CTV hot spots in most cases.
[LG-66] Bubble Dynamics Transformer: Microrheology at Ultra-High Strain Rates
链接: https://arxiv.org/abs/2506.11936
作者: Lehu Bu,Zhaohan Yu,Shaoting Lin,Jan N. Fuhg,Jin Yang
类目: Fluid Dynamics (physics.flu-dyn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Laser-induced inertial cavitation (LIC)-where microscale vapor bubbles nucleate due to a focused high-energy pulsed laser and then violently collapse under surrounding high local pressures-offers a unique opportunity to investigate soft biological material mechanics at extremely high strain rates (1000 1/s). Traditional rheological tools are often limited in these regimes by loading speed, resolution, or invasiveness. Here we introduce novel machine learning (ML) based microrheological frameworks that leverage LIC to characterize the viscoelastic properties of biological materials at ultra-high strain rates. We utilize ultra-high-speed imaging to capture time-resolved bubble radius dynamics during LIC events in various soft viscoelastic materials. These bubble radius versus time measurements are then analyzed using a newly developed Bubble Dynamics Transformer (BDT), a neural network trained on physics-based simulation data. The BDT accurately infers material viscoelastic parameters, eliminating the need for iterative fitting or complex inversion processes. This enables fast, accurate, and non-contact characterization of soft materials under extreme loading conditions, with significant implications for biomedical applications and materials science.
[LG-67] Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters
链接: https://arxiv.org/abs/2506.11904
作者: Mathukumalli Vidyasagar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages
Abstract:In this paper, we present a unified algorithm for stochastic optimization that makes use of a “momentum” term; in other words, the stochastic gradient depends not only on the current true gradient of the objective function, but also on the true gradient at the previous iteration. Our formulation includes the Stochastic Heavy Ball (SHB) and the Stochastic Nesterov Accelerated Gradient (SNAG) algorithms as special cases. In addition, in our formulation, the momentum term is allowed to vary as a function of time (i.e., the iteration counter). The assumptions on the stochastic gradient are the most general in the literature, in that it can be biased, and have a conditional variance that grows in an unbounded fashion as a function of time. This last feature is crucial in order to make the theory applicable to “zero-order” methods, where the gradient is estimated using just two function evaluations. We present a set of sufficient conditions for the convergence of the unified algorithm. These conditions are natural generalizations of the familiar Robbins-Monro and Kiefer-Wolfowitz-Blum conditions for standard stochastic gradient descent. We also analyze another method from the literature for the SHB algorithm with a time-varying momentum parameter, and show that it is impracticable. Comments: 32 pages Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.11904 [math.OC] (or arXiv:2506.11904v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2506.11904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-68] Decadal sink-source shifts of forest aboveground carbon since 1988
链接: https://arxiv.org/abs/2506.11879
作者: Zhen Qian,Sebastian Bathiany,Teng Liu,Lana L. Blaschke,Hoong Chen Teo,Niklas Boers
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
Abstract:As enduring carbon sinks, forest ecosystems are vital to the terrestrial carbon cycle and help moderate global warming. However, the long-term dynamics of aboveground carbon (AGC) in forests and their sink-source transitions remain highly uncertain, owing to changing disturbance regimes and inconsistencies in observations, data processing, and analysis methods. Here, we derive reliable, harmonized AGC stocks and fluxes in global forests from 1988 to 2021 at high spatial resolution by integrating multi-source satellite observations with probabilistic deep learning models. Our approach simultaneously estimates AGC and associated uncertainties, showing high reliability across space and time. We find that, although global forests remained an AGC sink of 6.2 PgC over 30 years, moist tropical forests shifted to a substantial AGC source between 2001 and 2010 and, together with boreal forests, transitioned toward a source in the 2011-2021 period. Temperate, dry tropical and subtropical forests generally exhibited increasing AGC stocks, although Europe and Australia became sources after 2011. Regionally, pronounced sink-to-source transitions occurred in tropical forests over the past three decades. The interannual relationship between global atmospheric CO2 growth rates and tropical AGC flux variability became increasingly negative, reaching Pearson’s r = -0.63 (p 0.05) in the most recent decade. In the Brazilian Amazon, the contribution of deforested regions to AGC losses declined from 60% in 1989-2000 to 13% in 2011-2021, while the share from untouched areas increased from 33% to 76%. Our findings suggest a growing role of tropical forest AGC in modulating variability in the terrestrial carbon cycle, with anthropogenic climate change potentially contributing increasingly to AGC changes, particularly in previously untouched areas.
[LG-69] Learning Overspecified Gaussian Mixtures Exponentially Fast with the EM Algorithm KDD2025 ECML
链接: https://arxiv.org/abs/2506.11850
作者: Zhenisbek Assylbekov,Alan Legg,Artur Pak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ECML PKDD 2025
Abstract:We investigate the convergence properties of the EM algorithm when applied to overspecified Gaussian mixture models – that is, when the number of components in the fitted model exceeds that of the true underlying distribution. Focusing on a structured configuration where the component means are positioned at the vertices of a regular simplex and the mixture weights satisfy a non-degeneracy condition, we demonstrate that the population EM algorithm converges exponentially fast in terms of the Kullback-Leibler (KL) distance. Our analysis leverages the strong convexity of the negative log-likelihood function in a neighborhood around the optimum and utilizes the Polyak-Łojasiewicz inequality to establish that an \epsilon -accurate approximation is achievable in O(\log(1/\epsilon)) iterations. Furthermore, we extend these results to a finite-sample setting by deriving explicit statistical convergence guarantees. Numerical experiments on synthetic datasets corroborate our theoretical findings, highlighting the dramatic acceleration in convergence compared to conventional sublinear rates. This work not only deepens the understanding of EM’s behavior in overspecified settings but also offers practical insights into initialization strategies and model design for high-dimensional clustering and density estimation tasks.
[LG-70] Bayesian Optimization with Inexact Acquisition: Is Random Grid Search Sufficient? UAI2025
链接: https://arxiv.org/abs/2506.11831
作者: Hwanwoo Kim,Chong Liu,Yuxin Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: This paper is accepted to UAI 2025
Abstract:Bayesian optimization (BO) is a widely used iterative algorithm for optimizing black-box functions. Each iteration requires maximizing an acquisition function, such as the upper confidence bound (UCB) or a sample path from the Gaussian process (GP) posterior, as in Thompson sampling (TS). However, finding an exact solution to these maximization problems is often intractable and computationally expensive. Reflecting such realistic situations, in this paper, we delve into the effect of inexact maximizers of the acquisition functions. Defining a measure of inaccuracy in acquisition solutions, we establish cumulative regret bounds for both GP-UCB and GP-TS without requiring exact solutions of acquisition function maximization. Our results show that under appropriate conditions on accumulated inaccuracy, inexact BO algorithms can still achieve sublinear cumulative regret. Motivated by such findings, we provide both theoretical justification and numerical validation for random grid search as an effective and computationally efficient acquisition function solver.
[LG-71] Using Deep Operators to Create Spatio-temporal Surrogates for Dynamical Systems under Uncertainty
链接: https://arxiv.org/abs/2506.11761
作者: Jichuan Tang,Patrick T. Brewick,Ryan G. McClarren,Christopher Sweet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Spatio-temporal data, which consists of responses or measurements gathered at different times and positions, is ubiquitous across diverse applications of civil infrastructure. While SciML methods have made significant progress in tackling the issue of response prediction for individual time histories, creating a full spatial-temporal surrogate remains a challenge. This study proposes a novel variant of deep operator networks (DeepONets), namely the full-field Extended DeepONet (FExD), to serve as a spatial-temporal surrogate that provides multi-output response predictions for dynamical systems. The proposed FExD surrogate model effectively learns the full solution operator across multiple degrees of freedom by enhancing the expressiveness of the branch network and expanding the predictive capabilities of the trunk network. The proposed FExD surrogate is deployed to simultaneously capture the dynamics at several sensing locations along a testbed model of a cable-stayed bridge subjected to stochastic ground motions. The ensuing response predictions from the FExD are comprehensively compared against both a vanilla DeepONet and a modified spatio-temporal Extended DeepONet. The results demonstrate the proposed FExD can achieve both superior accuracy and computational efficiency, representing a significant advancement in operator learning for structural dynamics applications.
[LG-72] Bias and Identifiability in the Bounded Confidence Model
链接: https://arxiv.org/abs/2506.11751
作者: Claudio Borile,Jacopo Lenti,Valentina Ghidini,Corrado Monti,Gianmarco De Francisci Morales
类目: Methodology (stat.ME); Computers and Society (cs.CY); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 13 pages, 8 figures
Abstract:Opinion dynamics models such as the bounded confidence models (BCMs) describe how a population can reach consensus, fragmentation, or polarization, depending on a few parameters. Connecting such models to real-world data could help understanding such phenomena, testing model assumptions. To this end, estimation of model parameters is a key aspect, and maximum likelihood estimation provides a principled way to tackle it. Here, our goal is to outline the properties of statistical estimators of the two key BCM parameters: the confidence bound and the convergence rate. We find that their maximum likelihood estimators present different characteristics: the one for the confidence bound presents a small-sample bias but is consistent, while the estimator of the convergence rate shows a persistent bias. Moreover, the joint parameter estimation is affected by identifiability issues for specific regions of the parameter space, as several local maxima are present in the likelihood function. Our results show how the analysis of the likelihood function is a fruitful approach for better understanding the pitfalls and possibilities of estimating the parameters of opinion dynamics models, and more in general, agent-based models, and for offering formal guarantees for their calibration.
[LG-73] Quantum Learning and Estimation for Distribution Networks and Energy Communities Coordination
链接: https://arxiv.org/abs/2506.11730
作者: Yingrui Zhuang,Lin Cheng,Yuji Cao,Tongxin Li,Ning Qi,Yan Xu,Yue Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: This is a manuscript submitted to PROTECTION AND CONTROL OF MODERN POWER SYSTEMS
Abstract:Price signals from distribution networks (DNs) guide energy communities (ECs) to adjust energy usage, enabling effective coordination for reliable power system operation. However, this coordination faces significant challenges due to the limited availability of information (i.e., only the aggregated energy usage of ECs is available to DNs), and the high computational burden of accounting for uncertainties and the associated risks through numerous scenarios. To address these challenges, we propose a quantum learning and estimation approach to enhance coordination between DNs and ECs. Specifically, leveraging advanced quantum properties such as quantum superposition and entanglement, we develop a hybrid quantum temporal convolutional network-long short-term memory (Q-TCN-LSTM) model to establish an end-to-end mapping between ECs’ responses and the price incentives from DNs. Moreover, we develop a quantum estimation method based on quantum amplitude estimation (QAE) and two phase-rotation circuits to significantly accelerate the optimization process under numerous uncertainty scenarios. Numerical experiments demonstrate that, compared to classical neural networks, the proposed Q-TCN-LSTM model improves the mapping accuracy by 69.2% while reducing the model size by 99.75% and the computation time by 93.9%. Compared to classical Monte Carlo simulation, QAE achieves comparable accuracy with a dramatic reduction in computational time (up to 99.99%) and requires significantly fewer computational resources.
[LG-74] On the performance of multi-fidelity and reduced-dimensional neural emulators for inference of physiologic boundary conditions
链接: https://arxiv.org/abs/2506.11683
作者: Chloe H. Choi,Andrea Zanoni,Daniele E. Schiavazzi,Alison L. Marsden
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Statistics Theory (math.ST); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Solving inverse problems in cardiovascular modeling is particularly challenging due to the high computational cost of running high-fidelity simulations. In this work, we focus on Bayesian parameter estimation and explore different methods to reduce the computational cost of sampling from the posterior distribution by leveraging low-fidelity approximations. A common approach is to construct a surrogate model for the high-fidelity simulation itself. Another is to build a surrogate for the discrepancy between high- and low-fidelity models. This discrepancy, which is often easier to approximate, is modeled with either a fully connected neural network or a nonlinear dimensionality reduction technique that enables surrogate construction in a lower-dimensional space. A third possible approach is to treat the discrepancy between the high-fidelity and surrogate models as random noise and estimate its distribution using normalizing flows. This allows us to incorporate the approximation error into the Bayesian inverse problem by modifying the likelihood function. We validate five different methods which are variations of the above on analytical test cases by comparing them to posterior distributions derived solely from high-fidelity models, assessing both accuracy and computational cost. Finally, we demonstrate our approaches on two cardiovascular examples of increasing complexity: a lumped-parameter Windkessel model and a patient-specific three-dimensional anatomy.
[LG-75] Recursive KalmanNet: Deep Learning-Augmented Kalman Filtering for State Estimation with Consistent Uncertainty Quantification
链接: https://arxiv.org/abs/2506.11639
作者: Hassan Mortada,Cyril Falcon,Yanis Kahil,Mathéo Clavaud,Jean-Philippe Michel
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 5 pages, 3 figures. Accepted for publication in EUSIPCO 2025 proceedings
Abstract:State estimation in stochastic dynamical systems with noisy measurements is a challenge. While the Kalman filter is optimal for linear systems with independent Gaussian white noise, real-world conditions often deviate from these assumptions, prompting the rise of data-driven filtering techniques. This paper introduces Recursive KalmanNet, a Kalman-filter-informed recurrent neural network designed for accurate state estimation with consistent error covariance quantification. Our approach propagates error covariance using the recursive Joseph’s formula and optimizes the Gaussian negative log-likelihood. Experiments with non-Gaussian measurement white noise demonstrate that our model outperforms both the conventional Kalman filter and an existing state-of-the-art deep learning based estimator.
[LG-76] Learning Encodings by Maximizing State Distinguishability: Variational Quantum Error Correction
链接: https://arxiv.org/abs/2506.11552
作者: Nico Meyer,Christopher Mutschler,Andreas Maier,Daniel D. Scherer
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 50 pages, 24 figures, 7 tables
Abstract:Quantum error correction is crucial for protecting quantum information against decoherence. Traditional codes like the surface code require substantial overhead, making them impractical for near-term, early fault-tolerant devices. We propose a novel objective function for tailoring error correction codes to specific noise structures by maximizing the distinguishability between quantum states after a noise channel, ensuring efficient recovery operations. We formalize this concept with the distinguishability loss function, serving as a machine learning objective to discover resource-efficient encoding circuits optimized for given noise characteristics. We implement this methodology using variational techniques, termed variational quantum error correction (VarQEC). Our approach yields codes with desirable theoretical and practical properties and outperforms standard codes in various scenarios. We also provide proof-of-concept demonstrations on IBM and IQM hardware devices, highlighting the practical relevance of our procedure.
[LG-77] SemanticST: Spatially Informed Semantic Graph Learning for1 Clustering Integration and Scalable Analysis of Spatial2 Transcriptomics
链接: https://arxiv.org/abs/2506.11491
作者: Roxana Zahedi,Ahmadreza Argha,Nona Farbehi,Ivan Bakhshayeshi,Youqiong Ye,Nigel H. Lovell,Hamid Alinejad-Rokny
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 6 Figures
Abstract:Spatial transcriptomics (ST) technologies enable gene expression profiling with spatial resolution, offering unprecedented insights into tissue organization and disease heterogeneity. However, current analysis methods often struggle with noisy data, limited scalability, and inadequate modelling of complex cellular relationships. We present SemanticST, a biologically informed, graph-based deep learning framework that models diverse cellular contexts through multi-semantic graph construction. SemanticST builds multiple context-specific graphs capturing spatial proximity, gene expression similarity, and tissue domain structure, and learns disentangled embeddings for each. These are fused using an attention-inspired strategy to yield a unified, biologically meaningful representation. A community-aware min-cut loss improves robustness over contrastive learning, particularly in sparse ST data. SemanticST supports mini-batch training, making it the first graph neural network scalable to large-scale datasets such as Xenium (500,000 cells). Benchmarking across four platforms (Visium, Slide-seq, Stereo-seq, Xenium) and multiple human and mouse tissues shows consistent 20 percentage gains in ARI, NMI, and trajectory fidelity over DeepST, GraphST, and IRIS. In re-analysis of breast cancer Xenium data, SemanticST revealed rare and clinically significant niches, including triple receptor-positive clusters, spatially distinct DCIS-to-IDC transition zones, and FOXC2 tumour-associated myoepithelial cells, suggesting non-canonical EMT programs with stem-like features. SemanticST thus provides a scalable, interpretable, and biologically grounded framework for spatial transcriptomics analysis, enabling robust discovery across tissue types and diseases, and paving the way for spatially resolved tissue atlases and next-generation precision medicine.
[LG-78] Fast Bayesian Optimization of Function Networks with Partial Evaluations
链接: https://arxiv.org/abs/2506.11456
作者: Poompol Buathong,Peter I. Frazier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, 8 figures, 1 table
Abstract:Bayesian optimization of function networks (BOFN) is a framework for optimizing expensive-to-evaluate objective functions structured as networks, where some nodes’ outputs serve as inputs for others. Many real-world applications, such as manufacturing and drug discovery, involve function networks with additional properties - nodes that can be evaluated independently and incur varying costs. A recent BOFN variant, p-KGFN, leverages this structure and enables cost-aware partial evaluations, selectively querying only a subset of nodes at each iteration. p-KGFN reduces the number of expensive objective function evaluations needed but has a large computational overhead: choosing where to evaluate requires optimizing a nested Monte Carlo-based acquisition function for each node in the network. To address this, we propose an accelerated p-KGFN algorithm that reduces computational overhead with only a modest loss in query efficiency. Key to our approach is generation of node-specific candidate inputs for each node in the network via one inexpensive global Monte Carlo simulation. Numerical experiments show that our method maintains competitive query efficiency while achieving up to a 16x speedup over the original p-KGFN algorithm.
[LG-79] Polymorphism Crystal Structure Prediction with Adaptive Space Group Diversity Control
链接: https://arxiv.org/abs/2506.11332
作者: Sadman Sadeed Omee,Lai Wei,Sourin Dey,Jianjun Hu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Crystalline materials can form different structural arrangements (i.e. polymorphs) with the same chemical composition, exhibiting distinct physical properties depending on how they were synthesized or the conditions under which they operate. For example, carbon can exist as graphite (soft, conductive) or diamond (hard, insulating). Computational methods that can predict these polymorphs are vital in materials science, which help understand stability relationships, guide synthesis efforts, and discover new materials with desired properties without extensive trial-and-error experimentation. However, effective crystal structure prediction (CSP) algorithms for inorganic polymorph structures remain limited. We propose ParetoCSP2, a multi-objective genetic algorithm for polymorphism CSP that incorporates an adaptive space group diversity control technique, preventing over-representation of any single space group in the population guided by a neural network interatomic potential. Using an improved population initialization method and performing iterative structure relaxation, ParetoCSP2 not only alleviates premature convergence but also achieves improved convergence speed. Our results show that ParetoCSP2 achieves excellent performance in polymorphism prediction, including a nearly perfect space group and structural similarity accuracy for formulas with two polymorphs but with the same number of unit cell atoms. Evaluated on a benchmark dataset, it outperforms baseline algorithms by factors of 2.46-8.62 for these accuracies and improves by 44.8%-87.04% across key performance metrics for regular CSP. Our source code is freely available at this https URL.
[LG-80] Score-based Generative Diffusion Models to Synthesize Full-dose FDG Brain PET from MRI in Epilepsy Patients
链接: https://arxiv.org/abs/2506.11297
作者: Jiaqi Wu,Jiahong Ouyang,Farshad Moradi,Mohammad Mehdi Khalighi,Greg Zaharchuk
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Fluorodeoxyglucose (FDG) PET to evaluate patients with epilepsy is one of the most common applications for simultaneous PET/MRI, given the need to image both brain structure and metabolism, but is suboptimal due to the radiation dose in this young population. Little work has been done synthesizing diagnostic quality PET images from MRI data or MRI data with ultralow-dose PET using advanced generative AI methods, such as diffusion models, with attention to clinical evaluations tailored for the epilepsy population. Here we compared the performance of diffusion- and non-diffusion-based deep learning models for the MRI-to-PET image translation task for epilepsy imaging using simultaneous PET/MRI in 52 subjects (40 train/2 validate/10 hold-out test). We tested three different models: 2 score-based generative diffusion models (SGM-Karras Diffusion [SGM-KD] and SGM-variance preserving [SGM-VP]) and a Transformer-Unet. We report results on standard image processing metrics as well as clinically relevant metrics, including congruency measures (Congruence Index and Congruency Mean Absolute Error) that assess hemispheric metabolic asymmetry, which is a key part of the clinical analysis of these images. The SGM-KD produced the best qualitative and quantitative results when synthesizing PET purely from T1w and T2 FLAIR images with the least mean absolute error in whole-brain specific uptake value ratio (SUVR) and highest intraclass correlation coefficient. When 1% low-dose PET images are included in the inputs, all models improve significantly and are interchangeable for quantitative performance and visual quality. In summary, SGMs hold great potential for pure MRI-to-PET translation, while all 3 model types can synthesize full-dose FDG-PET accurately using MRI and ultralow-dose PET.
[LG-81] Collaborative Prediction: To Join or To Disjoin Datasets UAI2025
链接: https://arxiv.org/abs/2506.11271
作者: Kyung Rok Kim,Yansong Wang,Xiaocheng Li,Guanting Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To be published in the 41st Conference on Uncertainty in Artificial Intelligence (UAI 2025)
Abstract:With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model’s performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at this https URL.
[LG-82] Brain-wide interpolation and conditioning of gene expression in the human brain using Implicit Neural Representations
链接: https://arxiv.org/abs/2506.11158
作者: Xizheng Yu,Justin Torok,Sneha Pandya,Sourav Pal,Vikas Singh,Ashish Raj
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study the efficacy and utility of recent advances in non-local, non-linear image interpolation and extrapolation algorithms, specifically, ideas based on Implicit Neural Representations (INR), as a tool for analysis of spatial transcriptomics data. We seek to utilize the microarray gene expression data sparsely sampled in the healthy human brain, and produce fully resolved spatial maps of any given gene across the whole brain at a voxel-level resolution. To do so, we first obtained the 100 top AD risk genes, whose baseline spatial transcriptional profiles were obtained from the Allen Human Brain Atlas (AHBA). We adapted Implicit Neural Representation models so that the pipeline can produce robust voxel-resolution quantitative maps of all genes. We present a variety of experiments using interpolations obtained from Abagen as a baseline/reference.
[LG-83] HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data
链接: https://arxiv.org/abs/2506.11152
作者: Hiren Madhu,João Felipe Rocha,Tinglin Huang,Siddharth Viswanath,Smita Krishnaswamy,Rex Ying
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注:
Abstract:Single-cell transcriptomics has become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and transcriptional regulation at the single-cell level. With the advent of spatial transcriptomics data we have the promise of learning about cells within a tissue context as it provides both spatial coordinates and transcriptomic readouts. However, existing models either ignore spatial resolution or the gene regulatory information. Gene regulation in cells can change depending on microenvironmental cues from neighboring cells, but existing models neglect gene regulatory patterns with hierarchical dependencies across levels of abstraction. In order to create contextualized representations of cells and genes from spatial transcriptomics data, we introduce HEIST, a hierarchical graph transformer-based foundation model for spatial transcriptomics and proteomics data. HEIST models tissue as spatial cellular neighborhood graphs, and each cell is, in turn, modeled as a gene regulatory network graph. The framework includes a hierarchical graph transformer that performs cross-level message passing and message passing within levels. HEIST is pre-trained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive learning and masked auto-encoding objectives. Unsupervised analysis of HEIST representations of cells, shows that it effectively encodes the microenvironmental influences in cell embeddings, enabling the discovery of spatially-informed subpopulations that prior models fail to differentiate. Further, HEIST achieves state-of-the-art results on four downstream task such as clinical outcome prediction, cell type annotation, gene imputation, and spatially-informed cell clustering across multiple technologies, highlighting the importance of hierarchical modeling and GRN-based representations.
[LG-84] Fifteen Years of Child-Centered Long-Form Recordings: Promises Resources and Remaining Challenges to Validity
链接: https://arxiv.org/abs/2506.11075
作者: Loann Peurey,Marvin Lavechin,Tarek Kunze,Manel Khentout,Lucas Gautheron,Emmanuel Dupoux,Alejandrina Cristia
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 5 pages, 3 figures
Abstract:Audio-recordings collected with a child-worn device are a fundamental tool in child language research. Long-form recordings collected over whole days promise to capture children’s input and production with minimal observer bias, and therefore high validity. The sheer volume of resulting data necessitates automated analysis to extract relevant metrics for researchers and clinicians. This paper summarizes collective knowledge on this technique, providing entry points to existing resources. We also highlight various sources of error that threaten the accuracy of automated annotations and the interpretation of resulting metrics. To address this, we propose potential troubleshooting metrics to help users assess data quality. While a fully automated quality control system is not feasible, we outline practical strategies for researchers to improve data collection and contextualize their analyses.
[LG-85] Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
链接: https://arxiv.org/abs/2506.11074
作者: Tarek Kunze,Marianne Métais,Hadrien Titeux,Lucas Elbert,Joseph Coffey,Emmanuel Dupoux,Alejandrina Cristia,Marvin Lavechin
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 5 pages, 3 figures
Abstract:Recordings gathered with child-worn devices promised to revolutionize both fundamental and applied speech sciences by allowing the effortless capture of children’s naturalistic speech environment and language production. This promise hinges on speech technologies that can transform the sheer mounds of data thus collected into usable information. This paper demonstrates several obstacles blocking progress by summarizing three years’ worth of experiments aimed at improving one fundamental task: Voice Type Classification. Our experiments suggest that improvements in representation features, architecture, and parameter search contribute to only marginal gains in performance. More progress is made by focusing on data relevance and quantity, which highlights the importance of collecting data with appropriate permissions to allow sharing.
[LG-86] A Framework for Non-Linear Attention via Modern Hopfield Networks
链接: https://arxiv.org/abs/2506.11043
作者: Ahmed Farooq
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages
Abstract:In this work we propose an energy functional along the lines of Modern Hopfield Networks (MNH), the stationary points of which correspond to the attention due to Vaswani et al. [12], thus unifying both frameworks. The minima of this landscape form “context wells” - stable configurations that encapsulate the contextual relationships among tokens. A compelling picture emerges: across n token embeddings an energy landscape is defined whose gradient corresponds to the attention computation. Non-linear attention mechanisms offer a means to enhance the capabilities of transformer models for various sequence modeling tasks by improving the model’s understanding of complex relationships, learning of representations, and overall efficiency and performance. A rough analogy can be seen via cubic splines which offer a richer representation of non-linear data where a simpler linear model may be inadequate. This approach can be used for the introduction of non-linear heads in transformer based models such as BERT, [6], etc.
信息检索
[IR-0] Forgetful by Design? A Critical Audit of YouTubes Search API for Academic Research
链接: https://arxiv.org/abs/2506.11727
作者: Bernhard Rieder,Adrian Padilla,Oscar Coromina
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
*备注: 34 pages, 2 tables and 4 figures
Abstract:This paper critically audits the search endpoint of YouTube’s Data API (v3), a common tool for academic research. Through systematic weekly searches over six months using eleven queries, we identify major limitations regarding completeness, representativeness, consistency, and bias. Our findings reveal substantial differences between ranking parameters like relevance and date in terms of video recall and precision, with relevance often retrieving numerous off-topic videos. We also find severe temporal decay, as the number of findable videos for a specific period dramatically decreases after just 20-60 days from the publication date, potentially hampering many different research designs. Furthermore, search results lack consistency, with identical queries yielding different video sets over time, compromising replicability. A case study on the European Parliament elections highlights how these issues impact research outcomes. While the paper offers several mitigation strategies, it concludes that the API’s search function, potentially prioritizing “freshness” over comprehensive retrieval, is not adequate for robust academic research, especially concerning Digital Services Act requirements.
[IR-1] ongSearch-QR: Reinforced Query Reasoning for Retrieval
链接: https://arxiv.org/abs/2506.11603
作者: Xubo Qin,Jun Bai,Jiaqi Li,Zixia Jia,Zilong Zheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale language models like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce TongSearch QR (Previously Known as “TongSearch Reasoner”), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. With a novel semi-rule-based reward function, we employ reinforcement learning approaches enabling smaller language models, e,g, Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve query reasoning performance rivaling large-scale language models without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that with BM25 as retrievers, both TongSearch QR-7B and TongSearch QR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment.
[IR-2] Dual-View Disentangled Multi-Intent Learning for Enhanced Collaborative Filtering
链接: https://arxiv.org/abs/2506.11538
作者: Shanfan Zhang,Yongyi Lin,Yuan Rao,Chenlong Zhang
类目: Information Retrieval (cs.IR)
*备注: 26 pages, 11 figures
Abstract:Disentangling user intentions from implicit feedback has become a promising strategy to enhance recommendation accuracy and interpretability. Prior methods often model intentions independently and lack explicit supervision, thus failing to capture the joint semantics that drive user-item interactions. To address these limitations, we propose DMICF, a unified framework that explicitly models interaction-level intent alignment while leveraging structural signals from both user and item perspectives. DMICF adopts a dual-view architecture that jointly encodes user-item interaction graphs from both sides, enabling bidirectional information fusion. This design enhances robustness under data sparsity by allowing the structural redundancy of one view to compensate for the limitations of the other. To model fine-grained user-item compatibility, DMICF introduces an intent interaction encoder that performs sub-intent alignment within each view, uncovering shared semantic structures that underlie user decisions. This localized alignment enables adaptive refinement of intent embeddings based on interaction context, thus improving the model’s generalization and expressiveness, particularly in long-tail scenarios. Furthermore, DMICF integrates an intent-aware scoring mechanism that aggregates compatibility signals from matched intent pairs across user and item subspaces, enabling personalized prediction grounded in semantic congruence rather than entangled representations. To facilitate semantic disentanglement, we design a discriminative training signal via multi-negative sampling and softmax normalization, which pulls together semantically aligned intent pairs while pushing apart irrelevant or noisy ones. Extensive experiments demonstrate that DMICF consistently delivers robust performance across datasets with diverse interaction distributions.
[IR-3] A Reference Model and Patterns for Production Event Data Enrichment
链接: https://arxiv.org/abs/2506.11502
作者: Mark van der Pas,Remco Dijkman,Alp Akçay,Ivo Adan,John Walker
类目: Information Retrieval (cs.IR)
*备注: Extended version of the paper submitted to EDOC 2025
Abstract:With the advent of digital transformation, organisations are increasingly generating large volumes of data through the execution of various processes across disparate systems. By integrating data from these heterogeneous sources, it becomes possible to derive new insights essential for tasks such as monitoring and analysing process performance. Typically, this information is extracted during a data pre-processing or engineering phase. However, this step is often performed in an ad-hoc manner and is time-consuming and labour-intensive. To streamline this process, we introduce a reference model and a collection of patterns designed to enrich production event data. The reference model provides a standard way for storing and extracting production event data. The patterns describe common information extraction tasks and how such tasks can be automated effectively. The reference model is developed by combining the ISA-95 industry standard with the Event Knowledge Graph formalism. The patterns are developed based on empirical observations from event data sets originating in manufacturing processes and are formalised using the reference model. We evaluate the relevance and applicability of these patterns by demonstrating their application to use cases.
[IR-4] Leverag ing Reference Documents for Zero-Shot Ranking via Large Language Models
链接: https://arxiv.org/abs/2506.11452
作者: Jieran Li,Xiuyuan Hu,Yang Zhao,Shengyao Zhuang,Hao Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) have demonstrated exceptional performance in the task of text ranking for information retrieval. While Pointwise ranking approaches offer computational efficiency by scoring documents independently, they often yield biased relevance estimates due to the lack of inter-document comparisons. In contrast, Pairwise methods improve ranking accuracy by explicitly comparing document pairs, but suffer from substantial computational overhead with quadratic complexity ( O(n^2) ). To address this tradeoff, we propose \textbfRefRank, a simple and effective comparative ranking method based on a fixed reference document. Instead of comparing all document pairs, RefRank prompts the LLM to evaluate each candidate relative to a shared reference anchor. By selecting the reference anchor that encapsulates the core query intent, RefRank implicitly captures relevance cues, enabling indirect comparison between documents via this common anchor. This reduces computational cost to linear time ( O(n) ) while importantly, preserving the advantages of comparative evaluation. To further enhance robustness, we aggregate multiple RefRank outputs using a weighted averaging scheme across different reference choices. Experiments on several benchmark datasets and with various LLMs show that RefRank significantly outperforms Pointwise baselines and could achieve performance at least on par with Pairwise approaches with a significantly lower computational cost.