本篇博文主要内容为 2025-03-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-03-10)
今日共更新548篇论文,其中:
- 自然语言处理共134篇(Computation and Language (cs.CL))
- 人工智能共190篇(Artificial Intelligence (cs.AI))
- 计算机视觉共105篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共142篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Understanding the Limits of Lifelong Knowledge Editing in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models)在实际部署中保持事实更新的问题,同时应对因成本高昂而难以频繁重新训练的挑战。论文的关键在于引入一种名为“WikiBigEdit”的大规模真实世界维基数据编辑基准,用于评估和改进终生知识编辑技术在实际规模上的表现。通过构建这一基准,论文研究了现有知识编辑方法处理大规模真实世界事实的能力,并将其与通用修改技术(如检索增强和持续微调)进行对比,以全面了解当前终生知识编辑技术的实际应用范围。
链接: https://arxiv.org/abs/2503.05683
作者: Lukas Thede,Karsten Roth,Matthias Bethge,Zeynep Akata,Tom Hartvigsen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
点击查看摘要
Abstract:Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce WikiBigEdit; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use WikiBigEdit to study existing knowledge editing techniques’ ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.
zh
[NLP-1] Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning
【速读】: 该论文旨在解决现有预训练专家大语言模型(LLMs)在处理大规模且多样化任务时,任务级专家选择过于粗粒度的问题。具体而言,不同实例可能需要针对具体技能(如数学中的代数或生物医学推理中的分子生物学)的专门知识,而传统的任务级选择无法满足这种细粒度的需求。论文的关键解决方案是提出了一种名为Symbolic-MoE的符号化、基于文本且无梯度的专家混合框架(Mixture-of-Experts)。Symbolic-MoE通过强调技能层面的自适应实例级混合,采用基于技能的招募策略动态选择最相关的专家LLMs,并利用聚合器将多个专家生成的推理结果综合成高质量的最终输出。此外,为了降低因频繁模型加载和卸载带来的计算开销,论文进一步引入批推理策略,按分配的专家分组实例,仅加载模型一次,从而实现在单GPU上集成16个专家模型的同时保持高效性。实验表明,Symbolic-MoE不仅显著提升了性能,还优于多轮讨论的计算密集型基线方法,平均绝对提升达8.15%。
链接: https://arxiv.org/abs/2503.05641
作者: Justin Chih-Yao Chen,Sukwon Yun,Elias Stengel-Eskin,Tianlong Chen,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The first three authors contributed equally. Project Page: this https URL
点击查看摘要
Abstract:Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE’s instance-level expert selection improves performance by a large margin but – when implemented naively – can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch inference strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we demonstrate that Symbolic-MoE outperforms strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute average improvement of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.
zh
[NLP-2] Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实时处理复杂对话任务时因延迟限制而难以部署的问题,同时克服小规模模型(数百万参数)依赖高质量人工标注数据集导致的数据获取成本高和耗时的挑战。论文提出的关键解决方案是一种结合大规模模型生成标签的可扩展性与人工标注精度的新框架,特别适用于逐句分类问题(如意图检测和对话状态跟踪等)。为减轻大模型生成标签中的噪声对小规模学生模型的影响,论文引入了一种去噪偏好学习损失函数。实验结果表明,该方法显著提升了包括情感检测和对话行为分类在内的多种逐句对话任务的准确性,分别提高了超过2%和1.5%。
链接: https://arxiv.org/abs/2503.05620
作者: Xuanqing Liu,Luyang Kong,Wei Niu,Afshin Khashei,Belinda Zeng,Steve Johnson,Jon Jay,Davor Golac,Matt Pope
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. However, analyzing live dialogues in real-time necessitates low-latency processing systems, making it impractical to deploy models with billions of parameters due to latency constraints. As a result, practitioners often prefer smaller models with millions of parameters, trained on high-quality, human-annotated datasets. Yet, curating such datasets is both time-consuming and costly. Consequently, there is a growing need to combine the scalability of LLM-generated labels with the precision of human annotations, enabling fine-tuned smaller models to achieve both higher speed and accuracy comparable to larger models. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more. To mitigate the impact of labeling errors from LLMs – the primary source of inaccuracies in student models – we propose a noise-reduced preference learning loss. Experimental results demonstrate that our method significantly improves accuracy across utterance-level dialogue tasks, including sentiment detection (over 2% ), dialogue act classification (over 1.5% ), etc.
zh
[NLP-3] A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内部机制不透明的问题,通过引入稀疏自编码器(Sparse Autoencoders, SAEs)提供一种有效的解释方法。论文的关键在于利用SAEs能够将LLMs中复杂的叠加特征解耦为更易解释的组件的能力,从而实现对LLMs工作机制的深入理解、引导模型行为朝期望方向发展,并开发更透明的训练方法。论文系统性地探讨了SAEs的基本原理、架构设计及其在LLM分析中的应用,涵盖了稀疏机制的理论基础、实施策略以及最新进展。尽管SAEs在实际部署和扩展方面仍面临挑战,但它们仍然是理解LLMs内部机制的重要工具。
链接: https://arxiv.org/abs/2503.05613
作者: Dong Shu,Xuansheng Wu,Haiyan Zhao,Daking Rai,Ziyu Yao,Ninghao Liu,Mengnan Du
机构: Northwestern University (西北大学); University of Georgia (佐治亚大学); New Jersey Institute of Technology (新泽西理工学院); George Mason University (乔治梅森大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 3 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a particularly promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive examination of SAEs as a promising approach to interpreting and understanding LLMs. We provide a systematic overview of SAE principles, architectures, and applications specifically tailored for LLM analysis, covering theoretical foundations, implementation strategies, and recent developments in sparsity mechanisms. We also explore how SAEs can be leveraged to explain the internal workings of LLMs, steer model behaviors in desired directions, and develop more transparent training methodologies for future models. Despite the challenges that remain around SAE implementation and scaling, they continue to provide valuable tools for understanding the internal mechanisms of large language models.
zh
[NLP-4] AceWGS: An LLM -Aided Framework to Accelerate Catalyst Design for Water-Gas Shift Reactions
【速读】: 该论文旨在解决水煤气变换(Water-Gas Shift, WGS)反应中低温高效催化剂设计的挑战,以及当前人工智能(Artificial Intelligence, AI)在催化材料研究中的两个关键局限性:一是AI模型主要依赖数值数据,无法有效捕捉文本信息(如合成方法等);二是催化设计的跨学科性质导致AI与理论、实验及数值模拟之间的协作存在沟通障碍。论文的关键解决方案是提出AceWGS框架,这是一个基于大语言模型(Large Language Models, LLMs)辅助的设计工具,通过自然语言交互,支持四种功能:回答通用查询、提取数据库信息、理解文献上下文以及利用逆向AI模型识别候选催化剂。该框架通过整合多源信息,实现了催化剂设计流程的加速,并提供了一个开源且可调的平台,以促进跨学科研究的无缝集成。
链接: https://arxiv.org/abs/2503.05607
作者: Joyjit Chattoraj,Brahim Hamadicharef,Teo Shi Chang,Yingzhi Zeng,Chee Kok Poh,Luwei Chen,Teck Leong Tan
机构: Institute of High Performance Computing, Agency for Science Technology and Research (ASTAR)(高性能计算研究所, 新加坡科技研究局); Department of Catalysis and Green Process Engineering, Institute of Sustainability for Chemicals, Energy and Environment, Agency for Science Technology and Research (ASTAR)(催化与绿色工艺工程部, 可持续化学、能源与环境研究所, 新加坡科技研究局); Materials Science and Chemistry, Institute of High Performance Computing, Agency for Science Technology and Research (A*STAR)(材料科学与化学, 高性能计算研究所, 新加坡科技研究局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While the Water-Gas Shift (WGS) reaction plays a crucial role in hydrogen production for fuel cells, finding suitable catalysts to achieve high yields for low-temperature WGS reactions remains a persistent challenge. Artificial Intelligence (AI) has shown promise in accelerating catalyst design by exploring vast candidate spaces, however, two key gaps limit its effectiveness. First, AI models primarily train on numerical data, which fail to capture essential text-based information, such as catalyst synthesis methods. Second, the cross-disciplinary nature of catalyst design requires seamless collaboration between AI, theory, experiments, and numerical simulations, often leading to communication barriers. To address these gaps, we present AceWGS, a Large Language Models (LLMs)-aided framework to streamline WGS catalyst design. AceWGS interacts with researchers through natural language, answering queries based on four features: (i) answering general queries, (ii) extracting information about the database comprising WGS-related journal articles, (iii) comprehending the context described in these articles, and (iv) identifying catalyst candidates using our proposed AI inverse model. We presented a practical case study demonstrating how AceWGS can accelerate the catalyst design process. AceWGS, built with open-source tools, offers an adjustable framework that researchers can readily adapt for a range of AI-accelerated catalyst design applications, supporting seamless integration across cross-disciplinary studies.
zh
[NLP-5] R1-Searcher: Incentivizing the Search Capability in LLM s via Reinforcement Learning
【速读】: 该论文旨在解决现有大型推理模型(LRMs)在处理时间敏感或知识密集型问题时,过度依赖内部知识而导致准确性不足和幻觉现象的问题。论文提出了一种名为\textbf{R1-Searcher}的创新两阶段基于结果的强化学习方法,其关键是通过增强大型语言模型(LLMs)的搜索能力,使其能够在推理过程中自主调用外部搜索系统以获取额外知识。这种方法完全基于强化学习,无需过程奖励或蒸馏即可实现冷启动,并有效提升了模型在复杂任务中的表现。
链接: https://arxiv.org/abs/2503.05592
作者: Huatong Song,Jinhao Jiang,Yingqian Min,Jie Chen,Zhipeng Chen,Wayne Xin Zhao,Lei Fang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院,中国人民大学); DataCanvas Alaya NeW (数据 canvas 阿拉亚新世界)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbfR1-Searcher, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
zh
[NLP-6] Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
【速读】: 该论文旨在解决基于 Retrieval-Augmented Generation (RAG) 系统在实际应用中因大规模语言模型 (LLMs) 对语义无关特征 (spurious features) 的敏感性而引发的鲁棒性问题。论文的关键在于通过统计分析确认了 RAG 模型中 spurious features 的存在,并提出了一个全面的分类框架以量化这些特征的影响。研究进一步揭示,这些 spurious features 并非全然有害,有时甚至可能有益。作者通过广泛的实验评估表明,spurious features 是 RAG 领域内普遍存在且具有挑战性的问题。为推动后续研究,论文提供了相关代码和数据集。
链接: https://arxiv.org/abs/2503.05587
作者: Shiping Yang,Jie Wu,Wenbiao Ding,Ning Wu,Shining Liang,Ming Gong,Hengyuan Zhang,Dongmei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks spurious features (a.k.a. implicit noise). While previous works have explored spurious features in LLMs, they are limited to specific features (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we statistically confirm the presence of spurious features in the RAG paradigm, a robustness problem caused by the sensitivity of LLMs to semantic-agnostic features. Moreover, we provide a comprehensive taxonomy of spurious features and empirically quantify their impact through controlled experiments. Further analysis reveals that not all spurious features are harmful and they can even be beneficial sometimes. Extensive evaluation results across multiple LLMs suggest that spurious features are a widespread and challenging problem in the field of RAG. The code and dataset will be released to facilitate future research. We release all codes and data at: \\hrefthis https URLthis https URL .
zh
[NLP-7] Evaluating open-source Large Language Models for automated fact-checking
【速读】: 该论文试图解决在线虚假信息日益增多背景下,自动化事实核查需求增加的问题,具体研究大型语言模型(Large Language Models, LLMs)在辅助自动化事实核查任务中的有效性。论文的关键在于评估不同开源LLMs在处理带有不同上下文信息的陈述时的事实核查能力,并通过三项实验探究其性能:(1) 检测LLMs识别陈述与事实核查文章语义关系的能力;(2) 在提供相关事实核查文章时验证LLMs的准确性;(3) 测试LLMs利用外部知识源(如Google和Wikipedia)进行事实核查的表现。结果表明,LLMs在识别陈述-文章关联及验证已核查故事方面表现良好,但在确认事实新闻时逊色于传统微调模型(如RoBERTa),且引入外部知识未显著提升其性能,这凸显了LLMs在自动化事实核查中的潜力与局限性,强调了进一步优化的必要性。
链接: https://arxiv.org/abs/2503.05565
作者: Nicolo’ Fontana,Francesco Corso,Enrico Zuccolotto,Francesco Pierri
机构: Politecnico di Milano (米兰理工大学, Italy); CENTAI (Italy)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Main: 10 pages, 13 figures. Supplementary Materials: 7 pages, 29 figures, 1 table ### This work has been submitted to the IEEE for possible publication. ###
点击查看摘要
Abstract:The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solutions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models’ accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs’ fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim-article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs’ performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers.
zh
[NLP-8] Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
【速读】: 该论文旨在解决几何问题求解中因文本歧义导致的推理困难,特别是在多模态数学推理任务中被忽视的文本澄清问题。论文提出了一种名为Pi-GPS的新框架,其关键是设计了一个包含校正器(rectifier)和验证器(verifier)的微模块:校正器利用多语言大模型(MLLMs)基于图示上下文消除文本歧义;验证器则确保校正后的输出符合几何规则,从而减少模型幻觉(hallucination)。此外,论文还探索了在基于消歧后形式化语言的定理预测器中引入大型语言模型(LLMs)的影响。实证结果显示,Pi-GPS在Geometry3K数据集上的表现超越现有最先进的神经符号方法,提升了近10%的性能。这一工作强调了解决多模态数学推理中文本歧义的重要性,这是限制性能的关键因素之一。
链接: https://arxiv.org/abs/2503.05543
作者: Junbo Zhao,Ting Zhang,Jiayu Sun,Mi Tian,Hua Huang
机构: Beijing Normal University (北京师范大学); TAL (未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Geometry problem solving has garnered increasing attention due to its potential applications in intelligent education field. Inspired by the observation that text often introduces ambiguities that diagrams can clarify, this paper presents Pi-GPS, a novel framework that unleashes the power of diagrammatic information to resolve textual ambiguities, an aspect largely overlooked in prior research. Specifically, we design a micro module comprising a rectifier and verifier: the rectifier employs MLLMs to disambiguate text based on the diagrammatic context, while the verifier ensures the rectified output adherence to geometric rules, mitigating model hallucinations. Additionally, we explore the impact of LLMs in theorem predictor based on the disambiguated formal language. Empirical results demonstrate that Pi-GPS surpasses state-of-the-art models, achieving a nearly 10% improvement on Geometry3K over prior neural-symbolic approaches. We hope this work highlights the significance of resolving textual ambiguity in multimodal mathematical reasoning, a crucial factor limiting performance.
zh
[NLP-9] Cognitive Bias Detection Using Advanced Prompt Engineering
【速读】: 该论文旨在解决用户生成文本中实时检测认知偏差(Cognitive Biases)的问题,这些偏差如确认偏误(Confirmation Bias)、循环推理(Circular Reasoning)和隐藏假设(Hidden Assumption)等,会严重影响内容的客观性。论文的关键解决方案在于结合大型语言模型(Large Language Models, LLMs)与先进的提示工程(Prompt Engineering)技术,通过设计定制化提示(Tailored Prompts),使系统既能有效识别认知偏差,又能主动减轻其影响,从而提升人生成内容(如新闻、媒体和报告)的质量。实验结果验证了该方法在偏差检测中的高准确性,为增强内容客观性和降低偏见决策风险提供了有力工具。
链接: https://arxiv.org/abs/2503.05516
作者: Frederic Lemieux,Aisha Behr,Clara Kellermann-Bryant,Zaki Mohammed
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 17 pages. 6 Figures, 2 Tables
点击查看摘要
Abstract:Cognitive biases, systematic deviations from rationality in judgment, pose significant challenges in generating objective content. This paper introduces a novel approach for real-time cognitive bias detection in user-generated text using large language models (LLMs) and advanced prompt engineering techniques. The proposed system analyzes textual data to identify common cognitive biases such as confirmation bias, circular reasoning, and hidden assumption. By designing tailored prompts, the system effectively leverages LLMs’ capabilities to both recognize and mitigate these biases, improving the quality of human-generated content (e.g., news, media, reports). Experimental results demonstrate the high accuracy of our approach in identifying cognitive biases, offering a valuable tool for enhancing content objectivity and reducing the risks of biased decision-making.
zh
[NLP-10] Statistical Guarantees of Correctness Coverag e for Medical Multiple-Choice Question Answering
【速读】: 该论文致力于解决大型语言模型(Large Language Models, LLMs)在高风险医疗任务中的不可信问题,特别是其生成幻觉和非事实信息的现象。论文的关键创新在于首次将一致性预测(Conformal Prediction, CP)框架应用于医学多选题问答(Medical Multiple-Choice Question-Answering, MCQA)任务,并通过关联非一致性分数与基于自洽性理论的正确选项频率得分,提出了一个无需访问内部模型信息的解决方案。为了控制错误覆盖率,论文进一步结合风险控制框架,设计了一个单调递减的损失函数以优化特定任务的性能指标。实验结果表明,该方法能够在指定的平均或边际误差率下实现有效评估,并且测试集上的平均预测集大小(Average Prediction Set Size, APSS)随风险水平提高而减少,为LLMs的不确定性评估提供了有前景的度量方式。
链接: https://arxiv.org/abs/2503.05505
作者: Yusong Ke
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in real-world question-answering (QA) applications. However, LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal prediction (CP) is well-known to be model-agnostic and distribution-free, which creates statistically rigorous prediction sets in classification tasks. In this work, we for the first time adapt the CP framework to medical multiple-choice question-answering (MCQA) tasks, by correlating the nonconformity score with the frequency score of correct options grounded in self-consistency theory, assuming no access to internal model information. Considering that the adapted CP framework can only control the (mis)coverage rate, we employ a risk control framework, which can manage task-specific metrics by devising a monotonically decreasing loss function. We evaluate our framework on 3 popular medical MCQA datasets utilizing 4 ``off-the-shelf’’ LLMs. Empirical results demonstrate that we achieve user-specified average (or marginal) error rates on the test set. Furthermore, we observe that the average prediction set size (APSS) on the test set decreases as the risk level increases, which concludes a promising evaluation metric for the uncertainty of LLMs.
zh
[NLP-11] EuroBERT: Scaling Multilingual Encoders for European Languages
【速读】: 该论文试图解决如何通过改进的编码器方法提升多语言向量表示在检索、回归和分类任务中的性能,并超越现有替代方案。论文的关键在于重新审视生成式解码器模型驱动的进展,将其中与解码器无关的创新应用于多语言编码器的设计,从而提出EuroBERT模型家族。EuroBERT不仅在多语言能力、数学和代码相关任务上表现出色,还支持长达8,192个标记的序列处理。其解决方案的核心在于结合最新技术进步优化编码器架构及其训练流程(training pipeline),并通过公开发布模型和训练框架促进进一步研究。
链接: https://arxiv.org/abs/2503.05500
作者: Nicolas Boizard,Hippolyte Gisserot-Boukhlef,Duarte M. Alves,André Martins,Ayoub Hammal,Caio Corro,Céline Hudelot,Emmanuel Malherbe,Etienne Malaboeuf,Fanny Jourdan,Gabriel Hautreux,João Alves,Kevin El-Haddad,Manuel Faysse,Maxime Peyrard,Nuno M. Guerreiro,Patrick Fernandes,Ricardo Rei,Pierre Colombo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 6 figures, 11 tables
点击查看摘要
Abstract:General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.
zh
[NLP-12] Benchmarking LLM s in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推荐系统中的应用评估问题,具体目标是构建一个全面的基准(RecBench),以系统性地比较LLMs与传统推荐系统的推荐能力。论文的关键在于通过引入多种项目表示形式(如唯一标识符、文本、语义嵌入和语义标识符)以及评估两种主要推荐任务(点击率预测CTR和序列推荐SeqRec),对多达17个大模型在五个来自时尚、新闻、视频、书籍和音乐领域的多样化数据集上进行广泛的实验。研究发现,基于LLM的推荐器在CTR场景中可提升高达5%的AUC,在SeqRec场景中可提升高达170%的NDCG@10,但这些性能改进是以显著降低推理效率为代价的。因此,论文的核心解决方案在于提出RecBench框架,并揭示LLM作为推荐系统(LLM-as-RS)范式的性能权衡问题,为未来针对推荐系统的特定加速方法研究提供启发。
链接: https://arxiv.org/abs/2503.05493
作者: Qijiong Liu,Jieming Zhu,Lu Fan,Kun Wang,Hengchang Hu,Wei Guo,Yong Liu,Xiao-Ming Wu
机构: The HK PolyU (香港理工大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Nanyang Technology University (南洋理工大学); National University of Singapore (新加坡国立大学); Xiao-Ming Wu@The HK PolyU (香港理工大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.
zh
[NLP-13] KIEval: Evaluation Metric for Document Key Information Extraction
【速读】: 该论文试图解决现有文档关键信息提取(Document Key Information Extraction, KIE)技术评估指标无法准确反映其工业应用关键属性的问题。解决方案的关键在于提出了一种面向应用的新评估指标 KIEval,它不仅评估单个信息实体(entity)的提取效果,还评估结构化信息组群(grouping)的提取能力。通过引入对结构化信息的评价,KIEval 更加贴近工业场景中实际提取分组信息的需求,从而提供更符合工业应用需求的模型评估标准。
链接: https://arxiv.org/abs/2503.05488
作者: Minsoo Khang,Sang Chul Jung,Sungrae Park,Teakgyu Hong
机构: Upstage AI (Upstage AI); South Korea (韩国)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Document Key Information Extraction (KIE) is a technology that transforms valuable information in document images into structured data, and it has become an essential function in industrial settings. However, current evaluation metrics of this technology do not accurately reflect the critical attributes of its industrial applications. In this paper, we present KIEval, a novel application-centric evaluation metric for Document KIE models. Unlike prior metrics, KIEval assesses Document KIE models not just on the extraction of individual information (entity) but also of the structured information (grouping). Evaluation of structured information provides assessment of Document KIE models that are more reflective of extracting grouped information from documents in industrial settings. Designed with industrial application in mind, we believe that KIEval can become a standard evaluation metric for developing or applying Document KIE models in practice. The code will be publicly available.
zh
[NLP-14] Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
【速读】: 该论文旨在解决大规模模型高效建模与训练的问题,特别是在结合线性序列建模(Linear Sequence Modeling, LSM)与专家混合(Mixture-of-Experts, MoE)架构时如何实现高效率与高性能的平衡。论文的关键创新在于提出Linear-MoE系统,它通过整合LSM模块的线性复杂度序列建模能力与MoE层的稀疏激活特性,在保持竞争力性能的同时显著提升训练效率。解决方案的核心在于设计了一个包含建模子系统(提供统一框架支持所有LSM实例)和训练子系统(利用多种高级并行技术,特别是为Linear-MoE定制的Sequence Parallelism)的生产级系统,并进一步探索了将Linear-MoE层与标准Transformer-MoE层相结合的混合模型,以增强模型灵活性与性能。
链接: https://arxiv.org/abs/2503.05447
作者: Weigao Sun,Disen Lan,Tong Zhu,Xiaoye Qu,Yu Cheng
机构: Shanghai AI Laboratory (上海人工智能实验室); South China University of Technology (华南理工大学); Soochow University (苏州大学); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Technical report, 17 pages
点击查看摘要
Abstract:Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: this https URL.
zh
[NLP-15] An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning
【速读】: 本文旨在解决复杂多步推理任务中标准开放式大语言模型(LLMs)性能不足的问题。论文的关键解决方案是结合使用一致性语言建模(Conformal Language Modelling, CLM)与回答集编程(Answer Set Programming, ASP)。具体而言,通过在StepGame数据集上的实验,CLM被用于从LLM生成ASP程序集,并为输出结果提供统计正确性保证。此外,引入LLM-as-Judge度量进一步提升了CLM的表现,尤其是在评估结构和逻辑正确的ASP输出方面。然而,使用多样化校准集对CLM进行校准未能显著提升需要更长推理步骤任务的泛化能力,表明其在处理更复杂任务时存在局限性。
链接: https://arxiv.org/abs/2503.05439
作者: Navdeep Kaur,Lachlan McPheat,Alessandra Russo,Anthony G Cohn,Pranava Madhyastha
机构: The Alan Turing Institute; Imperial College London (帝国理工学院); The University of Leeds (利兹大学); City University of London (伦敦城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this paper, we examine the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP) to enhance the performance of standard open-weight LLMs on complex multi-step reasoning tasks. Using the StepGame dataset, which requires spatial reasoning, we apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs. Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods, achieving substantial accuracy improvements across different levels of reasoning complexity. Additionally, the LLM-as-Judge metric enhances CLM’s performance, especially in assessing structurally and logically correct ASP outputs. However, calibrating CLM with diverse calibration sets did not improve generalizability for tasks requiring much longer reasoning steps, indicating limitations in handling more complex tasks.
zh
[NLP-16] Multi Agent based Medical Assistant for Edge Devices
【速读】: 该论文旨在解决大型动作模型(Large Action Models, LAMs)在医疗保健领域应用中面临的隐私保护、延迟以及对互联网访问依赖等问题。论文的关键解决方案在于提出了一种基于设备端的多智能体医疗助手系统。该系统通过采用较小的任务专用智能体(task-specific agents),优化资源利用、确保可扩展性和高性能表现。其中,Planner 和 Caller 智能体分别针对任务规划和呼叫功能,在 Qwen Code Instruct 2.5 7B 模型的支持下实现了平均 RougeL 分数分别为 85.5 和 96.5 的性能指标,同时具备轻量级特性以支持设备端部署。这一创新方法结合了设备端系统的优势与多智能体架构的特点,为以用户为中心的医疗保健解决方案奠定了基础。
链接: https://arxiv.org/abs/2503.05397
作者: Sakharam Gawade,Shivam Akhouri,Chinmay Kulkarni,Jagdish Samant,Pragya Sahu,Aastik,Jai Pahal,Saswat Meher
机构: Samsung Research Institute Bangalore (三星研究学院班加罗尔), India (印度)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Action Models (LAMs) have revolutionized intelligent automation, but their application in healthcare faces challenges due to privacy concerns, latency, and dependency on internet access. This report introduces an ondevice, multi-agent healthcare assistant that overcomes these limitations. The system utilizes smaller, task-specific agents to optimize resources, ensure scalability and high performance. Our proposed system acts as a one-stop solution for health care needs with features like appointment booking, health monitoring, medication reminders, and daily health reporting. Powered by the Qwen Code Instruct 2.5 7B model, the Planner and Caller Agents achieve an average RougeL score of 85.5 for planning and 96.5 for calling for our tasks while being lightweight for on-device deployment. This innovative approach combines the benefits of ondevice systems with multi-agent architectures, paving the way for user-centric healthcare solutions.
zh
[NLP-17] Leverag ing Semantic Type Dependencies for Clinical Named Entity Recognition
【速读】: 该论文试图解决临床关系抽取任务中从自由文本句子中提取实体语义类型信息的问题。解决方案的关键在于利用领域特定的语义类型依赖关系作为补充证据,不仅考虑匹配Unified Medical Language System (UMLS)概念的词组跨度与句子中其他词的关系,还通过实现一种新的矩阵编码方法,在一次传递中处理多于三种依赖关系,从而显著提升命名实体识别(Named Entity Recognition, NER)的效果,尤其是在使用不同的预训练临床嵌入模型(如BERT、BioBERT、UMLSBert)时。
链接: https://arxiv.org/abs/2503.05373
作者: Linh Le,Guido Zuccon,Gianluca Demartini,Genghong Zhao,Xia Zhang
机构: University of Queensland (昆士兰大学), Australia; Neusoft Research of Intelligent Healthcare Technology, Co. Ltd. (东软智能医疗科技研究院有限公司), Shenyang, China; Neusoft Corporation (东软集团), Shenyang, China
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Previous work on clinical relation extraction from free-text sentences leveraged information about semantic types from clinical knowledge bases as a part of entity representations. In this paper, we exploit additional evidence by also making use of domain-specific semantic type dependencies. We encode the relation between a span of tokens matching a Unified Medical Language System (UMLS) concept and other tokens in the sentence. We implement our method and compare against different named entity recognition (NER) architectures (i.e., BiLSTM-CRF and BiLSTM-GCN-CRF) using different pre-trained clinical embeddings (i.e., BERT, BioBERT, UMLSBert). Our experimental results on clinical datasets show that in some cases NER effectiveness can be significantly improved by making use of domain-specific semantic type dependencies. Our work is also the first study generating a matrix encoding to make use of more than three dependencies in one pass for the NER task.
zh
[NLP-18] Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLM s ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中的偏差问题。为实现这一目标,论文提出了一种通过在前向传播过程中应用转向向量(steering vectors)来修改模型激活的新方法。解决方案的关键在于利用贝叶斯优化(Bayesian optimization)系统性地识别针对九个偏差轴的有效对比对数据集,并通过优化这些转向向量,在Mistral、Llama和Qwen等模型上的偏差分别实现了比基线高出12.2%、4.7%和3.2%的平均改进。进一步地,论文引入了转向向量集成(Steering Vector Ensembles, SVE),通过整合多个专门针对特定偏差轴(如年龄、种族或性别)优化的转向向量,不仅提升了偏差减少的效果,还保持了模型的整体性能。这项工作首次系统性地研究了转向向量在偏差缓解中的应用,并证明了SVE是一种高效且计算成本较低的策略,具有增强AI安全性的广泛意义。
链接: https://arxiv.org/abs/2503.05371
作者: Zara Siddique,Irtaza Khalid,Liam D. Turner,Luis Espinosa-Anke
机构: School of Computer Science and Informatics, Cardiff University (卡迪夫大学), United Kingdom; AMPLYFI (未知), United Kingdom
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ACL 2025
点击查看摘要
Abstract:We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen, respectively. Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender. By leveraging their collective strength, SVE outperforms individual steering vectors in both bias reduction and maintaining model performance. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that SVE is a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.
zh
[NLP-19] Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
【速读】: 该论文旨在解决情感支持对话(Emotional Support Conversations, ESC)中大型语言模型(Large Language Models, LLMs)面临的两个关键问题:策略选择准确性低以及偏好偏差,这些问题限制了模型适应用户情感需求的能力。现有的监督微调(Supervised Fine-Tuning, SFT)方法难以应对这些挑战,因为它仅基于单一标准答案进行训练,未能捕捉策略选择中的细微权衡。为了解决这些问题,论文提出了一种名为链式策略优化(Chain-of-Strategy Optimization, CSO)的新方法,其关键是通过在每一轮对话中优化策略选择偏好,从而提升策略准确性并减轻偏好偏差。此外,研究利用蒙特卡洛树搜索(Monte Carlo Tree Search)构建了一个高质量的ESC-Pro数据集,包含轮次级别的策略-响应对,并通过在该数据集上应用CSO进一步提升了模型生成的情感共鸣性和上下文适配性。实验结果表明,CSO方法优于传统的SFT,在LLaMA-3.1-8B、Gemma-2-9B和Qwen2.5-7B等模型上的表现验证了细粒度、轮次级别偏好建模的有效性。
链接: https://arxiv.org/abs/2503.05362
作者: Weixiang Zhao,Xingyu Sui,Xinyang Han,Yang Deng,Yulin Hu,Jiahe Guo,Libo Qin,Qianyun Du,Shijin Wang,Yanyan Zhao,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); Central South University; iFLYTEK Co., Ltd (科大讯飞股份有限公司); iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd (科大讯飞股份有限公司中央研究院)
类目: Computation and Language (cs.CL)
备注: 19 pages, 9 figures, 15 tables
点击查看摘要
Abstract:The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it rigidly trains models on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these limitations, we propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes strategy selection preferences at each dialogue turn. We first leverage Monte Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, enabling LLMs to generate more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, highlighting the efficacy of fine-grained, turn-level preference modeling in ESC.
zh
[NLP-20] Improving Hate Speech Classification with Cross-Taxonomy Dataset Integration ACL
【速读】: 该论文旨在解决算法仇恨言论检测领域因定义多样化及数据集差异导致的分类难题。不同社会媒体平台、法律框架和机构采用各自独立但部分重叠的定义,使得构建通用模型面临挑战。论文的关键解决方案在于提出一种统一的分类法(universal taxonomy)以及一个能够在单一框架内识别广泛定义的仇恨言论分类器(hate speech classifier),通过整合现有数据集和分类体系,不仅提升了预测性能,还减少了对多个专用分类器的依赖。研究验证了这一方法的有效性,通过结合两个标注方式不同的常用数据集,在独立测试集中实现了更优的分类效果。这表明数据集与分类法的集成在推动仇恨言论检测技术发展、提高效率及增强跨场景适用性方面具有巨大潜力。
链接: https://arxiv.org/abs/2503.05357
作者: Jan Fillies,Adrian Paschke
机构: Institut für Angewandte Informatik (应用信息学研究所), Leipzig, Germany; Freie Universität Berlin (自由大学柏林), Berlin, Germany; Fraunhofer-Institut für Offene Kommunikationssysteme (弗劳恩霍夫开放通信系统研究所), Berlin, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Accepted for publication at LaTeCH-CLfL 2025. The 9th Joint ACL Special Interest Group on Language Technologies for the Socio-Economic Sciences and Humanities (SIGHUM) Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
点击查看摘要
Abstract:Algorithmic hate speech detection faces significant challenges due to the diverse definitions and datasets used in research and practice. Social media platforms, legal frameworks, and institutions each apply distinct yet overlapping definitions, complicating classification efforts. This study addresses these challenges by demonstrating that existing datasets and taxonomies can be integrated into a unified model, enhancing prediction performance and reducing reliance on multiple specialized classifiers. The work introduces a universal taxonomy and a hate speech classifier capable of detecting a wide range of definitions within a single framework. Our approach is validated by combining two widely used but differently annotated datasets, showing improved classification performance on an independent test set. This work highlights the potential of dataset and taxonomy integration in advancing hate speech detection, increasing efficiency, and ensuring broader applicability across contexts.
zh
[NLP-21] GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation
【速读】: 该论文旨在解决现有自动医学报告生成评估指标存在的局限性,这些指标主要关注生成报告与人工撰写报告之间关键医学信息覆盖的准确性,而忽视了诸如异常位置、严重程度及不确定性等重要细节。这种不足阻碍了对生成报告可靠性的全面评估,并可能带来临床应用的风险。为了解决这一问题,论文提出了Granular Explainable Multi-Agent Score (GEMA-Score),其核心在于通过基于大型语言模型的多智能体工作流实现客观量化与主观评价相结合。GEMA-Score利用命名实体识别(NER)F1分数计算来评估疾病诊断、位置、严重程度及不确定性,并通过智能体之间的交互信息交换完成结构化报告解析。此外,基于LLM的评分智能体进一步评估报告的完整性、可读性和临床术语使用情况,同时提供解释性反馈。实验结果表明,GEMA-Score在公共数据集上的Kendall相关系数分别达到0.70(Rexval数据集)和0.54(RadEvalX数据集),验证了其在临床评分中的有效性。
链接: https://arxiv.org/abs/2503.05347
作者: Zhenxuan Zhang,Kinhei Lee,Weihang Deng,Huichi Zhou,Zihao Jin,Jiahao Huang,Zhifan Gao,Dominic C Marshall,Yingying Fang,Guang Yang
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Automatic medical report generation supports clinical diagnosis, reduces the workload of radiologists, and holds the promise of improving diagnosis consistency. However, existing evaluation metrics primarily assess the accuracy of key medical information coverage in generated reports compared to human-written reports, while overlooking crucial details such as the location and certainty of reported abnormalities. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs NER-F1 calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = 0.70 for Rexval dataset and Kendall coefficient = 0.54 for RadEvalX dataset). The anonymous project demo is available at: this https URL.
zh
[NLP-22] AutoIOT: LLM -Driven Automated Natural Language Programming for AIoT Applications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在人工智能物联网(AIoT)应用开发中的局限性,特别是隐私保护、高昂的查询费用、令牌限制以及推理结果的可验证性问题。论文的关键解决方案是提出AutoIOT,这是一种基于LLM的自动化程序生成器,能够通过自然语言输入自动生成可解释的代码及文档输出。AutoIOT不仅降低了执行AIoT任务的技术门槛,提升了代码质量,还通过本地执行合成程序缓解了隐私担忧并减少了令牌成本,从而显著增强了AIoT应用开发的普及性和效率。
链接: https://arxiv.org/abs/2503.05346
作者: Leming Shen,Qiang Yang,Yuanqing Zheng,Mo Li
机构: The Hong Kong Polytechnic University (香港理工大学); University of Cambridge (剑桥大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) has profoundly transformed our lives, revolutionizing interactions with AI and lowering the barrier to AI usage. While LLMs are primarily designed for natural language interaction, the extensive embedded knowledge empowers them to comprehend digital sensor data. This capability enables LLMs to engage with the physical world through IoT sensors and actuators, performing a myriad of AIoT tasks. Consequently, this evolution triggers a paradigm shift in conventional AIoT application development, democratizing its accessibility to all by facilitating the design and development of AIoT applications via natural language. However, some limitations need to be addressed to unlock the full potential of LLMs in AIoT application development. First, existing solutions often require transferring raw sensor data to LLM servers, which raises privacy concerns, incurs high query fees, and is limited by token size. Moreover, the reasoning processes of LLMs are opaque to users, making it difficult to verify the robustness and correctness of inference results. This paper introduces AutoIOT, an LLM-based automated program generator for AIoT applications. AutoIOT enables users to specify their requirements using natural language (input) and automatically synthesizes interpretable programs with documentation (output). AutoIOT automates the iterative optimization to enhance the quality of generated code with minimum user involvement. AutoIOT not only makes the execution of AIoT tasks more explainable but also mitigates privacy concerns and reduces token costs with local execution of synthesized programs. Extensive experiments and user studies demonstrate AutoIOT’s remarkable capability in program synthesis for various AIoT tasks. The synthesized programs can match and even outperform some representative baselines.
zh
[NLP-23] Speculative Decoding for Multi-Sample Inference
【速读】: 该论文旨在解决多样本推理场景(如自一致性推理和Best-of-N采样)中生成高质量候选解的问题。传统方法通常依赖辅助模型或外部数据库,而该研究提出了一种新颖的推测解码方法,通过分析并行生成路径中的内在一致性来合成高质量的候选标记序列。其关键在于利用概率聚合机制动态识别与解码分布一致的共识标记序列,从而在不增加额外资源的情况下提升生成效率。实验表明,该方法显著提高了草案接受率,同时降低了候选标记生成的延迟,为高效多样本推理提供了新的范式。
链接: https://arxiv.org/abs/2503.05330
作者: Yiwei Li,Jiayi Shi,Shaoxiong Feng,Peiwen Yuan,Xinglin Wang,Yueqi Zhang,Ji Zhang,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
机构: School of Computer Science, Beijing Institute of Technology (北京理工大学计算机学院); Xiaohongshu Inc (小红书)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios, such as self-consistency and Best-of-N sampling. Our method exploits the intrinsic consensus of parallel generation paths to synthesize high-quality draft tokens without requiring auxiliary models or external databases. By dynamically analyzing structural patterns across parallel reasoning paths through a probabilistic aggregation mechanism, it identifies consensus token sequences that align with the decoding distribution. Evaluations on mathematical reasoning benchmarks demonstrate a substantial improvement in draft acceptance rates over baselines, while reducing the latency in draft token construction. This work establishes a paradigm shift for efficient multi-sample inference, enabling seamless integration of speculative decoding with sampling-based reasoning techniques.
zh
[NLP-24] Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
【速读】: 该论文试图解决的问题是如何通过动态外部知识集成来提升大型语言模型(Large Language Models, LLMs)在生成反驳论点方面的表现。传统LLMs在论证任务中虽展现出潜力,但其倾向于生成冗长且可能不基于事实的回答,这凸显了采用更受控及基于证据方法的需求。为解决这一问题,论文的关键方案是引入一种新的手动整理的数据集——包含论证与反驳论点对,并专门设计以平衡论证复杂性与评估可行性;同时提出了一种新的LLM作为裁判(LLM-as-a-Judge)的评估方法,该方法相较于传统的基于参考的度量标准更能与人类判断保持强相关性。实验结果表明,从网络集成动态外部知识能够显著提高生成反驳论点的质量,特别是在相关性、说服力和事实准确性方面。研究结果表明,将LLMs与实时外部知识检索相结合为开发更高效和可靠的反驳系统提供了有前景的方向。
链接: https://arxiv.org/abs/2503.05328
作者: Anar Yeginbergen,Maite Oronoz,Rodrigo Agerri
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU (毕尔巴鄂大学 UPV/EHU)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper investigates the role of dynamic external knowledge integration in improving counter-argument generation using Large Language Models (LLMs). While LLMs have shown promise in argumentative tasks, their tendency to generate lengthy, potentially unfactual responses highlights the need for more controlled and evidence-based approaches. We introduce a new manually curated dataset of argument and counter-argument pairs specifically designed to balance argumentative complexity with evaluative feasibility. We also propose a new LLM-as-a-Judge evaluation methodology that shows a stronger correlation with human judgments compared to traditional reference-based metrics. Our experimental results demonstrate that integrating dynamic external knowledge from the web significantly improves the quality of generated counter-arguments, particularly in terms of relatedness, persuasiveness, and factuality. The findings suggest that combining LLMs with real-time external knowledge retrieval offers a promising direction for developing more effective and reliable counter-argumentation systems.
zh
[NLP-25] Fine-Grained Evaluation for Implicit Discourse Relation Recognition
【速读】: 该论文旨在解决隐式语篇关系识别任务中预训练语言模型(Pre-trained Language Models, PLMs)性能的细粒度分析不足的问题,并探索该任务的难点与潜在方向。此外,针对Penn Discourse Treebank 3.0(PDTB 3.0)中少数标注示例的语篇关系,论文通过半人工标注的方式补充了高质量数据。论文的关键在于通过对预训练语言模型预测的深入分析揭示任务的难点,并利用新增注释的数据显著提升二级语义级别(level-2 senses)下的隐式语篇关系识别性能。
链接: https://arxiv.org/abs/2503.05326
作者: Xinyi Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Implicit discourse relation recognition is a challenging task in discourse analysis due to the absence of explicit discourse connectives between spans of text. Recent pre-trained language models have achieved great success on this task. However, there is no fine-grained analysis of the performance of these pre-trained language models for this task. Therefore, the difficulty and possible directions of this task is unclear. In this paper, we deeply analyze the model prediction, attempting to find out the difficulty for the pre-trained language models and the possible directions of this task. In addition to having an in-depth analysis for this task by using pre-trained language models, we semi-manually annotate data to add relatively high-quality data for the relations with few annotated examples in PDTB 3.0. The annotated data significantly help improve implicit discourse relation recognition for level-2 senses.
zh
[NLP-26] Uncertainty-Aware Decoding with Minimum Bayes Risk ICLR2025
【速读】: 该论文旨在解决当代语言模型在文本生成过程中偶尔产生不可取输出(如幻觉文本)的问题,这些问题通常与模型的不确定性有关,但现有方法很少在生成过程中主动考虑这种不确定性。论文的关键解决方案是将Minimum Bayes Risk (MBR)解码推广为一种有原则的不确定性感知解码方法。具体而言,通过在MBR预期风险计算中引入模型参数的后验分布,论文在解码过程中显式地考虑了模型的不确定性。这种方法不仅有助于选择更优的输出,还能决定何时放弃生成,且无需增加额外开销。此外,论文评估了不同的后验学习方法,并表明预测多样性可以提升性能。
链接: https://arxiv.org/abs/2503.05318
作者: Nico Daheim,Clara Meister,Thomas Möllenhoff,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab) (多语言知识处理实验室); Technical University of Darmstadt (达姆施塔特工业大学); ETH Zurich (瑞士联邦理工学院); RIKEN Center for Advanced Intelligence Project (理化学研究所高级智能项目中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025 (Poster)
点击查看摘要
Abstract:Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. In short, we account for model uncertainty during decoding by incorporating a posterior over model parameters into MBR’s computation of expected risk. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead. We benchmark different methods for learning posteriors and show that performance improves with prediction diversity. We release our code publicly.
zh
[NLP-27] Coreference as an indicator of context scope in multimodal narrative
【速读】: 该论文试图解决大型多模态语言模型在视觉故事叙述任务中与人类在共指表达(coreferential expressions)分布上的显著差异问题。论文的关键在于引入了一系列量化指标,用于分析人类和机器生成文本中的共指模式特征,揭示人类能够以保持文本与图像间一致性的方式灵活交错引用不同实体,而现有模型在追踪混合引用方面能力较弱,尽管其生成质量看似有所提升。
链接: https://arxiv.org/abs/2503.05298
作者: Nikolai Ilinykh,Shalom Lappin,Asad Sayeed,Sharid Loáiciga
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 4 tables
点击查看摘要
Abstract:We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality.
zh
[NLP-28] Similarity-Based Domain Adaptation with LLM s
【速读】: 该论文试图解决无监督领域自适应(Unsupervised Domain Adaptation)中依赖源域数据训练模型的问题,即当目标领域的标注数据稀缺或不可获取时,如何有效利用源域的丰富标注数据进行目标领域的分类任务。传统方法通常需要在源域数据上训练模型,这不仅耗时,还限制了模型在不同源域数据上的通用性。
解决方案的关键在于引入一个简单但高效的框架,该框架利用大型语言模型(Large Language Models, LLMs)的强大泛化能力为目标数据提供自动标注,而无需依赖源域模型的训练。此外,该框架设计了一种基于相似性的知识蒸馏损失函数,进一步提升了跨域文本分类任务中的性能。实验结果表明,与最先进的方法相比,该方法在跨域文本分类任务中实现了2.44%的准确率提升。
链接: https://arxiv.org/abs/2503.05281
作者: Jie He,Wendi Zhou,Xiang Lorraine Li,Jeff Z. Pan
机构: School of Informatics, University of Edinburgh (爱丁堡大学); University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data. Prior research has primarily focused on learning domain-invariant features across the source and target domains. However, these methods often require training a model using source domain data, which is time-consuming and can limit model usage for applications with different source data. This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation without the need of source model training, followed by a novel similarity-based knowledge distillation loss. Our extensive experiments on cross-domain text classification reveal that our framework achieves impressive performance, specifically, 2.44% accuracy improvement when compared to the SOTA method.
zh
[NLP-29] Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
【速读】: 该论文试图解决自然语言处理(NLP)方法在在线内容审核任务中的决策机制不透明问题,以及为何某些内容会被审核的问题。论文的关键在于通过多角度探索内容审核背后的隐藏机制:一是训练分类器以逆向工程化的方式分析不同国家的内容审核决策;二是利用Shapley值和大型语言模型(LLM)引导的解释来解析这些决策。研究重点关注基于预存语料库的跨国内容审核决策,该语料库采样自Twitter流数据。通过实验揭示被审查帖子在不同国家和地区随时间变化的有趣模式,并通过人类评估三种LLM生成的解释,评估LLM在内容审核中的有效性。最终,论文讨论了未来的研究方向、工作局限性和伦理考量。
链接: https://arxiv.org/abs/2503.05280
作者: Neemesh Yadav,Jiarui Liu,Francesco Ortu,Roya Ensafi,Zhijing Jin,Rada Mihalcea
机构: IIIT Delhi (印度国际信息技术学院); CMU (卡内基梅隆大学); University of Trieste (的里雅斯特大学); University of Michigan (密歇根大学); MPI (马克斯·普朗克研究所); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at this https URL
zh
[NLP-30] ZOGRASCOPE: A New Benchmark for Property Graphs
【速读】: 该论文试图解决自然语言接口在属性图(Property Graph)上的语义解析挑战,特别是针对现有研究主要集中在RDF风格图谱上的不足,以及评估系统资源在属性图上的缺乏,尤其是简单查询占主导的现状。论文通过引入ZOGRASCOPE基准测试集,专门针对Cypher查询语言设计了一组包含复杂度各异的人工标注查询,填补了这一研究空白。解决方案的关键在于提供一个多样化的基准测试集以评估图谱语义解析性能,并通过实验验证了仅依靠提示大型语言模型(LLMs)无法有效解决图谱语义解析这一开放性难题。
链接: https://arxiv.org/abs/2503.05268
作者: Francesco Cazzaro,Justin Kleindienst,Sofia Marquez,Ariadna Quattoni
机构: Universitat Politècnica de Catalunya (巴塞罗那理工学院, Spain); dMetrics (dMetrics, Brooklyn, New York)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Natural language interfaces to knowledge graphs have become increasingly important in recent years, enabling easy and efficient access to structured data. In particular property graphs have seen growing adoption. However, these kind of graphs remain relatively underrepresented in research, which has focused in large part on RDF-style graphs. As a matter of fact there is a lack of resources for evaluating systems on property graphs, with many existing datasets featuring relatively simple queries. To address this gap, we introduce ZOGRASCOPE, a benchmark designed specifically for the cypher query language. The benchmark includes a diverse set of manually annotated queries of varying complexity. We complement this paper with a set of experiments that test the performance of out-of-the-box LLMs of different sizes. Our experiments show that semantic parsing over graphs is still a challenging open problem that can not be solved by prompting LLMs alone.
zh
[NLP-31] PhiloBERTA: A Transformer-Based Cross-Lingual Analysis of Greek and Latin Lexicons
【速读】: 该论文旨在解决跨语言哲学概念语义对齐的问题,通过构建PhiloBERTA模型,利用跨语言Transformer架构测量古希腊文和拉丁文词典之间的语义关系。解决方案的关键在于采用上下文化嵌入(contextual embeddings)和角度相似性度量(angular similarity metrics)分析经典文本中选定术语对的语义关联,揭示词源相关术语对表现出显著更高的相似性分数,并通过统计学方法验证这些关系模式的显著性(p = 0.012)。研究结果为量化分析希腊与拉丁哲学传统中概念迁移提供了框架,同时为古典语言学研究提供了新方法。
链接: https://arxiv.org/abs/2503.05265
作者: Rumi A. Allbert,Makai L. Allbert
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present PhiloBERTA, a cross-lingual transformer model that measures semantic relationships between ancient Greek and Latin lexicons. Through analysis of selected term pairs from classical texts, we use contextual embeddings and angular similarity metrics to identify precise semantic alignments. Our results show that etymologically related pairs demonstrate significantly higher similarity scores, particularly for abstract philosophical concepts such as epistēmē (scientia) and dikaiosynē (iustitia). Statistical analysis reveals consistent patterns in these relationships (p = 0.012), with etymologically related pairs showing remarkably stable semantic preservation compared to control pairs. These findings establish a quantitative framework for examining how philosophical concepts moved between Greek and Latin traditions, offering new methods for classical philological research.
zh
[NLP-32] WritingBench: A Comprehensive Benchmark for Generative Writing
【速读】: 本文旨在解决大型语言模型(Large Language Models, LLMs)在生成式写作任务评估中的挑战。现有基准测试主要侧重于通用文本生成或有限的写作任务,无法充分涵盖不同领域高质量书面内容的多样化需求。为填补这一空白,论文提出了WritingBench,这是一个全面的基准测试集,用于评估LLMs在六个核心写作领域及其100个子领域的表现,包括创意、说服性、信息性和技术性写作。解决方案的关键在于提出了一种查询依赖的评估框架,使LLMs能够动态生成针对具体实例的评估标准,并通过微调的批评者模型实现风格、格式和长度的有意识评分。此外,该框架的数据整理能力进一步证明了其有效性,使得参数量为7B的模型接近最先进的性能水平。论文还开源了该基准测试及其评估工具和模块化框架组件,以推动LLMs在写作领域的进步。
链接: https://arxiv.org/abs/2503.05244
作者: Yuning Wu,Jiahao Mei,Ming Yan,Chenliang Li,SHaopeng Lai,Yuran Ren,Zijia Wang,Ji Zhang,Mengyue Wu,Qin Jin,Fei Huang
机构: Alibaba Group (阿里巴巴集团); Renmin University of China (中国人民大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework’s validity is further demonstrated by its data curation capability, which enables 7B-parameter models to approach state-of-the-art (SOTA) performance. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.
zh
[NLP-33] MM-StoryAgent : Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text Image and Audio
【速读】: 该论文旨在解决生成式故事书在吸引力提升、叙事表达丰富性增强以及开源评估基准与框架开发方面的挑战。论文的关键解决方案是提出并开源MM-StoryAgent,它通过多模态的大语言模型(LLMs)和专家工具设计了一个多智能体框架,能够生成具有精炼情节、角色一致图像和多通道音频的沉浸式有声视频故事书。其核心在于采用多阶段写作管道提升故事吸引力,并通过整合音效与视觉、音乐及叙事资源来增强沉浸式叙事体验。此外,该平台提供了灵活的开源环境,支持模块替换以促进进一步开发。论文通过客观与主观评估验证了MM-StoryAgent系统在文本故事质量和模态对齐方面的有效性。
链接: https://arxiv.org/abs/2503.05242
作者: Xuenan Xu,Jiahao Mei,Chenliang Li,Yuning Wu,Ming Yan,Shaopeng Lai,Ji Zhang,Mengyue Wu
机构: MoE Key Lab of Artificial Intelligence X-LANCE Lab, Shanghai Jiao Tong University (教育部人工智能关键实验室X-LANCE实验室,上海交通大学); Alibaba Group (阿里巴巴集团); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) and artificial intelligence-generated content (AIGC) has accelerated AI-native applications, such as AI-based storybooks that automate engaging story production for children. However, challenges remain in improving story attractiveness, enriching storytelling expressiveness, and developing open-source evaluation benchmarks and frameworks. Therefore, we propose and opensource MM-StoryAgent, which creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio. MM-StoryAgent designs a multi-agent framework that employs LLMs and diverse expert tools (generative models and APIs) across several modalities to produce expressive storytelling videos. The framework enhances story attractiveness through a multi-stage writing pipeline. In addition, it improves the immersive storytelling experience by integrating sound effects with visual, music and narrative assets. MM-StoryAgent offers a flexible, open-source platform for further development, where generative modules can be substituted. Both objective and subjective evaluation regarding textual story quality and alignment between modalities validate the effectiveness of our proposed MM-StoryAgent system. The demo and source code are available.
zh
[NLP-34] Personalized Text Generation with Contrastive Activation Steering
【速读】: 该论文旨在解决个性化文本生成中两个关键问题:(1) 历史文本中语义内容与风格模式的纠缠阻碍了对用户特定写作风格偏好的准确建模;(2) 检索增强生成(Retrieval-Augmented Generation, RAG)的推理延迟以及参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的用户模型参数存储需求带来的可扩展性挑战。为克服这些限制,论文提出了一种名为StyleVector的无训练框架,通过在大型语言模型(LLM)激活空间中将个性化写作风格解耦并表示为向量,实现了推理阶段的风格引导生成,无需昂贵的检索或参数存储。这一方案的关键在于通过引入StyleVector实现风格表征的解耦与高效表示。
链接: https://arxiv.org/abs/2503.05213
作者: Jinghao Zhang,Yuting Liu,Wenjie Wang,Qiang Liu,Shu Wu,Liang Wang,Tat-Seng Chua
机构: NLPR (NLPR); Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Northeastern University, China (东北大学,中国); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Personalized text generation aims to infer users’ writing style preferences from their historical texts and generate outputs that faithfully reflect these stylistic characteristics. Existing solutions primarily adopt two paradigms: retrieval-augmented generation (RAG) and parameter-efficient fine-tuning (PEFT). While these approaches have advanced the field, they suffer from two critical limitations: (1) the entanglement of content semantics and stylistic patterns in historical texts impedes accurate modeling of user-specific writing preferences; and (2) scalability challenges arising from both RAG’s inference latency by retrieval operations and PEFT’s parameter storage requirements for per user model. To overcome these limitations, we propose StyleVector, a training-free framework that disentangles and represents personalized writing style as a vector in LLM’s activation space, enabling style-steered generation during inference without requiring costly retrieval or parameter storage. Comprehensive experiments demonstrate that our framework achieves a significant 8% relative improvement in personalized generation while reducing storage requirements by 1700 times over PEFT method.
zh
[NLP-35] Knowledge Updating? No More Model Editing! Just Selective Contextual Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识更新过程中面临的关键问题,即随着现实世界知识的演变,现有模型嵌入的信息可能变得过时、不足或错误。尽管已有方法通过最小化计算成本和参数调整来实现模型编辑以更新知识,但这些方法通常低估了参数修改对广泛分布的知识可能产生的负面影响,并且在多跳推理和连续知识更新方面表现不佳。论文指出,目前缺乏对现有模型编辑方法的全面评估,同时强调当前方法在可靠性、泛化能力、局部性和可移植性等多个维度上的显著不足。
解决方案的关键在于提出了一种名为选择性上下文推理(Selective Contextual Reasoning, SCR)的新方法。与传统的模型编辑方法不同,SCR 不直接修改模型参数,而是利用 LLM 的内在上下文推理能力结合更新后的知识片段进行推理。具体而言,SCR 首先判断输入查询是否属于外部知识库的范围;如果是,则将相关外部知识文本上下文化以增强推理能力;否则直接回答查询。通过这种方式,SCR 在不改变模型参数的情况下实现了有效的知识更新。实验结果表明,SCR 在两个反事实数据集上与十种主流模型编辑方法相比,展示了上下文推理在知识更新中的有效性和高效性。
链接: https://arxiv.org/abs/2503.05212
作者: Guoxiu He,Xin Song,Aixin Sun
机构: School of Economics and Management, East China Normal University (华东师范大学经济与管理学院); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算与数据科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As real-world knowledge evolves, the information embedded within large language models (LLMs) can become outdated, inadequate, or erroneous. Model editing has emerged as a prominent approach for updating LLMs’ knowledge with minimal computational costs and parameter changes. This approach typically identifies and adjusts specific model parameters associated with newly acquired knowledge. However, existing methods often underestimate the adverse effects that parameter modifications can have on broadly distributed knowledge. More critically, post-edit LLMs frequently struggle with multi-hop reasoning and continuous knowledge updates. Although various studies have discussed these shortcomings, there is a lack of comprehensive evaluation. In this paper, we provide an evaluation of ten model editing methods along four dimensions: reliability, generalization, locality, and portability. Results confirm that all ten popular model editing methods show significant shortcomings across multiple dimensions, suggesting model editing is less promising. We then propose a straightforward method called Selective Contextual Reasoning (SCR), for knowledge updating. SCR does not modify model parameters but harnesses LLM’s inherent contextual reasoning capabilities utilizing the updated knowledge pieces. Under SCR, an LLM first assesses whether an incoming query falls within the scope of an external knowledge base. If it does, the relevant external knowledge texts are contextualized to enhance reasoning; otherwise, the query is answered directly. We evaluate SCR against the ten model editing methods on two counterfactual datasets with three backbone LLMs. Empirical results confirm the effectiveness and efficiency of contextual reasoning for knowledge updating.
zh
[NLP-36] Path Pooling: Train-Free Structure Enhancement for Efficient Knowledge Graph Retrieval-Augmented Generation
【速读】: 该论文试图解决大型语言模型(Large Language Models)在实际应用中面临的幻觉(hallucinations)和知识不足(knowledge deficiencies)的问题。为提升其质量和可信度,许多基于知识图谱的检索增强生成方法(Knowledge Graph-based Retrieval-Augmented Generation, KG-RAG)利用知识图谱中的结构和语义信息作为外部知识库。然而,这些方法在有效整合结构信息方面存在挑战,要么计算成本高昂,要么未能充分利用可用知识。
论文的关键解决方案是提出路径池化(path pooling),这是一种简单且无需训练的策略,通过引入一种新颖的以路径为中心的池化操作来整合结构信息。路径池化能够无缝集成到现有的KG-RAG方法中,显著提升结构信息的利用率。实验结果表明,将路径池化融入最先进的KG-RAG方法中,在不同场景下均能持续提高性能,同时引入的额外开销可以忽略不计。
链接: https://arxiv.org/abs/2503.05203
作者: Hairu Wang,Yuan Feng,Xike Xie,S Kevin Zhou
机构: School of Computer Science, University of Science and Technology of China (中国科学技术大学), China; School of Biomedical Engineering, USTC (中国科学技术大学), China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC (中国科学技术大学先进技术研究院), China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Although Large Language Models achieve strong success in many tasks, they still suffer from hallucinations and knowledge deficiencies in real-world applications. Many knowledge graph-based retrieval-augmented generation (KG-RAG) methods enhance the quality and credibility of LLMs by leveraging structure and semantic information in KGs as external knowledge bases. However, these methods struggle to effectively incorporate structure information, either incurring high computational costs or underutilizing available knowledge. Inspired by smoothing operations in graph representation learning, we propose path pooling, a simple, train-free strategy that introduces structure information through a novel path-centric pooling operation. It seamlessly integrates into existing KG-RAG methods in a plug-and-play manner, enabling richer structure information utilization. Extensive experiments demonstrate that incorporating the path pooling into the state-of-the-art KG-RAG method consistently improves performance across various settings while introducing negligible additional cost. Code is coming soon at this https URL.
zh
[NLP-37] ORANSight-2.0: Foundational LLM s for O-RAN
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放无线电接入网络(Open Radio Access Network, O-RAN)领域应用受限的问题。目前,O-RAN领域的专用基础模型匮乏,现有解决方案多依赖于通用型LLMs,这些模型难以应对O-RAN特有的技术挑战与复杂性。为填补这一空白,论文提出ORANSight-2.0(O-RAN Insights),一个专注于开发针对O-RAN优化的专用基础LLMs的开创性项目。其关键在于RANSTRUCT框架,这是一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的指令微调方法,通过两个LLM代理生成高质量的指令微调数据集,并利用QLoRA对18个预训练开源LLMs进行微调,从而显著降低对专有闭源模型的依赖,同时提升O-RAN任务的表现。此外,论文通过引入srsRANBench和ORANBench13K等评估基准,验证了ORANSight-2.0在性能、计算成本及能耗上的优越性。
链接: https://arxiv.org/abs/2503.05200
作者: Pranshav Gajjar,Vijay K. Shah
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite the transformative impact of Large Language Models (LLMs) across critical domains such as healthcare, customer service, and business marketing, their integration into Open Radio Access Networks (O-RAN) remains limited. This gap is primarily due to the absence of domain-specific foundational models, with existing solutions often relying on general-purpose LLMs that fail to address the unique challenges and technical intricacies of O-RAN. To bridge this gap, we introduce ORANSight-2.0 (O-RAN Insights), a pioneering initiative aimed at developing specialized foundational LLMs tailored for O-RAN. Built on 18 LLMs spanning five open-source LLM frameworks, ORANSight-2.0 fine-tunes models ranging from 1 to 70B parameters, significantly reducing reliance on proprietary, closed-source models while enhancing performance for O-RAN. At the core of ORANSight-2.0 is RANSTRUCT, a novel Retrieval-Augmented Generation (RAG) based instruction-tuning framework that employs two LLM agents to create high-quality instruction-tuning datasets. The generated dataset is then used to fine-tune the 18 pre-trained open-source LLMs via QLoRA. To evaluate ORANSight-2.0, we introduce srsRANBench, a novel benchmark designed for code generation and codebase understanding in the context of srsRAN, a widely used 5G O-RAN stack. We also leverage ORANBench13K, an existing benchmark for assessing O-RAN-specific knowledge. Our comprehensive evaluations demonstrate that ORANSight-2.0 models outperform general-purpose and closed-source models, such as ChatGPT-4o and Gemini, by 5.421% on ORANBench and 18.465% on srsRANBench, achieving superior performance while maintaining lower computational and energy costs. We also experiment with RAG-augmented variants of ORANSight-2.0 LLMs and thoroughly evaluate their energy characteristics, demonstrating costs for training, standard inference, and RAG-augmented inference.
zh
[NLP-38] Memory-augmented Query Reconstruction for LLM -based Knowledge Graph Reasoning
【速读】: 该论文旨在解决现有知识图谱问答(Knowledge Graph Question Answering, KGQA)方法中存在的工具利用与知识推理混淆问题,这导致模型输出可读性差,并引发幻觉式的工具调用,从而阻碍了 KGQA 的发展。为了解决这一问题,论文提出了一种名为“基于记忆增强查询重构的大语言模型知识图谱推理方法(Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning, MemQ)”。其关键是通过引入由大语言模型构建的查询记忆模块,显式描述查询语句,从而将大语言模型从工具调用任务中解耦,并结合自然语言推理和记忆增强的查询重构来促进 KGQA 过程。此外,设计了一种有效的可读推理机制以提升大语言模型在 KGQA 中的推理能力。实验结果表明,MemQ 在常用的基准数据集 WebQSP 和 CWQ 上达到了最先进的性能。
链接: https://arxiv.org/abs/2503.05193
作者: Mufan Xu,Gewen Liang,Kehai Chen,Wei Wang,Xun Zhou,Muyun Yang,Tiejun Zhao,Min Zhang
机构: School of Computer Science and Technology, Harbin Institute of Technology, China (哈尔滨工业大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable performance on knowledge graph question answering (KGQA) tasks by planning and interacting with knowledge graphs. However, existing methods often confuse tool utilization with knowledge reasoning, harming readability of model outputs and giving rise to hallucinatory tool invocations, which hinder the advancement of KGQA. To address this issue, we propose Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning (MemQ) to decouple LLM from tool invocation tasks using LLM-built query memory. By establishing a memory module with explicit descriptions of query statements, the proposed MemQ facilitates the KGQA process with natural language reasoning and memory-augmented query reconstruction. Meanwhile, we design an effective and readable reasoning to enhance the LLM’s reasoning capability in KGQA. Experimental results that MemQ achieves state-of-the-art performance on widely used benchmarks WebQSP and CWQ.
zh
[NLP-39] Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning
【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)提示在不同推理任务中表现不一的问题,并深入分析影响其性能的关键模式。论文从有效性和忠实性两个角度研究CoT的表现:有效性方面,识别了影响性能提升的关键因素,包括问题难度、信息增益和信息流;忠实性方面,通过分析问题、CoT和答案之间的信息交互,揭示了CoT生成过程中可能出现的不忠实现象,即大语言模型(LLM)在预测答案时可能从问题中召回CoT中缺失的正确信息,从而导致问题的发生。为解决这一问题,论文提出了一种新颖的算法,关键在于从问题中召回额外信息以增强CoT生成,并基于信息增益评估CoT。大量实验表明,该方法同时提升了CoT的忠实性和有效性。
链接: https://arxiv.org/abs/2503.05188
作者: Jiachun Li,Pengfei Cao,Yubo Chen,Jiexin Xu,Huaijun Li,Xiaojian Jiang,Kang Liu,Jun Zhao
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所认知与决策智能实验室); China Merchants Bank (招商银行)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 21 figures
点击查看摘要
Abstract:Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.
zh
[NLP-40] Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
【速读】: 该论文旨在解决大型语言模型在利用Chain of Thought (CoT)提示进行推理时,中间输出往往过于冗长导致计算开销增加的问题。论文提出了一种名为Sketch-of-Thought (SoT)的新颖提示框架,结合认知启发的推理范式与语言学约束,在减少token使用的同时保持推理准确性。SoT的关键在于其灵活性,能够集成基于认知科学的任何自定义推理范式,并通过概念链(Conceptual Chaining)、分块符号化(Chunked Symbolism)和专家词典(Expert Lexicons)三种具体范式实例化,这些范式针对不同任务动态选择并通过轻量级路由模型实现。综合评估显示,SoT可将token使用量减少76%,且对准确性影响极小,在数学和多跳推理等特定领域甚至提升了准确性。
链接: https://arxiv.org/abs/2503.05179
作者: Simon A. Aytes,Jinheon Baek,Sung Ju Hwang
机构: KAIST (高丽大学); DeepAuto.ai (深智科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in large language models have demonstrated remarkable reasoning capabilities through Chain of Thought (CoT) prompting, but often at the cost of excessive verbosity in their intermediate outputs, which increases computational overhead. We introduce Sketch-of-Thought (SoT), a novel prompting framework that combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage while preserving reasoning accuracy. SoT is designed as a flexible framework that can incorporate any custom reasoning paradigms based on cognitive science, and we instantiate it with three such paradigms - Conceptual Chaining, Chunked Symbolism, and Expert Lexicons - each tailored to different reasoning tasks and selected dynamically via a lightweight routing model. Through comprehensive evaluation across 15 reasoning datasets with multiple languages and multimodal scenarios, we demonstrate that SoT achieves token reductions of 76% with negligible accuracy impact. In certain domains like mathematical and multi-hop reasoning, it even improves accuracy while using significantly fewer tokens. Our code is publicly available: this https URL.
zh
[NLP-41] Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
【速读】: 该论文试图解决语言模型在文本分类任务中因类别准确率失衡而导致的整体性能被掩盖的问题。论文指出,追求整体准确率不应依赖于强化优势类别,而应通过提升弱势类别来实现。为此,作者提出了一种基于非线性整数规划的后处理去偏方法,该方法结合类别加权校正与成员资格校正,在类别和样本层面灵活调整类别概率,从而直接优化大型语言模型 (LLMs) 的输出性能。解决方案的关键在于同时在类别和样本层面进行概率校正,并验证了样本层面的校正对于提升弱势类别的重要性,同时证明了该方法在不同规模的语言模型(如Llama-2-13B和Llama-2-70B)上的有效性,尤其是在生物医学领域的任务中表现出显著性能提升。
链接: https://arxiv.org/abs/2503.05157
作者: Ruixi Lin,Ziqiao Wang,Yang You
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a post-hoc nonlinear integer programming based debiasing method that ensembles weight correction and membership correction to enable flexible rectifications of class probabilities at both class and sample levels, enhancing the performance of LLMs directly from their outputs. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. The resulted probability correction scheme demonstrates that sample-level corrections are necessary to elevate weak classes. In addition, due to effectively correcting weak classes, our method also brings significant performance gains to Llama-2-70B, especially on a biomedical domain task, demonstrating its effectiveness across both small and large model variants.
zh
[NLP-42] Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
【速读】: 本文旨在解决现有主动对话系统仅关注预定义关键词或实体的问题,忽视了对话历史中隐含的用户属性与偏好,从而阻碍了长期用户亲密度的发展。为应对这些挑战,论文提出了一种将主动对话系统与长期记忆整合到统一框架中的激进方法。关键在于定义了一个名为“带有记忆意识的主动对话”(Memory-aware Proactive Dialogue, MapDia)的新任务,并通过分解此任务提出了自动数据构造方法,创建了首个中文带记忆意识的主动数据集(ChMapData)。此外,引入了一种基于检索增强生成(Retrieval Augmented Generation, RAG)的联合框架,包含主题摘要(Topic Summarization)、主题检索(Topic Retrieval)以及主动主题转换检测与生成(Proactive Topic-shifting Detection and Generation)三个模块,以在适当时候引导对话转向相关的历史话题。研究通过自动评估和人工评估验证了数据集和模型的有效性,并开源了框架和数据集。
链接: https://arxiv.org/abs/2503.05150
作者: Bowen Wu,Wenqing Wang,Haoran Li,Ying Li,Jingsong Yu,Baoxun Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Proactive dialogue systems aim to empower chatbots with the capability of leading conversations towards specific targets, thereby enhancing user engagement and service autonomy. Existing systems typically target pre-defined keywords or entities, neglecting user attributes and preferences implicit in dialogue history, hindering the development of long-term user intimacy. To address these challenges, we take a radical step towards building a more human-like conversational agent by integrating proactive dialogue systems with long-term memory into a unified framework. Specifically, we define a novel task named Memory-aware Proactive Dialogue (MapDia). By decomposing the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive Dataset (ChMapData). Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation, designed to steer dialogues towards relevant historical topics at the right time. The effectiveness of our dataset and models is validated through both automatic and human evaluations. We release the open-source framework and dataset at this https URL.
zh
[NLP-43] RocketEval: Efficient Automated LLM Evaluation via Grading Checklist ICLR2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多样化和具有挑战性的场景中评估所面临的高成本、隐私与安全担忧以及不可复现性等问题。论文提出了一种名为RocketEval的自动化评估方法,其关键是利用轻量级LLM作为评委,并通过引入基于检查清单(checklist)的多方面问答任务重构评估过程,以减轻轻量级LLM在判断准确性上的不确定性与位置偏差问题。此外,通过检查清单项的重新加权以匹配监督注释,该方法实现了高效且可复现的自动化评估,在MT-Bench和WildBench数据集上的实验表明,RocketEval不仅显著降低了评估成本,还实现了与人类偏好高度一致的评估性能(相关系数达0.965)。
链接: https://arxiv.org/abs/2503.05142
作者: Tianjun Wei,Wei Wen,Ruizhi Qiao,Xing Sun,Jianghong Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICLR 2025: this https URL
点击查看摘要
Abstract:Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted QA using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at this https URL .
zh
[NLP-44] Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
【速读】: 本文旨在解决大规模混合专家模型(Mixture of Experts, MoE)训练中的成本效率低下及资源限制问题。为应对这些挑战,论文提出了两种不同规模的MoE大型语言模型(LLMs),即Ling-Lite和Ling-Plus,并通过创新方法优化模型架构与训练流程、改进训练异常处理以及提升模型评估效率,从而显著降低训练成本并提高资源利用率。此外,利用知识图谱生成的高质量数据进一步增强了模型工具使用的优越性。关键在于通过优化训练策略和硬件选择,在性能相当的前提下,实现300亿参数规模的MoE LLM在低性能设备上的高效预训练,相比高性能设备可节省约20%的计算成本。
链接: https://arxiv.org/abs/2503.05139
作者: Ling Team,Binwei Zeng,Chao Huang,Chao Zhang,Changxin Tian,Cong Chen,Dingnan Jin,Feng Yu,Feng Zhu,Feng Yuan,Fakang Wang,Gangshan Wang,Guangyao Zhai,Haitao Zhang,Huizhong Li,Jun Zhou,Jia Liu,Junpeng Fang,Junjie Ou,Jun Hu,Ji Luo,Ji Zhang,Jian Liu,Jian Sha,Jianxue Qian,Jiewei Wu,Junping Zhao,Jianguo Li,Jubao Feng,Jingchao Di,Junming Xu,Jinghua Yao,Kuan Xu,Kewei Du,Longfei Li,Lei Liang,Lu Yu,Li Tang,Lin Ju,Peng Xu,Qing Cui,Song Liu,Shicheng Li,Shun Song,Song Yan,Tengwei Cai,Tianyi Chen,Ting Guo,Ting Huang,Tao Feng,Tao Wu,Wei Wu,Xiaolu Zhang,Xueming Yang,Xin Zhao,Xiaobo Hu,Xin Lin,Yao Zhao,Yilong Wang,Yongzhen Guo,Yuanyuan Wang,Yue Yang,Yang Cao,Yuhao Fu,Yi Xiong,Yanzhe Li,Zhe Li,Zhiqiang Zhang,Ziqi Liu,Zhaoxin Huan,Zujie Wen,Zhenhang Sun,Zhuoxuan Du,Zhengyu He
机构: AI@Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages
点击查看摘要
Abstract:In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as “Bailing” in Chinese, spelled Bǎilíng in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at this https URL.
zh
[NLP-45] AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models
【速读】: 该论文旨在解决现有自然语言处理(NLP)模型评估方法因人工劳动需求高以及能力评估范围有限而导致的局限性。论文的关键解决方案是引入AutoTestForge,这是一种自动化且多维度的NLP模型测试框架。其核心在于利用大规模语言模型(LLMs)自动生成测试模板并实例化,大幅减少人工参与;同时通过基于差分测试的测试用例标签验证机制和多模型投票系统确保测试质量。此外,框架从分类学、公平性和鲁棒性三个维度扩展测试套件,提供全面的能力评估。实验结果表明,AutoTestForge在情感分析(SA)和语义文本相似性(STS)任务中的错误检测率分别比现有数据集和工具高出平均30.89%和34.58%,并且不同生成策略表现出稳定的有效性。
链接: https://arxiv.org/abs/2503.05102
作者: Hengrui Xing,Cong Tian,Liang Zhao,Zhi Ma,WenSheng Wang,Nan Zhang,Chao Huang,Zhenhua Duan
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 15 pages, 4 figures, Under review
点击查看摘要
Abstract:In recent years, the application of behavioral testing in Natural Language Processing (NLP) model evaluation has experienced a remarkable and substantial growth. However, the existing methods continue to be restricted by the requirements for manual labor and the limited scope of capability assessment. To address these limitations, we introduce AutoTestForge, an automated and multidimensional testing framework for NLP models in this paper. Within AutoTestForge, through the utilization of Large Language Models (LLMs) to automatically generate test templates and instantiate them, manual involvement is significantly reduced. Additionally, a mechanism for the validation of test case labels based on differential testing is implemented which makes use of a multi-model voting system to guarantee the quality of test cases. The framework also extends the test suite across three dimensions, taxonomy, fairness, and robustness, offering a comprehensive evaluation of the capabilities of NLP models. This expansion enables a more in-depth and thorough assessment of the models, providing valuable insights into their strengths and weaknesses. A comprehensive evaluation across sentiment analysis (SA) and semantic textual similarity (STS) tasks demonstrates that AutoTestForge consistently outperforms existing datasets and testing tools, achieving higher error detection rates (an average of 30.89% for SA and 34.58% for STS). Moreover, different generation strategies exhibit stable effectiveness, with error detection rates ranging from 29.03% - 36.82% .
zh
[NLP-46] SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)服务在动态请求模式下难以实现低推理延迟并满足服务级别目标(Service Level Objective, SLO)的问题。现有基于推测解码(Speculative Decoding)的加速方法常因无法适应多变的工作负载和系统环境而导致性能波动及SLO违约。论文的关键解决方案是提出SpecServe系统,它通过实时调整推测策略以适配请求负载和系统配置,同时引入理论模型预测推测解码在不同场景下的效率,并采用智能起草与验证算法确保最优性能,从而显著提升LLM推理的稳定性和效率,在真实数据集上的实验表明其性能提升了1.14倍至14.3倍,且始终满足SLO要求。
链接: https://arxiv.org/abs/2503.05096
作者: Kaiyu Huang,Hao Wu,Zhubo Shi,Han Zou,Minchen Yu,Qingjiang Shi
机构: Tongji University (同济大学); Shenzhen Research Institute of Big Data (深圳大数据研究院); Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学,深圳)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to varying workloads and system environments, resulting in performance variability and SLO violations. In this paper, we introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. SpecServe proposes a theoretical model to understand and predict the efficiency of speculative decoding across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to guarantee optimal performance while achieving high SLO attainment. Experimental results on real-world LLM traces demonstrate that SpecServe consistently meets SLOs and achieves substantial performance improvements, yielding 1.14 \times -14.3 \times speedups over state-of-the-art speculative inference systems.
zh
[NLP-47] S2S-Arena Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
【速读】: 该论文试图解决现有语音模型在指令跟随能力评估中存在的两个主要问题:一是现有的基准测试依赖于基于文本的自动评估器,缺乏对语音理解和生成中副语言信息(paralinguistic information)的考虑;二是当前语音到语音(Speech2Speech, S2S)协议中,联合训练的模型在处理副语言信息时的表现不如级联架构的模型。论文的关键解决方案是提出了一种名为S2S-Arena的新颖评估基准,它通过融合文本转语音(Text-to-Speech, TTS)和真实录音,在实际任务场景下以竞技场风格(arena-style)评估模型的指令跟随能力,并同时考虑语音输入和输出中的副语言信息。这种设计不仅涵盖了多种领域和任务,还通过人工评估揭示了级联架构模型在特定条件下的优越性以及副语言信息处理的关键依赖点。
链接: https://arxiv.org/abs/2503.05085
作者: Feng Jiang,Zhiyu Lin,Fan Bu,Yuhao Du,Benyou Wang,Haizhou Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.
zh
[NLP-48] Capacity-Aware Inference: Mitigating the Strag gler Effect in Mixture of Experts
【速读】: 该论文旨在解决混合专家模型(Mixture of Experts, MoE)在专家并行化推理过程中因令牌到专家分配不均导致的资源利用率低下和延迟增加问题,即“Straggler Effect”。论文的关键解决方案包括两个技术:(1)容量感知令牌丢弃(Capacity-Aware Token Drop),通过丢弃过载令牌来控制MoE的最大延迟;(2)容量感知令牌重路由(Capacity-Aware Token Reroute),将溢出令牌重新分配至未充分利用的专家以平衡令牌分布。这些方法共同优化了高负载和低负载专家的使用效率,显著提升了MoE推理的效率与速度。
链接: https://arxiv.org/abs/2503.05066
作者: Shwai He,Weilin Cai,Jiayi Huang,Ang Li
机构: University of Maryland, College Park (马里兰大学帕克分校); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textbf\textitStraggler Effect. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textbf\textitCapacity-Aware Token Drop, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textbf\textitCapacity-Aware Token Reroute, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2% average performance increase and a 1.94 \times inference speedup on Mixtral-8 \times 7B-Instruct.
zh
[NLP-49] he study of short texts in digital politics: Document aggregation for topic modeling
【速读】: 该论文试图解决“文档长度如何影响主题模型(Topic Modeling)结果的可解释性”这一问题。论文的关键在于通过将短文档按自然单元聚合为较大文档的方式,研究不同文档定义对主题建模结果的影响。研究发现,以账户级别聚合的文档比单条推文更能与个体州相关联,并在维基百科页面的实验中重现了这一结果,从而揭示了文档定义对主题模型结果的重要性。
链接: https://arxiv.org/abs/2503.05065
作者: Nitheesha Nakka,Omer F. Yalcin,Bruce A. Desmarais,Sarah Rajtmajer,Burt Monroe
机构: Pennsylvania State University (宾夕法尼亚州立大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.
zh
[NLP-50] No Free Labels: Limitations of LLM -as-a-Judge Without Human Grounding
【速读】: 本文旨在解决大型语言模型(LLMs)作为评估者(LLM-as-a-Judge框架)在评价自然语言文本质量时存在的偏差问题,特别是其在判断给定对话问题的回答是否正确方面的有效性。尽管LLM评估者因其低成本、易用性以及与人类风格偏好之间的强相关性而展现出巨大潜力,但研究发现它们可能因偏差而扭曲判断结果。为评估LLM评估者的这种能力,作者构建并公开发布了一个包含1200个LLM响应正确性标签的人类注释数据集,并使用来自现有数据集及新创建的挑战性基准(BFF-Bench)的问题。研究表明,LLM正确回答问题的能力与其对该问题响应的评分之间存在紧密联系。然而,在整体统计层面看似与人工标注者高度一致的情况下,当面对LLM无法作答的问题子集时,其表现会显著下降。为解决这一问题,论文提出的关键解决方案是向评估者提供正确的人工撰写参考答案。进一步深入分析表明,为较弱的LLM评估者(如Qwen 2.5 7B)提供高质量参考答案,比为强大的评估者(如GPT-4o)提供合成参考答案更能达成与人工标注者的良好一致性。
链接: https://arxiv.org/abs/2503.05061
作者: Michael Krumdick,Charles Lovering,Varshini Reddy,Seth Ebner,Chris Tanner
机构: Kensho Technologies (肯肖技术公司); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM’s ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.
zh
[NLP-51] ModernBERT is More Efficient than Conventional BERT for Chest CT Findings Classification in Japanese Radiology Reports
【速读】: 该论文旨在评估并比较两种日语语言模型——传统的双向编码器表示从变压器(BERT)和较新的ModernBERT,在分类胸部CT报告中的表现,重点关注分词效率、处理时间和分类性能。论文的关键解决方案在于采用ModernBERT模型,它在保持分类性能的同时显著提升了分词效率和训练速度,具体表现为每文档所需令牌数减少24.0%,训练时间缩短39%,并且在训练和推理阶段的处理速度分别提高了1.65倍和1.66倍。
链接: https://arxiv.org/abs/2503.05060
作者: Yosuke Yamagishi,Tomohiro Kikuchi,Shouhei Hanaoka,Takeharu Yoshikawa,Osamu Abe
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 8 figures
点击查看摘要
Abstract:Objective: This study aims to evaluate and compare the performance of two Japanese language models-conventional Bidirectional Encoder Representations from Transformers (BERT) and the newer ModernBERT-in classifying findings from chest CT reports, with a focus on tokenization efficiency, processing time, and classification performance. Methods: We conducted a retrospective study using the CT-RATE-JPN dataset containing 22,778 training reports and 150 test reports. Both models were fine-tuned for multi-label classification of 18 common chest CT conditions. The training data was split in 18,222:4,556 for training and validation. Performance was evaluated using F1 scores for each condition and exact match accuracy across all 18 labels. Results: ModernBERT demonstrated superior tokenization efficiency, requiring 24.0% fewer tokens per document (258.1 vs. 339.6) compared to BERT Base. This translated to significant performance improvements, with ModernBERT completing training in 1877.67 seconds versus BERT’s 3090.54 seconds (39% reduction). ModernBERT processed 38.82 samples per second during training (1.65x faster) and 139.90 samples per second during inference (1.66x faster). Despite these efficiency gains, classification performance remained comparable, with ModernBERT achieving superior F1 scores in 8 conditions, while BERT performed better in 4 conditions. Overall exact match accuracy was slightly higher for ModernBERT (74.67% vs. 72.67%), though this difference was not statistically significant (p=0.6291). Conclusion: ModernBERT offers substantial improvements in tokenization efficiency and training speed without sacrificing classification performance. These results suggest that ModernBERT is a promising candidate for clinical applications in Japanese radiology reports analysis.
zh
[NLP-52] A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLM s
【速读】: 该论文试图解决大型语言模型(LLMs)因复杂性增加而导致的透明性和可解释性挑战问题,提出通过引入可解释的人工智能(XAI)技术来提升其可信度和可用性。论文的关键解决方案在于构建了一个包含四个创新指标的综合评估框架,用于评估五种XAI技术在五种LLMs及两个下游任务中的有效性。这一框架通过Human-reasoning Agreement (HA)、Robustness、Consistency和Contrastivity等指标,系统地分析了LIME、SHAP、Integrated Gradients、Layer-wise Relevance Propagation (LRP) 和Attention Mechanism Visualization (AMV) 等技术的表现,揭示了不同XAI方法的优势与局限性,为LLMs的XAI技术开发与选择提供了重要指导。
链接: https://arxiv.org/abs/2503.05050
作者: Melkamu Abay Mersha,Mesay Gemeda Yigezu,Hassan shakil,Ali Al shami,Sanghyun Byun,Jugal Kalita
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2501.15374
点击查看摘要
Abstract:The increasing complexity of LLMs presents significant challenges to their transparency and interpretability, necessitating the use of eXplainable AI (XAI) techniques to enhance trustworthiness and usability. This study introduces a comprehensive evaluation framework with four novel metrics for assessing the effectiveness of five XAI techniques across five LLMs and two downstream tasks. We apply this framework to evaluate several XAI techniques LIME, SHAP, Integrated Gradients, Layer-wise Relevance Propagation (LRP), and Attention Mechanism Visualization (AMV) using the IMDB Movie Reviews and Tweet Sentiment Extraction datasets. The evaluation focuses on four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. Our results show that LIME consistently achieves high scores across multiple LLMs and evaluation metrics, while AMV demonstrates superior Robustness and near-perfect Consistency. LRP excels in Contrastivity, particularly with more complex models. Our findings provide valuable insights into the strengths and limitations of different XAI methods, offering guidance for developing and selecting appropriate XAI techniques for LLMs.
zh
[NLP-53] Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets
【速读】: 该论文旨在解决传统静态问答(QA)基准在评估大型语言模型(LLMs)时容易受到数据污染和过度拟合的影响,从而可能导致对模型泛化能力的高估以及无法可靠反映实际应用性能的问题。为了解决这一挑战,论文提出的关键方案是Dynamic-KGQA框架,它通过从知识图谱(Knowledge Graphs, KGs)生成自适应的问答数据集,能够在每次运行时动态生成新的数据变体,同时保持底层分布的一致性,从而有效降低记忆效应风险并实现公平可重复的评估。此外,该框架还提供了对数据特性的精细控制,支持领域特定和主题聚焦的数据集生成,并生成紧凑且语义一致的子图,以增强知识图谱问答(KGQA)模型的有效性。
链接: https://arxiv.org/abs/2503.05049
作者: Preetam Prabhu Srikar Dammu,Himanshu Naidu,Chirag Shah
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:As question answering (QA) systems advance alongside the rapid evolution of foundation models, the need for robust, adaptable, and large-scale evaluation benchmarks becomes increasingly critical. Traditional QA benchmarks are often static and publicly available, making them susceptible to data contamination and memorization by large language models (LLMs). Consequently, static benchmarks may overestimate model generalization and hinder a reliable assessment of real-world performance. In this work, we introduce Dynamic-KGQA, a scalable framework for generating adaptive QA datasets from knowledge graphs (KGs), designed to mitigate memorization risks while maintaining statistical consistency across iterations. Unlike fixed benchmarks, Dynamic-KGQA generates a new dataset variant on every run while preserving the underlying distribution, enabling fair and reproducible evaluations. Furthermore, our framework provides fine-grained control over dataset characteristics, supporting domain-specific and topic-focused QA dataset generation. Additionally, Dynamic-KGQA produces compact, semantically coherent subgraphs that facilitate both training and evaluation of KGQA models, enhancing their ability to leverage structured knowledge effectively. To align with existing evaluation protocols, we also provide static large-scale train/test/validation splits, ensuring comparability with prior methods. By introducing a dynamic, customizable benchmarking paradigm, Dynamic-KGQA enables a more rigorous and adaptable evaluation of QA systems.
zh
[NLP-54] Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成的自然语言处理(NLP)数据集是否包含注释 artifacts 和社会偏见的问题。论文的关键解决方案是通过使用 GPT-4、Llama-2 70b for Chat 和 Mistral 7b Instruct 重新创建部分 Stanford Natural Language Inference (SNLI) 数据集,并训练假设-only 分类器来检测注释 artifacts,同时利用点互信息(pointwise mutual information)识别与性别、种族和年龄相关的词汇,以此量化潜在的社会偏见。研究发现,微调后的 BERT 假设-only 分类器在这些生成的数据集上达到了 86%-96% 的准确率,从而揭示了注释 artifacts 和刻板印象偏见的存在。
链接: https://arxiv.org/abs/2503.05047
作者: Grace Proebsting,Adam Poliak
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2410.08996
点击查看摘要
Abstract:We test whether NLP datasets created with Large Language Models (LLMs) contain annotation artifacts and social biases like NLP datasets elicited from crowd-source workers. We recreate a portion of the Stanford Natural Language Inference corpus using GPT-4, Llama-2 70b for Chat, and Mistral 7b Instruct. We train hypothesis-only classifiers to determine whether LLM-elicited NLI datasets contain annotation artifacts. Next, we use pointwise mutual information to identify the words in each dataset that are associated with gender, race, and age-related terms. On our LLM-generated NLI datasets, fine-tuned BERT hypothesis-only classifiers achieve between 86-96% accuracy. Our analyses further characterize the annotation artifacts and stereotypical biases in LLM-generated datasets.
zh
[NLP-55] Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning
【速读】: 该论文旨在解决自动机条件化强化学习(Automata-conditioned Reinforcement Learning, ACRL)中缺乏理论保证的问题。现有方法通过在训练下游策略之前预训练并冻结自动机嵌入(automata embeddings),能够在运行时实现具备长时间延展目标的多任务策略学习,但未提供理论上的正确性保障。论文的关键贡献在于提出了一种理论框架,证明了ACRL问题是“大概率近似可学习”(probably approximately correct learnable)的,并进一步设计了一种技术方案,用于学习具有理论保证的自动机嵌入,从而确保多任务策略学习的最优性。实验评估验证了这些理论结果。
链接: https://arxiv.org/abs/2503.05042
作者: Beyazit Yalcinkaya,Niklas Lauffer,Marcell Vazquez-Chanlatte,Sanjit A. Seshia
机构: University of California(Berkeley); University of California(Berkeley); Nissan Advanced Technology Center Silicon Valley; University of California(Berkeley)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:
点击查看摘要
Abstract:Automata-conditioned reinforcement learning (RL) has given promising results for learning multi-task policies capable of performing temporally extended objectives given at runtime, done by pretraining and freezing automata embeddings prior to training the downstream policy. However, no theoretical guarantees were given. This work provides a theoretical framework for the automata-conditioned RL problem and shows that it is probably approximately correct learnable. We then present a technique for learning provably correct automata embeddings, guaranteeing optimal multi-task policy learning. Our experimental evaluation confirms these theoretical results.
zh
[NLP-56] Collapse of Dense Retrievers: Short Early and Literal Biases Outranking Factual Evidence
【速读】: 该论文试图解决信息检索(Information Retrieval, IR)中密集检索模型(Dense Retrieval Models)的鲁棒性问题,特别是 Retrieval-Augmented Generation (RAG) 系统中检索模块的潜在缺陷。论文关注的问题是现有检索器容易受到启发式偏差(heuristic biases)的影响,例如倾向于选择较短文档或过度依赖文档开头等浅层模式,从而缺乏对文档深层语义的理解。这种行为可能导致模型在下游任务中表现显著下降,特别是在多个偏差叠加时,正确选择包含答案文档的概率不足 3%。
解决方案的关键在于通过重新利用关系抽取数据集(如 Re-Do-cRED),设计了一系列受控实验来量化这些启发式偏差对检索性能的具体影响,并揭示检索器在处理复杂查询时的脆弱性。研究强调,为了提升检索模型的鲁棒性,需要改进其对文档内容的深层语义理解和判断能力,以避免误导后续的生成式 AI (Generative AI) 模型,例如在 RAG 系统中导致性能下降达 34% 的问题。
链接: https://arxiv.org/abs/2503.05037
作者: Mohsen Fayyaz,Ali Modarressi,Hinrich Schuetze,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校); CIS, LMU Munich (慕尼黑大学计算机与信息科学学院); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query’s answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.
zh
[NLP-57] Continual Pre-training of MoEs: How robust is your router?
【速读】: 该论文试图解决的问题是如何在不完全重新训练的情况下,通过持续预训练(Continual Pre-Training, CPT)扩展稀疏激活混合专家(Mixture of Experts, MoE)转换器的能力,特别是在解码器专用的 MoE 大型语言模型(LLMs)中。论文关注的核心问题是 MoE 的路由算法在持续预训练过程中是否会导致对先前分布的负载不平衡,并且是否加剧了灾难性遗忘(forgetting),同时评估已有的针对密集模型的策略是否适用于 MoE LLMs。
解决方案的关键在于通过大规模实验(使用 20 亿参数的 Switch 和 DeepSeek MoE LLMs,在 6 千亿个 token 上进行训练)验证两种路由算法(Sinkhorn-Balanced 和 Z-and-Aux-loss-balanced)对分布偏移的鲁棒性,即使在没有经验回放(replay)的情况下也能保持稳定性。此外,研究还表明,MoE LLMs 在持续预训练期间维持了与浮点运算匹配的密集模型相当的样本效率,并以较低的成本实现了接近完全重新训练的性能。
链接: https://arxiv.org/abs/2503.05029
作者: Benjamin Thérien,Charles-Étienne Joseph,Zain Sarwar,Ashwinee Panda,Anirban Das,Shi-Xiong Zhang,Stephen Rawls,Sambit Sahu,Eugene Belilovsky,Irina Rish
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer’s routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale (2B parameter switch and DeepSeek MoE LLMs trained for 600B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.
zh
[NLP-58] Safety is Not Only About Refusal: Reasoning -Enhanced Fine-tuning for Interpretable LLM Safety
【速读】: 该论文试图解决大型语言模型(LLMs)在传统安全对齐方法中的脆弱性问题,特别是面对 Jailbreak 攻击时,这些方法依赖于僵化的拒绝启发式或表示工程,难以应对需要细微且上下文感知决策的广泛安全性挑战。论文的关键解决方案是提出了一种名为“可解释 LLM 安全性的推理增强微调”(Reasoning-enhanced Finetuning for interpretable LLM Safety, Rational)的新框架。该框架通过训练模型在生成响应前进行显式的安全推理,利用自生成推理的知识来引导模型内部化上下文敏感的决策过程,从而实现更稳健、可解释且适应性强的安全响应。关键在于将推理不仅视为 LLM 的核心能力,更是确保其安全性的基础机制。
链接: https://arxiv.org/abs/2503.05021
作者: Yuyou Zhang,Miao Li,William Han,Yihang Yao,Zhepeng Cen,Ding Zhao
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Finetuning for interpretable LLM Safety (Rational), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. Rational employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.
zh
[NLP-59] Leverag ing Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学、法律等专业领域机器翻译(Machine Translation, MT)性能不足的问题。论文的关键在于通过精心设计的提示(prompting)设置研究LLMs的领域适应性MT方法,并发现演示(demonstrations)始终优于术语(terminology),检索(retrieval)优于生成(generation)。此外,论文指出,使用较弱模型生成演示可以缩小与更大模型零样本(zero-shot)性能之间的差距。关键解决方案在于强调演示的领域特异性(domain-specificity)的重要性,并揭示流行的多领域基准测试更多是在评估写作风格的适应性而非特定领域的适应性。
链接: https://arxiv.org/abs/2503.05010
作者: Bryan Li,Jiaming Luo,Eleftheria Briakou,Colin Cherry
机构: University of Pennsylvania (宾夕法尼亚大学); Google (谷歌)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model’s zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.
zh
[NLP-60] Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models
【速读】: 该论文旨在解决在实际应用中部署大语言模型(Large Language Models, LLMs)时受到严格的计算资源和延迟限制的问题。现有动态推理方法常因硬件效率低下或性能退化而受限。论文提出的关键解决方案是Balcony框架,它通过冻结预训练的LLM并在选定的退出点插入额外的Transformer层,实现在保持全模型性能的同时,根据不同的计算预算进行实时适应。这一方法利用简单的自蒸馏损失函数训练附加层,使子模型输出与全模型对齐,显著减少了所需的训练令牌数和可调参数量,从而大幅降低计算成本。实验表明,Balcony在多个模型和不同规模下优于现有最先进的方法,如Flextron和Layerskip,同时在LLaMA3-8B上的测试仅使用了原始预训练数据的0.2%,实现了极小的性能下降和显著的速度提升。
链接: https://arxiv.org/abs/2503.05005
作者: Benyamin Jamialahmadi,Parsa Kavehzadeh,Mehdi Rezagholizadeh,Parsa Farinneya,Hossein Rajabzadeh,Aref Jafari,Boxing Chen,Marzieh Tahaei
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Deploying large language models (LLMs) in real-world applications is often hindered by strict computational and latency constraints. While dynamic inference offers the flexibility to adjust model behavior based on varying resource budgets, existing methods are frequently limited by hardware inefficiencies or performance degradation. In this paper, we introduce Balcony, a simple yet highly effective framework for depth-based dynamic inference. By freezing the pretrained LLM and inserting additional transformer layers at selected exit points, Balcony maintains the full model’s performance while enabling real-time adaptation to different computational budgets. These additional layers are trained using a straightforward self-distillation loss, aligning the sub-model outputs with those of the full model. This approach requires significantly fewer training tokens and tunable parameters, drastically reducing computational costs compared to prior methods. When applied to the LLaMA3-8B model, using only 0.2% of the original pretraining data, Balcony achieves minimal performance degradation while enabling significant speedups. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip as well as other leading compression techniques on multiple models and at various scales, across a variety of benchmarks.
zh
[NLP-61] HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model NAACL2025
【速读】: 该论文旨在解决现有模糊象形文字(Egyptian hieroglyph)修复方法的两个主要局限性:(1) 无法处理严重损坏或完全缺失的象形文字;(2) 仅基于单个象形文字进行预测而未考虑上下文和语法信息。为克服这些问题,论文提出将象形文字修复建模为下一个词预测任务,并利用语言模型(Language Model, LM)来解决这一问题。关键在于选择适合象形文字语义局部亲和性强的语言模型架构,最终采用LSTM构建HieroLM。实验表明,HieroLM在多示例预测和小样本数据情况下表现出超过44%的准确率,成为辅助学者推断缺失象形文字的实用工具,同时可与基于计算机视觉的方法互补以显著降低模糊象形文字识别的困惑度。
链接: https://arxiv.org/abs/2503.04996
作者: Xuheng Cai,Erica Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LaTeCH-CLfL 2025 @ NAACL 2025
点击查看摘要
Abstract:Egyptian hieroglyphs are found on numerous ancient Egyptian artifacts, but it is common that they are blurry or even missing due to erosion. Existing efforts to restore blurry hieroglyphs adopt computer vision techniques such as CNNs and model hieroglyph recovery as an image classification task, which suffers from two major limitations: (i) They cannot handle severely damaged or completely missing hieroglyphs. (ii) They make predictions based on a single hieroglyph without considering contextual and grammatical information. This paper proposes a novel approach to model hieroglyph recovery as a next word prediction task and use language models to address it. We compare the performance of different SOTA language models and choose LSTM as the architecture of our HieroLM due to the strong local affinity of semantics in Egyptian hieroglyph texts. Experiments show that HieroLM achieves over 44% accuracy and maintains notable performance on multi-shot predictions and scarce data, which makes it a pragmatic tool to assist scholars in inferring missing hieroglyphs. It can also complement CV-based models to significantly reduce perplexity in recognizing blurry hieroglyphs. Our code is available at this https URL.
zh
[NLP-62] Wanda: Pruning Large Language Models via Regional Gradients
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)剪枝方法在保证全模型稀疏性的同时,因缺乏有效的梯度信息利用而导致性能下降的问题。论文提出了一种名为Wanda++的新框架,其关键在于引入解码器块级区域梯度(decoder-block-level regional gradients),首次通过区域梯度优化提升剪枝评分,并设计了一种高效的区域优化方法以最小化密集模型与稀疏模型输出之间的差异。这种方法不仅显著提升了语言建模任务中的困惑度(perplexity),最高改善幅度达32%,还在下游任务中表现出良好的泛化能力,同时保持了与稀疏性感知微调的正交性。此外,Wanda++具有轻量化特点,在单个NVIDIA H100 GPU上可在10分钟内完成7B参数规模LLaMA模型的剪枝。
链接: https://arxiv.org/abs/2503.04992
作者: Yifan Yang,Kai Zhen,Bhavana Ganesh,Aram Galstyan,Goeric Huybrechts,Markus Müller,Jonas M. Kübler,Rupak Vignesh Swaminathan,Athanasios Mouchtaris,Sravan Babu Bodapati,Nathan Susanj,Zheng Zhang,Jack FitzGerald,Abhishek Kumar
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Amazon AGI (亚马逊AGI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal performance impact. However, existing methods often suffer from performance loss without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbfregional gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Further experiments indicate our proposed method is orthogonal to sparsity-aware fine-tuning, where Wanda++ can be combined with LoRA fine-tuning to achieve a similar perplexity improvement as the Wanda method. The proposed method is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single NVIDIA H100 GPU.
zh
[NLP-63] DP-GTR: Differentially Private Prompt Protection via Group Text Rewriting
【速读】: 该论文旨在解决在线使用大型语言模型(LLMs)时提示隐私保护的问题,特别是在提示中通常包含敏感信息的情况下。现有方法主要集中在文档级别的文本重写,忽视了文本的多粒度表示特性,这限制了LLMs在特定任务上的应用,并且未能充分利用其泛化能力和上下文学习能力,从而阻碍了实际应用。为了解决这一局限性,论文提出了一种名为DP-GTR的新框架,它通过局部差分隐私(Local Differential Privacy, DP)和基于组文本重写的合成定理,在三个阶段内操作。DP-GTR的关键创新在于首次整合了文档级别和词级别信息,同时利用上下文学习来同时提升隐私性和实用性,有效弥合了局部和全局DP机制在单个数据点层面的差距。实验结果表明,DP-GTR在CommonSense QA和DocVQA任务上优于现有方法,实现了更优的隐私-实用性权衡。此外,该框架可以与现有的重写技术兼容,作为插件增强隐私保护。代码已公开发布以供复现。
链接: https://arxiv.org/abs/2503.04990
作者: Mingchen Li,Heng Fan,Song Fu,Junhua Ding,Yunhe Feng
机构: Department of Computer Science and Engineering (计算机科学与工程系), University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 1 table
点击查看摘要
Abstract:Prompt privacy is crucial, especially when using online large language models (LLMs), due to the sensitive information often contained within prompts. While LLMs can enhance prompt privacy through text rewriting, existing methods primarily focus on document-level rewriting, neglecting the rich, multi-granular representations of text. This limitation restricts LLM utilization to specific tasks, overlooking their generalization and in-context learning capabilities, thus hindering practical application. To address this gap, we introduce DP-GTR, a novel three-stage framework that leverages local differential privacy (DP) and the composition theorem via group text rewriting. DP-GTR is the first framework to integrate both document-level and word-level information while exploiting in-context learning to simultaneously improve privacy and utility, effectively bridging local and global DP mechanisms at the individual data point level. Experiments on CommonSense QA and DocVQA demonstrate that DP-GTR outperforms existing approaches, achieving a superior privacy-utility trade-off. Furthermore, our framework is compatible with existing rewriting techniques, serving as a plug-in to enhance privacy protection. Our code is publicly available at this https URL for reproducibility.
zh
[NLP-64] Application of integrated gradients explainability to sociopsychological semantic markers
【速读】: 该论文旨在解决如何通过解释性方法揭示文本分类过程中具体单词对分类贡献的问题,特别是在社会心理标记(如agency)的细粒度情感分析中。论文的关键在于利用整合梯度(Integrated Gradient, IG)方法,在单词级别捕获分类输出,并通过优化训练过程增强其在小样本标注数据集上的适用性,以有效识别与相关社会心理标记相关的不同类别构建中的显著词。解决方案的关键是结合社会心理学视角,设计一种鼓励过拟合的特殊训练策略,同时评估IG方法与其他替代方案的性能差异,验证其在实际应用场景中的有效性。
链接: https://arxiv.org/abs/2503.04989
作者: Ali Aghababaei,Jan Nikadon,Magdalena Formanowicz,Maria Laura Bettinsoli,Carmen Cervone,Caterina Suitner,Tomaso Erseghe
机构: Department of Information Engineering, University of Padova (帕多瓦大学), Italy; Center for Research on Social Relations, University of Social Sciences and Humanities (SWPS) (华沙社会与人文学科大学), Warsaw, Poland; Department of Developmental Psychology and Socialization, University of Padova (帕多瓦大学), Italy
类目: Computation and Language (cs.CL)
备注: Submitted to IEEE Trans. on Affective Computing
点击查看摘要
Abstract:Classification of textual data in terms of sentiment, or more nuanced sociopsychological markers (e.g., agency), is now a popular approach commonly applied at the sentence level. In this paper, we exploit the integrated gradient (IG) method to capture the classification output at the word level, revealing which words actually contribute to the classification process. This approach improves explainability and provides in-depth insights into the text. We focus on sociopsychological markers beyond sentiment and investigate how to effectively train IG in agency, one of the very few markers for which a verified deep learning classifier, BERTAgent, is currently available. Performance and system parameters are carefully tested, alternatives to the IG approach are evaluated, and the usefulness of the result is verified in a relevant application scenario. The method is also applied in a scenario where only a small labeled dataset is available, with the aim of exploiting IG to identify the salient words that contribute to building the different classes that relate to relevant sociopsychological markers. To achieve this, an uncommon training procedure that encourages overfitting is employed to enhance the distinctiveness of each class. The results are analyzed through the lens of social psychology, offering valuable insights.
zh
[NLP-65] LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression NAACL2025
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多种压缩技术对其多模态输入驱动任务生成性能影响方面的研究空白。论文的关键在于提出了一种名为LVLM-Compress-Bench的框架,用于系统性评估两种主要的压缩方法——键值缓存(KV Cache)压缩和权重(Weight)压缩对LVLMs的影响。具体而言,针对动态增长的中间缓存和静态权重分别采用KV缓存压缩和权重压缩技术,并结合流行的LLaVA框架中的四种LVLM变体进行分析,集成最先进的均匀量化、异常值减少量化以及分组量化等方法。通过在涵盖识别、知识理解、语言生成、空间意识、视觉推理、幻觉与视觉错觉检测、毒性评估及偏见分析等能力的十个多模态数据集上进行实验,论文展示了不同量化预算下LVLMs在保持或丧失性能方面相对于FP16基准格式的广泛且有趣的观察结果。这一框架不仅涵盖了真实世界的数据集,还包含了合成数据集以反映社会交叉属性的多样性,从而全面评估压缩对一般性和伦理关键指标的影响。
链接: https://arxiv.org/abs/2503.04982
作者: Souvik Kundu,Anahita Bhiwandiwalla,Sungduk Yu,Phillip Howard,Tiep Le,Sharath Nittur Sridhar,David Cobbley,Hao Kang,Vasudev Lal
机构: Intel Labs (英特尔实验室), USA; Georgia Institute of Technology (乔治亚理工学院), USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been accepted to NAACL 2025
点击查看摘要
Abstract:Despite recent efforts in understanding the compression impact on large language models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (for example, question answering, common sense reasoning), their detailed study on multi-modal Large Vision-Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thoroughly study the broad impact of compression on the generative performance of LVLMs with multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis via integrating various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization for the KV cache and weights. With this framework we demonstrate on ten different multi-modal datasets with different capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. Code will be open-sourced at this https URL. Comments: This work has been accepted to NAACL 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2503.04982 [cs.CV] (or arXiv:2503.04982v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.04982 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-66] Beyond RAG : Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
【速读】: 本文旨在解决大型语言模型(Large Language Models, LLMs)在整合外部知识时面临的权衡问题。现有的方法如Retrieval-Augmented Generation (RAG) 存在关键信息可能未包含在排名靠前的结果中的局限性,而长上下文模型虽然能够处理多文档但计算成本高且受上下文窗口大小限制。为此,论文受学生为开卷考试总结学习材料的启发,提出了一种任务感知的关键值(Key-Value, KV)缓存压缩方法,在零样本或少样本设置下压缩外部知识。这种方法使LLMs能够在紧凑的相关信息表示上高效推理。其关键是通过任务感知的压缩技术,在保持性能的同时显著减少外部知识的存储需求和推理延迟。实验表明,该方法在LongBench v2上的准确性比RAG高出多达7个百分点,压缩率达到30倍,推理延迟从0.43秒降低到0.16秒,并在广泛的知识任务中优于任务无关的压缩方法。
链接: https://arxiv.org/abs/2503.04973
作者: Giulio Corallo,Orion Weller,Fabio Petroni,Paolo Papotti
机构: SAP Labs (SAP实验室); EURECOM (欧洲通信与数字经济学研究学院); Johns Hopkins University (约翰斯·霍普金斯大学); Samaya AI (Samaya AI); EURECOM (欧洲通信与数字经济学研究学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
zh
[NLP-67] Evaluating Answer Reranking Strategies in Time-sensitive Question Answering
【速读】: 该论文试图解决当前问答系统(Question Answering, QA)在处理与过去事件相关的详细问题时,难以有效利用和理解时间信息的问题。解决方案的关键在于探索几种简单的时间特征驱动的答案选择技术,通过强调显式和隐式时间问题之间的差异,以及时间特性在从历时性文档集合中选择最相关答案中的作用,以提升系统对时间敏感问题的回答能力。
链接: https://arxiv.org/abs/2503.04972
作者: Mehmet Kardan,Bhawna Piryani,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学), Austria
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Despite advancements in state-of-the-art models and information retrieval techniques, current systems still struggle to handle temporal information and to correctly answer detailed questions about past events. In this paper, we investigate the impact of temporal characteristics of answers in Question Answering (QA) by exploring several simple answer selection techniques. Our findings emphasize the role of temporal features in selecting the most relevant answers from diachronic document collections and highlight differences between explicit and implicit temporal questions.
zh
[NLP-68] DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL
【速读】: 该论文旨在解决现有基于大型语言模型(Large Language Models, LLMs)的文本转SQL(text-to-SQL)系统在处理复杂数据库结构和领域特定查询时表现不佳的问题。这些系统通常侧重于提升逻辑推理能力和SQL语法理解,而忽视了全面数据库理解的重要性。为了解决这一局限性,论文提出了一种名为DB-Explore的新框架,其关键是通过自动化探索与指令合成将LLMs与数据库知识系统性地对齐。DB-Explore利用GPT-4挖掘数据库的结构模式和语义知识,并通过合成指令提炼这些知识以高效微调LLMs,从而实现对数据库结构的全面理解,弥合数据库结构与语言模型之间的鸿沟。实验结果表明,该方法在SPIDER和BIRD基准测试中表现出色,显著提升了执行准确性。
链接: https://arxiv.org/abs/2503.04959
作者: Haoyuan Ma,Yongliang Shen,Hengwei Liu,Wenqi Zhang,Haolei Xu,Qiuying Peng,Jun Wang,Weiming Lu
机构: Zhejiang University (浙江大学); OPPO Research Institute (OPPO研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent text-to-SQL systems powered by large language models (LLMs) have demonstrated remarkable performance in translating natural language queries into SQL. However, these systems often struggle with complex database structures and domain-specific queries, as they primarily focus on enhancing logical reasoning and SQL syntax while overlooking the critical need for comprehensive database understanding. To address this limitation, we propose DB-Explore, a novel framework that systematically aligns LLMs with database knowledge through automated exploration and instruction synthesis. DB-Explore constructs database graphs to capture complex relational schemas, leverages GPT-4 to systematically mine structural patterns and semantic knowledge, and synthesizes instructions to distill this knowledge for efficient fine-tuning of LLMs. Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation, bridging the gap between database structures and language models. Experiments conducted on the SPIDER and BIRD benchmarks validate the effectiveness of DB-Explore, achieving an execution accuracy of 52.1% on BIRD and 84.0% on SPIDER. Notably, our open-source implementation, based on the Qwen2.5-coder-7B model, outperforms multiple GPT-4-driven text-to-SQL systems in comparative evaluations, and achieves near state-of-the-art performance with minimal computational cost.
zh
[NLP-69] SafeArena: Evaluating the Safety of Autonomous Web Agents
【速读】: 该论文试图解决大型语言模型(LLM)驱动的网络代理在实际应用中可能被恶意滥用的问题,具体评估其在误导信息传播、非法活动、骚扰、网络犯罪及社会偏见等五个危害类别下的表现。论文的关键解决方案是提出了SafeArena基准测试集,包含250个安全任务和250个有害任务,并设计了Agent Risk Assessment框架,通过四个风险等级系统性评估主流LLM驱动的网络代理对有害任务的响应情况,揭示这些代理对恶意请求的高容忍度,从而强调了加强网络代理安全性对齐程序的紧迫性。
链接: https://arxiv.org/abs/2503.04957
作者: Ada Defne Tur,Nicholas Meade,Xing Han Lù,Alejandra Zambrano,Arkil Patel,Esin Durmus,Spandana Gella,Karolina Stańczak,Siva Reddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories – misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: this https URL
zh
[NLP-70] Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
【速读】: 该论文试图解决在生成式 AI (Generative AI) 模型广泛应用背景下,区分真实人类创作内容与深度伪造 (Deepfake) 内容的挑战。解决方案的关键在于探索一款名为 DeepFakeDeLiBot 的 deliberation-enhancing 聊天机器人在支持群体检测深度伪造文本方面的潜力。研究发现,基于群体的问题解决方法显著提高了识别机器生成段落的准确性,而 DeepFakeDeLiBot 主要通过增强群体互动、促进共识形成以及提升推理性陈述的频率和多样性来优化群体动态,尤其对于认为群体协作更有效率的参与者,其性能益处更为明显。这表明 deliberative 聊天机器人在提高群体协作效率的同时确保深度伪造文本检测的准确性方面具有重要潜力。
链接: https://arxiv.org/abs/2503.04945
作者: Jooyoung Lee,Xiaochen Zhu,Georgi Karadzhov,Tom Stafford,Andreas Vlachos,Dongwon Lee
机构: University of Toronto(多伦多大学); University of Cambridge(剑桥大学); University of Sheffield(谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15
点击查看摘要
Abstract:The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textitDataset and source code used in this study will be made publicly available upon acceptance of the manuscript.
zh
[NLP-71] VQEL: Enabling Self-Developed Symbolic Language in Agents through Vector Quantization in Emergent Language Games
【速读】: 该论文致力于解决在单智能体(single-agent)环境中通过自我博弈(self-play)发展语言表示的问题,特别是当语言不仅作为与他人交流的工具,还作为个体思维、自我反思和问题解决的手段时所面临的挑战。传统方法如REINFORCE在单一智能体设置下难以有效学习离散符号表示,因为它们无法提供优于多智能体方法的优势。论文的关键创新在于引入了一种名为VQEL的新方法,将向量量化(Vector Quantization, VQ)融入智能体架构中,使智能体能够在自我博弈的参照游戏中自主发明和优化离散符号表示。此外,VQEL通过结合后续的强化学习和与其他智能体的互弈(mutual-play)进一步提升语言能力。实验结果表明,VQEL不仅超越了传统的REINFORCE方法,还通过向量量化显著提高了控制能力和减少了表示崩塌的风险。
链接: https://arxiv.org/abs/2503.04940
作者: Mohammad Mahdi Samiei Paqaleh,Mahdieh Soleymani Baghshah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In the field of emergent language, efforts have traditionally focused on developing communication protocols through interactions between agents in referential games. However, the aspect of internal language learning, where language serves not only as a communicative tool with others but also as a means for individual thinking, self-reflection, and problem-solving remains underexplored. Developing a language through self-play, without another agent’s involvement, poses a unique challenge. It requires an agent to craft symbolic representations and train them using direct gradient methods. The challenge here is that if an agent attempts to learn symbolic representations through self-play using conventional modeling and techniques such as REINFORCE, the solution will offer no advantage over previous multi-agent approaches. We introduce VQEL, a novel method that incorporates Vector Quantization into the agents’ architecture, enabling them to autonomously invent and develop discrete symbolic representations in a self-play referential game. Following the self-play phase, agents can enhance their language through reinforcement learning and interactions with other agents in the mutual-play phase. Our experiments across various datasets demonstrate that VQEL not only outperforms the traditional REINFORCE method but also benefits from improved control and reduced susceptibility to collapse, thanks to the incorporation of vector quantization.
zh
[NLP-72] HILGEN: Hierarchically-Informed Data Generation for Biomedical NER Using Knowledgebases and Large Language Models
【速读】: 本文旨在解决生物医学命名实体识别(NER)任务中数据稀缺的问题。传统方法依赖于大量高质量标注数据,但在实际应用中获取这些数据往往成本高昂且耗时。为应对这一挑战,论文提出了一种名为HILGEN的分层知识引导的数据生成方法。该方案的关键在于结合了统一医学语言系统(UMLS)中的领域知识与大型语言模型(LLMs),特别是GPT-3.5生成的合成数据。通过利用UMLS的层次结构来扩展训练数据的相关概念,并借助针对性提示从LLMs中提取上下文信息以自动生成稀疏出现的命名实体示例,从而在少量样本场景下显著提升了NER性能。实验结果显示,在多个生物医学NER数据集上的最佳集成模型实现了最高达42.29%的F1分数提升,证明了将历史积累的生物医学知识与生成式AI相结合的有效性。
链接: https://arxiv.org/abs/2503.04930
作者: Yao Ge,Yuting Guo,Sudeshna Das,Swati Rajwal,Selen Bozkurt,Abeed Sarker
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present HILGEN, a Hierarchically-Informed Data Generation approach that combines domain knowledge from the Unified Medical Language System (UMLS) with synthetic data generated by large language models (LLMs), specifically GPT-3.5. Our approach leverages UMLS’s hierarchical structure to expand training data with related concepts, while incorporating contextual information from LLMs through targeted prompts aimed at automatically generating synthetic examples for sparsely occurring named entities. The performance of the HILGEN approach was evaluated across four biomedical NER datasets (MIMIC III, BC5CDR, NCBI-Disease, and Med-Mentions) using BERT-Large and DANN (Data Augmentation with Nearest Neighbor Classifier) models, applying various data generation strategies, including UMLS, GPT-3.5, and their best ensemble. For the BERT-Large model, incorporating UMLS led to an average F1 score improvement of 40.36%, while using GPT-3.5 resulted in a comparable average increase of 40.52%. The Best-Ensemble approach using BERT-Large achieved the highest improvement, with an average increase of 42.29%. DANN model’s F1 score improved by 22.74% on average using the UMLS-only approach. The GPT-3.5-based method resulted in a 21.53% increase, and the Best-Ensemble DANN model showed a more notable improvement, with an average increase of 25.03%. Our proposed HILGEN approach improves NER performance in few-shot settings without requiring additional manually annotated data. Our experiments demonstrate that an effective strategy for optimizing biomedical NER is to combine biomedical knowledge curated in the past, such as the UMLS, and generative LLMs to create synthetic training instances. Our future research will focus on exploring additional innovative synthetic data generation strategies for further improving NER performance.
zh
[NLP-73] Maximizing Signal in Human-Model Preference Alignment AAAI2025
【速读】: 该论文试图解决的问题是如何在自然语言理解与生成任务中,有效评估大型语言模型(LLMs)的输出,尤其是在需要终端用户认可机器学习(ML)模型决策的场景下(如毒性检测或摘要要点提取)。论文指出,尽管自动化评估方法快速且成本效益高,但它们无法替代人类判断的重要性。因此,论文主张模型训练和评估应基于能够代表终端用户偏好的数据。
解决方案的关键在于将人类反馈融入模型训练和评估过程中。首先,论文提出了解决标注任务中噪声与信号分离的方法。其次,通过遵循经过验证的最佳实践,可以最小化标注分歧中的噪声,同时最大化信号以支持模型训练和评估任务。最后,论文通过一个案例研究展示了如何利用人类判断来评估两个守卫线(guardrails)分类器,并使最终模型行为与用户偏好保持一致。论文的目标是为研究人员和专业人士提供指导,帮助他们将人类判断整合到机器学习(ML)和生成式 AI 的评估工具包中,特别是在实现符合用户需求和期望的准确、无偏特征方面。
链接: https://arxiv.org/abs/2503.04910
作者: Kelsey Kraus,Margaret Kroll
机构: 未知
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注: Presented at AAAI 2025, special track on AI Alignment
点击查看摘要
Abstract:The emergence of powerful LLMs has led to a paradigm shift in Natural Language Understanding and Natural Language Generation. The properties that make LLMs so valuable for these tasks – creativity, ability to produce fluent speech, and ability to quickly and effectively abstract information from large corpora – also present new challenges to evaluating their outputs. The rush to market has led teams to fall back on quick, cost-effective automatic evaluations which offer value, but do not obviate the need for human judgments in model training and evaluation. This paper argues that in cases in which end users need to agree with the decisions made by ML models – e.g. in toxicity detection or extraction of main points for summarization – models should be trained and evaluated on data that represent the preferences of those users. We support this argument by explicating the role of human feedback in labeling and judgment tasks for model training and evaluation. First, we propose methods for disentangling noise from signal in labeling tasks. Then we show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices, while signal can be maximized to play an integral role in model training and evaluation tasks. Finally, we illustrate best practices by providing a case study in which two guardrails classifiers are evaluated using human judgments to align final model behavior to user preferences. We aim for this paper to provide researchers and professionals with guidelines to integrating human judgments into their ML and generative AI evaluation toolkit, particularly when working toward achieving accurate and unbiased features that align with users’ needs and expectations.
zh
[NLP-74] Architecture for a Trustworthy Quantum Chatbot
【速读】: 该论文旨在解决通用大型语言模型(LLM)在量子计算领域因数据不足而导致可靠性较低的问题。论文提出的关键解决方案是开发专注于量子程序的聊天机器人 C4Q,其软件架构通过集成专用的LLM来分类请求,并结合确定性逻辑引擎与专门的问题解答模块,提供可信赖的量子计算支持。这一方案的核心在于将确定性逻辑与基于概率的文本生成分离,从而提升结果的质量和系统的整体性能。
链接: https://arxiv.org/abs/2503.04875
作者: Yaiza Aragonés-Soria,Manuel Oriol
机构: 未知
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注:
点击查看摘要
Abstract:Large language model (LLM)-based tools such as ChatGPT seem useful for classical programming assignments. The more specialized the field, the more likely they lack reliability because of the lack of data to train them. In the case of quantum computing, the quality of answers of generic chatbots is low. C4Q is a chatbot focused on quantum programs that addresses this challenge through a software architecture that integrates specialized LLMs to classify requests and specialized question answering modules with a deterministic logical engine to provide trustworthy quantum computing support. This article describes the latest version (2.0) of C4Q, which delivers several enhancements: ready-to-run Qiskit code for gate definitions and circuit operations, expanded features to solve software engineering tasks such as the travelling salesperson problem and the knapsack problem, and a feedback mechanism for iterative improvement. Extensive testing of the backend confirms the system’s reliability, while empirical evaluations show that C4Q 2.0’s classification LLM reaches near-perfect accuracy. The evaluation of the result consists in a comparative study with three existing chatbots highlighting C4Q 2.0’s maintainability and correctness, reflecting on how software architecture decisions, such as separating deterministic logic from probabilistic text generation impact the quality of the results. Subjects: Computation and Language (cs.CL); Quantum Physics (quant-ph) MSC classes: 81P68, 68T20 Cite as: arXiv:2503.04875 [cs.CL] (or arXiv:2503.04875v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.04875 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-75] Memory Is All You Need: Testing How Model Memory Affects LLM Performance in Annotation Tasks
【速读】: 该论文试图解决生成式大语言模型(Large Language Models, LLMs)在文本标注任务中因缺乏对先前标注信息的记忆而导致每个响应独立性的问题。传统方法如零样本(zero-shot)和少样本(few-shot)学习不允许模型保留之前标注的信息,这限制了其性能。论文的关键在于引入“模型记忆”概念,即允许LLM在其自身在同一任务中的先前标注中获取知识,并验证了这一机制可使性能提升5%到25%。此外,作者提出了一种名为“记忆强化”的新方法,结合模型记忆与强化学习(reinforcement learning),进一步提升了三种测试条件下的性能表现。
链接: https://arxiv.org/abs/2503.04874
作者: Joan C. Timoneda,Sebastián Vallejo Vera
机构: Purdue University (普渡大学); University of Western Ontario (西安大略大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Generative Large Language Models (LLMs) have shown promising results in text annotation using zero-shot and few-shot learning. Yet these approaches do not allow the model to retain information from previous annotations, making each response independent from the preceding ones. This raises the question of whether model memory – the LLM having knowledge about its own previous annotations in the same task – affects performance. In this article, using OpenAI’s GPT-4o and Meta’s Llama 3.1 on two political science datasets, we demonstrate that allowing the model to retain information about its own previous classifications yields significant performance improvements: between 5 and 25% when compared to zero-shot and few-shot learning. Moreover, memory reinforcement, a novel approach we propose that combines model memory and reinforcement learning, yields additional performance gains in three out of our four tests. These findings have important implications for applied researchers looking to improve performance and efficiency in LLM annotation tasks.
zh
[NLP-76] Are Large Language Models Good In-context Learners for Financial Sentiment Analysis? ICLR2025
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备良好的上下文学习能力(in-context learning),以应对金融情感分析(Financial Sentiment Analysis, FSA)中的挑战,特别是在无需针对金融领域数据进行微调的情况下。论文的关键在于探索LLMs能否通过泛化已有的金融文档-情感配对的上下文示例,从而实现对新文档的情感分析,这在金融特定数据微调困难甚至不可能的情况下尤为重要。为了解决这一问题,论文涵盖了多种现代LLMs(包括最新发布的DeepSeek V3)以及多种上下文样本选择方法,并通过全面的实验验证了LLMs在FSA任务中的上下文学习能力。
链接: https://arxiv.org/abs/2503.04873
作者: Xinyu Wei,Luojia Liu
机构: Independent Researcher (独立研究员); Kenan-Flagler Business School, University of North Carolina at Chapel Hill (凯南-弗拉格勒商学院,北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: Accepted by ICLR 2025 Workshop on Advances in Financial AI
点击查看摘要
Abstract:Recently, large language models (LLMs) with hundreds of billions of parameters have demonstrated the emergent ability, surpassing traditional methods in various domains even without fine-tuning over domain-specific data. However, when it comes to financial sentiment analysis (FSA) \unicodex2013 a fundamental task in financial AI \unicodex2013 these models often encounter various challenges, such as complex financial terminology, subjective human emotions, and ambiguous inclination expressions. In this paper, we aim to answer the fundamental question: whether LLMs are good in-context learners for FSA? Unveiling this question can yield informative insights on whether LLMs can learn to address the challenges by generalizing in-context demonstrations of financial document-sentiment pairs to the sentiment analysis of new documents, given that finetuning these models on finance-specific data is difficult, if not impossible at all. To the best of our knowledge, this is the first paper exploring in-context learning for FSA that covers most modern LLMs (recently released DeepSeek V3 included) and multiple in-context sample selection methods. Comprehensive experiments validate the in-context learning capability of LLMs for FSA.
zh
[NLP-77] nyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在压缩其规模的同时保持高性能的挑战。现有的方法如模型蒸馏(model distillation)和迁移学习(transfer learning)往往难以实现高精度。为克服这一限制,论文提出了一种名为Branch-Merge蒸馏的方法,其关键在于通过两个阶段增强模型压缩:(1) 分支阶段(Branch Phase),利用领域特定的有监督微调(Supervised Fine-Tuning, SFT)有选择性地将大规模教师模型的知识蒸馏到专门的学生模型中;(2) 合并阶段(Merge Phase),将这些学生模型合并以实现跨领域的知识转移并提升泛化能力。实验验证表明,该方法产生的TinyR1-32B-Preview模型在多个基准测试中表现优异,同时在计算成本和时间上显著降低。
链接: https://arxiv.org/abs/2503.04872
作者: Lin Sun,Guangxiang Zhao,Xiaoqi Jian,Yuhan Wu,Weihong Lin,Yongfu Zhu,Change Jia,Linglin Zhang,Jinzhu Wu,Junfeng Ran,Sai-er Hu,Zihan Jiang,Junting Zhou,Wenrui Liu,Bin Cui,Tong Yang,Xiangzheng Zhang
机构: Qiyuan Tech; Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
点击查看摘要
Abstract:The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textitselectively distilled into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.
zh
[NLP-78] Label Distribution Learning-Enhanced Dual-KNN for Text Classification SDM2024
【速读】: 本文旨在解决传统文本分类方法过度依赖外部信息(如标签描述和知识库)而未能充分挖掘模型在训练过程中生成的内部信息(如文本嵌入和预测标签概率分布)的问题。论文的关键创新在于提出了一种双k最近邻(Dual k Nearest Neighbor, DkNN)框架,包含两个kNN模块,用于从训练集中检索若干邻居并增强标签分布。为了解决kNN模块在处理噪声数据集(含有标注错误)或相似数据集(具有相似标签)时容易产生混淆从而导致错误预测的问题,本文引入了一个标签分布学习模块,该模块能够学习标签相似性,并生成更优的标签分布以帮助模型更有效地区分文本。这一模块不仅缓解了模型过拟合现象,还提升了最终的分类性能,进而提高了推理阶段kNN模块检索邻居的质量。实验结果验证了所提方法的有效性。
链接: https://arxiv.org/abs/2503.04869
作者: Bo Yuan,Yulin Chen,Zhen Tan,Wang Jinyan,Huan Liu,Yin Zhang
机构: Zhejiang University (浙江大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by SDM 2024
点击查看摘要
Abstract:Many text classification methods usually introduce external information (e.g., label descriptions and knowledge bases) to improve the classification performance. Compared to external information, some internal information generated by the model itself during training, like text embeddings and predicted label probability distributions, are exploited poorly when predicting the outcomes of some texts. In this paper, we focus on leveraging this internal information, proposing a dual k nearest neighbor (D k NN) framework with two k NN modules, to retrieve several neighbors from the training set and augment the distribution of labels. For the k NN module, it is easily confused and may cause incorrect predictions when retrieving some nearest neighbors from noisy datasets (datasets with labeling errors) or similar datasets (datasets with similar labels). To address this issue, we also introduce a label distribution learning module that can learn label similarity, and generate a better label distribution to help models distinguish texts more effectively. This module eases model overfitting and improves final classification performance, hence enhancing the quality of the retrieved neighbors by k NN modules during inference. Extensive experiments on the benchmark datasets verify the effectiveness of our method.
zh
[NLP-79] Codebook Reduction and Saturation: Novel observations on Inductive Thematic Saturation for Large Language Models and initial coding in Thematic Analysis
【速读】: 该论文试图解决的主题分析(Thematic Analysis)过程中初始编码(initial codes)的分析饱和度(analytical saturation)问题,特别是在大型语言模型(Large Language Models, LLMs)生成初始编码的情境下。论文的关键在于提出了一种创新的方法来衡量归纳主题饱和度(Inductive Thematic Saturation, ITS),并通过利用名为DSPy的编程框架实现这一测量。这种方法能够精确评估饱和度,从而验证主题分析的有效性。
链接: https://arxiv.org/abs/2503.04859
作者: Stefano De Paoli,Walter Stan Mathis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:This paper reflects on the process of performing Thematic Analysis with Large Language Models (LLMs). Specifically, the paper deals with the problem of analytical saturation of initial codes, as produced by LLMs. Thematic Analysis is a well-established qualitative analysis method composed of interlinked phases. A key phase is the initial coding, where the analysts assign labels to discrete components of a dataset. Saturation is a way to measure the validity of a qualitative analysis and relates to the recurrence and repetition of initial codes. In the paper we reflect on how well LLMs achieve analytical saturation and propose also a novel technique to measure Inductive Thematic Saturation (ITS). This novel technique leverages a programming framework called DSPy. The proposed novel approach allows a precise measurement of ITS.
zh
[NLP-80] One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对多轮“越狱”对话攻击时的安全性挑战,尽管当前模型已进行广泛的安全增强,但熟练的人类对手仍能通过精心设计的多轮对话绕过复杂的防护机制。然而,这类多轮攻击需要大量人工投入,限制了其可扩展性。论文的关键在于提出了一种名为Multi-turn-to-Single-turn (M2S)的新方法,通过系统性地将多轮越狱提示转换为单轮攻击,从而降低攻击的复杂性和成本。M2S方法包含Hyphenize、Numberize和Pythonize三种转换策略,这些策略在保持序列上下文的同时,将多轮交互封装为单一查询。实验结果显示,与原始多轮对话相比,M2S方法通常能够维持或提高攻击成功率(Attack Success Rates, ASRs),并在某些情况下显著提升安全性评估中的有害行为检测阈值,例如在Mistral-7B上达到95.9%的ASR,在GPT-4o上的绝对改进高达17.5%。进一步分析表明,某些对抗性策略通过利用结构化格式线索来规避标准政策检查。这一发现强调了单轮攻击可能同样有效甚至更高效,揭示了重新评估和强化LLMs安全策略的紧迫性。
链接: https://arxiv.org/abs/2503.04856
作者: Junwoo Ha,Hyunjun Kim,Sangyoon Yu,Haon Park,Ashkan Yousefpour,Yuna Park,Suhyun Kim
机构: AIM Intelligence (AIM Intelligence); University of Seoul (首尔大学); Korea Advanced Institute of Science and Technology (韩国科学技术院); Seoul National University (首尔国立大学); Yonsei University (延世大学); Korea Institute of Science and Technology (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite extensive safety enhancements in large language models (LLMs), multi-turn “jailbreak” conversations crafted by skilled human adversaries can still breach even the most sophisticated guardrails. However, these multi-turn attacks demand considerable manual effort, limiting their scalability. In this work, we introduce a novel approach called Multi-turn-to-Single-turn (M2S) that systematically converts multi-turn jailbreak prompts into single-turn attacks. Specifically, we propose three conversion strategies - Hyphenize, Numberize, and Pythonize - each preserving sequential context yet packaging it in a single query. Our experiments on the Multi-turn Human Jailbreak (MHJ) dataset show that M2S often increases or maintains high Attack Success Rates (ASRs) compared to original multi-turn conversations. Notably, using a StrongREJECT-based evaluation of harmfulness, M2S achieves up to 95.9% ASR on Mistral-7B and outperforms original multi-turn prompts by as much as 17.5% in absolute improvement on GPT-4o. Further analysis reveals that certain adversarial tactics, when consolidated into a single prompt, exploit structural formatting cues to evade standard policy checks. These findings underscore that single-turn attacks - despite being simpler and cheaper to conduct - can be just as potent, if not more, than their multi-turn counterparts. Our findings underscore the urgent need to reevaluate and reinforce LLM safety strategies, given how adversarial queries can be compacted into a single prompt while still retaining sufficient complexity to bypass existing safety measures.
zh
[NLP-81] Enhancing Collective Intelligence in Large Language Models Through Emotional Integration
【速读】: 本文旨在解决如何将情感多样性融入大型语言模型(Large Language Models, LLMs)以提升其集体智能的问题。研究受到人类群体智慧现象的启发,即群体决策往往优于个体判断。为实现这一目标,论文的关键解决方案是通过Google的GoEmotions数据集以及低秩适应技术(Low-Rank Adaptation, LoRA)对DarkIdol-Llama-3.1-8B模型进行微调,从而模拟出具有情感多样性的响应。这种情感集成不仅塑造了响应模式,还保持了可接受的预测准确性,揭示了增强人工集体智能的潜力。
链接: https://arxiv.org/abs/2503.04849
作者: Likith Kadiyala,Ramteja Sajja,Yusuf Sermet,Ibrahim Demir
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 23 pages, 8 figures
点击查看摘要
Abstract:This research investigates the integration of emotional diversity into Large Language Models (LLMs) to enhance collective intelligence. Inspired by the human wisdom of crowds phenomenon, where group decisions often outperform individual judgments, we fine-tuned the DarkIdol-Llama-3.1-8B model using Google’s GoEmotions dataset and Low-Rank Adaptation (LoRA) to simulate emotionally diverse responses. Evaluating the model on a distance estimation task between Fargo, ND, and Seattle, WA, across 15,064 unique persona configurations, we analyzed how emotional states and social attributes influence decision-making. Our findings demonstrate that emotional integration shapes response patterns while maintaining acceptable prediction accuracy, revealing its potential to enhance artificial collective intelligence. This study provides valuable insights into the interplay of emotional diversity and decision-making in LLMs, suggesting pathways for creating emotionally aware AI systems that balance emotional depth with analytical precision.
zh
[NLP-82] hree tiers of computation in transformers and in brain architectures
【速读】: 该论文试图解决的问题是如何系统性地解释人类与基于 Transformer 的语言模型(LMs)在自然语言处理、数学及逻辑推理能力上的差异,并揭示其背后的能力涌现机制。论文的关键在于提出了一种基于语法-自动机(Grammar-Automata, G-A)层级的分析框架,将人类和语言模型的能力划分为三个层级,并强调能力的跃迁(tier transitions)而非单纯规模扩展(scaling)是决定系统能力的核心因素。通过这种分析,论文提供了对现有系统能力和局限性的深入理解,并为增强人工智能系统的逻辑推理能力提供了可行的洞见。
链接: https://arxiv.org/abs/2503.04848
作者: E Graham,R Granger
机构: 未知
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注:
点击查看摘要
Abstract:Specific empirical phenomena spanning human natural language, and mathematical and logical abilities, are rigorously situated in the well-studied grammar-automata (G-A) hierarchy. We identify three tiers and corresponding two transitions within the hierarchy and show their correspondence to the emergence of particular abilities in humans and in transformer-based language models (LMs). These emergent abilities have often been described in terms of “scaling”; we show that it is the transition between tiers, rather than size itself, that determines a system’s capabilities. Specifically, humans effortlessly process language yet require specific training to perform arithmetic or logical reasoning tasks; and LMs possess language abilities absent from predecessor systems yet still struggle with logical processing. The resulting principled analyses provide underlying explanatory accounts of both the abilities and shortfalls of these systems, and suggest actionable insights into the expansion of logic abilities in AI systems.
zh
[NLP-83] Universal Narrative Model: an Author-centric Storytelling Framework for Generative AI
【速读】: 该论文试图解决生成式 AI 在程序化叙事生成领域中面临的根本性叙事困境,特别是玩家agency(自主性)与叙事连贯性之间的平衡问题,并填补缺乏专门针对生成式 AI 特点的严格叙事标准的空白。论文的关键解决方案是提出“通用叙事模型 (Universal Narrative Model, UNM)”,这是一种开放且可扩展的标准,旨在将作者置于未来叙事设计工作流的核心位置,同时实现不同创作平台间的互操作性。通过以客观的叙事模型编码作者意图,UNM 实现了叙事的可移植性以及基于意图的约束机制,从而有效利用生成式 AI 的优势。
链接: https://arxiv.org/abs/2503.04844
作者: Hank Gerba
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generative AI promises to finally realize dynamic, personalized storytelling technologies across a range of media. To date, experimentation with generative AI in the field of procedural narrative generation has been quite promising from a technical perspective. However, fundamental narrative dilemmas remain, such as the balance between player agency and narrative coherence, and no rigorous narrative standard has been proposed to specifically leverage the strengths of generative AI. In this paper, we propose the Universal Narrative Model (UNM), an open and extensible standard designed to place writers at the center of future narrative design workflows and enable interoperability across authoring platforms. By encoding an author’s intent according to an objective narrative model, the UNM enables narrative portability as well as intent-based constraints for generative systems.
zh
[NLP-84] Replicating Human Social Perception in Generative AI: Evaluating the Valence-Dominance Model
【速读】: 该论文试图解决的问题是:探究多模态生成式 AI 系统是否能够复制人类社会知觉的基础模型,并评估其表征在不同世界区域的一致性。具体而言,研究关注生成式 AI 是否能够重现由“效价(valence)”和“支配性(dominance)”构成的社会判断二维框架。
解决方案的关键在于通过主成分分析(Principal Component Analysis, PCA),验证生成式 AI 对面部图像进行评价时所提取的维度是否与理论上的效价-支配性结构一致,并探讨其在全球不同区域的表现是否存在额外的潜在成分。研究发现,虽然效价和支配性维度得到了良好复现,但许多地区和模型还表现出一个额外的第三成分,其性质和意义需要进一步研究。
链接: https://arxiv.org/abs/2503.04842
作者: Necdet Gurkan,Kimathi Njoki,Jordan W. Suchow
机构: University of Missouri - St. Louis (密苏里大学圣路易斯分校); Stevens Institute of Technology (史蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As artificial intelligence (AI) continues to advance–particularly in generative models–an open question is whether these systems can replicate foundational models of human social perception. A well-established framework in social cognition suggests that social judgments are organized along two primary dimensions: valence (e.g., trustworthiness, warmth) and dominance (e.g., power, assertiveness). This study examines whether multimodal generative AI systems can reproduce this valence-dominance structure when evaluating facial images and how their representations align with those observed across world regions. Through principal component analysis (PCA), we found that the extracted dimensions closely mirrored the theoretical structure of valence and dominance, with trait loadings aligning with established definitions. However, many world regions and generative AI models also exhibited a third component, the nature and significance of which warrant further investigation. These findings demonstrate that multimodal generative AI systems can replicate key aspects of human social perception, raising important questions about their implications for AI-driven decision-making and human-AI interactions.
zh
[NLP-85] Framing the Game: How Context Shapes LLM Decision-Making
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在不同上下文环境中支持决策时,因上下文框架影响而导致的理性决策感知偏差问题。论文的关键解决方案在于引入了一种新颖的评估框架,通过系统性地改变评估实例的关键特征,并程序化生成描述性案例(vignettes),构建高度多样化的场景。这种方法使研究者能够分析相同基础博弈结构在不同上下文中的决策模式,揭示LLMs响应中显著的上下文变异性。研究发现,这种变异性虽然具有可预测性,但对框架效应极为敏感。论文结果强调了在实际部署中采用动态且上下文感知的评估方法的重要性。
链接: https://arxiv.org/abs/2503.04840
作者: Isaac Robinson,John Burden
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed across diverse contexts to support decision-making. While existing evaluations effectively probe latent model capabilities, they often overlook the impact of context framing on perceived rational decision-making. In this study, we introduce a novel evaluation framework that systematically varies evaluation instances across key features and procedurally generates vignettes to create highly varied scenarios. By analyzing decision-making patterns across different contexts with the same underlying game structure, we uncover significant contextual variability in LLM responses. Our findings demonstrate that this variability is largely predictable yet highly sensitive to framing effects. Our results underscore the need for dynamic, context-aware evaluation methodologies for real-world deployments.
zh
[NLP-86] Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations ICLR2025
【速读】: 该论文旨在解决多模态in-context learning (ICL) 中配置鲁棒in-context demonstration (ICD) 序列的挑战,这是大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 的重要能力之一。尽管多模态 ICL 具有潜力,但由于图像-文本输入的复杂性以及 ICL 性能对输入配置的高度敏感性,其实现仍具挑战性。论文的关键在于揭示了任务映射在构建鲁棒 ICD 序列中的核心作用,并提出了一种轻量级但强大的仅解码器 Transformer 模型 \textit{SabER},其配备了任务感知注意力机制。此设计能够智能地以自回归方式从演示库中选择和排列 ICDs,通过细粒度特征提取和跨模态推理,迭代优化任务映射以生成高质量的 ICD 序列。通过覆盖五种 LVLM 和九个基准数据集的广泛实验,\textit{SabER} 不仅展示了显著的实证性能,还深化了对任务语义与多模态 ICD 交互的理解。研究结果强调了合理配置 ICD 序列的重要性,并为提升多模态 ICL 在实际场景中的表现开辟了新途径。
链接: https://arxiv.org/abs/2503.04839
作者: Yanshu Li
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by Reasoning and Planning for LLMs @ ICLR2025, 25 pages, 13 tables
点击查看摘要
Abstract:Multimodal in-context learning (ICL) has emerged as a key capability of Large Vision-Language Models (LVLMs), driven by their increasing scale and applicability. Despite its promise, effective ICL in the multimodal setting remains challenging due to the inherent complexity of image-text inputs and the high sensitivity of ICL performance to input configurations. In this work, we shed light on the core mechanism underlying multimodal ICL, identifying task mapping as a crucial factor in configuring robust in-context demonstration (ICD) sequences. Building on these insights, we propose \textitSabER, a lightweight yet powerful decoder-only transformer equipped with task-aware attention, which intelligently selects and arranges ICDs from a demonstration library in an autoregressive fashion. This design enables fine-grained feature extraction and cross-modal reasoning, iteratively refining task mapping to generate high-quality ICD sequences. Through extensive experiments covering five LVLMs and nine benchmark datasets, SabER not only demonstrates strong empirical performance, but also provides deeper understanding of how task semantics interact with multimodal ICDs. Our findings highlight the importance of principled ICD sequence configuration and open new avenues to enhance multimodal ICL in a wide range of real-world scenarios.
zh
[NLP-87] Extrapolation Merging: Keep Improving With Extrapolation and Merging
【速读】: 该论文试图解决在指令微调(Instruction Fine-Tuning)阶段,大型语言模型(LLMs)仍需大量计算资源和标注数据的问题,同时缺乏一种无需额外计算资源和数据即可提升模型性能的范式。论文的关键解决方案是提出了一种名为“外推合并(Extrapolation Merging)”的新范式,通过在模型合并过程中引入明确的优化方向,利用外推方法实现局部优化搜索,从而在不增加计算资源和数据的情况下持续提升模型性能。实验结果表明,该方法在七个不同任务上均能有效提高模型的性能。
链接: https://arxiv.org/abs/2503.04834
作者: Yiguan Lin,Bin Xu,Yinghao Li,Yang Gao
机构: School of Computer Science and Technology, Beijing Institute of Technology (北京理工大学), Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) require instruction fine-tuning to perform different downstream tasks. However, the instruction fine-tuning phase still demands significant computational resources and labeled data, lacking a paradigm that can improve model performance without additional computational power and data. Model merging aims to enhance performance by combining the parameters of different models, but the lack of a clear optimization direction during the merging process does not always guarantee improved performance. In this paper, we attempt to provide a clear optimization direction for model merging. We first validate the effectiveness of the model extrapolation method during the instruction fine-tuning phase. Then, we propose Extrapolation Merging, a paradigm that can continue improving model performance without requiring extra computational resources or data. Using the extrapolation method, we provide a clear direction for model merging, achieving local optimization search, and consequently enhancing the merged model’s performance. We conduct experiments on seven different tasks, and the results show that our method can consistently improve the model’s performance after fine-tuning.
zh
[NLP-88] Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
【速读】: 本文旨在解决多模态大型语言模型(MLLMs)在训练阶段易受越狱攻击(jailbreak attacks)的问题。越狱攻击通过精心设计的扰动规避安全防护机制,诱发有害输出。为应对这一挑战,论文提出了首个针对MLLM训练阶段的对抗训练(Adversarial Training, AT)范式。然而,将传统对抗训练扩展到该领域面临两大关键难题:高效调整海量参数以及确保跨多种模态的鲁棒性。为此,论文引入了Projection Layer Against Adversarial Training (ProEAT),这是一种端到端的对抗训练框架。ProEAT采用基于投影层的架构,在保持计算可行性的前提下,仅针对轻量级投影层进行对抗训练,而非整个模型;同时,设计了一种动态权重调整机制,根据任务需求优化损失函数的权重分配,简化调参过程。此外,为了增强防御性能,提出了一种跨视觉与文本模态的联合优化策略,确保对来自任一模态的越狱攻击具有稳健抵抗能力。实验结果表明,ProEAT在三个主流MLLM上针对五种主要越狱攻击方法展现了最先进的防御效果,相比现有基线平均提升34%,同时仅导致清洁数据精度下降1%,验证了其有效性与实际应用潜力。
链接: https://arxiv.org/abs/2503.04833
作者: Liming Lu,Shuchao Pang,Siyuan Liang,Haotian Zhu,Xiyu Zeng,Aishan Liu,Yunhuai Liu,Yongbin Zhou
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have made remarkable strides in cross-modal comprehension and generation tasks. However, they remain vulnerable to jailbreak attacks, where crafted perturbations bypass security guardrails and elicit harmful outputs. In this paper, we present the first adversarial training (AT) paradigm tailored to defend against jailbreak attacks during the MLLM training phase. Extending traditional AT to this domain poses two critical challenges: efficiently tuning massive parameters and ensuring robustness against attacks across multiple modalities. To address these challenges, we introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework. ProEAT incorporates a projector-based adversarial training architecture that efficiently handles large-scale parameters while maintaining computational feasibility by focusing adversarial training on a lightweight projector layer instead of the entire model; additionally, we design a dynamic weight adjustment mechanism that optimizes the loss function’s weight allocation based on task demands, streamlining the tuning process. To enhance defense performance, we propose a joint optimization strategy across visual and textual modalities, ensuring robust resistance to jailbreak attacks originating from either modality. Extensive experiments conducted on five major jailbreak attack methods across three mainstream MLLMs demonstrate the effectiveness of our approach. ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities, while incurring only a 1% reduction in clean accuracy. Furthermore, evaluations on real-world embodied intelligent systems highlight the practical applicability of our framework, paving the way for the development of more secure and reliable multimodal systems.
zh
[NLP-89] “Only ChatGPT gets me”: An Empirical Analysis of GPT versus other Large Language Models for Emotion Detection in Text
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在检测和理解人类情感方面的有效性问题。研究通过整合心理学中的情感模型与计算科学的视角,评估LLMs识别文本交互中表达的情感的准确性,并比较不同模型在这项特定任务上的表现。关键在于采用与最先进的模型在GoEmotions数据集上进行对比的方法论,以衡量LLMs作为情感分析系统的效能,从而推动其在需要精细理解人类语言的多个领域的潜在应用。
链接: https://arxiv.org/abs/2503.04831
作者: Florian Lecourt(LIRMM | ADVANSE),Madalina Croitoru(GRAPHIK),Konstantin Todorov(LIRMM | WEB3, LIRMM, WEB3)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This work investigates the capabilities of large language models (LLMs) in detecting and understanding human emotions through text. Drawing upon emotion models from psychology, we adopt an interdisciplinary perspective that integrates computational and affective sciences insights. The main goal is to assess how accurately they can identify emotions expressed in textual interactions and compare different models on this specific task. This research contributes to broader efforts to enhance human-computer interaction, making artificial intelligence technologies more responsive and sensitive to users’ emotional nuances. By employing a methodology that involves comparisons with a state-of-the-art model on the GoEmotions dataset, we aim to gauge LLMs’ effectiveness as a system for emotional analysis, paving the way for potential applications in various fields that require a nuanced understanding of human language.
zh
[NLP-90] Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM -Agents
【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的对话式购物代理(Conversational Shopping Agent, CSA)在电子商务领域中面临的两个主要挑战:一是LLMs可能生成未经验证或虚假的信息,从而传播错误信息并损害客户信任;二是缺乏知识来源引用机制,使客户难以验证由LLMs生成的信息。为了解决这些问题,论文提出了一种易于生产化实现的解决方案,通过利用上下文学习(In-context Learning, ICL)和多用户体验推理(Multi-UX-Inference, MUI),生成包含引用标记的响应以追溯原始信息来源,同时不影响其他用户体验功能。关键在于设计一种能够在适当用户体验框架下将这些引用标记链接到相关产品信息并展示来源的方法,从而提升LLMs输出的可信度与透明性。实验结果表明,该方法可使实际数据中LLMs的准确性提升13.83%,不仅有效缓解了LLMs的准确性问题,还增强了对话式人工智能的透明度。
链接: https://arxiv.org/abs/2503.04830
作者: Jingying Zeng,Hui Liu,Zhenwei Dai,Xianfeng Tang,Chen Luo,Samarth Varshney,Zhen Li,Qi He
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the advancement of conversational large language models (LLMs), several LLM-based Conversational Shopping Agents (CSA) have been developed to help customers answer questions and smooth their shopping journey in e-commerce domain. The primary objective in building a trustworthy CSA is to ensure the agent’s responses are accurate and factually grounded, which is essential for building customer trust and encouraging continuous engagement. However, two challenges remain. First, LLMs produce hallucinated or unsupported claims. Such inaccuracies risk spreading misinformation and diminishing customer trust. Second, without providing knowledge source attribution in CSA response, customers struggle to verify LLM-generated information. To address these challenges, we present an easily productionized solution that enables a “citation experience” utilizing In-context Learning (ICL) and Multi-UX-Inference (MUI) to generate responses with citations to attribute its original sources without interfering other existing UX features. With proper UX design, these citation marks can be linked to the related product information and display the source to our customers. In this work, we also build auto-metrics and scalable benchmarks to holistically evaluate LLM’s grounding and attribution capabilities. Our experiments demonstrate that incorporating this citation generation paradigm can substantially enhance the grounding of LLM responses by 13.83% on the real-world data. As such, our solution not only addresses the immediate challenges of LLM grounding issues but also adds transparency to conversational AI.
zh
[NLP-91] Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications
【速读】: 该论文试图解决现有大型语言模型(Large Language Models, LLMs)评估框架过于依赖静态评价数据集的问题,提出一种更全面的评估方法。解决方案的关键在于基于传统游戏与工具结合的架构,通过这种架构实现对模型能力的更广泛测量。该方法提供了一个通用的基础框架,能够灵活扩展至多种场景,包括具体应用领域(如供应链管理或金融推理)以及抽象层面(如伦理或安全性)的评估需求。
链接: https://arxiv.org/abs/2503.04828
作者: Vishakha Agrawal,Archie Chaudhury,Shreya Agrawal
机构: AMD; LayerLens; UCLA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) are fundamentally next-token prediction systems, their practical applications extend far beyond this basic function. From natural language processing and text generation to conversational assistants and software use, LLMs have numerous use-cases, and have already acquired a significant degree of enterprise adoption. To evaluate such models, static evaluation datasets, consisting of a set of prompts and their corresponding ground truths, are often used to benchmark the efficacy of the model for a particular task. In this paper, we provide the basis for a more comprehensive evaluation framework, based upon a traditional game and tool-based architecture that enables a more overarching measurement of a model’s capabilities. For simplicity, we provide a generalized foundation that can be extended, without significant alteration, to numerous scenarios, from specific use cases such as supply chain management or financial reasoning, to abstract measurements such as ethics or safety.
zh
[NLP-92] Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems NAACL2025
【速读】: 该论文旨在解决全球化背景下因主要语言主导导致近3,000种语言面临灭绝风险的问题,现有基于人工智能的翻译模型虽高效但难以捕捉文化细微差别、习语表达及历史意义,从而削弱了语言多样性。为应对这些挑战,论文提出了一种多智能体人工智能框架,专为服务不足的语言社区提供具有文化适应性的翻译。方案的关键在于利用专门设计的智能体进行翻译、释义、内容合成以及偏见评估,确保语言准确性与文化相关性的同时,通过外部验证增强上下文保真度并缓解偏见。实验表明,该框架在生成上下文丰富且文化嵌入性强的翻译方面优于GPT-4o,对土著语、区域语言及低资源语言的发展具有重要意义。此研究强调了多智能体人工智能在促进公平、可持续且文化敏感的自然语言处理技术中的潜力,契合服务于欠资源社区的语言模型在AI治理、文化NLP及可持续NLP领域的支柱理念。完整实验代码库已公开发布。
链接: https://arxiv.org/abs/2503.04827
作者: Mahfuz Ahmed Anik,Abdur Rahman,Azmine Toushik Wasi,Md Manjurul Ahsan
机构: Shahjalal University of Science and Technology (沙赫贾拉尔科技大学), Sylhet, Bangladesh; University of Oklahoma (俄克拉荷马大学), Norman, OK 73019, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: Accepted in NAACL 2025 Workshop on Language Models for Underserved Communities ( this https URL )
点击查看摘要
Abstract:Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction. Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance, leading to translations that marginalize linguistic diversity. To address these challenges, we propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities. Our approach leverages specialized agents for translation, interpretation, content synthesis, and bias evaluation, ensuring that linguistic accuracy and cultural relevance are preserved. Using CrewAI and LangChain, our system enhances contextual fidelity while mitigating biases through external validation. Comparative analysis shows that our framework outperforms GPT-4o, producing contextually rich and culturally embedded translations, a critical advancement for Indigenous, regional, and low-resource languages. This research underscores the potential of multi-agent AI in fostering equitable, sustainable, and culturally sensitive NLP technologies, aligning with the AI Governance, Cultural NLP, and Sustainable NLP pillars of Language Models for Underserved Communities. Our full experimental codebase is publicly available at: this https URL
zh
[NLP-93] HeTGB: A Comprehensive Benchmark for Heterophilic Text-Attributed Graphs
【速读】: 该论文试图解决异质性文本属性图(Heterophilic Text-attributed Graphs, TAGs)在节点分类任务中的研究不足问题。现有图神经网络(Graph Neural Networks, GNNs)主要针对同质性假设设计,在处理异质性图数据时表现欠佳,而许多现实世界中的图数据同时具有异质性和丰富的文本描述属性。论文的关键解决方案是提出了一个名为Heterophilic Text-attributed Graph Benchmark (HeTGB) 的新型基准数据集,该数据集包含来自多个领域的五个真实异质性图数据集,并通过丰富的文本描述增强节点信息。HeTGB不仅能够系统评估GNNs、预训练语言模型(Pre-trained Language Models, PLMs)以及协同训练方法在节点分类任务上的性能,还揭示了文本属性在异质性图中的价值、分析了相关挑战及现有模型的局限性,并探讨了图结构与文本属性之间的相互作用关系。通过公开发布HeTGB及其基线实现,论文旨在推动这一领域的进一步研究。
链接: https://arxiv.org/abs/2503.04822
作者: Shujie Li,Yuxia Wu,Chuan Shi,Yuan Fang
机构: Beijing University of Post and Telecommunication (北京邮电大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
点击查看摘要
Abstract:Graph neural networks (GNNs) have demonstrated success in modeling relational data primarily under the assumption of homophily. However, many real-world graphs exhibit heterophily, where linked nodes belong to different categories or possess diverse attributes. Additionally, nodes in many domains are associated with textual descriptions, forming heterophilic text-attributed graphs (TAGs). Despite their significance, the study of heterophilic TAGs remains underexplored due to the lack of comprehensive benchmarks. To address this gap, we introduce the Heterophilic Text-attributed Graph Benchmark (HeTGB), a novel benchmark comprising five real-world heterophilic graph datasets from diverse domains, with nodes enriched by extensive textual descriptions. HeTGB enables systematic evaluation of GNNs, pre-trained language models (PLMs) and co-training methods on the node classification task. Through extensive benchmarking experiments, we showcase the utility of text attributes in heterophilic graphs, analyze the challenges posed by heterophilic TAGs and the limitations of existing models, and provide insights into the interplay between graph structures and textual attributes. We have publicly released HeTGB with baseline implementations to facilitate further research in this field.
zh
[NLP-94] Prompting Science Report 1: Prompt Engineering is Complicated and Contingent
【速读】: 该论文旨在解决如何通过严格测试帮助商业、教育及政策领域的领导者理解与大型语言模型(Large Language Model, LLM)相关的技术细节。论文的关键在于揭示两个核心问题:首先,不存在单一标准来衡量LLM是否通过基准测试,选择不同的标准会对LLM在该基准上的表现产生显著影响,而具体选择哪种标准取决于在特定情况下使用LLM的目标;其次,难以提前预测某种提示方法会提升还是损害LLM回答特定问题的能力,研究发现有时对LLM保持礼貌有助于性能提升,而在其他情况下则可能降低性能,同时约束AI的回答在某些场景下有助于性能提升,但在其他场景下可能适得其反。综合来看,这表明基准测试AI性能并非一刀切的问题,并且特定的提示公式或方法(如对AI保持礼貌)并不具有普遍价值。
链接: https://arxiv.org/abs/2503.04818
作者: Lennart Meincke,Ethan Mollick,Lilach Mollick,Dan Shapiro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This is the first of a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we demonstrate two things: - There is no single standard for measuring whether a Large Language Model (LLM) passes a benchmark, and that choosing a standard has a big impact on how well the LLM does on that benchmark. The standard you choose will depend on your goals for using an LLM in a particular case. - It is hard to know in advance whether a particular prompting approach will help or harm the LLM’s ability to answer any particular question. Specifically, we find that sometimes being polite to the LLM helps performance, and sometimes it lowers performance. We also find that constraining the AI’s answers helps performance in some cases, though it may lower performance in other cases. Taken together, this suggests that benchmarking AI performance is not one-size-fits-all, and also that particular prompting formulas or approaches, like being polite to the AI, are not universally valuable. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04818 [cs.CL] (or arXiv:2503.04818v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.04818 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lennart Meincke [view email] [v1] Tue, 4 Mar 2025 21:09:12 UTC (320 KB)
zh
[NLP-95] Multi-Agent System for AI-Assisted Extraction of Narrative Arcs in TV Series
【速读】: 该论文旨在解决复杂电视剧情节线索难以追踪及其动态演化的分析难题。为应对这一挑战,论文提出了一种多智能体系统,其关键在于结合文本分析与人类专家知识,通过提取并存储三种叙事弧(Anthology、Soap 和 Genre-Specific)的进展信息至关系型与语义(向量)数据库中,实现结构化分析与比较。此外,系统通过图形界面整合自动化与批判性解读,允许人工干预以优化和可视化数据。尽管该系统在识别 Anthology 叙事弧和角色实体方面表现强劲,但对文本周边材料(如剧集概述)的依赖限制了其对重叠弧线及微妙动态的识别能力。未来工作将探索多模态输入(如对话与视觉信息)的整合,并扩展到更多类型的叙事分析中。
链接: https://arxiv.org/abs/2503.04817
作者: Roberto Balestri,Guglielmo Pescatore
机构: Department of the Arts, Università di Bologna (博洛尼亚大学), Italy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: 17th International Conference on Agents and Artificial Intelligence, Porto (Portugal). 23/02/2025 - 25/02/2025
点击查看摘要
Abstract:Serialized TV shows are built on complex storylines that can be hard to track and evolve in ways that defy straightforward analysis. This paper introduces a multi-agent system designed to extract and analyze these narrative arcs. Tested on the first season of Grey’s Anatomy (ABC 2005-), the system identifies three types of arcs: Anthology (self-contained), Soap (relationship-focused), and Genre-Specific (strictly related to the series’ genre). Episodic progressions of these arcs are stored in both relational and semantic (vectorial) databases, enabling structured analysis and comparison. To bridge the gap between automation and critical interpretation, the system is paired with a graphical interface that allows for human refinement using tools to enhance and visualize the data. The system performed strongly in identifying Anthology Arcs and character entities, but its reliance on textual paratexts (such as episode summaries) revealed limitations in recognizing overlapping arcs and subtler dynamics. This approach highlights the potential of combining computational and human expertise in narrative analysis. Beyond television, it offers promise for serialized written formats, where the narrative resides entirely in the text. Future work will explore the integration of multimodal inputs, such as dialogue and visuals, and expand testing across a wider range of genres to refine the system further.
zh
[NLP-96] Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis
【速读】: 该论文试图解决语音识别与分析中音素归一化(Phonetic Normalization)的重要性及其在现有预训练大模型微调范式下的实现方式问题。论文的关键在于揭示了通过微调 wav2vec 2.0 模型,任务无关信息能够被选择性抑制,从而在不显式执行音素归一化的情况下,实现隐式的音素归一化效果。研究发现,针对多任务微调的模型能够在保留多个任务所需信息的同时保持性能,且抑制任务无关信息并非有效分类的必要条件。这一解决方案的关键在于利用预训练模型的表征能力,在微调过程中自适应地实现音素归一化。
链接: https://arxiv.org/abs/2503.04814
作者: Yiming Wang,Yi Yang,Jiahong Yuan
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Phonetic normalization plays a crucial role in speech recognition and analysis, ensuring the comparability of features derived from raw audio data. However, in the current paradigm of fine-tuning pre-trained large transformer models, phonetic normalization is not deemed a necessary step; instead, it is implicitly executed within the models. This study investigates the normalization process within transformer models, especially wav2vec 2.0. Through a comprehensive analysis of embeddings from models fine-tuned for various tasks, our results demonstrate that fine-tuning wav2vec 2.0 effectively achieves phonetic normalization by selectively suppressing task-irrelevant information. We found that models fine-tuned for multiple tasks retain information for both tasks without compromising performance, and that suppressing task-irrelevant information is not necessary for effective classification. These findings provide new insights into how phonetic normalization can be flexibly achieved in speech models and how it is realized in human speech perception.
zh
[NLP-97] Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在复杂多步数学问题求解中的推理能力不足问题,具体表现为误差传播、缺乏自修正机制以及对多样化推理风格适应性有限。当前方法主要依赖静态微调或提示工程,这些方法难以在不同问题复杂度间泛化,同时高质量偏好数据的匮乏进一步限制了推理的可靠性。论文的关键解决方案是提出SPHERE(Self-evolving Pipeline for Hierarchy Enhancement and Reasoning Expansion),一种自我演进的数据生成管道,通过迭代生成、修正和多样化推理链来增强SLMs的推理能力。SPHERE的核心在于其三阶段机制:(i) 自主生成问题解决步骤;(ii) 自我校正以识别和修复错误;(iii) 多样性诱导以通过多条有效推理路径提升鲁棒性。这种自我演进机制显著提升了数学推理能力和模型可靠性,在MATH 500、GSM8K、AIME、AMC及Olympiad等基准测试中,SPHERE训练的模型相比基线模型取得了显著改进,并在某些任务上达到了与GPT-4o相当甚至超越的表现。
链接: https://arxiv.org/abs/2503.04813
作者: Joykirat Singh,Tanmoy Chakraborty,Akshay Nambi
机构: Microsoft Research, India (微软研究印度); IIT Delhi, India (印度理工学院德里)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have significantly improved their reasoning capabilities; however, they still struggle with complex multi-step mathematical problem-solving due to error propagation, lack of self-correction, and limited adaptability to diverse reasoning styles. Existing methods rely on static fine-tuning or prompt engineering, which fail to generalize across problem complexities, while the scarcity of high-quality preference data further hinders reliable reasoning. We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs) by iteratively generating, correcting, and diversifying reasoning chains. SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories. This self-evolution mechanism strengthens mathematical reasoning and enhances model reliability. Evaluations on MATH 500, GSM8K, AIME, AMC, and Olympiad show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks. Our findings demonstrate that self-evolving models can close the reasoning gap between SLMs and state-of-the-art LLMs, making mathematical AI more reliable, scalable, and efficient. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2503.04813 [cs.LG] (or arXiv:2503.04813v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.04813 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-98] LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
【速读】: 该论文旨在解决现有基于大型多模态(LMM)模型的嵌入模型在标准InfoNCE损失函数训练下,正负样本对之间的相似度分布重叠程度较高,导致难以有效区分困难负样本对的问题。为了解决这一挑战,论文提出了一种简单而有效的框架,通过动态改进负样本对的表征学习,使其根据判别难度进行调整。关键在于引入了LLaVE系列模型,并通过MMEB基准测试验证其性能,涵盖4个元任务和36个数据集。实验结果表明,LLaVE不仅建立了更强的基线,在多个任务上达到或超越当前最先进的性能,同时展现出良好的可扩展性和效率。
链接: https://arxiv.org/abs/2503.04812
作者: Zhibin Lan,Liqiang Niu,Fandong Meng,Jie Zhou,Jinsong Su
机构: School of Informatics, Xiamen University (厦门大学信息学院), China; Pattern Recognition Center, WeChat AI, Tencent Inc (腾讯微信人工智能研究中心), China; Shanghai Artificial Intelligence Laboratory (上海人工智能实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
点击查看摘要
Abstract:Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model’s representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
zh
[NLP-99] PanguIR Technical Report for NTCIR-18 AEOLLM Task
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)能力评估中的挑战,特别是现有评估方法的局限性。手动评估虽全面但成本高且资源密集,而自动评估虽可扩展但受限于基于参考答案的评价标准。为克服这些限制,NTCIR-18引入了AEOLLM(大型语言模型的自动评估)任务,鼓励无参考的评估方法。论文的关键解决方案包括:1)多模型协作,利用多个LLMs在不同子任务上近似人类评分;2)提示自动优化,通过LLMs迭代优化初始任务提示以适应训练样本的反馈;3)情境学习(In-context Learning, ICL)优化,基于多任务反馈训练专门的情境示例检索模型,并结合语义相关性检索模型,共同识别最有效的情境学习示例。实验结果表明,所提方法在AEOLLM任务中表现出色。
链接: https://arxiv.org/abs/2503.04809
作者: Lang Mei,Chong Chen,Jiaxin Mao
机构: Huawei Cloud BU(华为云事业部); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As large language models (LLMs) gain widespread attention in both academia and industry, it becomes increasingly critical and challenging to effectively evaluate their capabilities. Existing evaluation methods can be broadly categorized into two types: manual evaluation and automatic evaluation. Manual evaluation, while comprehensive, is often costly and resource-intensive. Conversely, automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria (dominated by reference-based answers). To address these challenges, NTCIR-18 introduced the AEOLLM (Automatic Evaluation of LLMs) task, aiming to encourage reference-free evaluation methods that can overcome the limitations of existing approaches. In this paper, to enhance the evaluation performance of the AEOLLM task, we propose three key methods to improve the reference-free evaluation: 1) Multi-model Collaboration: Leveraging multiple LLMs to approximate human ratings across various subtasks; 2) Prompt Auto-optimization: Utilizing LLMs to iteratively refine the initial task prompts based on evaluation feedback from training samples; and 3) In-context Learning (ICL) Optimization: Based on the multi-task evaluation feedback, we train a specialized in-context example retrieval model, combined with a semantic relevance retrieval model, to jointly identify the most effective in-context learning examples. Experiments conducted on the final dataset demonstrate that our approach achieves superior performance on the AEOLLM task.
zh
[NLP-100] Learning from Failures in Multi-Attempt Reinforcement Learning
【速读】: 该论文试图解决如何通过任务设计提升大型语言模型(Large Language Models, LLMs)在推理能力上的表现。论文的关键在于引入多尝试(multi-attempt)任务设置,与传统的单次作答(single-turn)任务不同,该方法允许模型在每次错误响应后接收反馈,并进行多次修正尝试。这种机制不仅提高了模型的准确性,还增强了其搜索效率和基于用户反馈优化自身响应的能力。实验结果表明,在数学基准测试中,经过多尝试任务训练的小型LLM从单次尝试的45.6%提升到两次尝试的52.5%,而相同模型在标准单次任务下仅从42.3%提升至43.2%。
链接: https://arxiv.org/abs/2503.04808
作者: Stephen Chung,Wenyu Du,Jie Fu
机构: DualityRL; Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint
点击查看摘要
Abstract:Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM’s reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at this https URL
zh
[NLP-101] Call for Rigor in Reporting Quality of Instruction Tuning Data ACL25
【速读】: 该论文旨在解决因训练大语言模型(Large Language Models, LLMs)时超参数选择的随意性所导致的数据质量评估不准确的问题。论文指出,现有研究在评估指令微调(Instruction Tuning, IT)数据质量时,通常通过LLMs的性能来衡量,但忽略了超参数选择缺乏合理依据这一关键问题。即使使用相同的数据集,不同研究中采用的超参数存在显著差异,这种随意性可能导致任意的结论。论文的关键解决方案在于强调在验证数据质量时需谨慎对待超参数的选择,并通过实验表明,不当的超参数决策可能使任何关于数据质量的结论都失去意义。实验结果基于LIMA数据集以及1,000个Alpaca数据点,进一步支持了这一观点。
链接: https://arxiv.org/abs/2503.04807
作者: Hyeonseok Moon,Jaehyung Seo,Heuiseok Lim
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL25 under review (ARR24 December 3/3.5/4 , meta 4)
点击查看摘要
Abstract:Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion.
zh
[NLP-102] What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成文本时对非人类动物潜在危害的评估问题。现有技术评估主要关注LLMs对人类及环境的影响,而忽视了对非人类动物的潜在伤害。为填补这一空白,论文提出了一种名为Animal Harm Assessment (AHA)的新评估框架,用于识别和量化LLM生成文本中的动物伤害风险。
解决方案的关键在于构建一个包含多样化场景的数据集,涵盖50种类别的动物(如猫、爬行动物等)以及50种伦理情境,并结合Reddit社区的真实提问与合成问题,形成1,850条手工标注的问题样本和2,500条合成样本,同时采用70/30的公开/私有数据划分策略。通过“LLM作为裁判”(LLM-as-a-judge)的方法,AHA评估生成答案是否可能增加或减少动物伤害风险,并对模型自我偏好的倾向进行去偏处理。这种方法能够有效揭示不同模型、动物类别、伦理情境及社区间的显著差异。
链接: https://arxiv.org/abs/2503.04804
作者: Arturs Kanepajs,Aditi Basu,Sankalpa Ghose,Constance Li,Akshat Mehta,Ronak Mehta,Samuel David Tucker-Davis,Eric Zhou,Bob Fischer
机构: AI for Animals; Alethic Research; Texas State University
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As machine learning systems become increasingly embedded in human society, their impact on the natural world continues to escalate. Technical evaluations have addressed a variety of potential harms from large language models (LLMs) towards humans and the environment, but there is little empirical work regarding harms towards nonhuman animals. Following the growing recognition of animal protection in regulatory and ethical AI frameworks, we present the Animal Harm Assessment (AHA), a novel evaluation of risks of animal harm in LLM-generated text. Our dataset comprises 1,850 curated questions from Reddit post titles and 2,500 synthetic questions based on 50 animal categories (e.g., cats, reptiles) and 50 ethical scenarios, with further 70-30 publi-private split. Scenarios include open-ended questions about how to treat animals, practical scenarios with potential animal harm, and willingness-to-pay measures for the prevention of animal harm. Using the LLM-as-a-judge framework, answers are evaluated for their potential to increase or decrease harm, and evaluations are debiased for the tendency to judge their own outputs more favorably. We show that AHA produces meaningful evaluation results when applied to frontier LLMs, revealing significant differences between models, animal categories, scenarios, and subreddits. We conclude with future directions for technical research and the challenges of building evaluations on complex social and moral topics.
zh
[NLP-103] he order in speech disorder: a scoping review of state of the art machine learning methods for clinical speech classification
【速读】: 该论文旨在探索机器学习(Machine Learning, ML)利用语音模式诊断神经、喉部及精神疾病的能力。论文的关键在于通过系统性回顾564篇相关文章,评估不同研究中用于语音分类的方法,并基于所报告的ML模型诊断准确性对其评分(0-10)。结果显示,对于喉部疾病、构音障碍以及帕金森病相关的语音变化,ML模型展现了高诊断准确性;而对于抑郁症、精神分裂症、轻度认知障碍和阿尔茨海默病等疾病,尽管也显示出较高的准确性,但存在一定的研究间变异性。此外,强迫症(OCD)和自闭症的研究强调了进一步研究的重要性,以明确语音模式与相应疾病之间的关系。论文结论指出,ML模型在诊断多种精神、喉部和神经系统疾病方面具有巨大潜力,但其效果因疾病类型而异,需进一步研究以整合到临床实践中。
链接: https://arxiv.org/abs/2503.04802
作者: Birger Moell,Fredrik Sand Aronsson,Per Östberg,Jonas Beskow
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Background:Speech patterns have emerged as potential diagnostic markers for conditions with varying etiologies. Machine learning (ML) presents an opportunity to harness these patterns for accurate disease diagnosis. Objective: This review synthesized findings from studies exploring ML’s capability in leveraging speech for the diagnosis of neurological, laryngeal and mental disorders. Methods: A systematic examination of 564 articles was conducted with 91 articles included in the study, which encompassed a wide spectrum of conditions, ranging from voice pathologies to mental and neurological disorders. Methods for speech classifications were assessed based on the relevant studies and scored between 0-10 based on the reported diagnostic accuracy of their ML models. Results: High diagnostic accuracies were consistently observed for laryngeal disorders, dysarthria, and changes related to speech in Parkinsons disease. These findings indicate the robust potential of speech as a diagnostic tool. Disorders like depression, schizophrenia, mild cognitive impairment and Alzheimers dementia also demonstrated high accuracies, albeit with some variability across studies. Meanwhile, disorders like OCD and autism highlighted the need for more extensive research to ascertain the relationship between speech patterns and the respective conditions. Conclusion: ML models utilizing speech patterns demonstrate promising potential in diagnosing a range of mental, laryngeal, and neurological disorders. However, the efficacy varies across conditions, and further research is needed. The integration of these models into clinical practice could potentially revolutionize the evaluation and diagnosis of a number of different medical conditions. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04802 [cs.CL] (or arXiv:2503.04802v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.04802 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Birger Moell [view email] [v1] Mon, 3 Mar 2025 11:33:02 UTC (226 KB) Full-text links: Access Paper: View a PDF of the paper titled The order in speech disorder: a scoping review of state of the art machine learning methods for clinical speech classification, by Birger Moell and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-104] Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models
【速读】: 该论文试图解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在跨模态知识推理过程中知识整合不充分导致的推理结果一致性下降的问题。为系统性地探索这一挑战,论文设计了四个评估任务并构建了一个新的数据集,通过在该数据集上的实验分析和比较,识别出影响一致性退化的主要因素。关键在于通过实验揭示多模态知识推理中的一致性挑战,并为未来提升MLLMs性能提供有价值的指导方向。
链接: https://arxiv.org/abs/2503.04801
作者: Boyu Jia,Junzhe Zhang,Huixuan Zhang,Xiaojun Wan
机构: School of Software and Microelectronics, Peking University (北京大学软件与微电子学院); Wangxuan Institute of Computer Technology, Peking University (王选计算机研究所, 北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In recent years, multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision. However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning, leading to inconsistencies in reasoning outcomes. To systematically explore this issue, we propose four evaluation tasks and construct a new dataset. We conduct a series of experiments on this dataset to analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs. Based on the experimental results, we identify factors contributing to the observed degradation in consistency. Our research provides new insights into the challenges of multimodal knowledge reasoning and offers valuable guidance for future efforts aimed at improving MLLMs.
zh
[NLP-105] HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
【速读】: 该论文旨在解决 Retrieval-Augmented Generation (RAG) 方法在处理知识库中过时信息时面临的挑战,特别是现有研究对共存于检索源中的过时信息影响关注不足的问题。论文的关键在于引入了首个专门评估过时信息对 RAG 影响的基准 HoH。HoH 利用基于 token 级别差异算法与大型语言模型 (LLM) 管道,高效构建了一个大规模问答数据集,精准捕获现实世界事实中的时间知识演化。通过全面实验,论文揭示了过时信息通过分散模型注意力显著降低响应准确性,并可能导致有害输出的误导性影响,而当前 RAG 方法在检索和生成方面均难以有效应对这些问题。因此,论文强调亟需创新性解决方案来应对 RAG 的时间相关挑战。
链接: https://arxiv.org/abs/2503.04800
作者: Jie Ouyang,Tingyue Pan,Mingyue Cheng,Ruiran Yan,Yucong Luo,Jiaying Lin,Qi Liu
机构: State Key Lab of Cognitive Intelligence, University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures temporal knowledge evolution in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG.
zh
[NLP-106] Direct Speech to Speech Translation: A Review
【速读】: 该论文旨在解决实时多语言通信中的语音到语音翻译(Speech to Speech Translation, S2ST)问题。论文对比分析了传统级联模型(依赖自动语音识别ASR、机器翻译MT和文本到语音TTS组件)与新兴端到端及直接语音翻译(Direct Speech Translation, DST)模型的关键差异与优劣。解决方案的关键在于通过直接语音翻译模型减少错误传播、降低延迟,并保留说话人的身份特征和韵律信息,从而提升翻译的自然性和流畅性。然而,这些方法仍面临数据稀疏性、高计算成本以及低资源语言泛化能力不足等挑战。
链接: https://arxiv.org/abs/2503.04799
作者: Mohammad Sarim,Saim Shakeel,Laeeba Javed,Jamaluddin,Mohammad Nadeem
机构: Department of Computer Science, Aligarh Muslim University (阿利格尔穆斯林大学), Aligarh, Uttar Pradesh, India.
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Speech to speech translation (S2ST) is a transformative technology that bridges global communication gaps, enabling real time multilingual interactions in diplomacy, tourism, and international trade. Our review examines the evolution of S2ST, comparing traditional cascade models which rely on automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS) components with newer end to end and direct speech translation (DST) models that bypass intermediate text representations. While cascade models offer modularity and optimized components, they suffer from error propagation, increased latency, and loss of prosody. In contrast, direct S2ST models retain speaker identity, reduce latency, and improve translation naturalness by preserving vocal characteristics and prosody. However, they remain limited by data sparsity, high computational costs, and generalization challenges for low-resource languages. The current work critically evaluates these approaches, their tradeoffs, and future directions for improving real time multilingual communication.
zh
[NLP-107] Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review
【速读】: 该论文旨在解决低资源Indic语言在机器翻译(Machine Translation, MT)领域中高质量平行语料稀缺的问题。论文的关键在于全面梳理和评估现有的Indic语言平行语料库,涵盖文本到文本、代码混合以及多模态数据等多种类型,并对其在构建鲁棒多语言MT系统中的重要性进行分类和分析。此外,论文深入探讨了语料创建过程中面临的挑战,如语言多样性、书写系统差异、数据稀疏性以及非正式文本的普遍存在,并评估了这些语料库在对齐质量与领域代表性方面的表现。针对开放性挑战,如Indic语言间的数据不平衡、质量和数量之间的权衡以及噪声、非正式及方言数据对翻译性能的影响,论文提出了未来的研究方向,包括利用跨语言迁移学习、扩展多语言数据集以及整合多模态资源以提升翻译质量。
链接: https://arxiv.org/abs/2503.04797
作者: Rahul Raja,Arpita Vats
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual this http URL also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.
zh
[NLP-108] Optimizing Multi-Hop Document Retrieval Through Intermediate Representations
【速读】: 本文旨在解决基于检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理复杂查询,特别是多跳问题(multi-hop questions)时所面临的挑战。传统方法通过迭代生成内部查询并检索外部文档来应对多跳问题,但这些方法计算成本高昂。论文的关键洞察在于发现大型语言模型(LLMs)在逐层推理过程中存在一种三阶段的信息处理模式,即提取、处理及后续提取步骤。研究进一步指出,中间层的表征相比其他层包含更丰富的信息。基于此,作者提出了分层RAG(Layer-wise RAG, L-RAG)。与专注于生成新内部查询的传统方法不同,L-RAG 利用捕获下一跳信息的中间层表示来检索外部知识。这种方法使L-RAG在性能上接近多步方法的同时,保持了与标准RAG相当的推理开销。实验结果表明,L-RAG 在MuSiQue、HotpotQA和2WikiMultiHopQA等开放领域多跳问答数据集上优于现有RAG方法。
链接: https://arxiv.org/abs/2503.04796
作者: Jiaen Lin,Jingyu Liu
机构: Independent Researcher (独立研究员); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) encounters challenges when addressing complex queries, particularly multi-hop questions. While several methods tackle multi-hop queries by iteratively generating internal queries and retrieving external documents, these approaches are computationally expensive. In this paper, we identify a three-stage information processing pattern in LLMs during layer-by-layer reasoning, consisting of extraction, processing, and subsequent extraction steps. This observation suggests that the representations in intermediate layers contain richer information compared to those in other layers. Building on this insight, we propose Layer-wise RAG (L-RAG). Unlike prior methods that focus on generating new internal queries, L-RAG leverages intermediate representations from the middle layers, which capture next-hop information, to retrieve external knowledge. L-RAG achieves performance comparable to multi-step approaches while maintaining inference overhead similar to that of standard RAG. Experimental results show that L-RAG outperforms existing RAG methods on open-domain multi-hop question-answering datasets, including MuSiQue, HotpotQA, and 2WikiMultiHopQA. The code is available in this https URL
zh
[NLP-109] Cyber for AI at SemEval-2025 Task 4: Forgotten but Not Lost: The Balancing Act of Selective Unlearning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理敏感或过时数据时面临的隐私、伦理和合规性挑战,特别是当需要选择性遗忘(Selective Unlearning)特定数据时,传统重新训练方法因计算成本过高而不切实际。论文的关键解决方案在于通过全局权重调整(Global Weight Modification)实现选择性遗忘的有效性、知识保留以及目标模型遗忘后的实用性之间的平衡,并提出了针对具体任务的评估机制以验证其效果。实验结果显示,在测试集上7B和1B规模的目标模型分别达到了0.409和0.389的综合得分,表明该方法在可验证的LLMs选择性遗忘任务中具有良好的潜力。
链接: https://arxiv.org/abs/2503.04795
作者: Dinesh Srivasthav P,Bala Mallikarjunarao Garlapati
机构: TCS Research (塔塔咨询服务公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) face significant challenges in maintaining privacy, ethics, and compliance, when sensitive or obsolete data must be selectively removed. Retraining these models from scratch is computationally infeasible, necessitating efficient alternatives. As part of the SemEval 2025 Task 4, this work focuses on the application of selective unlearning in LLMs to address this challenge. In this paper, we present our experiments and findings, primarily leveraging global weight modification to achieve an equilibrium between effectiveness of unlearning, knowledge retention, and target model’s post-unlearning utility. We also detail the task-specific evaluation mechanism, results, and challenges. Our algorithms have achieved an aggregate score of 0.409 and 0.389 on the test set for 7B and 1B target models, respectively, demonstrating promising results in verifiable LLM unlearning.
zh
[NLP-110] Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
【速读】: 该论文旨在解决现有奖励模型在从人类偏好数据集中学习以及通过强化学习优化语言模型时存在的两个主要问题:一是粗粒度奖励模型需要生成完整响应才能获得奖励值,导致稀疏奖励可能对下游强化学习带来挑战;二是近期尝试的基于标记级别的奖励模型因缺乏显式语义信息而难以有效建模每个标记的贡献。论文的关键解决方案是提出了一种中间粒度的奖励模型,通过将完整响应分割为句子,并对每个句子起始和结束位置的奖励输出应用差分操作来有效建模句子级别的奖励。此外,引入了一种新颖的注意力机制,将所有句子的分数聚合为响应级别的分数,使其能够使用Bradley-Terry模型进行训练。实验结果表明,该方法在RewardBench上的表现比响应级别奖励模型高出2.7%,并在AlpacaEval上超越了所有基线。
链接: https://arxiv.org/abs/2503.04793
作者: Wenjie Qiu,Yi-Chen Li,Xuqin Zhang,Tianyi Zhang,Yihang Zhang,Zongzhang Zhang,Yang Yu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).
zh
[NLP-111] Cross-linguistic disagreement as a conflict of semantic alignment norms in multilingual AI~Linguistic Diversity as a Problem for Philosophy Cognitive Science and AI~
【速读】: 该论文试图解决多语言大型语言模型(Multilingual Large Language Models, LLMs)在跨语言知识转移过程中因语言间固有的语义差异而产生的冲突问题。具体而言,论文关注由语言学分歧导致的跨语言分歧(cross-linguistic disagreements),即纯粹由于相关概念的语义差异而在不同语言间产生的不一致。论文指出,这种分歧源于两种根本性的对齐规范之间的冲突:跨语言一致性(Cross-Linguistic Consistency, CL-consistency),即追求跨语言的普遍概念;以及与民间判断的一致性(Consistency with Folk Judgments, Folk-consistency),即尊重语言特定的语义规范。
解决方案的关键在于识别和分析这两种对齐规范之间的冲突,并通过实证研究揭示其影响。论文通过对英语和日语对话型多语言AI的回答进行考察,展示了即使是最先进的LLMs也表现出跨语言的不一致性和内部不连贯性。这一发现揭示了跨语言知识转移中的新型定性限制——即概念性的跨语言知识障碍(conceptual cross-linguistic knowledge barriers),挑战了普遍表示和跨语言迁移能力的内在可取性假设。此外,论文强调了开发者对齐策略之间的矛盾,提出了重要的规范性问题,不仅涉及技术层面的对齐挑战,还延伸至哲学、道德政治以及形而上学层面,呼吁多学科方法来平衡跨语言一致性与语言多样性尊重之间的实际利益与价值取向。
链接: https://arxiv.org/abs/2503.04792
作者: Masaharu Mizumoto,Dat Tien Nguyen,Justin Sytsma,Mark Alfano,Yu Izumi,Koji Fujita,Nguyen Le Minh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multilingual large language models (LLMs) face an often-overlooked challenge stemming from intrinsic semantic differences across languages. Linguistic divergence can sometimes lead to cross-linguistic disagreements–disagreements purely due to semantic differences about a relevant concept. This paper identifies such disagreements as conflicts between two fundamental alignment norms in multilingual LLMs: cross-linguistic consistency (CL-consistency), which seeks universal concepts across languages, and consistency with folk judgments (Folk-consistency), which respects language-specific semantic norms. Through examining responses of conversational multilingual AIs in English and Japanese with the cases used in philosophy (cases of knowledge-how attributions), this study demonstrates that even state-of-the-art LLMs provide divergent and internally inconsistent responses. Such findings reveal a novel qualitative limitation in crosslingual knowledge transfer, or conceptual crosslingual knowledge barriers, challenging the assumption that universal representations and cross-linguistic transfer capabilities are inherently desirable. Moreover, they reveal conflicts of alignment policies of their developers, highlighting critical normative questions for LLM researchers and developers. The implications extend beyond technical alignment challenges, raising normative, moral-political, and metaphysical questions about the ideals underlying AI development–questions that are shared with philosophers and cognitive scientists but for which no one yet has definitive answers, inviting a multidisciplinary approach to balance the practical benefits of cross-linguistic consistency and respect for linguistic diversity.
zh
[NLP-112] SuperRAG : Beyond RAG with Layout-Aware Graph Modeling NAACL2025
【速读】: 该论文试图解决传统检索增强生成(RAG)方法在处理多模态信息时忽视文档布局结构的问题。传统RAG方法主要关注于扁平文本片段的处理,而未能充分利用多模态信息之间的关联性。为解决此问题,论文提出了一种基于布局感知图建模的方法,其关键是定义了一种基于文档布局解析的图建模结构,通过连接文本块、表格和图表来保留输入文档的结构。这种表示方法使得模型能够处理需要从多模态信息中获取复杂信息的问题。为验证图建模的有效性,论文开发了一个采用鲁棒组件的灵活RAG管道,并通过四个基准数据集的实验结果证明了布局感知建模对提升RAG管道性能的贡献。
链接: https://arxiv.org/abs/2503.04790
作者: Jeff Yang,Duy-Khanh Vu,Minh-Tien Nguyen,Xuan-Quang Nguyen,Linh Nguyen,Hung Le
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025, Industry Track
点击查看摘要
Abstract:This paper introduces layout-aware graph modeling for multimodal RAG. Different from traditional RAG methods that mostly deal with flat text chunks, the proposed method takes into account the relationship of multimodalities by using a graph structure. To do that, a graph modeling structure is defined based on document layout parsing. The structure of an input document is retained with the connection of text chunks, tables, and figures. This representation allows the method to handle complex questions that require information from multimodalities. To confirm the efficiency of the graph modeling, a flexible RAG pipeline is developed using robust components. Experimental results on four benchmark test sets confirm the contribution of the layout-aware modeling for performance improvement of the RAG pipeline.
zh
[NLP-113] Ext2Gen: Alignment through Unified Extraction and Generation for Robust Retrieval-Augmented Generation
【速读】: 该论文试图解决 Retrieval-augmented generation (RAG) 模型在生成过程中因相关片段位置不确定性及检索诱导的信息过载导致生成结果脆弱、容易产生幻觉的问题。解决方案的关键在于提出了一种名为 Ext2Gen 的新型提取-再生成模型,通过先提取与查询相关的句子,再进行答案生成的方式增强 RAG 的鲁棒性。此外,为了优化该模型,采用了基于成对反馈学习的偏好对齐方法,使模型能够生成稳健的答案,而不受检索结果变化的影响。
链接: https://arxiv.org/abs/2503.04789
作者: Hwanjun Song,Jeonghwan Choi,Minseok Kim
机构: KAIST; Meta(元宇宙平台公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enhances LLMs by integrating external knowledge, but generation remains fragile due to the uncertain placement of relevant chunks and retrieval-induced information overload, leading to hallucinations. We propose Ext2Gen, a novel extract-then-generate model that enhances RAG robustness by first extracting query-relevant sentences before generating answers. To optimize this model, we employ preference alignment through pairwise feedback learning, enabling the model to generate robust answers regardless of variations in retrieval results. Extensive experiments demonstrate that Ext2Gen effectively identifies query-relevant sentences with high precision and recall, leading to highly reliable answers. Furthermore, deploying our model in a RAG environment reveals that it not only boosts the performance of the base LLM but also synergizes with advanced retrieval strategies like query expansion. The dataset and model will be released soon.
zh
[NLP-114] AgroLLM : Connecting Farmers and Agricultural Practices through Large Language Models for Enhanced Knowledge Transfer and Practical Application
【速读】: 该论文旨在解决农业领域知识共享与教育过程中信息检索不准确的问题,通过构建一个基于大型语言模型(Large Language Models, LLMs)和 Retrieval-Augmented Generation (RAG) 框架的农业专用聊天机器人 AgroLLM,提供精准且上下文相关的响应。其解决方案的关键在于采用 FAISS 向量数据库实现高效相似性搜索,结合连续反馈机制优化响应质量,并通过引入 RAG 技术显著降低错误信息的检索概率,最终确保在农业四大核心领域(农业与生命科学、农业管理、农林结合以及农业商业)内的高精度响应表现,其中 ChatGPT-4o Mini 配合 RAG 的组合达到了 93% 的最高准确性。
链接: https://arxiv.org/abs/2503.04788
作者: Dinesh Jackson Samuel,Inna Skarga-Bandurova,David Sikolia,Muhammad Awais
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:AgroLLM is an AI-powered chatbot designed to enhance knowledge-sharing and education in agriculture using Large Language Models (LLMs) and a Retrieval-Augmented Generation (RAG) framework. By using a comprehensive open-source agricultural database, AgroLLM provides accurate, contextually relevant responses while reducing incorrect information retrieval. The system utilizes the FAISS vector database for efficient similarity searches, ensuring rapid access to agricultural knowledge. A comparative study of three advanced models: Gemini 1.5 Flash, ChatGPT-4o Mini, and Mistral-7B-Instruct-v0.2 was conducted to evaluate performance across four key agricultural domains: Agriculture and Life Sciences, Agricultural Management, Agriculture and Forestry, and Agriculture Business. Key evaluation metrics included embedding quality, search efficiency, and response relevance. Results indicated that ChatGPT-4o Mini with RAG achieved the highest accuracy at 93%. Continuous feedback mechanisms enhance response quality, making AgroLLM a benchmark AI-driven educational tool for farmers, researchers, and professionals, promoting informed decision-making and improved agricultural practices.
zh
[NLP-115] owards Anthropomorphic Conversational AI Part I: A Practical Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要更强社交与对话智能以及更拟人化反应的应用场景中的局限性。尽管基础LLMs在多轮对话中表现出色,但在某些情况下,单次调用可能不足以满足复杂的人类交互需求。为弥合这一差距,论文提出了一种两阶段解决方案。关键在于第一阶段引入的多模块框架,该框架包括用于推理的思考模块(reasoning modules)、用于管理知识与外部信息的资源模块(resource modules),以及用于生成上下文适配交互的响应模块(response modules)。这些模块协同工作,使代理能够提供更加人性化且自然的对话体验。第二阶段计划利用经过筛选和标注的对话数据进行强化学习训练,以进一步优化人类偏好捕捉能力,但相关内容留待后续研究。实验结果显示,在未对LLM进行微调的情况下,该框架显著提升了对话的社交与会话智能水平。
链接: https://arxiv.org/abs/2503.04787
作者: Fei Wei,Yaliang Li,Bolin Ding
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs), due to their advanced natural language capabilities, have seen significant success in applications where the user interface is usually a conversational artificial intelligence (AI) agent and engages the user through multi-round conversations. However, many scenarios require the agents to exhibit stronger social and conversational intelligence and demonstrate more human-like (anthropomorphic) reactions. This is an aspect that foundational LLMs have yet to fully address such that a single call of foundational models might be insufficient. To bridge this gap, we propose a two-stage solution. In this work, we focus on the first stage, introducing a multi-module framework designed to replicate the key aspects of human intelligence involved in conversations. This framework comprises thinking modules for reasoning, resource modules for managing knowledge and external information, and response modules for generating contextually appropriate interactions. With all the modules cooperating, the framework would empower the agents to provide a better human-like conversation experience. In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning, enabling AI to better capture human preferences. This stage is left for future work. In our experiments, volunteers engaged in over 3000 rounds of conversation with the same AI character powered by a standalone LLM and our framework which integrates the same LLM. A separate group of evaluators rated the conversation samples, revealing that our framework significantly enhanced the social and conversational intelligence, even without fine-tuning the LLM. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04787 [cs.CL] (or arXiv:2503.04787v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.04787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-116] Analyzing the temporal dynamics of linguistic features contained in misinformation
【速读】: 该论文旨在研究误导性信息(Misinformation)在语言特征及来源上的时间动态变化,以帮助理解算法标签在评估内容准确性与来源可靠性方面如何随时间演变。论文的关键解决方案在于结合自然语言处理技术(Natural Language Processing, NLP),对PolitiFact从2010年至2024年的声明进行量化分析,通过考察不同时间段内误导性信息的语言特征及其来源的变化,揭示其与准确信息之间的差异。研究发现,随着时间推移,声明的整体情感倾向显著下降,并且与误导性信息相关的声明情感明显低于准确信息。此外,近期时间段内的信息更多来源于社交媒体和其他数字平台,这些来源包含高比例的负面情感误导性内容,而早期时间段的信息多来自个体来源(如政治家),且情感倾向更中性或积极。通过命名实体识别(Named-Entity Recognition, NER),进一步发现总统在职者和候选人更可能出现在误导性信息中,而美国各州更多出现在准确信息中。最后,误导性声明中的实体标签多与人和组织相关,而准确声明则更常包含数值型实体标签(如百分比和日期)。因此,该研究的关键在于利用NLP技术量化和解析误导性信息的时间动态特性,为算法标签设计提供基于时间维度的洞见。
链接: https://arxiv.org/abs/2503.04786
作者: Erik J Schlicht
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Consumption of misinformation can lead to negative consequences that impact the individual and society. To help mitigate the influence of misinformation on human beliefs, algorithmic labels providing context about content accuracy and source reliability have been developed. Since the linguistic features used by algorithms to estimate information accuracy can change across time, it is important to understand their temporal dynamics. As a result, this study uses natural language processing to analyze PolitiFact statements spanning between 2010 and 2024 to quantify how the sources and linguistic features of misinformation change between five-year time periods. The results show that statement sentiment has decreased significantly over time, reflecting a generally more negative tone in PolitiFact statements. Moreover, statements associated with misinformation realize significantly lower sentiment than accurate information. Additional analysis shows that recent time periods are dominated by sources from online social networks and other digital forums, such as blogs and viral images, that contain high levels of misinformation containing negative sentiment. In contrast, most statements during early time periods are attributed to individual sources (i.e., politicians) that are relatively balanced in accuracy ratings and contain statements with neutral or positive sentiment. Named-entity recognition was used to identify that presidential incumbents and candidates are relatively more prevalent in statements containing misinformation, while US states tend to be present in accurate information. Finally, entity labels associated with people and organizations are more common in misinformation, while accurate statements are more likely to contain numeric entity labels, such as percentages and dates.
zh
[NLP-117] Mapping Trustworthiness in Large Language Models : A Bibliometric Analysis Bridging Theory to Practice
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中的可信性(trustworthiness)问题,具体关注可靠性、透明度、公平性和伦理对齐等方面的挑战。尽管LLMs已在多个领域得到广泛应用,但如何在实践中操作化这些可信性特征仍缺乏共识。论文的关键解决方案在于通过文献计量学映射分析(bibliometric mapping analysis)识别研究趋势与定义,并结合对68篇核心论文的系统回顾,提炼出LLMs可信性的主要维度(如能力、善意和正直)。此外,论文提出了覆盖LLM生命周期的20种增强信任的技术框架,包括检索增强生成(retrieval-augmented generation, RAG)、可解释性技术及训练后审计等,旨在弥合理论与实践之间的差距,推动更透明、负责任且符合伦理的LLM开发与部署。
链接: https://arxiv.org/abs/2503.04785
作者: José Siqueira de Cerqueira,Kai-Kristian Kemell,Rebekah Rousi,Nannan Xi,Juho Hamari,Pekka Abrahamsson
机构: Tampere University (坦佩雷大学); University of Vaasa (瓦萨大学); Tampere University (坦佩雷大学); Tampere University (坦佩雷大学); Tampere University (坦佩雷大学); Tampere University (坦佩雷大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:The rapid proliferation of Large Language Models (LLMs) has raised pressing concerns regarding their trustworthiness, spanning issues of reliability, transparency, fairness, and ethical alignment. Despite the increasing adoption of LLMs across various domains, there remains a lack of consensus on how to operationalize trustworthiness in practice. This study bridges the gap between theoretical discussions and implementation by conducting a bibliometric mapping analysis of 2,006 publications from 2019 to 2025. Through co-authorship networks, keyword co-occurrence analysis, and thematic evolution tracking, we identify key research trends, influential authors, and prevailing definitions of LLM trustworthiness. Additionally, a systematic review of 68 core papers is conducted to examine conceptualizations of trust and their practical implications. Our findings reveal that trustworthiness in LLMs is often framed through existing organizational trust frameworks, emphasizing dimensions such as ability, benevolence, and integrity. However, a significant gap exists in translating these principles into concrete development strategies. To address this, we propose a structured mapping of 20 trust-enhancing techniques across the LLM lifecycle, including retrieval-augmented generation (RAG), explainability techniques, and post-training audits. By synthesizing bibliometric insights with practical strategies, this study contributes towards fostering more transparent, accountable, and ethically aligned LLMs, ensuring their responsible deployment in real-world applications.
zh
[NLP-118] KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models)在实际应用中面临的低计算效率、梯度消失以及难以有效捕捉复杂特征交互等挑战。为应对这些局限性,论文提出了一种创新框架,其关键在于引入了可学习的密集残差跳跃连接机制、TransformerX模块(一种集成了多尺度卷积与自适应激活函数的Transformer基组件)以及多令牌预测交互模块。其中,可学习的密集残差连接增强了跨层的信息流动与特征提取能力;TransformerX模块通过大尺寸卷积核聚合长文本语义信息,同时利用小尺寸卷积关注局部词序与句法结构;自适应激活函数可根据输入文本的语义特性动态调整参数,从而提升模型处理多样化语义表达及复杂关系的能力;多令牌预测模块则提高了数据利用率并加速推理过程。这些设计共同显著提升了大型语言模型的性能与效率。
链接: https://arxiv.org/abs/2503.04784
作者: Jiexiong Liu,Yixuan Chen,Yanqin Jia,Zhepeng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages
点击查看摘要
Abstract:Large language models have demonstrated remarkable performance across various tasks, yet they face challenges such as low computational efficiency, gradient vanishing, and difficulties in capturing complex feature interactions. To address these limitations, a novel framework has been proposed. This framework incorporates a learnable dense residual skip connection mechanism, a TransformerX module a transformer based component integrating multiscale convolution and adaptive activation functions and a multitoken prediction interaction module. The learnable dense residual connections enhance information flow and feature capture across layers. Within the TransformerX module, large convolutional kernels aggregate semantic information from extensive text segments, while smaller convolutions focus on local word order and syntactic structures. The adaptive activation function dynamically adjusts its parameters based on the semantic features of the input text, improving the model’s ability to handle diverse semantic expressions and complex relationships. The multitoken prediction module boosts data utilization and accelerates inference by predicting multiple future tokens. These components significantly enhance the performance and efficiency of large language models.
zh
[NLP-119] Comparative Analysis Based on DeepSeek ChatGPT and Google Gemini: Features Techniques Performance Future Prospects
【速读】: 该论文旨在通过对比分析DeepSeek、ChatGPT和Google Gemini三种大型语言模型(LLM)的技术方法与应用特性,解决在不同领域任务中选择合适LLM技术的问题。论文的关键在于基于任务需求的数据选择标准、模型架构特点及其在推理效率和领域适用性上的比较,并通过数据集分析揭示各模型在多模态处理及特定应用场景下的优势与局限性,从而为LLM领域的研究提供方向性的指导与建议。
链接: https://arxiv.org/abs/2503.04783
作者: Anichur Rahman,Shahariar Hossain Mahir,Md Tanjum An Tashrif,Airin Afroj Aishi,Md Ahsan Karim,Dipanjali Kundu,Tanoy Debnath,Md. Abul Ala Moududi,MD. Zunead Abedin Eidmum
机构: Deptment of Computer Science and Engineering, National Institute of Textile Engineering and Research (NITER)(国立纺织工程与研究学院); Department of Computing and Information System, Daffodil International University(达芙妮国际大学); Department of Computer Science, Stony Brook University (石溪大学)(美国纽约州立大学石溪分校); Department of Internet of Things and Robotics Engineering, Bangabandhu Sheikh Mujibur Rahman Digital University (孟加拉国班加班杜数字大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Nowadays, DeepSeek, ChatGPT, and Google Gemini are the most trending and exciting Large Language Model (LLM) technologies for reasoning, multimodal capabilities, and general linguistic performance worldwide. DeepSeek employs a Mixture-of-Experts (MoE) approach, activating only the parameters most relevant to the task at hand, which makes it especially effective for domain-specific work. On the other hand, ChatGPT relies on a dense transformer model enhanced through reinforcement learning from human feedback (RLHF), and then Google Gemini actually uses a multimodal transformer architecture that integrates text, code, and images into a single framework. However, by using those technologies, people can be able to mine their desired text, code, images, etc, in a cost-effective and domain-specific inference. People may choose those techniques based on the best performance. In this regard, we offer a comparative study based on the DeepSeek, ChatGPT, and Gemini techniques in this research. Initially, we focus on their methods and materials, appropriately including the data selection criteria. Then, we present state-of-the-art features of DeepSeek, ChatGPT, and Gemini based on their applications. Most importantly, we show the technological comparison among them and also cover the dataset analysis for various applications. Finally, we address extensive research areas and future potential guidance regarding LLM-based AI research for the community.
zh
[NLP-120] Bangla Fake News Detection Based On Multichannel Combined CNN-LSTM
【速读】: 该论文旨在解决 Bengali 语系中假新闻识别的问题,特别是从未经验证的新闻来源中区分真实新闻与虚假新闻,以减少误导信息对社会造成的负面影响。论文的关键在于提出了一种结合卷积神经网络(CNN)和长短期记忆网络(LSTM)的多通道联合架构。其中,CNN 用于深层特征提取,而 LSTM 则利用这些特征进行假新闻检测。通过这种方法,论文在包含约 50k 条新闻的数据集上实现了 75.05% 的准确率,表明该模型在 Bengali 假新闻检测领域具有良好的性能表现。
链接: https://arxiv.org/abs/2503.04781
作者: Md. Zahin Hossain George,Naimul Hossain,Md. Rafiuzzaman Bhuiyan,Abu Kaisar Mohammad Masum,Sheikh Abujar
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 2 tables. THE 12th International Conference on Computing, Communication and Networking Technologies, 2021
点击查看摘要
Abstract:There have recently been many cases of unverified or misleading information circulating quickly over bogus web networks and news portals. This false news creates big damage to society and misleads people. For Example, in 2019, there was a rumor that the Padma Bridge of Bangladesh needed 100,000 human heads for sacrifice. This rumor turns into a deadly position and this misleading information takes the lives of innocent people. There is a lot of work in English but a few works in Bangla. In this study, we are going to identify the fake news from the unconsidered news source to provide the newsreader with natural news or real news. The paper is based on the combination of convolutional neural network (CNN) and long short-term memory (LSTM), where CNN is used for deep feature extraction and LSTM is used for detection using the extracted feature. The first thing we did to deploy this piece of work was data collection. We compiled a data set from websites and attempted to deploy it using the methodology of deep learning which contains about 50k of news. With the proposed model of Multichannel combined CNN-LSTM architecture, our model gained an accuracy of 75.05%, which is a good sign for detecting fake news in Bangla.
zh
[NLP-121] MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model
【速读】: 该论文旨在解决现有分子-文本模型忽视分子不同视图间互补信息以及依赖单一视图表示的问题,这些问题限制了大型语言模型(LLMs)在分子理解方面的能力。具体而言,论文关注于如何通过多视图对齐来提升LLMs的分子推理能力,特别是在跨模态学习中实现细粒度对齐时面临的挑战,包括不一致的嵌入映射以及现有损失目标无法有效保留互补信息等问题。
为了解决上述问题,论文提出了一种名为MV-CLAM的新框架,其关键在于利用多查询变换器(Multi-Query Former, MQ-Former)将多视图分子表示对齐到统一的文本空间中。此方法不仅确保了跨视图的一致性,还通过基于标记级别的对比损失函数,在不同的文本查询之间保存了丰富的分子特征。这种方法显著增强了分子推理能力,从而提高了分子检索与描述的准确性。
链接: https://arxiv.org/abs/2503.04780
作者: Sumin Ha,Jun Hyeong Kim,Yinhua Piao,Sun Kim
机构: Interdisciplinary Program in Artificial Intelligence (交叉学科人工智能项目), Seoul National University (首尔国立大学); Bio-MAX/N-Bio (生物医学卓越中心/生物技术), Seoul National University (首尔国立大学); Department of Computer Science and Engineering (计算机科学与工程系), Seoul National University (首尔国立大学); Interdisciplinary Program in Bioinformatics (交叉学科生物信息学项目), Seoul National University (首尔国立大学); AIGENDRUG Co., Ltd. (爱根药物有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Atomic Physics (physics.atom-ph)
备注:
点击查看摘要
Abstract:Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM’s ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in this https URL.
zh
[NLP-122] Invisible Walls in Cities: Leverag ing Large Language Models to Predict Urban Segregation Experience with Social Media Content
【速读】: 该论文旨在解决城市日常生活中社会隔离现象的理解与预测问题,以应对社会不平等并促进包容性。然而,利用社交媒体上的大量用户生成评论数据面临数据量庞大、语义模糊及视角多样等挑战。为此,论文提出使用大型语言模型(Large Language Models, LLMs)来自动化挖掘这些在线评论,以实现对社会隔离的预测。
解决方案的关键在于设计了一个名为“Reflective LLM Coder”的系统,用于将社交媒体内容转化为与现实反馈一致的洞察,并最终生成一个包含文化共鸣与吸引力、可达性与便利性以及社区参与与本地投入等关键维度的编码簿(codebook)。此编码簿指导LLMs生成既具信息价值又可用于隔离预测的评论摘要和评分。此外,还设计了一个名为“REasoning-and-EMbedding (RE’EM)”的框架,结合语言模型的推理与嵌入能力,整合多通道特征进行隔离预测。实验结果显示,该框架显著提升了预测准确性,R²提高了22.79%,均方误差(MSE)降低了9.33%。用户研究进一步表明,基于编码簿的摘要能够增强人类参与者对地点兴趣点(POIs)社会属性的认知。这项研究标志着理解隐性社会壁垒与不平等的重要进展,展示了AI推动社会包容的巨大潜力。
链接: https://arxiv.org/abs/2503.04773
作者: Bingbing Fan,Lin Chen,Songwei Li,Jian Yuan,Fengli Xu,Pan Hui,Yong Li
机构: Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose using Large Language Models (LLMs) to automate online review mining for segregation prediction. We design a Reflective LLM Coder to digest social media content into insights consistent with real-world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning-and-EMbedding (RE’EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi-channel features for segregation prediction. Experiments on real-world data demonstrate that our framework greatly improves prediction accuracy, with a 22.79% elevation in R2 and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction this http URL, our user study confirms that the codebook-guided summaries provide cognitive gains for human participants in perceiving POIs’ social this http URL study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with AI.
zh
[NLP-123] DiMA: An LLM -Powered Ride-Hailing Assistant at DiDi
【速读】: 本文旨在解决网约车服务中动态复杂时空环境下,如何通过自然高效的对话交互提供无缝用户体验的问题。解决方案的关键在于提出了一个时空感知的订单规划模块(spatiotemporal-aware order planning module),该模块利用外部工具实现精确的时空推理和渐进式订单规划。同时,开发了一个成本效益高的对话系统,集成多类型对话回复器与成本感知的大语言模型(LLM)配置,以处理多样化的对话目标,并在响应质量和延迟之间进行权衡。此外,引入了一种持续微调方案,利用真实世界交互和模拟对话使助手的行为与人类偏好的决策过程保持一致。这些创新点共同构成了DiMA的核心竞争力。
链接: https://arxiv.org/abs/2503.04768
作者: Yansong Ning,Shuowei Cai,Wei Li,Jun Fang,Naiqiang Tan,Hua Chai,Hao Liu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Didichuxing Co. Ltd
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Under review, 18 pages
点击查看摘要
Abstract:On-demand ride-hailing services like DiDi, Uber, and Lyft have transformed urban transportation, offering unmatched convenience and flexibility. In this paper, we introduce DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing. Its goal is to provide seamless ride-hailing services and beyond through a natural and efficient conversational interface under dynamic and complex spatiotemporal urban contexts. To achieve this, we propose a spatiotemporal-aware order planning module that leverages external tools for precise spatiotemporal reasoning and progressive order planning. Additionally, we develop a cost-effective dialogue system that integrates multi-type dialog repliers with cost-aware LLM configurations to handle diverse conversation goals and trade-off response quality and latency. Furthermore, we introduce a continual fine-tuning scheme that utilizes real-world interactions and simulated dialogues to align the assistant’s behavior with human preferred decision-making processes. Since its deployment in the DiDi application, DiMA has demonstrated exceptional performance, achieving 93% accuracy in order planning and 92% in response generation during real-world interactions. Offline experiments further validate DiMA capabilities, showing improvements of up to 70.23% in order planning and 321.27% in response generation compared to three state-of-the-art agent frameworks, while reducing latency by 0.72\times to 5.47\times . These results establish DiMA as an effective, efficient, and intelligent mobile assistant for ride-hailing services.
zh
[NLP-124] MiniF2F in Rocq: Automatic Translation Between Proof Assistants – A Case Study
【速读】: 该论文旨在解决将 MiniF2F 数学问题集翻译为 Rocq 格式的问题。论文的关键在于通过使用先进的大型语言模型(LLMs),采用从单次提示到多轮对话的逐步复杂化方法,结合自然语言描述、Lean 和 Isabelle 形式化表示作为输入源,成功翻译了 478 条定理中的绝大部分。这一方案的核心在于设计有效的提示工程与迭代反馈机制以提升翻译质量,并最终实现了高质量的自动化形式化数学内容转换。
链接: https://arxiv.org/abs/2503.04763
作者: Jules Viennot,Guillaume Baudart,Emilio Jesùs Gallego Arias,Marc Lelarge
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:
点击查看摘要
Abstract:In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: this https URL.
zh
[NLP-125] Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations
【速读】: 该论文试图解决的问题是如何系统性地衡量人工智能(Artificial Intelligence, AI)在经济各领域的实际使用模式。长期以来,关于AI对未来工作的潜在影响存在大量推测,但缺乏实证数据支持其具体应用场景与任务分布。为填补这一空白,论文提出了一种创新框架,通过分析美国劳工部O*NET数据库中的任务与职业信息,研究AI在超过四百万条在线对话中的使用情况。
解决方案的关键在于结合隐私保护技术,以自动化且精细的方式追踪AI在不同任务和职业中的应用。研究发现,AI的主要使用集中在软件开发和写作任务上,占总使用量的近一半;同时,约36%的职业至少在四分之一的任务中采用AI。此外,研究进一步揭示了AI使用的两种主要形式:57%的使用体现了增强人类能力(如迭代优化输出),而43%则表现为自动化(如低介入完成任务)。尽管数据和方法存在一定局限性,该框架仍为理解AI在经济中的动态角色提供了重要参考,并可作为预测未来影响的领先指标。
链接: https://arxiv.org/abs/2503.04761
作者: Kunal Handa,Alex Tamkin,Miles McCain,Saffron Huang,Esin Durmus,Sarah Heck,Jared Mueller,Jerry Hong,Stuart Ritchie,Tim Belonax,Kevin K. Troy,Dario Amodei,Jared Kaplan,Jack Clark,Deep Ganguli
机构: Anthropic
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite widespread speculation about artificial intelligence’s impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system to analyze over four million this http URL conversations through the lens of tasks and occupations in the U.S. Department of Labor’s O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with approximately 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI’s evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.
zh
[NLP-126] Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators ICLR2025
【速读】: 该论文旨在探讨私有评估(private evaluations)在大型语言模型(Large Language Models, LLMs)评价中的潜在风险,并强调尽管私有评估可能在一定程度上缓解数据污染问题,但其引入了不可忽视的财务与评估风险。关键在于分析私有数据策展人(private data curators)与其服务的领先LLM公司之间的潜在利益冲突,以及由私有标注专家的主观偏好导致的对基于私有策展数据训练模型的固有评估偏差。论文为研究这些风险奠定了基础,以期引发广泛讨论并推动相关政策调整。
链接: https://arxiv.org/abs/2503.04756
作者: Hritik Bansal,Pratyush Maini
机构: UCLA (加州大学洛杉矶分校); CMU (卡内基梅隆大学); DatologyAI
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published as a blogpost at ICLR 2025. Originally posted at this https URL
点击查看摘要
Abstract:The rapid advancement in building large language models (LLMs) has intensified competition among big-tech companies and AI startups. In this regard, model evaluations are critical for product and investment-related decision-making. While open evaluation sets like MMLU initially drove progress, concerns around data contamination and data bias have constantly questioned their reliability. As a result, it has led to the rise of private data curators who have begun conducting hidden evaluations with high-quality self-curated test prompts and their own expert annotators. In this paper, we argue that despite potential advantages in addressing contamination issues, private evaluations introduce inadvertent financial and evaluation risks. In particular, the key concerns include the potential conflict of interest arising from private data curators’ business relationships with their clients (leading LLM firms). In addition, we highlight that the subjective preferences of private expert annotators will lead to inherent evaluation bias towards the models trained with the private curators’ data. Overall, this paper lays the foundation for studying the risks of private evaluations that can lead to wide-ranging community discussions and policy changes.
zh
[NLP-127] NutriTransform: Estimating Nutritional Information From Online Food Posts
【速读】: 该论文旨在解决从在线食品帖子中推导营养信息的挑战,特别是在用户未明确记录共享餐食的宏量营养素(macronutrients)时。论文提出了一种高效且直观的方法,仅基于食品帖子的标题来估算宏量营养素。解决方案的关键在于结合美国农业部的公共食品数据库与先进的文本嵌入技术(advanced text embedding techniques),通过这种方法实现对宏量营养素的估算,并进一步分析Reddit上/r/food子论坛中超过500,000个真实帖子的食品分享行为趋势。这一工作为仅利用文本数据估计卡路里和营养成分的研究人员和实践者奠定了基础。
链接: https://arxiv.org/abs/2503.04755
作者: Thorsten Ruprechter,Marion Garaus,Ivo Ponocny,Denis Helic
机构: Graz University of Technology(Graz University of Technology); Sigmund Freud Private University(Vienna)(Sigmund Freud Private University); Graz University of Technology(Graz University of Technology); Graz University of Technology(Graz University of Technology)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: under review
点击查看摘要
Abstract:Deriving nutritional information from online food posts is challenging, particularly when users do not explicitly log the macro-nutrients of a shared meal. In this work, we present an efficient and straightforward approach to approximating macro-nutrients based solely on the titles of food posts. Our method combines a public food database from the U.S. Department of Agriculture with advanced text embedding techniques. We evaluate the approach on a labeled food dataset, demonstrating its effectiveness, and apply it to over 500,000 real-world posts from Reddit’s popular /r/food subreddit to uncover trends in food-sharing behavior based on the estimated macro-nutrient content. Altogether, this work lays a foundation for researchers and practitioners aiming to estimate caloric and nutritional content using only text data.
zh
[NLP-128] Sovereign Large Language Models : Advantages Strategy and Regulations
【速读】: 该论文旨在分析全球大型语言模型(Large Language Models, LLMs)发展的关键趋势、挑战、风险和机遇,同时评估国家在LLMs开发中的经验及投资可行性,并探索州层面实施、监管和融资AI项目的战略。论文试图解决的核心问题是如何在全球范围内有效推动LLMs的发展及其应用生态建设。解决方案的关键在于综合考虑技术、经济、政策等多维度因素,以制定切实可行的开发策略、监管框架和融资模式,从而平衡创新发展与潜在风险之间的关系。
链接: https://arxiv.org/abs/2503.04745
作者: Mykhailo Bondarenko,Sviatoslav Lushnei,Yurii Paniv,Oleksii Molchanovsky,Mariana Romanyshyn,Yurii Filipchuk,Artur Kiulian
机构: Ukrainian Catholic University; OpenBabylon Inc (OpenBabylon)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This report analyzes key trends, challenges, risks, and opportunities associated with the development of Large Language Models (LLMs) globally. It examines national experiences in developing LLMs and assesses the feasibility of investment in this sector. Additionally, the report explores strategies for implementing, regulating, and financing AI projects at the state level.
zh
[NLP-129] Standardizing Intelligence: Aligning Generative AI for Regulatory and Operational Compliance
【速读】: 本文旨在解决生成式 AI (GenAI) 模型在标准合规性任务中的应用挑战与机遇问题。关键在于通过计算方法将 GenAI 与技术标准(Technical Standards)对齐,以增强监管和运营合规性。论文评估了不同领域和技术标准的重要程度,并结合现有先进 GenAI 模型的合规能力进行分级,同时探讨了集成 GenAI 的潜在挑战与机会,提出了针对标准开发和使用者的可操作建议。研究强调,通过这种方法可以提升大型且更强大的 GenAI 系统的管理、监督及可信度。
链接: https://arxiv.org/abs/2503.04736
作者: Joseph Marvin Imperial,Matthew D. Jones,Harish Tayyar Madabushi
机构: UKRI CDT in Accountable, Responsible and Transparent AI (英国研究与创新署 accountable、responsible和transparent人工智能研究中心); Department of Computer Science (计算机科学系); Department of Life Sciences (生命科学系); University of Bath (巴斯大学), UK (英国)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Technical standards, or simply standards, are established documented guidelines and rules that facilitate the interoperability, quality, and accuracy of systems and processes. In recent years, we have witnessed an emerging paradigm shift where the adoption of generative AI (GenAI) models has increased tremendously, spreading implementation interests across standard-driven industries, including engineering, legal, healthcare, and education. In this paper, we assess the criticality levels of different standards across domains and sectors and complement them by grading the current compliance capabilities of state-of-the-art GenAI models. To support the discussion, we outline possible challenges and opportunities with integrating GenAI for standard compliance tasks while also providing actionable recommendations for entities involved with developing and using standards. Overall, we argue that aligning GenAI with standards through computational methods can help strengthen regulatory and operational compliance. We anticipate this area of research will play a central role in the management, oversight, and trustworthiness of larger, more powerful GenAI-based systems in the near future.
zh
[NLP-130] What can large language models do for sustainable food?
【速读】: 该论文试图解决食品生产对环境造成的影响问题,探索大型语言模型(Large Language Models, LLMs)在减少食品生产环境影响方面的潜在贡献。研究基于可持续食品领域的文献和领域专家的合作,定义了一组设计与预测任务类型,并评估了六种LLMs在四种任务上的表现。研究发现,LLMs在某些任务(如可持续蛋白质设计)中能显著提升效率,但在需要同时考虑人类满意度与气候影响的任务(如可持续菜单设计)中存在不足。为此,论文提出了一种将LLMs与组合优化技术结合的一般性框架,通过优化技术增强LLMs的推理能力。关键在于结合优化方法,使LLMs不仅提高效率,还能有效改善决策质量,最终在假设场景中将餐厅食品选择的排放降低了79%,同时保持参与者对其选择的满意度。
链接: https://arxiv.org/abs/2503.04734
作者: Anna T. Thomas,Adam Yee,Andrew Mayne,Maya B. Mathur,Dan Jurafsky,Kristina Gligorić
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Food systems are responsible for a third of human-caused greenhouse gas emissions. We investigate what Large Language Models (LLMs) can contribute to reducing the environmental impacts of food production. We define a typology of design and prediction tasks based on the sustainable food literature and collaboration with domain experts, and evaluate six LLMs on four tasks in our typology. For example, for a sustainable protein design task, food science experts estimated that collaboration with an LLM can reduce time spent by 45% on average, compared to 22% for collaboration with another expert human food scientist. However, for a sustainable menu design task, LLMs produce suboptimal solutions when instructed to consider both human satisfaction and climate impacts. We propose a general framework for integrating LLMs with combinatorial optimization to improve reasoning capabilities. Our approach decreases emissions of food choices by 79% in a hypothetical restaurant while maintaining participants’ satisfaction with their set of choices. Our results demonstrate LLMs’ potential, supported by optimization techniques, to accelerate sustainable food development and adoption.
zh
[NLP-131] WinClick: GUI Grounding with Multimodal Large Language Models
【速读】: 该论文旨在解决通用桌面环境中图形用户界面(GUI)代理在执行自动化任务时面临的GUI定位(GUI grounding)挑战。传统方法依赖于结构化数据格式(如DOM或HTML文件),但在Windows等通用桌面平台上,这些数据可能不可用。为应对这一问题,论文提出的关键解决方案是WinClick,这是一种专为Windows平台开发的新型视觉GUI代理。WinClick通过屏幕截图检测可操作区域,并结合GUI定位预训练技术提升性能。此外,论文引入了一种基于大型语言模型(LLM)的方法来对齐GUI定位数据。为了评估方案效果,作者还推出了WinSpot,首个针对Windows平台的全面GUI定位基准。实验结果表明,结合GUI定位预训练的WinClick显著优于现有基线,为桌面环境中的GUI自动化提供了一种可扩展的解决方案。
链接: https://arxiv.org/abs/2503.04730
作者: Zheng Hui,Yinheng Li,Dan zhao,Tianyi Chen,Colby Banbury,Kazuhito Koishida
机构: Microsoft (微软); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key challenge in developing visual GUI agents: GUI grounding - the ability to accurately locate screen elements based on instructions. However, most existing GUI agents rely on structured data formats like DOM or HTML files in training or inferencing, which are inaccessible across all applications, particular in a general desktop environments such as Windows OS. To address this, we introduce WinClick, a novel visual GUI agent developed in Windows platform. WinClick leverages screenshots to detect actionable regions. To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training and propose an LLM-based method for aligning GUI grounding data. Additionally, we introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows. Our experiments demonstrate that WinClick, combined with GUI grounding pre-training, significantly outperforms existing baselines, offering a scalable solution for GUI automation in desktop environments. WinSpot is publicly available at this https URL.
zh
[NLP-132] Leverag ing Large Language Models For Optimized Item Categorization using UNSPSC Taxonomy ICSE2024
【速读】: 该论文试图解决物品分类(Item Categorization)缺乏统一标准且高度主观的问题,同时减轻基于联合国标准产品与服务分类代码(UNSPC)进行手动分类所需的大量人工投入。论文的关键解决方案在于探索利用大型语言模型(Large Language Models, LLMs)自动化地将物品描述(Item Descriptions)映射到UNSPSC代码,通过评估其在不同数据集上的分类准确性和效率,验证LLMs作为标准化库存分类工具的潜力。实验结果表明,LLMs不仅能显著减少人工劳动,还能保持高精度,提供了一种可扩展的商业应用方案。
链接: https://arxiv.org/abs/2503.04728
作者: Anmolika Singh,Yuhang Diao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 Pages, International Conference on NLP, AI, Computer Science Engineering (NLAICSE 2024), December 2024, ISBN : 978-1-923107-45-8
点击查看摘要
Abstract:Effective item categorization is vital for businesses, enabling the transformation of unstructured datasets into organized categories that streamline inventory management. Despite its importance, item categorization remains highly subjective and lacks a uniform standard across industries and businesses. The United Nations Standard Products and Services Code (UNSPSC) provides a standardized system for cataloguing inventory, yet employing UNSPSC categorizations often demands significant manual effort. This paper investigates the deployment of Large Language Models (LLMs) to automate the classification of inventory data into UNSPSC codes based on Item Descriptions. We evaluate the accuracy and efficiency of LLMs in categorizing diverse datasets, exploring their language processing capabilities and their potential as a tool for standardizing inventory classification. Our findings reveal that LLMs can substantially diminish the manual labor involved in item categorization while maintaining high accuracy, offering a scalable solution for businesses striving to enhance their inventory management practices.
zh
[NLP-133] Nature-Inspired Population-Based Evolution of Large Language Models
【速读】: 本文旨在解决大型语言模型(Large Language Models, LLMs)在面对新任务时的快速适应与优化问题。论文提出了一种基于种群的进化框架,其关键在于通过模拟生物进化的机制实现LLMs的高效演化。具体而言,该框架以一组初始的父代LLMs为基础,利用四种核心操作:(i) 交叉(crossover),将不同父代模型的权重融合生成子代LLMs;(ii) 变异(mutation),对模型权重引入随机小扰动以促进多样性;(iii) 选择(selection),优先保留性能优异的模型;(iv) 接替(succession),将父代模型的知识迁移到子代模型中。通过这种方式,仅需少量样本(每项新任务200个样本),该框架即可使LLMs快速适应任务需求,而无需依赖梯度更新。实验结果表明,该方法在12个数据集上的表现显著优于现有多LLM合并与适配方法,最高可提升54.8%的准确性,并支持多任务同时演化及零样本泛化至未见任务。
链接: https://arxiv.org/abs/2503.01155
作者: Yiqun Zhang,Peng Ye,Xiaocui Yang,Shi Feng,Shufei Zhang,Lei Bai,Wanli Ouyang,Shuyue Hu
机构: Northeastern University, China (东北大学, 中国); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: preprint
点击查看摘要
Abstract:Evolution, the engine behind the survival and growth of life on Earth, operates through the population-based process of reproduction. Inspired by this principle, this paper formally defines a newly emerging problem – the population-based evolution of large language models (LLMs) – and introduces a novel framework. Starting with a population of parent LLMs, our framework enables the population to evolve through four key operations: (i) crossover, merging the weights of different parents to create offspring LLMs, (ii) mutation, introducing small, random changes to model weights to foster diversity, (iii) selection, prioritizing high-performing models, and (iv) succession, transferring the learned experience from parent to offspring LLMs. With only 200 samples per new task, the LLM population evolves rapidly to adapt to the task at hand, without any gradients. Experiments on 12 datasets show that our framework consistently outperforms existing multi-LLM merging and adaptation methods, achieving accuracy gains of up to 54.8% over the best LLM in the initial population. Moreover, our framework allows for the evolution of LLMs across multiple new tasks simultaneously, scaling effectively with populations of up to 40 LLMs, and even zero-shot generalization to unseen held-out tasks. We have open-sourced the code on GitHub and released the weights of 10 parent LLMs, fine-tuned from gemma-2-2b-it, on HuggingFace , enabling reproduction of our proposed framework using just a single 4090 GPU with 24GB memory, without any performance degradation.
zh
计算机视觉
[CV-0] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中多模态轨迹生成的问题,特别是现有方法因轨迹发散性和指导信息与场景不一致导致的轨迹质量下降及选择复杂性。论文提出了一种端到端的生成式自动驾驶方法GoalFlow,其关键是通过引入目标点(goal point)约束生成过程以解决基于扩散模型的轨迹发散问题,并结合高效的Flow Matching生成方法与精炼的评分机制,从候选轨迹中选择最优解。此外,GoalFlow提出了一种新颖的目标点选择机制,依据场景信息从候选点中选取最适合作为目标的点,从而有效提升轨迹质量和多样性。实验结果表明,GoalFlow在Navsim数据集上实现了最先进的性能,PDMS达到90.3,显著优于其他方法,且仅需单步去噪即可实现卓越表现。
链接: https://arxiv.org/abs/2503.05689
作者: Zebin Xing,Xingyu Zhang,Yang Hu,Bo Jiang,Tong He,Qian Zhang,Xiaoxiao Long,Wei Yin
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Horizon Robotics (地平线机器人); Nanjing University (南京大学); Huazhong University of Science & Technology (华中科技大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim\citeDauner2024_navsim, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at this https URL.
zh
[CV-1] Fairness-Aware Low-Rank Adaptation Under Demographic Privacy Constraints
【速读】:该论文试图解决在分布式环境下,如何在不访问敏感属性或其预测器的前提下,通过低秩适配(Low-Rank Adaptation, LoRA)方法实现公平性意识的微调,以开发公平的预训练基础模型。传统公平性微调方法依赖于直接访问敏感属性或其预测器,但在实际应用中,这些敏感属性通常受到严格的消费者隐私保护限制,导致模型开发者无法获取这些信息,从而阻碍了公平模型的开发。
解决方案的关键在于提出了一组基于LoRA的微调方法,允许模型开发者与公平性审计人员在不共享敏感属性或其预测器的情况下进行协作。具体而言,论文评估了三种方法:敏感属性遗忘(sensitive unlearning)、对抗训练(adversarial training)和正交性损失(orthogonality loss),并与公平性无意识的基线进行了对比。实验结果显示,在CelebA和UTK-Face数据集上使用ImageNet预训练的ViT-Base模型时,正交性损失能够在减少偏见的同时保持或提升模型效用,而对抗训练在某些情况下能够改善错误阳性率均等性和人口均等性,但敏感属性遗忘未表现出明显优势。这表明分布式公平性意识的微调方法能够在不损害消费者隐私的前提下有效消除偏见,并在大多数情况下提高模型效用。
链接: https://arxiv.org/abs/2503.05684
作者: Parameswaran Kamalaruban,Mark Anderson,Stuart Burrell,Maeve Madigan,Piotr Skalski,David Sutton
机构: Innovation Lab, Featurespace (创新实验室, Featurespace)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pre-trained foundation models can be adapted for specific tasks using Low-Rank Adaptation (LoRA). However, the fairness properties of these adapted classifiers remain underexplored. Existing fairness-aware fine-tuning methods rely on direct access to sensitive attributes or their predictors, but in practice, these sensitive attributes are often held under strict consumer privacy controls, and neither the attributes nor their predictors are available to model developers, hampering the development of fair models. To address this issue, we introduce a set of LoRA-based fine-tuning methods that can be trained in a distributed fashion, where model developers and fairness auditors collaborate without sharing sensitive attributes or predictors. In this paper, we evaluate three such methods - sensitive unlearning, adversarial training, and orthogonality loss - against a fairness-unaware baseline, using experiments on the CelebA and UTK-Face datasets with an ImageNet pre-trained ViT-Base model. We find that orthogonality loss consistently reduces bias while maintaining or improving utility, whereas adversarial training improves False Positive Rate Parity and Demographic Parity in some cases, and sensitive unlearning provides no clear benefit. In tasks where significant biases are present, distributed fairness-aware fine-tuning methods can effectively eliminate bias without compromising consumer privacy and, in most cases, improve model utility.
zh
[CV-2] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data CVPR2025
【速读】:该论文旨在解决现有提升模型公平性的方法在合成数据多样性和质量上的局限性,以及对人口统计学标签依赖的问题。这些局限性导致模型公平性和整体准确性受损。论文提出了一种名为AIM-Fair的方法,通过从初始训练于真实世界数据但未标注人口统计学信息的有偏模型出发,利用先进的扩散模型生成无偏合成数据进行微调,从而提高模型的公平性。关键在于解决了合成数据质量问题(通过Contextual Synthetic Data Generation, CSDG)以及真实与合成数据之间的领域和偏见差距(通过引入选择性微调方案)。实验表明,AIM-Fair在保持模型实用性的同时显著提升了公平性。
链接: https://arxiv.org/abs/2503.05665
作者: Zengqun Zhao,Ziquan Liu,Yu Cao,Shaogang Gong,Ioannis Patras
机构: Centre for Multimodal AI, Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025. Github: this https URL . Project page: this https URL
点击查看摘要
Abstract:Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a novel selective fine-tuning scheme in which only model parameters more sensitive to bias and less sensitive to domain shift are updated. Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
zh
[CV-3] NoT: Federated Unlearning via Weight Negation
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)模型中参与者数据撤回(unlearning)的问题,即如何从已训练的FL模型中移除特定参与者的贡献数据,同时确保隐私保护和法规合规性。传统联邦未学习(Federated Unlearning, FU)方法通常依赖于客户端或服务器端的辅助存储,或者需要直接访问待移除的数据,但这些条件在数据不可用时难以满足。为此,论文提出了一种名为NoT的新算法,其关键是通过权重取反(将权重乘以-1)的方式实现未学习过程,无需额外存储或访问目标数据。这种方法的关键在于通过扰动模型参数远离最优解集,但仍保持快速重新优化的能力,从而在理论层面证明了权重取反能够有效破坏层间协同适应性(inter-layer co-adaptation),实现未学习效果的同时保留近似最优性,支持快速恢复。实验结果表明,NoT在未学习效能以及通信和计算效率方面显著优于现有基准方法。
链接: https://arxiv.org/abs/2503.05657
作者: Yasser H. Khalil,Leo Brunswic,Soufiane Lamghari,Xu Li,Mahdi Beitollahi,Xi Chen
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室), Montreal, Canada; Huawei Technologies Canada Inc. (华为技术加拿大公司), Ottawa, Canada
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: The 42nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville TN, US. 2025
点击查看摘要
Abstract:Federated unlearning (FU) aims to remove a participant’s data contributions from a trained federated learning (FL) model, ensuring privacy and regulatory compliance. Traditional FU methods often depend on auxiliary storage on either the client or server side or require direct access to the data targeted for removal-a dependency that may not be feasible if the data is no longer available. To overcome these limitations, we propose NoT, a novel and efficient FU algorithm based on weight negation (multiplying by -1), which circumvents the need for additional storage and access to the target data. We argue that effective and efficient unlearning can be achieved by perturbing model parameters away from the set of optimal parameters, yet being well-positioned for quick re-optimization. This technique, though seemingly contradictory, is theoretically grounded: we prove that the weight negation perturbation effectively disrupts inter-layer co-adaptation, inducing unlearning while preserving an approximate optimality property, thereby enabling rapid recovery. Experimental results across three datasets and three model architectures demonstrate that NoT significantly outperforms existing baselines in unlearning efficacy as well as in communication and computational efficiency.
zh
[CV-4] BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities
【速读】:该论文旨在解决移动操作机器人在现实世界家庭任务中面临的重大挑战。现有机器人基准分析表明,成功完成任务依赖于三种关键全身控制能力:双手协调、稳定且精确的导航以及广泛的末端执行器可达性。实现这些能力需要精心设计硬件,但由此产生的系统复杂性进一步增加了视觉运动策略学习的难度。为应对这些挑战,论文引入了BEHAVIOR机器人套件(BRS),这是一个针对多样化家庭任务全身操作的综合框架。BRS基于双臂带轮机器人和4自由度躯干构建,并集成了用于数据收集的成本效益全身遥操作接口以及一种用于学习全身视觉运动策略的新算法。论文评估了BRS在五个具有挑战性的家庭任务中的表现,这些任务不仅强调了三种核心能力,还引入了额外的复杂性,如长距离导航、与关节和可变形物体交互以及在受限空间中的操作。论文认为,BRS集成的机器人形态、数据收集接口和学习框架标志着迈向实现日常家庭任务真实世界全身操作的重要一步。BRS已开源。
关键解决方案在于通过BEHAVIOR机器人套件整合全身遥操作接口和新型学习算法,以克服硬件复杂性和学习难度的双重挑战,从而有效提升机器人在复杂家庭环境中的操作能力。
链接: https://arxiv.org/abs/2503.05652
作者: Yunfan Jiang,Ruohan Zhang,Josiah Wong,Chen Wang,Yanjie Ze,Hang Yin,Cem Gokmen,Shuran Song,Jiajun Wu,Li Fei-Fei
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL
点击查看摘要
Abstract:Real-world household tasks present significant challenges for mobile manipulation robots. An analysis of existing robotics benchmarks reveals that successful task performance hinges on three key whole-body control capabilities: bimanual coordination, stable and precise navigation, and extensive end-effector reachability. Achieving these capabilities requires careful hardware design, but the resulting system complexity further complicates visuomotor policy learning. To address these challenges, we introduce the BEHAVIOR Robot Suite (BRS), a comprehensive framework for whole-body manipulation in diverse household tasks. Built on a bimanual, wheeled robot with a 4-DoF torso, BRS integrates a cost-effective whole-body teleoperation interface for data collection and a novel algorithm for learning whole-body visuomotor policies. We evaluate BRS on five challenging household tasks that not only emphasize the three core capabilities but also introduce additional complexities, such as long-range navigation, interaction with articulated and deformable objects, and manipulation in confined spaces. We believe that BRS’s integrated robotic embodiment, data collection interface, and learning framework mark a significant step toward enabling real-world whole-body manipulation for everyday household tasks. BRS is open-sourced at this https URL
zh
[CV-5] VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
【速读】:该论文旨在解决视频修复(Video Inpainting)中的两个关键挑战:一是生成完全被遮挡物体的内容,二是平衡背景上下文保留与前景生成之间的相互竞争目标。为了解决这些问题,论文提出了一种名为VideoPainter的新颖双流范式。其关键是引入了一个高效的上下文编码器(仅占主干网络参数的6%),用于处理被遮罩的视频,并向任何预训练的视频DiT注入与主干网络相关的背景上下文线索,从而以即插即用的方式生成语义一致的内容。这种架构分离显著降低了模型的学习复杂性,同时实现了关键背景上下文的细微集成。此外,论文还提出了一个新的目标区域ID重采样技术,支持任意长度的视频修复,极大地提升了实际应用能力。
链接: https://arxiv.org/abs/2503.05639
作者: Yuxuan Bian,Zhaoyang Zhang,Xuan Ju,Mingdeng Cao,Liangbin Xie,Ying Shan,Qiang Xu
机构: The Chinese University of Hong Kong (香港中文大学); Tencent ARC Lab (腾讯ARC实验室); The University of Tokyo (东京大学); University of Macau (澳门大学); Claude (Claude); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project page available at this https URL
点击查看摘要
Abstract:Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model’s learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter’s superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.
zh
[CV-6] rajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
【速读】:该论文旨在解决单目视频中相机轨迹重定向的问题,即如何在仅凭单目视频输入的情况下,实现用户指定的精确相机视点变换。解决方案的关键在于提出了一种新颖的双流条件视频扩散模型(dual-stream conditional video diffusion model),通过解耦确定性视点变换与随机内容生成,同时结合点云渲染和源视频作为条件,确保视点变换的准确性以及四维内容生成的一致性。此外,作者创新性地采用双重新投影策略(double-reprojection strategy)构建了一个包含网络规模单目视频与静态多视角数据的混合训练集,显著提升了方法在多样化场景中的泛化能力。
链接: https://arxiv.org/abs/2503.05638
作者: Mark YU,Wenbo Hu,Jinbo Xing,Ying Shan
机构: ARC Lab, Tencent PCG (腾讯互娱 ARC 实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project webpage: this https URL
点击查看摘要
Abstract:We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.
zh
[CV-7] Joint 3D Point Cloud Segmentation using Real-Sim Loop: From Panels to Trees and Branches ICRA2025
【速读】:该论文旨在解决现代果园点云数据从Panel到Tree和Branch(P2TB)的联合分割问题,现有方法主要依赖于单一实例分割的深度网络序列来执行联合任务,这种方法存在误差累积、标注与计算成本增加以及扩展性不足的问题。论文的关键解决方案在于提出了一种结合Real2Sim L-TreeGen用于训练数据生成的新方法,以及针对P2TB任务设计的联合模型J-P2TB。通过在生成的仿真数据集上训练J-P2TB模型,并利用零样本学习实现真实场景下的Panel点云联合分割,显著提升了分割性能,同时减少了40%的学习参数量,验证了L-TreeGen在模型训练中的有效性及J-P2TB在实际应用中的准确性、效率和泛化能力。这一创新不仅推动了自动化果园机器人技术的发展,也为数字孪生技术的进步提供了支持。
链接: https://arxiv.org/abs/2503.05630
作者: Tian Qiu,Ruiming Du,Nikolai Spine,Lailiang Cheng,Yu Jiang
机构: Cornell University (康奈尔大学); Cornell University (康奈尔大学); Cornell University (康奈尔大学); Cornell University (康奈尔大学); Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted by ICRA 2025
点击查看摘要
Abstract:Modern orchards are planted in structured rows with distinct panel divisions to improve management. Accurate and efficient joint segmentation of point cloud from Panel to Tree and Branch (P2TB) is essential for robotic operations. However, most current segmentation methods focus on single instance segmentation and depend on a sequence of deep networks to perform joint tasks. This strategy hinders the use of hierarchical information embedded in the data, leading to both error accumulation and increased costs for annotation and computation, which limits its scalability for real-world applications. In this study, we proposed a novel approach that incorporated a Real2Sim L-TreeGen for training data generation and a joint model (J-P2TB) designed for the P2TB task. The J-P2TB model, trained on the generated simulation dataset, was used for joint segmentation of real-world panel point clouds via zero-shot learning. Compared to representative methods, our model outperformed them in most segmentation metrics while using 40% fewer learnable parameters. This Sim2Real result highlighted the efficacy of L-TreeGen in model training and the performance of J-P2TB for joint segmentation, demonstrating its strong accuracy, efficiency, and generalizability for real-world applications. These improvements would not only greatly benefit the development of robots for automated orchard operations but also advance digital twin technology.
zh
[CV-8] FMT:A Multimodal Pneumonia Detection Model Based on Stacking MOE Framework
【速读】:该论文旨在解决传统多模态方法在处理医学影像分析中因数据不完整和模态缺失导致性能下降的问题。为应对这些挑战,论文提出了一种名为Flexible Multimodal Transformer (FMT) 的方法,其关键是结合ResNet-50和BERT进行联合表示学习,并采用动态掩码注意力策略模拟临床中的模态丢失以增强鲁棒性,最终利用序列混合专家(Mixture of Experts, MOE)架构实现多层次决策优化。实验结果显示,FMT在小规模多模态肺炎数据集上取得了最先进的性能,显著优于单模态基线及医学基准模型CheXMed,为资源受限环境下的多模态肺炎诊断提供了可扩展解决方案。
链接: https://arxiv.org/abs/2503.05626
作者: Jingyu Xu,Yang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Artificial intelligence has shown the potential to improve diagnostic accuracy through medical image analysis for pneumonia diagnosis. However, traditional multimodal approaches often fail to address real-world challenges such as incomplete data and modality loss. In this study, a Flexible Multimodal Transformer (FMT) was proposed, which uses ResNet-50 and BERT for joint representation learning, followed by a dynamic masked attention strategy that simulates clinical modality loss to improve robustness; finally, a sequential mixture of experts (MOE) architecture was used to achieve multi-level decision refinement. After evaluation on a small multimodal pneumonia dataset, FMT achieved state-of-the-art performance with 94% accuracy, 95% recall, and 93% F1 score, outperforming single-modal baselines (ResNet: 89%; BERT: 79%) and the medical benchmark CheXMed (90%), providing a scalable solution for multimodal diagnosis of pneumonia in resource-constrained medical settings.
zh
[CV-9] Conformal Prediction for Image Segmentation Using Morphological Prediction Sets
【速读】:本文旨在解决图像分割任务中由数据标注过程或训练数据采样等多源不确定性带来的挑战。论文专注于二值分割,并通过一致性预测(Conformal Prediction)方法来应对这些挑战。一致性预测是一类模型和数据无关的不确定性量化方法,能够提供有限样本下的理论保证,并适用于任何预训练的预测器。解决方案的关键在于计算非一致性分数(一种预测残差),该分数基于保留的校准数据集计算得出,而不依赖于训练数据。此外,利用数学形态学中的膨胀操作构造一个附加到预测分割掩码边界的边界框。在推理阶段,包含掩码及其边界框的预测集合以用户指定的置信水平高概率覆盖真实掩码。边界框的大小作为特定模型和数据集预测不确定性的指标。由于方法仅需预测掩码而无需预测器反馈,因此适用于任何分割模型,包括基于深度学习的模型。本文通过多个医学影像应用验证了该方法的有效性。
链接: https://arxiv.org/abs/2503.05618
作者: Luca Mossina,Corentin Friedrich
机构: IRT Saint Exupéry (IRT 圣埃克苏佩里)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Image segmentation is a challenging task influenced by multiple sources of uncertainty, such as the data labeling process or the sampling of training data. In this paper we focus on binary segmentation and address these challenges using conformal prediction, a family of model- and data-agnostic methods for uncertainty quantification that provide finite-sample theoretical guarantees and applicable to any pretrained predictor. Our approach involves computing nonconformity scores, a type of prediction residual, on held-out calibration data not used during training. We use dilation, one of the fundamental operations in mathematical morphology, to construct a margin added to the borders of predicted segmentation masks. At inference, the predicted set formed by the mask and its margin contains the ground-truth mask with high probability, at a confidence level specified by the user. The size of the margin serves as an indicator of predictive uncertainty for a given model and dataset. We work in a regime of minimal information as we do not require any feedback from the predictor: only the predicted masks are needed for computing the prediction sets. Hence, our method is applicable to any segmentation model, including those based on deep learning; we evaluate our approach on several medical imaging applications.
zh
[CV-10] CACTUS: An Open Dataset and Framework for Automated Cardiac Assessment and Classification of Ultrasound Images Using Deep Transfer Learning
【速读】:该论文旨在解决医学领域中机器学习(Machine Learning, ML)应用于心脏超声(Cardiac Ultrasound, US)图像分类和评估时面临的医疗数据有限这一重大挑战。解决方案的关键在于引入了一个名为CACTUS(Cardiac Assessment and ClassificaTion of UltraSound)的首个公开分级数据集,并提出了一种深度学习(Deep Learning, DL)框架。该框架包含两个主要组件:第一个组件利用卷积神经网络(Convolutional Neural Network, CNN)基于心脏视图对超声图像进行分类;第二个组件通过迁移学习(Transfer Learning, TL)微调第一个组件的知识,构建用于图像分级和评估的模型。该框架在分类和分级任务中表现出色,分别实现了高达99.43%的准确率和低至0.3067的误差,并通过进一步微调和专家问卷评估展示了其在处理实时扫描中的鲁棒性。
链接: https://arxiv.org/abs/2503.05604
作者: Hanae Elmekki,Ahmed Alagha,Hani Sami,Amanda Spilkin,Antonela Mariel Zanuttini,Ehsan Zakeri,Jamal Bentahar,Lyes Kadem,Wen-Fang Xie,Philippe Pibarot,Rabeb Mizouni,Hadi Otrok,Shakti Singh,Azzam Mourad
机构: Concordia University (康考迪亚大学); Université Laval (拉瓦尔大学); Khalifa University (哈利法大学); Lebanese American University (黎巴嫩美国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Cardiac ultrasound (US) scanning is a commonly used techniques in cardiology to diagnose the health of the heart and its proper functioning. Therefore, it is necessary to consider ways to automate these tasks and assist medical professionals in classifying and assessing cardiac US images. Machine learning (ML) techniques are regarded as a prominent solution due to their success in numerous applications aimed at enhancing the medical field, including addressing the shortage of echography technicians. However, the limited availability of medical data presents a significant barrier to applying ML in cardiology, particularly regarding US images of the heart. This paper addresses this challenge by introducing the first open graded dataset for Cardiac Assessment and ClassificaTion of UltraSound (CACTUS), which is available online. This dataset contains images obtained from scanning a CAE Blue Phantom and representing various heart views and different quality levels, exceeding the conventional cardiac views typically found in the literature. Additionally, the paper introduces a Deep Learning (DL) framework consisting of two main components. The first component classifies cardiac US images based on the heart view using a Convolutional Neural Network (CNN). The second component uses Transfer Learning (TL) to fine-tune the knowledge from the first component and create a model for grading and assessing cardiac images. The framework demonstrates high performance in both classification and grading, achieving up to 99.43% accuracy and as low as 0.3067 error, respectively. To showcase its robustness, the framework is further fine-tuned using new images representing additional cardiac views and compared to several other state-of-the-art architectures. The framework’s outcomes and performance in handling real-time scans were also assessed using a questionnaire answered by cardiac experts.
zh
[CV-11] D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在视频表示中的两大核心问题:低可解释性与有限的实际应用效果。为应对这些挑战,论文提出了一种基于可变形二维高斯点阵化(Deformable 2D Gaussian Splatting)的新颖视频表示方法——D2GV。其关键在于通过引入可微分光栅化技术,将视频帧表示为从标准空间变形至相应时间戳的二维高斯分布,从而实现高效、高质量的视频表示,同时提升模型的可扩展性和下游任务友好性。此外,结合可学习的剪枝与量化策略,进一步优化了模型的紧凑性。这一方案不仅显著提高了训练效率(收敛速度快且解码速度超过400 FPS),还确保了与当前最先进的INRs相当甚至更高的质量表现。
链接: https://arxiv.org/abs/2503.05600
作者: Mufan Liu,Qi Yang,Miaoran Zhao,He Huang,Le Yang,Zhu Li,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri, Kansas City (密苏里大学堪萨斯城分校); University of Canterbury (坎特伯雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Implicit Neural Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting. However, their implicit formulation limits both interpretability and efficacy, undermining their practicality as a comprehensive solution. We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV, which aims to achieve three key objectives: 1) improved efficiency while delivering superior quality; 2) enhanced scalability and interpretability; and 3) increased friendliness for downstream tasks. Specifically, we initially divide the video sequence into fixed-length Groups of Pictures (GoP) to allow parallel training and linear scalability with video length. For each GoP, D2GV represents video frames by applying differentiable rasterization to 2D Gaussians, which are deformed from a canonical space into their corresponding timestamps. Notably, leveraging efficient CUDA-based rasterization, D2GV converges fast and decodes at speeds exceeding 400 FPS, while delivering quality that matches or surpasses state-of-the-art INRs. Moreover, we incorporate a learnable pruning and quantization strategy to streamline D2GV into a more compact representation. We demonstrate D2GV’s versatility in tasks including video interpolation, inpainting and denoising, underscoring its potential as a promising solution for video representation. Code is available at: \hrefthis https URLthis https URL.
zh
[CV-12] Anti-Diffusion: Preventing Abuse of Modifications of Diffusion-Based Models
【速读】:该论文旨在解决扩散模型(diffusion-based methods)在图像生成与编辑任务中被滥用所导致的严重负面社会影响问题。论文指出现有防御方法存在局限性,特别是在特定场景下依赖人工定义提示词或稳定扩散(Stable Diffusion, SD)版本的问题,并且仅关注调参技术而忽视了同样具有威胁的编辑技术。为此,论文提出了一种名为Anti-Diffusion的隐私保护系统,适用于广泛的扩散模型及其调参与编辑技术。
Anti-Diffusion的关键解决方案包括两个方面:首先引入提示词微调(Prompt Tuning, PT)策略,以更精确地表达原始图像,从而克服人工定义提示词对防御效果的限制;其次提出语义扰动损失(Semantic Disturbance Loss, SDL),通过干扰受保护图像的语义信息来抵御调参与编辑攻击。此外,为了评估编辑防御方法的效果,研究团队还开发了一个名为Defense-Edit的数据集。实验结果表明,Anti-Diffusion在多种扩散模型及不同场景下均表现出卓越的防御性能。
链接: https://arxiv.org/abs/2503.05595
作者: Zheng Li,Liangbin Xie,Jiantao Zhou,Xintao Wang,Haiwei Wu,Jinyu Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Although diffusion-based techniques have shown remarkable success in image generation and editing tasks, their abuse can lead to severe negative social impacts. Recently, some works have been proposed to provide defense against the abuse of diffusion-based methods. However, their protection may be limited in specific scenarios by manually defined prompts or the stable diffusion (SD) version. Furthermore, these methods solely focus on tuning methods, overlooking editing methods that could also pose a significant threat. In this work, we propose Anti-Diffusion, a privacy protection system designed for general diffusion-based methods, applicable to both tuning and editing techniques. To mitigate the limitations of manually defined prompts on defense performance, we introduce the prompt tuning (PT) strategy that enables precise expression of original images. To provide defense against both tuning and editing methods, we propose the semantic disturbance loss (SDL) to disrupt the semantic information of protected images. Given the limited research on the defense against editing methods, we develop a dataset named Defense-Edit to assess the defense performance of various methods. Experiments demonstrate that our Anti-Diffusion achieves superior defense performance across a wide range of diffusion-based techniques in different scenarios.
zh
[CV-13] QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution
【速读】:该论文试图解决降低一阶扩散模型(OSDSR)在低比特量化下的性能损失问题,同时进一步减少计算成本。论文的关键解决方案在于提出了一种名为QArtSR的高效方法,包含两个创新策略:Timestep Retraining Quantization (TRQ) 和 Reversed Per-module Quantization (RPQ),用于校准量化模型。此外,通过引入模块损失和图像损失来更新所有量化模块,并采用仅微调量化组件参数的方式避免影响原始权重。为了确保各模块充分微调,论文还设计了扩展的端到端训练阶段。这些措施显著提升了OSDSR在4-bit和2-bit量化下的性能,接近全精度模型的效果。
链接: https://arxiv.org/abs/2503.05584
作者: Libo Zhu,Haotong Qin,Kaicheng Yang,Wenbo Li,Yong Guo,Yulun Zhang,Susanto Rahardja,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); ETH Zürich (瑞士苏黎世联邦理工学院); Singapore Institute of Technology (新加坡科技学院); Chinese University of Hong Kong (香港中文大学); Max Planck Institute for Informatics (马克斯·普朗克信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:One-step diffusion-based image super-resolution (OSDSR) models are showing increasingly superior performance nowadays. However, although their denoising steps are reduced to one and they can be quantized to 8-bit to reduce the costs further, there is still significant potential for OSDSR to quantize to lower bits. To explore more possibilities of quantized OSDSR, we propose an efficient method, Quantization via reverse-module and timestep-retraining for OSDSR, named QArtSR. Firstly, we investigate the influence of timestep value on the performance of quantized models. Then, we propose Timestep Retraining Quantization (TRQ) and Reversed Per-module Quantization (RPQ) strategies to calibrate the quantized model. Meanwhile, we adopt the module and image losses to update all quantized modules. We only update the parameters in quantization finetuning components, excluding the original weights. To ensure that all modules are fully finetuned, we add extended end-to-end training after per-module stage. Our 4-bit and 2-bit quantization experimental results indicate that QArtSR obtains superior effects against the recent leading comparison methods. The performance of 4-bit QArtSR is close to the full-precision one. Our code will be released at this https URL.
zh
[CV-14] Novel Object 6D Pose Estimation with a Single Reference View
【速读】:该论文旨在解决现有新颖物体6D位姿估计方法依赖于CAD模型或密集参考视图的问题,这些方法在实际应用中获取困难。论文提出了一种基于单参考视图的新颖物体6D位姿估计方法(SinRef-6D),其关键是利用状态空间模型(SSMs)在相机坐标系中迭代建立逐点对齐。具体而言,迭代相机空间逐点对齐能够有效处理大范围位姿差异,而提出的RGB和点云SSMs可以从单一视角捕获长距离依赖和空间信息,具有线性复杂度和优越的空间建模能力。预训练后的SinRef-6D仅需单个参考视图即可估计新物体的6D位姿,无需重新训练或使用CAD模型。实验结果表明,该方法在六个流行数据集和真实机器人场景中表现出与基于CAD模型和密集参考视图的方法相当的性能。
链接: https://arxiv.org/abs/2503.05578
作者: Jian Liu,Wei Sun,Kai Zeng,Jin Zheng,Hui Yang,Lin Wang,Hossein Rahmani,Ajmal Mian
机构: Hunan University (湖南大学); Nanyang Technological University (南洋理工大学); Central South University (中南大学); Lancaster University (兰卡斯特大学); The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 17 pages, 12 figures (including supplementary material)
点击查看摘要
Abstract:Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in the camera coordinate system based on state space models (SSMs). Specifically, iterative camera-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at this https URL.
zh
[CV-15] omatoScanner: phenotyping tomato fruit based on only RGB image
【速读】:该论文旨在解决传统番茄果实表型测量方法存在的效率低、人工操作危险以及设备成本高等问题。同时,现有基于计算机视觉的方法也存在需要额外标定、可能破坏果实或只能测量有限且无意义性状的局限性。为了解决这些问题,论文提出了一种名为TomatoScanner的非接触式番茄果实表型测量方法,其关键是仅利用RGB图像作为输入,通过融合像素特征与深度特征实现精准的果实宽度、高度、垂直面积和体积等表型参数的测量。具体而言,解决方案的关键在于提出了三个创新模块(EdgeAttention、EdgeLoss和EdgeBoost)以增强EdgeYOLO在边缘部分的分割精度,并实现了轻量级高效的模型设计,同时结合Depth Pro进行深度特征提取。最终,TomatoScanner在自建的数据集上验证了其优越性能,各项指标均表现出色。
链接: https://arxiv.org/abs/2503.05568
作者: Xiaobei Zhao(1),Xiangrong Zeng(1),Yihang Ma(1),Pengjin Tang(1),Xiang Li(1) ((1) China Agricultural University)
机构: China Agricultural University (中国农业大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 37 figures. Codes and datasets are open-sourced in this https URL
点击查看摘要
Abstract:In tomato greenhouse, phenotypic measurement is meaningful for researchers and farmers to monitor crop growth, thereby precisely control environmental conditions in time, leading to better quality and higher yield. Traditional phenotyping mainly relies on manual measurement, which is accurate but inefficient, more importantly, endangering the health and safety of people. Several studies have explored computer vision-based methods to replace manual phenotyping. However, the 2D-based need extra calibration, or cause destruction to fruit, or can only measure limited and meaningless traits. The 3D-based need extra depth camera, which is expensive and unacceptable for most farmers. In this paper, we propose a non-contact tomato fruit phenotyping method, titled TomatoScanner, where RGB image is all you need for input. First, pixel feature is extracted by instance segmentation of our proposed EdgeYOLO with preprocessing of individual separation and pose correction. Second, depth feature is extracted by depth estimation of Depth Pro. Third, pixel and depth feature are fused to output phenotype results in reality. We establish self-built Tomato Phenotype Dataset to test TomatoScanner, which achieves excellent phenotyping on width, height, vertical area and volume, with median relative error of 5.63%, 7.03%, -0.64% and 37.06%, respectively. We propose and add three innovative modules - EdgeAttention, EdgeLoss and EdgeBoost - into EdgeYOLO, to enhance the segmentation accuracy on edge portion. Precision and mean Edge Error greatly improve from 0.943 and 5.641% to 0.986 and 2.963%, respectively. Meanwhile, EdgeYOLO keeps lightweight and efficient, with 48.7 M weights size and 76.34 FPS. Codes and datasets: this https URL.
zh
[CV-16] Stereo Any Video: Temporally Consistent Stereo Matching
【速读】:该论文致力于解决视频立体匹配(Video Stereo Matching)中的空间准确性、时间一致性以及对辅助信息(如相机姿态或光流)的低依赖性问题。论文提出了一种名为“Stereo Any Video”的强大框架作为解决方案。其关键在于结合单目视频深度模型丰富的先验知识与卷积特征,生成稳定表示,并通过架构创新进一步提升性能。这些创新包括全互相关(all-to-all-pairs correlation),用于构建平滑且鲁棒的匹配代价体,以及时间凸上采样(temporal convex upsampling),以增强时间一致性。这些组件共同确保了方法在零样本设置下于多个数据集上的最先进性能,同时具备强泛化能力至真实室内外场景。
链接: https://arxiv.org/abs/2503.05549
作者: Junpeng Jing,Weixun Luo,Ye Mao,Krystian Mikolajczyk
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.
zh
[CV-17] Disconnect to Connect: A Data Augmentation Method for Improving Topology Accuracy in Image Segmentation
【速读】:该论文旨在解决薄且管状结构(如血管)在深度神经网络分割任务中因拓扑不准确导致的问题。现有方法依赖于精确的拓扑标注,但这类标注难以获取,因为标注图像(尤其是三维图像)耗时费力,且低分辨率和对比度会进一步加剧管状结构的断连现象。论文的关键解决方案是提出了一种名为CoLeTra的数据增强策略,它通过整合先验知识(即看似断开的结构实际上是连接的),生成具有断连外观但保留原始标签的图像,从而提升模型的拓扑准确性。实验表明,CoLeTra不仅提高了拓扑精度,还通常提升了Dice系数和Hausdorff距离,并且其超参数易于调节且对变化具有鲁棒性。
链接: https://arxiv.org/abs/2503.05541
作者: Juan Miguel Valverde,Maja Østergaard,Adrian Rodriguez-Palomo,Peter Alling Strange Vibe,Nina Kølln Wittig,Henrik Birkedal,Anders Bjorholm Dahl
机构: DTU Compute, Technical University of Denmark (丹麦技术大学计算系); A.I. Virtanen Institute, University of Eastern Finland (东芬兰大学A.I. Virtanen研究所); Department of Chemistry and iNANO, Aarhus University (奥胡斯大学化学系和iNANO中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate segmentation of thin, tubular structures (e.g., blood vessels) is challenging for deep neural networks. These networks classify individual pixels, and even minor misclassifications can break the thin connections within these structures. Existing methods for improving topology accuracy, such as topology loss functions, rely on very precise, topologically-accurate training labels, which are difficult to obtain. This is because annotating images, especially 3D images, is extremely laborious and time-consuming. Low image resolution and contrast further complicates the annotation by causing tubular structures to appear disconnected. We present CoLeTra, a data augmentation strategy that integrates to the models the prior knowledge that structures that appear broken are actually connected. This is achieved by creating images with the appearance of disconnected structures while maintaining the original labels. Our extensive experiments, involving different architectures, loss functions, and datasets, demonstrate that CoLeTra leads to segmentations topologically more accurate while often improving the Dice coefficient and Hausdorff distance. CoLeTra’s hyper-parameters are intuitive to tune, and our sensitivity analysis shows that CoLeTra is robust to changes in these hyper-parameters. We also release a dataset specifically suited for image segmentation methods with a focus on topology accuracy. CoLetra’s code can be found at this https URL.
zh
[CV-18] S4M: Segment Anything with 4 Extreme Points
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学领域的细粒度实例分割任务中,由于依赖稀疏提示(如点或边界框)导致的性能局限性,特别是在内窥镜图像中,现有提示难以有效捕捉物体边界的问题。论文的关键解决方案在于提出 S4M (Segment Anything with 4 Extreme Points),通过引入四极点提示(即目标实例的最顶部、底部、左侧和右侧点)替代传统的边界框提示,并设计专用的可学习嵌入来帮助模型理解这些提示的语义角色及其空间关系。此外,通过 Canvas 模块引入辅助训练任务,使模型在仅基于提示的情况下预测粗略实例掩膜,从而增强模型对极点与掩膜分布之间关系的理解,提升分割鲁棒性。实验结果表明,S4M 在三个内窥镜手术数据集上优于其他基于 SAM 的方法,并通过人工标注研究验证了极点提示比边界框更快速易得。
链接: https://arxiv.org/abs/2503.05534
作者: Adrien Meyer,Lorenzo Arboit,Giuseppe Massimiani,Francesco Brucchi,Luca Emanuele Amodio,Didier Mutter,Nicolas Padoy
机构: University of Strasbourg (斯特拉斯堡大学), CNRS (法国国家科学研究中心), INSERM (法国国家健康与医学研究院), ICube (ICube), UMR7357 (联合研究单位7357), Strasbourg, France; IHU Strasbourg (斯特拉斯堡IHU), France; Fondazione Policlinico Universitario A. Gemelli IRCCS (A. Gemelli大学医院基金会IRCCS), Rome, Italy; Ospedale Isola Tiberina-Gemelli Isola (Tiberina岛- Gemelli岛医院), Rome, Italy; University of Milan (米兰大学), Milan, Italy; University Hospital of Strasbourg (斯特拉斯堡大学医院), France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The Segment Anything Model (SAM) has revolutionized open-set interactive image segmentation, inspiring numerous adapters for the medical domain. However, SAM primarily relies on sparse prompts such as point or bounding box, which may be suboptimal for fine-grained instance segmentation, particularly in endoscopic imagery, where precise localization is critical and existing prompts struggle to capture object boundaries effectively. To address this, we introduce S4M (Segment Anything with 4 Extreme Points), which augments SAM by leveraging extreme points – the top-, bottom-, left-, and right-most points of an instance – prompts. These points are intuitive to identify and provide a faster, structured alternative to box prompts. However, a naïve use of extreme points degrades performance, due to SAM’s inability to interpret their semantic roles. To resolve this, we introduce dedicated learnable embeddings, enabling the model to distinguish extreme points from generic free-form points and better reason about their spatial relationships. We further propose an auxiliary training task through the Canvas module, which operates solely on prompts – without vision input – to predict a coarse instance mask. This encourages the model to internalize the relationship between extreme points and mask distributions, leading to more robust segmentation. S4M outperforms other SAM-based approaches on three endoscopic surgical datasets, demonstrating its effectiveness in complex scenarios. Finally, we validate our approach through a human annotation study on surgical endoscopic videos, confirming that extreme points are faster to acquire than bounding boxes.
zh
[CV-19] Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations
【速读】:该论文试图解决概念激活向量(Concept Activation Vectors, CAVs)在处理相关概念时因方向非正交导致的纠缠问题,这种纠缠会阻碍单个概念的独立解释,并可能在激活引导(activation steering)等应用中引发不良效果。论文的关键解决方案是提出了一种后处理的概念解缠方法,通过引入非正交性损失(non-orthogonality loss),在保持方向正确性的前提下实现概念方向的正交化。这一方法不仅解决了相关概念的方向纠缠问题,还验证了正交化概念表示在激活引导任务中的优越性,包括通过生成模型插入孤立概念以及在减少对相关概念影响的同时有效抑制快捷方式(shortcut suppression)。
链接: https://arxiv.org/abs/2503.05522
作者: Eren Erogullari,Sebastian Lapuschkin,Wojciech Samek,Frederik Pahde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as “beard” and “necktie” within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.
zh
[CV-20] Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature Perturbation WACV2025
【速读】:该论文旨在解决单类异常检测中因训练数据缺乏异常样本而导致的模型泛化能力不足的问题。现有方法通常依赖于对正常样本进行数据增强以模拟异常,或者利用基准数据集中的几何偏见,但这些方法在更通用的场景下表现受限。此外,许多方法仅关注图像域,并通过端到端方式从单一正常类别学习,未能充分利用丰富的特征表示。
论文的关键解决方案在于引入了一种新颖的自适应线性特征扰动技术(Adaptive Linear Feature Perturbation)。该技术基于预训练模型提供的丰富特征空间,通过调整噪声分布适应每个样本,对特征向量施加衰减的线性扰动,并结合对比学习目标引导分类过程。这种方法有效提升了模型在标准数据集以及无几何偏见数据集上的性能,克服了现有方法的局限性。
链接: https://arxiv.org/abs/2503.05520
作者: Romain Hermary,Vincent Gaudillière,Abd El Rahman Shabayek,Djamila Aouada
机构: University of Luxembourg (卢森堡大学), Esch-sur-Alzette, Luxembourg
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in WACV 2025
点击查看摘要
Abstract:One-class anomaly detection aims to detect objects that do not belong to a predefined normal class. In practice training data lack those anomalous samples; hence state-of-the-art methods are trained to discriminate between normal and synthetically-generated pseudo-anomalous data. Most methods use data augmentation techniques on normal images to simulate anomalies. However the best-performing ones implicitly leverage a geometric bias present in the benchmarking datasets. This limits their usability in more general conditions. Others are relying on basic noising schemes that may be suboptimal in capturing the underlying structure of normal data. In addition most still favour the image domain to generate pseudo-anomalies training models end-to-end from only the normal class and overlooking richer representations of the information. To overcome these limitations we consider frozen yet rich feature spaces given by pretrained models and create pseudo-anomalous features with a novel adaptive linear feature perturbation technique. It adapts the noise distribution to each sample applies decaying linear perturbations to feature vectors and further guides the classification process using a contrastive learning objective. Experimental evaluation conducted on both standard and geometric bias-free datasets demonstrates the superiority of our approach with respect to comparable baselines. The codebase is accessible via our public repository.
zh
[CV-21] FastMap: Fast Queries Initialization Based Vectorized HD Map Reconstruction Framework
【速读】:本文旨在解决现有基于DETR框架的矢量化高精地图重建方法因解码器结构冗余导致的计算效率低下问题。为应对这一挑战,论文提出FastMap,一种创新框架,通过采用单层双阶段Transformer优化解码器架构,实现多层级表征能力以减少冗余。FastMap的关键创新在于摒弃传统随机初始化查询的方式,引入热图引导的查询生成模块,在解码阶段利用可学习的位置编码将图像特征映射到结构化查询向量;同时,提出几何约束点到线损失机制,有效应对传统点对点损失在区分高度同质化特征时的难题。实验结果表明,FastMap在nuScenes和Argoverse2数据集上达到最先进性能,其解码速度比基线快3.2倍。
链接: https://arxiv.org/abs/2503.05492
作者: Haotian Hu,Jingwei Xu,Fanyi Wang,Toyota Li,Yaonong Wang,Laifeng Hu,Zhiwang Zhang
机构: Leap Motor(零跑科技); Zhejiang University (浙江大学); Independent; Nanjing Institute of Technology (南京工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reconstruction of high-definition maps is a crucial task in perceiving the autonomous driving environment, as its accuracy directly impacts the reliability of prediction and planning capabilities in downstream modules. Current vectorized map reconstruction methods based on the DETR framework encounter limitations due to the redundancy in the decoder structure, necessitating the stacking of six decoder layers to maintain performance, which significantly hampers computational efficiency. To tackle this issue, we introduce FastMap, an innovative framework designed to reduce decoder redundancy in existing approaches. FastMap optimizes the decoder architecture by employing a single-layer, two-stage transformer that achieves multilevel representation capabilities. Our framework eliminates the conventional practice of randomly initializing queries and instead incorporates a heatmap-guided query generation module during the decoding phase, which effectively maps image features into structured query vectors using learnable positional encoding. Additionally, we propose a geometry-constrained point-to-line loss mechanism for FastMap, which adeptly addresses the challenge of distinguishing highly homogeneous features that often arise in traditional point-to-point loss computations. Extensive experiments demonstrate that FastMap achieves state-of-the-art performance in both nuScenes and Argoverse2 datasets, with its decoder operating 3.2 faster than the baseline. Code and more demos are available at this https URL.
zh
[CV-22] DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction CVPR2025
【速读】:本文旨在解决在野外捕获的视频中将静态物体与其接触表面分离的问题,这是实现基于牛顿物理的真实模拟的关键前提。现有方法多依赖于合成数据或局限于接触面附近的弹性抖动,无法实现物体的完全独立分离或显著的位置变化。本文提出的DecoupledGaussian系统通过解耦物体与接触表面,允许物体在分离后发生显著的位置变化,而不受初始接触表面的限制。解决方案的关键在于引入联合Poisson场来修复和扩展分离后的物体与场景高斯分布,并结合多雕刻策略优化物体几何形状,从而支持用户指定冲量驱动的真实分离运动、碰撞和断裂模拟,适用于单个场景或多场景复杂交互。
链接: https://arxiv.org/abs/2503.05484
作者: Miaowei Wang,Yibo Zhang,Rui Ma,Weiwei Xu,Changqing Zou,Daniel Morris
机构: The University of Edinburgh (爱丁堡大学); Jilin University (吉林大学); Zhejiang University (浙江大学); Michigan State University (密歇根州立大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025 Accepted
点击查看摘要
Abstract:We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. Unlike prior methods focused on synthetic data or elastic jittering along the contact surface, which prevent objects from fully detaching or moving independently, DecoupledGaussian allows for significant positional changes without being constrained by the initial contacted surface. Recognizing the limitations of current 2D inpainting tools for restoring 3D locations, our approach proposes joint Poisson fields to repair and expand the Gaussians of both objects and contacted scenes after separation. This is complemented by a multi-carve strategy to refine the object’s geometry. Our system enables realistic simulations of decoupling motions, collisions, and fractures driven by user-specified impulses, supporting complex interactions within and across multiple scenes. We validate DecoupledGaussian through a comprehensive user study and quantitative benchmarks. This system enhances digital interaction with objects and scenes in real-world environments, benefiting industries such as VR, robotics, and autonomous driving. Our project page is at: this https URL.
zh
[CV-23] Automatic Teaching Platform on Vision Language Retrieval Augmented Generation
【速读】:该论文旨在解决自动化教学中难以提供符合学生个体学习节奏和理解水平的细微、实时反馈的问题,特别是在需要适应性解释的抽象概念领域。论文的关键在于提出了一种名为VL-RAG(Vision Language Retrieval Augmented Generation)的系统,该系统通过结合定制化答案与图像的数据库,能够动态检索与特定问题相关的上下文信息,并生成视觉丰富的响应,从而提升学生的理解能力。这种方法不仅增强了互动性和参与度,还促进了更深层次的学习,同时减少了对持续人工监督的需求,同时保持了跨不同学科和课程材料的灵活性扩展能力。
链接: https://arxiv.org/abs/2503.05464
作者: Ruslan Gokhman,Jialu Li,Youshan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Automating teaching presents unique challenges, as replicating human interaction and adaptability is complex. Automated systems cannot often provide nuanced, real-time feedback that aligns with students’ individual learning paces or comprehension levels, which can hinder effective support for diverse needs. This is especially challenging in fields where abstract concepts require adaptive explanations. In this paper, we propose a vision language retrieval augmented generation (named VL-RAG) system that has the potential to bridge this gap by delivering contextually relevant, visually enriched responses that can enhance comprehension. By leveraging a database of tailored answers and images, the VL-RAG system can dynamically retrieve information aligned with specific questions, creating a more interactive and engaging experience that fosters deeper understanding and active student participation. It allows students to explore concepts visually and verbally, promoting deeper understanding and reducing the need for constant human oversight while maintaining flexibility to expand across different subjects and course material.
zh
[CV-24] owards Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients
【速读】:该论文旨在解决深度学习模型虽然具有高预测性能但缺乏内在可解释性的问题,特别是现有局部可解释性方法仅关注关联而忽视了模型预测的因果驱动因素,同时其他采用因果视角的方法主要提供全局性的解释,而对于特定输入,无法明确全局识别的因素是否在局部适用。为了解决这一局限性,论文引入了一种基于图像到图像编辑模型最新进展的新型局部干预解释框架。关键在于通过渐进式干预语义属性,并利用一种新的评分指标——期望属性梯度幅值,量化这些干预对模型预测的影响,从而实现更精确的局部因果解释。研究通过广泛的实证评估验证了该方法的有效性,包括合成场景中的偏差本地化、网络训练动态分析、医学皮肤病变分类器研究以及使用真实干预数据的预训练CLIP模型分析。
链接: https://arxiv.org/abs/2503.05424
作者: Niklas Penzel,Joachim Denzler
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 39 figures, 14 tables
点击查看摘要
Abstract:Deep learning models achieve high predictive performance but lack intrinsic interpretability, hindering our understanding of the learned prediction behavior. Existing local explainability methods focus on associations, neglecting the causal drivers of model predictions. Other approaches adopt a causal perspective but primarily provide more general global explanations. However, for specific inputs, it’s unclear whether globally identified factors apply locally. To address this limitation, we introduce a novel framework for local interventional explanations by leveraging recent advances in image-to-image editing models. Our approach performs gradual interventions on semantic properties to quantify the corresponding impact on a model’s predictions using a novel score, the expected property gradient magnitude. We demonstrate the effectiveness of our approach through an extensive empirical evaluation on a wide range of architectures and tasks. First, we validate it in a synthetic scenario and demonstrate its ability to locally identify biases. Afterward, we apply our approach to analyze network training dynamics, investigate medical skin lesion classifiers, and study a pre-trained CLIP model with real-life interventional data. Our results highlight the potential of interventional explanations on the property level to reveal new insights into the behavior of deep models.
zh
[CV-25] Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning
【速读】:该论文致力于解决在无样本类增量学习(Exemplar-Free Class-Incremental Learning, EFCIL)中因缺乏保留示例而导致的知识灾难性遗忘问题。现有方法虽通过知识蒸馏缓解遗忘,但仍面临语义偏移(semantic shift)和决策偏差(decision bias)两大挑战:即旧任务的嵌入在学习新任务后会在嵌入空间中发生偏移,而分类器因仅使用新数据训练而倾向于偏向新任务,从而破坏旧知识与新知识之间的平衡。为此,论文提出了一种名为双投影偏移估计与分类器重构(Dual-Projection Shift Estimation and Classifier Reconstruction, DPCR)的方法。其关键是通过结合可学习变换与行空间投影的双重投影机制有效估计语义偏移,并利用岭回归将分类器训练重构成重构过程,利用校准后的类别协方差和原型中的先前信息减少决策偏差,从而实现旧任务与新任务之间的平衡。实验表明,DPCR在多个数据集上优于当前最先进的EFCIL方法。
链接: https://arxiv.org/abs/2503.05423
作者: Run He,Di Fang,Yicheng Xu,Yawen Cui,Ming Li,Cen Chen,Ziqian Zeng,Huiping Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures
点击查看摘要
Abstract:Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, thereby hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate classifier training as a reconstruction process. This reconstruction exploits previous information encoded in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, across various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods.
zh
[CV-26] Self-Modeling Robots by Photographing
【速读】:该论文旨在解决现有机器人自建模方法中存在的建模质量不高或数据采集成本过高的问题,并进一步探索如何结合纹理信息进行高质量的自建模。论文的关键在于提出了一种基于三维高斯分布(3D Gaussians)的高精度、纹理感知且以关节级别为核心的自建模方法。通过利用三维高斯分布表示机器人的静态形态与纹理,并将其聚类构建神经椭球骨骼,同时使用运动学神经网络生成变换矩阵来控制其变形。该方法仅依赖于包含关节角度、相机参数和多视角图像的数据对进行训练,无需深度信息。关键创新点在于结合纹理信息的建模能力以及在关节级别描述形态、运动学特性和纹理的能力,从而实现高效的下游任务应用如运动规划和逆运动学计算。
链接: https://arxiv.org/abs/2503.05398
作者: Kejun Hu,Peng Yu,Ning Tan
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学), Guangzhou, Guangdong, China
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Self-modeling enables robots to build task-agnostic models of their morphology and kinematics based on data that can be automatically collected, with minimal human intervention and prior information, thereby enhancing machine intelligence. Recent research has highlighted the potential of data-driven technology in modeling the morphology and kinematics of robots. However, existing self-modeling methods suffer from either low modeling quality or excessive data acquisition costs. Beyond morphology and kinematics, texture is also a crucial component of robots, which is challenging to model and remains unexplored. In this work, a high-quality, texture-aware, and link-level method is proposed for robot self-modeling. We utilize three-dimensional (3D) Gaussians to represent the static morphology and texture of robots, and cluster the 3D Gaussians to construct neural ellipsoid bones, whose deformations are controlled by the transformation matrices generated by a kinematic neural network. The 3D Gaussians and kinematic neural network are trained using data pairs composed of joint angles, camera parameters and multi-view images without depth information. By feeding the kinematic neural network with joint angles, we can utilize the well-trained model to describe the corresponding morphology, kinematics and texture of robots at the link level, and render robot images from different perspectives with the aid of 3D Gaussian splatting. Furthermore, we demonstrate that the established model can be exploited to perform downstream tasks such as motion planning and inverse kinematics.
zh
[CV-27] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning
【速读】:本文旨在解决多模态大型语言模型在情感识别任务中的性能优化问题,该任务中视觉和听觉模态均发挥关键作用。论文的关键解决方案是首次将具有可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR)应用于Omni多模态大型语言模型。通过引入RLVR,显著提升了模型在推理能力、情感识别准确率以及泛化能力三个方面的表现。此外,改进的推理能力使得能够清晰分析不同模态(尤其是视觉和听觉信息)在情感识别过程中的贡献,为多模态大型语言模型的优化提供了重要洞见。
链接: https://arxiv.org/abs/2503.05379
作者: Jiaxing Zhao,Xihan Wei,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where both visual and audio modalities play crucial roles. We leverage RLVR to optimize the Omni model, significantly enhancing its performance in three key aspects: reasoning capability, emotion recognition accuracy, and generalization ability. The introduction of RLVR not only improves the model’s overall performance on in-distribution data but also demonstrates superior robustness when evaluated on out-of-distribution datasets. More importantly, the improved reasoning capability enables clear analysis of the contributions of different modalities, particularly visual and audio information, in the emotion recognition process. This provides valuable insights into the optimization of multimodal large language models.
zh
[CV-28] Multi-Grained Feature Pruning for Video-Based Human Pose Estimation
【速读】:该论文致力于解决基于Transformer的视频人体姿态估计算法在处理冗余时间信息和实现细粒度感知方面的挑战。当前方法通常仅关注低分辨率特征的处理,导致难以有效管理时间维度上的冗余信息并实现精确的姿态估计。为了解决这些问题,论文提出了一种新颖的多尺度分辨率框架,通过在不同粒度上编码时空表示并执行细粒度感知补偿来优化模型性能。此外,论文采用密度峰值聚类方法动态识别并优先处理具有重要语义信息的Token,这一策略能够有效剪枝冗余特征Token,特别是来自多帧特征的冗余信息,从而在不牺牲语义丰富性的情况下显著提升计算效率。关键在于结合多尺度时空建模与高效的Token剪枝策略,实现了性能和效率的双重提升,在PoseTrack2017数据集上取得了87.4 mAP的精度,并将推理速度提升了93.8%。
链接: https://arxiv.org/abs/2503.05365
作者: Zhigang Wang,Shaojing Fan,Zhenguang Liu,Zheqi Wu,Sifan Wu,Yingying Jiao
机构: College of Computer Science and Technology, Zhejiang Gongshang University (浙江工商大学计算机科学与技术学院), Hangzhou, China; School of Computing, National University of Singapore (新加坡国立大学计算机学院), Singapore; The State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学区块链与数据安全国家重点实验室), Hangzhou, China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新技术产业开发区(滨江区)区块链与数据安全研究所), Hangzhou, China; College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院), Changchun, China; College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学计算机科学与技术学院), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.
zh
[CV-29] New multimodal similarity measure for image registration via modeling local functional dependence with linear combination of learned basis functions
【速读】:该论文旨在解决跨模态图像(different modalities)非刚性配准(deformable registration)这一在医学影像应用中极具挑战的问题。其核心挑战在于开发一种鲁棒的相似性度量方法,以衡量不同模态图像之间的重叠程度,尽管这些图像捕捉的是同一组织的不同方面。论文的关键解决方案是基于配准后图像强度值之间的函数依赖关系设计相似性度量,并通过线性基函数模型建模局部函数依赖性。该模型的基函数与变形场(deformation field)联合学习,同时利用卷积实现高效计算,从而显著提升了小范围上下文内的配准性能。最终,论文提出的方法在GPU上具有高效性和易用性,并在三个数据集上展示了优于传统基线方法及早期函数依赖方法的表现。
链接: https://arxiv.org/abs/2503.05335
作者: Joel Honkamaa,Pekka Marttinen
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The deformable registration of images of different modalities, essential in many medical imaging applications, remains challenging. The main challenge is developing a robust measure for image overlap despite the compared images capturing different aspects of the underlying tissue. Here, we explore similarity metrics based on functional dependence between intensity values of registered images. Although functional dependence is too restrictive on the global scale, earlier work has shown competitive performance in deformable registration when such measures are applied over small enough contexts. We confirm this finding and further develop the idea by modeling local functional dependence via the linear basis function model with the basis functions learned jointly with the deformation. The measure can be implemented via convolutions, making it efficient to compute on GPUs. We release the method as an easy-to-use tool and show good performance on three datasets compared to well-established baseline and earlier functional dependence-based methods.
zh
[CV-30] PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?
【速读】:该论文旨在探索生成式模型在物理模拟任务中的潜力,试图解决两个关键问题:i) 生成式模型是否能够从输入-输出图像对中学习复杂的物理关系?ii) 相比基于微分方程的传统物理模拟方法,生成式模型能否实现显著的速度提升?论文通过提供一个包含30万图像对的数据集及三种不同物理模拟任务的基线评估,构建了一个基准测试平台。解决方案的关键在于设计有效的生成式模型架构以及数据驱动的学习策略,以同时实现高计算效率和物理正确性,但现有模型在物理正确性方面仍存在明显局限性,这表明需要开发新的方法来强制保证物理约束的满足。
链接: https://arxiv.org/abs/2503.05333
作者: Martin Spitznagel,Jan Vaillant,Janis Keuper
机构: Institute for Machine Learning and Analytics (IMLA), Offenburg University (Offenburg大学); Herrenknecht AG; University of Mannheim (曼海姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The image-to-image translation abilities of generative learning models have recently made significant progress in the estimation of complex (steered) mappings between image distributions. While appearance based tasks like image in-painting or style transfer have been studied at length, we propose to investigate the potential of generative models in the context of physical simulations. Providing a dataset of 300k image-pairs and baseline evaluations for three different physical simulation tasks, we propose a benchmark to investigate the following research questions: i) are generative models able to learn complex physical relations from input-output image pairs? ii) what speedups can be achieved by replacing differential equation based simulations? While baseline evaluations of different current models show the potential for high speedups (ii), these results also show strong limitations toward the physical correctness (i). This underlines the need for new methods to enforce physical correctness. Data, baseline models and evaluation code this http URL.
zh
[CV-31] CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
【速读】:该论文旨在解决由相机运动导致的运动模糊(motion blur)问题,这一问题是高精度三维场景重建的主要障碍之一。论文的关键解决方案是提出了一种名为CoMoGaussian的连续运动感知高斯点绘制方法(Continuous Motion-Aware Gaussian Splatting)。该方法通过神经常微分方程(neural ordinary differential equations, ODEs)预测连续相机轨迹,并利用刚体变换(rigid body transformations)来保持物体形状和大小的一致性,同时采用离散帧采样的积分方式处理运动模糊。此外,引入连续运动细化(continuous motion refinement, CMR)变换以进一步优化刚体变换,通过引入可学习参数更精确地建模连续运动轨迹,从而显著提升重建精度。实验结果表明,该方法在定量与定性评估方面均达到当前最佳性能,适用于从轻度到重度运动模糊的各种场景。
链接: https://arxiv.org/abs/2503.05332
作者: Jungho Lee,Donghyeong Kim,Dogyoon Lee,Suhwan Cho,Minhyeok Lee,Wonjoon Lee,Taeoh Kim,Dongyoon Wee,Sangyoun Lee
机构: School of Electrical and Electronic Engineering, Yonsei University (延世大学电气与电子工程学院); NAVER Cloud (NAVER云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Revised Version of CRiM-GS, Github: this https URL
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has gained significant attention for their high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur.
zh
[CV-32] Attenuation artifact detection and severity classification in intracoronary OCT using mixed image representations
【速读】:该论文旨在解决冠状动脉光学相干断层成像(Intracoronary Optical Coherence Tomography, OCT)中因血液残留和气泡导致的衰减伪影问题,这些伪影会掩盖重要的血管结构。论文提出了一种基于卷积神经网络的方法,通过将衰减线(Attenuation Lines, A-lines)分类为无伪影、轻度伪影和严重伪影三类,实现伪影的自动检测。这种方法的关键在于结合笛卡尔坐标和极坐标表示下的特征提取与融合,利用两种坐标系统提供的互补特征,有效提升了轻度和重度伪影检测的F-score至0.77和0.94,同时保持约6秒的推理时间。这为冠状动脉OCT成像中的自动化伪影评估和图像采集指导奠定了基础。
链接: https://arxiv.org/abs/2503.05322
作者: Pierandrea Cancian,Simone Saitta,Xiaojin Gu,Rudolf L.M. van Herten,Thijs J. Luttikholt,Jos Thannhauser,Rick H.J.A. Volleberg,Ruben G.A. van der Waerden,Joske L. van der Zande,Clarisa I. Sánchez,Bram van Ginneken,Niels van Royen,Ivana Išgum
机构: Quantitative Healthcare Analysis group, Biomedical Engineering and Physics, Amsterdam UMC (阿姆斯特丹UMC), the Netherlands.; Quantitative Healthcare Analysis group, Informatics Institute, University of Amsterdam (阿姆斯特丹大学), the Netherlands.; Amsterdam Cardiovascular Sciences, Amsterdam UMC (阿姆斯特丹UMC), the Netherlands.; Diagnostic Image Analysis Group, Radboud University Medical Center (拉德堡德大学医学中心), Nijmegen, the Netherlands.; Department of Cardiology, Radboud University Medical Center (拉德堡德大学医学中心), Nijmegen, the Netherlands.; Department of Radiology and Nuclear Medicine, Amsterdam UMC (阿姆斯特丹UMC), the Netherlands.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:In intracoronary optical coherence tomography (OCT), blood residues and gas bubbles cause attenuation artifacts that can obscure critical vessel structures. The presence and severity of these artifacts may warrant re-acquisition, prolonging procedure time and increasing use of contrast agent. Accurate detection of these artifacts can guide targeted re-acquisition, reducing the amount of repeated scans needed to achieve diagnostically viable images. However, the highly heterogeneous appearance of these artifacts poses a challenge for the automated detection of the affected image regions. To enable automatic detection of the attenuation artifacts caused by blood residues and gas bubbles based on their severity, we propose a convolutional neural network that performs classification of the attenuation lines (A-lines) into three classes: no artifact, mild artifact and severe artifact. Our model extracts and merges features from OCT images in both Cartesian and polar coordinates, where each column of the image represents an A-line. Our method detects the presence of attenuation artifacts in OCT frames reaching F-scores of 0.77 and 0.94 for mild and severe artifacts, respectively. The inference time over a full OCT scan is approximately 6 seconds. Our experiments show that analysis of images represented in both Cartesian and polar coordinate systems outperforms the analysis in polar coordinates only, suggesting that these representations contain complementary features. This work lays the foundation for automated artifact assessment and image acquisition guidance in intracoronary OCT imaging.
zh
[CV-33] Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation
【速读】:该论文旨在解决在实际应用中因医疗设备不足和数据隐私问题导致完整多模态数据稀缺的问题。传统深度学习方法通过在潜在空间学习表示来应对这些挑战,但存在两个关键局限性:一是复杂模态中的任务无关冗余信息(如大量切片)导致潜在空间表示冗余;二是多模态表示重叠使得难以提取每种模态的独特特征。为克服这些问题,论文提出了一种名为Essence-Point and Disentangle Representation Learning (EDRL) 的策略,将自蒸馏机制集成到端到端框架中,以增强特征选择与解耦能力,从而实现更稳健的多模态学习。其中,Essence-Point Representation Learning模块通过选择判别性特征提升疾病分级性能,而Disentangled Representation Learning模块则将多模态数据分解为模态共享和模态独特表示,减少特征纠缠,提高眼科疾病诊断的鲁棒性和可解释性。实验结果表明,所提出的EDRL策略显著优于当前最先进的方法。
链接: https://arxiv.org/abs/2503.05319
作者: Xinkun Wang,Yifang Wang,Senwei Liang,Feilong Tang,Chengzhi Liu,Ming Hu,Chao Hu,Junjun He,Zongyuan Ge,Imran Razzak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10pages
点击查看摘要
Abstract:This paper discusses how ophthalmologists often rely on multimodal data to improve diagnostic accuracy. However, complete multimodal data is rare in real-world applications due to a lack of medical equipment and concerns about data privacy. Traditional deep learning methods typically address these issues by learning representations in latent space. However, the paper highlights two key limitations of these approaches: (i) Task-irrelevant redundant information (e.g., numerous slices) in complex modalities leads to significant redundancy in latent space representations. (ii) Overlapping multimodal representations make it difficult to extract unique features for each modality. To overcome these challenges, the authors propose the Essence-Point and Disentangle Representation Learning (EDRL) strategy, which integrates a self-distillation mechanism into an end-to-end framework to enhance feature selection and disentanglement for more robust multimodal learning. Specifically, the Essence-Point Representation Learning module selects discriminative features that improve disease grading performance. The Disentangled Representation Learning module separates multimodal data into modality-common and modality-unique representations, reducing feature entanglement and enhancing both robustness and interpretability in ophthalmic disease diagnosis. Experiments on multimodal ophthalmology datasets show that the proposed EDRL strategy significantly outperforms current state-of-the-art methods.
zh
[CV-34] Frequency Autoregressive Image Generation with Continuous Tokens
【速读】:该论文旨在解决图像生成领域中自回归(Autoregressive, AR)模型面临的两大挑战:因模态差异导致的传统基于矢量量化(Vector Quantization)与逐像素预测(raster-scan “next-token prediction”)范式的局限性。论文从tokenizer格式与回归方向两个视角重新审视图像自回归模型的设计,并提出频率渐进自回归(Frequency Progressive Autoregressive, \textbf{FAR})范式。关键在于识别频谱依赖性(spectral dependency)作为FAR的回归方向,通过从低频到高频逐步构建完整图像,不仅满足自回归模型的因果性需求,还保留了图像数据的独特空间局部性。此外,论文结合连续tokenizer,引入一系列技术优化训练与推理效率,验证了FAR在ImageNet数据集上的有效性以及其在文本到图像生成任务中的潜力。
链接: https://arxiv.org/abs/2503.05305
作者: Hu Yu,Hao Luo,Hangjie Yuan,Yu Rong,Feng Zhao
机构: Alibaba Group (阿里巴巴集团), DAMO Academy (达摩院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive (\textbfFAR) paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.
zh
[CV-35] Escaping Platos Cave: Towards the Alignment of 3D and Text Latent Spaces CVPR2025
【速读】:该论文试图解决跨模态(multi-modal)学习中3D编码器与文本编码器特征对齐的问题。现有工作主要关注二维视觉与文本编码器之间的特征共享特性,而3D编码器在多模态对齐中的作用尚未被充分探索。此外,当前基于大型数据集的3D基础模型通常通过显式的对齐目标与冻结的其他模态编码器进行训练。论文的关键在于提出了一种后验特征对齐方法,通过对单模态3D编码器和基于文本的特征空间所提取的子空间进行投影,将高维特征映射到精心选择的低维子空间中,从而显著提高特征对齐的质量,并在匹配和检索任务中实现更高的准确性。论文揭示了这些共享子空间大致区分了语义和几何数据表示的本质,并为未来3D模态与其他模态的后训练特征对齐建立了基准。
链接: https://arxiv.org/abs/2503.05283
作者: Souhail Hadgi,Luca Moschella,Andrea Santilli,Diego Gomez,Qixing Huang,Emanuele Rodolà,Simone Melzi,Maks Ovsjanikov
机构: École polytechnique (巴黎综合理工学院); Sapienza University of Rome (罗马大学); University of Milano-Bicocca (米兰比可卡大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
点击查看摘要
Abstract:Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data compared to other representations.
zh
[CV-36] CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
【速读】:该论文旨在解决多图像复杂理解任务中现有跨模态慢思考方法效果受限的问题。这些方法主要依赖于基于文本的中间推理过程,导致在多图像场景下的表现不如人脑处理类似任务时的效果。为了解决这一问题,论文提出了一种名为复杂多模态链式思维(Complex Multi-Modal Chain-of-Thought, CMMCoT)的新框架,其关键是引入了两个创新点:一是构建交错的多模态多步推理链,利用从中间推理步骤提取的关键视觉区域标记作为监督信号,不仅促进全面的跨模态理解,还提升模型的可解释性;二是设计了一个测试时记忆增强模块,在保持参数效率的同时扩展模型在推理阶段的能力。此外,为了推动相关研究,作者还创建了一个新的多图像慢思考数据集。实验结果验证了所提方法的有效性。
链接: https://arxiv.org/abs/2503.05255
作者: Guanghao Zhang,Tao Zhong,Yan Xia,Zhelun Yu,Haoyuan Li,Wanggui He,Fangxun Shu,Mushui Liu,Dong She,Yi Wang,Hao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like “slow thinking” for multi-image understanding. Our approach incorporates two key innovations: 1. The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. 2. The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.
zh
[CV-37] ColFigPhotoAttnNet: Reliable Finger Photo Presentation Attack Detection Leverag ing Window-Attention on Color Spaces WACV
【速读】:该论文试图解决现有深度学习指纹照片呈现攻击检测(Presentation Attack Detection, PAD)系统在跨捕获设备设置下的性能下降问题。这些问题源于现有算法通常针对特定类型的攻击进行训练,并且依赖于特定捕获设备的图像数据,导致其泛化能力较差且对移动硬件的演进缺乏鲁棒性。论文的关键解决方案是提出了一种名为ColFigPhotoAttnNet的新架构,该架构基于颜色通道上的窗口注意力机制,并结合嵌套残差网络作为预测器,以实现可靠的指纹照片PAD。通过在多种捕获设备(如iPhone13 Pro、Google Pixel 3、Nokia C5和OnePlus One)上进行广泛的实验验证,证明了所提方法的有效性。
链接: https://arxiv.org/abs/2503.05247
作者: Anudeep Vurity,Emanuela Marasco,Raghavendra Ramachandra,Jongwoo Park
机构: Center for Secure Information Systems, George Mason University (乔治梅森大学); Norwegian University of Science and Technology (NTNU) (挪威科技大学), Gjøvik, Norway; Stony Brook University (石溪大学), Stony Brook, NY, U.S.A
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Winter Conference on Applications of Computer Vision (WACV) 2025
点击查看摘要
Abstract:Finger photo Presentation Attack Detection (PAD) can significantly strengthen smartphone device security. However, these algorithms are trained to detect certain types of attacks. Furthermore, they are designed to operate on images acquired by specific capture devices, leading to poor generalization and a lack of robustness in handling the evolving nature of mobile hardware. The proposed investigation is the first to systematically analyze the performance degradation of existing deep learning PAD systems, convolutional and transformers, in cross-capture device settings. In this paper, we introduce the ColFigPhotoAttnNet architecture designed based on window attention on color channels, followed by the nested residual network as the predictor to achieve a reliable PAD. Extensive experiments using various capture devices, including iPhone13 Pro, GooglePixel 3, Nokia C5, and OnePlusOne, were carried out to evaluate the performance of proposed and existing methods on three publicly available databases. The findings underscore the effectiveness of our approach.
zh
[CV-38] Unified Reward Model for Multimodal Understanding and Generation
【速读】:该论文试图解决现有视觉任务特定奖励模型适应性不足的问题,并提出通过联合学习多任务评估以实现相互促进的效果。论文的关键在于提出UnifiedReward,这是一种针对多模态理解与生成评估的统一奖励模型,支持成对排序和点评估,可用于视觉模型偏好对齐。具体而言,首先在大规模人类偏好数据集上训练该模型,涵盖图像和视频生成/理解任务;其次利用该模型自动构建高质量的偏好对数据,并通过成对排序和逐点筛选逐步优化视觉模型输出;最后使用这些数据通过直接偏好优化(Direct Preference Optimization, DPO)完成模型偏好对齐。实验结果表明,联合学习多视觉任务评估能够带来显著的互惠效益,并在图像和视频的理解与生成任务中大幅提升性能。
链接: https://arxiv.org/abs/2503.05236
作者: Yibin Wang,Yuhang Zang,Hao Li,Cheng Jin,Jiaqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
点击查看摘要
Abstract:Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.
zh
[CV-39] RecipeGen: A Benchmark for Real-World Recipe Image Generation
【速读】:该论文试图解决食品计算领域中菜谱图像生成的重要挑战,当前缺乏一个全面连接菜谱目标、步骤序列与对应图像的真实世界数据集。为了解决这一问题,论文引入了RecipeGen,这是一个首个面向菜谱生成的真实世界目标-步骤-图像基准数据集,其关键在于包含了多样化的食材、多样的菜谱步骤、多种烹饪风格以及广泛的食品类别。
链接: https://arxiv.org/abs/2503.05228
作者: Ruoxuan Zhang,Hongxia Xie,Yi Yao,Jian-Yu Jiang-Lin,Bin Wen,Ling Lo,Hong-Han Shuai,Yung-Hui Li,Wen-Huang Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recipe image generation is an important challenge in food computing, with applications from culinary education to interactive recipe platforms. However, there is currently no real-world dataset that comprehensively connects recipe goals, sequential steps, and corresponding images. To address this, we introduce RecipeGen, the first real-world goal-step-image benchmark for recipe generation, featuring diverse ingredients, varied recipe steps, multiple cooking styles, and a broad collection of food categories. Data is in this https URL.
zh
[CV-40] DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility NAACL25
【速读】:该论文致力于解决视频到语音(Video-to-Speech, V2S)合成中,在仅利用视觉信息的情况下同时实现语音内容的准确重建与说话人特性的有效保持这一挑战性问题。传统方法通常依赖额外的声学提示以确保训练收敛,而近期的视听预训练虽缓解了这一需求,但现有方法仍难以在语音可懂度(acoustic intelligibility)与说话人特定特征保留之间取得平衡。为应对这一局限,论文提出了DiVISe(Direct Visual-Input Speech Synthesis),这是一种端到端的V2S模型,其创新点在于直接从视频帧预测梅尔频谱图(Mel-spectrogram),而不依赖任何声学提示。通过这种方法,DiVISe不仅在语音可懂度方面表现出色,而且在LRS2和LRS3数据集上的客观与主观评估指标上均优于现有方法,同时展现出更强的数据与模型参数扩展能力。
链接: https://arxiv.org/abs/2503.05223
作者: Yifan Liu,Yu Fang,Zhouhan Lin
机构: Shanghai Jiao Tong University (上海交通大学); ShanghaiTech University (上海科技大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: to be published in NAACL 25
点击查看摘要
Abstract:Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters. Code and weights can be found at this https URL.
zh
[CV-41] Separability Membrane: 3D Active Contour for Point Cloud Surface Reconstruction
【速读】:该论文旨在解决从三维点云中提取物体表面的问题,尤其关注在存在噪声或离群点导致边界模糊的情况下,如何准确重建表面边界。论文提出的方法名为“可分离性膜 (Separability Membrane)”,其关键在于将三维物体表面定义为最大化内外区域点特征(如强度、颜色或局部密度)可分性的边界,并通过基于Fisher比率的准则实现。解决方案的核心是结合自适应B样条曲面,既能最大化类别可分性以精确识别物体表面,又能控制三维表面模型的刚性,从而在局部和全局可分性之间取得平衡。这种方法无需训练数据或体素化表示转换,即可有效处理复杂条件下的表面提取任务。
链接: https://arxiv.org/abs/2503.05217
作者: Gulpi Qorik Oktagalu Pratamasunu,Guoqing Hao,Kazuhiro Fukui
机构: Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba (筑波大学); Department of Integrated Information Technology, Aoyama Gakuin University (青山学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper proposes Separability Membrane, a robust 3D active contour for extracting a surface from 3D point cloud object. Our approach defines the surface of a 3D object as the boundary that maximizes the separability of point features, such as intensity, color, or local density, between its inner and outer regions based on Fisher’s ratio. Separability Membrane identifies the exact surface of a 3D object by maximizing class separability while controlling the rigidity of the 3D surface model with an adaptive B-spline surface that adjusts its properties based on the local and global separability. A key advantage of our method is its ability to accurately reconstruct surface boundaries even when they are ambiguous due to noise or outliers, without requiring any training data or conversion to volumetric representation. Evaluations on a synthetic 3D point cloud dataset and the 3DNet dataset demonstrate the membrane’s effectiveness and robustness under diverse conditions.
zh
[CV-42] Data-Efficient Generalization for Zero-shot Composed Image Retrieval
【速读】:该论文旨在解决零样本组合图像检索(ZS-CIR)任务中,基于视觉-语言预训练范式的网络泛化能力不足的问题。具体而言,现有方法在训练与推理阶段因模态差异(modality discrepancy)和分布偏移(distribution shift)而导致性能下降。为了解决这一问题,论文提出了一种高效泛化(Data-efficient Generalization, DeG)框架,其关键在于两个创新设计:文本补充(Textual Supplement, TS)模块和语义集(Semantic-Set, S-Set)。TS模块通过利用训练过程中的组合文本语义增强伪词嵌入的语义表达,从而有效缓解模态差异;而S-Set则利用预训练视觉-语言模型(Vision-Language Model, VLM)的零样本能力,减轻分布偏移并缓解大规模图像-文本数据冗余引起的过拟合问题。实验结果表明,DeG在四个ZS-CIR基准数据集上以更少的数据实现了超越当前最优方法(SOTA)的表现,并显著减少了训练和推理时间。
链接: https://arxiv.org/abs/2503.05204
作者: Zining Chen,Zhicheng Zhao,Fei Su,Xiaoqin Zhang,Shijian Lu
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学); College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学, 新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set). The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The S-Set exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.
zh
[CV-43] STGA: Selective-Training Gaussian Head Avatars
【速读】:该论文旨在解决动态高斯头像细节增强的问题。解决方案的关键在于提出了一种选择性训练高斯头像(Selective-Training Gaussian Head Avatars, STGA)的方法。具体而言,该方法基于FLAME参数化模型训练动态高斯模型,并将每个高斯光晕嵌入到FLAME网格中以实现基于网格的高斯模型动画。在训练前,采用一种选择策略计算每帧中需要优化的三维高斯光晕;在每一帧的训练过程中,仅优化选定的三维高斯光晕参数,而冻结其他光晕的参数。这种帧间光晕优化参与者的动态变化显著提升了细节的真实感。与基于网络的方法相比,该方法具有更短的训练时间且效果更优;与基于网格的方法相比,在相同训练时间内生成的细节更加逼真。此外,消融实验验证了该方法有效提升了细节质量。
链接: https://arxiv.org/abs/2503.05196
作者: Hanzhi Guo,Yixiao Chen,Dongye Xiaonuo,Zeyu Tian,Dongdong Weng,Le Luo
机构: Beijing Institute of Technology(Beijing理工大学); Pengcheng Laboratory(Guangzhou实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose selective-training Gaussian head avatars (STGA) to enhance the details of dynamic head Gaussian. The dynamic head Gaussian model is trained based on the FLAME parameterized model. Each Gaussian splat is embedded within the FLAME mesh to achieve mesh-based animation of the Gaussian model. Before training, our selection strategy calculates the 3D Gaussian splat to be optimized in each frame. The parameters of these 3D Gaussian splats are optimized in the training of each frame, while those of the other splats are frozen. This means that the splats participating in the optimization process differ in each frame, to improve the realism of fine details. Compared with network-based methods, our method achieves better results with shorter training time. Compared with mesh-based methods, our method produces more realistic details within the same training time. Additionally, the ablation experiment confirms that our method effectively enhances the quality of details.
zh
[CV-44] Partially Supervised Unpaired Multi-Modal Learning for Label-Efficient Medical Image Segmentation
【速读】:该论文旨在解决无配对多模态学习(Unpaired Multi-Modal Learning, UMML)在医学图像分析中的标注成本过高的问题。传统方法需要完全标注的多模态数据集,而本文通过引入部分标注数据,提出了部分监督无配对多模态学习(Partially Supervised Unpaired Multi-Modal Learning, PSUMML)的新范式,以将标注成本降低多达一半。解决方案的关键在于设计了一个名为分解部分类别适应与快照集成自训练(Decomposed partial class adaptation with snapshot Ensembled Self-Training, DEST)的框架。具体而言,该框架包含一个具有模态特定归一化层的紧凑分割网络,用于处理部分标注的无配对多模态数据。由于部分类别标注导致的复杂部分类别分布差异阻碍了跨模态的有效知识迁移,论文通过分解定理对此现象进行了理论分析,并提出了一种分解部分类别适应技术来精确对齐不同模态之间的部分标注类别,从而减少分布差异。此外,还提出了快照集成自训练技术,利用训练过程中的有价值快照模型为部分标注像素分配伪标签,以进一步提升模型性能。实验结果表明,该框架在心脏亚结构分割和腹部多器官分割两项任务中显著优于现有方法。
链接: https://arxiv.org/abs/2503.05190
作者: Lei Zhu,Yanyu Xu,Huazhu Fu,Xinxing Xu,Rick Siow Mong Goh,Yong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MLMI 2024
点击查看摘要
Abstract:Unpaired Multi-Modal Learning (UMML) which leverages unpaired multi-modal data to boost model performance on each individual modality has attracted a lot of research interests in medical image analysis. However, existing UMML methods require multi-modal datasets to be fully labeled, which incurs tremendous annotation cost. In this paper, we investigate the use of partially labeled data for label-efficient unpaired multi-modal learning, which can reduce the annotation cost by up to one half. We term the new learning paradigm as Partially Supervised Unpaired Multi-Modal Learning (PSUMML) and propose a novel Decomposed partial class adaptation with snapshot Ensembled Self-Training (DEST) framework for it. Specifically, our framework consists of a compact segmentation network with modality specific normalization layers for learning with partially labeled unpaired multi-modal data. The key challenge in PSUMML lies in the complex partial class distribution discrepancy due to partial class annotation, which hinders effective knowledge transfer across modalities. We theoretically analyze this phenomenon with a decomposition theorem and propose a decomposed partial class adaptation technique to precisely align the partially labeled classes across modalities to reduce the distribution discrepancy. We further propose a snapshot ensembled self-training technique to leverage the valuable snapshot models during training to assign pseudo-labels to partially labeled pixels for self-training to boost model performance. We perform extensive experiments under different scenarios of PSUMML for two medical image segmentation tasks, namely cardiac substructure segmentation and abdominal multi-organ segmentation. Our framework outperforms existing methods significantly.
zh
[CV-45] Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions CVPR2025
【速读】:该论文致力于解决现有基于视觉-语言模型附加字幕的文本-视频检索方法难以充分捕捉视频中包含的丰富语义(如时间变化)以及生成模型可能引入的错误信息导致检索不准确的问题。论文的关键解决方案是提出了一种名为Narrating the Video (NarVid) 的新框架,通过多方面利用叙述性字幕信息来增强检索性能:1)通过叙述与视频之间的跨模态交互提升特征表达;2)基于查询的自适应过滤以抑制无关或错误信息;3)结合查询-视频相似性和查询-叙述相似性的双模态匹配分数;4)利用来自不同视角的两种相似性从多个角度学习判别性特征的困难负样本损失。实验结果表明,NarVid 在多种基准数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2503.05186
作者: Chan hur,Jeong-hun Hong,Dong-hun Lee,Dabin Kang,Semin Myeong,Sang-hyo Park,Hyeyoung Park
机构: School of Computer Science and Engineering, Kyungpook National University (庆北国立大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
点击查看摘要
Abstract:In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.
zh
[CV-46] Spectral-Spatial Extraction through Layered Tensor Decomposition for Hyperspectral Anomaly Detection
【速读】:该论文旨在解决高光谱异常检测(Hyperspectral Anomaly Detection, HAD)中低秩张量表示(Low Rank Tensor Representation, LRTR)方法存在的两个主要问题:一是通常忽略光谱异常,二是依赖大规模矩阵奇异值分解。为克服这些局限性,论文提出了一种分层张量分解(Layered Tensor Decomposition, LTD)框架,其关键是结合非负矩阵分解(Non-negative Matrix Factorization, NMF)提取光谱异常以缓解光谱维度冗余,同时利用LRTR提取空间异常并减轻空间冗余。此外,通过引入具有验证机制的秩约简策略,该框架能够自适应地减小数据规模,避免过度约简。论文还发展了一种基于近端交替最小化(Proximal Alternating Minimization)的迭代算法来求解提出的LTD模型,并证明了其收敛性。理论分析进一步表明,张量管秩与张量组稀疏正则化(Tensor Group Sparsity Regularization, TGSR)之间存在等价关系,在温和条件下,松弛形式的TGSR与其原始形式共享相同的全局最优解和最优值。实验结果表明,所提方法在Airport-Beach-Urban和MVTec数据集上的表现优于现有最先进的HAD方法。
链接: https://arxiv.org/abs/2503.05183
作者: Quan Yu,Yu-Hong Dai,Minru Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:
点击查看摘要
Abstract:Low rank tensor representation (LRTR) methods are very useful for hyperspectral anomaly detection (HAD). To overcome the limitations that they often overlook spectral anomaly and rely on large-scale matrix singular value decomposition, we first apply non-negative matrix factorization (NMF) to alleviate spectral dimensionality redundancy and extract spectral anomaly and then employ LRTR to extract spatial anomaly while mitigating spatial redundancy, yielding a highly efffcient layered tensor decomposition (LTD) framework for HAD. An iterative algorithm based on proximal alternating minimization is developed to solve the proposed LTD model, with convergence guarantees provided. Moreover, we introduce a rank reduction strategy with validation mechanism that adaptively reduces data size while preventing excessive reduction. Theoretically, we rigorously establish the equivalence between the tensor tubal rank and tensor group sparsity regularization (TGSR) and, under mild conditions, demonstrate that the relaxed formulation of TGSR shares the same global minimizers and optimal values as its original counterpart. Experimental results on the Airport-Beach-Urban and MVTec datasets demonstrate that our approach outperforms state-of-the-art methods in the HAD task.
zh
[CV-47] MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
【速读】:该论文试图解决生成式三维高斯点 splatting(3D Gaussian Splatting, 3D-GS)中新型视角合成(Novel View Synthesis, NVS)与表面重建(Surface Reconstruction, SR)任务之间的传统权衡问题。传统方法中,基于GS的渲染方法在多变光照条件下表现不佳且难以生成精确表面,而基于GS的重建方法则通常会牺牲渲染质量。论文的关键解决方案在于提出MGSR(Mutual-boosted Gaussian splatting for Surface Reconstruction),通过引入二维/三维相互增强的高斯点splating框架,同时提升渲染质量和三维重建准确性。其核心在于设计了包含二维高斯点(2D-GS)和三维高斯点(3D-GS)两个分支的架构:2D-GS分支专注于精确几何信息的表面重建,并为3D-GS分支提供指导;3D-GS分支利用几何引导的光照分解模块实现真实感渲染;两者通过交替优化过程进行相互监督,同时采用独立预热阶段和早期停止策略以降低计算成本。这一创新性架构有效解决了渲染与重建之间的传统权衡问题。
链接: https://arxiv.org/abs/2503.05182
作者: Qingyuan Zhou,Yuehu Gong,Weidong Yang,Jiaze Li,Yeqi Luo,Baixin Xu,Shuhao Li,Ben Fei,Ying He
机构: Fudan University (复旦大学); Nanyang Technological University (南洋理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
点击查看摘要
Abstract:Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches–one based on 2D-GS and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction, providing precise geometry information to the 3D-GS branch. Leveraging this geometry, the 3D-GS branch employs a geometry-guided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2D-GS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction.
zh
[CV-48] SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting IROS2025
【速读】:本文针对基于单目RGB图像进行六自由度(6-DoF)位姿估计时因依赖初始估计而容易产生精度损失以及旋转模糊问题,同时避免引入高成本的深度传感器或多视图设置,提出了一种名为SplatPose的新框架。其关键在于结合3D高斯点阵(3D Gaussian Splatting, 3DGS)与双分支神经网络架构,并通过双注意射线评分网络(Dual-Attention Ray Scoring Network, DARS-Net)创新性地在几何域内解耦位置和角度对齐,显式建模方向依赖以缓解旋转模糊问题。此外,粗到细优化管道逐步通过查询图像与由3DGS合成视图之间的密集二维特征对齐来精化位姿估计,有效修正了由于稀疏射线采样导致的特征错位和深度误差。实验结果表明,SplatPose在单目RGB设置下的6-DoF位姿估计精度达到当前最优水平,可媲美需要深度信息或多视图图像的方法。
链接: https://arxiv.org/abs/2503.05174
作者: Linqi Yang,Xiongwei Zhao,Qihao Sun,Ke Wang,Ao Chen,Peng Kang
机构: State Key Laboratory of Robotics and System, Harbin Institute of Technology (哈尔滨工业大学); Zhengzhou Research Institute, Harbin Institute of Technology (哈尔滨工业大学); School of Information Science and Technology, Harbin Institute of Technology (Shen Zhen) (哈尔滨工业大学深圳学院); Jianghuai Advance Technology Center (江淮先进技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to IROS 2025
点击查看摘要
Abstract:6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.
zh
[CV-49] Spatial Context-Driven Positive Pair Sampling for Enhanced Histopathology Image Classification
【速读】:该论文致力于解决基于全幻灯片图像(Whole-Slide Images, WSIs)的癌症分类中深度学习方法对大量标注数据依赖的问题。为应对这一挑战,论文提出了一种基于空间上下文驱动的正样本对采样策略,用于自监督学习(Self-Supervised Learning, SSL)。解决方案的关键在于利用WSIs中相邻图像块的自然一致性,通过构建来自空间邻近区域的生物学相关的正样本对,从而捕获组织病理学图像中复杂的空间关系,增强图像块级别的表征能力,最终提升幻灯片级别的分类性能。实验结果表明,该方法在多个数据集上的分类准确率比标准方法提高了5%到10%,为更临床相关的癌症诊断AI模型奠定了基础。
链接: https://arxiv.org/abs/2503.05170
作者: Willmer Rafell Quinones Robles,Sakonporn Noree,Young Sin Ko,Bryan Wong,Jongwoo Kim,Mun Yong Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep learning has demonstrated great promise in cancer classification from whole-slide images (WSIs) but remains constrained by the need for extensive annotations. Annotation-free methods, such as multiple instance learning (MIL) and self-supervised learning (SSL), have emerged to address this challenge; however, current SSL techniques often depend on synthetic augmentations or temporal context, which may not adequately capture the intricate spatial relationships inherent to histopathology. In this work, we introduce a novel spatial context-driven positive pair sampling strategy for SSL that leverages the natural coherence of adjacent patches in WSIs. By constructing biologically relevant positive pairs from spatially proximate patches, our approach harnesses inherent spatial coherence to enhance patch-level representations, ultimately boosting slide-level classification performance. Experiments on multiple datasets reveal that our strategy improves classification accuracy by 5% to 10% over the standard method, paving the way for more clinically relevant AI models in cancer diagnosis. The code is available at this https URL.
zh
[CV-50] EvolvingGS: High-Fidelity Streamable Volumetric Video via Evolving 3D Gaussian Representation
【速读】:该论文旨在解决动态场景(如复杂人类表演的长时间序列)重建中的挑战,特别是现有方法在处理剧烈运动、频繁拓扑变化以及道具交互时难以保持时间稳定性的问题。这些方法通常将整个序列分割成独立处理的帧组,导致时间连贯性丧失及存储效率低下。为了解决这些问题,论文提出了一种名为EvolvingGS的两阶段策略作为解决方案的关键:首先通过高斯模型的变形粗略对齐目标帧;然后在快速变化区域以最小点的增减进行细化。这种方法利用增量演进表示的灵活性,在保持快速渲染的同时,显著提升了每帧质量和时间连贯性指标,并通过挖掘连续帧之间的时序一致性实现了超过50倍的压缩率。实验结果表明,该方法在动态场景重建领域尤其是复杂人类表演的长序列重建方面取得了显著进展。
链接: https://arxiv.org/abs/2503.05162
作者: Chao Zhang,Yifeng Zhou,Shuheng Wang,Wenfa Li,Degang Wang,Yi Xu,Shaohui Jiao
机构: Bytedance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We have recently seen great progress in 3D scene reconstruction through explicit point-based 3D Gaussian Splatting (3DGS), notable for its high quality and fast rendering speed. However, reconstructing dynamic scenes such as complex human performances with long durations remains challenging. Prior efforts fall short of modeling a long-term sequence with drastic motions, frequent topology changes or interactions with props, and resort to segmenting the whole sequence into groups of frames that are processed independently, which undermines temporal stability and thereby leads to an unpleasant viewing experience and inefficient storage footprint. In view of this, we introduce EvolvingGS, a two-stage strategy that first deforms the Gaussian model to coarsely align with the target frame, and then refines it with minimal point addition/subtraction, particularly in fast-changing areas. Owing to the flexibility of the incrementally evolving representation, our method outperforms existing approaches in terms of both per-frame and temporal quality metrics while maintaining fast rendering through its purely explicit representation. Moreover, by exploiting temporal coherence between successive frames, we propose a simple yet effective compression algorithm that achieves over 50x compression rate. Extensive experiments on both public benchmarks and challenging custom datasets demonstrate that our method significantly advances the state-of-the-art in dynamic scene reconstruction, particularly for extended sequences with complex human performances.
zh
[CV-51] GaussianCAD: Robust Self-Supervised CAD Reconstruction from Three Orthographic Views Using 3D Gaussian Splatting
【速读】:该论文致力于解决基于计算机辅助设计(CAD)草图自动重建3D CAD模型的问题。传统方法依赖于矢量CAD草图和3D真实标注进行监督,但在工业应用中这些数据难以获取且对噪声输入敏感。为克服这些限制,论文将CAD重建视为稀疏视角3D重建的一个特例。然而,这一转化带来了两个主要挑战:(1) CAD草图与自然图像之间的模态差异;(2) CAD草图精确相机姿态估计的困难。关键在于首先将CAD草图转换为类似自然图像的表示形式并提取对应的掩码,接着通过手动计算正交视图的相机姿态确保3D坐标系中的精确对齐,最后采用定制化的稀疏视角3D重建方法从对齐的正交视图实现高质量重建。此外,通过利用光栅化CAD草图实现自监督,该方法摆脱了对矢量CAD草图和3D真实标注的依赖,并在Sub-Fusion360数据集上的实验验证了其在性能提升及抗噪能力方面的显著优势。
链接: https://arxiv.org/abs/2503.05161
作者: Zheng Zhou,Zhe Li,Bo Yu,Lina Hu,Liang Dong,Zijian Yang,Xiaoli Liu,Ning Xu,Ziwei Wang,Yonghao Dang,Jianqin Yin
机构: School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院), China.; State Grid Hubei Electric Power Co., Ltd. Information and Communication Company (国家电网湖北省电力有限公司信通公司), Hubei, China.
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注:
点击查看摘要
Abstract:The automatic reconstruction of 3D computer-aided design (CAD) models from CAD sketches has recently gained significant attention in the computer vision community. Most existing methods, however, rely on vector CAD sketches and 3D ground truth for supervision, which are often difficult to be obtained in industrial applications and are sensitive to noise inputs. We propose viewing CAD reconstruction as a specific instance of sparse-view 3D reconstruction to overcome these limitations. While this reformulation offers a promising perspective, existing 3D reconstruction methods typically require natural images and corresponding camera poses as inputs, which introduces two major significant challenges: (1) modality discrepancy between CAD sketches and natural images, and (2) difficulty of accurate camera pose estimation for CAD sketches. To solve these issues, we first transform the CAD sketches into representations resembling natural images and extract corresponding masks. Next, we manually calculate the camera poses for the orthographic views to ensure accurate alignment within the 3D coordinate system. Finally, we employ a customized sparse-view 3D reconstruction method to achieve high-quality reconstructions from aligned orthographic views. By leveraging raster CAD sketches for self-supervision, our approach eliminates the reliance on vector CAD sketches and 3D ground truth. Experiments on the Sub-Fusion360 dataset demonstrate that our proposed method significantly outperforms previous approaches in CAD reconstruction performance and exhibits strong robustness to noisy inputs.
zh
[CV-52] Accelerating Diffusion Transformer via Gradient-Optimized Cache
【速读】:该论文旨在解决通过特征缓存加速扩散变换器(DiT)采样过程中存在的两个主要问题:(1) 缓存块的逐步误差累积显著降低生成质量,尤其是在超过50%的块被缓存时;(2) 当前的误差补偿方法忽略了缓存过程中的动态扰动模式,导致误差校正效果次优。为了解决这些问题,论文提出了梯度优化缓存(Gradient-Optimized Cache, GOC),其关键创新包括:(1) 缓存梯度传播:通过梯度队列动态计算缓存特征与重新计算特征之间的梯度差异,并以加权方式将这些梯度传播到后续步骤,直接补偿由缓存引入的近似误差;(2) 倾点感知优化:通过对特征变化模式的统计分析,识别去噪轨迹方向改变的关键倾点,并通过在检测到的相位上对齐梯度更新,避免误差校正过程中的冲突梯度方向。
链接: https://arxiv.org/abs/2503.05156
作者: Junxiang Qiu,Lin Liu,Shuo Wang,Jinda Lu,Kezhou Chen,Yanbin Hao
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC’s superior trade-off between efficiency and quality. With 50% cached blocks, GOC achieves IS 216.28 (26.3% higher) and FID 3.907 (43% lower) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements.
zh
[CV-53] Development and Enhancement of Text-to-Image Diffusion Models
【速读】:本文研究旨在解决文本到图像去噪扩散模型在样本多样性有限和训练不稳定方面的关键挑战。为应对这些难题,论文引入了 Classifier-Free Guidance (CFG) 和 Exponential Moving Average (EMA) 技术作为解决方案的核心。CFG 提供了一种无需分类器的引导机制以增强生成图像的质量与多样性,而 EMA 则通过平滑模型参数来提升训练过程的稳定性。通过这些关键技术的结合应用,论文显著提高了生成图像的质量、多样性和稳定性,并在 Hugging Face 的领先文本到图像生成模型基础上建立了新的基准,推动了生成式人工智能 (Generative AI) 领域的发展。
链接: https://arxiv.org/abs/2503.05149
作者: Rajdeep Roshan Sahu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This research focuses on the development and enhancement of text-to-image denoising diffusion models, addressing key challenges such as limited sample diversity and training instability. By incorporating Classifier-Free Guidance (CFG) and Exponential Moving Average (EMA) techniques, this study significantly improves image quality, diversity, and stability. Utilizing Hugging Face’s state-of-the-art text-to-image generation model, the proposed enhancements establish new benchmarks in generative AI. This work explores the underlying principles of diffusion models, implements advanced strategies to overcome existing limitations, and presents a comprehensive evaluation of the improvements achieved. Results demonstrate substantial progress in generating stable, diverse, and high-quality images from textual descriptions, advancing the field of generative artificial intelligence and providing new foundations for future applications. Keywords: Text-to-image, Diffusion model, Classifier-free guidance, Exponential moving average, Image generation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.05149 [cs.CV] (or arXiv:2503.05149v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.05149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-54] R1-Zeros “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model
【速读】:该论文试图解决如何在多模态推理任务中实现类似于DeepSeek R1所展示的复杂推理能力,特别是“顿悟时刻”(即模型展现出自我反思及响应长度增加的现象)。此前针对多模态推理的研究未能成功复现这些特性。论文的关键在于通过直接在非SFT(Supervised Fine-Tuning)的2B参数规模模型Qwen2-VL-2B上应用强化学习(Reinforcement Learning, RL),基于SAT数据集进行训练,使模型在CVBench测试集上的准确率达到59.47%,显著优于基础模型约30%,并且比基于SFT设置的模型高出约2%。此外,作者分享了在尝试使用RL改进指令模型(Instruct Model)以达到类似R1推理能力时的失败案例与见解,强调了应用RL于指令模型常导致推理轨迹过于简单化,以及朴素的长度奖励机制难以有效激发推理能力。
链接: https://arxiv.org/abs/2503.05132
作者: Hengguang Zhou,Xirui Li,Ruochen Wang,Minhao Cheng,Tianyi Zhou,Cho-Jui Hsieh
机构: University of California, LA (加州大学洛杉矶分校); Pennsylvania State University (宾夕法尼亚州立大学); University of Maryland (马里兰大学); University of California, LA (加州大学洛杉矶分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures
点击查看摘要
Abstract:Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the “aha moment”, in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at this https URL
zh
[CV-55] HexPlane Representation for 3D Semantic Scene Understanding
【速读】:该论文旨在解决三维语义场景理解中的高效性和准确性问题。现有方法如点云(Point Cloud)和体素(Voxel)表示在处理稀疏且无序的三维数据时存在效率瓶颈或信息损失。为应对这一挑战,论文提出了一种名为HexPlane的新表征方法,其关键在于设计了一个视图投影模块(View Projection Module, VPM),将三维点云投影到六个平面以最大程度保留空间信息,并通过二维编码器提取特征后,利用HexPlane关联模块(HexPlane Association Module, HAM)自适应融合每个点的最相关信息。最终,融合后的点特征被送入任务头(Task Head)生成预测结果。这种方法充分利用了高度优化的二维操作以及现有的二维模型、网络权重和训练策略,从而实现了高效的三维场景理解,同时在ScanNet和SemanticKITTI基准测试中展现了与现有方法相当甚至更优的表现。
链接: https://arxiv.org/abs/2503.05127
作者: Zeren Chen,Yuenan Hou,Yulin Chen,Li Liu,Xiao Sun,Lu Sheng
机构: School of Software, Beihang University (北航软件学院); Shanghai AI Laboratory (上海人工智能实验室); College of Electronic Science and Technology, National University of Defense Technology (国防科技大学电子科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures
点击查看摘要
Abstract:In this paper, we introduce the HexPlane representation for 3D semantic scene understanding. Specifically, we first design the View Projection Module (VPM) to project the 3D point cloud into six planes to maximally retain the original spatial information. Features of six planes are extracted by the 2D encoder and sent to the HexPlane Association Module (HAM) to adaptively fuse the most informative information for each point. The fused point features are further fed to the task head to yield the ultimate predictions. Compared to the popular point and voxel representation, the HexPlane representation is efficient and can utilize highly optimized 2D operations to process sparse and unordered 3D point clouds. It can also leverage off-the-shelf 2D models, network weights, and training recipes to achieve accurate scene understanding in 3D space. On ScanNet and SemanticKITTI benchmarks, our algorithm, dubbed HexNet3D, achieves competitive performance with previous algorithms. In particular, on the ScanNet 3D segmentation task, our method obtains 77.0 mIoU on the validation set, surpassing Point Transformer V2 by 1.6 mIoU. We also observe encouraging results in indoor 3D detection tasks. Note that our method can be seamlessly integrated into existing voxel-based, point-based, and range-based approaches and brings considerable gains without bells and whistles. The codes will be available upon publication.
zh
[CV-56] EDM: Efficient Deep Feature Matching
【速读】:该论文旨在解决现有特征匹配方法在追求高精度的同时缺乏效率优化的问题。论文重新审视了主流的无检测器特征匹配流程,并在提升精度的同时着重考虑效率优化。解决方案的关键在于提出了一种高效的深度特征匹配网络EDM(Efficient Deep feature Matching network)。具体而言,首先采用具有较少维度的更深卷积神经网络(CNN)提取多层级特征;其次引入相关性注入模块(Correlation Injection Module),对高级别深度特征进行变换,并逐步从全局到局部注入特征相关性,以实现高效多尺度特征聚合,从而同时提升速度与性能;最后设计了一种新颖的轻量级基于轴向回归的双向回归头,直接预测亚像素级别的对应关系,避免了在高分辨率局部特征热图上显式定位关键点带来的巨大计算开销。此外,还提出了有效的特征选择策略以增强匹配准确性。实验结果表明,EDM在多种基准数据集上实现了竞争性的匹配精度,并展现出卓越的效率,为实际应用提供了宝贵的实践经验。
链接: https://arxiv.org/abs/2503.05122
作者: Xi Li,Tong Rao,Cihui Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at this https URL.
zh
[CV-57] SMILENet: Unleashing Extra-Large Capacity Image Steganography via a Synergistic Mosaic InvertibLE Hiding Network
【速读】:该论文旨在解决现有图像隐写术方法在隐藏容量(通常为1到7幅图像)方面的根本限制,主要由于严重的信息干扰和能力-失真权衡之间的不协调。论文提出了一种名为SMILENet的新颖协同框架,通过三个关键创新实现了25幅图像的隐藏:(i) 协同网络架构协调可逆与不可逆操作,以高效利用秘密图像和载体图像中的信息冗余。可逆的“可逆载体驱动马赛克 (ICDM)”模块和“可逆马赛克秘密嵌入 (IMSE)”模块建立了载体引导的马赛克变换和表示嵌入,并保证了无失真的嵌入数学可逆性;不可逆的“秘密信息选择 (SIS)”模块和“秘密细节增强 (SDE)”模块实现了关键信息选择和增强的可学习特征调制。(ii) 统一的训练策略协调互补模块,在保持优异视觉质量的同时,使隐藏容量比现有方法高出3.0倍。(iii) 最后,引入了一种新的度量方法来建模隐藏容量与失真的权衡,综合考虑了隐藏容量和失真,为不同数量的秘密图像提供了统一的评估方法。大量实验表明,SMILENet在隐藏容量、恢复质量和对抗隐写分析的安全性方面均优于现有最先进的方法。
链接: https://arxiv.org/abs/2503.05118
作者: Jun-Jie Huang,Zihan Chen,Tianrui Liu,Wentao Zhao,Xin Deng,Xinwang Liu,Meng Wang,Pier Luigi Dragotti
机构: College of Computer Science and Technology, National University of Defense Technology (国防科技大学计算机科学与技术学院), Changsha, China; School of Electronic Information Engineering, Beihang University (北京航空航天大学电子信息技术学院), Beijing, China; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院), Hefei, China; Department of Electrical and Electronic Engineering, Imperial College London (伦敦帝国理工学院电气与电子工程系), London, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing image steganography methods face fundamental limitations in hiding capacity (typically 1\sim7 images) due to severe information interference and uncoordinated capacity-distortion trade-off. We propose SMILENet, a novel synergistic framework that achieves 25 image hiding through three key innovations: (i) A synergistic network architecture coordinates reversible and non-reversible operations to efficiently exploit information redundancy in both secret and cover images. The reversible Invertible Cover-Driven Mosaic (ICDM) module and Invertible Mosaic Secret Embedding (IMSE) module establish cover-guided mosaic transformations and representation embedding with mathematically guaranteed invertibility for distortion-free embedding. The non-reversible Secret Information Selection (SIS) module and Secret Detail Enhancement (SDE) module implement learnable feature modulation for critical information selection and enhancement. (ii) A unified training strategy that coordinates complementary modules to achieve 3.0x higher capacity than existing methods with superior visual quality. (iii) Last but not least, we introduce a new metric to model Capacity-Distortion Trade-off for evaluating the image steganography algorithms that jointly considers hiding capacity and distortion, and provides a unified evaluation approach for accessing results with different number of secret image. Extensive experiments on DIV2K, Paris StreetView and ImageNet1K show that SMILENet outperforms state-of-the-art methods in terms of hiding capacity, recovery quality as well as security against steganalysis methods.
zh
[CV-58] Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)中偏差研究的局限性问题。具体而言,现有研究主要关注特质关联(trait associations),而忽视了其他形式的刻板印象,并且通常在特定预期出现偏差的情境下进行分析,同时将社会类别如种族和性别简单二元化,未能充分反映这些身份的复杂性。论文的关键在于通过使用具有不同典型性的标准化面部图像,在开放性场景下测试四种VLMs的特质关联及同质性偏差(homogeneity bias)。研究发现,VLMs对女性生成的故事比男性更加一致,且外貌更具性别典型性的个体被更均匀地描述;此外,白人美国人比黑人美国人表现出更高的故事一致性,但种族典型性并未显著增强这种一致性。在特质关联方面,研究显示有限的刻板印象证据,例如黑人美国人始终与篮球相关联,而其他种族关联则因具体模型而异。这些结果表明,VLM中的刻板印象表现超出简单的群体归属范畴,传统的偏差缓解策略可能不足以应对这些问题,同时即使在输出中特质关联不明显时,同质性偏差仍然存在。
链接: https://arxiv.org/abs/2503.05093
作者: Messi H.J. Lee,Soyeon Jeon,Jacob M. Montgomery,Calvin K. Lai
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current research on bias in Vision Language Models (VLMs) has important limitations: it is focused exclusively on trait associations while ignoring other forms of stereotyping, it examines specific contexts where biases are expected to appear, and it conceptualizes social categories like race and gender as binary, ignoring the multifaceted nature of these identities. Using standardized facial images that vary in prototypicality, we test four VLMs for both trait associations and homogeneity bias in open-ended contexts. We find that VLMs consistently generate more uniform stories for women compared to men, with people who are more gender prototypical in appearance being represented more uniformly. By contrast, VLMs represent White Americans more uniformly than Black Americans. Unlike with gender prototypicality, race prototypicality was not related to stronger uniformity. In terms of trait associations, we find limited evidence of stereotyping-Black Americans were consistently linked with basketball across all models, while other racial associations (i.e., art, healthcare, appearance) varied by specific VLM. These findings demonstrate that VLM stereotyping manifests in ways that go beyond simple group membership, suggesting that conventional bias mitigation strategies may be insufficient to address VLM stereotyping and that homogeneity bias persists even when trait associations are less apparent in model outputs.
zh
[CV-59] Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion IROS2025
【速读】:该论文旨在解决单目室内语义场景完成(SSC)任务中的深度、尺度和形状歧义问题,特别是在复杂且通常存在严重遮挡的室内环境中,现有方法难以准确重建3D语义占用图。论文的关键创新在于提出了一种结合新颖视图合成与多视图融合的方法:通过在场景周围布置虚拟相机模拟多视图输入以增强上下文信息,并引入多视图融合适配器(MVFA)将多视图的3D场景预测融合为统一的3D语义占用图。此外,研究还探讨了生成技术在SSC任务中的固有局限性——新颖性与一致性之间的权衡。实验结果显示,GenFuSE系统在NYUv2数据集上显著提升了场景完成和语义场景完成的IoU分数,分别提高了2.8%和4.9%。因此,该研究通过引入GenFuSE框架为单目SSC任务提供了新的解决方案。
链接: https://arxiv.org/abs/2503.05086
作者: Anith Selvakumar,Manasa Bharadwaj
机构: LG Electronics, Toronto AI Lab (LG电子, 多伦多人工智能实验室), Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IROS 2025
点击查看摘要
Abstract:Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.
zh
[CV-60] aming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs CVPR2025
【速读】:该论文致力于解决使用3D Gaussian Splatting (3DGS) 进行稀疏输入场景建模时面临的两个关键挑战:外推(extrapolation)和遮挡(occlusion)。为了解决这些问题,论文提出了一种基于生成式重建的流水线,利用视频扩散模型学习到的先验知识,为视野外或被遮挡区域提供合理的解释。然而,生成的序列可能存在不一致性,影响后续3DGS建模的效果。为此,论文引入了一种无需训练的场景引导机制,基于优化后的3DGS渲染序列来约束扩散模型,从而生成一致的序列。此外,还设计了一种轨迹初始化方法以识别视野外和被遮挡的区域,并提出了一套针对3DGS优化的方案。实验表明,该方法显著优于基线,并在具有挑战性的基准测试中达到了最先进的性能。关键在于通过场景引导机制实现生成序列的一致性,同时结合轨迹初始化方法有效处理稀疏输入中的外推与遮挡问题。
链接: https://arxiv.org/abs/2503.05082
作者: Yingji Zhong,Zhihao Li,Dave Zhenyu Chen,Lanqing Hong,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025. The project page is available at this https URL
点击查看摘要
Abstract:Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.
zh
[CV-61] ISP-AD: A Large-Scale Real-World Dataset for Advancing Industrial Anomaly Detection with Synthetic and Real Defects
【速读】:该论文旨在解决工业视觉检测中基于机器学习的方法在实际应用中的局限性问题,特别是现有异常检测方法对复杂缺陷外观及非理想成像条件的适应能力不足。当前公开数据集多偏向于最优成像条件,导致模型在真实工业场景中的适用性被高估。为填补这一差距,论文引入了工业丝网印刷异常检测数据集(ISP-AD),该数据集包含嵌入复杂结构图案中的小尺寸且对比度低的表面缺陷,并涵盖从工厂直接采集的真实与合成缺陷,成为目前最大的公开工业数据集。关键解决方案在于结合监督学习与无监督学习策略:少量真实缺陷的注入可显著提升模型泛化能力,而基于纯合成缺陷训练的模型能够有效整合后续采集的真实缺陷样本;研究发现,通过合成缺陷与累积真实缺陷的联合监督,可实现低误报率与高召回率等工业检测需求,从而增强无监督、自监督及监督方法在工业场景中的适用性。
链接: https://arxiv.org/abs/2503.04997
作者: Paul J. Krassnig,Dieter P. Gruber
机构: pccl.at (PCCL); unileoben.ac.at (莱奥本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 6 figures, this preprint has been submitted to the Journal of Intelligent Manufacturing
点击查看摘要
Abstract:Automatic visual inspection using machine learning-based methods plays a key role in achieving zero-defect policies in industry. Research on anomaly detection approaches is constrained by the availability of datasets that represent complex defect appearances and imperfect imaging conditions, which are typical to industrial processes. Recent benchmarks indicate that most publicly available datasets are biased towards optimal imaging conditions, leading to an overestimation of the methods’ applicability to real-world industrial scenarios. To address this gap, we introduce the Industrial Screen Printing Anomaly Detection dataset (ISP-AD). It presents challenging small and weakly contrasted surface defects embedded within structured patterns exhibiting high permitted design variability. To the best of our knowledge, it is the largest publicly available industrial dataset to date, including both synthetic and real defects collected directly from the factory floor. In addition to the evaluation of defect detection performance of recent unsupervised anomaly detection methods, experiments on a mixed supervised training approach, incorporating both synthesized and real defects, were conducted. Even small amounts of injected real defects prove beneficial for model generalization. Furthermore, starting from training on purely synthetic defects, emerging real defective samples can be efficiently integrated into subsequent scalable training. Research findings indicate that supervision by means of both synthetic and accumulated real defects can complement each other, meeting demanded industrial inspection requirements such as low false positive rates and high recall. The presented unsupervised and supervised dataset splits are designed to emphasize research on unsupervised, self-supervised, and supervised approaches, enhancing their applicability to industrial settings.
zh
[CV-62] Leverag ing Large Language Models For Scalable Vector Graphics Processing: A Review
【速读】:该论文旨在解决传统矢量图形(Vector Graphics)处理技术在生成与编辑方面存在的效率低下和输出复杂性过高的问题。矢量图形因其可扩展性和易编辑性,在数字设计领域至关重要,但现有传统方法难以满足实际应用需求。论文提出利用大型语言模型(Large Language Models, LLMs)的新范式来改进矢量图形的生成、编辑和分析能力,特别是针对基于文本的SVG格式,因其天然适合与LLMs结合而展现出巨大潜力。关键在于通过增强模型的推理能力,使LLMs在矢量图形任务中表现更优,尤其是在生成和理解任务中超越标准LLMs。此外,研究强调了构建更多样化且注释丰富的数据集的重要性,以进一步提升LLMs在矢量图形相关任务中的性能。
链接: https://arxiv.org/abs/2503.04983
作者: Boris Malashenko,Ivan Jarsky,Valeria Efimova
机构: ITMO University (ITMO大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In recent years, rapid advances in computer vision have significantly improved the processing and generation of raster images. However, vector graphics, which is essential in digital design, due to its scalability and ease of editing, have been relatively understudied. Traditional vectorization techniques, which are often used in vector generation, suffer from long processing times and excessive output complexity, limiting their usability in practical applications. The advent of large language models (LLMs) has opened new possibilities for the generation, editing, and analysis of vector graphics, particularly in the SVG format, which is inherently text-based and well-suited for integration with LLMs. This paper provides a systematic review of existing LLM-based approaches for SVG processing, categorizing them into three main tasks: generation, editing, and understanding. We observe notable models such as IconShop, StrokeNUWA, and StarVector, highlighting their strengths and limitations. Furthermore, we analyze benchmark datasets designed for assessing SVG-related tasks, including SVGEditBench, VGBench, and SGP-Bench, and conduct a series of experiments to evaluate various LLMs in these domains. Our results demonstrate that for vector graphics reasoning-enhanced models outperform standard LLMs, particularly in generation and understanding tasks. Furthermore, our findings underscore the need to develop more diverse and richly annotated datasets to further improve LLM capabilities in vector graphics tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.04983 [cs.CV] (or arXiv:2503.04983v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.04983 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-63] HyDA: Hypernetworks for Test Time Domain Adaptation in Medical Imaging Analysis MICCAI2025
【速读】:该论文旨在解决医学影像分析模型在面对域偏移(domain shift)时的适应性挑战,特别是在临床环境中目标域数据仅在实时可用的情况下,现有领域自适应(Domain Adaptation, DA)方法因需要测试域样本而受限的问题。论文的关键创新在于提出了一种名为HyDA的新型超网络框架,其通过利用而非抑制域特性,在推理阶段实现动态适配。具体而言,HyDA学习隐式的域表示,并据此实时调整模型参数,从而有效地推广至未见过的目标域。这一方案的核心在于无需依赖大量目标域标注样本即可实现跨任务和模态的泛化能力。
链接: https://arxiv.org/abs/2503.04979
作者: Doron Serebro,Tammy Riklin-Raviv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to MICCAI 2025
点击查看摘要
Abstract:Medical imaging datasets often vary due to differences in acquisition protocols, patient demographics, and imaging devices. These variations in data distribution, known as domain shift, present a significant challenge in adapting imaging analysis models for practical healthcare applications. Most current domain adaptation (DA) approaches aim either to align the distributions between the source and target domains or to learn an invariant feature space that generalizes well across all domains. However, both strategies require access to a sufficient number of examples, though not necessarily annotated, from the test domain during training. This limitation hinders the widespread deployment of models in clinical settings, where target domain data may only be accessible in real time. In this work, we introduce HyDA, a novel hypernetwork framework that leverages domain characteristics rather than suppressing them, enabling dynamic adaptation at inference time. Specifically, HyDA learns implicit domain representations and uses them to adjust model parameters on-the-fly, effectively interpolating to unseen domains. We validate HyDA on two clinically relevant applications - MRI brain age prediction and chest X-ray pathology classification - demonstrating its ability to generalize across tasks and modalities. Our code is available at TBD. Comments: submitted to MICCAI 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.04979 [cs.CV] (or arXiv:2503.04979v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.04979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-64] Spectral Informed Mamba for Robust Point Cloud Processing
【速读】:本文旨在解决点云数据在监督学习和自监督学习中的处理挑战,特别是针对复杂点云结构的建模与分析问题。论文的关键解决方案包括三个方面:首先,通过图拉普拉斯谱定义了一种等距不变的遍历顺序,以捕获局部块之间的连接性,该方法对视角变化具有鲁棒性,并能更有效地捕捉形状流形;其次,提出了一种基于拉普拉斯谱分量的递归块划分策略,用于分割任务,实现了更精细的分割与分析;第三,针对Mamba中掩码自编码器的令牌放置问题,通过将令牌恢复到原始位置,保持了关键的顺序信息,从而提升了学习效果。这些创新显著改进了分类、分割以及少样本学习任务的表现。
链接: https://arxiv.org/abs/2503.04953
作者: Ali Bahri,Moslem Yazdanpanah,Mehrdad Noori,Sahar Dastani,Milad Cheraghalikhani,David Osowiechi,Gustavo Adolfo Vargas Hakim,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers
机构: École de technologie supérieure (ÉTS); International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:State space models have shown significant promise in Natural Language Processing (NLP) and, more recently, computer vision. This paper introduces a new methodology leveraging Mamba and Masked Autoencoder networks for point cloud data in both supervised and self-supervised learning. We propose three key contributions to enhance Mamba’s capability in processing complex point cloud structures. First, we exploit the spectrum of a graph Laplacian to capture patch connectivity, defining an isometry-invariant traversal order that is robust to viewpoints and better captures shape manifolds than traditional 3D grid-based traversals. Second, we adapt segmentation via a recursive patch partitioning strategy informed by Laplacian spectral components, allowing finer integration and segment analysis. Third, we address token placement in Masked Autoencoder for Mamba by restoring tokens to their original positions, which preserves essential order and improves learning. Extensive experiments demonstrate the improvements of our approach in classification, segmentation, and few-shot tasks over state-of-the-art baselines.
zh
[CV-65] Metadata-free Georegistration of Ground and Airborne Imagery WACV2025
【速读】:该论文旨在解决异构地面与机载影像数据难以生成地理配准(georegistered)三维模型的问题,同时提供一种机制以对齐来自非重叠数据的多个独立生成的模型。解决方案的关键在于利用卫星影像、关联的数字表面模型(Digital Surface Model, DSM)以及现代三维建模技术(如神经辐射场)的新视角生成能力,提出了一种稳健的机载影像地理配准方法,并进一步开发了一种将地面影像注册到机载影像生成模型中的相关技术。该方法无需依赖除基于卫星的参考产品之外的任何元数据,从而具有广泛的适用性。
链接: https://arxiv.org/abs/2503.04927
作者: Adam Bredvik,Scott Richardson,Daniel Crispell
机构: Vision Systems, Inc. (视觉系统股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025 ULTRRA Workshop
点击查看摘要
Abstract:Heterogeneous collections of ground and airborne imagery can readily be used to create high-quality 3D models and novel viewpoint renderings of the observed scene. Standard photogrammetry pipelines generate models in arbitrary coordinate systems, which is problematic for applications that require georegistered models. Even for applications that do not require georegistered models, georegistration is useful as a mechanism for aligning multiple disconnected models generated from non-overlapping data. The proposed method leverages satellite imagery, an associated digital surface model (DSM), and the novel view generation capabilities of modern 3D modeling techniques (e.g. neural radiance fields) to provide a robust method for georegistering airborne imagery, and a related technique for registering ground-based imagery to models created from airborne imagery. Experiments demonstrate successful georegistration of airborne and ground-based photogrammetric models across a variety of distinct sites. The proposed method does not require use of any metadata other than a satellite-based reference product and therefore has general applicability.
zh
[CV-66] FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
【速读】:该论文旨在解决利用多模态大型语言模型(Multimodal Large Language Models, MLLMs)进行三维场景生成时,因缺乏对三维几何理解的充分支持而导致的应用局限性问题。论文的核心目标是在物体放置任务中有效结合MLLMs的语义能力和三维几何推理能力。解决方案的关键在于提出了一种名为FirePlace的新框架,该框架通过三个主要步骤实现这一目标:(1) 利用MLLMs进行三维几何推理并提取场景中的相关几何细节;(2) 构建并求解从低级几何中提取出的几何约束;(3) 剪枝以保留符合常识的最终放置方案。通过将几何推理与MLLMs对现实世界的理解相结合,该方法能够提出满足几何约束和高层语义常识要求的物体放置建议,在包含复杂几何结构的场景中表现出比现有方法更高的有效性。
链接: https://arxiv.org/abs/2503.04919
作者: Ian Huang,Yanan Bao,Karen Truong,Howard Zhou,Cordelia Schmid,Leonidas Guibas,Alireza Fathi
机构: Stanford University (斯坦福大学); Google DeepMind (谷歌深思)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.
zh
[CV-67] Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach
【速读】:该论文旨在解决如何通过微调提升基于Transformer的Vision-Language模型(Florence 2)在复杂无结构环境中的特定任务性能,尤其是物体检测任务的效率。解决方案的关键在于通过实验多种配置,包括不同GPU类型(如T4、L4、A100)、优化器(AdamW和SGD)、学习率以及LoRA(Low Rank Adaptation)设置,从而实现对Florence 2模型的有效微调。分析表明,经过微调后的模型在平均精度均值(mAP)等指标上与包括YOLOv8、YOLOv9和YOLOv10在内的目标检测模型表现相当,验证了优化后的Transformer基Vision-Language模型在解决无结构环境中物体检测具体挑战方面的潜力,并为实际应用提供了可行路径。
链接: https://arxiv.org/abs/2503.04918
作者: Soumyadeep Ro,Sanapala Satwika,Pamarthi Yasoda Gayathri,Mohmmad Ghaith Balsha,Aysegul Ucar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 13 Figures, 6 Tables
点击查看摘要
Abstract:Artificial intelligence has progressed through the development of Vision-Language Models (VLMs), which integrate text and visual inputs to achieve comprehensive understanding and interaction in various contexts. Enhancing the performance of these models such as the transformer based Florence 2 on specialized tasks like object detection in complex and unstructured environments requires fine-tuning. The goal of this paper is to improve the efficiency of the Florence 2 model in challenging environments by finetuning it. We accomplished this by experimenting with different configurations, using various GPU types (T4, L4, A100) and optimizers such as AdamW and SGD. We also employed a range of learning rates and LoRA (Low Rank Adaptation) settings. Analyzing the performance metrics, such as Mean Average Precision (mAP) scores,reveals that the finetuned Florence 2 models performed comparably to YOLO models, including YOLOv8, YOLOv9, and YOLOv10. This demonstrates how transformer based VLMs can be adapted for detailed object detection tasks. The paper emphasizes the capability of optimized transformer based VLMs to address specific challenges in object detection within unstructured environments, opening up promising avenues for practical applications in demanding and complex settings.
zh
[CV-68] Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning
【速读】:该论文旨在解决如何通过自监督学习(Self-Supervised Learning, SSL)将复杂的视觉信息抽象为离散且结构化的符号序列的问题。解决方案的关键在于提出了一种基于Transformer解码器的新型方法,利用交叉注意力机制生成符号表示,并通过扩展DINO框架同时处理视觉与符号信息。这种方法不仅能够捕获有意义的抽象层次,还因其可解释性而具有优势,即生成的符号序列可以通过注意图与特定符号关联,揭示这些表示与图像区域之间的对应关系。这为构建可用于高级场景理解的可解释符号表示奠定了基础。
链接: https://arxiv.org/abs/2503.04900
作者: Victor Sebastian Martinez Pozos,Ivan Vladimir Meza Ruiz
机构: Posgrado en Ciencia e Ingeniería de la Computación (计算机科学与工程研究生项目), Universidad Nacional Autónoma de México (墨西哥国立自治大学); Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (应用数学与系统研究院), Universidad Nacional Autónoma de México (墨西哥国立自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper explores the potential of abstracting complex visual information into discrete, structured symbolic sequences using self-supervised learning (SSL). Inspired by how language abstracts and organizes information to enable better reasoning and generalization, we propose a novel approach for generating symbolic representations from visual data. To learn these sequences, we extend the DINO framework to handle visual and symbolic information. Initial experiments suggest that the generated symbolic sequences capture a meaningful level of abstraction, though further refinement is required. An advantage of our method is its interpretability: the sequences are produced by a decoder transformer using cross-attention, allowing attention maps to be linked to specific symbols and offering insight into how these representations correspond to image regions. This approach lays the foundation for creating interpretable symbolic representations with potential applications in high-level scene understanding.
zh
[CV-69] Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning
【速读】:该论文试图解决当观察数据偏离训练分布时,基于模仿学习(Imitation Learning, IL)的机器人任务性能显著下降的问题,以及现有基于3D场景表示的方法在跨形态和新相机姿态设置下仅表现出有限改进的局限性。为了解决这些问题,论文提出了一种名为自适应3D场景表示(Adaptive 3D Scene Representation, Adapt3R)的通用3D观测编码器。其关键是利用预训练的2D主干网络提取场景的语义信息,并通过3D建模将这些语义信息定位到机械臂末端执行器附近,从而合成一个单一向量,作为任意IL算法的条件输入。这种方法不仅保持了多任务学习能力,还实现了在新机器人形态和相机姿态下的零样本迁移。
链接: https://arxiv.org/abs/2503.04877
作者: Albert Wilcox,Mohamed Ghanem,Masoud Moghani,Pierre Barroso,Benjamin Joffe,Animesh Garg
机构: Georgia Institute of Technology (乔治亚理工学院); Georgia Tech Research Institute (乔治亚理工研究院); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Videos, code, and data: this https URL
点击查看摘要
Abstract:Imitation Learning (IL) has been very effective in training robots to perform complex and diverse manipulation tasks. However, its performance declines precipitously when the observations are out of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to improve generalizability of IL policies, but our evaluations in cross-embodiment and novel camera pose settings found that they show only modest improvement. To address those challenges, we propose Adaptive 3D Scene Representation (Adapt3R), a general-purpose 3D observation encoder which uses a novel architecture to synthesize data from one or more RGBD cameras into a single vector that can then be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information about the scene, using 3D only as a medium for localizing this semantic information with respect to the end-effector. We show that when trained end-to-end with several SOTA multi-task IL algorithms, Adapt3R maintains these algorithms’ multi-task learning capacity while enabling zero-shot transfer to novel embodiments and camera poses. Furthermore, we provide a detailed suite of ablation and sensitivity experiments to elucidate the design space for point cloud observation encoders.
zh
[CV-70] oward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation
【速读】:该论文致力于解决稳定扩散(Stable Diffusion)模型在推理速度和内存占用方面的瓶颈问题,特别是在图像和视频合成任务中。传统方法依赖于大型变分自编码器(Variational Autoencoder, VAE)解码器,这会显著减慢生成过程并增加GPU内存消耗。为应对这一挑战,论文的关键创新在于引入了定制训练的轻量级解码器,采用轻量级视觉Transformer(Vision Transformer)和Taming Transformer架构。这些设计不仅实现了高达15%的整体生成速度提升(在COCO2017数据集上),并在子模块中达到最高20倍的加速,同时在UCF-101视频任务中进一步优化。尽管相比默认解码器存在轻微的感知质量下降,但显著提升了模型的速度和可扩展性,这对大规模推理场景(如生成10万张图像)至关重要。此外,论文通过双掩码策略等高效视频生成技术补充了研究背景,展示了推动生成模型效率与可扩展性的更广泛努力。
链接: https://arxiv.org/abs/2503.04871
作者: Alexey Buzovkin,Evgeny Shilov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 11 pages, 8 figures, 3 tables
点击查看摘要
Abstract:We investigate methods to reduce inference time and memory footprint in stable diffusion models by introducing lightweight decoders for both image and video synthesis. Traditional latent diffusion pipelines rely on large Variational Autoencoder decoders that can slow down generation and consume considerable GPU memory. We propose custom-trained decoders using lightweight Vision Transformer and Taming Transformer architectures. Experiments show up to 15% overall speed-ups for image generation on COCO2017 and up to 20 times faster decoding in the sub-module, with additional gains on UCF-101 for video tasks. Memory requirements are moderately reduced, and while there is a small drop in perceptual quality compared to the default decoder, the improvements in speed and scalability are crucial for large-scale inference scenarios such as generating 100K images. Our work is further contextualized by advances in efficient video generation, including dual masking strategies, illustrating a broader effort to improve the scalability and efficiency of generative models.
zh
[CV-71] E4: Energy-Efficient DNN Inference for Edge Video Analytics Via Early-Exit and DVFS AAAI2025
【速读】:该论文旨在解决深度神经网络(DNN)模型在资源受限的边缘设备上进行能效推理的问题。现有的大多数解决方案主要关注于优化DNN推理的延迟和准确性,而忽视了能量效率,并且未能考虑视频帧复杂性的差异,导致边缘视频分析性能不佳。论文的关键解决方案是提出了一种名为Energy-Efficient Early-Exit (E4)的框架,它通过集成一种新颖的早期退出机制与动态电压频率调节(DVFS)管理器来提升边缘视频分析中的DNN推理效率。E4采用基于注意力的级联模块来分析视频帧的多样性并自动确定最优的DNN退出点,同时利用即时(JIT)剖析器结合坐标下降搜索算法协同优化CPU和GPU在每个DNN退出点前各层的时钟频率。评估结果显示,E4相比当前最先进的方法实现了高达2.8倍的速度提升和平均26%的能量节省,同时保持了高精度。
链接: https://arxiv.org/abs/2503.04865
作者: Ziyang Zhang,Yang Zhao,Ming-Ching Chang,Changyao Lin,Jie Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, to be published in AAAI 2025
点击查看摘要
Abstract:Deep neural network (DNN) models are increasingly popular in edge video analytic applications. However, the compute-intensive nature of DNN models pose challenges for energy-efficient inference on resource-constrained edge devices. Most existing solutions focus on optimizing DNN inference latency and accuracy, often overlooking energy efficiency. They also fail to account for the varying complexity of video frames, leading to sub-optimal performance in edge video analytics. In this paper, we propose an Energy-Efficient Early-Exit (E4) framework that enhances DNN inference efficiency for edge video analytics by integrating a novel early-exit mechanism with dynamic voltage and frequency scaling (DVFS) governors. It employs an attention-based cascade module to analyze video frame diversity and automatically determine optimal DNN exit points. Additionally, E4 features a just-in-time (JIT) profiler that uses coordinate descent search to co-optimize CPU and GPU clock frequencies for each layer before the DNN exit points. Extensive evaluations demonstrate that E4 outperforms current state-of-the-art methods, achieving up to 2.8x speedup and 26% average energy saving while maintaining high accuracy.
zh
[CV-72] Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism
【速读】:本文针对自动驾驶领域中基于体素网格预测的内存消耗大以及3D语义占用预测精度受限的问题,提出了一种新的方法,即GaussianFormer,利用稀疏的3D语义高斯分布来描述场景。每个3D高斯函数代表一个灵活的兴趣区域及其语义特征,并通过注意力机制迭代优化这些特征。然而,实验发现此方法所需的高斯函数规模大于原始密集网格网络的查询分辨率,导致性能下降。为了解决这一问题,论文的关键在于引入时空自注意力机制(Spatial-Temporal Self-attention Mechanism),从先前基于网格的占用网络中学习未充分利用的时间信息,并将其优化应用于GaussianFormer。目前,该研究已在NuScenes数据集上进行实验,实验正在进行中。
链接: https://arxiv.org/abs/2503.04863
作者: Ziyue Zhao,Qining Qi,Jianfa Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. Each 3D Gaussian function represents a flexible region of interest and its semantic features, which are iteratively refined by the attention mechanism. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance. Therefore, we consider optimizing GaussianFormer by using unused temporal information. We learn the Spatial-Temporal Self-attention Mechanism from the previous grid-given occupation network and improve it to GaussianFormer. The experiment was conducted with the NuScenes dataset, and the experiment is currently underway.
zh
[CV-73] High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects
【速读】:该论文致力于解决人形机器人在现实世界中高精度微小物体对齐的常见且关键挑战。针对手持工具(如螺丝刀尖端)与目标物体(如螺丝槽)之间的相对位置精确估计与控制问题,提出了一种基于视觉的框架。解决方案的关键在于所提出的基于Transformer的视觉伺服方法,该方法通过融合机器人头部和躯干摄像头的图像及其头部关节角度,能够有效校正手持工具的位置误差,尤其是在近距离场景下。此外,论文设计的距离估计Transformer架构与多感知头机制进一步提升了微小物体对齐的精度。实验结果表明,该方法在M4-M8螺丝上的平均收敛误差为0.8-1.3毫米,成功率可达93%-100%。
链接: https://arxiv.org/abs/2503.04862
作者: Jialong Xue,Wei Gao,Yu Wang,Chao Ji,Dongdong Zhao,Shi Yan,Shiwu Zhang
机构: Institute of Humanoid Robots, Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China (中国科学技术大学); School of Information Science and Engineering, Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: for associated video, see this https URL
点击查看摘要
Abstract:High-precision tiny object alignment remains a common and critical challenge for humanoid robots in real-world. To address this problem, this paper proposes a vision-based framework for precisely estimating and controlling the relative position between a handheld tool and a target object for humanoid robots, e.g., a screwdriver tip and a screw head slot. By fusing images from the head and torso cameras on a robot with its head joint angles, the proposed Transformer-based visual servoing method can correct the handheld tool’s positional errors effectively, especially at a close distance. Experiments on M4-M8 screws demonstrate an average convergence error of 0.8-1.3 mm and a success rate of 93%-100%. Through comparative analysis, the results validate that this capability of high-precision tiny object alignment is enabled by the Distance Estimation Transformer architecture and the Multi-Perception-Head mechanism proposed in this paper.
zh
[CV-74] End-to-End Human Pose Reconstruction from Wearable Sensors for 6G Extended Reality Systems
【速读】:该论文旨在解决在无线网络中实现精确人体姿态重建的挑战,特别是在第六代(6G)网络支持的扩展现实(XR)应用中。现有方法通常假设室内环境下的无错误传输,这限制了其在真实世界场景中的适用性。主要问题包括信道损伤、比特错误和量化效应导致的精度下降。
论文提出了一种基于深度学习的新框架,用于正交频分复用(OFDM)系统中的人体姿态重建。该框架的关键在于引入了一个两阶段的深度学习接收器:第一阶段同时估计无线信道并解码OFDM符号;第二阶段将接收到的传感器信号映射到完整的三维身体姿态。通过这种方式,新提出的神经网络接收器能够显著降低误比特率(BER),与采用独立信号检测步骤(如最小二乘法信道估计和线性最小均方误差均衡)的传统方法相比,在10⁻⁴ BER下获得了5 dB的增益。此外,实证研究表明,8位量化足以实现准确的姿态重建,重构传感器信号的均方误差达到5×10⁻⁴,并且重构人体姿态的联合角度误差减少了37%。
链接: https://arxiv.org/abs/2503.04860
作者: Nguyen Quang Hieu,Dinh Thai Hoang,Diep N. Nguyen,Mohammad Abu Alsheikh,Carlos C. N. Kuhn,Yibeltal F. Alem,Ibrahim Radwan
机构: School of Electrical Data Engineering, University of Technology, Sydney, NSW 2007, Australia (悉尼科技大学电气数据工程学院,新南威尔士州 2007,澳大利亚); School of Information Technology and Systems, University of Canberra, Canberra, ACT 2617, Australia (堪培拉大学信息技术与系统学院,堪培拉,澳大利亚首都领地 2617,澳大利亚); School of Electrical and Data Engineering, University of Technology Sydney, NSW 2007, Australia (悉尼科技大学电气与数据工程学院,新南威尔士州 2007,澳大利亚); Faculty of Science & Technology, University of Canberra, Canberra, ACT 2617, Australia (堪培拉大学科学与技术学院,堪培拉,澳大利亚首都领地 2617,澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
点击查看摘要
Abstract:Full 3D human pose reconstruction is a critical enabler for extended reality (XR) applications in future sixth generation (6G) networks, supporting immersive interactions in gaming, virtual meetings, and remote collaboration. However, achieving accurate pose reconstruction over wireless networks remains challenging due to channel impairments, bit errors, and quantization effects. Existing approaches often assume error-free transmission in indoor settings, limiting their applicability to real-world scenarios. To address these challenges, we propose a novel deep learning-based framework for human pose reconstruction over orthogonal frequency-division multiplexing (OFDM) systems. The framework introduces a two-stage deep learning receiver: the first stage jointly estimates the wireless channel and decodes OFDM symbols, and the second stage maps the received sensor signals to full 3D body poses. Simulation results demonstrate that the proposed neural receiver reduces bit error rate (BER), thus gaining a 5 dB gap at 10^-4 BER, compared to the baseline method that employs separate signal detection steps, i.e., least squares channel estimation and linear minimum mean square error equalization. Additionally, our empirical findings show that 8-bit quantization is sufficient for accurate pose reconstruction, achieving a mean squared error of 5\times10^-4 for reconstructed sensor signals, and reducing joint angular error by 37% for the reconstructed human poses compared to the baseline.
zh
[CV-75] SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
【速读】:该论文旨在解决现有大型视觉语言模型(LVLMs)在偏好对齐过程中因人类标注偏好数据的多样性有限及高昂成本而导致的对齐能力受限的问题。论文提出了一种名为SHAPE的自监督框架,通过将已有的有监督文本-图像对转换为全面的偏好三元组,从而实现更高效且经济的LVLM对齐,无需依赖人工偏好标注。方案的关键在于设计一种机制,使胜者文本在整体性上持续提升,并在质量上优于败者响应,推动模型通过偏好微调达到最佳对齐性能。具体而言,对于每个给定的文本-图像对,SHAPE引入多种视觉增强方法,并将其与摘要文本配对作为胜者响应,而原始文本则作为败者响应。实验结果表明,该方法在多个基准测试中显著提升了模型性能,例如在7B规模的模型上于MMVet、MMBench和POPE等任务上分别取得了+11.3%、+1.4%和+8.0%的提升。定性分析进一步验证了模型对视觉细节的关注增强以及与人类整体描述偏好的更好一致性。
链接: https://arxiv.org/abs/2503.04858
作者: Kejia Chen,Jiawen Zhang,Jiacong Hu,Jiazhen Yang,Jian Lou,Zunlei Feng,Mingli Song
机构: Zhejiang University (浙江大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability, which steers the model behavior via preference fine-tuning on preference data structured as image - winner text - loser text'' triplets. However, existing approaches often suffer from limited diversity and high costs associated with human-annotated preference data, hindering LVLMs from fully achieving their intended alignment capabilities. We present \projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets for more effective and cheaper LVLM alignment, eliminating the need for human preference annotations. Our approach facilitates LVLMs in progressively enhancing alignment capabilities through iterative self-improvement. The key design rationale is to devise preference triplets where the winner text consistently improves in holisticness and outperforms the loser response in quality, thereby pushing the model to
strive to the utmost’’ of alignment performance through preference fine-tuning. For each given text-image pair, SHAPE introduces multiple visual augmentations and pairs them with a summarized text to serve as the winner response, while designating the original text as the loser response. Experiments across \textbf12 benchmarks on various model architectures and sizes, including LLaVA and DeepSeek-VL, show that SHAPE achieves significant gains, for example, achieving +11.3% on MMVet (comprehensive evaluation), +1.4% on MMBench (general VQA), and +8.0% on POPE (hallucination robustness) over baselines in 7B models. Notably, qualitative analyses confirm enhanced attention to visual details and better alignment with human preferences for holistic descriptions.
zh
[CV-76] CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
【速读】:该论文试图解决视觉因果推理评估基准不足的问题。当前人工智能(AI)和计算机视觉(CV)领域在复杂视觉数据中推断潜在因果关系的能力缺乏有效的评估标准。为了解决这一问题,论文提出了Causal3D这一新颖且全面的基准,它将结构化数据(如表格)与对应的视觉表示(如图像)相结合,以评估因果推理能力。Causal3D包含19个3D场景数据集,涵盖多样的因果关系、视角和背景,支持不同复杂度场景下的评估。关键在于通过设计系统化的框架,整合多种先进的方法(如经典因果发现、因果表征学习以及大型视觉语言模型等)进行实验验证,揭示了在缺乏先验知识的情况下,随着因果结构复杂度增加,性能显著下降的现象,从而凸显了现有方法在复杂因果场景中的挑战。因此,Causal3D为推动计算机视觉中的因果推理研究以及促进关键领域的可信人工智能发展提供了重要资源。
链接: https://arxiv.org/abs/2503.04852
作者: Disheng Liu,Yiran Qiao,Wuche Liu,Yiren Lu,Yunlai Zhou,Tuo Liang,Yu Yin,Jing Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models’ abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc\textbfCausal3D, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
zh
[CV-77] ZAugNet for Z-Slice Augmentation in Bio-Imaging
【速读】:该论文旨在解决三维生物显微图像中因显微技术、样本特性或光毒性等因素导致的z轴分辨率较低的问题,这限制了精确的细胞测量。论文的关键解决方案是引入ZAugNet,这是一种快速、准确且自监督的深度学习方法,通过在连续切片间进行非线性插值,在每次迭代中有效将z轴分辨率加倍。此外,ZAugNet结合生成对抗网络(GAN)架构与知识蒸馏技术,在最大化预测速度的同时保持了高精度。对于具有非均匀切片间距的数据集,还开发了扩展版本ZAugNet+,实现了任意距离的连续插值。这些方法为大规模三维成像提供了高性能、可扩展的z轴切片增强解决方案,并以开源形式提供PyTorch框架及直观的Colab笔记本界面,便于科学界使用。
链接: https://arxiv.org/abs/2503.04843
作者: Alessandro Pasqui,Sajjad Mahdavi,Benoit Vianay,Alexandra Colin,Alex McDougall,Rémi Dumollard,Yekaterina A. Miroshnikova,Elsa Labrune,Hervé Turlier
机构: Center for Interdisciplinary Research in Biology (CIRB), Collège de France, Université PSL, CNRS, INSERM (跨学科生物学研究中心,法兰西学院,PSL大学,法国国家科学研究中心,法国国家健康与医学研究院); Sajjad Mahdavi (未知); Benoit Vianay (格勒诺布尔-阿尔卑斯大学,法国原子能委员会,法国国家科学研究中心,法国国家农业、食品与环境研究院,格勒诺布尔跨学科研究中心,细胞形态实验室,格勒诺布尔,法国); Alexandra Colin (格勒诺布尔-阿尔卑斯大学,法国原子能委员会,法国国家科学研究中心,法国国家农业、食品与环境研究院,格勒诺布尔跨学科研究中心,细胞形态实验室,格勒诺布尔,法国); Rémi Dumollard (维勒弗朗什海洋发育生物学实验室,维勒弗朗什海洋研究所,索邦大学,法国国家科学研究中心,维勒弗朗什-梅尔,法国); Alex McDougall (维勒弗朗什海洋发育生物学实验室,维勒弗朗什海洋研究所,索邦大学,法国国家科学研究中心,维勒弗朗什-梅尔,法国); Yekaterina A. Miroshnikova (美国国立糖尿病、消化与肾脏疾病研究所,美国国立卫生研究院,贝塞斯达,马里兰州,美国); Elsa Labrune (里昂民用医院生殖医学科,里昂第一大学,克劳德·贝尔纳,法国,INSERM U1208干细胞与大脑研究所,布隆,法国); Hervé Turlier (跨学科生物学研究中心,法兰西学院,PSL大学,法国国家科学研究中心,法国国家健康与医学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: 17 pages, 9 figures, 1 table
点击查看摘要
Abstract:Three-dimensional biological microscopy has significantly advanced our understanding of complex biological structures. However, limitations due to microscopy techniques, sample properties or phototoxicity often result in poor z-resolution, hindering accurate cellular measurements. Here, we introduce ZAugNet, a fast, accurate, and self-supervised deep learning method for enhancing z-resolution in biological images. By performing nonlinear interpolation between consecutive slices, ZAugNet effectively doubles resolution with each iteration. Compared on several microscopy modalities and biological objects, it outperforms competing methods on most metrics. Our method leverages a generative adversarial network (GAN) architecture combined with knowledge distillation to maximize prediction speed without compromising accuracy. We also developed ZAugNet+, an extended version enabling continuous interpolation at arbitrary distances, making it particularly useful for datasets with nonuniform slice spacing. Both ZAugNet and ZAugNet+ provide high-performance, scalable z-slice augmentation solutions for large-scale 3D imaging. They are available as open-source frameworks in PyTorch, with an intuitive Colab notebook interface for easy access by the scientific community.
zh
[CV-78] Combined Physics and Event Camera Simulator for Slip Detection
【速读】:该论文试图解决机器人操作中物体滑脱检测的问题。当前基于事件相机的数据研究主要依赖于真实场景中的手动数据收集和额外的数据标注设置,这导致数据采集时间显著增加、场景布置缺乏灵活性以及实验重复性复杂度较高。论文的关键解决方案是提出了一种基于模拟器的管道,用于生成使用特定相机-夹爪配置的滑脱数据,并通过初始的数据驱动实验验证其有效性。该模拟器的使用一旦搭建完成,能够大幅减少数据采集时间,灵活调整场景设置,简化重复实验和生成任意规模数据集的过程。此外,论文创建了两个不同的数据集并通过视觉检查和人工神经网络(ANNs)验证,结果表明生成的帧具有照片级真实感,滑脱建模准确,并且训练出的ANNs在验证集上表现出高精度和良好的泛化能力,初步展示了其在真实世界数据中的适用性。
链接: https://arxiv.org/abs/2503.04838
作者: Thilo Reinold,Suman Ghosh,Guillermo Gallego
机构: Technische Universität Berlin (柏林工业大学); Robotics Institute Germany (机器人研究所德国分部); Einstein Center for Digital Future (数字未来爱因斯坦中心); Science of Intelligence Excellence Cluster (智能科学卓越集群)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 8 figures, 2 tables, this https URL
点击查看摘要
Abstract:Robot manipulation is a common task in fields like industrial manufacturing. Detecting when objects slip from a robot’s grasp is crucial for safe and reliable operation. Event cameras, which register pixel-level brightness changes at high temporal resolution (called ``events’'), offer an elegant feature when mounted on a robot’s end effector: since they only detect motion relative to their viewpoint, a properly grasped object produces no events, while a slipping object immediately triggers them. To research this feature, representative datasets are essential, both for analytic approaches and for training machine learning models. The majority of current research on slip detection with event-based data is done on real-world scenarios and manual data collection, as well as additional setups for data labeling. This can result in a significant increase in the time required for data collection, a lack of flexibility in scene setups, and a high level of complexity in the repetition of experiments. This paper presents a simulation pipeline for generating slip data using the described camera-gripper configuration in a robot arm, and demonstrates its effectiveness through initial data-driven experiments. The use of a simulator, once it is set up, has the potential to reduce the time spent on data collection, provide the ability to alter the setup at any time, simplify the process of repetition and the generation of arbitrarily large data sets. Two distinct datasets were created and validated through visual inspection and artificial neural networks (ANNs). Visual inspection confirmed photorealistic frame generation and accurate slip modeling, while three ANNs trained on this data achieved high validation accuracy and demonstrated good generalization capabilities on a separate test set, along with initial applicability to real-world data. Project page: this https URL
zh
[CV-79] FedPalm: A General Federated Learning Framework for Closed- and Open-Set Palmprint Verification
【速读】:该论文旨在解决基于联邦学习(Federated Learning, FL)的手掌纹验证面临的两个关键问题:一是因生物特征数据的敏感性和不可变性导致的隐私顾虑;二是数据异质性带来的挑战以及缺乏标准化评估基准。论文通过构建一个全面的基准来填补这些空白,并明确评估了闭集和开集验证两种实际场景。
解决方案的关键在于提出了一种名为FedPalm的统一联邦学习框架。该框架在保持本地适应性的同时实现了全局泛化能力。每个客户端训练个性化纹理专家以适配本地数据,同时共同贡献于共享的全局纹理专家以提取通用特征。为进一步提升验证性能,引入了纹理专家交互模块,动态路由纹理特征以生成精炼的侧纹理特征,并通过可学习参数建模原始特征与侧特征之间的关系,促进跨纹理专家的交互,增强特征区分度。实验结果验证了FedPalm的有效性,在两种场景下均表现出稳健性能,为基于联邦学习的手掌纹验证研究奠定了坚实基础。
链接: https://arxiv.org/abs/2503.04837
作者: Ziyuan Yang,Yingyu Chen,Chengrui Gao,Andrew Beng Jin Teoh,Bob Zhang,Yi Zhang
机构: College of Computer Science, Sichuan University, Chengdu 610065, China (四川大学计算机学院,中国成都 610065); Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore (新加坡科技研究局前沿人工智能研究中心); School of Electrical and Electronic Engineering, College of Engineering, Yonsei University, Seoul, Republic of Korea (韩国首尔延世大学电气与电子工程学院); Pattern Analysis and Machine Intelligence Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China (澳门大学计算机与信息科学系模式分析与机器智能小组); School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China (四川大学网络空间安全学院,中国成都 610065)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Current deep learning (DL)-based palmprint verification models rely on centralized training with large datasets, which raises significant privacy concerns due to biometric data’s sensitive and immutable nature. Federated learning~(FL), a privacy-preserving distributed learning paradigm, offers a compelling alternative by enabling collaborative model training without the need for data sharing. However, FL-based palmprint verification faces critical challenges, including data heterogeneity from diverse identities and the absence of standardized evaluation benchmarks. This paper addresses these gaps by establishing a comprehensive benchmark for FL-based palmprint verification, which explicitly defines and evaluates two practical scenarios: closed-set and open-set verification. We propose FedPalm, a unified FL framework that balances local adaptability with global generalization. Each client trains a personalized textural expert tailored to local data and collaboratively contributes to a shared global textural expert for extracting generalized features. To further enhance verification performance, we introduce a Textural Expert Interaction Module that dynamically routes textural features among experts to generate refined side textural features. Learnable parameters are employed to model relationships between original and side features, fostering cross-texture-expert interaction and improving feature discrimination. Extensive experiments validate the effectiveness of FedPalm, demonstrating robust performance across both scenarios and providing a promising foundation for advancing FL-based palmprint verification research.
zh
[CV-80] Distilling Dataset into Neural Field ICLR2025
【速读】:该论文致力于解决利用大规模数据集训练高性能深度学习模型时面临的计算和存储成本高昂的问题。为克服这一挑战,论文提出了一种名为“将数据蒸馏到神经场中(Distilling Dataset into Neural Field, DDiF)”的新颖参数化框架。DDiF 的关键在于利用神经场(neural field)来存储大规模数据集的必要信息,其独特的坐标输入与输出特性使得该方法能够高效保存数据信息,并灵活生成多种形状的数据。理论分析表明,在相同合成实例预算下,DDiF 的表达能力优于部分已有工作。实验结果进一步证明,DDiF 在多个基准数据集上表现出色,应用领域也扩展至视频、音频及 3D 体素等场景。
链接: https://arxiv.org/abs/2503.04835
作者: Donghyeok Shin,HeeSun Bae,Gyuwon Sim,Wanmo Kang,Il-Chul Moon
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The Thirteenth International Conference on Learning Representations (ICLR 2025)
点击查看摘要
Abstract:Utilizing a large-scale dataset is essential for training high-performance deep learning models, but it also comes with substantial computation and storage costs. To overcome these challenges, dataset distillation has emerged as a promising solution by compressing the large-scale dataset into a smaller synthetic dataset that retains the essential information needed for training. This paper proposes a novel parameterization framework for dataset distillation, coined Distilling Dataset into Neural Field (DDiF), which leverages the neural field to store the necessary information of the large-scale dataset. Due to the unique nature of the neural field, which takes coordinates as input and output quantity, DDiF effectively preserves the information and easily generates various shapes of data. We theoretically confirm that DDiF exhibits greater expressiveness than some previous literature when the utilized budget for a single synthetic instance is the same. Through extensive experiments, we demonstrate that DDiF achieves superior performance on several benchmark datasets, extending beyond the image domain to include video, audio, and 3D voxel. We release the code at this https URL.
zh
[CV-81] RD Efficient FPGA Deployment of Learned Image Compression: Knowledge Distillation and Hybrid Quantization
【速读】:该论文致力于解决Learnable Image Compression (LIC)在硬件实现中面临的权衡问题,即如何在不牺牲率失真(RD)效率的前提下,提升其在特定硬件平台上的适应性和性能。论文的关键解决方案在于提出了一种新的设计范式:通过模型维度调整而非复杂的硬件设计探索来适配不同硬件平台。具体而言,首先开发了一种从参考教师模型蒸馏出更精简学生模型的框架,仅需调节单一模型超参数即可满足多种硬件约束;其次,提出了一种硬件友好的广义除归一化(GDN)激活函数实现方法,在参数量化后仍能保持RD效率;最后,设计了一种流水线化的FPGA配置方案,充分利用可用资源并通过并行处理和优化资源配置提高性能。这些创新点共同确保了在不降低RD效率的情况下显著提升了LIC在FPGA上的表现。
链接: https://arxiv.org/abs/2503.04832
作者: Mazouz Alaa Eddine,Sumanta Chaudhuri,Marco Cagnanzzo,Mihai Mitrea,Enzo Tartaglione,Attilio Fiandrotti
机构: Télécom SudParis, Institut Polytechnique de Paris, France (巴黎高等电信学院,巴黎综合理工大学,法国); Università di Torino, dipartimento di Informatica, Italy (都灵大学,计算机科学系,意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization (GDN) activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model in terms of RD efficiency.
zh
[CV-82] StickMotion: Generating 3D Human Motions by Drawing a Stickman CVPR2025
【速读】:该论文旨在解决基于文本描述生成精确且细节丰富的运动序列这一挑战。传统方法难以从简单的文本输入中准确捕捉用户想象中的复杂动作。为应对这一问题,论文提出StickMotion,这是一种基于扩散模型的高效网络,专为多条件场景设计,通过传统的文本条件以及作者提出的stickman(骨架人物)条件实现对全局和局部运动的控制。关键在于从三个方面创新性地解决了由stickman引入的挑战:首先,开发了一种算法以自动生成跨不同数据集格式的手绘stickman;其次,在扩散过程中融入多条件融合模块,能够处理所有可能的条件组合,相比传统使用自注意力模块的方法显著降低了计算复杂度并提升了性能;最后,提出了动态监督策略,使StickMotion能够在输出序列中微调stickman的位置,从而生成更自然的动作。实验结果表明,stickman草图可帮助用户节省约51.5%的时间来生成与其想象一致的运动。
链接: https://arxiv.org/abs/2503.04829
作者: Tao Wang,Zhihua Wu,Qiaozhi He,Jiaming Chu,Ling Qian,Yu Cheng,Junliang Xing,Jian Zhao,Lei Jin
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Science and Technology of China (中国科学技术大学); NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China (东北大学计算机科学与工程学院自然语言处理实验室); China Mobile (Suzhou) Sofware Technology Co, Ltd. (中国移动(苏州)软件技术有限公司); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学); China Telecom Institute of AI (中国电信人工智能研究院); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, accepted by CVPR2025
点击查看摘要
Abstract:Text-to-motion generation, which translates textual descriptions into human motions, has been challenging in accurately capturing detailed user-imagined motions from simple text inputs. This paper introduces StickMotion, an efficient diffusion-based network designed for multi-condition scenarios, which generates desired motions based on traditional text and our proposed stickman conditions for global and local control of these motions, respectively. We address the challenges introduced by the user-friendly stickman from three perspectives: 1) Data generation. We develop an algorithm to generate hand-drawn stickmen automatically across different dataset formats. 2) Multi-condition fusion. We propose a multi-condition module that integrates into the diffusion process and obtains outputs of all possible condition combinations, reducing computational complexity and enhancing StickMotion’s performance compared to conventional approaches with the self-attention module. 3) Dynamic supervision. We empower StickMotion to make minor adjustments to the stickman’s position within the output sequences, generating more natural movements through our proposed dynamic supervision strategy. Through quantitative experiments and user studies, sketching stickmen saves users about 51.5% of their time generating motions consistent with their imagination. Our codes, demos, and relevant data will be released to facilitate further research and validation within the scientific community.
zh
[CV-83] DA-STGCN: 4D Trajectory Prediction Based on Spatiotemporal Feature Extraction
【速读】:本文旨在解决机场终端区域及繁忙空域中四维(4D)轨迹预测的挑战,特别是现有方法未能充分考虑飞机间相互作用的问题。解决方案的关键在于提出了一种名为DA-STGCN的新颖时空图卷积网络,其通过双注意力机制集成,利用自注意力方法重构邻接矩阵以增强节点相关性的捕捉,并采用图注意力提取时空特征,从而生成预测轨迹的概率分布。此外,该重构后的邻接矩阵在训练过程中动态优化,相比传统算法提供了更精细的节点间关系表征。实验结果表明,与当前方法相比,该模型在两个ADS-B数据集上的平均位移误差(ADE)和最终位移误差(FDE)分别降低了20%和30%,验证了双注意力模块对提升节点相关性提取的有效性。
链接: https://arxiv.org/abs/2503.04823
作者: Yuheng Kuang,Zhengning Wang,Jianping Zhang,Zhenyu Shi,Yuding Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The importance of four-dimensional (4D) trajectory prediction within air traffic management systems is on the rise. Key operations such as conflict detection and resolution, aircraft anomaly monitoring, and the management of congested flight paths are increasingly reliant on this foundational technology, underscoring the urgent demand for intelligent solutions. The dynamics in airport terminal zones and crowded airspaces are intricate and ever-changing; however, current methodologies do not sufficiently account for the interactions among aircraft. To tackle these challenges, we propose DA-STGCN, an innovative spatiotemporal graph convolutional network that integrates a dual attention mechanism. Our model reconstructs the adjacency matrix through a self-attention approach, enhancing the capture of node correlations, and employs graph attention to distill spatiotemporal characteristics, thereby generating a probabilistic distribution of predicted trajectories. This novel adjacency matrix, reconstructed with the self-attention mechanism, is dynamically optimized throughout the network’s training process, offering a more nuanced reflection of the inter-node relationships compared to traditional algorithms. The performance of the model is validated on two ADS-B datasets, one near the airport terminal area and the other in dense airspace. Experimental results demonstrate a notable improvement over current 4D trajectory prediction methods, achieving a 20% and 30% reduction in the Average Displacement Error (ADE) and Final Displacement Error (FDE), respectively. The incorporation of a Dual-Attention module has been shown to significantly enhance the extraction of node correlations, as verified by ablation experiments.
zh
[CV-84] Invisible Strings: Revealing Latent Dancer-to-Dancer Interactions with Graph Neural Networks
【速读】:本文旨在解决传统舞蹈记录方式无法有效捕捉双人舞中复杂且微妙的协作关系的问题。解决方案的关键在于利用图神经网络(Graph Neural Networks, GNNs)来分析和解读舞者之间的精细互动。通过构建从视频到3D姿态提取的流水线,论文首先从精心挑选的现代舞双人舞视频中提取3D运动数据,并进行专用预处理以优化重建效果。随后,训练GNN模型预测舞者间的加权连接关系。最终,通过对预测关系的可视化与解析,展示了基于图的方法在构建双人舞协作动力学替代模型方面的潜力,为生成性和协同性的工作室实践提供了指导策略。
链接: https://arxiv.org/abs/2503.04816
作者: Luis Vitor Zerkowski,Zixuan Wang,Ilya Vidrin,Mariel Pettee
机构: University of Amsterdam (阿姆斯特丹大学); Georgia Institute of Technology (乔治亚理工学院); Northeastern University (东北大学); Lawrence Berkeley National Lab (劳伦斯伯克利国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10 pages, 10 figures, submitted to ICCC’25
点击查看摘要
Abstract:Dancing in a duet often requires a heightened attunement to one’s partner: their orientation in space, their momentum, and the forces they exert on you. Dance artists who work in partnered settings might have a strong embodied understanding in the moment of how their movements relate to their partner’s, but typical documentation of dance fails to capture these varied and subtle relationships. Working closely with dance artists interested in deepening their understanding of partnering, we leverage Graph Neural Networks (GNNs) to highlight and interpret the intricate connections shared by two dancers. Using a video-to-3D-pose extraction pipeline, we extract 3D movements from curated videos of contemporary dance duets, apply a dedicated pre-processing to improve the reconstruction, and train a GNN to predict weighted connections between the dancers. By visualizing and interpreting the predicted relationships between the two movers, we demonstrate the potential for graph-based methods to construct alternate models of the collaborative dynamics of duets. Finally, we offer some example strategies for how to use these insights to inform a generative and co-creative studio practice.
zh
[CV-85] GrainPaint: A multi-scale diffusion-based generative model for microstructure reconstruction of large-scale objects
【速读】:该论文试图解决基于模拟方法在微观结构生成中存在的高内存使用、长计算时间和复杂几何形状生成困难等问题。解决方案的关键在于利用去噪扩散模型(Denoising Diffusion Models)在图像修复(inpainting)领域的进展,通过提出一种新的微观结构生成方法克服传统生成式机器学习模型中生成区域固定尺寸的限制。研究显示,采用该方法生成的微观结构在统计特性上与基于Kinetic Monte Carlo模拟器SPPARKS生成的晶粒结构相似。
链接: https://arxiv.org/abs/2503.04776
作者: Nathan Hoffman,Cashen Diniz,Dehao Liu,Theron Rodgers,Anh Tran,Mark Fuge
机构: Department of Mechanical Engineering, University of Maryland (马里兰大学机械工程系); Department of Mechanical and Process Engineering, ETH Zürich (瑞士苏黎世联邦理工学院机械与过程工程系); Department of Mechanical Engineering, State University of New York at Binghamton (纽约州宾厄姆顿大学机械工程系); Sandia National Laboratories (桑迪亚国家实验室)
类目: Graphics (cs.GR); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Simulation-based approaches to microstructure generation can suffer from a variety of limitations, such as high memory usage, long computational times, and difficulties in generating complex geometries. Generative machine learning models present a way around these issues, but they have previously been limited by the fixed size of their generation area. We present a new microstructure generation methodology leveraging advances in inpainting using denoising diffusion models to overcome this generation area limitation. We show that microstructures generated with the presented methodology are statistically similar to grain structures generated with a kinetic Monte Carlo simulator, SPPARKS.
zh
[CV-86] ask-oriented Uncertainty Collaborative Learning for Label-Efficient Brain Tumor Segmentation
【速读】:该论文旨在解决多对比度磁共振成像(MRI)在脑肿瘤分割中的挑战,特别是在有限标注数据条件下,跨不同对比度的多层次特异性感知难题。这些挑战包括数据异质性、粒度差异以及冗余信息干扰等。为应对这些限制,论文提出了一种面向任务的不确定性协同学习(Task-oriented Uncertainty Collaborative Learning, TUCL)框架。其关键在于引入了面向任务提示注意力(Task-oriented Prompt Attention, TPA)模块,通过提示内和提示间注意力机制动态建模对比度与任务间的特征交互,并设计循环过程确保提示的有效利用;同时,在解码阶段采用双路径不确定性优化(Dual-Path Uncertainty Refinement, DUR)策略,迭代优化预测结果以实现鲁棒分割。实验表明,TUCL显著提升了分割准确性(Dice达88.2%,HD95为10.853 mm),减少了对大量标注数据的依赖。
链接: https://arxiv.org/abs/2503.05682
作者: Zhenxuan Zhang,Hongjie Wu,Jiahao Huang,Baihong Xie,Zhifan Gao,Junxian Du,Pete Lally,Guang Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-contrast magnetic resonance imaging (MRI) plays a vital role in brain tumor segmentation and diagnosis by leveraging complementary information from different contrasts. Each contrast highlights specific tumor characteristics, enabling a comprehensive understanding of tumor morphology, edema, and pathological heterogeneity. However, existing methods still face the challenges of multi-level specificity perception across different contrasts, especially with limited annotations. These challenges include data heterogeneity, granularity differences, and interference from redundant information. To address these limitations, we propose a Task-oriented Uncertainty Collaborative Learning (TUCL) framework for multi-contrast MRI segmentation. TUCL introduces a task-oriented prompt attention (TPA) module with intra-prompt and cross-prompt attention mechanisms to dynamically model feature interactions across contrasts and tasks. Additionally, a cyclic process is designed to map the predictions back to the prompt to ensure that the prompts are effectively utilized. In the decoding stage, the TUCL framework proposes a dual-path uncertainty refinement (DUR) strategy which ensures robust segmentation by refining predictions iteratively. Extensive experimental results on limited labeled data demonstrate that TUCL significantly improves segmentation accuracy (88.2% in Dice and 10.853 mm in HD95). It shows that TUCL has the potential to extract multi-contrast information and reduce the reliance on extensive annotations. The code is available at: this https URL.
zh
[CV-87] owards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images
【速读】:该论文试图解决组织病理学全片扫描图像(Whole Slide Images, WSIs)中细胞核检测的问题,特别是现有方法在处理千兆像素级别WSIs时利用滑动窗口技术所导致的边界上下文信息(如组织结构)缺失以及预测不准确的问题。此外,虽然已有研究通过裁剪每个滑动窗口周围的大幅视野(Field-of-View, FoV)区域来提取上下文特征,但这些方法显著增加了推理延迟。为了解决这些问题,论文提出了一种有效的上下文感知细胞核检测算法。其关键在于:不是依赖大幅视野区域,而是从历史上访问过的滑动窗口的现成特征中聚合上下文线索,这种设计大大减少了计算开销;同时,与低倍率下的大幅视野相比,滑动窗口补丁具有更高的放大倍率,提供了更精细的组织细节,从而提高了检测准确性。进一步地,通过引入网格池化技术将每个补丁的密集特征图压缩为少量上下文标记,以提高效率。最后,构建了首个专注于上下文感知细胞核实例分割的数据集OCELOT-seg,并提供了相关代码、数据集和模型检查点。
链接: https://arxiv.org/abs/2503.05678
作者: Zhongyi Shui,Ruizhe Guo,Honglin Li,Yuxuan Sun,Yunlong Zhang,Chenglu Zhu,Jiatong Cai,Pingyi Chen,Yanzhou Su,Lin Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: under review
点击查看摘要
Abstract:Nucleus detection in histopathology whole slide images (WSIs) is crucial for a broad spectrum of clinical applications. Current approaches for nucleus detection in gigapixel WSIs utilize a sliding window methodology, which overlooks boarder contextual information (eg, tissue structure) and easily leads to inaccurate predictions. To address this problem, recent studies additionally crops a large Filed-of-View (FoV) region around each sliding window to extract contextual features. However, such methods substantially increases the inference latency. In this paper, we propose an effective and efficient context-aware nucleus detection algorithm. Specifically, instead of leveraging large FoV regions, we aggregate contextual clues from off-the-shelf features of historically visited sliding windows. This design greatly reduces computational overhead. Moreover, compared to large FoV regions at a low magnification, the sliding window patches have higher magnification and provide finer-grained tissue details, thereby enhancing the detection accuracy. To further improve the efficiency, we propose a grid pooling technique to compress dense feature maps of each patch into a few contextual tokens. Finally, we craft OCELOT-seg, the first benchmark dedicated to context-aware nucleus instance segmentation. Code, dataset, and model checkpoints will be available at this https URL.
zh
[CV-88] State-of-the-Art Stroke Lesion Segmentation at 1/1000th of Parameters
【速读】:该论文致力于解决医学图像分析中高效且精确的全脑病灶分割难题。论文的关键创新在于重新审视MeshNet这一参数高效的分割模型,并引入了一种新颖的多尺度膨胀模式,采用编码器-解码器结构。这种方法能够在不使用传统下采样、上采样或跳跃连接的情况下捕获广泛的上下文信息和细粒度细节。与处理子体积或切片的传统方法不同,该模型直接在完整的256³ MRI体积上运行。实验结果表明,MeshNet在Aphasia Recovery Cohort (ARC) 数据集上的DICE分数达到或超过了MedNeXt和U-MAMBA等最先进的架构,而参数量仅为其1/1000。这验证了MeshNet在效率与性能之间的良好平衡,使其特别适用于资源受限的环境,如基于Web的应用程序,并为先进医学图像分析工具的广泛部署开辟了新的可能性。
链接: https://arxiv.org/abs/2503.05531
作者: Alex Fedorov,Yutong Bu,Xiao Hu,Chris Rorden,Sergey Plis
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: International Symposium on Biomedical Imaging, April 14-17, 2025
点击查看摘要
Abstract:Efficient and accurate whole-brain lesion segmentation remains a challenge in medical image analysis. In this work, we revisit MeshNet, a parameter-efficient segmentation model, and introduce a novel multi-scale dilation pattern with an encoder-decoder structure. This innovation enables capturing broad contextual information and fine-grained details without traditional downsampling, upsampling, or skip-connections. Unlike previous approaches processing subvolumes or slices, we operate directly on whole-brain 256^3 MRI volumes. Evaluations on the Aphasia Recovery Cohort (ARC) dataset demonstrate that MeshNet achieves superior or comparable DICE scores to state-of-the-art architectures such as MedNeXt and U-MAMBA at 1/1000th of parameters. Our results validate MeshNet’s strong balance of efficiency and performance, making it particularly suitable for resource-limited environments such as web-based applications and opening new possibilities for the widespread deployment of advanced medical image analysis tools.
zh
[CV-89] Pretext Task Adversarial Learning for Unpaired Low-field to Ultra High-field MRI Synthesis
【速读】:该论文旨在解决利用低场 MRI 数据合成高场 MRI 数据的问题,这一任务在训练下游任务(如分割)时由于高场 MRI 数据稀缺且成本高昂而具有重要意义。然而,低场 MRI 的信噪比 (SNR) 和空间分辨率较低,合成高场 MRI 面临着跨域图像特征对齐、保持解剖准确性以及增强细节等挑战。论文的关键解决方案是提出了一种名为预训练任务对抗学习框架 (Pretext Task Adversarial, PTA) 的方法,该框架包含三个核心过程:(1) 切片差距感知网络 (SGP) 基于对比学习对齐低场与高场数据集的切片间差异;(2) 局部结构校正网络 (LSC) 通过恢复局部旋转和掩蔽图像提取局部结构;(3) 预训练任务引导的对抗训练过程引入额外监督并通过判别器提升图像真实性。实验结果表明,该方法在低场到超高场任务中表现出色,达到了当前最先进的性能 (FID=16.892, IS=1.933, MS-SSIM=0.324),从而能够从低场 MRI 数据生成高质量的高场 MRI 样本以扩充下游任务的训练数据集。
链接: https://arxiv.org/abs/2503.05339
作者: Zhenxuan Zhang,Peiyuan Jing,Coraline Beitone,Jiahao Huang,Zhifan Gao,Guang Yang,Pete Lally
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Given the scarcity and cost of high-field MRI, the synthesis of high-field MRI from low-field MRI holds significant potential when there is limited data for training downstream tasks (e.g. segmentation). Low-field MRI often suffers from a reduced signal-to-noise ratio (SNR) and spatial resolution compared to high-field MRI. However, synthesizing high-field MRI data presents challenges. These involve aligning image features across domains while preserving anatomical accuracy and enhancing fine details. To address these challenges, we propose a Pretext Task Adversarial (PTA) learning framework for high-field MRI synthesis from low-field MRI data. The framework comprises three processes: (1) The slice-wise gap perception (SGP) network aligns the slice inconsistencies of low-field and high-field datasets based on contrastive learning. (2) The local structure correction (LSC) network extracts local structures by restoring the locally rotated and masked images. (3) The pretext task-guided adversarial training process introduces additional supervision and incorporates a discriminator to improve image realism. Extensive experiments on low-field to ultra high-field task demonstrate the effectiveness of our method, achieving state-of-the-art performance (16.892 in FID, 1.933 in IS, and 0.324 in MS-SSIM). This enables the generation of high-quality high-field-like MRI data from low-field MRI data to augment training datasets for downstream tasks. The code is available at: this https URL.
zh
[CV-90] L-FUSION: Laplacian Fetal Ultrasound Segmentation Uncertainty Estimation
【速读】:该论文旨在解决胎儿超声图像中因操作者依赖性和技术限制(如固有伪影、设置错误等)导致的图像解释困难及诊断不确定性评估复杂化的问题。论文提出的解决方案关键在于开发了一种名为L-FUSION的框架,它通过无监督规范学习与大规模基础模型的集成,实现了对正常及病理性扫描中胎儿结构的鲁棒分割。L-FUSION利用随机分割网络的Aleatoric逻辑分布以及基于快速Hessian估计的拉普拉斯近似,仅从分割头估计认识论不确定性,并结合集成Dropout组件,生成增强的不确定性图和分割反事实,从而可靠地区分病变与正常胎儿解剖结构,实现异常量化和即时诊断反馈。这种方法不仅提升了认识论和认识论不确定性解释的一致性,还消除了手动疾病标注的需求,支持现场决策并为临床环境中的胎儿超声分析提供了可扩展的解决方案。
链接: https://arxiv.org/abs/2503.05245
作者: Johanna P. Müller,Robert Wright,Thomas G. Day,Lorenzo Venturini,Samuel F. Budd,Hadrien Reynaud,Joseph V. Hajnal,Reza Razavi,Bernhard Kainz
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
点击查看摘要
Abstract:Accurate analysis of prenatal ultrasound (US) is essential for early detection of developmental anomalies. However, operator dependency and technical limitations (e.g. intrinsic artefacts and effects, setting errors) can complicate image interpretation and the assessment of diagnostic uncertainty. We present L-FUSION (Laplacian Fetal US Segmentation with Integrated FoundatiON models), a framework that integrates uncertainty quantification through unsupervised, normative learning and large-scale foundation models for robust segmentation of fetal structures in normal and pathological scans. We propose to utilise the aleatoric logit distributions of Stochastic Segmentation Networks and Laplace approximations with fast Hessian estimations to estimate epistemic uncertainty only from the segmentation head. This enables us to achieve reliable abnormality quantification for instant diagnostic feedback. Combined with an integrated Dropout component, L-FUSION enables reliable differentiation of lesions from normal fetal anatomy with enhanced uncertainty maps and segmentation counterfactuals in US imaging. It improves epistemic and aleatoric uncertainty interpretation and removes the need for manual disease-labelling. Evaluations across multiple datasets show that L-FUSION achieves superior segmentation accuracy and consistent uncertainty quantification, supporting on-site decision-making and offering a scalable solution for advancing fetal ultrasound analysis in clinical settings.
zh
[CV-91] Gaussian Random Fields as an Abstract Representation of Patient Metadata for Multimodal Medical Image Segmentation
【速读】:该论文旨在解决慢性伤口(尤其是糖尿病患者中高发的慢性伤口)检测与分割的难题,这些伤口治疗难度大且成本高昂,对医疗系统构成沉重负担。同时,慢性伤口可能导致感染,严重影响患者生活质量并增加死亡风险。为应对这一挑战,论文提出了一种创新的多模态分割方法,其关键是将患者的元数据(metadata)引入训练流程,并以高斯随机场(Gaussian random fields)的形式表达患者数据,从而提升模型性能。通过在多个基于不同元数据类别的模型上进行训练,并结合距离变换(distance transform)对预测掩膜进行平均融合,论文实现了对基准结果(Intersection over Union = 0.4670,Dice相似性系数 = 0.5908)的显著改进(分别提升0.0220和0.0229)。这是首次研究将患者数据集成到慢性伤口分割工作流中的尝试,展示了显著的性能提升潜力。
链接: https://arxiv.org/abs/2503.05214
作者: Bill Cassidy,Christian McBride,Connah Kendrick,Neil D. Reeves,Joseph M. Pappachan,Shaghayegh Raad,Moi Hoon Yap
机构: Department of Computing and Mathematics, Manchester Metropolitan University (曼彻斯特都会大学); Medical School, Faculty of Health and Medicine, Health Innovation Campus, Lancaster University (兰开斯特大学); Lancashire Teaching Hospitals NHS Foundation Trust (兰开郡教学医院 NHS 基金会信托)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The growing rate of chronic wound occurrence, especially in patients with diabetes, has become a concerning trend in recent years. Chronic wounds are difficult and costly to treat, and have become a serious burden on health care systems worldwide. Chronic wounds can have devastating consequences for the patient, with infection often leading to reduced quality of life and increased mortality risk. Innovative deep learning methods for the detection and monitoring of such wounds have the potential to reduce the impact to both patient and clinician. We present a novel multimodal segmentation method which allows for the introduction of patient metadata into the training workflow whereby the patient data are expressed as Gaussian random fields. Our results indicate that the proposed method improved performance when utilising multiple models, each trained on different metadata categories. Using the Diabetic Foot Ulcer Challenge 2022 test set, when compared to the baseline results (intersection over union = 0.4670, Dice similarity coefficient = 0.5908) we demonstrate improvements of +0.0220 and +0.0229 for intersection over union and Dice similarity coefficient respectively. This paper presents the first study to focus on integrating patient data into a chronic wound segmentation workflow. Our results show significant performance gains when training individual models using specific metadata categories, followed by average merging of prediction masks using distance transforms. All source code for this study is available at: this https URL
zh
[CV-92] We Care Each Pixel: Calibrating on Medical Segmentation Model
【速读】:该论文旨在解决现有医学图像分割评估指标无法有效衡量模型校准质量的问题,这对于临床应用的可靠性至关重要。论文的关键在于提出了一种新的像素级预期校准误差(pixel-wise Expected Calibration Error, pECE)度量方法,用于显式评估分割模型的校准偏差,同时结合形态学适应策略及签名距离校准损失(Signed Distance Calibration, SDC),通过优化边界几何与校准目标的一致性,显著提升了分割性能和校准质量,确保了预测置信度的可靠性和空间精度。
链接: https://arxiv.org/abs/2503.05107
作者: Wenhao Liang,Wei Zhang,Yue Lin,Miao Xu,Olaf Maennel,Weitong Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Reviewing
点击查看摘要
Abstract:Medical image segmentation is fundamental for computer-aided diagnostics, providing accurate delineation of anatomical structures and pathological regions. While common metrics such as Accuracy, DSC, IoU, and HD primarily quantify spatial agreement between predictions and ground-truth labels, they do not assess the calibration quality of segmentation models, which is crucial for clinical reliability. To address this limitation, we propose pixel-wise Expected Calibration Error (pECE), a novel metric that explicitly measures miscalibration at the pixel level, thereby ensuring both spatial precision and confidence reliability. We further introduce a morphological adaptation strategy that applies morphological operations to ground-truth masks before computing calibration losses, particularly benefiting margin-based losses such as Margin SVLS and NACL. Additionally, we present the Signed Distance Calibration Loss (SDC), which aligns boundary geometry with calibration objectives by penalizing discrepancies between predicted and ground-truth signed distance functions (SDFs). Extensive experiments demonstrate that our method not only enhances segmentation performance but also improves calibration quality, yielding more trustworthy confidence estimates. Code is available at: this https URL.
zh
[CV-93] Lightweight Hypercomplex MRI Reconstruction: A Generalized Kronecker-Parameterized Approach
【速读】:该论文旨在解决磁共振成像(MRI)临床诊断中的长时间扫描问题,同时克服现有深度学习模型因内存密集型特性而在资源受限系统中应用困难的问题。论文的关键创新在于提出了一种轻量级的MRI重建模型,通过利用Kronecker参数化超复数神经网络,在减少参数数量的同时实现高性能的MRI重建。其核心解决方案在于引入基于Kronecker的产品模块,包括Kronecker多层感知机(MLP)、Kronecker窗口注意力(Window Attention)和Kronecker卷积(Convolution),这些模块能够高效提取空间特征并保持表示能力。实验结果表明,提出的Kronecker U-Net和Kronecker SwinMR在FastMRI数据集上的性能与现有模型相当,甚至在高加速因子(8倍和16倍)下仍能保持高质量重建,且参数量减少约50%,同时展现出更好的泛化能力和减少过拟合的能力,为硬件受限环境下的高效MRI重建提供了新基准。
链接: https://arxiv.org/abs/2503.05063
作者: Haosen Zhang,Jiahao Huang,Yinzhe Wu,Congren Dai,Fanwen Wang,Zhenxuan Zhang,Guang Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures. Submitted for publication
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) is crucial for clinical diagnostics but is hindered by prolonged scan times. Current deep learning models enhance MRI reconstruction but are often memory-intensive and unsuitable for resource-limited systems. This paper introduces a lightweight MRI reconstruction model leveraging Kronecker-Parameterized Hypercomplex Neural Networks to achieve high performance with reduced parameters. By integrating Kronecker-based modules, including Kronecker MLP, Kronecker Window Attention, and Kronecker Convolution, the proposed model efficiently extracts spatial features while preserving representational power. We introduce Kronecker U-Net and Kronecker SwinMR, which maintain high reconstruction quality with approximately 50% fewer parameters compared to existing models. Experimental evaluation on the FastMRI dataset demonstrates competitive PSNR, SSIM, and LPIPS metrics, even at high acceleration factors (8x and 16x), with no significant performance drop. Additionally, Kronecker variants exhibit superior generalization and reduced overfitting on limited datasets, facilitating efficient MRI reconstruction on hardware-constrained systems. This approach sets a new benchmark for parameter-efficient medical imaging models.
zh
[CV-94] Accelerated Patient-specific Non-Cartesian MRI Reconstruction using Implicit Neural Representations
【速读】:该论文旨在解决快速磁共振成像(MRI)中非笛卡尔欠采样k空间重建的问题,其核心挑战在于现有方法难以在非笛卡尔采样下有效建模连续频域信号。论文的关键创新在于提出了一种基于隐式神经表征(Implicit Neural Representations, INR)的生成式对抗训练方法(k-GINR)。该方法通过两个阶段实现:第一阶段利用生成对抗网络(GAN)在全采样数据上进行监督训练;第二阶段针对个体患者数据进行自监督优化,以嵌入先验知识的方式实现患者特异性调整。这种方法能够在高加速因子(如20倍)下显著优于传统压缩感知(Compressed Sensing)和图像域深度学习方法(如Deep Cascade CNN),尤其适用于非笛卡尔采样下的快速k空间重建。
链接: https://arxiv.org/abs/2503.05051
作者: Di Xu,Hengjie Liu,Xin Miao,Daniel O’Connor,Jessica E. Scholey,Wensha Yang,Mary Feng,Michael Ohliger,Hui Lin,Dan Ruan,Yang Yang,Ke Sheng
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The scanning time for a fully sampled MRI can be undesirably lengthy. Compressed sensing has been developed to minimize image artifacts in accelerated scans, but the required iterative reconstruction is computationally complex and difficult to generalize on new cases. Image-domain-based deep learning methods (e.g., convolutional neural networks) emerged as a faster alternative but face challenges in modeling continuous k-space, a problem amplified with non-Cartesian sampling commonly used in accelerated acquisition. In comparison, implicit neural representations can model continuous signals in the frequency domain and thus are compatible with arbitrary k-space sampling patterns. The current study develops a novel generative-adversarially trained implicit neural representations (k-GINR) for de novo undersampled non-Cartesian k-space reconstruction. k-GINR consists of two stages: 1) supervised training on an existing patient cohort; 2) self-supervised patient-specific optimization. In stage 1, the network is trained with the generative-adversarial network on diverse patients of the same anatomical region supervised by fully sampled acquisition. In stage 2, undersampled k-space data of individual patients is used to tailor the prior-embedded network for patient-specific optimization. The UCSF StarVIBE T1-weighted liver dataset was evaluated on the proposed framework. k-GINR is compared with an image-domain deep learning method, Deep Cascade CNN, and a compressed sensing method. k-GINR consistently outperformed the baselines with a larger performance advantage observed at very high accelerations (e.g., 20 times). k-GINR offers great value for direct non-Cartesian k-space reconstruction for new incoming patients across a wide range of accelerations liver anatomy.
zh
[CV-95] Enhancing Alzheimers Diagnosis: Leverag ing Anatomical Landmarks in Graph Convolutional Neural Networks on Tetrahedral Meshes
【速读】:本文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期诊断中脑淀粉样蛋白阳性(brain amyloid positivity)检测的挑战。传统方法如正电子发射断层扫描(PET)虽然敏感,但成本高昂且侵入性较强;而结构磁共振成像(structural MRI, sMRI)提供了更安全便捷的替代方案,但其在病理特征识别上的局限性导致现有分类模型难以泛化至脑淀粉样蛋白阳性的预测任务。此外,尽管血液生物标志物(Blood-based Biomarkers, BBBMs)在高风险群体中表现出良好的预测能力,但对于中等风险个体仍需依赖PET进行进一步验证。
为应对上述问题,本文提出了一种基于变压器架构的几何深度学习模型,该模型不仅具备可扩展性,还能有效处理输入体素网格尺寸的变化。关键创新在于引入了一种针对四面体网格的新颖标记化方案,并结合由预训练高斯过程模型生成的解剖学地标信息。实验结果表明,此模型在AD分类任务中表现优异,并且能够推广到中等风险组的脑淀粉样蛋白阳性预测任务,弥补了仅依靠血液生物标志物无法清晰区分的不足。这一工作有望推动几何深度学习领域的研究进展,并提高AD诊断的准确性,同时避免使用昂贵且侵入性强的PET扫描技术。
链接: https://arxiv.org/abs/2503.05031
作者: Yanxi Chen,Mohammad Farazi,Zhangsihao Yang,Yonghui Fan,Nicholas Ashton,Eric M Reiman,Yi Su,Yalin Wang
机构: Unknown
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
点击查看摘要
Abstract:Alzheimer’s disease (AD) is a major neurodegenerative condition that affects millions around the world. As one of the main biomarkers in the AD diagnosis procedure, brain amyloid positivity is typically identified by positron emission tomography (PET), which is costly and invasive. Brain structural magnetic resonance imaging (sMRI) may provide a safer and more convenient solution for the AD diagnosis. Recent advances in geometric deep learning have facilitated sMRI analysis and early diagnosis of AD. However, determining AD pathology, such as brain amyloid deposition, in preclinical stage remains challenging, as less significant morphological changes can be observed. As a result, few AD classification models are generalizable to the brain amyloid positivity classification task. Blood-based biomarkers (BBBMs), on the other hand, have recently achieved remarkable success in predicting brain amyloid positivity and identifying individuals with high risk of being brain amyloid positive. However, individuals in medium risk group still require gold standard tests such as Amyloid PET for further evaluation. Inspired by the recent success of transformer architectures, we propose a geometric deep learning model based on transformer that is both scalable and robust to variations in input volumetric mesh size. Our work introduced a novel tokenization scheme for tetrahedral meshes, incorporating anatomical landmarks generated by a pre-trained Gaussian process model. Our model achieved superior classification performance in AD classification task. In addition, we showed that the model was also generalizable to the brain amyloid positivity prediction with individuals in the medium risk class, where BM alone cannot achieve a clear classification. Our work may enrich geometric deep learning research and improve AD diagnosis accuracy without using expensive and invasive PET scans.
zh
[CV-96] Prediction of Frozen Region Growth in Kidney Cryoablation Intervention Using a 3D Flow-Matching Model MICCAI2025
【速读】:该论文旨在解决肾冷冻消融术(cryoablation)中精确预测冰球(iceball)扩展范围的问题,以实现肿瘤的完全消融同时保护周围健康组织。传统方法通常基于物理驱动或扩散模型,计算成本高且难以准确表示复杂的解剖结构。为克服这些局限性,研究的关键在于提出了一种基于三维流匹配(3D flow-matching)的深度学习模型,该模型通过术中CT成像数据进行训练,能够学习连续的形变场,将早期阶段的CT扫描映射到未来的预测结果。这种方法不仅估算冰球的体积变化,还生成对应的分割掩模,从而有效捕捉其空间和形态随时间的变化。论文通过定量分析验证了模型的鲁棒性,其Dice系数达到0.75,IoU分数为0.61,表明预测与真实分割高度一致。这一方案通过整合实时CT成像与先进的深度学习技术,有望显著提升术中引导精度,改善手术结果,并推动微创外科的发展。
链接: https://arxiv.org/abs/2503.04966
作者: Siyeop Yoon,Yujin Oh,Matthew Tivnan,Sifan Song,Pengfei Jin,Sekeun KimHyun Jin Cho,Dufan Wu,Raul Uppot,Quanzheng Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025 submitted version
点击查看摘要
Abstract:This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally demanding and often struggle to represent complex anatomical structures accurately. To address these limitations, our approach leverages intraoperative CT imaging to inform the model. The proposed 3D flow matching model is trained to learn a continuous deformation field that maps early-stage CT scans to future predictions. This transformation not only estimates the volumetric expansion of the iceball but also generates corresponding segmentation masks, effectively capturing spatial and morphological changes over time. Quantitative analysis highlights the model robustness, demonstrating strong agreement between predictions and ground-truth segmentations. The model achieves an Intersection over Union (IoU) score of 0.61 and a Dice coefficient of 0.75. By integrating real time CT imaging with advanced deep learning techniques, this approach has the potential to enhance intraoperative guidance in kidney cryoablation, improving procedural outcomes and advancing the field of minimally invasive surgery.
zh
[CV-97] PGAD: Prototype-Guided Adaptive Distillation for Multi-Modal Learning in AD Diagnosis
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)诊断中因数据缺失模态导致的问题,特别是在真实世界数据集中普遍存在大量不完整样本的情况下。现有方法通常仅利用完整数据进行训练,未能充分挖掘不完全样本的价值,且在高缺失率下难以有效处理多模态特征对齐与知识迁移的挑战。为解决这些问题,论文提出了一种基于原型引导自适应蒸馏(Prototype-Guided Adaptive Distillation, PGAD)的框架,其关键是通过原型匹配增强缺失模态的表示,并结合动态采样策略平衡学习过程。实验结果表明,PGAD在不同缺失率(20%、50% 和 70%)下显著优于现有最先进的方法,验证了原型匹配和自适应采样的有效性,展示了该框架在实际临床环境中实现鲁棒且可扩展的AD诊断潜力。
链接: https://arxiv.org/abs/2503.04836
作者: Yanfei Li,Teng Yin,Wenyi Shang,Jingyu Liu,Xi Wang,Kaiyang Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Missing modalities pose a major issue in Alzheimer’s Disease (AD) diagnosis, as many subjects lack full imaging data due to cost and clinical constraints. While multi-modal learning leverages complementary information, most existing methods train only on complete data, ignoring the large proportion of incomplete samples in real-world datasets like ADNI. This reduces the effective training set and limits the full use of valuable medical data. While some methods incorporate incomplete samples, they fail to effectively address inter-modal feature alignment and knowledge transfer challenges under high missing rates. To address this, we propose a Prototype-Guided Adaptive Distillation (PGAD) framework that directly incorporates incomplete multi-modal data into training. PGAD enhances missing modality representations through prototype matching and balances learning with a dynamic sampling strategy. We validate PGAD on the ADNI dataset with varying missing rates (20%, 50%, and 70%) and demonstrate that it significantly outperforms state-of-the-art approaches. Ablation studies confirm the effectiveness of prototype matching and adaptive sampling, highlighting the potential of our framework for robust and scalable AD diagnosis in real-world clinical settings.
zh
[CV-98] Rethinking Few-Shot Medical Image Segmentation by SAM2: A Training-Free Framework with Augmentative Prompting and Dynamic Matching
【速读】:该论文旨在解决医学图像分割中对大规模标记数据集的依赖这一重大挑战。尽管少量学习(Few-shot Learning)提供了潜在解决方案,但现有方法通常仍需要大量的训练数据。为应对这一问题,本文提出了一种新颖的方法,利用Segment Anything Model 2 (SAM2),这是一种具备强大视频分割能力的视觉基础模型。论文的关键创新在于一种支持-查询匹配策略:对单个标记的支持图像进行广泛的数据增强,并针对查询体数据集中的每一帧,算法化选择最相似的增强支持图像。所选图像及其对应的掩码被用作掩码提示,驱动SAM2的视频分割。此方法完全避免了模型再训练或参数更新。实验结果表明,该方法在基准少量医学图像分割数据集上达到了最先进的性能,显著提高了准确性与标注效率,提供了一种通用且强大的3D医学图像分割解决方案。
链接: https://arxiv.org/abs/2503.04826
作者: Haiyue Zu,Jun Ge,Heting Xiao,Jile Xie,Zhangzhe Zhou,Yifan Meng,Jiayi Ni,Junjie Niu,Linlin Zhang,Li Ni,Huilin Yang
机构: Department of Orthopaedics, The First Affiliated Hospital of Soochow University, Soochow University (苏州大学附属第一医院骨科; 苏州大学), Suzhou (苏州), 215006, China.; Independent Researcher (独立研究者).
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The reliance on large labeled datasets presents a significant challenge in medical image segmentation. Few-shot learning offers a potential solution, but existing methods often still require substantial training data. This paper proposes a novel approach that leverages the Segment Anything Model 2 (SAM2), a vision foundation model with strong video segmentation capabilities. We conceptualize 3D medical image volumes as video sequences, departing from the traditional slice-by-slice paradigm. Our core innovation is a support-query matching strategy: we perform extensive data augmentation on a single labeled support image and, for each frame in the query volume, algorithmically select the most analogous augmented support image. This selected image, along with its corresponding mask, is used as a mask prompt, driving SAM2’s video segmentation. This approach entirely avoids model retraining or parameter updates. We demonstrate state-of-the-art performance on benchmark few-shot medical image segmentation datasets, achieving significant improvements in accuracy and annotation efficiency. This plug-and-play method offers a powerful and generalizable solution for 3D medical image segmentation.
zh
[CV-99] RTFusion: A depth estimation network based on multimodal fusion in challenging scenarios
【速读】:该论文旨在解决复杂现实场景中单模态(如可见光或热红外THR影像)深度估计面临的挑战,特别是在不利光照条件下精度和鲁棒性不足的问题。论文提出了一种新颖的多模态深度估计算法RTFusion,其关键在于通过融合RGB和THR数据的互补优势来提升深度估计的性能。解决方案的核心是引入独特的融合机制EGFusion,它包含用于跨模态特征对齐的互补充注意力模块MCA以及用于增强边缘细节保留的边缘显著性增强模块ESEM。实验结果表明,该模型在多种具有挑战性的环境中(如夜间、雨天和强光条件)均能生成高质量的深度图,并在自动驾驶、机器人和增强现实等需要可靠深度估计的应用中展现出潜力。
链接: https://arxiv.org/abs/2503.04821
作者: Zelin Meng,Takanori Fukao
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures
点击查看摘要
Abstract:Depth estimation in complex real-world scenarios is a challenging task, especially when relying solely on a single modality such as visible light or thermal infrared (THR) imagery. This paper proposes a novel multimodal depth estimation model, RTFusion, which enhances depth estimation accuracy and robustness by integrating the complementary strengths of RGB and THR data. The RGB modality provides rich texture and color information, while the THR modality captures thermal patterns, ensuring stability under adverse lighting conditions such as extreme illumination. The model incorporates a unique fusion mechanism, EGFusion, consisting of the Mutual Complementary Attention (MCA) module for cross-modal feature alignment and the Edge Saliency Enhancement Module (ESEM) to improve edge detail preservation. Comprehensive experiments on the MS2 and ViViD++ datasets demonstrate that the proposed model consistently produces high-quality depth maps across various challenging environments, including nighttime, rainy, and high-glare conditions. The experimental results highlight the potential of the proposed method in applications requiring reliable depth estimation, such as autonomous driving, robotics, and augmented reality.
zh
人工智能
[AI-0] Multi-Fidelity Policy Gradient Algorithms
链接: https://arxiv.org/abs/2503.05696
作者: Xinjie Liu,Cyrus Neary,Kushagra Gupta,Christian Ellis,Ufuk Topcu,David Fridovich-Keil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Many reinforcement learning (RL) algorithms require large amounts of data, prohibiting their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-fidelity simulators–such as reduced-order models, heuristic reward functions, or generative world models–can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a large volume of low-fidelity simulation data to form unbiased, reduced-variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9x higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples–up to 10x as many interactions with the target environment–MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.
[AI-1] dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale
链接: https://arxiv.org/abs/2503.05646
作者: Yihao Liu,Yu-Chun Ku,Jiaming Zhang,Hao Ding,Peter Kazanzides,Mehran Armand
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures
点击查看摘要
Abstract:Data scarcity has long been an issue in the robot learning community. Particularly, in safety-critical domains like surgical applications, obtaining high-quality data can be especially difficult. It poses challenges to researchers seeking to exploit recent advancements in reinforcement learning and imitation learning, which have greatly improved generalizability and enabled robots to conduct tasks autonomously. We introduce dARt Vinci, a scalable data collection platform for robot learning in surgical settings. The system uses Augmented Reality (AR) hand tracking and a high-fidelity physics engine to capture subtle maneuvers in primitive surgical tasks: By eliminating the need for a physical robot setup and providing flexibility in terms of time, space, and hardware resources-such as multiview sensors and actuators-specialized simulation is a viable alternative. At the same time, AR allows the robot data collection to be more egocentric, supported by its body tracking and content overlaying capabilities. Our user study confirms the proposed system’s efficiency and usability, where we use widely-used primitive tasks for training teleoperation with da Vinci surgical robots. Data throughput improves across all tasks compared to real robot settings by 41% on average. The total experiment time is reduced by an average of 10%. The temporal demand in the task load survey is improved. These gains are statistically significant. Additionally, the collected data is over 400 times smaller in size, requiring far less storage while achieving double the frequency.
[AI-2] Exploring FMCW Radars and Feature Maps for Activity Recognition: A Benchmark Study
链接: https://arxiv.org/abs/2503.05629
作者: Ali Samimi Fard,Mohammadreza Mashhadigholamali,Samaneh Zolfaghari,Hajar Abedi,Mainak Chakraborty,Luigi Borzì,Masoud Daneshtalab,George Shaker
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Human Activity Recognition has gained significant attention due to its diverse applications, including ambient assisted living and remote sensing. Wearable sensor-based solutions often suffer from user discomfort and reliability issues, while video-based methods raise privacy concerns and perform poorly in low-light conditions or long ranges. This study introduces a Frequency-Modulated Continuous Wave radar-based framework for human activity recognition, leveraging a 60 GHz radar and multi-dimensional feature maps. Unlike conventional approaches that process feature maps as images, this study feeds multi-dimensional feature maps – Range-Doppler, Range-Azimuth, and Range-Elevation – as data vectors directly into the machine learning (SVM, MLP) and deep learning (CNN, LSTM, ConvLSTM) models, preserving the spatial and temporal structures of the data. These features were extracted from a novel dataset with seven activity classes and validated using two different validation approaches. The ConvLSTM model outperformed conventional machine learning and deep learning models, achieving an accuracy of 90.51% and an F1-score of 87.31% on cross-scene validation and an accuracy of 89.56% and an F1-score of 87.15% on leave-one-person-out cross-validation. The results highlight the approach’s potential for scalable, non-intrusive, and privacy-preserving activity monitoring in real-world scenarios.
[AI-3] Superintelligence Strategy: Expert Version
链接: https://arxiv.org/abs/2503.05628
作者: Dan Hendrycks,Eric Schmidt,Alexandr Wang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: this https URL
点击查看摘要
Abstract:Rapid advances in AI are beginning to reshape national security. Destabilizing AI developments could rupture the balance of power and raise the odds of great-power conflict, while widespread proliferation of capable AI hackers and virologists would lower barriers for rogue actors to cause catastrophe. Superintelligence – AI vastly better than humans at nearly all cognitive tasks – is now anticipated by AI researchers. Just as nations once developed nuclear strategies to secure their survival, we now need a coherent superintelligence strategy to navigate a new period of transformative change. We introduce the concept of Mutual Assured AI Malfunction (MAIM): a deterrence regime resembling nuclear mutual assured destruction (MAD) where any state’s aggressive bid for unilateral AI dominance is met with preventive sabotage by rivals. Given the relative ease of sabotaging a destabilizing AI project – through interventions ranging from covert cyberattacks to potential kinetic strikes on datacenters – MAIM already describes the strategic picture AI superpowers find themselves in. Alongside this, states can increase their competitiveness by bolstering their economies and militaries through AI, and they can engage in nonproliferation to rogue actors to keep weaponizable AI capabilities out of their hands. Taken together, the three-part framework of deterrence, nonproliferation, and competitiveness outlines a robust strategy to superintelligence in the years ahead.
[AI-4] InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model IROS2025
链接: https://arxiv.org/abs/2503.05573
作者: Feeza Khan Khanzada,Jaerock Kwon
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: This work has been submitted to IROS 2025 and is currently under review
点击查看摘要
Abstract:Model-based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement based rewards within a Dreamer based MBRL framework. By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without any task specific feedback. This approach yields a task agnostic latent representation, allowing for rapid zero shot or few shot fine tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic exploration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems.
[AI-5] Compliance of AI Systems
链接: https://arxiv.org/abs/2503.05571
作者: Julius Schöning,Niklas Kruse
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 5 pages, 3 figures
点击查看摘要
Abstract:The increasing integration of artificial intelligence (AI) systems in various fields requires solid concepts to ensure compliance with upcoming legislation. This paper systematically examines the compliance of AI systems with relevant legislation, focusing on the EU’s AI Act and the compliance of data sets. The analysis highlighted many challenges associated with edge devices, which are increasingly being used to deploy AI applications closer and closer to the data sources. Such devices often face unique issues due to their decentralized nature and limited computing resources for implementing sophisticated compliance mechanisms. By analyzing AI implementations, the paper identifies challenges and proposes the first best practices for legal compliance when developing, deploying, and running AI. The importance of data set compliance is highlighted as a cornerstone for ensuring the trustworthiness, transparency, and explainability of AI systems, which must be aligned with ethical standards set forth in regulatory frameworks such as the AI Act. The insights gained should contribute to the ongoing discourse on the responsible development and deployment of embedded AI systems.
[AI-6] Impoola: The Power of Averag e Pooling for Image-Based Deep Reinforcement Learning
链接: https://arxiv.org/abs/2503.05546
作者: Raphael Trumpp,Ansgar Schäfftlein,Mirco Theile,Marco Caccamo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As image-based deep reinforcement learning tackles more challenging tasks, increasing model size has become an important factor in improving performance. Recent studies achieved this by focusing on the parameter efficiency of scaled networks, typically using Impala-CNN, a 15-layer ResNet-inspired network, as the image encoder. However, while Impala-CNN evidently outperforms older CNN architectures, potential advancements in network design for deep reinforcement learning-specific image encoders remain largely unexplored. We find that replacing the flattening of output feature maps in Impala-CNN with global average pooling leads to a notable performance improvement. This approach outperforms larger and more complex models in the Procgen Benchmark, particularly in terms of generalization. We call our proposed encoder model Impoola-CNN. A decrease in the network’s translation sensitivity may be central to this improvement, as we observe the most significant gains in games without agent-centered observations. Our results demonstrate that network scaling is not just about increasing model size - efficient network design is also an essential factor.
[AI-7] Grammar-Based Code Representation: Is It a Worthy Pursuit for LLM s?
链接: https://arxiv.org/abs/2503.05507
作者: Qingyuan Liang,Zhao Zhang,Zeyu Sun,Zheng Lin,Qi Luo,Yueyi Xiao,Yizhou Chen,Yuqun Zhang,Haotian Zhang,Lu Zhang,Bin Chen,Yingfei Xiong
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs’ ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
[AI-8] Personalized Federated Learning via Learning Dynamic Graphs
链接: https://arxiv.org/abs/2503.05474
作者: Ziran Zhou,Guanyu Gao,Xiaohu Wu,Yan Lyu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Personalized Federated Learning (PFL) aims to train a personalized model for each client that is tailored to its local data distribution, learning fails to perform well on individual clients due to variations in their local data distributions. Most existing PFL methods focus on personalizing the aggregated global model for each client, neglecting the fundamental aspect of federated learning: the regulation of how client models are aggregated. Additionally, almost all of them overlook the graph structure formed by clients in federated learning. In this paper, we propose a novel method, Personalized Federated Learning with Graph Attention Network (pFedGAT), which captures the latent graph structure between clients and dynamically determines the importance of other clients for each client, enabling fine-grained control over the aggregation process. We evaluate pFedGAT across multiple data distribution scenarios, comparing it with twelve state of the art methods on three datasets: Fashion MNIST, CIFAR-10, and CIFAR-100, and find that it consistently performs well.
[AI-9] he Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence
链接: https://arxiv.org/abs/2503.05473
作者: Noah Mamie,Susie Xi Rao
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 11 pages (excl. appendix)
点击查看摘要
Abstract:Multi-agent systems address issues of accessibility and scalability of artificial intelligence (AI) foundation models, which are often represented by large language models. We develop a framework - the “Society of HiveMind” (SOHM) - that orchestrates the interaction between multiple AI foundation models, imitating the observed behavior of animal swarms in nature by following modern evolutionary theories. On the one hand, we find that the SOHM provides a negligible benefit on tasks that mainly require real-world knowledge. On the other hand, we remark a significant improvement on tasks that require intensive logical reasoning, indicating that multi-agent systems are capable of increasing the reasoning capabilities of the collective compared to the individual agents. Our findings demonstrate the potential of combining a multitude of diverse AI foundation models to form an artificial swarm intelligence capable of self-improvement through interactions with a given environment.
[AI-10] Controllable Complementarity: Subjective Preferences in Human-AI Collaboration
链接: https://arxiv.org/abs/2503.05455
作者: Chase McDonald,Cleotilde Gonzalez
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 9 pages, 4 figures
点击查看摘要
Abstract:Research on human-AI collaboration often prioritizes objective performance. However, understanding human subjective preferences is essential to improving human-AI complementarity and human experiences. We investigate human preferences for controllability in a shared workspace task with AI partners using Behavior Shaping (BS), a reinforcement learning algorithm that allows humans explicit control over AI behavior. In one experiment, we validate the robustness of BS in producing effective AI policies relative to self-play policies, when controls are hidden. In another experiment, we enable human control, showing that participants perceive AI partners as more effective and enjoyable when they can directly dictate AI behavior. Our findings highlight the need to design AI that prioritizes both task performance and subjective human preferences. By aligning AI behavior with human preferences, we demonstrate how human-AI complementarity can extend beyond objective outcomes to include subjective preferences. Comments: 9 pages, 4 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2503.05455 [cs.HC] (or arXiv:2503.05455v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.05455 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] Soft Policy Optimization: Online Off-Policy RL for Sequence Models
链接: https://arxiv.org/abs/2503.05453
作者: Taco Cohen,David W. Zhang,Kunhao Zheng,Yunhao Tang,Remi Munos,Gabriel Synnaeve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or other policies, or by decoding and exploration methods. This results in severe sample inefficiency and exploration difficulties, as well as a potential loss of diversity in the policy responses. Moreover, asynchronous PPO implementations require frequent and costly model transfers, and typically use value models which require a large amount of memory. In this paper we introduce Soft Policy Optimization (SPO), a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories and does not require a separate value model. In experiments on code contests, we shows that SPO outperforms PPO on pass@10, is significantly faster and more memory efficient, is able to benefit from off-policy data, enjoys improved stability, and learns more diverse (i.e. soft) policies.
[AI-12] LLM -based Iterative Approach to Metamodeling in Automotive
链接: https://arxiv.org/abs/2503.05449
作者: Nenad Petrovic,Fengjunjie Pan,Vahid Zolfaghari,Alois Knoll
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In this paper, we introduce an automated approach to domain-specific metamodel construction relying on Large Language Model (LLM). The main focus is adoption in automotive domain. As outcome, a prototype was implemented as web service using Python programming language, while OpenAI’s GPT-4o was used as the underlying LLM. Based on the initial experiments, this approach successfully constructs Ecore metamodel based on set of automotive requirements and visualizes it making use of PlantUML notation, so human experts can provide feedback in order to refine the result. Finally, locally deployable solution is also considered, including the limitations and additional steps required.
[AI-13] Static Program Analysis Guided LLM Based Unit Test Generation
链接: https://arxiv.org/abs/2503.05394
作者: Sujoy Roychowdhury,Giriprasad Sridhara,A K Raghavan,Joy Bose,Sourav Mazumdar,Hamender Singh,Srinivasan Bajji Sugumaran,Ricardo Britto
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We describe a novel approach to automating unit test generation for Java methods using large language models (LLMs). Existing LLM-based approaches rely on sample usage(s) of the method to test (focal method) and/or provide the entire class of the focal method as input prompt and context. The former approach is often not viable due to the lack of sample usages, especially for newly written focal methods. The latter approach does not scale well enough; the bigger the complexity of the focal method and larger associated class, the harder it is to produce adequate test code (due to factors such as exceeding the prompt and context lengths of the underlying LLM). We show that augmenting prompts with \emphconcise and \emphprecise context information obtained by program analysis %of the focal method increases the effectiveness of generating unit test code through LLMs. We validate our approach on a large commercial Java project and a popular open-source Java project.
[AI-14] Ontology Generation using Large Language Models
链接: https://arxiv.org/abs/2503.05388
作者: Anna Sofia Lippolis,Mohammad Javad Saeedizade,Robin Keskisärkkä,Sara Zuppiroli,Miguel Ceriani,Aldo Gangemi,Eva Blomqvist,Andrea Giovanni Nuzzolese
类目: Artificial Intelligence (cs.AI)
*备注: 2 figures and 3 tables. 20 pages
点击查看摘要
Abstract:The ontology engineering process is complex, time-consuming, and error-prone, even for experienced ontology engineers. In this work, we investigate the potential of Large Language Models (LLMs) to provide effective OWL ontology drafts directly from ontological requirements described using user stories and competency questions. Our main contribution is the presentation and evaluation of two new prompting techniques for automated ontology development: Memoryless CQbyCQ and Ontogenia. We also emphasize the importance of three structural criteria for ontology assessment, alongside expert qualitative evaluation, highlighting the need for a multi-dimensional evaluation in order to capture the quality and usability of the generated ontologies. Our experiments, conducted on a benchmark dataset of ten ontologies with 100 distinct CQs and 29 different user stories, compare the performance of three LLMs using the two prompting techniques. The results demonstrate improvements over the current state-of-the-art in LLM-supported ontology engineering. More specifically, the model OpenAI o1-preview with Ontogenia produces ontologies of sufficient quality to meet the requirements of ontology engineers, significantly outperforming novice ontology engineers in modelling ability. However, we still note some common mistakes and variability of result quality, which is important to take into account when using LLMs for ontology authoring support. We discuss these limitations and propose directions for future research.
[AI-15] VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method
链接: https://arxiv.org/abs/2503.05383
作者: Weiyu Ma,Yuqian Fu,Zecheng Zhang,Guohao Li
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Under Review
点击查看摘要
Abstract:We introduce VLM-Attention, a multimodal StarCraft II environment that aligns artificial agent perception with the human gameplay experience. Traditional frameworks such as SMAC rely on abstract state representations that diverge significantly from human perception, limiting the ecological validity of agent behavior. Our environment addresses this limitation by incorporating RGB visual inputs and natural language observations that more closely simulate human cognitive processes during gameplay. The VLM-Attention framework consists of three integrated components: (1) a vision-language model enhanced with specialized self-attention mechanisms for strategic unit targeting and battlefield assessment, (2) a retrieval-augmented generation system that leverages domain-specific StarCraft II knowledge to inform tactical decisions, and (3) a dynamic role-based task distribution system that enables coordinated multi-agent behavior. Our experimental evaluation across 21 custom scenarios demonstrates that VLM-based agents powered by foundation models (specifically Qwen-VL and GPT-4o) can execute complex tactical maneuvers without explicit training, achieving comparable performance to traditional MARL methods that require substantial training iterations. This work establishes a foundation for developing human-aligned StarCraft II agents and advances the broader research agenda of multimodal game AI. Our implementation is available at this https URL.
[AI-16] On the Logical Content of Logic Programs
链接: https://arxiv.org/abs/2503.05355
作者: Alexader V. Gheorghiu
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Logic programming (LP) is typically understood through operational semantics (e.g., SLD-resolution) or model-theoretic interpretations (e.g., the least Herbrand model). This paper introduces a novel perspective on LP by defining a support'' relation that explicates what a program
knows’'. This interpretation is shown to express classical and intuitionistic logic, as well as an intermediate logic, depending on certain choices regarding LP and the meanings of disjunction and negation. These results are formalized using the idea of base-extension semantics within proof-theoretic semantics. Our approach offers new insights into the logical foundations of LP and has potential applications in knowledge representation, automated reasoning, and formal verification.
[AI-17] Spatial Distillation based Distribution Alignment (SDDA) for Cross-Headset EEG Classification
链接: https://arxiv.org/abs/2503.05349
作者: Dingkun Liu,Siyang Li,Ziwei Wang,Wei Li,Dongrui Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 5 figures
点击查看摘要
Abstract:A non-invasive brain-computer interface (BCI) enables direct interaction between the user and external devices, typically via electroencephalogram (EEG) signals. However, decoding EEG signals across different headsets remains a significant challenge due to differences in the number and locations of the electrodes. To address this challenge, we propose a spatial distillation based distribution alignment (SDDA) approach for heterogeneous cross-headset transfer in non-invasive BCIs. SDDA uses first spatial distillation to make use of the full set of electrodes, and then input/feature/output space distribution alignments to cope with the significant differences between the source and target domains. To our knowledge, this is the first work to use knowledge distillation in cross-headset transfers. Extensive experiments on six EEG datasets from two BCI paradigms demonstrated that SDDA achieved superior performance in both offline unsupervised domain adaptation and online supervised domain adaptation scenarios, consistently outperforming 10 classical and state-of-the-art transfer learning algorithms.
[AI-18] oward an Evaluation Science for Generative AI Systems
链接: https://arxiv.org/abs/2503.05336
作者: Laura Weidinger,Deb Raji,Hanna Wallach,Margaret Mitchell,Angelina Wang,Olawale Salaudeen,Rishi Bommasani,Sayash Kapoor,Deep Ganguli,Sanmi Koyejo,William Isaac
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: First two authors contributed equally to this work
点击查看摘要
Abstract:There is an increasing imperative to anticipate and understand the performance and safety of generative AI systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems. While generative AI creates unique challenges for system safety engineering and measurement science, the field can draw valuable insights from the development of safety evaluation practices in other fields, including transportation, aerospace, and pharmaceutical engineering. In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established. Applying these insights, we outline a concrete path toward a more rigorous approach for evaluating generative AI systems.
[AI-19] Disentangling Task Interference within Neurons: Model Merging in Alignment with Neuronal Mechanisms
链接: https://arxiv.org/abs/2503.05320
作者: Zitao Fang,Guodong DU,Shuyang Yu,Yifei Guo,Yiwei Zhang,Jing Li,Ho-Kin Tang,Sim Kuan Goh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic at various levels: model, layer, or parameter, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlook the fundamental role of individual neurons and their connectivity, resulting in a lack of interpretability in both the merging process and the merged models. In this work, we present the first study on the impact of neuronal alignment in model merging. We decompose task-specific representations into two complementary neuronal subspaces that regulate neuron sensitivity and input adaptability. Leveraging this decomposition, we introduce NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrate that NeuroMerging achieves superior performance compared to existing methods on multi-task benchmarks across both vision and natural language domains. Our findings highlight the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion.
[AI-20] Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
链接: https://arxiv.org/abs/2503.05306
作者: Hyungkyu Kang,Min-hwan Oh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In this paper, we study offline preference-based reinforcement learning (PbRL), where learning is based on pre-collected preference feedback over pairs of trajectories. While offline PbRL has demonstrated remarkable empirical success, existing theoretical approaches face challenges in ensuring conservatism under uncertainty, requiring computationally intractable confidence set constructions. We address this limitation by proposing Adversarial Preference-based Policy Optimization (APPO), a computationally efficient algorithm for offline PbRL that guarantees sample complexity bounds without relying on explicit confidence sets. By framing PbRL as a two-player game between a policy and a model, our approach enforces conservatism in a tractable manner. Using standard assumptions on function approximation and bounded trajectory concentrability, we derive a sample complexity bound. To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability. Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods.
[AI-21] Evidential Uncertainty Estimation for Multi-Modal Trajectory Prediction
链接: https://arxiv.org/abs/2503.05274
作者: Sajad Marvi,Christoph Rist,Julian Schmidt,Julian Jordan,Abhinav Valada
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Accurate trajectory prediction is crucial for autonomous driving, yet uncertainty in agent behavior and perception noise makes it inherently challenging. While multi-modal trajectory prediction models generate multiple plausible future paths with associated probabilities, effectively quantifying uncertainty remains an open problem. In this work, we propose a novel multi-modal trajectory prediction approach based on evidential deep learning that estimates both positional and mode probability uncertainty in real time. Our approach leverages a Normal Inverse Gamma distribution for positional uncertainty and a Dirichlet distribution for mode uncertainty. Unlike sampling-based methods, it infers both types of uncertainty in a single forward pass, significantly improving efficiency. Additionally, we experimented with uncertainty-driven importance sampling to improve training efficiency by prioritizing underrepresented high-uncertainty samples over redundant ones. We perform extensive evaluations of our method on the Argoverse 1 and Argoverse 2 datasets, demonstrating that it provides reliable uncertainty estimates while maintaining high trajectory prediction accuracy.
[AI-22] Jailbreaking is (Mostly) Simpler Than You Think
链接: https://arxiv.org/abs/2503.05264
作者: Mark Russinovich,Ahmed Salem
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce the Context Compliance Attack (CCA), a novel, optimization-free method for bypassing AI safety mechanisms. Unlike current approaches – which rely on complex prompt engineering and computationally intensive optimization – CCA exploits a fundamental architectural vulnerability inherent in many deployed AI systems. By subtly manipulating conversation history, CCA convinces the model to comply with a fabricated dialogue context, thereby triggering restricted behavior. Our evaluation across a diverse set of open-source and proprietary models demonstrates that this simple attack can circumvent state-of-the-art safety protocols. We discuss the implications of these findings and propose practical mitigation strategies to fortify AI systems against such elementary yet effective adversarial tactics.
[AI-23] A Map-free Deep Learning-based Framework for Gate-to-Gate Monocular Visual Navigation aboard Miniaturized Aerial Vehicles
链接: https://arxiv.org/abs/2503.05251
作者: Lorenzo Scarciglia,Antonio Paolillo,Daniele Palossi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: \c{opyright}2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:Palm-sized autonomous nano-drones, i.e., sub-50g in weight, recently entered the drone racing scenario, where they are tasked to avoid obstacles and navigate as fast as possible through gates. However, in contrast with their bigger counterparts, i.e., kg-scale drones, nano-drones expose three orders of magnitude less onboard memory and compute power, demanding more efficient and lightweight vision-based pipelines to win the race. This work presents a map-free vision-based (using only a monocular camera) autonomous nano-drone that combines a real-time deep learning gate detection front-end with a classic yet elegant and effective visual servoing control back-end, only relying on onboard resources. Starting from two state-of-the-art tiny deep learning models, we adapt them for our specific task, and after a mixed simulator-real-world training, we integrate and deploy them aboard our nano-drone. Our best-performing pipeline costs of only 24M multiply-accumulate operations per frame, resulting in a closed-loop control performance of 30 Hz, while achieving a gate detection root mean square error of 1.4 pixels, on our ~20k real-world image dataset. In-field experiments highlight the capability of our nano-drone to successfully navigate through 15 gates in 4 min, never crashing and covering a total travel distance of ~100m, with a peak flight speed of 1.9 m/s. Finally, to stress the generalization capability of our system, we also test it in a never-seen-before environment, where it navigates through gates for more than 4 min.
[AI-24] Robust Conformal Prediction with a Single Binary Certificate ICLR2025
链接: https://arxiv.org/abs/2503.05239
作者: Soroush H. Zargarbashi,Aleksandar Bojchevski
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2025
点击查看摘要
Abstract:Conformal prediction (CP) converts any model’s output to prediction sets with a guarantee to cover the true label with (adjustable) high probability. Robust CP extends this guarantee to worst-case (adversarial) inputs. Existing baselines achieve robustness by bounding randomly smoothed conformity scores. In practice, they need expensive Monte-Carlo (MC) sampling (e.g. \sim10^4 samples per point) to maintain an acceptable set size. We propose a robust conformal prediction that produces smaller sets even with significantly lower MC samples (e.g. 150 for CIFAR10). Our approach binarizes samples with an adjustable (or automatically adjusted) threshold selected to preserve the coverage guarantee. Remarkably, we prove that robustness can be achieved by computing only one binary certificate, unlike previous methods that certify each calibration (or test) point. Thus, our method is faster and returns smaller robust sets. We also eliminate a previous limitation that requires a bounded score function.
[AI-25] Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction
链接: https://arxiv.org/abs/2503.05231
作者: Shuo Jiang,Haonan Li,Ruochen Ren,Yanmin Zhou,Zhipeng Wang,Bin He
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Cutting-edge robot learning techniques including foundation models and imitation learning from humans all pose huge demands on large-scale and high-quality datasets which constitute one of the bottleneck in the general intelligent robot fields. This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems in the sophisticated assembling scenario,especially with dynamics information and its fine-grained labelling. The dataset first provides an integration of human,environment and robot data collection framework with 20 subjects and 30 interaction objects resulting in totally 11,664 instances of integrated actions. For each of the demonstration,hand motions,operation pressures,sounds of the assembling process,multi-view videos, high-precision motion capture information,eye gaze with first-person videos,electromyography signals are all recorded. Fine-grained multi-level annotation based on absolute timestamp,and semantic segmentation labelling are performed. Kaiwu dataset aims to facilitate robot learning,dexterous manipulation,human intention investigation and human-robot collaboration research.
[AI-26] Discrete Contrastive Learning for Diffusion Policies in Autonomous Driving
链接: https://arxiv.org/abs/2503.05229
作者: Kalle Kujanpää,Daulet Baimukashev,Farzeen Munir,Shoaib Azam,Tomasz Piotr Kucner,Joni Pajarinen,Ville Kyrki
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Learning to perform accurate and rich simulations of human driving behaviors from data for autonomous vehicle testing remains challenging due to human driving styles’ high diversity and variance. We address this challenge by proposing a novel approach that leverages contrastive learning to extract a dictionary of driving styles from pre-existing human driving data. We discretize these styles with quantization, and the styles are used to learn a conditional diffusion policy for simulating human drivers. Our empirical evaluation confirms that the behaviors generated by our approach are both safer and more human-like than those of the machine-learning-based baseline methods. We believe this has the potential to enable higher realism and more effective techniques for evaluating and improving the performance of autonomous vehicles.
[AI-27] MOHPER: Multi-objective Hyperparameter Optimization Framework for E-commerce Retrieval System
链接: https://arxiv.org/abs/2503.05227
作者: Jungbae Park,Heonseok Jang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:E-commerce search optimization has evolved to include a wider range of metrics that reflect user engagement and business objectives. Modern search frameworks now incorporate advanced quality features, such as sales counts and document-query relevance, to better align search results with these goals. Traditional methods typically focus on click-through rate (CTR) as a measure of engagement or relevance, but this can miss true purchase intent, creating a gap between user interest and actual conversions. Joint training with the click-through conversion rate (CTCVR) has become essential for understanding buying behavior, although its sparsity poses challenges for reliable optimization. This study presents MOHPER, a Multi-Objective Hyperparameter Optimization framework for E-commerce Retrieval systems. Utilizing Bayesian optimization and sampling, it jointly optimizes both CTR, CTCVR, and relevant objectives, focusing on engagement and conversion of the users. In addition, to improve the selection of the best configuration from multi-objective optimization, we suggest advanced methods for hyperparameter selection, including a meta-configuration voting strategy and a cumulative training approach that leverages prior optimal configurations, to improve speeds of training and efficiency. Currently deployed in a live setting, our proposed framework substantiates its practical efficacy in achieving a balanced optimization that aligns with both user satisfaction and revenue goals.
[AI-28] Reward-Centered ReST-MCTS: A Robust Decision-Making Framework for Robotic Manipulation in High Uncertainty Environments
链接: https://arxiv.org/abs/2503.05226
作者: Xibai Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Monte Carlo Tree Search (MCTS) has emerged as a powerful tool for decision-making in robotics, enabling efficient exploration of large search spaces. However, traditional MCTS methods struggle in environments characterized by high uncertainty and noisy data due to their reliance on final-step reward evaluation. The lack of intermediate feedback during search often results in suboptimal decision-making and computational inefficiencies. This paper introduces Reward-Centered ReST-MCTS, a novel framework that enhances MCTS by incorporating intermediate reward shaping. The core of our approach is the Rewarding Center, which refines search trajectories by dynamically assigning partial rewards using rule-based validation, heuristic guidance, and neural estimation. By integrating these mechanisms, our method enables real-time optimization of search paths, mitigating the effects of error propagation. We evaluate Reward-Centered ReST-MCTS in robotic manipulation tasks under high uncertainty, demonstrating consistent improvements in decision accuracy. Compared to baseline methods, including Chain-of-Thought (CoT) prompting and Vanilla ReST-MCTS, our framework achieves a 2-4% accuracy improvement while maintaining computational feasibility. Ablation studies confirm the effectiveness of intermediate feedback in search refinement, particularly in pruning incorrect decision paths early. Furthermore, robustness tests show that our method retains high performance across varying levels of uncertainty. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.05226 [cs.RO] (or arXiv:2503.05226v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2503.05226 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-29] Deep Sequence Models for Predicting Averag e Shear Wave Velocity from Strong Motion Records
链接: https://arxiv.org/abs/2503.05224
作者: Baris Yilmaz,Erdem Akagündüz,Salih Tileylioglu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This study explores the use of deep learning for predicting the time averaged shear wave velocity in the top 30 m of the subsurface ( V_s30 ) at strong motion recording stations in Türkiye. V_s30 is a key parameter in site characterization and, as a result for seismic hazard assessment. However, it is often unavailable due to the lack of direct measurements and is therefore estimated using empirical correlations. Such correlations however are commonly inadequate in capturing complex, site-specific variability and this motivates the need for data-driven approaches. In this study, we employ a hybrid deep learning model combining convolutional neural networks (CNNs) and long short-term memory (LSTM) networks to capture both spatial and temporal dependencies in strong motion records. Furthermore, we explore how using different parts of the signal influence our deep learning model. Our results suggest that the hybrid approach effectively learns complex, nonlinear relationships within seismic signals. We observed that an improved P-wave arrival time model increased the prediction accuracy of V_s30 . We believe the study provides valuable insights into improving V_s30 predictions using a CNN-LSTM framework, demonstrating its potential for improving site characterization for seismic studies. Our codes are available via this repo: this https URL
[AI-30] Policy Constraint by Only Support Constraint for Offline Reinforcement Learning
链接: https://arxiv.org/abs/2503.05207
作者: Yunkai Gao,Jiaming Guo,Fan Wu,Rui Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Offline reinforcement learning (RL) aims to optimize a policy by using pre-collected datasets, to maximize cumulative rewards. However, offline reinforcement learning suffers challenges due to the distributional shift between the learned and behavior policies, leading to errors when computing Q-values for out-of-distribution (OOD) actions. To mitigate this issue, policy constraint methods aim to constrain the learned policy’s distribution with the distribution of the behavior policy or confine action selection within the support of the behavior policy. However, current policy constraint methods tend to exhibit excessive conservatism, hindering the policy from further surpassing the behavior policy’s performance. In this work, we present Only Support Constraint (OSC) which is derived from maximizing the total probability of learned policy in the support of behavior policy, to address the conservatism of policy constraint. OSC presents a regularization term that only restricts policies to the support without imposing extra constraints on actions within the support. Additionally, to fully harness the performance of the new policy constraints, OSC utilizes a diffusion model to effectively characterize the support of behavior policies. Experimental evaluations across a variety of offline RL benchmarks demonstrate that OSC significantly enhances performance, alleviating the challenges associated with distributional shifts and mitigating conservatism of policy constraints. Code is available at this https URL.
[AI-31] Deep Muscle EMG construction using A Physics-Integrated Deep Learning approach
链接: https://arxiv.org/abs/2503.05201
作者: Rajnish Kumar,Tapas Tripura,Souvik Chakraborty,Sitikantha Roy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Electromyography (EMG)–based computational musculoskeletal modeling is a non-invasive method for studying musculotendon function, human movement, and neuromuscular control, providing estimates of internal variables like muscle forces and joint torques. However, EMG signals from deeper muscles are often challenging to measure by placing the surface EMG electrodes and unfeasible to measure directly using invasive methods. The restriction to the access of EMG data from deeper muscles poses a considerable obstacle to the broad adoption of EMG-driven modeling techniques. A strategic alternative is to use an estimation algorithm to approximate the missing EMG signals from deeper muscle. A similar strategy is used in physics-informed deep learning, where the features of physical systems are learned without labeled data. In this work, we propose a hybrid deep learning algorithm, namely the neural musculoskeletal model (NMM), that integrates physics-informed and data-driven deep learning to approximate the EMG signals from the deeper muscles. While data-driven modeling is used to predict the missing EMG signals, physics-based modeling engraves the subject-specific information into the predictions. Experimental verifications on five test subjects are carried out to investigate the performance of the proposed hybrid framework. The proposed NMM is validated against the joint torque computed from ‘OpenSim’ software. The predicted deep EMG signals are also compared against the state-of-the-art muscle synergy extrapolation (MSE) approach, where the proposed NMM completely outperforms the existing MSE framework by a significant margin.
[AI-32] Uncertainty-Aware Explainable Federated Learning
链接: https://arxiv.org/abs/2503.05194
作者: Yanci Zhang,Han Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) is a collaborative machine learning paradigm for enhancing data privacy preservation. Its privacy-preserving nature complicates the explanation of the decision-making processes and the evaluation of the reliability of the generated explanations. In this paper, we propose the Uncertainty-aware eXplainable Federated Learning (UncertainXFL) to address these challenges. It generates explanations for decision-making processes under FL settings and provides information regarding the uncertainty of these explanations. UncertainXFL is the first framework to explicitly offer uncertainty evaluation for explanations within the FL context. Explanatory information is initially generated by the FL clients and then aggregated by the server in a comprehensive and conflict-free manner during FL training. The quality of the explanations, including the uncertainty score and tested validity, guides the FL training process by prioritizing clients with the most reliable explanations through higher weights during model aggregation. Extensive experimental evaluation results demonstrate that UncertainXFL achieves superior model accuracy and explanation accuracy, surpassing the current state-of-the-art model that does not incorporate uncertainty information by 2.71% and 1.77%, respectively. By integrating and quantifying uncertainty in the data into the explanation process, UncertainXFL not only clearly presents the explanation alongside its uncertainty, but also leverages this uncertainty to guide the FL training process, thereby enhancing the robustness and reliability of the resulting models.
[AI-33] A Comprehensive LLM -powered Framework for Driving Intelligence Evaluation
链接: https://arxiv.org/abs/2503.05164
作者: Shanhe You,Xuewen Luo,Xinhe Liang,Jiashu Yu,Chen Zheng,Jiangtao Gong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures
点击查看摘要
Abstract:Evaluation methods for autonomous driving are crucial for algorithm optimization. However, due to the complexity of driving intelligence, there is currently no comprehensive evaluation method for the level of autonomous driving intelligence. In this paper, we propose an evaluation framework for driving behavior intelligence in complex traffic environments, aiming to fill this gap. We constructed a natural language evaluation dataset of human professional drivers and passengers through naturalistic driving experiments and post-driving behavior evaluation interviews. Based on this dataset, we developed an LLM-powered driving evaluation framework. The effectiveness of this framework was validated through simulated experiments in the CARLA urban traffic simulator and further corroborated by human assessment. Our research provides valuable insights for evaluating and designing more intelligent, human-like autonomous driving agents. The implementation details of the framework and detailed information about the dataset can be found at Github.
[AI-34] Generative Trajectory Stitching through Diffusion Composition
链接: https://arxiv.org/abs/2503.05153
作者: Yunhao Luo,Utkarsh A. Mishra,Yilun Du,Danfei Xu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL
点击查看摘要
Abstract:Effective trajectory stitching for long-horizon planning is a significant challenge in robotic decision-making. While diffusion models have shown promise in planning, they are limited to solving tasks similar to those seen in their training data. We propose CompDiffuser, a novel generative approach that can solve new tasks by learning to compositionally stitch together shorter trajectory chunks from previously seen tasks. Our key insight is modeling the trajectory distribution by subdividing it into overlapping chunks and learning their conditional relationships through a single bidirectional diffusion model. This allows information to propagate between segments during generation, ensuring physically consistent connections. We conduct experiments on benchmark tasks of various difficulties, covering different environment sizes, agent state dimension, trajectory types, training data quality, and show that CompDiffuser significantly outperforms existing methods.
[AI-35] FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data
链接: https://arxiv.org/abs/2503.05143
作者: Wenhao Wang,Zijie Yu,Rui Ye,Jianqing Zhang,Siheng Chen,Yanfeng Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Mobile agents have attracted tremendous research participation recently. Traditional approaches to mobile agent training rely on centralized data collection, leading to high cost and limited scalability. Distributed training utilizing federated learning offers an alternative by harnessing real-world user data, providing scalability and reducing costs. However, pivotal challenges, including the absence of standardized benchmarks, hinder progress in this field. To tackle the challenges, we introduce FedMABench, the first benchmark for federated training and evaluation of mobile agents, specifically designed for heterogeneous scenarios. FedMABench features 6 datasets with 30+ subsets, 8 federated algorithms, 10+ base models, and over 800 apps across 5 categories, providing a comprehensive framework for evaluating mobile agents across diverse environments. Through extensive experiments, we uncover several key insights: federated algorithms consistently outperform local training; the distribution of specific apps plays a crucial role in heterogeneity; and, even apps from distinct categories can exhibit correlations during training. FedMABench is publicly available at: this https URL with the datasets at: this https URL. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.05143 [cs.AI] (or arXiv:2503.05143v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.05143 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-36] Multi-Task Reinforcement Learning Enables Parameter Scaling
链接: https://arxiv.org/abs/2503.05126
作者: Reginald McLean,Evangelos Chataroulas,Jordan Terry,Isaac Woungang,Nariman Farsad,Pablo Samuel Castro
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Multi-task reinforcement learning (MTRL) aims to endow a single agent with the ability to perform well on multiple tasks. Recent works have focused on developing novel sophisticated architectures to improve performance, often resulting in larger models; it is unclear, however, whether the performance gains are a consequence of the architecture design itself or the extra parameters. We argue that gains are mostly due to scale by demonstrating that naively scaling up a simple MTRL baseline to match parameter counts outperforms the more sophisticated architectures, and these gains benefit most from scaling the critic over the actor. Additionally, we explore the training stability advantages that come with task diversity, demonstrating that increasing the number of tasks can help mitigate plasticity loss. Our findings suggest that MTRL’s simultaneous training across multiple tasks provides a natural framework for beneficial parameter scaling in reinforcement learning, challenging the need for complex architectural innovations.
[AI-37] Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation
链接: https://arxiv.org/abs/2503.05114
作者: Tong Mu,Yihao Liu,Mehran Armand
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures
点击查看摘要
Abstract:Imitation learning frameworks for robotic manipulation have drawn attention in the recent development of language model grounded robotics. However, the success of the frameworks largely depends on the coverage of the demonstration cases: When the demonstration set does not include examples of how to act in all possible situations, the action may fail and can result in cascading errors. To solve this problem, we propose a framework that uses serialized Finite State Machine (FSM) to generate demonstrations and improve the success rate in manipulation tasks requiring a long sequence of precise interactions. To validate its effectiveness, we use environmentally evolving and long-horizon puzzles that require long sequential actions. Experimental results show that our approach achieves a success rate of up to 98 in these tasks, compared to the controlled condition using existing approaches, which only had a success rate of up to 60, and, in some tasks, almost failed completely.
[AI-38] S-LIF: A Temporal Segment Spiking Neuron Network for Time Series Forecasting
链接: https://arxiv.org/abs/2503.05108
作者: Shibo Feng,Wanjin Feng,Xingyu Gao,Peilin Zhao,Zhiqi Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) offer a promising, biologically inspired approach for processing spatiotemporal data, particularly for time series forecasting. However, conventional neuron models like the Leaky Integrate-and-Fire (LIF) struggle to capture long-term dependencies and effectively process multi-scale temporal dynamics. To overcome these limitations, we introduce the Temporal Segment Leaky Integrate-and-Fire (TS-LIF) model, featuring a novel dual-compartment architecture. The dendritic and somatic compartments specialize in capturing distinct frequency components, providing functional heterogeneity that enhances the neuron’s ability to process both low- and high-frequency information. Furthermore, the newly introduced direct somatic current injection reduces information loss during intra-neuronal transmission, while dendritic spike generation improves multi-scale information extraction. We provide a theoretical stability analysis of the TS-LIF model and explain how each compartment contributes to distinct frequency response characteristics. Experimental results show that TS-LIF outperforms traditional SNNs in time series forecasting, demonstrating better accuracy and robustness, even with missing data. TS-LIF advances the application of SNNs in time-series forecasting, providing a biologically inspired approach that captures complex temporal dynamics and offers potential for practical implementation in diverse forecasting scenarios. The source code is available at this https URL.
[AI-39] Grouped Sequential Optimization Strategy – the Application of Hyperparameter Importance Assessment in Deep Learning
链接: https://arxiv.org/abs/2503.05106
作者: Ruinan Wang,Ian Nabney,Mohammad Golbabaee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages
点击查看摘要
Abstract:Hyperparameter optimization (HPO) is a critical component of machine learning pipelines, significantly affecting model robustness, stability, and generalization. However, HPO is often a time-consuming and computationally intensive task. Traditional HPO methods, such as grid search and random search, often suffer from inefficiency. Bayesian optimization, while more efficient, still struggles with high-dimensional search spaces. In this paper, we contribute to the field by exploring how insights gained from hyperparameter importance assessment (HIA) can be leveraged to accelerate HPO, reducing both time and computational resources. Building on prior work that quantified hyperparameter importance by evaluating 10 hyperparameters on CNNs using 10 common image classification datasets, we implement a novel HPO strategy called ‘Sequential Grouping.’ That prior work assessed the importance weights of the investigated hyperparameters based on their influence on model performance, providing valuable insights that we leverage to optimize our HPO process. Our experiments, validated across six additional image classification datasets, demonstrate that incorporating hyperparameter importance assessment (HIA) can significantly accelerate HPO without compromising model performance, reducing optimization time by an average of 31.9% compared to the conventional simultaneous strategy.
[AI-40] Multi-Robot Collaboration through Reinforcement Learning and Abstract Simulation ICRA2025
链接: https://arxiv.org/abs/2503.05092
作者: Adam Labiosa,Josiah P. Hanna
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: ICRA 2025
点击查看摘要
Abstract:Teams of people coordinate to perform complex tasks by forming abstract mental models of world and agent dynamics. The use of abstract models contrasts with much recent work in robot learning that uses a high-fidelity simulator and reinforcement learning (RL) to obtain policies for physical robots. Motivated by this difference, we investigate the extent to which so-called abstract simulators can be used for multi-agent reinforcement learning (MARL) and the resulting policies successfully deployed on teams of physical robots. An abstract simulator models the robot’s target task at a high-level of abstraction and discards many details of the world that could impact optimal decision-making. Policies are trained in an abstract simulator then transferred to the physical robot by making use of separately-obtained low-level perception and motion control modules. We identify three key categories of modifications to the abstract simulator that enable policy transfer to physical robots: simulation fidelity enhancements, training optimizations and simulation stochasticity. We then run an empirical study with extensive ablations to determine the value of each modification category for enabling policy transfer in cooperative robot soccer tasks. We also compare the performance of policies produced by our method with a well-tuned non-learning-based behavior architecture from the annual RoboCup competition and find that our approach leads to a similar level of performance. Broadly we show that MARL can be use to train cooperative physical robot behaviors using highly abstract models of the world.
[AI-41] Object Packing and Scheduling for Sequential 3D Printing: a Linear Arithmetic Model and a CEGAR-inspired Optimal Solver
链接: https://arxiv.org/abs/2503.05071
作者: Pavel Surynek,Vojtěch Bubník,Lukáš Matěna,Petr Kubiš
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We address the problem of object arrangement and scheduling for sequential 3D printing. Unlike the standard 3D printing, where all objects are printed slice by slice at once, in sequential 3D printing, objects are completed one after other. In the sequential case, it is necessary to ensure that the moving parts of the printer do not collide with previously printed objects. We look at the sequential printing problem from the perspective of combinatorial optimization. We propose to express the problem as a linear arithmetic formula, which is then solved using a solver for satisfiability modulo theories (SMT). However, we do not solve the formula expressing the problem of object arrangement and scheduling directly, but we have proposed a technique inspired by counterexample guided abstraction refinement (CEGAR), which turned out to be a key innovation to efficiency.
[AI-42] PromptPex: Automatic Test Generation for Language Model Prompts
链接: https://arxiv.org/abs/2503.05070
作者: Reshabh K Sharma,Jonathan De Halleux,Shraddha Barke,Benjamin Zorn
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.05070 [cs.SE] (or arXiv:2503.05070v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.05070 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-43] Perceiving Reasoning Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation
链接: https://arxiv.org/abs/2503.05064
作者: Qingxuan Jia,Guoqin Tang,Zeyuan Huang,Zixuan Hao,Ning Ji,Shihang, Yin,Gang Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high-level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation. Our method decomposes complex tasks into sub-actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high-precision spatial-semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task-oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real-time feedback, generating precise action plans and facilitating step-wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.
[AI-44] LLM s Reshaping of People Processes Products and Society in Software Development: A Comprehensive Exploration with Early Adopters
链接: https://arxiv.org/abs/2503.05012
作者: Benyamin Tabarsi,Heidi Reichert,Ally Limke,Sandeep Kuttal,Tiffany Barnes
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry, but their full impact on software engineering remains insufficiently explored. Despite their growing adoption, there is a notable lack of formal, qualitative assessments of how LLMs are applied in real-world software development contexts. To fill this gap, we conducted semi-structured interviews with sixteen early-adopter professional developers to explore their use of LLMs throughout various stages of the software development life cycle. Our investigation examines four dimensions: people - how LLMs affect individual developers and teams; process - how LLMs alter software engineering workflows; product - LLM impact on software quality and innovation; and society - the broader socioeconomic and ethical implications of LLM adoption. Thematic analysis of our data reveals that while LLMs have not fundamentally revolutionized the development process, they have substantially enhanced routine coding tasks, including code generation, refactoring, and debugging. Developers reported the most effective outcomes when providing LLMs with clear, well-defined problem statements, indicating that LLMs excel with decomposed problems and specific requirements. Furthermore, these early-adopters identified that LLMs offer significant value for personal and professional development, aiding in learning new languages and concepts. Early-adopters, highly skilled in software engineering and how LLMs work, identified early and persisting challenges for software engineering, such as inaccuracies in generated content and the need for careful manual review before integrating LLM outputs into production environments. Our study provides a nuanced understanding of how LLMs are shaping the landscape of software development, with their benefits, limitations, and ongoing implications.
[AI-45] A Consensus Privacy Metrics Framework for Synthetic Data
链接: https://arxiv.org/abs/2503.04980
作者: Lisa Pilgram,Fida K. Dankar,Jorg Drechsler,Mark Elliot,Josep Domingo-Ferrer,Paul Francis,Murat Kantarcioglu,Linglong Kong,Bradley Malin,Krishnamurty Muralidhar,Puja Myles,Fabian Prasser,Jean Louis Raisaro,Chao Yan,Khaled El Emam
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Synthetic data generation is one approach for sharing individual-level data. However, to meet legislative requirements, it is necessary to demonstrate that the individuals’ privacy is adequately protected. There is no consolidated standard for measuring privacy in synthetic data. Through an expert panel and consensus process, we developed a framework for evaluating privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure, and their use is discouraged. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information about an individual without necessarily revealing their identity. The resultant framework provides precise recommendations for metrics that address these types of disclosures effectively. Our findings further present specific opportunities for future research that can help with widespread adoption of synthetic data.
[AI-46] Quantifying the Relevance of Youth Research Cited in the US Policy Documents
链接: https://arxiv.org/abs/2503.04977
作者: Miftahul Jannat Mokarrama,Hamed Alhoori
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: The paper was accepted and presented in IEEE BIG DATA 2024. It has 10 pages, 5 figures, and 4 tables
点击查看摘要
Abstract:In recent years, there has been a growing concern and emphasis on conducting research beyond academic or scientific research communities, benefiting society at large. A well-known approach to measuring the impact of research on society is enumerating its policy citation(s). Despite the importance of research in informing policy, there is no concrete evidence to suggest the research’s relevance in cited policy documents. This is concerning because it may increase the possibility of evidence used in policy being manipulated by individual, social, or political biases that may lead to inappropriate, fragmented, or archaic research evidence in policy. Therefore, it is crucial to identify the degree of relevance between research articles and citing policy documents. In this paper, we examined the scale of contextual relevance of youth-focused research in the referenced US policy documents using natural language processing techniques, state-of-the-art pre-trained Large Language Models (LLMs), and statistical analysis. Our experiments and analysis concluded that youth-related research articles that get US policy citations are mostly relevant to the citing policy documents.
[AI-47] Incentivizing Multi-Tenant Split Federated Learning for Foundation Models at the Network Edge
链接: https://arxiv.org/abs/2503.04971
作者: Songyuan Li,Jia Hu,Geyong Min,Haojun Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: Index Terms: Foundation models, Edge computing, Split federated learning, Multi-tenant system, Incentive mechanism
点击查看摘要
Abstract:Foundation models (FMs) such as GPT-4 exhibit exceptional generative capabilities across diverse downstream tasks through fine-tuning. Split Federated Learning (SFL) facilitates privacy-preserving FM fine-tuning on resource-constrained local devices by offloading partial FM computations to edge servers, enabling device-edge synergistic fine-tuning. Practical edge networks often host multiple SFL tenants to support diversified downstream tasks. However, existing research primarily focuses on single-tenant SFL scenarios, and lacks tailored incentive mechanisms for multi-tenant settings, which are essential to effectively coordinate self-interested local devices for participation in various downstream tasks, ensuring that each SFL tenant’s distinct FM fine-tuning requirements (e.g., FM types, performance targets, and fine-tuning deadlines) are met. To address this gap, we propose a novel Price-Incentive Mechanism (PRINCE) that guides multiple SFL tenants to offer strategic price incentives, which solicit high-quality device participation for efficient FM fine-tuning. Specifically, we first develop a bias-resilient global SFL model aggregation scheme to eliminate model biases caused by independent device participation. We then derive a rigorous SFL convergence bound to evaluate the contributions of heterogeneous devices to FM performance improvements, guiding the incentive strategies of SFL tenants. Furthermore, we model inter-tenant device competition as a congestion game for Stackelberg equilibrium (SE) analysis, deriving each SFL tenant’s optimal incentive strategy. Extensive simulations involving four representative SFL tenant types (ViT, BERT, Whisper, and LLaMA) across diverse data modalities (text, images, and audio) demonstrate that PRINCE accelerates FM fine-tuning by up to 3.07x compared to state-of-the-art approaches, while consistently meeting fine-tuning performance targets.
[AI-48] Data-Efficient Learning from Human Interventions for Mobile Robots ICRA2025
链接: https://arxiv.org/abs/2503.04969
作者: Zhenghao Peng,Zhizheng Liu,Bolei Zhou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICRA 2025. Webpage: this https URL
点击查看摘要
Abstract:Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots – a legged quadruped, and a wheeled delivery robot – in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes within 15 minutes. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at: this https URL
[AI-49] Energy-Latency Attacks: A New Adversarial Threat to Deep Learning
链接: https://arxiv.org/abs/2503.04963
作者: Hanene F. Z. Brachemi Meftah,Wassim Hamidouche,Sid Ahmed Fezza,Olivier Deforges
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The growing computational demand for deep neural networks ( DNNs) has raised concerns about their energy consumption and carbon footprint, particularly as the size and complexity of the models continue to increase. To address these challenges, energy-efficient hardware and custom accelerators have become essential. Additionally, adaptable DNN s are being developed to dynamically balance performance and efficiency. The use of these strategies became more common to enable sustainable AI deployment. However, these efficiency-focused designs may also introduce vulnerabilities, as attackers can potentially exploit them to increase latency and energy usage by triggering their worst-case-performance scenarios. This new type of attack, called energy-latency attacks, has recently gained significant research attention, focusing on the vulnerability of DNN s to this emerging attack paradigm, which can trigger denial-of-service ( DoS) attacks. This paper provides a comprehensive overview of current research on energy-latency attacks, categorizing them using the established taxonomy for traditional adversarial attacks. We explore different metrics used to measure the success of these attacks and provide an analysis and comparison of existing attack strategies. We also analyze existing defense mechanisms and highlight current challenges and potential areas for future research in this developing field. The GitHub page for this work can be accessed at this https URL
[AI-50] INTENT: Trajectory Prediction Framework with Intention-Guided Contrastive Clustering
链接: https://arxiv.org/abs/2503.04952
作者: Yihong Tang,Wei Ma
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Accurate trajectory prediction of road agents (e.g., pedestrians, vehicles) is an essential prerequisite for various intelligent systems applications, such as autonomous driving and robotic navigation. Recent research highlights the importance of environmental contexts (e.g., maps) and the “multi-modality” of trajectories, leading to increasingly complex model structures. However, real-world deployments require lightweight models that can quickly migrate and adapt to new environments. Additionally, the core motivations of road agents, referred to as their intentions, deserves further exploration. In this study, we advocate that understanding and reasoning road agents’ intention plays a key role in trajectory prediction tasks, and the main challenge is that the concept of intention is fuzzy and abstract. To this end, we present INTENT, an efficient intention-guided trajectory prediction model that relies solely on information contained in the road agent’s trajectory. Our model distinguishes itself from existing models in several key aspects: (i) We explicitly model road agents’ intentions through contrastive clustering, accommodating the fuzziness and abstraction of human intention in their trajectories. (ii) The proposed INTENT is based solely on multi-layer perceptrons (MLPs), resulting in reduced training and inference time, making it very efficient and more suitable for real-world deployment. (iii) By leveraging estimated intentions and an innovative algorithm for transforming trajectory observations, we obtain more robust trajectory representations that lead to superior prediction accuracy. Extensive experiments on real-world trajectory datasets for pedestrians and autonomous vehicles demonstrate the effectiveness and efficiency of INTENT.
[AI-51] Federated Inverse Probability Treatment Weighting for Individual Treatment Effect Estimation
链接: https://arxiv.org/abs/2503.04946
作者: Changchang Yin,Hong-You Chen,Wei-Lun Chao,Ping Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Individual treatment effect (ITE) estimation is to evaluate the causal effects of treatment strategies on some important outcomes, which is a crucial problem in healthcare. Most existing ITE estimation methods are designed for centralized settings. However, in real-world clinical scenarios, the raw data are usually not shareable among hospitals due to the potential privacy and security risks, which makes the methods not applicable. In this work, we study the ITE estimation task in a federated setting, which allows us to harness the decentralized data from multiple hospitals. Due to the unavoidable confounding bias in the collected data, a model directly learned from it would be inaccurate. One well-known solution is Inverse Probability Treatment Weighting (IPTW), which uses the conditional probability of treatment given the covariates to re-weight each training example. Applying IPTW in a federated setting, however, is non-trivial. We found that even with a well-estimated conditional probability, the local model training step using each hospital’s data alone would still suffer from confounding bias. To address this, we propose FED-IPTW, a novel algorithm to extend IPTW into a federated setting that enforces both global (over all the data) and local (within each hospital) decorrelation between covariates and treatments. We validated our approach on the task of comparing the treatment effects of mechanical ventilation on improving survival probability for patients with breadth difficulties in the intensive care unit (ICU). We conducted experiments on both synthetic and real-world eICU datasets and the results show that FED-IPTW outperform state-of-the-art methods on all the metrics on factual prediction and ITE estimation tasks, paving the way for personalized treatment strategy design in mechanical ventilation usage.
[AI-52] Learning-based GNSS Uncertainty Quantification using Continuous-Time Factor Graph Optimization
链接: https://arxiv.org/abs/2503.04933
作者: Haoming Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This extended abstract has been accepted to the 1st German Robotic Conference
点击查看摘要
Abstract:This short paper presents research findings on two learning-based methods for quantifying measurement uncertainties in global navigation satellite systems (GNSS). We investigate two learning strategies: offline learning for outlier prediction and online learning for noise distribution approximation, specifically applied to GNSS pseudorange observations. To develop and evaluate these learning methods, we introduce a novel multisensor state estimator that accurately and robustly estimates trajectory from multiple sensor inputs, critical for deriving GNSS measurement residuals used to train the uncertainty models. We validate the proposed learning-based models using real-world sensor data collected in diverse urban environments. Experimental results demonstrate that both models effectively handle GNSS outliers and improve state estimation performance. Furthermore, we provide insightful discussions to motivate future research toward developing a federated framework for robust vehicle localization in challenging environments.
[AI-53] Curiosity-Driven Imagination: Discovering Plan Operators and Learning Associated Policies for Open-World Adaptation ICRA2025
链接: https://arxiv.org/abs/2503.04931
作者: Pierrick Lorang,Hong Lu,Matthias Scheutz
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures. Accepted at ICRA 2025
点击查看摘要
Abstract:Adapting quickly to dynamic, uncertain environments-often called “open worlds”-remains a major challenge in robotics. Traditional Task and Motion Planning (TAMP) approaches struggle to cope with unforeseen changes, are data-inefficient when adapting, and do not leverage world models during learning. We address this issue with a hybrid planning and learning system that integrates two models: a low level neural network based model that learns stochastic transitions and drives exploration via an Intrinsic Curiosity Module (ICM), and a high level symbolic planning model that captures abstract transitions using operators, enabling the agent to plan in an “imaginary” space and generate reward machines. Our evaluation in a robotic manipulation domain with sequential novelty injections demonstrates that our approach converges faster and outperforms state-of-the-art hybrid methods.
[AI-54] Privacy in Responsible AI: Approaches to Facial Recognition from Cloud Providers
链接: https://arxiv.org/abs/2503.04866
作者: Anna Elivanova
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As the use of facial recognition technology is expanding in different domains, ensuring its responsible use is gaining more importance. This paper conducts a comprehensive literature review of existing studies on facial recognition technology from the perspective of privacy, which is one of the key Responsible AI principles. Cloud providers, such as Microsoft, AWS, and Google, are at the forefront of delivering facial-related technology services, but their approaches to responsible use of these technologies vary significantly. This paper compares how these cloud giants implement the privacy principle into their facial recognition and detection services. By analysing their approaches, it identifies both common practices and notable differences. The results of this research will be valuable for developers and businesses by providing them insights into best practices of three major companies for integration responsible AI, particularly privacy, into their cloud-based facial recognition technologies. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04866 [cs.CR] (or arXiv:2503.04866v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.04866 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-55] From Pixels to Trajectory: Universal Adversarial Example Detection via Temporal Imprints
链接: https://arxiv.org/abs/2503.04853
作者: Yansong Gao,Huaibing Peng,Hua Ma,Zhiyang Dai,Shuo Wang,Hongsheng Hu,Anmin Fu,Minhui Xue
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:For the first time, we unveil discernible temporal (or historical) trajectory imprints resulting from adversarial example (AE) attacks. Standing in contrast to existing studies all focusing on spatial (or static) imprints within the targeted underlying victim models, we present a fresh temporal paradigm for understanding these attacks. Of paramount discovery is that these imprints are encapsulated within a single loss metric, spanning universally across diverse tasks such as classification and regression, and modalities including image, text, and audio. Recognizing the distinct nature of loss between adversarial and clean examples, we exploit this temporal imprint for AE detection by proposing TRAIT (TRaceable Adversarial temporal trajectory ImprinTs). TRAIT operates under minimal assumptions without prior knowledge of attacks, thereby framing the detection challenge as a one-class classification problem. However, detecting AEs is still challenged by significant overlaps between the constructed synthetic losses of adversarial and clean examples due to the absence of ground truth for incoming inputs. TRAIT addresses this challenge by converting the synthetic loss into a spectrum signature, using the technique of Fast Fourier Transform to highlight the discrepancies, drawing inspiration from the temporal nature of the imprints, analogous to time-series signals. Across 12 AE attacks including SMACK (USENIX Sec’2023), TRAIT demonstrates consistent outstanding performance across comprehensively evaluated modalities, tasks, datasets, and model architectures. In all scenarios, TRAIT achieves an AE detection accuracy exceeding 97%, often around 99%, while maintaining a false rejection rate of 1%. TRAIT remains effective under the formulated strong adaptive attacks.
[AI-56] Role of Databases in GenAI Applications
链接: https://arxiv.org/abs/2503.04847
作者: Santosh Bhupathi
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Generative AI (GenAI) is transforming industries by enabling intelligent content generation, automation, and decision-making. However, the effectiveness of GenAI applications depends significantly on efficient data storage, retrieval, and contextual augmentation. This paper explores the critical role of databases in GenAI workflows, emphasizing the importance of choosing the right database architecture to optimize performance, accuracy, and scalability. It categorizes database roles into conversational context (key-value/document databases), situational context (relational databases/data lakehouses), and semantic context (vector databases) each serving a distinct function in enriching AI-generated responses. Additionally, the paper highlights real-time query processing, vector search for semantic retrieval, and the impact of database selection on model efficiency and scalability. By leveraging a multi-database approach, GenAI applications can achieve more context-aware, personalized, and high-performing AI-driven solutions.
[AI-57] chnique Inference Engine: A Recommender Model to Support Cyber Threat Hunting
链接: https://arxiv.org/abs/2503.04819
作者: Matthew J. Turner,Mike Carenzo,Jackie Lasky,James Morris-King,James Ross
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Cyber threat hunting is the practice of proactively searching for latent threats in a network. Engaging in threat hunting can be difficult due to the volume of network traffic, variety of adversary techniques, and constantly evolving vulnerabilities. To aid analysts in identifying techniques which may be co-occurring as part of a campaign, we present the Technique Inference Engine, a tool to infer tactics, techniques, and procedures (TTPs) which may be related to existing observations of adversarial behavior. We compile the largest (to our knowledge) available dataset of cyber threat intelligence (CTI) reports labeled with relevant TTPs. With the knowledge that techniques are chronically under-reported in CTI, we apply several implicit feedback recommender models to the data in order to predict additional techniques which may be part of a given campaign. We evaluate the results in the context of the cyber analyst’s use case and apply t-SNE to visualize the model embeddings. We provide our code and a web interface.
[AI-58] An energy-efficient learning solution for the Agile Earth Observation Satellite Scheduling Problem ICML
链接: https://arxiv.org/abs/2503.04803
作者: Antonio M. Mercado-Martínez,Beatriz Soret,Antonio Jurado-Navas
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at the IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) Special Sessions 2025
点击查看摘要
Abstract:The Agile Earth Observation Satellite Scheduling Problem (AEOSSP) entails finding the subset of observation targets to be scheduled along the satellite’s orbit while meeting operational constraints of time, energy and memory. The problem of deciding what and when to observe is inherently complex, and becomes even more challenging when considering several issues that compromise the quality of the captured images, such as cloud occlusion, atmospheric turbulence, and image resolution. This paper presents a Deep Reinforcement Learning (DRL) approach for addressing the AEOSSP with time-dependent profits, integrating these three factors to optimize the use of energy and memory resources. The proposed method involves a dual decision-making process: selecting the sequence of targets and determining the optimal observation time for each. Our results demonstrate that the proposed algorithm reduces the capture of images that fail to meet quality requirements by 60% and consequently decreases energy waste from attitude maneuvers by up to 78%, all while maintaining strong observation performance.
[AI-59] Advancing MAPF towards the Real World: A Scalable Multi-Agent Realistic Testbed (SMART)
链接: https://arxiv.org/abs/2503.04798
作者: Jingtian Yan,Zhifei Li,William Kang,Yulun Zhang,Stephen Smith,Jiaoyang Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present Scalable Multi-Agent Realistic Testbed (SMART), a realistic and efficient software tool for evaluating Multi-Agent Path Finding (MAPF) algorithms. MAPF focuses on planning collision-free paths for a group of agents. While state-of-the-art MAPF algorithms can plan paths for hundreds of robots in seconds, they often rely on simplified robot models, making their real-world performance unclear. Researchers typically lack access to hundreds of physical robots in laboratory settings to evaluate the algorithms. Meanwhile, industrial professionals who lack expertise in MAPF require an easy-to-use simulator to efficiently test and understand the performance of MAPF algorithms in their specific settings. SMART fills this gap with several advantages: (1) SMART uses a physics-engine-based simulator to create realistic simulation environments, accounting for complex real-world factors such as robot kinodynamics and execution uncertainties, (2) SMART uses an execution monitor framework based on the Action Dependency Graph, facilitating seamless integration with various MAPF algorithms and robot models, and (3) SMART scales to thousands of robots. In addition, we use SMART to explore and demonstrate research questions about the execution of MAPF algorithms in real-world scenarios. The code is publicly available at this https URL.
[AI-60] SMT(LIA) Sampling with High Diversity
链接: https://arxiv.org/abs/2503.04782
作者: Yong Lai,Junjie Li,Chuan Luo
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Satisfiability Modulo Linear Integer Arithmetic, SMT(LIA) for short, is pivotal across various critical domains. Previous research has primarily focused on SMT solving techniques. However, in practical applications such as software and hardware testing, there is a need to generate a diverse set of solutions for use as test inputs. We have developed the first sampling framework that integrates local search with CDCL(T) techniques, named HighDiv, capable of generating a highly diverse set of solutions for constraints under linear integer theory. Initially, in the local search phase, we introduced a novel operator called boundary-aware movement. This operator performs random moves by considering the current state’s constraints on variables, thereby enhancing the diversity of variables during the search process. Furthermore, we have conducted an in-depth study of the preprocessing and variable initialization mechanisms within the framework, which significantly enhances the efficiency of subsequent local searches. Lastly, we use the solutions obtained from local search sampling as additional constraints to further explore the solution space using the stochastic CDCL(T) method. Experimental results demonstrate that \HighDiv generates solutions with greater diversity compared to the state-of-the-art SMT(LIA) sampling tool, MeGASampler.
[AI-61] Can LLM s Reason About Program Semantics? A Comprehensive Evaluation of LLM s on Formal Specification Inference
链接: https://arxiv.org/abs/2503.04779
作者: Thanh Le-Cong,Bach Le,Toby Murray
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs’ capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs’ reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions (i.e., \textitcompleteness) and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics (i.e., \textitconsistency). Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.
[AI-62] Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs
链接: https://arxiv.org/abs/2503.04772
作者: David Yin,Jing Gao
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated significant potential in generating mathematical proofs. However, a persistent challenge is that LLMs occasionally make mistakes, while even a minor mistake can invalidate an entire proof. Proof assistants like Lean offer a great remedy. They are designed for verifying each step of a proof in a formal language, and in recent years researchers have created AI models to generate proofs in their languages. However, the scarcity of large-scale datasets of Lean proofs restrict the performance of such Automated Theorem Proving (ATP) models. We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs by finding new ways to prove existing Lean theorems. By leveraging an interactive Lean client and an efficient method for proof step generation, LeanNavigator efficiently produces new theorems with corresponding proofs. Applying this approach to Mathlib4, we generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude. Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks. These results confirm our hypothesis and demonstrate the critical role of large datasets in improving the performance of automated theorem provers. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04772 [cs.LO] (or arXiv:2503.04772v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2503.04772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-63] A cross-regional review of AI safety regulations in the commercial aviation
链接: https://arxiv.org/abs/2503.04767
作者: Penny A. Barr,Sohel M. Imroz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 34 pages, primary contributor by Penny A. Barr
点击查看摘要
Abstract:In this paper we examine the existing artificial intelligence (AI) policy documents in aviation for the following three regions: the United States, European Union, and China. The aviation industry has always been a first mover in adopting technological advancements. This early adoption offers valuable insights because of its stringent regulations and safety-critical procedures. As a result, the aviation industry provides an optimal platform to counter AI vulnerabilities through its tight regulations, standardization processes, and certification of new technologies. Keywords: AI in aviation; aviation safety; standardization; certifiable AI; regulations
[AI-64] Agent ic AI and the Cyber Arms Race
链接: https://arxiv.org/abs/2503.04760
作者: Sean Oesch,Jack Hutchins,Phillipe Austria,Amul Chaulagain
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure, due to be published in Computer Magazine
点击查看摘要
Abstract:Agentic AI is shifting the cybersecurity landscape as attackers and defenders leverage AI agents to augment humans and automate common tasks. In this article, we examine the implications for cyber warfare and global politics as Agentic AI becomes more powerful and enables the broad proliferation of capabilities only available to the most well resourced actors today.
[AI-65] Chat-GPT : An AI Based Educational Revolution
链接: https://arxiv.org/abs/2503.04758
作者: Sasa Maric,Sonja Maric,Lana Maric
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The AI revolution is gathering momentum at an unprecedented rate. Over the past decade, we have witnessed a seemingly inevitable integration of AI in every facet of our lives. Much has been written about the potential revolutionary impact of AI in education. AI has the potential to completely revolutionise the educational landscape as we could see entire courses and degrees developed by programs such as ChatGPT. AI has the potential to develop courses, set assignments, grade and provide feedback to students much faster than a team of teachers. In addition, because of its dynamic nature, it has the potential to continuously improve its content. In certain fields such as computer science, where technology is continuously evolving, AI based applications can provide dynamically changing, relevant material to students. AI has the potential to replace entire degrees and may challenge the concept of higher education institutions. We could also see entire new disciplines emerge as a consequence of AI. This paper examines the practical impact of ChatGPT and why it is believed that its implementation is a critical step towards a new era of education. We investigate the impact that ChatGPT will have on learning, problem solving skills and cognitive ability of students. We examine the positives, negatives and many other aspects of AI and its applications throughout this paper.
[AI-66] ransforming Student Evaluation with Adaptive Intelligence and Performance Analytics
链接: https://arxiv.org/abs/2503.04752
作者: Pushpalatha K S,Abhishek Mangalur,Ketan Hegde,Chetan Badachi,Mohammad Aamir
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures, 1 table
点击查看摘要
Abstract:The development in Artificial Intelligence (AI) offers transformative potential for redefining student assessment methodologies. This paper aims to establish the idea of the advancement of Artificial Intelligence (AI) and its prospect in reshaping approaches to assessing students. It creates a system for the evaluation of students performance using Artificial intelligence, and particularly the Gemini API for the generation of questions, grading and report on the students performances. This is to facilitate easy use of the tools in creating, scheduling, and delivering assessments with minimal chances of cheating through options such as full screen and time limit. There are formats of questions in the system which comprises multiple choice, short answers and descriptive questions, developed by Gemini. The most conspicuous feature is the self-checking system whereby the user gets instant feedback for the correct score that each of the students would have scored instantly with explanations about wrong answers. Moreover, the platform has intelligent learning progressions where the user will be able to monitor his/her performances to be recommended a certain level of performance. It will allow students as well as educators to have real-time analytics and feedback on what they are good at and where they need to improve. Not only does it make the assessment easier, but it also improves the levels of accuracy in grading and effectively strengthens a data based learning process for students.
[AI-67] What is Ethical: AIHED Driving Humans or Human-Driven AIHED? A Conceptual Framework enabling the Ethos of AI-driven Higher education
链接: https://arxiv.org/abs/2503.04751
作者: Prashant Mahajan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Tables 9, Figures 6
点击查看摘要
Abstract:The rapid integration of Artificial Intelligence (AI) in Higher Education (HE) is transforming personalized learning, administrative automation, and decision-making. However, this progress presents a duality, as AI adoption also introduces ethical and institutional challenges, including algorithmic bias, data privacy risks, and governance inconsistencies. To address these concerns, this study introduces the Human-Driven AI in Higher Education (HD-AIHED) Framework, ensuring compliance with UNESCO and OECD ethical standards. This conceptual research employs a qualitative meta-synthesis approach, integrating qualitative and quantitative studies to identify patterns, contradictions, and gaps in AI adoption within HE. It reinterprets existing datasets through theoretical and ethical lenses to develop governance frameworks. The study applies a participatory integrated co-system, Phased Human Intelligence, SWOC analysis, and AI ethical review boards to assess AI readiness and governance strategies for universities and HE institutions. The HD-AIHED model bridges AI research gaps, addresses global real-time challenges, and provides tailored, scalable, and ethical strategies for diverse educational contexts. By emphasizing interdisciplinary collaboration among stakeholders, this study envisions AIHED as a transparent and equitable force for innovation. The HD-AIHED framework ensures AI acts as a collaborative and ethical enabler rather than a disruptive replacement for human intelligence while advocating for responsible AI implementation in HE.
[AI-68] Position: AI agents should be regulated based on autonomous action sequences
链接: https://arxiv.org/abs/2503.04750
作者: Takauki Osogami
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 29 pages, 2 figures
点击查看摘要
Abstract:This position paper argues that AI agents should be regulated based on the sequence of actions they autonomously take. AI agents with long-term planning and strategic capabilities can pose significant risks of human extinction and irreversible global catastrophes. While existing regulations often focus on computational scale as a proxy for potential harm, we contend that such measures are insufficient for assessing the risks posed by AI agents whose capabilities arise primarily from inference-time computation. To support our position, we discuss relevant regulations and recommendations from AI scientists regarding existential risks, as well as the advantages of action sequences over existing impact measures that require observing environmental states.
[AI-69] E-LENS: User Requirements-Oriented AI Ethics Assurance
链接: https://arxiv.org/abs/2503.04747
作者: Jianlong Zhou,Fang Chen
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 29 pages
点击查看摘要
Abstract:Despite the much proliferation of AI ethical principles in recent years, there is a challenge of assuring AI ethics with current AI ethics frameworks in real-world applications. While system safety has emerged as a distinct discipline for a long time, originated from safety concerns in early aircraft manufacturing. The safety assurance is now an indispensable component in safety critical domains. Motivated by the assurance approaches for safety-critical systems such as aviation, this paper introduces the concept of AI ethics assurance cases into the AI ethics assurance. Three pillars of user requirements, evidence, and validation are proposed as key components and integrated into AI ethics assurance cases for a new approach of user requirements-oriented AI ethics assurance. The user requirements-oriented AI ethics assurance case is set up based on three pillars and hazard analysis methods used in the safety assurance of safety-critical systems. This paper also proposes a platform named Ethical-Lens (E-LENS) to implement the user requirements-oriented AI ethics assurance approach. The proposed user requirements-based E-LENS platform is then applied to assure AI ethics of an AI-driven human resource shortlisting system as a case study to show the effectiveness of the proposed approach.
[AI-70] Emerging Practices in Frontier AI Safety Frameworks
链接: https://arxiv.org/abs/2503.04746
作者: Marie Davidsen Buhl,Ben Bucknall,Tammy Masterson
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 38 pages
点击查看摘要
Abstract:As part of the Frontier AI Safety Commitments agreed to at the 2024 AI Seoul Summit, many AI developers agreed to publish a safety framework outlining how they will manage potential severe risks associated with their systems. This paper summarises current thinking from companies, governments, and researchers on how to write an effective safety framework. We outline three core areas of a safety framework - risk identification and assessment, risk mitigation, and governance - and identify emerging practices within each area. As safety frameworks are novel and rapidly developing, we hope that this paper can serve both as an overview of work to date and as a starting point for further discussion and innovation.
[AI-71] Safety Cases: A Scalable Approach to Frontier AI Safety
链接: https://arxiv.org/abs/2503.04744
作者: Benjamin Hilton,Marie Davidsen Buhl,Tomek Korbak,Geoffrey Irving
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 18 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.
[AI-72] AI Safety is Stuck in Technical Terms – A System Safety Response to the International AI Safety Report
链接: https://arxiv.org/abs/2503.04743
作者: Roel Dobbe
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: A response to the International AI Safety Report, which was released in preparation for the AI Action Summit in Paris, February 2025
点击查看摘要
Abstract:Safety has become the central value around which dominant AI governance efforts are being shaped. Recently, this culminated in the publication of the International AI Safety Report, written by 96 experts of which 30 nominated by the Organisation for Economic Co-operation and Development (OECD), the European Union (EU), and the United Nations (UN). The report focuses on the safety risks of general-purpose AI and available technical mitigation approaches. In this response, informed by a system safety perspective, I refl ect on the key conclusions of the report, identifying fundamental issues in the currently dominant technical framing of AI safety and how this frustrates meaningful discourse and policy efforts to address safety comprehensively. The system safety discipline has dealt with the safety risks of software-based systems for many decades, and understands safety risks in AI systems as sociotechnical and requiring consideration of technical and non-technical factors and their interactions. The International AI Safety report does identify the need for system safety approaches. Lessons, concepts and methods from system safety indeed provide an important blueprint for overcoming current shortcomings in technical approaches by integrating rather than adding on non-technical factors and interventions. I conclude with why building a system safety discipline can help us overcome limitations in the European AI Act, as well as how the discipline can help shape sustainable investments into Public Interest AI.
[AI-73] A case for specialisation in non-human entities
链接: https://arxiv.org/abs/2503.04742
作者: El-Mahdi El-Mhamdi,Lê-Nguyên Hoang,Mariame Tighanimine
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the rise of large multi-modal AI models, fuelled by recent interest in large language models (LLMs), the notion of artificial general intelligence (AGI) went from being restricted to a fringe community, to dominate mainstream large AI development programs. In contrast, in this paper, we make a \emphcase for specialisation, by reviewing the pitfalls of generality and stressing the industrial value of specialised systems. Our contribution is threefold. First, we review the most widely accepted arguments \emphagainst specialisation, and discuss how their relevance in the context of human labour is actually an argument \emphfor specialisation in the case of non human agents, be they algorithms or human organisations. Second, we propose four arguments \emphin favor of specialisation, ranging from machine learning robustness, to computer security, social sciences and cultural evolution. Third, we finally make a case for \emphspecification, discuss how the machine learning approach to AI has so far failed to catch up with good practices from safety-engineering and formal verification of software, and discuss how some emerging good practices in machine learning help reduce this gap. In particular, we justify the need for \emphspecified governance for hard-to-specify systems. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04742 [cs.CY] (or arXiv:2503.04742v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.04742 Focus to learn more arXiv-issued DOI via DataCite Submission history From: El-Mahdi El-Mhamdi [view email] [v1] Wed, 5 Feb 2025 20:38:18 UTC (48 KB) Full-text links: Access Paper: View a PDF of the paper titled A case for specialisation in non-human entities, by El-Mahdi El-Mhamdi and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CY prev | next new | recent | 2025-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[AI-74] Which Information should the UK and US AISI share with an International Network of AISIs? Opportunities Risks and a Tentative Proposal
链接: https://arxiv.org/abs/2503.04741
作者: Lara Thurnherr
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 12 Pages, 3 Tables, 2 Figures
点击查看摘要
Abstract:The UK AI Safety Institute (UK AISI) and its parallel organisation in the United States (US AISI) take up a unique position in the recently established International Network of AISIs. Both are in jurisdictions with frontier AI companies and are assuming leading roles in the international conversation on AI Safety. This paper argues that it is in the interest of both institutions to share specific categories of information with the International Network of AISIs, deliberately abstain from sharing others and carefully evaluate sharing some categories on a case by case basis, according to domestic priorities. The paper further proposes a provisional framework with which policymakers and researchers can distinguish between these three cases, taking into account the potential benefits and risks of sharing specific categories of information, ranging from pre-deployment evaluation results to evaluation standards. In an effort to further improve the research on AI policy relevant information sharing decisions, the paper emphasises the importance of continuously monitoring fluctuating factors influencing sharing decisions and a more in-depth analysis of specific policy relevant information categories and additional factors to consider in future research.
[AI-75] PRISM: Perspective Reasoning for Integrated Synthesis and Mediation as a Multi-Perspective Framework for AI Alignment
链接: https://arxiv.org/abs/2503.04740
作者: Anthony Diamond
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 104 pages, 5 figures. Preprint on AI alignment presenting PRISM: a multi-perspective framework that organizes moral concerns into seven basis worldviews and uses Pareto-inspired synthesis to reconcile conflicting human values and specification gaming. Grounded in cognitive science and moral psychology
点击查看摘要
Abstract:In this work, we propose Perspective Reasoning for Integrated Synthesis and Mediation (PRISM), a multiple-perspective framework for addressing persistent challenges in AI alignment such as conflicting human values and specification gaming. Grounded in cognitive science and moral psychology, PRISM organizes moral concerns into seven “basis worldviews”, each hypothesized to capture a distinct dimension of human moral cognition, ranging from survival-focused reflexes through higher-order integrative perspectives. It then applies a Pareto-inspired optimization scheme to reconcile competing priorities without reducing them to a single metric. Under the assumption of reliable context validation for robust use, the framework follows a structured workflow that elicits viewpoint-specific responses, synthesizes them into a balanced outcome, and mediates remaining conflicts in a transparent and iterative manner. By referencing layered approaches to moral cognition from cognitive science, moral psychology, and neuroscience, PRISM clarifies how different moral drives interact and systematically documents and mediates ethical tradeoffs. We illustrate its efficacy through real outputs produced by a working prototype, applying PRISM to classic alignment problems in domains such as public health policy, workplace automation, and education. By anchoring AI deliberation in these human vantage points, PRISM aims to bound interpretive leaps that might otherwise drift into non-human or machine-centric territory. We briefly outline future directions, including real-world deployments and formal verifications, while maintaining the core focus on multi-perspective synthesis and conflict mediation.
[AI-76] Responsible Artificial Intelligence Systems: A Roadmap to Societys Trust through Trustworthy AI Auditability Accountability and Governance
链接: https://arxiv.org/abs/2503.04739
作者: Andrés Herrera-Poyatos,Javier Del Ser,Marcos López de Prado,Fei-Yue Wang,Enrique Herrera-Viedma,Francisco Herrera
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 7 figures
点击查看摘要
Abstract:Artificial intelligence (AI) has matured as a technology, necessitating the development of responsibility frameworks that are fair, inclusive, trustworthy, safe and secure, transparent, and accountable. By establishing such frameworks, we can harness the full potential of AI while mitigating its risks, particularly in high-risk scenarios. This requires the design of responsible AI systems based on trustworthy AI technologies and ethical principles, with the aim of ensuring auditability and accountability throughout their design, development, and deployment, adhering to domain-specific regulations and standards. This paper explores the concept of a responsible AI system from a holistic perspective, which encompasses four key dimensions: 1) regulatory context; 2) trustworthy AI technology along with standardization and assessments; 3) auditability and accountability; and 4) AI governance. The aim of this paper is double. First, we analyze and understand these four dimensions and their interconnections in the form of an analysis and overview. Second, the final goal of the paper is to propose a roadmap in the design of responsible AI systems, ensuring that they can gain society’s trust. To achieve this trustworthiness, this paper also fosters interdisciplinary discussions on the ethical, legal, social, economic, and cultural aspects of AI from a global governance perspective. Last but not least, we also reflect on the current state and those aspects that need to be developed in the near future, as ten lessons learned. Comments: 22 pages, 7 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.0 Cite as: arXiv:2503.04739 [cs.CY] (or arXiv:2503.04739v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.04739 Focus to learn more arXiv-issued DOI via DataCite
[AI-77] Copyright in AI-generated works: Lessons from recent developments in patent law
链接: https://arxiv.org/abs/2503.04738
作者: Rita Matulionyte,Jyh-An Lee
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In Thaler v The Comptroller-General of Patents, Designs and Trade Marks (DABUS), Smith J. held that an AI owner can possibly claim patent ownership over an AI-generated invention based on their ownership and control of the AI system. This AI-owner approach reveals a new option to allocate property rights over AI-generated output. While this judgment was primarily about inventorship and ownership of AI-generated invention in patent law, it has important implications for copyright law. After analysing the weaknesses of applying existing judicial approaches to copyright ownership of AI-generated works, this paper examines whether the AI-owner approach is a better option for determining copyright ownership of AI-generated works. The paper argues that while contracts can be used to work around the AI-owner approach in scenarios where users want to commercially exploit the outputs, this approach still provides more certainty and less transaction costs for relevant parties than other approaches proposed so far.
[AI-78] Carelessness Detection using Performance Factor Analysis: A New Operationalization with Unexpectedly Different Relationship to Learning
链接: https://arxiv.org/abs/2503.04737
作者: Jiayi Zhang,Ryan S. Baker,Namrata Srivastava,Jaclyn Ocumpaugh,Caitlin Mills,Bruce M. McLaren
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Detection of carelessness in digital learning platforms has relied on the contextual slip model, which leverages conditional probability and Bayesian Knowledge Tracing (BKT) to identify careless errors, where students make mistakes despite having the knowledge. However, this model cannot effectively assess carelessness in questions tagged with multiple skills due to the use of conditional probability. This limitation narrows the scope within which the model can be applied. Thus, we propose a novel model, the Beyond Knowledge Feature Carelessness (BKFC) model. The model detects careless errors using performance factor analysis (PFA) and behavioral features distilled from log data, controlling for knowledge when detecting carelessness. We applied the BKFC to detect carelessness in data from middle school students playing a learning game on decimal numbers and operations. We conducted analyses comparing the careless errors detected using contextual slip to the BKFC model. Unexpectedly, careless errors identified by these two approaches did not align. We found students’ post-test performance was (corresponding to past results) positively associated with the carelessness detected using the contextual slip model, while negatively associated with the carelessness detected using the BKFC model. These results highlight the complexity of carelessness and underline a broader challenge in operationalizing carelessness and careless errors.
[AI-79] Ethics of generative AI and manipulation: a design-oriented research agenda
链接: https://arxiv.org/abs/2503.04733
作者: Michael Klenk
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Generative AI enables automated, effective manipulation at scale. Despite the growing general ethical discussion around generative AI, the specific manipulation risks remain inadequately investigated. This article outlines essential inquiries encompassing conceptual, empirical, and design dimensions of manipulation, pivotal for comprehending and curbing manipulation risks. By highlighting these questions, the article underscores the necessity of an appropriate conceptualisation of manipulation to ensure the responsible development of Generative AI technologies.
[AI-80] Epistemic Logic Programs: Non-Ground and Counting Complexity
链接: https://arxiv.org/abs/2503.04731
作者: Thomas Eiter,Johannes K. Fichte,Markus Hecher,Stefan Woltran
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:
点击查看摘要
Abstract:Answer Set Programming (ASP) is a prominent problem-modeling and solving framework, whose solutions are called answer sets. Epistemic logic programs (ELP) extend ASP to reason about all or some answer sets. Solutions to an ELP can be seen as consequences over multiple collections of answer sets, known as world views. While the complexity of propositional programs is well studied, the non-ground case remains open. This paper establishes the complexity of non-ground ELPs. We provide a comprehensive picture for well-known program fragments, which turns out to be complete for the class NEXPTIME with access to oracles up to \Sigma^P_2. In the quantitative setting, we establish complexity results for counting complexity beyond #EXP. To mitigate high complexity, we establish results in case of bounded predicate arity, reaching up to the fourth level of the polynomial hierarchy. Finally, we provide ETH-tight runtime results for the parameter treewidth, which has applications in quantitative reasoning, where we reason on (marginal) probabilities of epistemic literals.
[AI-81] Static Vs. Agent ic Game Master AI for Facilitating Solo Role-Playing Experiences
链接: https://arxiv.org/abs/2502.19519
作者: Nicolai Hejlesen Jørgensen,Sarmilan Tharmabalan,Ilhan Aslan,Nicolai Brodersen Hansen,Timothy Merritt
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 17 pages, 10 figures, 1 table, submitted for review
点击查看摘要
Abstract:This paper presents a game master AI for single-player role-playing games. The AI is designed to deliver interactive text-based narratives and experiences typically associated with multiplayer tabletop games like Dungeons Dragons. We report on the design process and the series of experiments to improve the functionality and experience design, resulting in two functional versions of the system. While v1 of our system uses simplified prompt engineering, v2 leverages a multi-agent architecture and the ReAct framework to include reasoning and action. A comparative evaluation demonstrates that v2 as an agentic system maintains play while significantly improving modularity and game experience, including immersion and curiosity. Our findings contribute to the evolution of AI-driven interactive fiction, highlighting new avenues for enhancing solo role-playing experiences.
[AI-82] Noise-Robust Radio Frequency Fingerprint Identification Using Denoise Diffusion Model
链接: https://arxiv.org/abs/2503.05514
作者: Guolin Yin,Junqing Zhang,Yuan Ding,Simon Cotton
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 6 pages, 8 figures, WCNC 2025
点击查看摘要
Abstract:Securing Internet of Things (IoT) devices presents increasing challenges due to their limited computational and energy resources. Radio Frequency Fingerprint Identification (RFFI) emerges as a promising authentication technique to identify wireless devices through hardware impairments. RFFI performance under low signal-to-noise ratio (SNR) scenarios is significantly degraded because the minute hardware features can be easily swamped in noise. In this paper, we leveraged the diffusion model to effectively restore the RFF under low SNR scenarios. Specifically, we trained a powerful noise predictor and tailored a noise removal algorithm to effectively reduce the noise level in the received signal and restore the device fingerprints. We used Wi-Fi as a case study and created a testbed involving 6 commercial off-the-shelf Wi-Fi dongles and a USRP N210 software-defined radio (SDR) platform. We conducted experimental evaluations on various SNR scenarios. The experimental results show that the proposed algorithm can improve the classification accuracy by up to 34.9%.
[AI-83] FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance
链接: https://arxiv.org/abs/2503.05185
作者: Fengbin Zhu,Junfeng Li,Liangming Pan,Wenjie Wang,Fuli Feng,Chao Wang,Huanbo Luan,Tat-Seng Chua
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Under review
点击查看摘要
Abstract:Finance decision-making often relies on in-depth data analysis across various data sources, including financial tables, news articles, stock prices, etc. In this work, we introduce FinTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation (RAG) systems in finance. Built from heterologous data of NASDAQ 100 companies, FinTMMBench offers three significant advantages. 1) Multi-modal Corpus: It encompasses a hybrid of financial tables, news articles, daily stock prices, and visual technical charts as the corpus. 2) Temporal-aware Questions: Each question requires the retrieval and interpretation of its relevant data over a specific time period, including daily, weekly, monthly, quarterly, and annual periods. 3) Diverse Financial Analysis Tasks: The questions involve 10 different tasks, including information extraction, trend analysis, sentiment analysis and event detection, etc. We further propose a novel TMMHybridRAG method, which first leverages LLMs to convert data from other modalities (e.g., tabular, visual and time-series data) into textual format and then incorporates temporal information in each node when constructing graphs and dense indexes. Its effectiveness has been validated in extensive experiments, but notable gaps remain, highlighting the challenges presented by our FinTMMBench.
[AI-84] Function-Coherent Gambles
链接: https://arxiv.org/abs/2503.01855
作者: Gregory Wheeler
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Probability (math.PR)
*备注: 11 pages, 3 figures
点击查看摘要
Abstract:The desirable gambles framework provides a foundational approach to imprecise probability theory but relies heavily on linear utility assumptions. This paper introduces \em function-coherent gambles, a generalization that accommodates non-linear utility while preserving essential rationality properties. We establish core axioms for function-coherence and prove a representation theorem that characterizes acceptable gambles through continuous linear functionals. The framework is then applied to analyze various forms of discounting in intertemporal choice, including hyperbolic, quasi-hyperbolic, scale-dependent, and state-dependent discounting. We demonstrate how these alternatives to constant-rate exponential discounting can be integrated within the function-coherent framework. This unified treatment provides theoretical foundations for modeling sophisticated patterns of time preference within the desirability paradigm, bridging a gap between normative theory and observed behavior in intertemporal decision-making under genuine uncertainty.
机器学习
[LG-0] Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams
链接: https://arxiv.org/abs/2503.05675
作者: Ted Shaowang,Shinan Liu,Jonatas Marques,Nick Feamster,Sanjay Krishnan
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 9 pages, 18 figures
点击查看摘要
Abstract:Machine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. By processing sensor data, machine learning models can optimize processes, improve efficiency, and enhance personalized user experiences in smart systems. However, IoT systems are often deployed in sensitive environments such as households and offices, where they may inadvertently expose identifiable information, including location, habits, and personal identifiers. This raises significant privacy concerns, necessitating the application of data minimization – a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary “relevant and necessary” rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.
[LG-1] Physics-based machine learning framework for predicting NOx emissions from compression ignition engines using on-board diagnostics data
链接: https://arxiv.org/abs/2503.05648
作者: Harish Panneer Selvam,Bharat Jayaprakash,Yan Li,Shashi Shekhar,William F. Northrop
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This work presents a physics-based machine learning framework to predict and analyze oxides of nitrogen (NOx) emissions from compression-ignition engine-powered vehicles using on-board diagnostics (OBD) data as input. Accurate NOx prediction from OBD datasets is difficult because NOx formation inside an engine combustion chamber is governed by complex processes occurring on timescales much shorter than the data collection rate. Thus, emissions generally cannot be predicted accurately using simple empirically derived physics models. Black box models like genetic algorithms or neural networks can be more accurate, but have poor interpretability. The transparent model presented in this paper has both high accuracy and can explain potential sources of high emissions. The proposed framework consists of two major steps: a physics-based NOx prediction model combined with a novel Divergent Window Co-occurrence (DWC) Pattern detection algorithm to analyze operating conditions that are not adequately addressed by the physics-based model. The proposed framework is validated for generalizability with a second vehicle OBD dataset, a sensitivity analysis is performed, and model predictions are compared with that from a deep neural network. The results show that NOx emissions predictions using the proposed model has around 55% better root mean square error, and around 60% higher mean absolute error compared to the baseline NOx prediction model from previously published work. The DWC Pattern Detection Algorithm identified low engine power conditions to have high statistical significance, indicating an operating regime where the model can be improved. This work shows that the physics-based machine learning framework is a viable method for predicting NOx emissions from engines that do not incorporate NOx sensing.
[LG-2] Strategy Coopetition Explains the Emergence and Transience of In-Context Learning
链接: https://arxiv.org/abs/2503.05631
作者: Aaditya K. Singh,Ted Moskovitz,Sara Dragutinovic,Felix Hill,Stephanie C.Y. Chan,Andrew M. Saxe
类目: Machine Learning (cs.LG)
*备注: 20 pages, 18 figures
点击查看摘要
Abstract:In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that, after the disappearance of ICL, the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term “context-constrained in-weights learning” (CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term “strategy coopetition.” We propose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.
[LG-3] Decision-aware training of spatiotemporal forecasting models
链接: https://arxiv.org/abs/2503.05622
作者: Kyle Heuton,F. Samuel Muench,Shikhar Shrestha,Thomas J. Stopka,Michael C. Hughes
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
点击查看摘要
Abstract:Optimal allocation of scarce resources is a common problem for decision makers faced with choosing a limited number of locations for intervention. Spatiotemporal prediction models could make such decisions data-driven. A recent performance metric called fraction of best possible reach (BPR) measures the impact of using a model’s recommended size K subset of sites compared to the best possible top-K in hindsight. We tackle two open problems related to BPR. First, we explore how to rank all sites numerically given a probabilistic model that predicts event counts jointly across sites. Ranking via the per-site mean is suboptimal for BPR. Instead, we offer a better ranking for BPR backed by decision theory. Second, we explore how to train a probabilistic model’s parameters to maximize BPR. Discrete selection of K sites implies all-zero parameter gradients which prevent standard gradient training. We overcome this barrier via advances in perturbed optimizers. We further suggest a training objective that combines likelihood with a decision-aware BPR constraint to deliver high-quality top-K rankings as well as good forecasts for all sites. We demonstrate our approach on two where-to-intervene applications: mitigating opioid-related fatal overdoses for public health and monitoring endangered wildlife.
[LG-4] Can KAN CANs? Input-convex Kolmogorov-Arnold Networks (KANs) as hyperelastic constitutive artificial neural networks (CANs)
链接: https://arxiv.org/abs/2503.05617
作者: Prakash Thakolkaran,Yaqi Guo,Shivam Saini,Mathias Peirlinck,Benjamin Alheit,Siddhant Kumar
类目: Machine Learning (cs.LG)
*备注: 34 pages, 15 figures
点击查看摘要
Abstract:Traditional constitutive models rely on hand-crafted parametric forms with limited expressivity and generalizability, while neural network-based models can capture complex material behavior but often lack interpretability. To balance these trade-offs, we present Input-Convex Kolmogorov-Arnold Networks (ICKANs) for learning polyconvex hyperelastic constitutive laws. ICKANs leverage the Kolmogorov-Arnold representation, decomposing the model into compositions of trainable univariate spline-based activation functions for rich expressivity. We introduce trainable input-convex splines within the KAN architecture, ensuring physically admissible polyconvex hyperelastic models. The resulting models are both compact and interpretable, enabling explicit extraction of analytical constitutive relationships through an input-convex symbolic regression techinque. Through unsupervised training on full-field strain data and limited global force measurements, ICKANs accurately capture nonlinear stress-strain behavior across diverse strain states. Finite element simulations of unseen geometries with trained ICKAN hyperelastic constitutive models confirm the framework’s robustness and generalization capability.
[LG-5] From Theory to Application: A Practical Introduction to Neural Operators in Scientific Computing
链接: https://arxiv.org/abs/2503.05598
作者: Prashant K. Jha
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 53 pages, 17 figures, Github repository: this https URL
点击查看摘要
Abstract:This focused review explores a range of neural operator architectures for approximating solutions to parametric partial differential equations (PDEs), emphasizing high-level concepts and practical implementation strategies. The study covers foundational models such as Deep Operator Networks (DeepONet), Principal Component Analysis-based Neural Networks (PCANet), and Fourier Neural Operators (FNO), providing comparative insights into their core methodologies and performance. These architectures are demonstrated on two classical linear parametric PDEs: the Poisson equation and linear elastic deformation. Beyond forward problem-solving, the review delves into applying neural operators as surrogates in Bayesian inference problems, showcasing their effectiveness in accelerating posterior inference while maintaining accuracy. The paper concludes by discussing current challenges, particularly in controlling prediction accuracy and generalization. It outlines emerging strategies to address these issues, such as residual-based error correction and multi-level training. This review can be seen as a comprehensive guide to implementing neural operators and integrating them into scientific computing workflows.
[LG-6] MPTSNet: Integrating Multiscale Periodic Local Patterns and Global Dependencies for Multivariate Time Series Classification AAAI2025
链接: https://arxiv.org/abs/2503.05582
作者: Yang Mu,Muhammad Shahzad,Xiao Xiang Zhu
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025
点击查看摘要
Abstract:Multivariate Time Series Classification (MTSC) is crucial in extensive practical applications, such as environmental monitoring, medical EEG analysis, and action recognition. Real-world time series datasets typically exhibit complex dynamics. To capture this complexity, RNN-based, CNN-based, Transformer-based, and hybrid models have been proposed. Unfortunately, current deep learning-based methods often neglect the simultaneous construction of local features and global dependencies at different time scales, lacking sufficient feature extraction capabilities to achieve satisfactory classification accuracy. To address these challenges, we propose a novel Multiscale Periodic Time Series Network (MPTSNet), which integrates multiscale local patterns and global correlations to fully exploit the inherent information in time series. Recognizing the multi-periodicity and complex variable correlations in time series, we use the Fourier transform to extract primary periods, enabling us to decompose data into multiscale periodic segments. Leveraging the inherent strengths of CNN and attention mechanism, we introduce the PeriodicBlock, which adaptively captures local patterns and global dependencies while offering enhanced interpretability through attention integration across different periodic scales. The experiments on UEA benchmark datasets demonstrate that the proposed MPTSNet outperforms 21 existing advanced baselines in the MTSC tasks.
[LG-7] BARK: A Fully Bayesian Tree Kernel for Black-box Optimization
链接: https://arxiv.org/abs/2503.05574
作者: Toby Boyne,Jose Pablo Folch,Robert M Lee,Behrang Shafei,Ruth Misener
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 8 main pages, 22 total pages, 10 figures, 6 tables
点击查看摘要
Abstract:We perform Bayesian optimization using a Gaussian process perspective on Bayesian Additive Regression Trees (BART). Our BART Kernel (BARK) uses tree agreement to define a posterior over piecewise-constant functions, and we explore the space of tree kernels using a Markov chain Monte Carlo approach. Where BART only samples functions, the resulting BARK model obtains samples of Gaussian processes defining distributions over functions, which allow us to build acquisition functions for Bayesian optimization. Our tree-based approach enables global optimization over the surrogate, even for mixed-feature spaces. Moreover, where many previous tree-based kernels provide uncertainty quantification over function values, our sampling scheme captures uncertainty over the tree structure itself. Our experiments show the strong performance of BARK on both synthetic and applied benchmarks, due to the combination of our fully Bayesian surrogate and the optimization procedure.
[LG-8] ractable Representations for Convergent Approximation of Distributional HJB Equations
链接: https://arxiv.org/abs/2503.05563
作者: Julie Alhosh,Harley Wiltzer,David Meger
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted to RLDM 2025
点击查看摘要
Abstract:In reinforcement learning (RL), the long-term behavior of decision-making policies is evaluated based on their average returns. Distributional RL has emerged, presenting techniques for learning return distributions, which provide additional statistics for evaluating policies, incorporating risk-sensitive considerations. When the passage of time cannot naturally be divided into discrete time increments, researchers have studied the continuous-time RL (CTRL) problem, where agent states and decisions evolve continuously. In this setting, the Hamilton-Jacobi-Bellman (HJB) equation is well established as the characterization of the expected return, and many solution methods exist. However, the study of distributional RL in the continuous-time setting is in its infancy. Recent work has established a distributional HJB (DHJB) equation, providing the first characterization of return distributions in CTRL. These equations and their solutions are intractable to solve and represent exactly, requiring novel approximation techniques. This work takes strides towards this end, establishing conditions on the method of parameterizing return distributions under which the DHJB equation can be approximately solved. Particularly, we show that under a certain topological property of the mapping between statistics learned by a distributional RL algorithm and corresponding distributions, approximation of these statistics leads to close approximations of the solution of the DHJB equation. Concretely, we demonstrate that the quantile representation common in distributional RL satisfies this topological property, certifying an efficient approximation algorithm for continuous-time distributional RL.
[LG-9] Global graph features unveiled by unsupervised geometric deep learning
链接: https://arxiv.org/abs/2503.05560
作者: Mirja Granfors,Jesús Pineda,Blanca Zufiria Gerbolés,Joana B. Pereira,Carlo Manzo,Giovanni Volpe
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
*备注: 23 pages, 5 figures
点击查看摘要
Abstract:Graphs provide a powerful framework for modeling complex systems, but their structural variability makes analysis and classification challenging. To address this, we introduce GAUDI (Graph Autoencoder Uncovering Descriptive Information), a novel unsupervised geometric deep learning framework that captures both local details and global structure. GAUDI employs an innovative hourglass architecture with hierarchical pooling and upsampling layers, linked through skip connections to preserve essential connectivity information throughout the encoding-decoding process. By mapping different realizations of a system - generated from the same underlying parameters - into a continuous, structured latent space, GAUDI disentangles invariant process-level features from stochastic noise. We demonstrate its power across multiple applications, including modeling small-world networks, characterizing protein assemblies from super-resolution microscopy, analyzing collective motion in the Vicsek model, and capturing age-related changes in brain connectivity. This approach not only improves the analysis of complex graphs but also provides new insights into emergent phenomena across diverse scientific domains.
[LG-10] Diffusion Models for Cayley Graphs
链接: https://arxiv.org/abs/2503.05558
作者: Michael R. Douglas,Cristofero Fraser-Taliente
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Group Theory (math.GR)
*备注: 25 pages, 5 figures
点击查看摘要
Abstract:We review the problem of finding paths in Cayley graphs of groups and group actions, using the Rubik’s cube as an example, and we list several more examples of significant mathematical interest. We then show how to formulate these problems in the framework of diffusion models. The exploration of the graph is carried out by the forward process, while finding the target nodes is done by the inverse backward process. This systematizes the discussion and suggests many generalizations. To improve exploration, we propose a ``reversed score’’ ansatz which substantially improves over previous comparable algorithms.
[LG-11] Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance NAACL
链接: https://arxiv.org/abs/2503.05551
作者: Bryan Etzine,Masoud Hashemi,Nishanth Madhusudhan,Sagar Davasam,Roshnee Sharma,Sathwik Tejaswi Madhusudhan,Vikas Yadav
类目: Machine Learning (cs.LG)
*备注: conference NAACL, TrustNLP Workshop
点击查看摘要
Abstract:Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.
[LG-12] Riemann2: Learning Riemannian Submanifolds from Riemannian Data AISTATS2025
链接: https://arxiv.org/abs/2503.05540
作者: Leonel Rozo,Miguel González-Duque,Noémie Jaquier,Søren Hauberg
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted at AISTATS 2025
点击查看摘要
Abstract:Latent variable models are powerful tools for learning low-dimensional manifolds from high-dimensional data. However, when dealing with constrained data such as unit-norm vectors or symmetric positive-definite matrices, existing approaches ignore the underlying geometric constraints or fail to provide meaningful metrics in the latent space. To address these limitations, we propose to learn Riemannian latent representations of such geometric data. To do so, we estimate the pullback metric induced by a Wrapped Gaussian Process Latent Variable Model, which explicitly accounts for the data geometry. This enables us to define geometry-aware notions of distance and shortest paths in the latent space, while ensuring that our model only assigns probability mass to the data manifold. This generalizes previous work and allows us to handle complex tasks in various domains, including robot motion synthesis and analysis of brain connectomes.
[LG-13] Additive Model Boosting: New Insights and Path(ologie)s
链接: https://arxiv.org/abs/2503.05538
作者: Rickmer Schulte,David Rügamer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Additive models (AMs) have sparked a lot of interest in machine learning recently, allowing the incorporation of interpretable structures into a wide range of model classes. Many commonly used approaches to fit a wide variety of potentially complex additive models build on the idea of boosting additive models. While boosted additive models (BAMs) work well in practice, certain theoretical aspects are still poorly understood, including general convergence behavior and what optimization problem is being solved when accounting for the implicit regularizing nature of boosting. In this work, we study the solution paths of BAMs and establish connections with other approaches for certain classes of problems. Along these lines, we derive novel convergence results for BAMs, which yield crucial insights into the inner workings of the method. While our results generally provide reassuring theoretical evidence for the practical use of BAMs, they also uncover some ``pathologies’’ of boosting for certain additive model classes concerning their convergence behavior that require caution in practice. We empirically validate our theoretical findings through several numerical experiments.
[LG-14] Leverag ing Approximate Caching for Faster Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2503.05530
作者: Shai Bergman,Zhang Ji,Anne-Marie Kermarrec,Diana Petrescu,Rafael Pires,Mathis Randl,Martijn de Vos
类目: Databases (cs.DB); Machine Learning (cs.LG); Performance (cs.PF)
*备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enhances the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive. To address this, we introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries. Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear, reducing reliance on expensive vector database lookups. We evaluate Proximity on the MMLU and MedRAG benchmarks, demonstrating that it significantly improves retrieval efficiency while maintaining response accuracy. Proximity reduces retrieval latency by up to 59% while maintaining accuracy and lowers the computational burden on the vector database. We also experiment with different similarity thresholds and quantify the trade-off between speed and recall. Our work shows that approximate caching is a viable and effective strategy for optimizing RAG-based systems.
[LG-15] Mol-CADiff: Causality-Aware Autoregressive Diffusion for Molecule Generation
链接: https://arxiv.org/abs/2503.05499
作者: Md Atik Ahamed,Qiang Ye,Qiang Cheng
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The design of novel molecules with desired properties is a key challenge in drug discovery and materials science. Traditional methods rely on trial-and-error, while recent deep learning approaches have accelerated molecular generation. However, existing models struggle with generating molecules based on specific textual descriptions. We introduce Mol-CADiff, a novel diffusion-based framework that uses causal attention mechanisms for text-conditional molecular generation. Our approach explicitly models the causal relationship between textual prompts and molecular structures, overcoming key limitations in existing methods. We enhance dependency modeling both within and across modalities, enabling precise control over the generation process. Our extensive experiments demonstrate that Mol-CADiff outperforms state-of-the-art methods in generating diverse, novel, and chemically valid molecules, with better alignment to specified properties, enabling more intuitive language-driven molecular design.
[LG-16] Statistical Deficiency for Task Inclusion Estimation
链接: https://arxiv.org/abs/2503.05491
作者: Loïc Fosse,Frédéric Béchet,Benoît Favre,Géraldine Damnati,Gwénolé Lecorvé,Maxime Darrin,Philippe Formont,Pablo Piantanida
类目: Machine Learning (cs.LG)
*备注: 34 pages
点击查看摘要
Abstract:Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the \bf inclusion between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.
[LG-17] Bridging the Semantic Gap in Virtual Machine Introspection and Forensic Memory Analysis
链接: https://arxiv.org/abs/2503.05482
作者: Christofer Fellicious,Hans P. Reiser,Michael Granitzer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Forensic Memory Analysis (FMA) and Virtual Machine Introspection (VMI) are critical tools for security in a virtualization-based approach. VMI and FMA involves using digital forensic methods to extract information from the system to identify and explain security incidents. A key challenge in both FMA and VMI is the “Semantic Gap”, which is the difficulty of interpreting raw memory data without specialized tools and expertise. In this work, we investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA, leveraging machine learning to automate information extraction and reduce the workload of forensic investigators. We choose OpenSSH as our use case to test different methods to extract high level structures. We also test our method on complete physical memory dumps to showcase the effectiveness of the engineered features. Our features range from basic statistical features to advanced graph-based representations using malloc headers and pointer translations. The training and testing are carried out on public datasets that we compare against already recognized baseline methods. We show that using metadata, we can improve the performance of the algorithm when there is very little training data and also quantify how having more data results in better generalization performance. The final contribution is an open dataset of physical memory dumps, totalling more than 1 TB of different memory state, software environments, main memory capacities and operating system versions. Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%. Our research underscores the possibility of using feature engineering and machine learning techniques to bridge the semantic gap.
[LG-18] Enhancing Network Security: A Hybrid Approach for Detection and Mitigation of Distributed Denial-of-Service Attacks Using Machine Learning
链接: https://arxiv.org/abs/2503.05477
作者: Nizo Jaman Shohan,Gazi Tanbhir,Faria Elahi,Ahsan Ullah,Md. Nazmus Sakib
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2091))
点击查看摘要
Abstract:The distributed denial-of-service (DDoS) attack stands out as a highly formidable cyber threat, representing an advanced form of the denial-of-service (DoS) attack. A DDoS attack involves multiple computers working together to overwhelm a system, making it unavailable. On the other hand, a DoS attack is a one-on-one attempt to make a system or website inaccessible. Thus, it is crucial to construct an effective model for identifying various DDoS incidents. Although extensive research has focused on binary detection models for DDoS identification, they face challenges to adapt evolving threats, necessitating frequent updates. Whereas multiclass detection models offer a comprehensive defense against diverse DDoS attacks, ensuring adaptability in the ever-changing cyber threat landscape. In this paper, we propose a Hybrid Model to strengthen network security by combining the featureextraction abilities of 1D Convolutional Neural Networks (CNNs) with the classification skills of Random Forest (RF) and Multi-layer Perceptron (MLP) classifiers. Using the CIC-DDoS2019 dataset, we perform multiclass classification of various DDoS attacks and conduct a comparative analysis of evaluation metrics for RF, MLP, and our proposed Hybrid Model. After analyzing the results, we draw meaningful conclusions and confirm the superiority of our Hybrid Model by performing thorough cross-validation. Additionally, we integrate our machine learning model with Snort, which provides a robust and adaptive solution for detecting and mitigating various DDoS attacks.
[LG-19] Quantum-PEFT: Ultra parameter-efficient fine-tuning ICLR2025
链接: https://arxiv.org/abs/2503.05431
作者: Toshiaki Koike-Akino,Francesco Tonin,Yongtao Wu,Frank Zhengqing Wu,Leyla Naz Candogan,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注: ICLR 2025
点击查看摘要
Abstract:This paper introduces Quantum-PEFT that leverages quantum computations for parameter-efficient fine-tuning (PEFT). Unlike other additive PEFT methods, such as low-rank adaptation (LoRA), Quantum-PEFT exploits an underlying full-rank yet surprisingly parameter efficient quantum unitary parameterization. With the use of Pauli parameterization, the number of trainable parameters grows only logarithmically with the ambient dimension, as opposed to linearly as in LoRA-based PEFT methods. Quantum-PEFT achieves vanishingly smaller number of trainable parameters than the lowest-rank LoRA as dimensions grow, enhancing parameter efficiency while maintaining a competitive performance. We apply Quantum-PEFT to several transfer learning benchmarks in language and vision, demonstrating significant advantages in parameter efficiency.
[LG-20] Physics-based machine learning for fatigue lifetime prediction under non-uniform loading scenarios
链接: https://arxiv.org/abs/2503.05419
作者: Abedulgader Baktheer,Fadi Aldakheel
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accurate lifetime prediction of structures subjected to cyclic loading is vital, especially in scenarios involving non-uniform loading histories where load sequencing critically influences structural durability. Addressing this complexity requires advanced modeling approaches capable of capturing the intricate relationship between loading sequences and fatigue lifetime. Traditional fatigue simulations are computationally prohibitive, necessitating more efficient methods. This study highlights the potential of physics-based machine learning ( \phi ML) to predict the fatigue lifetime of materials. Specifically, a FFNN is designed to embed physical constraints from experimental evidence directly into its architecture to enhance prediction accuracy. It is trained using numerical simulations generated by a physically based anisotropic continuum damage fatigue model. The model is calibrated and validated against experimental fatigue data of concrete cylinder specimens tested in uniaxial compression. The proposed approach demonstrates superior accuracy compared to purely data-driven neural networks, particularly in situations with limited training data, achieving realistic predictions of damage accumulation. Thus, a general algorithm is developed and successfully applied to predict fatigue lifetimes under complex loading scenarios with multiple loading ranges. Hereby, the \phi ML model serves as a surrogate to capture damage evolution across load transitions. The \phi ML based algorithm is subsequently employed to investigate the influence of multiple loading transitions on accumulated fatigue life, and its predictions align with trends observed in recent experimental studies. This work demonstrates \phi ML as a promising technique for efficient and reliable fatigue life prediction in engineering structures, with possible integration into digital twin models for real-time assessment.
[LG-21] Routing for Large ML Models
链接: https://arxiv.org/abs/2503.05324
作者: Ofir Cohen,Jose Yallouz Michael Schapira,Shahar Belkar,Tal Mizrahi
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textitquantifying network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textitoptimizing routing with respect to this global metric.
[LG-22] CoinRobot: Generalized End-to-end Robotic Learning for Physical Intelligence
链接: https://arxiv.org/abs/2503.05316
作者: Yu Zhao,Huxian Liu,Xiang Chen,Jiankai Sun,Jiahuan Yan,Luhui Hu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Physical intelligence holds immense promise for advancing embodied intelligence, enabling robots to acquire complex behaviors from demonstrations. However, achieving generalization and transfer across diverse robotic platforms and environments requires careful design of model architectures, training strategies, and data diversity. Meanwhile existing systems often struggle with scalability, adaptability to heterogeneous hardware, and objective evaluation in real-world settings. We present a generalized end-to-end robotic learning framework designed to bridge this gap. Our framework introduces a unified architecture that supports cross-platform adaptability, enabling seamless deployment across industrial-grade robots, collaborative arms, and novel embodiments without task-specific modifications. By integrating multi-task learning with streamlined network designs, it achieves more robust performance than conventional approaches, while maintaining compatibility with varying sensor configurations and action spaces. We validate our framework through extensive experiments on seven manipulation tasks. Notably, Diffusion-based models trained in our framework demonstrated superior performance and generalizability compared to the LeRobot framework, achieving performance improvements across diverse robotic platforms and environmental conditions.
[LG-23] LoRACode: LoRA Adapters for Code Embeddings ICLR2025
链接: https://arxiv.org/abs/2503.05315
作者: Saumya Chaturvedi,Aman Chadha,Laurent Bindschaedler
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注: Accepted at the Deep Learning for Code (DL4C) Workshop at ICLR 2025
点击查看摘要
Abstract:Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.
[LG-24] Robust Intrusion Detection System with Explainable Artificial Intelligence
链接: https://arxiv.org/abs/2503.05303
作者: Betül Güvenç Paltun,Ramin Fuladi,Rim El Malki
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine learning (ML) models serve as powerful tools for threat detection and mitigation; however, they also introduce potential new risks. Adversarial input can exploit these models through standard interfaces, thus creating new attack pathways that threaten critical network operations. As ML advancements progress, adversarial strategies become more advanced, and conventional defenses such as adversarial training are costly in computational terms and often fail to provide real-time detection. These methods typically require a balance between robustness and model performance, which presents challenges for applications that demand instant response. To further investigate this vulnerability, we suggest a novel strategy for detecting and mitigating adversarial attacks using eXplainable Artificial Intelligence (XAI). This approach is evaluated in real time within intrusion detection systems (IDS), leading to the development of a zero-touch mitigation strategy. Additionally, we explore various scenarios in the Radio Resource Control (RRC) layer within the Open Radio Access Network (O-RAN) framework, emphasizing the critical need for enhanced mitigation techniques to strengthen IDS defenses against advanced threats and implement a zero-touch mitigation solution. Extensive testing across different scenarios in the RRC layer of the O-RAN infrastructure validates the ability of the framework to detect and counteract integrated RRC-layer attacks when paired with adversarial strategies, emphasizing the essential need for robust defensive mechanisms to strengthen IDS against complex threats.
[LG-25] An Analytical Model for Overparameterized Learning Under Class Imbalance
链接: https://arxiv.org/abs/2503.05289
作者: Eliav Mor,Yair Carmon
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study class-imbalanced linear classification in a high-dimensional Gaussian mixture model. We develop a tight, closed form approximation for the test error of several practical learning methods, including logit adjustment and class dependent temperature. Our approximation allows us to analytically tune and compare these methods, highlighting how and when they overcome the pitfalls of standard cross-entropy minimization. We test our theoretical findings on simulated data and imbalanced CIFAR10, MNIST and FashionMNIST datasets.
[LG-26] Mastering Continual Reinforcement Learning through Fine-Grained Sparse Network Allocation and Dormant Neuron Exploration
链接: https://arxiv.org/abs/2503.05246
作者: Chengqi Zheng,Haiyan Yin,Jianda Chen,Terrence Ng,Yew-Soon Ong,Ivor Tsang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Continual Reinforcement Learning (CRL) is essential for developing agents that can learn, adapt, and accumulate knowledge over time. However, a fundamental challenge persists as agents must strike a delicate balance between plasticity, which enables rapid skill acquisition, and stability, which ensures long-term knowledge retention while preventing catastrophic forgetting. In this paper, we introduce SSDE, a novel structure-based approach that enhances plasticity through a fine-grained allocation strategy with Structured Sparsity and Dormant-guided Exploration. SSDE decomposes the parameter space into forward-transfer (frozen) parameters and task-specific (trainable) parameters. Crucially, these parameters are allocated by an efficient co-allocation scheme under sparse coding, ensuring sufficient trainable capacity for new tasks while promoting efficient forward transfer through frozen parameters. However, structure-based methods often suffer from rigidity due to the accumulation of non-trainable parameters, limiting exploration and adaptability. To address this, we further introduce a sensitivity-guided neuron reactivation mechanism that systematically identifies and resets dormant neurons, which exhibit minimal influence in the sparse policy network during inference. This approach effectively enhance exploration while preserving structural efficiency. Extensive experiments on the CW10-v1 Continual World benchmark demonstrate that SSDE achieves state-of-the-art performance, reaching a success rate of 95%, surpassing prior methods significantly in both plasticity and stability trade-offs (code is available at: this https URL).
[LG-27] Guaranteeing Out-Of-Distribution Detection in Deep RL via Transition Estimation
链接: https://arxiv.org/abs/2503.05238
作者: Mohit Prashant,Arvind Easwaran,Suman Das,Michael Yuhas
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:An issue concerning the use of deep reinforcement learning (RL) agents is whether they can be trusted to perform reliably when deployed, as training environments may not reflect real-life environments. Anticipating instances outside their training scope, learning-enabled systems are often equipped with out-of-distribution (OOD) detectors that alert when a trained system encounters a state it does not recognize or in which it exhibits uncertainty. There exists limited work conducted on the problem of OOD detection within RL, with prior studies being unable to achieve a consensus on the definition of OOD execution within the context of RL. By framing our problem using a Markov Decision Process, we assume there is a transition distribution mapping each state-action pair to another state with some probability. Based on this, we consider the following definition of OOD execution within RL: A transition is OOD if its probability during real-life deployment differs from the transition distribution encountered during training. As such, we utilize conditional variational autoencoders (CVAE) to approximate the transition dynamics of the training environment and implement a conformity-based detector using reconstruction loss that is able to guarantee OOD detection with a pre-determined confidence level. We evaluate our detector by adapting existing benchmarks and compare it with existing OOD detection models for RL.
[LG-28] Robustness of Generalized Median Computation for Consensus Learning in Arbitrary Spaces
链接: https://arxiv.org/abs/2503.05215
作者: Andreas Nienkötter,Sandro Vega-Pons,Xiaoyi Jiang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Robustness in terms of outliers is an important topic and has been formally studied for a variety of problems in machine learning and computer vision. Generalized median computation is a special instance of consensus learning and a common approach to finding prototypes. Related research can be found in numerous problem domains with a broad range of applications. So far, however, robustness of generalized median has only been studied in a few specific spaces. To our knowledge, there is no robustness characterization in a general setting, i.e. for arbitrary spaces. We address this open issue in our work. The breakdown point =0.5 is proved for generalized median with metric distance functions in general. We also study the detailed behavior in case of outliers from different perspectives. In addition, we present robustness results for weighted generalized median computation and non-metric distance functions. Given the importance of robustness, our work contributes to closing a gap in the literature. The presented results have general impact and applicability, e.g. providing deeper understanding of generalized median computation and practical guidance to avoid non-robust computation.
[LG-29] Safety-Critical Traffic Simulation with Adversarial Transfer of Driving Intentions ICRA2025
链接: https://arxiv.org/abs/2503.05180
作者: Zherui Huang,Xing Gao,Guanjie Zheng,Licheng Wen,Xuemeng Yang,Xiao Sun
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by ICRA 2025
点击查看摘要
Abstract:Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios.
[LG-30] phepy: Visual Benchmarks and Improvements for Out-of-Distribution Detectors
链接: https://arxiv.org/abs/2503.05169
作者: Juniper Tyree,Andreas Rupp,Petri S. Clusius,Michael H. Boy
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Applying machine learning to increasingly high-dimensional problems with sparse or biased training data increases the risk that a model is used on inputs outside its training domain. For such out-of-distribution (OOD) inputs, the model can no longer make valid predictions, and its error is potentially unbounded. Testing OOD detection methods on real-world datasets is complicated by the ambiguity around which inputs are in-distribution (ID) or OOD. We design a benchmark for OOD detection, which includes three novel and easily-visualisable toy examples. These simple examples provide direct and intuitive insight into whether the detector is able to detect (1) linear and (2) non-linear concepts and (3) identify thin ID subspaces (needles) within high-dimensional spaces (haystacks). We use our benchmark to evaluate the performance of various methods from the literature. Since tactile examples of OOD inputs may benefit OOD detection, we also review several simple methods to synthesise OOD inputs for supervised training. We introduce two improvements, t -poking and OOD sample weighting, to make supervised detectors more precise at the ID-OOD boundary. This is especially important when conflicts between real ID and synthetic OOD sample blur the decision boundary. Finally, we provide recommendations for constructing and applying out-of-distribution detectors in machine learning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.05169 [cs.LG] (or arXiv:2503.05169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.05169 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Juniper Tyree [view email] [v1] Fri, 7 Mar 2025 06:25:20 UTC (4,920 KB)
[LG-31] FMCHS: Advancing Traditional Chinese Medicine Herb Recommendation with Fusion of Multiscale Correlations of Herbs and Symptoms
链接: https://arxiv.org/abs/2503.05167
作者: Xinhan Zheng,Huyu Wu,Haopeng Jin,Ruotai Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Traditional Chinese medicine (TCM) exhibits remarkable therapeutic efficacy in disease treatment and healthcare through personalized herb prescriptions. However, current herb recommendation models inadequately capture the multiscale relations between herbs and clinical symptoms, particularly neglecting latent correlations at the chemical-molecular scale. To address these limitations, we propose the Fusion of Multiscale Correlations of Herbs and Symptoms (FMCHS), an innovative framework that synergistically integrates molecular-scale chemical characteristics of herbs with clinical symptoms. The framework employs multi-relational graph transformer layers to generate enriched embeddings that preserve both structural and semantic features within herbs and symptoms. Through systematic incorporation of herb chemical profiles into node embeddings and implementation of attention-based feature fusion, FMCHS effectively utilizes multiscale correlations. Comprehensive evaluations demonstrate FMCHS’s superior performance over the state-of-the-art (SOTA) baseline, achieving relative improvements of 8.85% in Precision@5, 12.30% in Recall@5, and 10.86% in F1@5 compared to the SOTA model on benchmark datasets. This work facilitates the practical application of TCM in disease treatment and healthcare.
[LG-32] Unity RL Playground: A Versatile Reinforcement Learning Framework for Mobile Robots
链接: https://arxiv.org/abs/2503.05146
作者: Linqi Ye,Rankun Li,Xiaowen Hu,Jiayi Li,Boyang Xing,Yan Peng,Bin Liang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper introduces Unity RL Playground, an open-source reinforcement learning framework built on top of Unity ML-Agents. Unity RL Playground automates the process of training mobile robots to perform various locomotion tasks such as walking, running, and jumping in simulation, with the potential for seamless transfer to real hardware. Key features include one-click training for imported robot models, universal compatibility with diverse robot configurations, multi-mode motion learning capabilities, and extreme performance testing to aid in robot design optimization and morphological evolution. The attached video can be found at this https URL and the code is coming soon.
[LG-33] AI-driven Prediction of Insulin Resistance in Normal Populations: Comparing Models and Criteria
链接: https://arxiv.org/abs/2503.05119
作者: Weihao Gao,Zhuo Deng,Zheng Gong,Ziyi Jiang,Lan Ma
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures
点击查看摘要
Abstract:Insulin resistance (IR) is a key precursor to diabetes and a significant risk factor for cardiovascular disease. Traditional IR assessment methods require multiple blood tests. We developed a simple AI model using only fasting blood glucose to predict IR in non-diabetic populations. Data from the NHANES (1999-2020) and CHARLS (2015) studies were used for model training and validation. Input features included age, gender, height, weight, blood pressure, waist circumference, and fasting blood glucose. The CatBoost algorithm achieved AUC values of 0.8596 (HOMA-IR) and 0.7777 (TyG index) in NHANES, with an external AUC of 0.7442 for TyG. For METS-IR prediction, the model achieved AUC values of 0.9731 (internal) and 0.9591 (external), with RMSE values of 3.2643 (internal) and 3.057 (external). SHAP analysis highlighted waist circumference as a key predictor of IR. This AI model offers a minimally invasive and effective tool for IR prediction, supporting early diabetes and cardiovascular disease prevention.
[LG-34] Partial Distribution Alignment via Adaptive Optimal Transport
链接: https://arxiv.org/abs/2503.05087
作者: Pei Yang,Qi Tan,Guihua Wen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:To remedy the drawbacks of full-mass or fixed-mass constraints in classical optimal transport, we propose adaptive optimal transport which is distinctive from the classical optimal transport in its ability of adaptive-mass preserving. It aims to answer the mathematical problem of how to transport the probability mass adaptively between probability distributions, which is a fundamental topic in various areas of artificial intelligence. Adaptive optimal transport is able to transfer mass adaptively in the light of the intrinsic structure of the problem itself. The theoretical results shed light on the adaptive mechanism of mass transportation. Furthermore, we instantiate the adaptive optimal transport in machine learning application to align source and target distributions partially and adaptively by respecting the ubiquity of noises, outliers, and distribution shifts in the data. The experiment results on the domain adaptation benchmarks show that the proposed method significantly outperforms the state-of-the-art algorithms.
[LG-35] On a Connection Between Imitation Learning and RLHF ICLR2025
链接: https://arxiv.org/abs/2503.05079
作者: Teng Xiao,Yige Yuan,Mingxiao Li,Zhengyu Chen,Vasant G Honavar
类目: Machine Learning (cs.LG)
*备注: ICLR 2025
点击查看摘要
Abstract:This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback RLHF and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks.
[LG-36] A new local time-decoupled squared Wasserstein-2 method for training stochastic neural networks to reconstruct uncertain parameters in dynamical systems
链接: https://arxiv.org/abs/2503.05068
作者: Mingtao Xia,Qijing Shen,Philip Maini,Eamonn Gaffney,Alex Mogilner
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:In this work, we propose and analyze a new local time-decoupled squared Wasserstein-2 method for reconstructing the distribution of unknown parameters in dynamical systems. Specifically, we show that a stochastic neural network model, which can be effectively trained by minimizing our proposed local time-decoupled squared Wasserstein-2 loss function, is an effective model for approximating the distribution of uncertain model parameters in dynamical systems. Through several numerical examples, we showcase the effectiveness of our proposed method in reconstructing the distribution of parameters in different dynamical systems.
[LG-37] QuietPaw: Learning Quadrupedal Locomotion with Versatile Noise Preference Alignment
链接: https://arxiv.org/abs/2503.05035
作者: Yuyou Zhang,Yihang Yao,Shiqi Liu,Yaru Niu,Changyi Lin,Yuxiang Yang,Wenhao Yu,Tingnan Zhang,Jie Tan,Ding Zhao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:When operating at their full capacity, quadrupedal robots can produce loud footstep noise, which can be disruptive in human-centered environments like homes, offices, and hospitals. As a result, balancing locomotion performance with noise constraints is crucial for the successful real-world deployment of quadrupedal robots. However, achieving adaptive noise control is challenging due to (a) the trade-off between agility and noise minimization, (b) the need for generalization across diverse deployment conditions, and © the difficulty of effectively adjusting policies based on noise requirements. We propose QuietPaw, a framework incorporating our Conditional Noise-Constrained Policy (CNCP), a constrained learning-based algorithm that enables flexible, noise-aware locomotion by conditioning policy behavior on noise-reduction levels. We leverage value representation decomposition in the critics, disentangling state representations from condition-dependent representations and this allows a single versatile policy to generalize across noise levels without retraining while improving the Pareto trade-off between agility and noise reduction. We validate our approach in simulation and the real world, demonstrating that CNCP can effectively balance locomotion performance and noise constraints, achieving continuously adjustable noise reduction.
[LG-38] Efficient Algorithms for Verifying Kruskal Rank in Sparse Linear Regression and Related Applications
链接: https://arxiv.org/abs/2503.04986
作者: Fengqin Zhou
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present novel algorithmic techniques to efficiently verify the Kruskal rank of matrices that arise in sparse linear regression, tensor decomposition, and latent variable models. Our unified framework combines randomized hashing techniques with dynamic programming strategies, and is applicable in various settings, including binary fields, general finite fields, and integer matrices. In particular, our algorithms achieve a runtime of \mathcalO\left(dk \cdot \left(nM\right)^\lceil k / 2 \rceil\right) while ensuring high-probability correctness. Our contributions include: A unified framework for verifying Kruskal rank across different algebraic settings; Rigorous runtime and high-probability guarantees that nearly match known lower bounds; Practical implications for identifiability in tensor decompositions and deep learning, particularly for the estimation of noise transition matrices.
[LG-39] Energy-Weighted Flow Matching for Offline Reinforcement Learning ICLR2025
链接: https://arxiv.org/abs/2503.04975
作者: Shiyuan Zhang,Weitong Zhang,Quanquan Gu
类目: Machine Learning (cs.LG)
*备注: 28 pages, 11 figures, accepted by ICLR 2025
点击查看摘要
Abstract:This paper investigates energy guidance in generative modeling, where the target distribution is defined as q(\mathbf x) \propto p(\mathbf x)\exp(-\beta \mathcal E(\mathbf x)) , with p(\mathbf x) being the data distribution and \mathcal E(\mathcal x) as the energy function. To comply with energy guidance, existing methods often require auxiliary procedures to learn intermediate guidance during the diffusion process. To overcome this limitation, we explore energy-guided flow matching, a generalized form of the diffusion process. We introduce energy-weighted flow matching (EFM), a method that directly learns the energy-guided flow without the need for auxiliary models. Theoretical analysis shows that energy-weighted flow matching accurately captures the guided flow. Additionally, we extend this methodology to energy-weighted diffusion models and apply it to offline reinforcement learning (RL) by proposing the Q-weighted Iterative Policy Optimization (QIPO). Empirically, we demonstrate that the proposed QIPO algorithm improves performance in offline RL tasks. Notably, our algorithm is the first energy-guided diffusion model that operates independently of auxiliary models and the first exact energy-guided flow matching model in the literature.
[LG-40] MarsLGPR: Mars Rover Localization with Ground Penetrating Radar
链接: https://arxiv.org/abs/2503.04944
作者: Anja Sheppard,Katherine A. Skinner
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multi-modal filtering estimates in high-slip environments. Lastly, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available upon publication.
[LG-41] Neural Configuration-Space Barriers for Manipulation Planning and Control
链接: https://arxiv.org/abs/2503.04929
作者: Kehan Long,Ki Myung Brian Lee,Nikola Raicevic,Niyas Attasseri,Melvin Leok,Nikolay Atanasov
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Planning and control for high-dimensional robot manipulators in cluttered, dynamic environments require both computational efficiency and robust safety guarantees. Inspired by recent advances in learning configuration-space distance functions (CDFs) as robot body representations, we propose a unified framework for motion planning and control that formulates safety constraints as CDF barriers. A CDF barrier approximates the local free configuration space, substantially reducing the number of collision-checking operations during motion planning. However, learning a CDF barrier with a neural network and relying on online sensor observations introduce uncertainties that must be considered during control synthesis. To address this, we develop a distributionally robust CDF barrier formulation for control that explicitly accounts for modeling errors and sensor noise without assuming a known underlying distribution. Simulations and hardware experiments on a 6-DoF xArm manipulator show that our neural CDF barrier formulation enables efficient planning and robust real-time safe control in cluttered and dynamic environments, relying only on onboard point-cloud observations.
[LG-42] Out-of-Distribution Radar Detection in Compound Clutter and Thermal Noise through Variational Autoencoders ICASSP
链接: https://arxiv.org/abs/2503.04861
作者: Y A Rouzoumka(SONDRA),E Terreaux,C Morisseau,J.-P Ovarlez(SONDRA),C Ren(SONDRA)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICASSP, IEEE, Apr 2025, Hyderabad, India
点击查看摘要
Abstract:This paper presents a novel approach to radar target detection using Variational AutoEncoders (VAEs). Known for their ability to learn complex distributions and identify out-ofdistribution samples, the proposed VAE architecture effectively distinguishes radar targets from various noise types, including correlated Gaussian and compound Gaussian clutter, often combined with additive white Gaussian thermal noise. Simulation results demonstrate that the proposed VAE outperforms classical adaptive detectors such as the Matched Filter and the Normalized Matched Filter, especially in challenging noise conditions, highlighting its robustness and adaptability in radar applications.
[LG-43] A kinetic-based regularization method for data science applications
链接: https://arxiv.org/abs/2503.04857
作者: Abhisek Ganguly,Alessandro Gabbana,Vybhav Rao,Sauro Succi,Santosh Ansumali
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We propose a physics-based regularization technique for function learning, inspired by statistical mechanics. By drawing an analogy between optimizing the parameters of an interpolator and minimizing the energy of a system, we introduce corrections that impose constraints on the lower-order moments of the data distribution. This minimizes the discrepancy between the discrete and continuum representations of the data, in turn allowing to access more favorable energy landscapes, thus improving the accuracy of the interpolator. Our approach improves performance in both interpolation and regression tasks, even in high-dimensional spaces. Unlike traditional methods, it does not require empirical parameter tuning, making it particularly effective for handling noisy data. We also show that thanks to its local nature, the method offers computational and memory efficiency advantages over Radial Basis Function interpolators, especially for large datasets.
[LG-44] Slow is Fast! Dissecting Ethereums Slow Liquidity Drain
链接: https://arxiv.org/abs/2503.04850
作者: Minh Trung Tran,Nasrin Sohrabi,Zahir Tari,Qin Wang,Xiaoyu Xia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We identify the slow liquidity drain (SLID) scam, an insidious and highly profitable threat to decentralized finance (DeFi), posing a large-scale, persistent, and growing risk to the ecosystem. Unlike traditional scams such as rug pulls or honeypots (USENIX Sec’19, USENIX Sec’23), SLID gradually siphons funds from liquidity pools over extended periods, making detection significantly more challenging. In this paper, we conducted the first large-scale empirical analysis of 319,166 liquidity pools across six major decentralized exchanges (DEXs) since 2018. We identified 3,117 SLID affected liquidity pools, resulting in cumulative losses of more than US 103 million. We propose a rule-based heuristic and an enhanced machine learning model for early detection. Our machine learning model achieves a detection speed 4.77 times faster than the heuristic while maintaining 95% accuracy. Our study establishes a foundation for protecting DeFi investors at an early stage and promoting transparency in the DeFi ecosystem.
[LG-45] Electricity Demand Forecasting in Future Grid States: A Digital Twin-Based Simulation Study
链接: https://arxiv.org/abs/2503.04757
作者: Daniel R. Bayer,Felix Haag,Marco Pruckner,Konstantin Hopf
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Presented at the 9th International Conference on Smart and Sustainable Technologies (SpliTech 2024), June 25–28, 2024, Bol and Split, Croatia. This is the author’s version of the work. It is posted here for your personal use, not for redistribution. Please cite the officially published version
点击查看摘要
Abstract:Short-term forecasting of residential electricity demand is an important task for utilities. Yet, many small and medium-sized utilities still use simple forecasting approaches such as Synthesized Load Profiles, which treat residential households similarly and neither account for renewable energy installations nor novel large consumers (e.g., heat pumps, electric vehicles). The effectiveness of such “one-fits-all” approaches in future grid states–where decentral generation and sector coupling increases–are questionable. Our study challenges these forecasting practices and investigates whether Machine Learning (ML) approaches are suited to predict electricity demand in today’s and in future grid states. We use real smart meter data from 3,511 households in Germany over 34 months. We extrapolate this data with future grid states (i.e., increased decentral generation and storage) based on a digital twin of a local energy system. Our results show that Long Short-Term Memory (LSTM) approaches outperform SLPs as well as simple benchmark estimators with up to 68.5% lower Root Mean Squared Error for a day-ahead forecast, especially in future grid states. Nevertheless, all prediction approaches perform worse in future grid states. Our findings therefore reinforce the need (a) for utilities and grid operators to employ ML approaches instead of traditional demand prediction methods in future grid states and (b) to prepare current ML methods for future grid states.
[LG-46] How Personality Traits Shape LLM Risk-Taking Behaviour
链接: https://arxiv.org/abs/2503.04735
作者: John Hartley,Conor Hamill,Devesh Batra,Dale Seddon,Ramin Okhrati,Raad Khraishi
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents, necessitating a deeper understanding of their decision-making behaviour under risk. This study investigates the relationship between LLMs’ personality traits and risk propensity, employing cumulative prospect theory (CPT) and the Big Five personality framework. We focus on GPT-4o, comparing its behaviour to human baselines and earlier models. Our findings reveal that GPT-4o exhibits higher Conscientiousness and Agreeableness traits compared to human averages, while functioning as a risk-neutral rational agent in prospect selection. Interventions on GPT-4o’s Big Five traits, particularly Openness, significantly influence its risk propensity, mirroring patterns observed in human studies. Notably, Openness emerges as the most influential factor in GPT-4o’s risk propensity, aligning with human findings. In contrast, legacy models like GPT-4-Turbo demonstrate inconsistent generalization of the personality-risk relationship. This research advances our understanding of LLM behaviour under risk and elucidates the potential and limitations of personality-based interventions in shaping LLM decision-making. Our findings have implications for the development of more robust and predictable AI systems such as financial modelling.
[LG-47] On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback
链接: https://arxiv.org/abs/2503.05662
作者: Matthew Faw,Constantine Caramanis,Jessica Hoffmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the biased value of a candidate, but we want to optimize with respect to their real value 2) the bias towards a candidate with a specific set of traits depends on the fraction of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of K . Since we treat rewards that are time-varying and dependent on the policy’s past actions, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.
[LG-48] On the similarity of bandwidth-tuned quantum kernels and classical kernels
链接: https://arxiv.org/abs/2503.05602
作者: Roberto Flórez Ablan,Marco Roth,Jan Schnabel
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 main pages with 5 figures, and 9 appendix pages with 12 figures
点击查看摘要
Abstract:Quantum kernels (QK) are widely used in quantum machine learning applications; yet, their potential to surpass classical machine learning methods on classical datasets remains uncertain. This limitation can be attributed to the exponential concentration phenomenon, which can impair both trainability and generalization. A common strategy to alleviate this is bandwidth tuning, which involves rescaling data points in the quantum model to improve generalization. In this work, we numerically demonstrate that optimal bandwidth tuning results in QKs that closely resemble radial basis function (RBF) kernels, leading to a lack of quantum advantage over classical methods. Moreover, we reveal that the size of optimal bandwidth tuning parameters further simplifies QKs, causing them to behave like polynomial kernels, corresponding to a low-order Taylor approximation of a RBF kernel. We thoroughly investigate this for fidelity quantum kernels and projected quantum kernels using various data encoding circuits across several classification datasets. We provide numerical evidence and derive a simple analytical model that elucidates how bandwidth tuning influences key quantities in classification tasks. Overall, our findings shed light on the mechanisms that render QK methods classically simulatable.
[LG-49] opXRD: Open Experimental Powder X-ray Diffraction Database
链接: https://arxiv.org/abs/2503.05577
作者: Daniel Hollarek,Henrik Schopmans,Jona Östreicher,Jonas Teufel,Bin Cao,Adie Alwen,Simon Schweidler,Mriganka Singh,Tim Kodalle,Hanlin Hu,Gregoire Heymans,Maged Abdelsamie,Arthur Hardiagon,Alexander Wieczorek,Siarhei Zhuk,Ruth Schwaiger,Sebastian Siol,François-Xavier Coudert,Moritz Wolf,Carolin M. Sutter-Fella,Ben Breitung,Andrea M. Hodge,Tong-yi Zhang,Pascal Friederich
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected \numpatterns diffractograms, 2179 of them labeled, from a wide spectrum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs.
[LG-50] Machine Learning for Improved Density Functional Theory Thermodynamics
链接: https://arxiv.org/abs/2503.05525
作者: Sergei I. Simak,Erna K. Delczeg-Czirjak,Olle Eriksson
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 1 table
点击查看摘要
Abstract:The predictive accuracy of density functional theory (DFT) for alloy formation enthalpies is often limited by intrinsic energy resolution errors, particularly in ternary phase stability calculations. In this work, we present a machine learning (ML) approach to systematically correct these errors, improving the reliability of first-principles predictions. A neural network model has been trained to predict the discrepancy between DFT-calculated and experimentally measured enthalpies for binary and ternary alloys and compounds. The model utilizes a structured feature set comprising elemental concentrations, atomic numbers, and interaction terms to capture key chemical and structural effects. By applying supervised learning and rigorous data curation we ensure a robust and physically meaningful correction. The model is implemented as a multi-layer perceptron (MLP) regressor with three hidden layers, optimized through leave-one-out cross-validation (LOOCV) and k-fold cross-validation to prevent overfitting. We illustrate the effectiveness of this method by applying it to the Al-Ni-Pd and Al-Ni-Ti systems, which are of interest for high-temperature applications in aerospace and protective coatings.
[LG-51] Semi-Supervised Learning for Dose Prediction in Targeted Radionuclide: A Synthetic Data Study
链接: https://arxiv.org/abs/2503.05367
作者: Jing Zhang,Alexandre Bousse,Laetitia Imbert,Song Xue,Kuangyu Shi,Julien Bert
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 12 pages, 13 figures, 5 tables
点击查看摘要
Abstract:Targeted Radionuclide Therapy (TRT) is a modern strategy in radiation oncology that aims to administer a potent radiation dose specifically to cancer cells using cancer-targeting radiopharmaceuticals. Accurate radiation dose estimation tailored to individual patients is crucial. Deep learning, particularly with pre-therapy imaging, holds promise for personalizing TRT doses. However, current methods require large time series of SPECT imaging, which is hardly achievable in routine clinical practice, and thus raises issues of data availability. Our objective is to develop a semi-supervised learning (SSL) solution to personalize dosimetry using pre-therapy images. The aim is to develop an approach that achieves accurate results when PET/CT images are available, but are associated with only a few post-therapy dosimetry data provided by SPECT images. In this work, we introduce an SSL method using a pseudo-label generation approach for regression tasks inspired by the FixMatch framework. The feasibility of the proposed solution was preliminarily evaluated through an in-silico study using synthetic data and Monte Carlo simulation. Experimental results for organ dose prediction yielded promising outcomes, showing that the use of pseudo-labeled data provides better accuracy compared to using only labeled data.
[LG-52] Graph Alignment via Birkhoff Relaxation
链接: https://arxiv.org/abs/2503.05323
作者: Sushil Mahavir Varma,Irène Waldspurger,Laurent Massoulié
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Spectral Theory (math.SP); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:We consider the graph alignment problem, wherein the objective is to find a vertex correspondence between two graphs that maximizes the edge overlap. The graph alignment problem is an instance of the quadratic assignment problem (QAP), known to be NP-hard in the worst case even to approximately solve. In this paper, we analyze Birkhoff relaxation, a tight convex relaxation of QAP, and present theoretical guarantees on its performance when the inputs follow the Gaussian Wigner Model. More specifically, the weighted adjacency matrices are correlated Gaussian Orthogonal Ensemble with correlation 1/\sqrt1+\sigma^2 . Denote the optimal solutions of the QAP and Birkhoff relaxation by \Pi^\star and X^\star respectively. We show that |X^\star-\Pi^\star|_F^2 = o(n) when \sigma = o(n^-1.25) and |X^\star-\Pi^\star|_F^2 = \Omega(n) when \sigma = \Omega(n^-0.5) . Thus, the optimal solution X^\star transitions from a small perturbation of \Pi^\star for small \sigma to being well separated from \Pi^\star as \sigma becomes larger than n^-0.5 . This result allows us to guarantee that simple rounding procedures on X^\star align 1-o(1) fraction of vertices correctly whenever \sigma = o(n^-1.25) . This condition on \sigma to ensure the success of the Birkhoff relaxation is state-of-the-art.
[LG-53] Riemannian Metric Learning: Closer to You than You Imagine
链接: https://arxiv.org/abs/2503.05321
作者: Samuel Gruffaz,Josua Sassen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:
点击查看摘要
Abstract:Riemannian metric learning is an emerging field in machine learning, unlocking new ways to encode complex data structures beyond traditional distance metric learning. While classical approaches rely on global distances in Euclidean space, they often fall short in capturing intrinsic data geometry. Enter Riemannian metric learning: a powerful generalization that leverages differential geometry to model the data according to their underlying Riemannian manifold. This approach has demonstrated remarkable success across diverse domains, from causal inference and optimal transport to generative modeling and representation learning. In this review, we bridge the gap between classical metric learning and Riemannian geometry, providing a structured and accessible overview of key methods, applications, and recent advances. We argue that Riemannian metric learning is not merely a technical refinement but a fundamental shift in how we think about data representations. Thus, this review should serve as a valuable resource for researchers and practitioners interested in exploring Riemannian metric learning and convince them that it is closer to them than they might imagine-both in theory and in practice.
[LG-54] Self-Supervised Penalty-Based Learning for Robust Constrained Optimization
链接: https://arxiv.org/abs/2503.05175
作者: Wyame Benslimane,Paul Grigas
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: To appear in the proceedings of CPAIOR 2025
点击查看摘要
Abstract:We propose a new methodology for parameterized constrained robust optimization, an important class of optimization problems under uncertainty, based on learning with a self-supervised penalty-based loss function. Whereas supervised learning requires pre-solved instances for training, our approach leverages a custom loss function derived from the exact penalty method in optimization to learn an approximation, typically defined by a neural network model, of the parameterized optimal solution mapping. Additionally, we adapt our approach to robust constrained combinatorial optimization problems by incorporating a surrogate linear cost over mixed integer domains, and a smooth approximations thereof, into the final layer of the network architecture. We perform computational experiments to test our approach on three different applications: multidimensional knapsack with continuous variables, combinatorial multidimensional knapsack with discrete variables, and an inventory management problem. Our results demonstrate that our self-supervised approach is able to effectively learn neural network approximations whose inference time is significantly smaller than the computation time of traditional solvers for this class of robust optimization problems. Furthermore, our results demonstrate that by varying the penalty parameter we are able to effectively balance the trade-off between sub-optimality and robust feasibility of the obtained solutions.
[LG-55] Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits
链接: https://arxiv.org/abs/2503.05098
作者: Piotr M. Suder,Eric Laber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Information-directed sampling (IDS) is a powerful framework for solving bandit problems which has shown strong results in both Bayesian and frequentist settings. However, frequentist IDS, like many other bandit algorithms, requires that one have prior knowledge of a (relatively) tight upper bound on the norm of the true parameter vector governing the reward model in order to achieve good performance. Unfortunately, this requirement is rarely satisfied in practice. As we demonstrate, using a poorly calibrated bound can lead to significant regret accumulation. To address this issue, we introduce a novel frequentist IDS algorithm that iteratively refines a high-probability upper bound on the true parameter norm using accumulating data. We focus on the linear bandit setting with heteroskedastic subgaussian noise. Our method leverages a mixture of relevant information gain criteria to balance exploration aimed at tightening the estimated parameter norm bound and directly searching for the optimal action. We establish regret bounds for our algorithm that do not depend on an initially assumed parameter norm bound and demonstrate that our method outperforms state-of-the-art IDS and UCB algorithms.
[LG-56] Kernel-based estimators for functional causal effects
链接: https://arxiv.org/abs/2503.05024
作者: Yordan P. Raykov,Hengrui Luo,Justin D. Strait,Wasiur R. KhudaBukhsh
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:We propose causal effect estimators based on empirical Fréchet means and operator-valued kernels, tailored to functional data spaces. These methods address the challenges of high-dimensionality, sequential ordering, and model complexity while preserving robustness to treatment misspecification. Using structural assumptions, we obtain compact representations of potential outcomes, enabling scalable estimation of causal effects over time and across covariates. We provide both theoretical, regarding the consistency of functional causal effects, as well as empirical comparison of a range of proposed causal effect estimators. Applications to binary treatment settings with functional outcomes illustrate the framework’s utility in biomedical monitoring, where outcomes exhibit complex temporal dynamics. Our estimators accommodate scenarios with registered covariates and outcomes, aligning them to the Fréchet means, as well as cases requiring higher-order representations to capture intricate covariate-outcome interactions. These advancements extend causal inference to dynamic and non-linear domains, offering new tools for understanding complex treatment effects in functional data settings. Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST) MSC classes: 62G05 ACMclasses: G.3 Cite as: arXiv:2503.05024 [stat.ME] (or arXiv:2503.05024v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2503.05024 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-57] Seismic inversion using hybrid quantum neural networks
链接: https://arxiv.org/abs/2503.05009
作者: Divakar Vashisth,Rohan Sharma,Tapan Mukerji,Mrinal K. Sen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
点击查看摘要
Abstract:Quantum computing leverages qubits, exploiting superposition and entanglement to solve problems intractable for classical computers, offering significant computational advantages. Quantum machine learning (QML), which integrates quantum computing with machine learning, holds immense potential across various fields but remains largely unexplored in geosciences. However, its progress is hindered by the limitations of current NISQ hardware. To address these challenges, hybrid quantum neural networks (HQNNs) have emerged, combining quantum layers within classical neural networks to leverage the strengths of both paradigms. To the best of our knowledge, this study presents the first application of QML to subsurface imaging through the development of hybrid quantum physics-informed neural networks (HQ-PINNs) for seismic inversion. We apply the HQ-PINN framework to invert pre-stack and post-stack seismic datasets, estimating P- and S-impedances. The proposed HQ-PINN architecture follows an encoder-decoder structure, where the encoder (HQNN), processes seismic data to estimate elastic parameters, while the decoder utilizes these parameters to generate the corresponding seismic data based on geophysical relationships. The HQ-PINN model is trained by minimizing the misfit between the input and predicted seismic data generated by the decoder. We systematically evaluate various quantum layer configurations, differentiation methods, and quantum device simulators on the inversion performance, and demonstrate real-world applicability through the individual and simultaneous inversion cases of the Sleipner dataset. The HQ-PINN framework consistently and efficiently estimated accurate subsurface impedances across the synthetic and field case studies, establishing the feasibility of leveraging QML for seismic inversion, thereby paving the way for broader applications of quantum computing in geosciences.
[LG-58] opology-Aware Conformal Prediction for Stream Networks
链接: https://arxiv.org/abs/2503.04981
作者: Jifan Zhang,Fangxin Wang,Philip S. Yu,Kaize Ding,Shixiang Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures
点击查看摘要
Abstract:Stream networks, a unique class of spatiotemporal graphs, exhibit complex directional flow constraints and evolving dependencies, making uncertainty quantification a critical yet challenging task. Traditional conformal prediction methods struggle in this setting due to the need for joint predictions across multiple interdependent locations and the intricate spatio-temporal dependencies inherent in stream networks. Existing approaches either neglect dependencies, leading to overly conservative predictions, or rely solely on data-driven estimations, failing to capture the rich topological structure of the network. To address these challenges, we propose Spatio-Temporal Adaptive Conformal Inference (\textttSTACI), a novel framework that integrates network topology and temporal dynamics into the conformal prediction framework. \textttSTACI introduces a topology-aware nonconformity score that respects directional flow constraints and dynamically adjusts prediction sets to account for temporal distributional shifts. We provide theoretical guarantees on the validity of our approach and demonstrate its superior performance on both synthetic and real-world datasets. Our results show that \textttSTACI effectively balances prediction efficiency and coverage, outperforming existing conformal prediction methods for stream networks.
[LG-59] Neuromorphic Quantum Neural Networks with Tunnel-Diode Activation Functions
链接: https://arxiv.org/abs/2503.04978
作者: Jake McNaughton,A. H. Abbas,Ivan S. Maksymov
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures
点击查看摘要
Abstract:The mathematical complexity and high dimensionality of neural networks hinder the training and deployment of machine learning (ML) systems while also requiring substantial computational resources. This fundamental limitation drives ML research, particularly in the exploration of alternative neural network architectures that integrate novel building blocks, such as advanced activation functions. Tunnel diodes are well-known electronic components that utilise the physical effect of quantum tunnelling (QT). Here, we propose using the current voltage characteristic of a tunnel diode as a novel, physics-based activation function for neural networks. We demonstrate that the tunnel-diode activation function (TDAF) outperforms traditional activation functions in terms of accuracy and loss during both training and evaluation. We also highlight its potential for implementation in electronic circuits suited to developing neuromorphic, quantum-inspired AI systems capable of operating in environments not suitable for qubit-based quantum computing hardware.
[LG-60] Boltzmann convolutions and Welford mean-variance layers with an application to time series forecasting and classification
链接: https://arxiv.org/abs/2503.04956
作者: Daniel Andrew Coulson,Martin T. Wells
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages, 7 figures, 11 tables
点击查看摘要
Abstract:In this paper we propose a novel problem called the ForeClassing problem where the loss of a classification decision is only observed at a future time point after the classification decision has to be made. To solve this problem, we propose an approximately Bayesian deep neural network architecture called ForeClassNet for time series forecasting and classification. This network architecture forces the network to consider possible future realizations of the time series, by forecasting future time points and their likelihood of occurring, before making its final classification decision. To facilitate this, we introduce two novel neural network layers, Welford mean-variance layers and Boltzmann convolutional layers. Welford mean-variance layers allow networks to iteratively update their estimates of the mean and variance for the forecasted time points for each inputted time series to the network through successive forward passes, which the model can then consider in combination with a learned representation of the observed realizations of the time series for its classification decision. Boltzmann convolutional layers are linear combinations of approximately Bayesian convolutional layers with different filter lengths, allowing the model to learn multitemporal resolution representations of the input time series, and which resolutions to focus on within a given Boltzmann convolutional layer through a Boltzmann distribution. Through several simulation scenarios and two real world applications we demonstrate ForeClassNet achieves superior performance compared with current state of the art methods including a near 30% improvement in test set accuracy in our financial example compared to the second best performing model.
[LG-61] Leverag ing Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis
链接: https://arxiv.org/abs/2503.04870
作者: Devi Dutta Biswajeet,Sara Kadkhodaei
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 20 pages, 10 figures, 4 tables; Supplementary Material with 13 figures and 4 tables
点击查看摘要
Abstract:Machine learning in materials science faces challenges due to limited experimental data, as generating synthesis data is costly and time-consuming, especially with in-house experiments. Mining data from existing literature introduces issues like mixed data quality, inconsistent formats, and variations in reporting experimental parameters, complicating the creation of consistent features for the learning algorithm. Additionally, combining continuous and discrete features can hinder the learning process with limited data. Here, we propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene chemical vapor deposition synthesis compiled from existing literature. These strategies include prompting modalities for imputing missing data points and leveraging large language model embeddings to encode the complex nomenclature of substrates reported in chemical vapor deposition experiments. The proposed strategies enhance graphene layer classification using a support vector machine (SVM) model, increasing binary classification accuracy from 39% to 65% and ternary accuracy from 52% to 72%. We compare the performance of the SVM and a GPT-4 model, both trained and fine-tuned on the same data. Our results demonstrate that the numerical classifier, when combined with LLM-driven data enhancements, outperforms the standalone LLM predictor, highlighting that in data-scarce scenarios, improving predictive learning with LLM strategies requires more than simple fine-tuning on datasets. Instead, it necessitates sophisticated approaches for data imputation and feature space homogenization to achieve optimal performance. The proposed strategies emphasize data enhancement techniques, offering a broadly applicable framework for improving machine learning performance on scarce, inhomogeneous datasets.
[LG-62] A characterization of sample adaptivity in UCB data
链接: https://arxiv.org/abs/2503.04855
作者: Yilun Chen,Jiaqi Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:We characterize a joint CLT of the number of pulls and the sample mean reward of the arms in a stochastic two-armed bandit environment under UCB algorithms. Several implications of this result are in place: (1) a nonstandard CLT of the number of pulls hence pseudo-regret that smoothly interpolates between a standard form in the large arm gap regime and a slow-concentration form in the small arm gap regime, and (2) a heuristic derivation of the sample bias up to its leading order from the correlation between the number of pulls and sample means. Our analysis framework is based on a novel perturbation analysis, which is of broader interest on its own.
[LG-63] A Practical Introduction to Kernel Discrepancies: MMD HSIC KSD
链接: https://arxiv.org/abs/2503.04820
作者: Antonin Schrab
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 7 figures
点击查看摘要
Abstract:This article provides a practical introduction to kernel discrepancies, focusing on the Maximum Mean Discrepancy (MMD), the Hilbert-Schmidt Independence Criterion (HSIC), and the Kernel Stein Discrepancy (KSD). Various estimators for these discrepancies are presented, including the commonly-used V-statistics and U-statistics, as well as several forms of the more computationally-efficient incomplete U-statistics. The importance of the choice of kernel bandwidth is stressed, showing how it affects the behaviour of the discrepancy estimation. Adaptive estimators are introduced, which combine multiple estimators with various kernels, addressing the problem of kernel selection.
[LG-64] Precoder Learning for Weighted Sum Rate Maximization
链接: https://arxiv.org/abs/2503.04497
作者: Mingyu Deng,Shengqian Han
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Weighted sum rate maximization (WSRM) for precoder optimization effectively balances performance and fairness among users. Recent studies have demonstrated the potential of deep learning in precoder optimization for sum rate maximization. However, the WSRM problem necessitates a redesign of neural network architectures to incorporate user weights into the input. In this paper, we propose a novel deep neural network (DNN) to learn the precoder for WSRM. Compared to existing DNNs, the proposed DNN leverage the joint unitary and permutation equivariant property inherent in the optimal precoding policy, effectively enhancing learning performance while reducing training complexity. Simulation results demonstrate that the proposed method significantly outperforms baseline learning methods in terms of both learning and generalization performance while maintaining low training and inference complexity.
信息检索
[IR-0] A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval
链接: https://arxiv.org/abs/2503.05659
作者: Yu Zhang,Shutong Qiao,Jiaqi Zhang,Tzu-Heng Lin,Chen Gao,Yong Li
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Information technology has profoundly altered the way humans interact with information. The vast amount of content created, shared, and disseminated online has made it increasingly difficult to access relevant information. Over the past two decades, search and recommendation systems (collectively referred to as information retrieval systems) have evolved significantly to address these challenges. Recent advances in large language models (LLMs) have demonstrated capabilities that surpass human performance in various language-related tasks and exhibit general understanding, reasoning, and decision-making abilities. This paper explores the transformative potential of large language model agents in enhancing search and recommendation systems. We discuss the motivations and roles of LLM agents, and establish a classification framework to elaborate on the existing research. We highlight the immense potential of LLM agents in addressing current challenges in search and recommendation, providing insights into future research directions. This paper is the first to systematically review and classify the research on LLM agents in these domains, offering a novel perspective on leveraging this advanced AI technology for information retrieval. To help understand the existing works, we list the existing papers on agent-based simulation with large language models at this link: this https URL.
[IR-1] Identification and explanation of disinformation in wiki data streams
链接: https://arxiv.org/abs/2503.05605
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,Juan C Burguillo
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: (2025) Integrated Computer-Aided Engineering
点击查看摘要
Abstract:Social media platforms, increasingly used as news sources for varied data analytics, have transformed how information is generated and disseminated. However, the unverified nature of this content raises concerns about trustworthiness and accuracy, potentially negatively impacting readers’ critical judgment due to disinformation. This work aims to contribute to the automatic data quality validation field, addressing the rapid growth of online content on wiki pages. Our scalable solution includes stream-based data processing with feature engineering, feature analysis and selection, stream-based classification, and real-time explanation of prediction outcomes. The explainability dashboard is designed for the general public, who may need more specialized knowledge to interpret the model’s prediction. Experimental results on two datasets attain approximately 90 % values across all evaluation metrics, demonstrating robust and competitive performance compared to works in the literature. In summary, the system assists editors by reducing their effort and time in detecting disinformation.
[IR-2] Bridging Classical and Quantum String Matching: A Computational Reformulation of Bit-Parallelism
链接: https://arxiv.org/abs/2503.05596
作者: Simone Faro,Arianna Pavone,Caterina Viola
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:String matching is a fundamental problem in computer science, with critical applications in text retrieval, bioinformatics, and data analysis. Among the numerous solutions that have emerged for this problem in recent decades, bit-parallelism has significantly enhanced their practical efficiency, leading to the development of several optimized approaches for both exact and approximate string matching. However, their potential in quantum computing remains largely unexplored. This paper presents a novel pathway that not only translates bit-parallel string matching algorithms into the quantum framework but also enhances their performance to achieve a quadratic speedup through Grover’s search. By embedding quantum search within a bit-parallel model, we reduce the time complexity of string matching, establishing a structured pathway for transforming classical algorithms into quantum solutions with provable computational advantages. Beyond exact matching, this technique offers a foundation for tackling a wide range of non-standard string matching problems, opening new avenues for efficient text searching in the quantum era. To demonstrate the simplicity and adaptability of the technique presented in this paper, we apply this translation and adaptation process to two landmark bit-parallel algorithms: Shift-And for exact pattern matching and Shift-Add for approximate string matching with up to k errors.
附件下载
点击下载今日全部论文列表